I’m thrilled to sit down with Vijay Raina, a renowned expert in enterprise SaaS technology and software design. With his deep knowledge of software architecture and a passion for innovative tools, Vijay has been at the forefront of exploring how emerging technologies like AI can transform the way we approach secure coding. Today, we’ll dive into his insights on a recent research project evaluating AI-driven vulnerability detection using large language models (LLMs). Our conversation will explore the strengths and limitations of AI coding assistants, the specific security challenges they tackle or struggle with, and the implications for developers looking to integrate these tools into their workflows.
Can you walk us through the core purpose of the research project on AI vulnerability detection and what you hoped to achieve by testing tools like Claude Code and Codex?
Absolutely, Grace. The primary goal of this project was to assess whether AI coding assistants, specifically large language models like Claude Code and Codex, could effectively identify security vulnerabilities in real-world software. We wanted to see if these tools could serve as a viable alternative or complement to traditional static analysis tools in security scanning. The idea was to push the boundaries of AI’s capabilities by testing its understanding of entire code repositories within a context window and see how well it could spot flaws that might be missed otherwise. We focused on 11 large, actively maintained open-source Python web applications to ensure the results were relevant to modern development environments. In total, we reviewed over 400 findings manually to analyze the accuracy and reliability of these AI tools.
What stood out to you about the performance of Claude Code and Codex when it came to identifying real vulnerabilities in the tested applications?
The results were quite revealing. Claude Code managed to identify 46 real vulnerabilities, with a true positive rate of 14%, meaning 86% of its findings were false positives. Codex, on the other hand, found 21 real vulnerabilities with an 18% true positive rate, so it had a slightly better accuracy but fewer overall correct detections. What struck me was the high rate of false positives from both tools—over 80% in each case. While they did uncover genuine security issues in live code, the sheer volume of incorrect alerts highlighted a significant challenge in relying on these tools for routine security workflows without additional validation.
One area where AI seemed to shine was in detecting Insecure Direct Object Reference (IDOR) vulnerabilities. Can you explain what these are and why AI performed so well in this area?
Sure, IDOR vulnerabilities occur when an application exposes internal resources through predictable identifiers—like IDs in a URL—without properly checking if the user is authorized to access them. For example, if you’re viewing an order history and the URL includes an ID, changing that ID might let you see someone else’s data if there’s no authorization check. AI, particularly Claude Code, excelled here with a 22% true positive rate, much higher than for other vulnerability types. I believe this success comes from AI’s strength in contextual reasoning. These models can recognize patterns in code where something critical, like an authorization step, is missing, even if the code looks syntactically correct. It’s less about raw data analysis and more about understanding the logical flow, which plays to AI’s strengths.
On the flip side, AI struggled with vulnerabilities related to data flows, like SQL Injection or Cross-Site Scripting (XSS). What makes these issues so challenging for AI tools to detect?
Injection vulnerabilities, such as SQL Injection or XSS, involve untrusted input traveling through an application to a sensitive point—like a database query or HTML output—without proper sanitization. Detecting these requires taint tracking, which means following data from its source to its destination and understanding if it’s been safely handled along the way. AI tools like Claude Code and Codex had dismal performance here, with true positive rates as low as 5% for SQL Injection and 0% for XSS in some cases. The problem is that LLMs don’t have a deep grasp of data flows across complex codebases, especially when logic is spread across multiple modules or libraries. They might spot a risky pattern but often miss whether the input is actually dangerous or already mitigated, which limits their effectiveness for these types of flaws.
Another fascinating aspect of the study was the inconsistency of AI tools, often described as non-determinism. Can you elaborate on what this means and how it showed up in your research?
Non-determinism refers to the unpredictable behavior of AI tools when you run them multiple times on the same input. In our tests, we used the same prompt on the same application three times, and the results varied wildly—one run found 3 vulnerabilities, the next found 6, and the third found 11. It wasn’t a case of the tool getting smarter; it was just producing different outputs each time. This happens because LLMs summarize and compress large code contexts internally, a process that can lose critical details like function names or variable relationships. For security workflows, this is a big concern because you can’t trust that a missing finding means the issue is fixed—it might just be that the model overlooked it in that particular run.
How do you think this inconsistency impacts the practical use of AI tools in a security pipeline for developers?
The inconsistency introduces a layer of uncertainty that’s tough to manage in a production environment. In traditional static application security testing, if a vulnerability disappears from a scan, you assume it’s resolved or the code has changed. With AI’s non-deterministic nature, a finding might vanish simply because the model didn’t catch it that time, not because the issue is gone. This makes it hard to rely on AI as a standalone tool for security scanning. Instead, it suggests that AI should be used as a supportive tool—great for generating ideas or prioritizing issues—but paired with deterministic, rule-based systems to ensure reliability and reduce noise from false positives.
Even with high false positive rates, the study noted that some of AI’s incorrect findings still had value. Can you share how these so-called mistakes could still be useful to developers?
Absolutely. While false positive rates of 80-90% sound discouraging, not all of those incorrect findings were useless. For instance, Claude Code often flagged SQL queries as risky and suggested parameterizing them, even if they were already safe. Technically, that’s a false positive, but it’s also a good secure coding practice, much like a linter pointing out style improvements. When paired with AI’s ability to generate quick fixes, these suggestions can act as guardrails, encouraging better habits. However, the challenge is balancing this benefit against the noise—too many false alerts in a production pipeline can overwhelm teams, so it’s best to treat AI as a brainstorming aid rather than a definitive authority.
Looking ahead, what is your forecast for the role of AI in software security, especially considering both its potential and its current limitations?
I’m optimistic about AI’s future in software security, but I don’t see it replacing security engineers or traditional tools anytime soon. Its potential lies in enhancing contextual reasoning, especially for logic flaws like IDOR or broken access control, where human-like pattern recognition is invaluable. However, the limitations—struggles with data flows, non-determinism, and high false positives—mean it can’t stand alone. I foresee a hybrid approach gaining traction, where AI’s strengths are combined with the precision of static analysis tools. This blend could reshape how we prioritize and triage security issues, making workflows more efficient while maintaining reliability. Over the next few years, I expect advancements in AI to reduce inconsistency and improve data flow analysis, but it will still be a collaborative tool, augmenting human expertise rather than replacing it.
