Home / DevOps & Deployment / How to Evaluate NL2SQL Accuracy on Oracle Using MCP?

How to Evaluate NL2SQL Accuracy on Oracle Using MCP?

May 18, 2026 Interview

Benjamin DaigleSoftware Development Expert

In the rapidly evolving landscape of enterprise software, the ability to bridge the gap between natural language and complex database queries is becoming a cornerstone of modern SaaS architecture. Vijay Raina, a seasoned specialist in enterprise technology and software design, has spent a significant portion of his career navigating the intricacies of how large language models interact with robust relational systems. His work on the SQLclMCP framework represents a pivot toward rigorous, automated evaluation—ensuring that when an AI generates code for an Oracle Database, it isn’t just a best guess, but a precise, high-performance command. By leveraging the Model Context Protocol, he has helped establish a standard where correctness, execution success, and efficiency are measured against human-written benchmarks with surgical precision.

This conversation delves into the technical hurdles of SQL generation, specifically focusing on the architectural benefits of decoupling logic through HTTP-based protocols and the unique challenges of the Oracle dialect. We explore the necessity of comparing execution plans, the patterns of accuracy loss across different query complexities, and the iterative process of refining prompts based on automated failure reports.

Using an HTTP-based protocol like MCP decouples SQL generation from evaluation logic. How does this architecture simplify the process of swapping LLMs or prompt versions, and what specific steps are required to ensure the evaluation runner remains independent of the server’s internal logic?

The beauty of the Model Context Protocol lies in the clean, unyielding boundary it creates between the “thinking” part of the system and the “doing” part. When we use an HTTP-based approach, the evaluation runner functions as a pure auditor that only cares about two endpoints—a health check and a generation request—meaning it never has to get its hands dirty with LLM SDKs or specific prompt formatting. This decoupling allows a developer to swap an entire model or jump from prompt version one to version two by simply updating a single URL, which provides a sense of immense freedom during rapid prototyping. To keep this independence absolute, the server must strictly adhere to a contract where the question goes in and the SQL comes out, leaving the evaluation logic to focus entirely on the Oracle Database environment. It feels like moving from a tangled web of dependencies to a modular “plug-and-play” system where you can run a 500-question TPC-H benchmark without ever changing a line of the Python runner’s code.

Oracle Database has unique syntax requirements, such as using FETCH FIRST instead of LIMIT or specific date extraction methods. How do you identify these dialect-specific failures during a large-scale benchmark, and what prompt engineering strategies effectively mitigate these errors for production-ready SQL?

Identifying these failures requires a meticulous look at execution success rates and the specific ORA-error codes that bubble up when the model misses a dialect quirk. For instance, we often saw the model try to use a LIMIT clause or a nonexistent EXTRACT(QUARTER…) function, which immediately triggers an ORA-00907 error and grinds the process to a halt. To mitigate this for production-ready output, we baked very specific “thou shalt not” rules into the schema hints, explicitly instructing the LLM to use FETCH FIRST N ROWS ONLY and to handle quarters through CEIL(EXTRACT(MONTH FROM col)/3). There is a certain satisfaction in watching the error rates drop once the prompt clarifies that date ranges are preferred over complex extract functions for filtering. By providing these guardrails, we transform a generic model into one that speaks “fluent Oracle,” ensuring it handles the unique syntax of a system like Oracle 26ai with the same nuance as a human DBA.

Beyond simple execution success, why is it essential to compare EXPLAIN PLAN metrics like cost and cardinality against human-written baselines? What specific performance deltas or latency ratios should a developer look for to determine if a generated query is efficient enough for a live database?

Execution success is a low bar because a query can be “correct” but still be an absolute nightmare for the database’s resources, potentially locking up tables or causing massive spikes in CPU usage. By pulling EXPLAIN PLAN metrics such as cost, cardinality, and total bytes, we can see exactly how the Oracle optimizer plans to tear through the data compared to our gold-standard, human-written SQL. If the latency ratio shows the generated query is taking five times longer than the baseline, or if the cost delta is significantly higher, it’s a red flag that the model has likely missed an index or joined tables in a sub-optimal order. In a live environment, a developer should be looking for a latency ratio close to 1:1; anything that deviates wildly suggests that while the result set might be semantically accurate, the query is far from production-ready. We use 13 different graphs and 6 detailed tables to visualize these deltas, providing a sensory, data-driven look at where the AI’s logic is lagging behind human intuition.

When running a 500-question TPC-H benchmark across simple and complex tiers, what patterns of accuracy degradation do you typically observe? Could you share a scenario where a query achieved a semantic match but was flagged as a failure due to improper column prefixes or ordering?

As we move from the “simple” tier to the “complex” tier, we typically see a noticeable dip in accuracy as the LLM struggles with the sheer number of joins and nested subqueries required by TPC-H. In the simple tier, the model is usually quite confident, but once you introduce multiple CTEs, it’s easy for it to get confused, leading to errors like the infamous ORA-00904. A classic scenario involves a query over the LINEITEM table where the model tries to use L.P_PARTKEY—a column that actually belongs to the PART table—instead of the correct L.L_PARTKEY. Even if the query somehow returns the right rows, it might fail our “exact order match” or “extract string” tests if the model neglected a specific ORDER BY clause mandated by the benchmark. These moments are frustrating but enlightening, as they highlight that “semantic match” is only the first hurdle; true robustness requires the model to honor the specific structural constraints of the data schema.

Effective troubleshooting requires side-by-side comparisons of baseline and generated SQL. How do you utilize automated failure reports to refine schema hints, and what is your step-by-step process for ensuring that these prompt adjustments do not negatively impact the performance of other query tiers?

Our troubleshooting process is anchored by an automated script that transforms the evaluation JSON into a readable markdown report, placing the human-written SQL and the LLM’s attempt side by side for immediate visual comparison. When we see a recurring failure, such as a missed join condition in the “medium” complexity tier, we update the schema hints to emphasize that specific relationship. To ensure these changes don’t break the “simple” queries, we follow a rigorous re-testing cycle: first, we run a quick smoke test on 10 queries, then we move to a full tier-specific run, and finally, we re-run all 500 questions to confirm the global accuracy hasn’t regressed. This 30-to-60-minute feedback loop is essential because it prevents “prompt drift,” where fixing one complex edge case inadvertently makes the model overthink basic requests. It’s a delicate balance of adding just enough detail to the prompt to be helpful without cluttering the context window to the point of confusion.

What is your forecast for NL2SQL?

I believe we are moving away from the era of “generic” SQL generation toward a period of extreme specialization where the LLM becomes an expert in specific database dialects and enterprise-grade performance tuning. In the near future, the focus will shift from just getting the “right answer” to generating SQL that is natively optimized for specific hardware and cloud configurations, using tools like MCP to act as the standard interface. We will likely see self-healing pipelines where the database feedback—like an ORA-error or a high-cost EXPLAIN PLAN—is fed back to the model in real-time to refine the query before it ever reaches a human reviewer. Ultimately, NL2SQL will stop being a experimental feature and become a core, invisible layer of the enterprise stack, allowing anyone to interact with massive datasets like TPC-H as easily as they would carry on a conversation, provided we maintain the rigorous evaluation standards we’ve discussed today.

How to Evaluate NL2SQL Accuracy on Oracle Using MCP?

Related Publications

Subscribe to our weekly news digest.