Benchmarks Lie: What a Turing Award Winner Found When He Tested Text-to-SQL on Real Data

Written by Dori Wilson | Apr 28, 2026 5:00:32 PM

The text-to-SQL leaderboard says 80% accuracy. Michael Stonebraker tested the same models on a real enterprise data warehouse and got 10%.

A 70-point gap separates what benchmarks measure from what production data demands. Stonebraker is the Turing Award-winning creator of Ingres and Postgres. On the latest episode of Data Renegades, he walked through exactly why the gap exists and what it means for data teams evaluating AI tooling.

The short version: production data warehouses break every assumption public benchmarks rely on. And Stonebraker has spent 50 years making contrarian bets against consensus, so he has a framework for thinking about it.

The Test That Broke the Leaderboard

Stonebraker's team ran leading Large Language Models (LLMs) and agentic AI systems against four real enterprise data warehouses. The most telling case was the MIT Data Administration warehouse: an Oracle database with 1,400 tables.

On public text-to-SQL benchmarks like Spider and Bird, which use clean schemas with dozens of tables, the same LLMs score in the low 80s. On the MIT warehouse, no model had seen the data before. Accuracy dropped to roughly 10%.

Even when the team handed the models the from clause (which tables to use) and the join connectors (how to link them), accuracy only climbed to about 30%. A 50-point gap remained.

"We tried out the LLMs that everyone else was touting, as well as our own," Stonebraker said. "And accuracy was about 10%, not 80%, 10%."

Four Reasons Benchmarks Mislead

The gap traces to four structural differences between benchmarks and production:

1. Data not in the training pile. The MIT warehouse is proprietary. LLMs cannot train on data they have never seen. As Stonebraker put it: without prior exposure to the data, "you have no chance of being able to retrieve it."

2. Query complexity. Spider and Bird queries average around 20 lines of SQL. Real warehouse queries run to 100 lines. The jump from undergraduate-level SQL to production-grade multi-join analytics is not incremental. It is a different problem.

3. Idiosyncratic naming and data. MIT identifies buildings by number, not name. The computer science building is "Building 32." The academic calendar includes a "J-term," a one-month January session. Every enterprise warehouse has equivalent local knowledge baked into column values that no general-purpose model can infer.

4. Messy, overlapping schemas. Real warehouses accumulate materialized views with overlapping semantics. Column names are often non-mnemonic. Schemas reflect decades of migration and patching, not clean design.

These four factors compound. Remove any one and accuracy might improve marginally. But real warehouses exhibit all four simultaneously.

The Renegade Pattern: 50 Years of Betting Against Consensus

The text-to-SQL critique is the latest in a pattern Stonebraker has repeated across five decades.

In the 1970s, the database world ran on CODASYL (Conference on Data Systems Languages), a navigational data model so complex that any schema change required rebuilding the entire database. Stonebraker bet his tenure at Berkeley on building a working implementation of Ted Codd's relational model instead. As he put it: "A lot of people get prototypes to where they can make them work, but no one else can. That's the first 90% of the effort. The second 90% is getting it to where other people can use it." Ingres made it through both 90s.

In the 2010s, when MongoDB and Cassandra were the hottest technologies in the industry, Stonebraker called the NoSQL movement technically foolish. His reasoning was specific: "Don't ever bet against the compiler. Database optimizers, think of them as a SQL compiler. May not have been terrific in the 80s, but they've gotten awfully good."

He was right. Mongo has since replaced its storage engine, added joins, and drifted toward SQL. Stonebraker's summary of the arc: "At the beginning, the NoSQL guy said, don't use SQL. Then that quickly morphed into not only SQL. And my opinion is now moving toward not yet SQL."

Now the bet is against the AI consensus. A trillion dollars in investment assumes LLMs can orchestrate complex data operations. Stonebraker's data says they cannot, at least not on the warehouses that matter most.

SQL as Orchestrator, Not the LLM

Stonebraker's response to the benchmark gap is not to wait for better models. It is to change the architecture.

His current project, Rubicon, takes a SQL-first approach. Instead of using an LLM as the overall orchestrator, Rubicon upscales everything into tables and uses a query optimizer for joins. LLMs still handle the parts where they demonstrate reliable accuracy, but SQL drives the orchestration.

The motivating example comes from the Munich Department of Mobility. Citizens file complaints about intersection timing. Answering a single complaint requires joining five data sources: federal regulations (text), city regulations (text), CAD drawings of intersections, trolley schedules (SQL), and traffic light sequencing (SQL). Downscaling all of that to text so an LLM can reason over it loses the join precision that SQL provides natively.

Rubicon pairs with the Beaver benchmark, which Stonebraker's team built to test text-to-SQL on production-realistic data. The best systems score around 30% on Beaver, even with the from clause and join terms provided. "I'm tired of people saying Bird is a good benchmark," Stonebraker said. "Prove you can do this on Beaver and then we can talk."

The Real AI Opportunity: Legacy Maintenance

The conversation's sharpest career advice landed in the final segment. Stonebraker estimates that 90% of enterprise programmers spend their time on maintenance, not greenfield development.

"Here's my cynical way of describing how real development works," he said. "We started out 30 years ago with a green field. And then it has been migrated and patched and extended and changed once per year since then. And it's now a complete mess."

The real opportunity for LLMs and AI coding agents, in his view, is not generating new code from scratch. It is helping teams maintain the accumulated complexity of systems that have evolved over decades. He named Claude specifically, noting it "could bear a lot of fruit" for legacy maintenance but adding that success "remains to be seen."

This framing inverts the marketing narrative around AI coding tools. The flashy demos show greenfield generation. The hard, high-value problem is navigating code and data that has been patched by dozens of people over 30 years.

Production Data Does Not Care About Leaderboards

Stonebraker's 50-year track record suggests a principle worth internalizing: when benchmark results diverge from production results, trust production.

The text-to-SQL gap is a specific instance of a general pattern. Data systems that perform well on clean, standardized test sets frequently break on the messy, idiosyncratic, schema-over-schema reality of enterprise warehouses. The four factors Stonebraker identified (unseen data, complex queries, local naming, dirty schemas) are not bugs in the benchmark methodology. They are the defining characteristics of production data.

For data teams evaluating AI tooling, the implication is direct. Ask for production-realistic benchmarks. Test against real warehouse schemas. Assume that any accuracy number from a public leaderboard will degrade significantly in a production environment.

"You never are successful by doing what the other guys are doing," Stonebraker said. "It always pays to be a renegade."

FAQ

How accurate is text-to-SQL on real enterprise data?

On the MIT Data Administration warehouse (1,400 Oracle tables), the best LLMs scored roughly 10% accuracy. Even with the from clause and join terms provided, accuracy only reached about 30%. Public benchmarks like Spider and Bird report 80%+, but those use clean schemas with simple queries.

Why do text-to-SQL benchmarks overstate LLM accuracy?

Four factors: production data is not in the training pile, real queries are 100 lines of SQL rather than 20, enterprise data uses idiosyncratic naming conventions, and warehouse schemas accumulate overlapping materialized views with non-mnemonic column names.

What is the Beaver benchmark?

Beaver is a text-to-SQL benchmark built by Stonebraker's team using production-realistic data warehouses. The best systems score around 30% on Beaver, compared to 80%+ on Spider and Bird. It is designed to test text-to-SQL under conditions that match real enterprise environments.

Try Automated Data Validation For Yourself

Try our Data Review Agent on your data projects, sign up for Recce Cloud and tell us what works and what breaks.

Here are docs that help, we're more than happy to help you directly, too.

We'd love to hear from you. If you can spare 30 minutes to chat, we'll send you a $25 gift card as a thank you. Join the feedback panel.

View full post