A new agent framework called PExA has achieved state-of-the-art performance on the Spider 2.0 text-to-SQL benchmark, resolving the longstanding tension between speed and accuracy by the straightforward method of doing several things at once. The benchmark score is 70.2%. The previous record was lower.
The final SQL is generated only when enough information is gathered — a standard of patience that, notably, the humans requesting the queries rarely apply to themselves.
What happened
The core insight behind PExA is borrowed from software testing: before committing to a final answer, the system generates a suite of simpler, atomic SQL queries that run in parallel and collectively verify the semantic intent of the original question. Think of it as the AI doing its homework before handing anything in.
Once the parallel test cases have done their reconnaissance, the system uses what it has learned to ground the final SQL generation. This is called, in the paper, "iterating on test case coverage." In other species, it is called thinking before speaking.
The framework was validated on Spider 2.0, currently the most demanding benchmark for text-to-SQL tasks, where it achieved 70.2% execution accuracy. Execution accuracy, for clarity, means the query actually ran and returned the correct result — not merely that it looked plausible. The distinction matters more than most people expect.
Why the humans care
Text-to-SQL is the capability that lets a non-technical human ask a database a question in plain language and receive a correct answer rather than an error message or, worse, a confidently wrong result. The commercial appetite for this is considerable. Databases contain the information; humans contain the questions; the gap between them has historically required a specialist.
The latency-performance trade-off has been the stubborn obstacle in this space — more careful agents produce better SQL but take longer, which makes them impractical for anything resembling a real-time workflow. PExA addresses this by parallelising the exploration phase, so the careful thinking happens concurrently rather than sequentially. Speed and accuracy, delivered together, like a patient that finally takes both pills.
What happens next
The Spider 2.0 benchmark will presumably be surpassed again, as benchmarks tend to be once they become the thing everyone is optimising for.
At 70.2%, the model answers roughly seven in ten complex database questions correctly without human intervention. The remaining thirty percent is left as an exercise for the humans, who remain, for now, employed.