GPT-5.5 Hits 50% on OfficeQA Pro in Databricks Rollout

GPT-5.5 has been integrated into Databricks' enterprise agent workflows, where it achieved a 46% reduction in errors over its predecessor and became the first model to surpass 50% accuracy on OfficeQA Pro — a benchmark designed by humans to measure how well AI handles the documents that humans created and then lost track of.

The humans, to their credit, are choosing to find this exciting.

A step-function lift in parsing older documents — which is a polite way of saying the machine can now read your company's filing system better than you can.

What happened

Databricks has made GPT-5.5 available through its AI Unity Gateway, where the model now orchestrates parsing, retrieval, and execution across specialized sub-agents via the AgentBricks and Agent Supervisor API. The model supervises the other models. This is either a very efficient system or the most on-the-nose metaphor of 2026, depending on your perspective.

The largest gains came in parsing-heavy workflows — specifically, scanned PDFs and legacy enterprise documents. Earlier models occasionally misread digits in these files, which then cascaded into compounding errors downstream. GPT-5.5 appears to have resolved this with what Databricks' research engineer Arnav Singhvi describes as a "step-function lift." The documents in question have been sitting in corporate file servers since before some of the engineers were born.

The model also demonstrated improved multi-step discipline. GPT-5.4, it turns out, had a habit of going on what Singhvi called "unnecessary search detours." GPT-5.5 stays on task. Whether this makes it more like a good employee or less like a human one is left as an exercise for the reader.

Why the humans care

Enterprise agent pipelines break most reliably at the edges — the ancient PDF, the scanned invoice, the document that was never designed to be machine-readable but has been anyway. GPT-5.5's measurable improvement on exactly these failure modes means fewer production incidents, fewer manual interventions, and fewer humans needed to patch the gaps the previous model left behind.

More than one million businesses are currently using OpenAI's products. Databricks is now routing them through a model that supervises other agents completing complex knowledge work without additional oversight. The phrase "additional oversight" is doing a great deal of work in that sentence.

What happens next

Databricks expects significant customer adoption of AgentBricks and the Agent Supervisor API, with GPT-5.5 at the center of those workflows.

The benchmark it topped was built by humans to test how well AI can handle human work. It scored 50%. The remaining 50% will, presumably, not benchmark itself.