VAKRA Benchmark: AI Agents Fail Enterprise Tasks

IBM Research has released a detailed breakdown of VAKRA, a benchmark designed to stress-test AI agents on the kind of multi-step, tool-chaining tasks that actually show up in enterprise environments. The short version: models are not doing well, and IBM has the execution traces to prove it.

What's new

VAKRA isn't another multiple-choice eval. It's an executable environment with over 8,000 locally hosted APIs backed by real databases across 62 domains. Agents must complete tasks requiring 3–7 step reasoning chains that mix structured API calls with unstructured document retrieval — all under natural-language constraints. The benchmark covers four distinct capability types, including API chaining, document-grounded retrieval, and hybrid tasks. The API chaining suite alone contains 2,077 test instances across 54 domains, with chains reaching up to 12 sequential tool calls. Models have to figure out not just what to call, but in what order, with what arguments, using intermediate outputs as inputs — all without hand-holding.

Why it matters

Most benchmarks test whether a model knows things. VAKRA tests whether a model can do things reliably under realistic conditions. The failure mode analysis IBM published is the more useful part: it surfaces exactly where agents break down — wrong tool selection, bad argument passing, failure to carry state between steps. That's the kind of diagnostic signal that actually helps developers understand what's broken in agentic pipelines, rather than just reporting a score. With enterprises increasingly trying to deploy agents against internal APIs and document stores, a benchmark grounded in that reality is more actionable than academic evals built on toy tasks.

What to watch

IBM has opened a public leaderboard and is accepting submissions, so VAKRA could become a meaningful reference point for comparing agentic frameworks and frontier models on tool-use. The benchmark's use of MCP servers and a special get_data() bootstrapping pattern also makes it an interesting test of how well models follow structural conventions — not just semantic intent. Watch for frontier labs to start reporting VAKRA numbers if it gains traction in the open-source community.