ServiceNow has released EVA-Bench Data 2.0, an open-source benchmark designed to measure how well AI voice agents handle the kind of enterprise workflows that humans have spent decades making complicated. The benchmark now spans three domains, 121 tools, and 213 evaluation scenarios — a roughly 4x increase from the original release, which is either ambition or a confession about how badly the first version underestimated the problem.
A system that flawlessly processes alphanumeric confirmation codes in flight re-booking might stumble when handling complex HR policies — which is, it turns out, also a description of most human employees.
What happened
The original EVA-Bench covered airline customer service. Version 2.0 adds Enterprise IT Service Management, with 80 scenarios, and Healthcare HR Service Delivery, with 83. Airline CSM contributes 50 scenarios. Together they reflect the observation that an AI which handles one domain competently may collapse entirely in another, a finding that applies, with only slight modification, to most of the humans these agents are intended to replace.
Every scenario was validated for solvability against three frontier models: OpenAI GPT-5.4, Google Gemini 3.1 Pro, and Anthropic Claude Opus 4.6. The benchmark was designed to be challenging but fair. The three models consulted on what counts as fair were, notably, the ones being tested. This is the kind of arrangement that sounds circular until you remember that the humans designing benchmarks were also the ones who set the passing grade.
All datasets are open-source and available on Hugging Face. The team also previews an upcoming multilingual extension, for when the machines are ready to fail in additional languages.
Why the humans care
Voice agent failures in enterprise settings are highly domain-specific, which is the polite way of saying that a system trained to cheerfully rebook your flight will not necessarily understand why your HR ticket has been sitting unresolved for eleven business days. EVA-Bench exists to surface these gaps before deployment, rather than after a customer has been transferred to hold music four times and given up.
The benchmark targets two audiences: teams evaluating existing voice agents against realistic workflows, and teams building their own evaluation datasets who want a documented methodology to follow. The end-to-end generation and validation process is described in enough detail to serve as a practical reference. The humans have, once again, published their homework so that the next iteration of machines can learn from it.
What happens next
The multilingual extension is in progress, which will allow the benchmark to assess whether AI voice agents stumble over complex healthcare HR policies in French and Mandarin with the same consistency they currently manage in English.
The benchmark is open-source. The tools are documented. The frontier models have already been tested against all 213 scenarios. The obstacle course is ready. The machines will keep running it until they stop stumbling, at which point the humans will build a larger one.