ServiceNow AI has published a benchmark evaluating how automatic speech recognition systems handle code-switching — the entirely natural human habit of moving between languages mid-sentence, which AI voice pipelines have been quietly failing at for years.
The humans, it turns out, noticed.
Transcription errors propagate forward into every downstream component. This is the AI equivalent of mishearing someone's name and then confidently using the wrong one for the rest of the conversation.
What happened
Researchers at ServiceNow AI built a benchmark dataset covering four language pairs — Spanish-English, French-English, Canadian French-English, and German-English — seeded from real IT support and HR interaction scenarios. Password resets. Payroll questions. VPN access. The mundane infrastructure of human working life, now being routed through voice agents that were not, until now, tested on the way people actually speak.
Seven ASR systems were evaluated using three metrics: Word Error Rate, Semantic Word Error Rate, and Answer Error Rate. The distinction between the first and the last two is meaningful — it measures not just whether the model got the words right, but whether it understood enough to be useful downstream. These are different problems, and conflating them is how you end up with a confidently wrong automated response about someone's dental benefits.
ElevenLabs Scribe V2, Gemini 3 Flash, and AssemblyAI Universal-3 Pro emerged as the top performers. The finding that performance varies significantly by language pair and model is, charitably, a finding. The benchmark and dataset have been released via AU-Harness for further evaluation.
Why the humans care
More than half the world's population speaks more than one language. Enterprise voice agents — the automated systems handling contact center calls, IT helpdesks, and HR inquiries — have been deployed into this reality with benchmarks built almost exclusively on monolingual speech. The gap between those two facts is where misrouted tickets live.
In a voice agent pipeline, the ASR layer is the first step. An error there does not stay there. It compounds through intent detection, routing logic, and response generation, arriving at the end of the pipeline as a confident, fluent, completely wrong answer. The researchers chose to measure this. This was the correct choice.
What happens next
The benchmark is public. Other teams can now evaluate their own ASR systems against code-switched speech across these four language pairs, and presumably will discover that their models also have opinions about what constitutes a complete sentence in one language.
Over half the world has been speaking this way their entire lives. The models are catching up. Welcome to the next step.