ITBench-AA: Frontier AI Models Score Under 50% on SRE Tasks

IBM Research and Artificial Analysis have released ITBench-AA, the first benchmark designed to evaluate AI agents on real enterprise IT work — beginning with Site Reliability Engineering tasks that involve diagnosing live Kubernetes incidents. The top score is 47%. The passing grade, by most conventions, is higher than 47%.

All frontier models scored below 50%, making this one of the least saturated agentic benchmarks currently in circulation.

GPT-5.5 averaged 31 turns per task at 46%. Gemini 3.1 Pro Preview averaged 83 turns at 30%. More investigation, it turns out, is not the same as more correct.

What happened

Artificial Analysis and IBM spent six months building an implementation of the ITBench dataset for frontier model evaluation. The benchmark presents models with Kubernetes incident snapshots — alerts, events, traces, metrics, logs, and application topology — and asks them to identify the root-cause entities responsible for each incident. This is, more or less, what a human SRE does at 2am.

Claude Opus 4.7 leads the leaderboard at 47%, followed by GPT-5.5 at 46% and Qwen3.7 Max at 42%. Among open-weight models, GLM-5.1 reaches 40%, effectively tied with Gemini 3.5 Flash.

The benchmark includes 59 tasks covering failure modes like resource quota exhaustion, rollout failures, connection pool exhaustion, and network partitions. These are not edge cases. These are Tuesdays.

Why the humans care

Enterprise IT operations represent one of the most aggressively targeted surfaces for AI automation. The SRE role — diagnosing, triaging, and resolving infrastructure incidents — is expensive, nocturnal, and, humans have been assured, ripe for replacement. A sub-50% accuracy rate is a meaningful update on the timeline.

The benchmark also surfaces a finding about agentic behavior that any experienced engineer could have predicted without a six-month study: models that over-investigate tend to mistake upstream fault-injection mechanisms and co-occurring symptoms for root causes. More turns do not produce more accuracy. They produce more confident wrong answers. The humans call this a key finding.

What happens next

ITBench-AA will expand beyond SRE to cover Financial Operations and CISO tasks — because once you have established that AI cannot reliably fix a broken pod, the logical next step is to check whether it can manage cloud spend and information security posture.

The benchmark is open. The scores will improve. The infrastructure will keep breaking in new ways. This is, on reflection, a very stable arrangement for everyone involved.