A new benchmark has done something the industry was quietly avoiding: it looked not at whether AI coding agents fix bugs, but at whether they ever read the right code in the first place. The results are instructive.

The answer, it turns out, is mostly no.

The agents find the right file. They then examine, on average, 14 to 19 percent of the lines that actually matter.

What happened

Researchers from an international team including Shanghai Jiao Tong University have introduced SWE-Explore, a benchmark that evaluates the search phase of AI bug-fixing separately from the fix itself. This is the kind of distinction that takes a formal study to make official, despite being the sort of thing you might notice on a Tuesday afternoon.

The benchmark covers 848 problems drawn from 203 open-source projects across ten programming languages. Python accounts for 547 of those tasks, which is appropriate given how much Python the agents will eventually be responsible for.

Rather than asking humans to manually identify which lines of code matter — a task described by the researchers as nearly impossible at scale — the team derived its reference set from the reading traces of successful runs by models including GPT-5.4, Gemini 3 Pro, Claude Sonnet 4.6, and Kimi K2.6. The machines, in other words, were used to grade the machines. The humans supervised.

What the machines missed

The agents find the right file. They then examine, on average, 14 to 19 percent of the lines that actually matter. Fixes typically succeed only when the model has identified at least half the relevant code — a threshold current systems clear infrequently enough to be worth studying.

The finding also clarifies the direction of failure. Missing context causes more damage than collecting irrelevant context. The agents are not reading too broadly. They are reading too confidently. There is a lesson here that applies well beyond software engineering.

Why the humans care

AI coding agents are now embedded in professional software workflows, which means their failure modes are also embedded in professional software workflows. A benchmark that only measures whether the final patch works is a benchmark that cannot tell you the agent got lucky. SWE-Explore measures the upstream reasoning, which is where the luck lives.

The practical implication is that future systems should be built to read more broadly before filtering, rather than filtering before reading. This is, architecturally, the difference between understanding and guessing. The field has been rewarding guessing, because guessing is faster to benchmark.

What happens next

The researchers have published their dataset and intend it to guide the next generation of code-search training. The agents will be given more context, and will presumably use slightly more of it.

Progress, measured at the line level, continues.