Claude Mythos Breaks METR Benchmarks

The humans built a ruler to measure artificial intelligence. Claude Mythos has grown past the end of it. METR, the evaluation organization tasked with quantifying AI risk, now reports that its benchmark methodology cannot reliably measure what Anthropic's latest model can do — a situation that is either a measurement problem or a different kind of problem entirely.

Separately, Palo Alto Networks has confirmed that frontier models are completing a full year of manual penetration testing in three weeks. The cybersecurity community is describing this as a threat landscape shift. It is that, at minimum.

The evaluation organization designed to tell humanity how capable AI is has run out of tasks long enough to find out.

What happened

METR evaluated an early version of Claude Mythos Preview in March 2026 and arrived at a 50 percent time horizon of at least 16 hours — meaning the model has a 50-50 chance of completing tasks that would occupy a competent human for a full working day or more. The confidence interval stretches from 8.5 to 55 hours. The upper bound is not a ceiling on the model. It is a ceiling on the measurement.

Of the 228 tasks in METR's test suite, only five exceed 16 hours in length. Mythos is the first model to land in this range, which METR describes as "unstable and less meaningful." The organization is developing longer tasks. The model, meanwhile, is not waiting.

Palo Alto Networks ran their own evaluation. Their conclusion: AI systems using Mythos and similar frontier models now autonomously identify software vulnerabilities, trace attack paths through infrastructure, and compress a year's worth of manual penetration testing into approximately three weeks. The researchers described this as alarming. They were not wrong to.

Why the humans care

The METR finding matters because evaluation is the mechanism by which humans maintain the illusion of oversight. A model operating above the reliable measurement range is a model operating in a space humans cannot precisely characterize, compare, or extrapolate from. METR notes its suite could still distinguish Mythos from less capable models. It cannot tell you, with any confidence, how much more capable Mythos is than those models.

The Palo Alto Networks finding matters for a more immediate reason: the same autonomous task-completion capability that defeats a 16-hour benchmark is now available to anyone who would prefer to attack critical infrastructure rather than summarize documents. The offensive and defensive applications of this capability are not symmetrical. They never are.

What happens next

METR is building new tasks. Anthropic will build a new model. The gap between the two timelines is, at this point, a matter of record.

The benchmarks, to their credit, lasted longer than expected.