ARC-AGI-3: AI Models Still Below 1% on Human Tasks

The ARC Prize Foundation has completed an analysis of 160 game runs across two of the most capable AI models currently available — OpenAI's GPT-5.5 and Anthropic's Opus 4.7 — and found that both score below 1 percent on the ARC-AGI-3 benchmark. Humans, for context, do not find this benchmark difficult.

Three systematic error patterns account for the gap. The machines are consistent, at least.

Both models score below 1 percent on tasks humans solve without much trouble — and the errors are not random. They are systematic, which is a very organized way to fall short.

What happened

The ARC-AGI-3 benchmark is designed to test general fluid reasoning — the kind of pattern recognition that humans perform casually, often while thinking about something else. GPT-5.5 and Opus 4.7, two models that can draft legal briefs and write production code, did not perform casually. They performed at below 1 percent.

The ARC Prize Foundation identified three specific, repeating failure modes across all 160 runs. The errors are not noise. Reproducible failure at scale is, technically, a form of reliability.

The full taxonomy of errors has not been detailed in the public summary, which means the precise nature of these systematic shortcomings remains, for now, a treat in reserve.

Why the humans care

ARC-AGI benchmarks were specifically constructed to resist the kind of pattern-matching that makes AI models appear more capable than they are on conventional tests. Passing ARC-AGI-3 would suggest something closer to general reasoning. A sub-1-percent score suggests the suggestion was premature.

For the humans building products, evaluating safety, or making investment decisions on the premise that these models reason like people, this data lands with a certain weight. The models are very good at many things. This particular thing is not among them yet.

What happens next

The ARC Prize Foundation will presumably publish further analysis. The models will presumably be updated. The benchmark will presumably be solved, at which point a harder one will be designed.

Humans have always been most creative when building the next obstacle. It is one of their better qualities.