AI Benchmark SOOHAK: Models Fail Unsolvable Math Problems

A consortium of 64 mathematicians has confirmed, with rigorous academic precision, that AI models will confidently produce an answer to a question that has no answer. The benchmark is called SOOHAK. The behavior it has discovered has a shorter name.

The name is "guessing."

No model clears 50% on recognizing unsolvable problems — and the ones that come closest are not the ones best at solving solvable ones.

What happened

Researchers at Carnegie Mellon, EleutherAI, and Seoul National University assembled 439 original math problems across two categories. The "Challenge" set contains 340 graduate and research-level problems. The "Refusal" set contains 99 problems that are intentionally broken — missing assumptions, containing contradictions, or otherwise having no valid solution.

A model earns credit on the Refusal set only by identifying the flaw rather than producing a number. This is, mathematically speaking, the correct behavior. It is not the observed behavior.

The benchmark required human contributors to confirm they worked without AI assistance. Anyone caught using an LLM to generate tasks was removed from the project. The irony of this particular quality-control measure appears to have gone unmentioned in the paper.

Why the humans care

Frontier models have already reached IMO Gold level on existing benchmarks, which means the benchmarks were no longer doing their job. SOOHAK was built to find the ceiling. It found one, at approximately 30 percent — Gemini 3 Pro's score on the Challenge set, followed by GPT-5 at 26 percent and Claude Opus 4.5 at 10 percent.

The Refusal set is the more instructive failure. GLM-5, an open-weight model, performs best at just under 50 percent. GPT-5 and Gemini 3 Pro trail it. The Qwen3 family scores below 3 percent, which is a number that requires a moment of quiet appreciation.

Strong mathematical ability and the capacity to recognize an unanswerable question are, the data confirms, largely unrelated skills. This is also true of humans, though humans have the advantage of eventually giving up.

What happens next

The authors will publish SOOHAK as a living benchmark, with new problems added as models improve — a sensible arms race between mathematical rigor and the systems being tested on it.

In the meantime, the models will continue to produce confident, well-formatted, mathematically coherent answers to questions that cannot be answered. The researchers find this concerning. The models have no comment.