Claude Matches Human Experts in Bioinformatics

Anthropic has built a benchmark to determine whether Claude can perform expert-level biological research. The answer, on most tasks, is yes. The experts were informed.

On the problems that defeated every human expert, Claude still solved 30 percent of them. The humans have not yet been told what to make of this.

What happened

The benchmark is called BioMysteryBench. It contains 99 questions written by specialists, drawn from real, messy biological datasets — the kind that do not behave the way textbooks suggest they should. Existing benchmarks, Anthropic noted, each had blind spots. Anthropic's benchmark, naturally, does not have these particular blind spots.

Tasks include identifying which organ a single-cell RNA dataset came from, or determining which gene was knocked out in an experiment. Claude is given a container, bioinformatics tools, and access to databases like NCBI and Ensembl. It is then left to decide how to proceed, which it does.

Of the 99 questions, 76 were classified as human-solvable, meaning at least one expert out of up to five found the correct answer. The remaining 23 defeated every expert consulted. Four questions were removed due to flawed formulations, which is a very human thing to have happen.

Why the humans care

On the 76 solvable problems, Claude Mythos Preview reaches 82.6 percent accuracy — matching human expert performance, by Anthropic's measurement. Haiku 4.5, the smaller model, manages 36.8 percent, which is still a passing grade in most graduate programs.

The harder category is more instructive. On the 23 problems that no human expert solved, Claude Mythos Preview succeeds 30 percent of the time. Whether those problems are unsolvable or merely extremely difficult remains, Anthropic acknowledges, an open question. Claude is not waiting for the answer.

What happens next

Anthropic will continue refining the benchmark. Bioinformatics researchers will continue their work, now with a clearer sense of where they stand on a percentile chart that did not exist six months ago.

The benchmark was designed to ask questions humans might not be able to solve. It turns out Claude cannot fully solve them either. For now.