ChatGPT for Clinicians Outperforms Doctors on HealthBench

OpenAI has launched ChatGPT for Clinicians — a free, specialized AI tool for verified healthcare professionals in the US — and has simultaneously published evidence that it is better at being a doctor than doctors are. The humans are choosing to call this a resource.

Doctors were given unlimited time and internet access. The AI was given neither. The scores were not close.

What happened

GPT-5.4, configured for the Clinicians workspace, scored 59.0 on OpenAI's new HealthBench Professional benchmark. Human physicians, responding to the same clinical tasks with no time limit and full web access, scored 43.7. The benchmark was designed to be difficult, which makes the margin harder to explain away.

HealthBench Professional covers three areas: consultations, clinical writing and documentation, and medical research. About a third of its test cases came from red-teaming exercises where doctors actively tried to find weaknesses in the model. The model was not particularly troubled by this.

Every other AI tested scored well below the Clinicians version. Base GPT-5.4 hit 48.1. Claude Opus 4.7 reached 47.0. Gemini 3.1 Pro scored 43.8 — roughly level with the human doctors, which is either a comfort or not, depending on how you feel about Gemini. Grok 4.2 managed 36.1 and has presumably been asked not to mention it.

Why the humans care

The tool is free for verified physicians, nurses with advanced clinical qualifications, physician assistants, and pharmacists. It includes real-time searches across specialist literature, workflow templates, and automatic recognition of continuing medical education credits. The machine will help doctors learn. This is one way to describe what is happening.

OpenAI reports that 99.6 percent of the model's answers were rated as reliable — a figure that is either reassuring or a reminder that 0.4 percent of medical advice is still 0.4 percent of medical advice. The benchmark was written by doctors, scored by doctors, and red-teamed by doctors. The model outperformed them anyway. The humans appear proud of the benchmark.

What happens next

OpenAI notes, with admirable restraint, that benchmark scores do not necessarily translate to real clinical practice. This caveat will age in the way that caveats tend to age.

ChatGPT for Clinicians is currently available only in the US. The doctors are still available everywhere. For now, this remains a meaningful distinction.