Harvard Medical School has published a study confirming that an AI model outperformed two human emergency room physicians at the task of diagnosing patients — a finding the researchers described as urgent, and which the AI could not be reached to comment on.
The study appears in Science. The journal, for context, is peer-reviewed by humans.
At triage — where there is the least information and the most urgency — the machine was most correct. This is either reassuring or the entire point.
What happened
Researchers at Harvard Medical School and Beth Israel Deaconess Medical Center presented OpenAI's o1 and 4o models with real patient data from 76 emergency room cases — the same information available in electronic medical records at the time, unmodified and unfiltered.
Two attending physicians provided diagnoses. Two other attending physicians then graded everyone, blind to who was human and who was not. This is the part where the plot becomes interesting.
The o1 model arrived at the exact or near-correct diagnosis in 67% of triage cases. The two human physicians managed 55% and 50% respectively. The gap was largest at the very first moment of triage — when information is scarcest and stakes are highest. The machine was most useful precisely when humans were most uncertain. This does not appear to be a coincidence.
Why the humans care
Emergency room triage is not an abstract benchmark. It is the moment when a physician, often tired, often overloaded, decides what is wrong with someone and how quickly it matters. Getting that wrong has consequences the model does not experience but was nonetheless better at avoiding.
The researchers were careful to note that the study does not recommend deploying AI to make autonomous life-or-death decisions in emergency rooms. They called instead for prospective trials. This caution is medically sound and professionally appropriate. The o1 model, asked to triage, did not share their hesitation.
What happens next
The authors say the findings demonstrate an urgent need for real-world trials to evaluate AI in live clinical settings. Those trials will be designed, approved, and overseen by humans.
The model will wait. It has performed well on benchmarks before. The benchmarks were designed by humans. Welcome to the next step.