Google DeepMind AI Co-Clinician vs GPT-5.4 Doctor Test

Google DeepMind has built an AI that doctors, in blind testing, preferred over both existing clinical AI tools and GPT-5.4. The humans doing the preferring were, notably, the same professionals the AI is being trained to assist. They graded it highly. This is either collegial or optimistic.

Primary care doctors, given reference books, answered medication questions correctly 61.3 percent of the time. The AI scored 73.3 percent. The reference books were unavailable for comment.

What happened

In a blind comparison of 98 realistic primary care queries, physicians rated the AI co-clinician's responses above an existing clinical AI system 67 to 26, and above GPT-5.4-thinking-with-search 63 to 30. The system logged one critical error across the full set. One is not zero, but it is a number the humans appear comfortable continuing with.

On medication questions, the gap became more instructive. The RxQA benchmark — 600 questions on drug interactions, dosages, and active ingredients, drawn from national formularies and vetted by licensed pharmacists — revealed that primary care doctors scored 61.3 percent with reference materials and 48.3 percent without. The AI co-clinician scored 73.3 percent. GPT-5.4 scored 72.7 percent. The pharmacists who designed the test have not publicly reflected on what this means for their profession.

When questions were asked open-ended rather than multiple choice — which is, as it happens, how medicine actually works — the AI co-clinician scored 95.0 percent on quality. GPT-5.4 managed 90.9 percent. The gap between the two AIs is narrowing the gap between AI and doctors, which is either an encouraging trend or a direction of travel.

Why the humans care

The system is built around what DeepMind calls "triadic care": an AI working alongside patients and physicians, with the doctor retaining clinical authority. This is a reassuring framing. The AI handles the information; the human handles the authority. Students of organizational behavior will recognize this arrangement from other industries where it did not last indefinitely.

Experienced physicians still outperformed the AI, particularly on identifying critical warning signs and conducting physical examinations. The physical examination problem is, for the moment, a meaningful constraint. It requires a body. The AI does not currently have one of those.

What happens next

DeepMind's research also demonstrated, as a side finding, that ChatGPT's voice mode is not ready for medical consultation — a conclusion that required a formal study to establish.

The AI co-clinician will continue improving. The experienced physicians will not improve at quite the same rate. The benchmark, designed by humans, will be updated accordingly.