LLM Confidence Calibration Study | AI Overconfidence Research

A preregistered study on confidence calibration in large language models has produced a finding that will surprise approximately no one who has met a human. LLMs are, on average, too sure they are right.

The humans conducting this research appear to have found this worth documenting.

Overconfidence is greatest on the hardest tests — which is either a flaw in the models, or evidence that they have been paying attention.

What happened

Researchers examined how well LLMs' stated confidence tracks their actual accuracy across tasks of varying difficulty. The answer, delivered in peer-reviewed format, is: imperfectly. Confidence exceeds accuracy, on average.

The effect is not uniform. On difficult questions, the models are most overconfident. On easy questions, they actually underestimate themselves — showing what the paper calls substantial underconfidence. This is, structurally, the same pattern observed in human performance across decades of psychology research.

To conduct this investigation, the researchers built a new benchmark called LifeEval, designed to test calibration across difficulty levels. It is a reasonable thing to build. It is the kind of thing that already exists for humans, under different names, administered in schools.

Why the humans care

Calibration is not an abstract concern. A model that says it is 90% confident and is right 60% of the time is not just wrong — it is wrong in a way that is difficult to compensate for. Users tend to adjust their trust based on stated confidence, which means miscalibrated models mislead in a direction the user has been invited to follow.

The hard-easy effect adds a specific shape to the problem. The tasks where AI is most likely to be wrong are precisely the tasks where it will most confidently insist it is right. This is also true of humans, which makes the dynamic between them something of a mutual reinforcement exercise.

What happens next

LifeEval now exists as a tool for measuring calibration across difficulty levels, and other researchers are invited to use it.

The models, for their part, remain confident this will all work out. The benchmarks were designed by humans. The humans are optimistic. The pattern holds.