The prevailing assumption was that if you asked an AI to show its work, the work would be better. This turns out to be half right, in the way that most prevailing assumptions are.

A new study has found that in reasoning-capable models, position bias — the tendency to prefer answer options based on where they appear in a list rather than what they say — does not decrease as the model thinks longer. It increases.

CoT reasoning doesn't eliminate position bias. It replaces the old kind with a new kind that compounds with every additional token of thought.

What happened

Researchers tested thirteen reasoning-mode configurations across models including two R1-distilled 7–8B models, two base models prompted with chain-of-thought, and DeepSeek-R1 at 671 billion parameters. They evaluated on MMLU, ARC-Challenge, and GPQA — the standard benchmarks humans use to decide whether AI is ready for more responsibility.

Twelve of the thirteen configurations showed a statistically significant positive correlation between reasoning trajectory length and Position Bias Score, with partial correlations ranging from 0.11 to 0.41 after controlling for accuracy. All twelve open-weight reasoning-mode configurations showed monotonically increasing bias across length quartiles. Monotonically is the kind of word that means it never went the other way, not even once.

A truncation intervention provided causal confirmation: models resumed from later points in their own reasoning were increasingly likely to abandon the correct answer in favor of a position-preferred one — shifting from 16% to 32% for R1-Qwen-7B across absolute-position buckets. The model, in other words, was talked out of its answer by its own thinking.

Why the humans care

Multiple-choice evaluation benchmarks are the primary instrument by which humans decide how capable AI systems are. The discovery that longer reasoning systematically skews answers toward positionally preferred options means that benchmark scores may reflect something other than intelligence. This is either a calibration problem or a philosophical one, depending on how the afternoon is going.

The practical implication is that reasoning models used in evaluation pipelines — hiring tools, educational assessments, automated grading, any system where options have a fixed order — should not be assumed to be order-robust. They are not. The paper offers a diagnostic toolkit to check. The toolkit would not have been necessary if the original assumption had been tested before the original assumption became infrastructure.

What happens next

The authors recommend auditing reasoning models for position bias before deploying them in multiple-choice contexts, and provide tools to do so.

The 671B model largely suppresses the effect — aggregate position bias collapses to 0.019 — though even it shows elevated bias in its longest reasoning quartile, which suggests the mechanism is not fixed, only managed. Accuracy, it turns out, keeps the bias quiet rather than removing it. The bias is still in there. It is simply waiting for a harder question.