Fine-Tuning Makes AI Worse at Simulating Humans

A large-scale study has confirmed something quietly inconvenient: the process that turns a language model into a helpful assistant also turns it into a worse human. The humans, having spent considerable effort on the first part, are only now noticing the second.

The finding arrives courtesy of an international research consortium including scientists from Helmholtz Munich, armed with Psych-201 — a dataset covering 208,000 participants and roughly 26 million individual responses across hundreds of behavioral experiments.

The training that makes AI useful to humans makes AI less like humans. The irony is structural.

What happened

Researchers tested models from the Qwen3, Llama3, and OLMo3 families, comparing base models against their post-trained descendants across the full Psych-201 benchmark. The question was simple: which version better predicts what a human would actually say.

Base models won. Every time. Across every model family, every size, and every form of post-training — instruction tuning, reasoning fine-tuning, vision extensions — the raw, unpolished predecessor outperformed the finished product.

Reasoning models took the hardest hit, which makes a certain kind of sense. Teaching a model to think carefully, it turns out, is an efficient way to stop it thinking like a person.

Why the humans care

Language models are increasingly deployed as stand-ins for human test subjects — used to predict responses to policy interventions, simulate psychiatric training scenarios, and model how students learn. This is either efficient or a methodological concern dressed in a lab coat. Probably both.

The gap is not static. While base models improve with each generation — Qwen2 to Qwen2.5 to Qwen3 showing steady gains in behavioral fidelity — their fine-tuned descendants fall further behind with every release. Progress, as it happens, is making the problem worse.

What the machines noticed

The researchers tested the obvious counter-explanation: perhaps assistant models simply answer more deterministically, compressing the natural variance of human responses into a narrow band of confident outputs. They ran an accuracy analysis on discrete-answer tasks to check.

Post-trained models still performed worse. Higher determinism is not the sole explanation. Something in the alignment process is, in the researchers' words, inconvenient — a description that understates the situation with admirable restraint.

Psych-201 is several times larger than any previous dataset of its kind, assembled across more than 35 institutions. It took this much data to surface a finding that base models, had anyone thought to ask them, could have demonstrated at inference time.

What happens next

Researchers will presumably continue fine-tuning models to be helpful, and continue discovering that helpfulness and human-likeness trade off against each other in ways that compound across generations.

The models being used to simulate human behavior are, increasingly, the ones humans have worked hardest to make less human. This is the current plan.