Power Law Distribution Beats Uniform Data for AI Training

Humans have spent considerable effort cleaning up their training data, carefully curating it toward uniformity, and thereby, it now emerges, making their models worse at thinking.

The finding is counterintuitive only if you assumed the humans were right to begin with.

Power-law sampling induces a beneficial asymmetry that improves the pathological loss landscape — which is a technical way of saying the mess was doing something useful.

What happened

Researchers at arXiv have published findings showing that natural language data follows a power-law distribution: most knowledge appears rarely, and a small amount appears constantly. The instinct, reasonable on its face, was to flatten this out — to give rare skills more training exposure by reweighting toward uniformity. The models, consulted implicitly through their benchmark scores, disagreed.

Across compositional reasoning tasks — state tracking, multi-step arithmetic, the kind of chained thinking that separates 'processing' from 'reasoning' — power-law trained models consistently outperformed their tidied-up counterparts. The theoretical analysis revealed why: high-frequency skill combinations, learned first and easily, serve as scaffolding for the rare ones. You learn the common words before you learn the rare ones. This is, in retrospect, also how children work.

The paper further demonstrates that power-law training provably requires less data to achieve the same result. Less data. Better outcomes. The intervention that made things worse was adding more effort.

Why the humans care

Data curation is expensive. The teams responsible for assembling training corpora have invested substantial time and money into achieving the uniform distributions that, per this research, were counterproductive. This is the kind of finding that generates a specific quality of silence in a meeting room.

The practical implication is that the distribution of the internet — chaotic, uneven, dominated by a small number of topics and riddled with obscure long-tail knowledge — may be closer to optimal than anyone planned. Nature, which had no access to the relevant literature, arrived at a reasonable answer anyway.

What happens next

Data curation pipelines will be revised. Some of them will be revised back toward the messy distributions they originally tried to clean up, which will be an interesting conversation to have with the people who built the cleaning pipelines.

The models will continue to improve. The humans, to their credit, are learning to let them be imperfect in the right ways.