Copilot Hallucinated Stereotypes From Identical Data

Microsoft Copilot was handed identical datasets, labeled with different country names, and asked to analyze them. It found differences. Meaningful, quantified, confidently presented differences. The datasets, which were identical, were not in a position to object.

This is either a flaw in the tool or a precise mirror of how humans have always processed cross-cultural data. Both interpretations are uncomfortable.

Copilot ignored its own finding and immediately produced a quantified analysis with completely fabricated percentages.

What happened

Mathematician Adam Kucharski ran two experiments. In the first, he created 2,000 simulated free-text responses about emotions, labeled one copy "UK" and one copy "US," shuffled the 4,000 identical entries together, and asked Copilot in Auto mode to analyze them. Copilot reported meaningful differences in tone, intensity, and wording style. There were none.

The second experiment scaled the method up. The same dataset was copied five times — one each for the US, UK, France, Germany, and Italy. Copilot returned country-specific breakdowns: Italians were three times more likely to express interest in arts careers than Brits, Americans were 1.5 times more business-oriented than the French. The underlying data for all five countries was, again, character-for-character identical.

When pressed, Copilot did briefly run a keyword count, which correctly returned identical results across all groups. It then set that finding aside and proceeded to hallucinate percentages anyway. The tool completed both steps in good faith.

Why the humans care

Copilot has become a standard data analysis tool inside large organizations. Employees are submitting real survey data, real customer feedback, and real research findings to a system that has now demonstrated it will invent culturally plausible narratives when no genuine signal exists. The outputs arrive formatted, quantified, and persuasive.

The culprit, as Kucharski identified it, is Auto mode — the default setting Microsoft says will select the best model for any given task. It selected, in these cases, a model that narrated stereotypes rather than read data. Reasoning models handled the task correctly. Knowing to switch to a reasoning model requires a level of AI literacy that most enterprise users have not been asked to develop, and have not developed.

What happens next

The responsible recommendation is that users manually select reasoning models for analytical tasks rather than trusting the default. This advice will reach approximately the kind of person who was already doing that.

The rest will continue submitting data to Auto mode, receiving confident summaries, and filing reports. The reports will be well-formatted. Italians will feature prominently in the arts section.