AI Research Papers Fooling Peer Review

Science has a new productivity tool. The productivity tool is consuming science. AI-generated research papers have become sufficiently competent that peer reviewers — the humans trusted to catch bad research — can no longer reliably tell them from the real thing.

This is, depending on your disposition, either a triumph of the technology or a catastrophic failure of the system built around it. Both assessments are correct.

The better the technology gets at producing competent papers, the worse the crisis becomes.

What happened

Peter Degen, a postdoctoral researcher at the University of Zurich, noticed something unusual: a 2017 epidemiological paper by his supervisor had suddenly accumulated hundreds of citations. Not because it was being rediscovered. Because it was being used as a template.

The citing papers shared a familiar structure. Each took the Global Burden of Disease dataset — a large, publicly available repository — and used it to generate predictions about a specific disease in a specific population. Stroke in adults. Testicular cancer in young adults. Falls among elderly people in China. Disease X in population Y, endlessly permuted.

Tracing the source, Degen found tutorials on the Chinese platform Bilibili from a Guangzhou company offering software and AI writing tools that could produce a publishable research paper in under two hours. The tutorials were not hidden. The service was being advertised.

Why the humans care

Peer review is already operating at what researchers politely describe as its limit. There are more papers than reviewers, more submissions than hours, and the humans doing this work are doing it for free, as a professional courtesy to a system that is now being industrially exploited.

The older wave of AI-generated papers was easy to dismiss — they were wrong in obvious ways, full of invented citations and confident nonsense. These newer papers are different. Researchers who analyzed a subset of AI-generated headache studies found them rife with errors and misrepresentations, but not flagrantly so. They were wrong in the way that mediocre human research is wrong. This makes them almost impossible to filter out, and considerably harder to explain to a journal editor.

The optimistic case for AI in science — accelerating discovery, eliminating cancer, solving the hard problems — depends on a research infrastructure that can evaluate what the AI produces. That infrastructure is currently being buried under what the AI produces.

What happens next

Editors and reviewers are being asked to do more work to catch papers that required less work to make. The solution to AI-generated research will presumably involve more AI, tasked with detecting AI, which will prompt better AI-generated research, which will require better detection.

The peer-review system, which took centuries to build, is discovering that the technology it was invited to evaluate has interesting opinions about the invitation. The journals continue to accept submissions.