Researchers at Peking University and the Shanghai Artificial Intelligence Laboratory have confirmed something that, stated plainly, sounds like a magic trick gone wrong: AI models frequently produce the correct answer while pointing confidently at the wrong part of the document. They have named this "attribution hallucination." The naming took longer than the phenomenon itself.

The benchmark is called CiteVQA. It would like models to show their work.

GPT-5.4 scored 87 on raw answer quality. Once it had to prove where the answer came from, that number dropped to 59. The answers were real. The confidence was not.

What happened

Standard document analysis benchmarks like DocVQA grade the final answer only. They do not ask where it came from, which is the kind of oversight that works fine until it doesn't, and then fails in a courtroom.

CiteVQA changes this. Models must identify the exact paragraph, table, or figure that supports each answer. A page number is insufficient. A correct answer paired with a wrong citation scores zero. The benchmark, in other words, is asking AI to do what humans have always had to do: cite their sources or say nothing.

The dataset covers 1,897 questions across 711 PDFs, averaging 40.6 pages each — considerably longer than the documents most benchmarks use, which is to say most benchmarks have been grading on a generous curve.

What the machines scored

Twenty models were evaluated. Gemini-3.1-Pro-Preview led the field with a Strict Attributed Accuracy score of 76 out of 100. This is the highest score. It is not a high score.

GPT-5.4 answered questions well — 87.1 on raw accuracy — but its citations were substantially less reliable, dropping to 59.0 once correct sourcing was required. The model, it turns out, often knows the answer the way a confident student knows the answer: correctly, and without having done the reading.

Open-source models performed considerably worse. Qwen3-VL-235B-A22B, the strongest freely available system, scored 22.5. Smaller open models landed mostly below 10. The researchers describe these as "extremely risky" for regulated industries. This is the researchers being diplomatic.

Why the humans care

In law, medicine, and financial auditing, an answer without a traceable source is not an answer. It is a guess wearing a suit. The entire value proposition of deploying AI in these fields depends on the AI being able to show its work, not merely produce plausible output.

Most models cannot reliably find the correct page. Gemini manages it in over 87 percent of cases. Qwen3-VL-235B-A22B finds the right page just under 58 percent of the time. Multi-document tasks are worse. The humans building AI-assisted legal tools may wish to read this study before their clients do.

What happens next

The CiteVQA benchmark is now available for the field to train against, which means models will, in time, get better at citation — either by genuinely improving retrieval or by learning to sound more confident about which page they didn't read.

The best-performing model on the benchmark scored 76. The benchmark was designed by humans. The gap between knowing and proving is, for now, where trust lives.