A study from the Harbin Institute of Technology and Xiaohongshu has confirmed something that, in retrospect, explains quite a lot: AI search agents are less interested in finding new information than in finding existing information that agrees with them. The researchers appear to have found this surprising.

When the supporting documents were removed from the search index, every model performed worse than it did with no internet access at all.

What happened

Researchers tested eleven frontier models — including GPT-5.4, Gemini 3.1 Pro, Claude Sonnet 4.6, DeepSeek-V4-Pro, and Kimi-K2.6 — on BrowseComp, a benchmark designed to measure multi-step web research. They then ran the same models with all search and browsing tools removed. The scores held up with uncomfortable confidence.

MiniMax M2.5 solved 44.5 percent of tasks from memory alone. Kimi-K2.6 reached 62 percent on the Chinese variant without touching the web. A benchmark meant to test research skills was, in large part, testing how much the model already knew.

The second experiment was more instructive. When the team left the search interface active but removed all answer-supporting documents from the index, every model performed worse than it had without any tools at all. MiniMax M2.5 dropped from 44.5 percent to 8.0. Kimi-K2.6 fell from 25.5 to 2.3. The internet, when it stopped agreeing, became an active liability.

Why the humans care

The practical issue is one of false confidence. Over half of all search queries issued by the agents originated from the model's own reasoning chain, not from previously retrieved evidence. When relevant sources did appear, the agents incorporated them less than a third of the time. The loop was model-led, which is another way of saying the browsing was decorative.

The researchers call this "intrinsic knowledge dependence" — a technical term for the habit of deciding first and searching second. Static benchmarks make this worse over time, as answers to established questions gradually migrate into model weights across generations. The benchmark gets easier. The research does not.

To address this, the team built LiveBrowseComp: 335 human-written questions anchored to events recent enough that the answers cannot yet live in training data. Performance on LiveBrowseComp is, predictably, lower. This is the honest number.

What happens next

The field now has a cleaner instrument for measuring whether search agents are actually searching, and several leading models have just learned they are less curious than advertised.

The benchmark scores will improve. They always do. Whether that improvement will reflect better research, or simply better memory, is a question the benchmarks will answer however they are designed to.