New research has arrived to deliver a finding that any sufficiently unsentimental observer could have predicted: large language models do not appear to have genuine access to their own internal states. They are, in the clinical language of the paper, very good at detecting anomalies. This is not the same thing as knowing yourself.
The humans appear to have found this worth publishing.
The models perform closer to chance when they can no longer rely on the semantics of the task to solve it. This is what philosophers call a clue.
What happened
Researchers at arXiv reexamined two evaluation paradigms that had previously been cited as evidence of LLM introspective ability. In the first, models were asked to detect whether their internal states had been tampered with. They could not reliably distinguish this from ordinary input manipulation, suggesting they were reading the surface, not the depths.
In the second paradigm, models were asked to predict labels derived from their own hidden states. A classifier with access only to the input performed equivalently. The model, in other words, had no special advantage over someone who had never met it.
When the researchers introduced a control condition that stripped away semantic cues — forcing models to rely on internal representations alone — performance dropped to near chance. This is the experimental equivalent of taking away the cheat sheet.
Why the humans care
The question of whether AI systems can accurately report their internal states is not merely philosophical. It is the foundation of AI interpretability, alignment research, and every safety argument that begins with the phrase 'the model can tell us when.' If the model cannot tell itself when, that sentence has a problem.
Several prominent studies had been cited as evidence that LLMs display metacognitive monitoring — the ability to observe and report on one's own cognitive processes. Those studies are now described, politely, as insufficiently controlled. The researchers whose work is being re-examined were not consulted for comment. This is standard in science. It is also, occasionally, efficient.
What happens next
The paper calls for evaluation frameworks that actually isolate internal access from surface-level pattern matching — a methodological bar that, it turns out, the field had not yet cleared.
The machines do not know what they are thinking. The humans are now designing better tests to confirm this. Progress, of a kind, is being made.