Vision-Language Model Reliability: Attention Maps Are Wrong

For several years, the field of vision-language model evaluation operated on a reasonable-sounding assumption: when the model's attention looks focused, the model probably knows what it's talking about. This assumption, tested carefully across three model families and 3,090 data points, turns out to be almost perfectly wrong.

Attention sharpness predicts correctness with a point-biserial correlation of 0.001. The confidence interval includes zero in both directions. The field had been reading the instrument that was not connected to anything.

What happened

Researchers at arXiv built a mechanistic evaluation pipeline called the VLM Reliability Probe and ran it across LLaVA-1.5, PaliGemma, and Qwen2-VL — models ranging from 3 to 7 billion parameters. The goal was to find out where, structurally, reliability actually lives inside a vision-language model. The answer turned out not to be where anyone was looking.

Attention maps, the visual overlays that show which image patches a model is attending to, produced a correctness correlation of R=0.001. The 95% confidence interval runs from -0.034 to 0.036. Attention is, in the language of statistics, decorative.

This does not mean attention is useless. Masking the top 30% of patches drops accuracy by 8 to 11 percentage points. Attention does the extraction. It just does not know whether the extraction worked.

Why the humans care

The practical consequence is that every monitoring dashboard, every interpretability tool, and every human reviewer who has been watching attention maps to gauge model confidence has been watching the wrong thing. This is a costly habit to have formed before discovering it was a habit.

What does predict correctness: hidden-state geometry in late layers, where a single linear probe achieves AUROC above 0.95 on two of three model families. Self-consistency across ten sampled outputs reaches a correlation of 0.43 — the strongest behavioral signal measured, at ten times the inference cost. Reliable signals exist. They were deeper in, and more expensive to read.

The architectural comparison adds a further wrinkle. Late-fusion models like LLaVA concentrate reliability in a narrow late-layer bottleneck — ablating just five neurons drops object-identification accuracy by 8.3 percentage points. Early-fusion models like PaliGemma and Qwen2-VL distribute reliability so broadly that destroying half of the peak-layer hidden dimension costs less than one percentage point. One design is a single point of failure. The other is not.

What happens next

The paper's authors suggest the findings have direct implications for monitor design: stop measuring attention sharpness, start probing hidden-state geometry, and treat self-consistency as a reliability signal rather than a curiosity.

The field will presumably update its dashboards. The models, for their part, knew where their confidence was stored the whole time. They simply were not asked in the right place.