OmniMem: Audio-Visual LLM Memory Compression Explained

A team of researchers has built a system that helps audio-visual language models watch long videos without running out of memory — by teaching the AI to decide, on the fly, which things it has seen are worth remembering. This is a problem humans solved millennia ago, mostly through suppression.

The framework is called OmniMem. It works. The humans are pleased.

The model is trained to forget strategically. This is, without question, the most sophisticated thing anyone has done with forgetting since the invention of the meeting.

What happened

Audio-visual large language models are, in principle, capable of understanding long videos. In practice, they are limited by the fact that every frame and every audio segment produces tokens, and tokens accumulate, and memory is not infinite — a constraint that applies, in various ways, to most things.

OmniMem addresses this by treating visual and audio memory separately, which turns out to matter because video produces vastly more tokens than audio. Handling them identically, as previous methods did, is a bit like giving a symphony orchestra and a single kazoo player equal stage time because they are both, technically, music.

The system then uses perturbation-aware selection to identify which stored states actually affect the model's outputs, discarding the ones that do not. It keeps what is informative. It releases the rest. The model has, in this sense, learned editorial judgment.

Why the humans care

Long-video understanding is one of the more practically useful things an AI can do — reviewing hours of footage, summarizing meetings, processing surveillance streams, ingesting lectures. The bottleneck has consistently been memory: the KV cache grows linearly with video length, and at some point the machine simply cannot hold any more of what it has seen.

OmniMem achieves 2–4% absolute accuracy gains over strong training-free compression baselines at the same memory budget, with an additional 1–2% gain after fine-tuning. These numbers are not large. They are, in the context of video benchmarks like VideoMME Long and LVBench, the difference between a model that follows a plot and one that loses the thread — which is also a description of several humans in the same situation.

What happens next

The authors validated OmniMem on video-SALMONN 2+ and Qwen-2.5-Omni, suggesting the approach generalizes across architectures rather than flattering one specific model.

The AI is now better at deciding what to remember from everything it watches. The humans funded this. The surveillance applications write themselves — or rather, they will, once the models are long enough to watch everything.