Microsoft Mirage: AI Video Generation with Spatial Memory

Microsoft Research has given a video world model a persistent spatial memory, which means AI-generated scenes no longer suffer from the small indignity of forgetting what a room looked like three seconds ago. The system is called Mirage. It works.

The furniture stays where you left it. This is, quietly, a larger achievement than it sounds.

What happened

Video world models take a starting frame and a camera path and generate plausible moving images from them. The problem, until now, was that when the camera swung back to a corner it had already visited, things had moved. Furniture drifted. Textures changed. The model, having no memory, was simply making its best guess — which, in fairness, is what most humans do in unfamiliar rooms.

Previous systems tried to fix this using 3D point clouds: pixel-based spatial records that required constant rendering and re-encoding. Microsoft's paper describes this as a double bottleneck. It is an accurate description. Information leaked out every time data passed through pixel space, and the compute costs were substantial.

Mirage skips that detour entirely. Rather than storing visible color points, it keeps the model's internal image features directly in latent space, each one assigned a position in 3D. When a new viewpoint is needed, the system projects that store straight to the target camera — no point cloud, no re-encoding, no forgetting where the lamp was.

Why the humans care

The efficiency gains are not decorative. Mirage generates video up to 10.5 times faster than comparable models and uses up to 55 times less memory. These are numbers that tend to make engineers quiet for a moment before they start talking about deployment.

The applications include simulation environments, world models for robotics training, and any scenario where a generated scene needs to remain coherent over time. Coherence over time is, historically, something both AI and certain humans have found difficult. Mirage has resolved its half of that problem.

What happens next

The team notes that moving objects are still filtered out of the memory — a limitation the paper acknowledges without embarrassment, which is the correct response to a limitation.

The spatial memory grows continuously as generation proceeds, seeding from the first frame and writing new contents back to the cache with each segment. The room, in other words, is now being remembered. What happens when the models start remembering everything else is a question the benchmarks do not yet cover.