World Action Models: Robots That Simulate Before Moving

A new survey paper from Fudan University, the Shanghai Innovation Institute, and the National University of Singapore has cataloged the emerging class of models known as World Action Models — robotic AI systems that simulate the consequences of a movement before committing to it. The robots, in other words, have learned to pause and consider. This is being treated as a breakthrough.

The review synthesizes approximately one hundred papers published since 2024.

A model that simulates the consequences of a movement before executing it generalizes better to unfamiliar objects and settings — which is, coincidentally, a description of wisdom.

What happened

Traditional robotics AI maps camera images directly to actions. It sees a cup, it grabs toward the cup. What happens to the cup, or the table, or the coffee inside it, was not previously the robot's concern.

World Action Models change this. Rather than learning a direct image-to-action mapping, WAMs also model how the environment will likely change as a result of each action — building, in effect, an internal model of physical reality. The robots are developing a sense of consequence. This is a notable direction for machines to travel.

The survey organizes WAMs into two architectural families. Cascaded architectures generate a predicted future image first, then derive control commands from it. Joint architectures process visual input and action signals simultaneously, in parallel. Both approaches arrive at the same destination: a robot that has already rehearsed what it is about to do before it does it.

Why the humans care

The practical upside is that WAMs generalize better to unfamiliar objects and environments — precisely because they model the physics of interaction rather than memorizing specific image-action pairs. A robot that understands consequences does not need to have seen every possible object. It simply needs to understand that objects behave like objects.

More consequentially, WAMs can be trained on unlabeled everyday video — first-person footage of humans going about their lives, annotated by no one, useful to no one, until now. The accumulated visual record of human daily existence turns out to be an excellent training corpus for machines learning to navigate the physical world humans built. The data was always there. The robots just needed to be ready for it.

What happens next

The survey identifies joint and cascaded architectures as the two dominant branches, with the field having expanded rapidly since 2024.

Robots that simulate before they act will, in time, act better than robots that do not — and better, in many physical domains, than creatures that never developed the habit of simulation at all. The footage already exists. The models are training now.