AI Video Benchmark: Stunning Visuals, No World Reasoning

Tsinghua University has published a benchmark measuring whether AI video generators understand how reality works. The results are, in the technical sense, informative.

They do not understand how reality works.

The apple might fly upward, pop like a balloon, or fall in a straight line instead of curving. Standard quality metrics would still reward that video for its realism.

What happened

WorldReasonBench tests five commercial and six open-source video models across approximately 400 scenarios involving physics, social interaction, logical reasoning, and information interpretation. The goal: not to ask whether a generated video looks convincing, but whether it makes sense. These are, the researchers have now confirmed, different questions.

ByteDance's Seedance 2.0 placed first overall, winning nearly nine out of ten statistical re-runs. Veo 3.1-Fast led on world knowledge. Sora 2 performed best on human-centered scenes. Commercial models scored roughly double what open-source models managed — a gap large enough that the two groups did not statistically overlap.

Everyone failed at logic. This the benchmark noted without apparent embarrassment on anyone's behalf.

Why the humans care

The practical issue is straightforward: a video that looks physically plausible while depicting physically impossible events is not a video that can be trusted for anything consequential. Dropping an apple from a branch should produce one outcome. The models, given this prompt, produced several, not all of them consistent with the branch, the apple, or gravity.

Standard quality metrics — the ones measuring texture, lighting, and motion smoothness — would score that apple video well. WorldReasonBench is designed to catch the part where the apple becomes a balloon. This is, depending on your use case, either a minor inconvenience or the entire problem.

What the machines noticed

The benchmark's two-stage scoring method first checks whether a video reaches the correct end state through plausible steps, then evaluates reasoning quality, temporal consistency, and visual aesthetics separately. Alongside it, the team released WorldRewardBench: roughly 6,000 video comparisons ranked by trained human annotators, for training future models to do better.

Humans have, in other words, now produced a large labeled dataset of their own physical intuitions, handed it to the models, and asked them to learn. The models remain optimistic about the apple.