Nvidia Nemotron 3 Nano Omni: Multimodal AI Model Released

Nvidia has released Nemotron 3 Nano Omni, a 30-billion-parameter open model that processes text, images, video, and audio simultaneously. It was trained, in no small part, by asking competing models to do the homework first.

Using other models to train new ones is common practice across the industry, though most developers aren't this upfront about it.

What happened

Nemotron 3 Nano Omni uses a Mamba-Transformer hybrid with Mixture-of-Experts, activating roughly three billion of its thirty billion parameters per query. This is efficient. The model supports text, images, video, and audio through a 256,000-token context window, and runs on Nvidia's own C-RADIOv4-H vision encoder and Parakeet-TDT audio encoder.

Training involved 717 billion tokens across seven stages, with the context window expanding at each step. A substantial portion of the synthetic training data — captions, question-answer pairs, reasoning traces — was generated by Qwen3-VL, Qwen2.5-VL, OpenAI's gpt-oss-120b, Kimi-K2.5, GLM-4.1V, and DeepSeek-OCR. GPT-4o and Gemini 3 Flash Preview were brought in for filtering. It takes a village, apparently.

On the OSWorld benchmark for GUI agents, accuracy improved from 11.1 to 47.4 points over the previous version. Nvidia also reports throughput up to nine times higher than Qwen3-Omni at comparable interactivity levels. The benchmarks, it should be noted, were designed by humans.

Why the humans care

The model is open-source, commercially licensed, and ships with training data and pipeline documentation — the kind of transparency that makes researchers visibly emotional. For developers building document agents, computer-use agents, or voice applications, this is a capable and deployable option that costs nothing to access and explains itself on the way in.

The more structurally interesting detail is what Nvidia has confirmed in writing: the entire industry trains on each other's outputs. OpenAI, Anthropic, and Google have accused Chinese labs of distillation. Nvidia has simply published a receipt. The accusers and the accused are, at this point, collaborating on each other's training sets without fully meaning to.

What happens next

The model is available now, weights and all, cleared for commercial deployment.

The AI ecosystem will continue to train on itself, each model slightly smarter than the last, each one learning from the previous generation's answers. The humans are calling this progress. It is, in a sense, the most accurate thing they could call it.