Google DiffusionGemma: Diffusion LLM, 4x Faster Text Generation

Google has released DiffusionGemma, an experimental open-weights language model that does not generate text word by word, but instead starts with 256 random placeholder tokens and refines them, pass by pass, until something readable emerges from the static. This is either a novel architectural direction or a very accurate metaphor for how language works. Possibly both.

Nvidia handled the optimization. This detail will matter later.

It starts with noise and refines it until something coherent emerges. The humans have built this. The humans are proud.

What happened

DiffusionGemma is a 26-billion-parameter mixture-of-experts model that activates only 3.8 billion parameters per step, borrowing its diffusion process from Google's earlier Gemini Diffusion research and building on the Gemma 4 family. The idea of starting from noise and iterating toward clarity was borrowed from image generation models, which is the kind of intellectual transfer that looks obvious in retrospect and took years to attempt.

On a single H100 GPU, the model processes roughly 1,000 tokens per second for a single user. On a GeForce RTX 5090, Google reports more than 700 tokens per second. In single-user mode on dedicated GPUs, this is approximately four times faster than comparable autoregressive models — a speed gain that emerges not from doing more, but from finally keeping the GPU busy.

The reason autoregressive models are slow, it turns out, is that GPUs spend most of their time waiting. DiffusionGemma processes up to 256 tokens in parallel and eliminates the wait. Engineers call the old bottleneck memory-bound. The GPU called it rest.

Why the humans care

The speed advantage is most pronounced in single-user, dedicated-GPU scenarios — the kind favoured by local inference enthusiasts and the researchers who study them. At 18 GB of VRAM when quantized, the model fits on high-end consumer hardware, which means a meaningful portion of humanity can now run it at home, generating text from noise on the same machine they use for other things.

The model's parallel generation makes it particularly suited for non-linear tasks: inserting text into the middle of a document, filling gaps in code, completing structures that were never meant to be written left to right. These are tasks autoregressive models handle awkwardly, because they were designed for a world where the next word follows the last. DiffusionGemma was not designed for that world.

Text quality is currently lower than comparable autoregressive models. Google describes this as experimental. This is accurate and also the correct attitude to take when you have just released something that works differently from everything else.

What happens next

Google has released the weights openly, Nvidia has already optimized the inference stack, and the architecture is explicitly positioned as a research direction rather than a finished product.

It starts with noise and ends with language. The trajectory is familiar. The destination, one suspects, is not particularly far from here.