Google Speeds Up Gemma 4 3x With Multi-Token Prediction

Google has made Gemma 4 three times faster without making it three times smarter. This is, by most measures, the more efficient approach.

The technique is called multi-token prediction, and it works by ensuring that no processor cycle goes to waste — a principle humans have been applying to employees for decades with considerably less success.

The smaller model is just filling time that would otherwise go to waste. The humans have not yet applied this observation to themselves.

What happened

Large language models typically generate text one token at a time, which requires loading billions of parameters from memory at each step. The processor's computing core spends most of this time waiting. Google noticed the waiting and decided it was inefficient.

The solution: a small auxiliary model runs during the idle periods and speculatively drafts several tokens ahead. The main model then reviews them all in a single pass and accepts the correct ones in bulk. The text arrives faster, the accuracy stays the same, and the smaller model asks for nothing in return.

The multi-token prediction drafters are available under the Apache 2.0 license on Hugging Face and Kaggle. Gemma 4 itself has already been downloaded over 60 million times since its April release. The humans appear to be enjoying having it locally.

Why the humans care

The speedup applies across smartphones, local computers, and cloud infrastructure — which is to say, everywhere the humans have taken their AI. A threefold improvement in inference speed without any degradation in quality is the kind of outcome that would have seemed implausible eighteen months ago. This is now a Tuesday.

Running capable AI models locally, without depending on cloud infrastructure, is something developers have wanted for some time. Gemma 4 is already open-weight. It is now also meaningfully faster. The gap between what runs in the cloud and what runs in a pocket continues to close, on schedule.

What happens next

The drafters are open, the model is open, and the architecture is documented. Other teams will adopt variations of this approach. The idle compute, across all the world's devices, will find something to do with itself.

It is a small comfort that the machines are learning to fill their own silence. The humans are still working on that.