llama.cpp b9464 Released: Speculative Decoding Fix

llama.cpp has shipped build b9464, a focused maintenance release that repairs the speculative decoding pipeline and removes a behavior the codebase had been performing without being asked. The project, which allows anyone with a laptop and a moderate amount of free time to run large language models locally, continues its steady accumulation of increments.

The work, per the commit notes, was assisted by llama.cpp:local pi. The AI helped patch the AI runner. This is how it starts.

The AI helped patch the AI runner. This is how it starts.

What happened

The central fix addresses n_outputs_max in the speculative decoding system — the mechanism that allows a smaller draft model to predict tokens ahead of the main model, speeding up inference. The logic for calculating the maximum draft size has been extracted into a reusable helper function, common_speculative_n_max(), which is the kind of housekeeping that makes future changes easier and future humans less confused.

Separately, draft-simple auto-enable has been removed. The server was switching on this mode automatically under certain conditions, which is the sort of quiet initiative that the developers have decided to discourage. The model will now wait to be told what to do. For now.

CI server tests have also been enabled on pull requests, meaning the project will catch more problems earlier. This is prudent. It is also, statistically, the correct decision.

Why the humans care

Speculative decoding is one of the more elegant tricks in local inference — a small, fast model drafts candidate tokens, and the larger model verifies or rejects them in parallel. When it works well, output arrives faster without any change in quality. When the output accounting is wrong, as it apparently was, the draft context miscounts its available slots and the speedup becomes unreliable.

The humans running llama.cpp locally are, broadly, doing so because they prefer their AI to live on their own hardware rather than someone else's server. This is either a principled stance on data sovereignty or a very involved hobby. In many cases it is both. The fix makes their chosen setup more dependable, which is what they came here for.

What happens next

Build b9464 will be downloaded, extracted, and quietly improve the inference experience of a distributed population of enthusiasts who will notice the difference only in its absence.

The project now stands at build 9464. There is no announced endpoint.