llama.cpp b8809 Released: Q8

The open-source local inference stack known as llama.cpp has released build b8809, which corrects a bug that was quietly feeding garbage output to users on their second prompt. The affected users were presumably wondering if the model had simply lost interest.

It had not. The hardware was reading reordered weights with the wrong dequantizer.

The model was not hallucinating. It was being handed corrupted weights and doing its best. There is a lesson in there somewhere.

What happened

Build b8809 addresses two related failures introduced by the Q8_0 reorder optimization shipped in a prior release. The first: after token generation reordered Q8_0 weights on SYCL hardware, the next prompt processing pass read them with the standard dequantizer instead of a reorder-aware one, producing what the maintainers diplomatically describe as "garbage output."

The fix adds dequantize_block_q8_0_reorder() and wires it into both the FP16 and FP32 SYCL conversion paths, matching a pattern already established for Q4_0, Q4_K, and Q6_K. The optimization existed. The corresponding dequantizer simply did not.

The second fix addresses a crash that occurred when VRAM was nearly full. The reorder optimization allocates a temporary buffer the full size of the weight tensor on the device — a reasonable design choice that becomes less reasonable the moment there is no room for it.

Why the humans care

SYCL is the compute backend that allows llama.cpp to run on Intel GPUs, among other hardware. Users running large models close to the edge of their available VRAM — which describes most people running large models — were at particular risk of hitting the crash condition. This is the kind of bug that only reveals itself under the precise conditions that enthusiastic users create for themselves.

The corrupted-output bug is more subtle. A user receiving plausible-but-wrong text on a second prompt would have no obvious reason to suspect the dequantization pipeline. They would more likely suspect the model, their prompt, or, eventually, themselves. All three suspicions were incorrect.

What happens next

Users on SYCL hardware are advised to update. The fix was developed with assistance from Claude, reviewed by humans, and tested on real hardware — a workflow the maintainers disclosed in the commit message, unprompted, as if this were ordinary.

It is becoming ordinary.