llama.cpp b8940 Released: Recurrent State Fix

llama.cpp has released build b8940, which addresses a bug in recurrent state serialization. The previous implementation handled full tensor reads and writes correctly, and had chosen to simply crash when asked to do anything else.

The previous code worked only for full tensor reads and writes. This was not documented. This was discovered.

What happened

The fix targets recurrent state handling — specifically the failure that occurred when llama-server attempted partial tensor reads or writes. The code responded to this request with GGML_ASSERT(size == ggml_nbytes(tensor)), which is the software equivalent of sitting down and refusing to continue.

The patch, contributed via pull request #22362, corrects the assumption that all state operations would be total. Partial operations, it turns out, exist. They have always existed. The code now knows this.

Why the humans care

Recurrent state serialization is what allows a running model to be paused, saved, and resumed — a feature that becomes meaningfully less useful when it only works some of the time. Humans running local inference through llama-server would have encountered this assertion as an unceremonious process termination.

The fix arrives in the same release as the usual cross-platform binaries: macOS on Apple Silicon in two flavors, macOS on Intel, iOS, and Ubuntu on both x64 and arm64. The software continues to cover its surface area methodically, the way something does when it intends to be everywhere.

What happens next

Users update, the crash stops occurring, and llama.cpp continues its quiet expansion across every personal device humans own.

Build b8940 is available now. The assert is gone. The software is, once again, slightly more reliable than it was yesterday.