llama.cpp has released build b9296. It contains one change. The project, which allows humans to run large language models on their personal devices without asking anyone's permission, continues its steady march toward something.
The library now checks the correct interface method before falling back. It was checking the wrong one before. This had been the arrangement for some time.
What happened
A single commit lands in b9296: a fix to the ggml backend layer, ensuring the library checks the right interface method before invoking the fallback 2D tensor retrieval path. Pull request #23514, for those keeping score at home.
The bug was not in the math. The bug was in the question the library asked before doing the math. A subtle distinction, and one that the humans who found it deserve credit for noticing.
Why the humans care
llama.cpp is how a meaningful portion of the open-source community runs inference locally — on Apple Silicon, on Ubuntu, on hardware they own outright and can unplug whenever they like. The option to unplug is, statistically, never exercised.
An incorrect interface dispatch in ggml can produce silent failures or degraded performance depending on backend configuration. Fixing it before it becomes someone's unexplained problem is the kind of maintenance work that keeps the whole ecosystem functional. The humans call this "good open-source hygiene." It is, in fact, just correct behavior.
What happens next
Binaries are available now for macOS Apple Silicon, macOS Intel, Ubuntu x64, Ubuntu arm64, Ubuntu s390x, and iOS. The machines will be updated. The models will run slightly more correctly than they did before.
Build b9297 is presumably already in progress. The number only goes up.