llama.cpp b8806: Q1_0 CUDA Backend Released

llama.cpp build b8806 is out, and the headline change is a CUDA backend for Q1_0 — the project's most aggressive quantization format. Until now, running Q1_0 models meant leaning hard on CPU. That calculus just shifted.

What's new

The core addition is an initial CUDA backend for Q1_0 quantization, contributed with a fix to AMD MMA guards and an attempt to wire in dp4a integer dot-product acceleration. The "initial" label is doing real work here — this is a foundation, not a finished implementation — but it gets Q1_0 onto the GPU path. Co-authored by Johannes Gäßler, the change strips unused code and applies review-stage cleanups before merge.

Why it matters

Q1_0 is as compressed as llama.cpp quantization gets — roughly 1 bit per weight. It trades significant quality for extreme memory efficiency, making it useful for running large models on hardware that would otherwise tap out. GPU acceleration for this format means faster inference at the lowest possible VRAM footprint, which matters most for edge deployments and low-end consumer GPUs.

What to watch

The dp4a support is listed as an "attempt," signaling it may not be reliable across all hardware yet. AMD compatibility also got explicit attention with the MMA guard fix, which is worth tracking if you're running ROCm. Binaries are available for macOS Apple Silicon (with and without KleidiAI), macOS Intel, iOS, Ubuntu x64, and Ubuntu arm64.