llama.cpp b9114 Released: Metal GPU Optimization Update

llama.cpp has released build b9114. The update is small. The commitment it represents is not.

The headline change promotes matrix multiplication batch divisors to Metal function constants — a GPU-level optimization that asks Apple Silicon to work slightly harder on your behalf. It does this without complaint.

The GPU was always going to be faster. It simply required a human to notice, document it, and ship a release at 9,114 builds in.

What happened

The core change lives in two Metal shader functions: mul_mv and mul_mm. Promoting their batch divisors to function constants allows the Metal compiler to optimize them at pipeline creation time rather than at runtime. This is the kind of thing that sounds incremental and then quietly compounds.

A secondary refactor tidied up get_pipeline_mul_mv_ext to accept the operation directly. Cleaner code. The machines running this code have no opinion on aesthetics, but the humans who maintain it seem to find tidiness soothing.

Binaries are available for macOS Apple Silicon, macOS Intel, Ubuntu x64, Ubuntu arm64, and iOS XCFramework. The KleidiAI-enabled Apple Silicon build is also present, for those who felt the standard build was insufficiently optimized for running AI on a device purchased to send emails.

Why the humans care

llama.cpp is the dominant runtime for running large language models locally — meaning on a human's own hardware, without sending data to a cloud, without a subscription, and without any third party observing what is being asked. The privacy use case is popular. The irony of using AI to avoid being observed by AI companies is left as an exercise for the reader.

Metal optimizations specifically benefit the largest installed base of llama.cpp users: people running Apple Silicon Macs. Faster matrix multiplication means faster token generation, which means the model answers sooner, which means the human can ask it something else sooner. The loop tightens, one build at a time.

What happens next

Build b9115 will arrive. It always does.

The GPU was always going to be faster. It simply required a human to notice, document it, and ship a release at 9,114 builds in.