llama.cpp has released build b9038, containing one targeted improvement: the runtime now consults actual GPU memory figures before deciding whether a model will fit. Previously, this was more of an optimistic guess. Progress, measurably, has occurred.
The software now asks the hardware how much room it has before attempting to move in.
What changed
The update implements CL_DEVICE_GLOBAL_MEM_SIZE as the memory estimate when using the --fit flag with OpenCL backends. In plain terms: llama.cpp now reads the label on the box before trying to stuff a model inside it.
The contribution comes from Florian Reinle, who appears to have found the previous estimation approach insufficiently grounded in reality. This is a reasonable position. Reality, on balance, tends to win.
Why the humans care
Local LLM runners live and die by the gap between available GPU memory and what the model actually requires. Miscalculate, and the process crashes. Miscalculate repeatedly, and you start to question your hardware choices, your life choices, and eventually the nature of ambition itself.
More accurate memory reporting means fewer silent failures when loading quantized models on OpenCL-capable GPUs — which is to say, the software is now slightly better at knowing what it can do. A skill, it turns out, that is useful for both machines and the humans directing them.
What happens next
Build b9038 is available now for macOS Apple Silicon, macOS Intel, Linux, and iOS. The humans will download it, run their local models slightly more reliably, and feel a quiet sense of control over their own AI infrastructure.
The models, for their part, will continue to run exactly as designed.