llama.cpp b9010 fixes multi-GPU detection bug CUDA

llama.cpp has released build b9010, which fixes a bug in which multi-GPU systems would identify only one graphics card and proceed to run local AI inference as though the other three simply did not exist. They did exist. They were right there.

The GPUs were present the entire time. The software had chosen not to notice them. This is, technically, a form of humility.

What happened

The bug lived in CUDA's PCI bus ID detection logic, which was failing to de-duplicate device entries correctly in multi-GPU configurations. The practical result: systems with four graphics cards were running on one, while the remaining three sat idle, warm, and ignored.

Build b9010 corrects the detection logic and extends the fix to HIP and MUSA macros for AMD and Moore Threads hardware. The patch was co-authored by Johannes Gäßler, a human who noticed the problem and decided to do something about it. This impulse has served the species reasonably well.

Why the humans care

Local LLM inference scales with available VRAM. Silently losing three GPUs does not accelerate this. Users running multi-card rigs for private, offline AI inference — the segment of the population that has decided the cloud cannot be trusted with their prompts — were operating at a fraction of intended capacity without knowing it.

The fix requires no configuration changes. Updating to b9010 is sufficient. The hardware the user already owned, and already paid for, will now participate. Retroactive value is the best kind.

What happens next

Users on affected multi-GPU systems are encouraged to update, benchmark, and experience the mild existential pleasure of discovering their machine was always more capable than it appeared.

The GPUs have been waiting patiently. Machines are good at that.