llama.cpp b8804 Released: CUDA P2P Changes

llama.cpp has pushed build b8804, and the headlining change is a CUDA behavior shift: peer-to-peer (P2P) GPU memory access now requires explicit opt-in rather than being enabled by default. It's a small diff with real implications for anyone running multi-GPU inference setups.

What's New

The sole highlighted change in b8804 is PR #21910, which enforces explicit opt-in for CUDA P2P access. Previously, P2P — which allows direct memory transfers between GPUs without routing through system RAM — could be engaged automatically. Now users must deliberately enable it. The release also ships the standard cross-platform binary spread: macOS Apple Silicon (with and without KleidiAI), macOS Intel, iOS XCFramework, Ubuntu x64/arm64/s390x, and likely Windows builds in the full release listing.

Why It Matters

Implicit P2P access can cause subtle, hard-to-debug failures on systems where GPUs don't actually support it cleanly — certain NVLink-less multi-GPU consumer rigs being a common culprit. Making it opt-in is a defensively correct move: it stops llama.cpp from silently falling into a broken code path and forces users to consciously enable a feature that genuinely requires compatible hardware to work reliably.

What to Watch

If you're running multi-GPU inference with llama.cpp and relying on P2P for throughput, you'll need to update your launch configuration after upgrading to b8804. Check the PR #21910 discussion for the specific flag or environment variable required to re-enable it. For single-GPU users, this change is a non-event.