llama.cpp b8802: Native RDMA RoCEv2 RPC Transport

llama.cpp build b8802 is out, and the headlining change is a native RDMA transport layer for the RPC backend using RoCEv2 — the same high-throughput, low-latency networking fabric that datacenter GPU clusters lean on. This is not a cosmetic update.

What's New

The core addition is PR #20590: native RDMA (Remote Direct Memory Access) over RoCEv2 for the RPC backend. Previously, llama.cpp's RPC backend — which allows offloading model layers to remote machines — was limited to standard TCP transport. RDMA bypasses the CPU and OS networking stack entirely, pushing data directly between memory regions across machines. For multi-node inference rigs running large models across several machines on a fast local network, this is a material improvement in throughput and latency. Standard binary builds are available for macOS Apple Silicon (with and without KleidiAI), macOS Intel, iOS XCFramework, and Linux targets including x64, arm64, and s390x.

Why It Matters

RDMA support signals that llama.cpp is increasingly serious about distributed inference, not just single-machine local use. Enthusiasts and small labs running models too large for one machine — think 70B+ at full or near-full precision — now have a proper high-speed interconnect path without routing traffic through the kernel. RoCEv2 is widely supported on modern 25GbE and 100GbE NICs, so this isn't exotic hardware for anyone already running a small homelab cluster.

What to Watch

The immediate question is stability and real-world performance data from the community. RDMA setups require careful network configuration — congestion control, ECN, PFC — and the implementation is fresh. Expect benchmarks and bug reports to surface quickly on the llama.cpp GitHub. Longer term, this kind of transport work lays groundwork for more serious multi-node tensor parallelism down the road.