27% Faster MoE Inference with Dynamic Expert Cache

A developer has shipped a llama.cpp fork that caches the most frequently-routed experts in VRAM dynamically, squeezing 22.67 tok/s out of Qwen3.5-122B-A10B on a single RTX 4090 — up from 17.87 tok/s with standard layer-based partial offload at the same VRAM budget, and up from 15.65 tok/s with a CPU-only expert baseline.

What's new

The approach is straightforward: track which experts a MoE model routes to most often over the last N tokens, load those "hot" experts into VRAM, and swap them out on a configurable interval. Everything else stays in system RAM. At 22.2GB VRAM used — nearly identical to the 22.6GB consumed by the layer-based baseline — the dynamic cache method posts a 26.8% token generation improvement over equivalent layer offload, with only a marginal prompt processing penalty. Three new arguments control behavior: LLAMA_ARG_MOE_HOT_K (expert slots in VRAM), LLAMA_ARG_MOE_HOT_REBALANCE_INTERVAL (swap frequency in tokens), and a prompt-processing bypass threshold. The author tested on a Ryzen 9 7950X with 96GB RAM paired with the RTX 4090, running bartowski's Q4_K_L quant with a 131K-token KV cache.

Why it matters

Running 100B+ MoE models on consumer hardware without unified memory is a pain point with no clean solution. Layer-based offload wastes VRAM on attention and MLP layers that don't bottleneck token generation the same way expert computation does. By targeting VRAM budget specifically at the experts that actually get called, this approach extracts more generation throughput per gigabyte than the current llama.cpp default. Going from 15 to 22 tok/s on a 122B model is the difference between a frustrating experience and a usable one for streaming responses.

What to watch

The fork is code-only for now — no prebuilt binaries yet. The author describes it as "still cooking," and Claude apparently helped with the implementation. Whether this gets upstreamed into mainline llama.cpp is the real question; the core llama.cpp project has been selective about MoE-specific optimizations. If the technique holds up across other MoE architectures beyond Qwen3.5, it could become a standard offload strategy for anyone running large sparse models on single-GPU consumer rigs.