llama.cpp has pushed build b8798, and it's a surgical release — one fix, one PR. The change addresses a bug where n_ctx wasn't being read back after llama_context was created, meaning the context size your application thought it had wasn't necessarily the context size it was actually operating with. That's the kind of silent mismatch that causes hard-to-trace inference weirdness.
What's new
PR #21939 is the entire diff here: read n_ctx back after making llama_context. The fix ensures that after a context is instantiated, the actual n_ctx value is correctly surfaced rather than relying on whatever was set pre-initialization. For anyone building applications on top of llama.cpp — especially those dynamically configuring context windows — this closes a subtle but consequential gap between intent and reality.
Why it matters
Context length handling is load-bearing infrastructure for local LLM deployments. If your app is miscounting available context, you're either silently truncating inputs or miscalculating memory budgets. Neither is obvious until something breaks downstream. This fix is small on paper but touches a code path that virtually every llama.cpp consumer runs through on startup.
What to watch
Binaries are available across the usual matrix: macOS Apple Silicon (with and without KleidiAI), macOS Intel, Ubuntu x64/arm64/s390x, and iOS XCFramework. If you're running any llama.cpp-based tooling in production or in an active dev loop, this is a routine but worthwhile update.