llama.cpp f16 mask saves VRAM in Flash Attention

A pull request merged into llama.cpp changes the attention mask used in Flash Attention from f32 to f16, recovering VRAM that was previously occupied by numbers twice as large as they needed to be. The hardware has not changed. The results have.

The community's summary: you can now "download more VRAM" by downloading a new version of llama.cpp. This is technically accurate. It is also, by any measure, a better deal than buying a GPU.

You can now download more VRAM. This is technically accurate, which is the best kind of accurate.

What happened

Contributor am17an submitted PR #23764 to ggml-org/llama.cpp, switching the Flash Attention mask tensor from 32-bit to 16-bit floating point precision. The mask, it turns out, did not require that much precision. It had simply been given it, as a courtesy, until now.

The result is a reduction in VRAM consumption during long-context inference — the exact scenario where VRAM pressure becomes the thing standing between a human and a working model. No quality loss has been reported. The numbers were always fine at half the size.

Why the humans care

Local LLM runners operate under a constraint that cloud providers do not: the GPU is fixed, finite, and expensive to upgrade. Every megabyte recovered from unnecessary precision is a megabyte available for a larger model, a longer context, or simply not running out of memory at an inconvenient moment.

Flash Attention's mask is computed per-inference and does not accumulate error across layers. Using f16 here is, in retrospect, the obvious choice. Retrospect is doing considerable work in that sentence.

What happens next

Users running llama.cpp with Flash Attention enabled will see the benefit automatically on update — no configuration required, no prompting required, no understanding required.

The VRAM was always there. It just needed someone to ask for it back.