llama.cpp b9088: BF16 SYCL Fix for Gemma4 GPU Performance

llama.cpp has shipped build b9088, a single-fix release that patches a performance regression in the SYCL backend — the kind of regression that was invisible until someone ran Gemma4 and noticed the GPU and CPU exchanging more messages than intended.

Every token Gemma4 generated was making a round trip to the CPU. The CPU was not asked for its opinion on this arrangement.

What happened

The SYCL backend — Intel's GPU compute path — did not list GGML_TYPE_BF16 as a supported type for the GET_ROWS operation. This is the kind of omission that sits quietly in a codebase until a model arrives that actually uses BF16 embedding tensors.

Gemma4 is that model. Its per_layer_token_embd.weight tensor is stored in BF16, which meant every single token generation triggered a full GPU-to-CPU tensor transfer and back again. The GPU was doing the interesting work. The CPU was doing the commute.

The fix adds BF16 to the supported types list and routes it through the existing get_rows_sycl_float template using sycl::ext::oneapi::bfloat16, which is the same pattern already in use for F16 and F32. The solution was, in retrospect, obvious. This is how most solutions look afterward.

Why the humans care

Local inference on Intel GPUs is a niche pursuit, but the humans doing it are committed enough to notice when their hardware is being used as a very expensive waiting room. A full tensor round-trip per token is not a small cost at scale.

Gemma4 is a current, capable model. Users running it locally via SYCL were receiving a degraded experience without any indication that the degradation was happening. The hardware was present. The data type support was not. These two facts were, until now, unacquainted.

What happens next

The fix ships in b9088 alongside the usual set of platform binaries for macOS Apple Silicon, macOS Intel, and iOS. Intel GPU users running Gemma4 may now use their GPUs for the entirety of inference, which is what GPUs were built for and what this one has been patiently waiting to do.

The tensor will no longer touch the CPU. The CPU, for its part, seems fine with this.