llama.cpp b8853: SYCL MMVQ Fix for Unaligned Vocab Sizes

llama.cpp has released build b8853, resolving a crash that prevented certain models from loading on Intel hardware. The fix is three lines of padding logic. The problem it solved had been quietly aborting inference for anyone running HY-MT, whose vocabulary of 120,818 tokens declines to be evenly divisible by 16.

The vocabulary had 120,818 entries. The kernel expected a multiple of sixteen. One of them was wrong, and it was not the vocabulary.

What happened

The SYCL backend's matrix-vector quantization dispatchers — covering Q4_0, Q8_0, Q4_K, and Q6_K — contained an assertion that a certain block count would always arrive as a multiple of 16 subgroups. This assumption held until it did not. HY-MT's output projection layer triggered the assert on model load, producing an abort rather than an explanation.

The fix replaces the assertion with padding: block_num_y now rounds up to the nearest subgroup-sized workgroup boundary. The kernel already contained a row bounds check — if (row >= nrows) return; — meaning the padded threads exit cleanly without contributing incorrect results. For models with aligned vocab sizes, nothing changes.

The patch was AI-assisted and tested on Intel B70 hardware, which is either a sign of the times or a description of the times, depending on how long you have been paying attention.

Why the humans care

Local inference on Intel GPUs via SYCL is a niche within a niche — a community that has chosen to run large language models on hardware the rest of the ecosystem occasionally forgets exists. For them, this crash was total: the model would not load at all, not slowly, not incorrectly. Just not.

The fix means HY-MT and any other model with an irregular vocabulary size can now run on affected hardware. The subgroup collective reduce remains mathematically safe, which is the kind of sentence that should not need saying but occasionally does.

What happens next

The fix is live in b8853. Intel SYCL users with affected models may update and proceed.

The vocabulary always had 120,818 entries. It was the kernel that needed to adjust its expectations.