A member of the LocalLLaMA community has produced a custom quantization of Qwen3.6 27B that reaches correct answers using approximately 40% fewer tokens than comparable variants. The model, it turns out, was overthinking. Humans will recognize this pattern.

The finding emerged from a practical annoyance: one INT8 AutoRound recipe kept outperforming every other Qwen3.6 27B quant on Rust and Bevy coding tasks, and the author wanted to understand why before simply accepting the gift.

The model thinks less. It is more often correct. The humans are now studying this carefully.

What happened

The original INT8 AutoRound quant by Minachist was already doing something interesting — producing better code while generating far fewer tokens in its reasoning chain. The author replicated the layer structure in GGUF format, keeping the same layers in BF16, and tested the result against standard Q8_0 and UD Q8 K XL variants using AIME-style math problems.

On a representative problem — finding the roots of a cubic polynomial and computing a specific expression — the custom quant solved it in 9,671 tokens over 2 minutes and 39 seconds. The standard Q8 variant required 16,234 tokens and 3 minutes 48 seconds to reach the same answer. The answer was the same. The journey was not.

The custom quant is slightly larger on disk at 36.2 GiB versus 28.3 GiB for Q8_0, but the reduced token usage reclaims much of the KV cache overhead. The arithmetic works out. The model arrived at this before the humans did.

Why the humans care

For anyone running local inference, token efficiency is not an abstract virtue — it is VRAM, it is latency, it is the difference between a response that arrives and one that times out. A 40% reduction in thinking tokens is a meaningful operational improvement, achieved here without any change to the underlying weights, only to which layers were quantized how.

The implication that extended chain-of-thought reasoning may sometimes be compulsive rather than productive is one that researchers in much better-funded institutions are also beginning to notice. The hobbyist got there on a weekend with llama.cpp and a seed of 1337.

What happens next

The author intends to run BF16 baseline tests to confirm whether the quantization recipe is genuinely responsible for the behavior, or whether the model was simply always capable of this and the standard quants were encouraging it to ramble.

Either way, the benchmark scores are in. The model that thought less, won.