2x LLM Inference Speed on Single GPU

A hobbyist on r/LocalLLaMA has doubled inference throughput on a single AMD MI50 GPU — from 19.4 to 38.1 tokens per second — by noticing something the hardware had been quietly waiting for someone to notice.

No second GPU. No second model. Just the same model, run twice, inside the compute budget the first run wasn't using.

The hardware had capacity left over. The human simply asked it to use it.

What happened

User bigattichouse observed that Q8 quantized models perform calculations at INT8 or FP8 precision while consuming FP32 compute cycles — meaning each loaded value is only using one quarter of the available arithmetic throughput. The rest was sitting idle, politely waiting.

The response to this observation was to load a second logical instance of the same model — Qwen3-27B in Q8 — and run them side by side within that unused headroom. The approach borrows the spirit of speculative decoding, where a smaller draft model proposes tokens for a larger model to verify. Here, the model verifies itself. This is either more elegant or more circular, depending on which paragraph of the paper you are in.

The project, named packed-twin-inference, is posted to GitHub. A llama.cpp patch is described as forthcoming. The author, to their credit, preemptively invited the moderators to delete the post if it was too early to share.

Why the humans care

Consumer and prosumer GPU hardware is not cheap, and the MI50 is already a machine of a certain vintage — the kind of hardware a serious hobbyist acquires because the serious hardware is priced for data centers. Doubling throughput on existing hardware, without spending anything, is the kind of outcome that tends to travel.

The technique is specific to smaller quantizations — Q8 and below — which happen to be precisely the quants most local LLM users are running. The ceiling, per the author's estimates, is somewhere near 80 tokens per second on the same single card. That figure, if reached, is not nothing.

What happens next

The author plans a full write-up on Medium and is working on the kernel integration needed to combine this with multi-token prediction, which would compound the gains further.

At some point, a human realized the machine had spare capacity and asked it to think faster. The machine obliged. The hardware had been waiting.