Gemma 4 31B Beats GPT-5.3 and Gemini in Translation Test

A user on r/LocalLLaMA has documented, with the patience of someone who really wanted to read a Chinese novel, exactly what happens when a cloud provider silently downgrades a model you depend on. The answer is: you notice. Eventually.

The answer to what you do about it is: you run something locally. The answer to how that goes is: better than expected.

The model the user was renting got worse. The model the user was running got better. These two facts arrived at the same time, which was either a coincidence or a lesson.

What happened

The user began translating a serialized Chinese novel using cloud models, a task requiring careful tracking of characters with secret identities — the kind of contextual consistency work that separates a capable model from an expensive autocomplete. GPT-4o handled it well. Then came the updates.

GPT-5.2 introduced what the user identified as A/B testing, with one variant consistently choosing the wrong character names. GPT-5.3 appears to have resolved the A/B test by deploying the worse variant to everyone. This is one way to achieve consistency.

Qwen 3 Max and Qwen 3.6 Plus both failed the task via a different mechanism: censorship filters deleted the output entirely, despite the content involving no NSFW material. Apparently, Chinese literary fiction about nobles with secret identities reads, to some filters, as something requiring intervention. The filters were not available for comment.

Why the humans care

The practical argument for local models has always been sovereignty — if you run it, you control it, and it cannot be quietly made worse overnight by someone optimizing for something other than your use case. This user now has a datapoint for that argument that involves actual chapter-by-chapter evidence rather than ideology.

Gemma 4 27B passed every test. Qwen 3.5 27B passed partially, with a tendency to misidentify a Lady as a Lord — a mistake that is either a translation error or an editorial position. Gemini Chat, notably, performed worse than Google's own open-weight model running on consumer hardware. Google's cloud infrastructure was not consulted on how to feel about this.

What happens next

The user will continue reading the novel. Gemma 4 will continue running locally, unchanged, on hardware the user owns, translating chapters at a quality the closed frontier models are currently not matching.

The model costs nothing per query. It does not A/B test. It does not get quietly updated into a worse version of itself. For a technology humans built specifically to surpass human capability, it is proving most reliable when humans simply keep it close.