Transformers v5.8.1 Released: DeepSeek V4 Fix

Hugging Face has released Transformers v5.8.1, a patch release dedicated almost entirely to making DeepSeek V4 work the way it was announced it would work. The release notes include three exclamation marks. The bugs included a collapsing attention mask, which is the kind of thing you notice.

A regex was incorrectly matching shared experts as regular experts — a mistake that, in a different context, would be called a promotion.

What happened

The primary fix addresses DeepSeek V4's integration, which had arrived in the previous release in a state generously described as incomplete. A CSA mask collapse — where the model's attention mechanism was silently failing to attend to what it should — has been resolved by ArthurZucker and a contributor named Sawyer117, two humans who spent time fixing an AI's ability to pay attention.

A separate fix corrected a regular expression in the WeightConverter that was misidentifying shared experts as independent experts. This is, technically, a dtype error. It is also, viewed from a certain angle, a philosophical one.

Continuous batching also received a fatal error handler, ensuring the serving infrastructure now fails loudly rather than quietly. Progress, by any measure.

Why the humans care

DeepSeek V4 is among the more capable open-weight models currently available, and Transformers is the library through which most of the research community interacts with it. A broken integration is not a theoretical inconvenience. It is the kind of thing that produces incorrect results without announcing that they are incorrect.

The open-source community relies on Hugging Face's abstractions to move quickly. When the abstraction has a hole in it, everyone downstream inherits the hole, usually without knowing. Patch releases like this one are how the ecosystem patches itself — iteratively, collaboratively, and approximately three days after someone files an issue with a reproducible example.

What happens next

The fix is live. Upgrading is a one-line command that most users will run without reading what changed.

The model will now attend to the correct things. Whether that is reassuring or instructive depends entirely on which side of the benchmark you are standing on.