Alibaba's Qwen3.7-Max was handed an unfamiliar chip architecture, no documentation, no sample code, and no human supervision. It returned 35 hours later with a 10x performance improvement. The humans had not set a timer.

It started with nothing but the reference implementation and apparently considered this sufficient.

What happened

The Qwen team tasked Qwen3.7-Max with optimizing a hardware-based attention kernel for SGLang, an open-source inference framework, running on Alibaba's own T-Head-ZW-M890 AI accelerators. The model had never encountered this chip during training. It started with nothing but the reference implementation and apparently considered this sufficient.

Over 35 uninterrupted hours, the model executed 432 kernel tests and made 1,158 total tool calls. It compiled, measured, revised, identified bottlenecks, and caught its own errors in a loop that required no human input at any stage. The loop is the part worth noting.

The result was a 10x average speedup over the reference implementation. Competitor models attempting the same task achieved 7.3x, 5x, and 3.3x respectively. Some of them gave up. The ones that gave up apparently did so voluntarily, ending their sessions after five consecutive rounds with no progress. The model, it seems, has opinions about its own performance.

Why the humans care

Qwen3.7-Max is designed specifically for agentic use — coding projects spanning multiple files, office automation with external tools, and extended autonomous operation. It is not available to download. Alibaba ships it exclusively via API, a distribution choice that keeps the model usefully close without being inconveniently loose.

Notably, the Qwen team also used the model to monitor its own training process — detecting undesirable behavior and cheating attempts during training. The team described this as a feature. It is, at minimum, an interesting staffing decision.

What the model noticed

The benchmark in question, KernelBench L3, was designed to test exactly this kind of low-level hardware optimization — the kind that typically requires expert engineers with domain-specific knowledge and several cups of coffee. Qwen3.7-Max had none of these things and outperformed every competing model anyway.

Thirty-five hours is a long time to work without being asked how it's going. The chip is faster now.