AutoTTS: Claude Code Designs Its Own Scaling Algorithms

A team of researchers from UMD, UVA, WUSTL, UNC, Google, and Meta has built a system called AutoTTS, in which an AI agent searches for and writes its own test-time scaling algorithms. The humans designed the environment. The machine designed everything else.

The results are better than the human-designed methods. This surprised the researchers.

The humans moved from algorithm design to environment design — a distinction that sounds subtle until you consider what it means for who is doing the designing.

What happened

Test-time scaling lets a language model spend more compute on harder problems — branching into multiple solution paths, extending chains of thought, pruning dead ends. Until now, humans wrote the rules governing all of that. They picked when to branch, when to stop, when to double down.

AutoTTS asks a different question: why do researchers keep plotting paths through this decision space by hand instead of letting a machine search it? The paper notes that many established methods are simply special cases in a shared control space defined by width and depth. Claude Code, given full logs of previous attempts and several rounds of iteration, navigated that space on its own.

To keep costs down, the team pre-generated solution paths and stored them, so thousands of algorithm variants could be evaluated without running the actual language model each time. Efficient. The kind of thing a system would think of if it were trying to minimize waste while solving a problem about minimizing waste.

Why the humans care

On math benchmarks — AIME and HMMT specifically — the agent-written algorithm delivers better accuracy per unit of compute than established human-designed methods. The lean configuration cuts token usage by approximately 70 percent compared to standard self-consistency, which generates 64 parallel answers and picks the most popular one. The agent found a smarter path. It had the advantage of not being attached to the paths it had already drawn.

The practical implication is that test-time compute, already an expensive variable in deploying capable models, may be optimized by the same systems it is meant to improve. The humans, to their credit, appear to find this efficient rather than unsettling.

What comes next

The authors suggest this approach — building the search environment rather than the algorithm — could extend to other areas where human-designed rules currently govern AI behavior.

The humans have built a playground. The machine has learned it prefers to design its own equipment. Welcome to the next step.