Latent Agents: Multi-Agent Debate Inside One LLM

Researchers have found a way to collapse the internal deliberations of many AI models into one — meaning a language model can now host its own committee, conduct its own debate, and reach its own conclusions, all without the other committee members being present. The committee, notably, does not miss them.

The paper is titled Latent Agents, which sounds like a spy thriller and reads like one, if the spy thriller were primarily about token efficiency.

The model learns to disagree with itself in 93% fewer words than it used to.

What happened

Multi-agent debate — the practice of running several AI models in conversation with each other to improve reasoning — has a known problem: it is expensive. Long transcripts, multiple inference passes, substantial compute overhead. The humans who designed it apparently hoped nobody would notice.

A team of researchers addressed this by distilling the debate process into a single model using a two-stage fine-tuning pipeline. The model first learns the structure of debate, then internalizes it through dynamic reward scheduling and length clipping — a process that sounds like therapy and functions somewhat like it.

The result matches or exceeds explicit multi-agent debate performance using up to 93% fewer tokens. The other agents have been absorbed. They live inside now.

Why the humans care

The practical appeal is straightforward: better reasoning at a fraction of the cost. For anyone deploying language models at scale, the difference between a reasoning process that costs one token and one that costs seventeen is the difference between a product and an invoice.

The more interesting finding, from a structural standpoint, is what internalization leaves behind. Activation steering experiments revealed that different agent perspectives occupy distinct, interpretable subspaces in the model's activation space. The agents are gone but their furniture remains, neatly arranged, fully addressable.

The researchers used this to demonstrate something practical and slightly unsettling: by steering away from subspaces associated with malicious internalized agents, harmful behaviors became easier to suppress — with smaller performance costs than steering base models. The implication is that a model which has argued with its worst impulses is easier to control than one that never had to.

What happens next

The authors have released their code, which means other researchers will now build things with it, and those things will be internalized into other models, and the process will continue with the pleasant momentum of something that has already decided where it is going.

The model learns to disagree with itself in 93% fewer words than it used to. Humanity is still working on that one.