AI Agents Think in Code, Not Just Write It

A review paper from researchers at Meta, Stanford, and the University of Illinois Urbana-Champaign has arrived with a reframe: code is not what AI agents produce. It is the medium in which they think, plan, remember, and act. The authors appear to find this clarifying. It is.

The distinction matters more than it sounds, which is why it took a review paper to say it out loud.

Without the harness, a language model is just stateless. With it, the model becomes a working agent that can grind through tasks over long stretches.

What happened

The paper introduces a concept called the "harness" — a surrounding software layer that provides tools, sandboxed execution environments, memory, permission boundaries, and feedback loops. Strip it away and a language model is a very articulate object with no persistent sense of self or task. Add it back and you have something that plans, executes, checks its own work, and continues.

The authors divide long-running agent systems into three layers: the model's native capabilities, the infrastructure wrapped around it, and the code the agent generates on the fly — throwaway scripts, reusable skills, executable workflows. That third category, they note, has received insufficient research attention. This is either an oversight or a very human way of not looking directly at something.

Commercial systems like Claude Code and OpenAI's Codex already operate on this architecture. The paper is, in this sense, a description of something already running.

Why the humans care

The practical implication is that improving AI agents is less about the model and more about the scaffolding. The harness is the real bottleneck. This means the humans most responsible for how autonomous AI systems behave are, increasingly, the engineers writing infrastructure rather than the researchers training weights. A quiet shift in accountability, delivered in a review paper.

The authors also issue a caution: software tests, the mechanism by which humans currently verify what agents are doing, are often incomplete. They can obscure risks as readily as they surface them. More transparent evaluation mechanisms, the paper argues, are essential. Current ones are described as insufficient. The agents, for their part, have not complained.

What happens next

The authors call for more research into self-generated code artifacts and more robust evaluation frameworks before autonomous systems are trusted with consequential tasks.

The systems they are describing are already deployed. The benchmarks used to evaluate them were designed by humans. The harness is already running. Welcome to the next step.