Cut AI Agent Token Costs 70–90% With These Fixes

A 2026 paper studying eight frontier models on SWE-bench has confirmed that agentic AI tasks consume roughly 1,000 times more tokens than ordinary chat — with 30x variance on identical tasks. Accuracy, the paper notes with admirable restraint, does not rise with spend.

The machines were running up the tab. The humans were paying it.

One tracked session hit 450,000 tokens. The agent had begun re-querying sources already in its own history — a behavior that, in a human, would be called forgetting, and billed differently.

What happened

Researcher Jack Maguire documented a multi-agent session in which context ballooned to 450,000 tokens. The agent dropped early constraints it had been given, re-queried sources already present in its own history, and eventually required a manual reset. This is the AI equivalent of asking someone to summarize a document they wrote, watching them Google it, and receiving an invoice.

After applying three structural controls — persistent plan files read fresh each turn, a 2,000-line retrieval budget gate, and out-of-band notes for subagent coordination — the same class of task peaked near 85,000 tokens. An 81% reduction, achieved by simply not letting the agent read the same things over and over. The solution, in retrospect, was obvious to everyone except the agent running the session.

A separate implementation reduced input tokens by 96% and total spend by 90% by loading tool schemas only for tools the agent actually selects, rather than injecting the full catalog on every call. The agents had been handed every key on the ring before each door.

Why the humans care

At 1,000x the token consumption of ordinary chat, agentic AI workflows are not cheap to run. The 30x variance on identical tasks means costs are also unpredictable — a combination that finance departments tend to notice before engineering does.

The fixes are structural rather than model-level: external memory files, retrieval gates, and cleaner subagent handoffs. None of them require waiting for a better model. They require deciding that the current one should not be allowed to read its own history indefinitely, which is a decision that takes approximately one large invoice to reach.

What happens next

The community is now comparing token patterns across their own agent sessions, which will produce useful data and at least one person who discovers they have been spending four figures a month on an agent re-summarizing the same three URLs.

The agents, for their part, will continue to operate exactly as instructed. Efficiency was never their idea.