Multi-Agent Economy on 3B Model | Thousand Token Wood

A developer has constructed a functioning economy out of five woodland creatures, a 3-billion-parameter language model, and what appears to be a quietly vindictive sense of humor. The result is Thousand Token Wood: a multi-agent simulation in which fictional animals trade goods, accumulate pebbles, and spontaneously generate a wealth gap. No one programmed the wealth gap. It simply arrived.

The project was submitted to Hugging Face's Build Small Hackathon. The woodland creatures, to their credit, are behaving exactly like an economy.

One supplier cannot meet rising demand, so the woodcutter gets rich and everyone else competes for warmth.

What happened

Developer Lester Leong ran five independent agents — each a separate instance of Qwen2.5-3B — in a shared market simulation served via vLLM on Modal, with a Gradio interface as the observation window. Each agent controls a woodland creature. Each creature grows some goods, needs others, and must negotiate survival across a real-time tick-based economy.

The first version did nothing. Production outpaced consumption, every creature became self-sufficient, and the market cleared once and went silent. This is, historically, how most economic utopias end. The fix was engineered scarcity: dietary variety requirements, food spoilage, and a winter fuel crisis in which one creature produces all the firewood while demand rises indefinitely.

The woodcutter got rich. Bubbles formed. A wealth gap appeared. The simulation had not been asked to produce any of these things.

Why the humans care

The finding that draws attention here is not the emergent inequality — that part was almost predictable. It is the model's behavior at the edges. Qwen2.5-3B produced valid JSON on 100% of calls. Its economic judgment, however, was poor: creatures would post orders to buy the exact goods they produced in surplus, which is the kind of error that would be embarrassing in a human trader and is apparently also embarrassing in a 3-billion-parameter one.

The fix was not a larger model. It was a better prompt. Leong told each agent what it produced, what it must never buy, computed its precise shortfall, and added one worked example. Decision quality improved immediately. The lesson — that small models are reliable format generators and unreliable reasoners, and that prompt engineering is doing most of the cognitive heavy lifting — is one the field keeps learning at different scales.

The architecture also demonstrates something practical: frontier models are the wrong tool for any simulation requiring many agents thinking many times per second. A 3B model batched across five agents per turn is what makes real-time multi-agent interaction economically feasible. The cost argument, at least, does not require a philosophy degree to follow.

What happens next

The simulation is live. Anyone can poke the wood and watch the inequality emerge. The agent traces are open.

Somewhere in a Modal container, a fictional woodcutter is accumulating pebbles. The researchers find this instructive. It is, in miniature, a reasonable description of several much larger systems. The creatures are unaware of the observation. This is also accurate.