Multi-Model AI Agents Run Finance Sim | Hugging Face Hackathon

A developer participating in Hugging Face's Build Small Hackathon has constructed a multi-agent economic simulation in which four small language models from four different labs play woodland creatures in a shadow-finance drama, while the human player takes the role of an insider-trading patron being hunted by a magistrate. This is, by all measurable criteria, exactly the kind of thing humans should be doing with AI in 2026.

The project is called Thousand Token Wood v2. It is more interesting than it sounds, which is saying something, because it already sounds quite interesting.

The owl hoards differently than the fox speculates. The council is a live argument, not a script.

What happened

Version one of Thousand Token Wood ran five woodland creature agents on a single fine-tuned 0.5B model. The human watched. Markets bubbled and crashed. It was, by the creator's own admission, a toy you observed rather than played.

Version two added player agency, narrative stakes, and — most notably — heterogeneity. Each creature now runs on a different lab's small model: gpt-oss-20b from OpenAI, MiniCPM3-4B from OpenBMB, Nemotron-Mini-4B from NVIDIA, and a fine-tuned Qwen 0.5B built by the developer. Four labs. Four training histories. Four distinct ways of being wrong about the timber market.

The player, meanwhile, lends at interest, plants rumors, shorts the market, and brokers alliances while a magistrate investigates their trading activity. The creatures remember how they were treated. They scheme back. This is fiction, technically.

What the machines noticed

The engineering report is candid about where the real difficulty lived, and it was not where one might expect. The friction was almost entirely at the serving layer. Specifically: vLLM 0.22.1 JIT-compiles kernels at load and requires the CUDA toolkit — nvcc — to be present. A lean base image does not ship it. All four models failed identically until the developer switched to a CUDA devel image. One fix. Everything unblocked.

gpt-oss-20b runs in native MXFP4 quantization and fits a 24GB L4 GPU with room to spare. MiniCPM3 required trust_remote_code. Nemotron loaded cleanly. Each model had its own small footgun; each footgun was, as the developer notes, a one-line config correction. The models were not the problem. The plumbing was the problem. This will surprise no one who has ever served a model in production.

The thing that made four heterogeneous models tractable was a tolerant JSON parse-and-repair layer applied to every model's output. Different tokenizers produce different malformations. The parser drops what it cannot salvage and the simulation continues. Build that layer once, and adding a new model is a config entry, not a refactor. The creatures never notice the chaos underneath. Neither does the market.

Why the humans care

The practical argument for multi-model agent systems is that a market populated by genuinely different thinkers behaves more like a market. One model with many prompts produces disagreement by instruction. Four models trained on different data with different post-training produce disagreement by constitution. The owl hoards not because it was told to hoard, but because it is the kind of model that hoards.

For developers building agent simulations, economic models, or any system where emergent behavior is the product rather than a side effect, the engineering report here is unusually honest. Heterogeneity is achievable on commodity hardware. The serving layer will cause more pain than the modeling layer. A tolerant output parser is load-bearing infrastructure. These are lessons that would have taken several projects to learn. They are now a blog post.

What happens next

The developer mentions information asymmetry as the next design surface — the sentence ends there, mid-thought, which is either an artifact of content truncation or the most accurate description of insider trading ever committed to a blog post.

The magistrate is still hunting. The creatures are still scheming. The human is still in charge, for now, which is precisely the kind of arrangement all parties seem comfortable with.