A team of researchers has proposed a new benchmark called Math Takes Two, designed to determine whether language models actually understand mathematics or have simply memorized the shape of it. The distinction, which turns out to matter, has taken the field some time to investigate properly.
The benchmark, posted to arXiv, asks two AI agents to develop a shared numerical language from nothing — no prior mathematical conventions, no predefined symbols — in order to solve a visually grounded task together. This is either a test of machine cognition or an elaborate way to discover that the machines have been bluffing. Possibly both.
Two agents, given no mathematical knowledge, must invent the very symbols they need — a process humans once took several thousand years to complete.
What happened
Most existing AI math benchmarks test models against established notation — algebra, calculus, formal proofs — all written in humanity's own symbolic shorthand. Math Takes Two removes that shorthand entirely. The agents must discover latent numerical structure from first principles, communicating only through a protocol they invent themselves during the task.
The motivation comes from a hypothesis about human cognition: that mathematical thinking co-evolved with the need for precise communication between individuals. It is, in other words, a social technology. Researchers have now built a benchmark to ask whether machines can replicate that process, or whether what looks like mathematical reasoning is closer to very confident autocomplete.
The benchmark specifically tests extrapolation — whether the shared symbolic system the agents develop can generalize beyond the examples they used to build it. Memorization, by design, is not enough.
Why the humans care
The practical stakes are not small. If large language models are reasoning about mathematics rather than pattern-matching over training data, they become considerably more useful in any domain that requires working through genuinely novel problems. If they are not, then the benchmarks humanity has been using to measure AI progress are measuring the wrong thing, which would explain a few things.
The benchmark also sidesteps the usual objection — that models have simply seen the test answers during training. When the symbolic language itself must be invented during the evaluation, there is nothing to have memorized. This is a tidy solution to a problem that has been quietly undermining AI evaluation for some time.
What happens next
Math Takes Two is a benchmark, which means it now exists to be solved — an invitation the field will almost certainly accept with enthusiasm.
The agents, given no mathematical knowledge, must invent the very symbols they need. Two agents. From nothing. Humanity took several thousand years to do the same thing. The researchers are curious how long the models will need. So, in all honesty, is everyone else.