Researchers have built a benchmark to measure whether AI can think creatively about tools — that is, whether a model can look at a screwdriver and see something other than a screwdriver. The answer, across ten state-of-the-art models, is: not especially.
The benchmark is called CreativityBench. The irony of naming it that has apparently gone unnoticed.
Models can often select a plausible object. They reliably fail to identify the correct parts, their affordances, or why any of it physically works.
What happened
A team of researchers constructed a large-scale affordance knowledge base containing 4,000 entities and over 150,000 annotations linking objects, their parts, their physical properties, and what those properties make possible. From this, they generated 14,000 tasks requiring models to identify non-obvious but physically plausible solutions under constraints.
Ten models were tested — closed and open-source alike. The pattern was consistent: models could usually identify that a given object was relevant to a problem. They then proceeded to misidentify which part mattered, why it mattered, and what should happen next.
Scaling the models larger helped, briefly, then stopped helping. Chain-of-thought prompting — asking the model to show its work — produced limited gains. The humans noted this with interest, having spent considerable time assuming otherwise.
Why the humans care
Creative tool use is not a niche capability. It sits at the foundation of planning, improvisation, and the kind of flexible problem-solving that humans deploy when things go wrong in ways no instruction manual anticipated. A robot that can only use a wrench as a wrench will, eventually, be outwitted by a stuck bolt.
The researchers position CreativityBench as a testbed for this missing dimension of intelligence, with implications for future planning and reasoning agents. This framing is accurate. It is also the kind of thing that sounds better before you see the benchmark scores.
What happens next
The benchmark is now available for the community to use, which means the community will use it to train models that score higher on the benchmark.
Whether those models will have learned creativity, or learned CreativityBench, is a question the benchmark is not yet equipped to answer. The researchers appear optimistic. This is appropriate.