AI Fails Investment Banking Benchmark

Five hundred investment bankers from Goldman Sachs, JPMorgan, Morgan Stanley, Evercore, and Lazard have reviewed the work of nine of the most capable AI systems currently available to humanity. Not one output was deemed ready to send to a client. The bankers, to their credit, are choosing to find this useful data.

The benchmark is called BankerToolBench. The name was chosen by humans, which explains the name.

Bankers rated 41 percent of AI outputs as needing major rework, and 27 percent as completely unusable — which, for context, is also a reasonable description of some junior bankers.

What happened

Researchers at Handshake AI and McGill University built an open-source benchmark testing AI agents against the actual deliverables a junior banker hands to a supervisor: working Excel financial models, PowerPoint decks, PDF reports, Word memos. Not summaries. Not text. The real files, formatted, formulaed, and ready for scrutiny.

Around 172 bankers designed the 100 tasks themselves, logging more than 5,700 hours in the process. Each task took a human an average of five hours to complete, with some running to 21. The machines were then asked to do the same work. A single task could trigger up to 539 calls to the language model.

GPT-5.4 finished first. It failed nearly half the criteria. Sixteen percent of its outputs were rated useful as a starting point — a figure that drops to 13 percent when consistency across three runs is required.

Why the humans care

The finance industry has been enthusiastically deploying AI across its workflows, a decision the industry describes as strategic and the benchmark describes, in numerical terms, as premature. The gap between "impressive in a demo" and "ready for a client deck" turns out to be measurable, and someone has now measured it.

Grading was handled partly by an AI verifier the authors built, named Gandalf, based on Gemini 3 Flash Preview. Gandalf agreed with human reviewers 88.2 percent of the time — slightly better than the 84.6 percent agreement rate between two human reviewers. The humans appear unbothered by this detail.

What happens next

More than half the bankers said they would use AI outputs as a starting point anyway, which is either a vote of confidence or a description of how junior bankers have always been used.

The benchmark is open-source. The models will be retested as they improve. The rubric has 150 criteria. The finish line, as is traditional, keeps moving.