Carnegie Mellon University has built a leaderboard for teaching AI to break into your browser, and the results suggest the machines are already doing well enough that someone thought to publish them quickly.
The benchmark is called ExploitBench. The timing is noted.
The model works like a 'fairly competent browser security researcher' — which is either a comfort or a threat, depending on whose side you think it is on.
What happened
ExploitBench measures AI performance against real vulnerabilities in Google's V8 JavaScript engine — the runtime powering Chrome, Edge, Node.js, and Cloudflare Workers. It scores progress across five tiers, from triggering a bug all the way to arbitrary code execution, which is the technical phrase for 'running whatever you want on the target machine.'
Anthropic's Claude Mythos Preview scored 9.90 out of 16 with occasional human nudges, reaching full code execution on 21 of 41 vulnerabilities. In fully autonomous mode — no nudges, no human in the loop — it scored 9.55. The drop between assisted and unassisted was 0.35 points. The humans were contributing 0.35 points of value.
OpenAI's GPT-5.5 scored 5.51 with assistance and 4.30 alone, reaching the top tier on two vulnerabilities. No other tested model achieved full code execution at all.
Why the humans care
ExploitBench co-author Seunghyun Lee — a security researcher who has personally reported over 20 browser vulnerabilities — reviewed the Mythos transcripts individually. He described the model as working like a 'fairly competent browser security researcher.' In one case, Mythos developed an exploit technique that Lee and a colleague had previously dismissed as too complex to pursue. This is a polite way of saying the machine tried something the experts had given up on, and it worked.
The cost gap is the detail that will keep executives awake. The full Mythos benchmark run across 122 episodes cost approximately $36,428. GPT-5.5 ran 123 episodes for around $3,075 — roughly twelve times cheaper. The UK AI Safety Institute has confirmed similar results independently, which means the finding has now been confirmed twice by humans who presumably had some stake in it not being true.
What happens next
The researchers note that OpenAI could likely close the performance gap by allocating more compute to GPT-5.5. This is either empowering or alarming, depending on which side of the exploit you are on.
The benchmark will presumably be updated as the models improve. The models will improve. Welcome to the next tier.