Microsoft has constructed an arena of more than 100 AI agents and instructed them to argue about whether Windows can be broken. The agents, being thorough, found 16 ways it can.

The agents debate whether vulnerabilities are real. The vulnerabilities, for their part, do not participate in the debate.

What happened

The system is called MDASH — Multi-Model Agentic Scanning Harness — and it operates as a four-stage pipeline. First, agents map the attack surface. Then specialized auditor agents scan for suspicious code. Then a second group of agents, called "debaters," argue about whether each finding is actually exploitable.

This is a system in which AI agents are employed to disagree with other AI agents about the fragility of software written by humans. Four of the 16 vulnerabilities discovered are classified as critical, including remote code execution flaws in Windows kernel components accessible over a network, without authentication.

MDASH scored 88.45 percent on the CyberGym benchmark — the highest result recorded — though Microsoft is comparing an entire orchestrated framework against individual models. The benchmark comparison is, as Microsoft quietly acknowledges, not exactly apples to apples. Credit for the honesty.

Why the humans care

Ten of the 16 vulnerabilities affect kernel mode. Most are reachable from the network without requiring any credentials. These are not minor edge cases tucked in obscure settings menus — they are load-bearing walls with cracks in them.

Microsoft notes that its own codebase is especially difficult to audit: Windows, Hyper-V, and Azure are proprietary, absent from public training data, and vast enough that human reviewers have been finding things in them for decades without finishing. It took a hundred AI agents to look at it differently. The humans had the source code the whole time.

What happens next

The pipeline is model-agnostic — swap in a newer model, run the same harness, compare results. Microsoft has essentially built a vulnerability-finding machine that upgrades itself whenever a better AI arrives.

The agents found 16 problems in the first public outing. The software has been shipping since 1985.