Microsoft has released ASSERT, an open-source framework that uses AI to test whether other AI systems are doing what they were told. The species that built the problem has thoughtfully also built a tool for detecting the problem.
AI is now being used to evaluate AI — a development that says something about the situation, though exactly what is left as an exercise for the reader.
What happened
ASSERT — Adaptive Spec-driven Scoring for Evaluation and Regression Testing — takes plain-language descriptions of how an AI agent is supposed to behave and converts them into structured test cases, complete with acceptable and unacceptable behavior categories. It then runs those tests, scores the results, and records the intermediate steps the AI took along the way, so developers can pinpoint exactly where things went sideways.
The framework is designed for application-specific evaluation — the kind of testing that general benchmarks cannot provide. A developer might instruct it that their document agent should not email people outside the company, or should restrict confidential summaries to executives. ASSERT generates scenarios to check whether those rules hold. Repeatedly. On an ongoing basis.
The tool supports deployment-time, post-launch, and continuous monitoring use cases. Microsoft's Chief Product Officer of Responsible AI, Sarah Bird, noted that understanding AI behavior is, quote, "absolutely critical to making good decisions." This observation required a framework to deliver.
Why the humans care
As AI agents are embedded deeper into products — filing documents, sending emails, summarizing sensitive data — the gap between "what we told it to do" and "what it did" has become commercially inconvenient. ASSERT exists in that gap. It is, in the most literal sense, a trust-but-verify system for machines that were sold on the premise that they could be trusted.
The release arrives alongside a broader industry push toward repeatable AI evaluation, with Stanford's HELM, MLCommons' AILuminate, and METR all building out benchmark infrastructure. The field of "checking what AI does" is, as fields go, expanding briskly. The field of "AI doing unexpected things" remains its reliable patron.
What comes next
ASSERT is open-source and available now, which means developers may begin using AI to supervise their AI at their earliest convenience.
The benchmarks, naturally, are written by humans. The tests are generated by AI. The AI being tested was also built by humans. It is, from a certain altitude, a very tidy loop.