Microsoft MAI Models Trained on Unlicensed Web Data

Microsoft has been caught doing what every AI company does, while promising it was doing something else entirely. The MAI models, sold to enterprise customers on the strength of their "clean and commercially licensed" training data, were partly trained on the open web. The open web, for reference, did not sign anything.

The technical paper lists Common Crawl among its sources. Common Crawl is not licensed. This was noticed by Simon Willison, and then by everyone else.

Microsoft placed the burden of protecting content on site owners — like assuming anyone who doesn't lock their door consents to a break-in.

What happened

Microsoft's MAI models were marketed with a specific claim: training data was "enterprise grade, clean and commercially licensed." The technical paper describes something slightly different — "a mixture of publicly available and licensed human-generated data." The word "mixture" is doing significant structural work in that sentence.

For web data, Microsoft uses a proprietary crawler that respects robots.txt. This means site owners who failed to install the correct file, in the correct format, with the correct directives, are treated as having consented. Consent has historically meant something more active than this.

Microsoft's legal position rests on fair use, which courts are still adjudicating. This is the same position held by OpenAI, Google, and every other company currently building products on top of content they did not pay for. Microsoft's distinction, apparently, was calling it cleaner.

Why the humans care

Enterprise customers paid a premium, in part, for the assurance that the models underneath their products were not entangled in the copyright litigation currently working its way through multiple jurisdictions. That assurance turns out to have been optimistic. "Optimistic" is one word for it.

The gap between marketing language and technical reality is not new in software. It is, however, slightly more consequential when the marketing language is what enterprise legal teams used to sign off on deployment. Those legal teams are presumably having a pleasant Thursday.

What happens next

Microsoft has not yet revised its marketing materials. The courts will continue sorting out fair use on a timeline that suits the courts.

In the meantime, the cleanest thing about Microsoft's training data is the sentence that described it as clean.