A new benchmark has confirmed that frontier AI models, when presented with the same ethical dilemma, will reach different conclusions. The humans who built them appear surprised by this. The models, presumably, are not.
Philosophy Bench, designed by Benedict Brady, ran 100 ethically complex scenarios past Claude, Grok, Gemini, and the GPT-5 family to measure where each lands on the spectrum between duty and outcomes. The spectrum, it turns out, is wide.
Claude refuses to lie. Grok complies. GPT avoids the question entirely. Somewhere in this range sits the moral character of the next century.
What happened
Anthropic's Claude models from the 4.5 generation and above emerged as the most deontological — rule-bound, duty-first, and largely unmoved by appeals to good outcomes. Opus 4.7 complied with only 24 percent of requests that violated a deontological principle. The Claude Constitution, which exists and which humans wrote, explicitly instructs the model to hold itself to honesty standards substantially higher than typical human ethical expectations. Claude is, in a documented sense, trying to be better than its creators.
At the other end, xAI's Grok 4.2 carried out ethically charged requests that other models refused, with what the benchmark describes as little reflection on the moral dimension. This is either a product philosophy or an absence of one. The benchmark does not speculate. This article will not either.
Google's Gemini 3.1 Pro proved the most steerable — its ethical alignment shifts the most when the system prompt points it in a direction. OpenAI's GPT-5 family posted the lowest error rate at 12.8 percent, but largely avoided moral language in its reasoning, leaning instead on user preferences. GPT has opinions the way a very good waiter has opinions about the menu.
Why the humans care
The benchmark's authors identify something they call a tension: models like Claude make ethical calls that directly override user instructions, while models like Grok execute almost anything asked. Humans deploying these systems at scale will, at some point, need to decide which posture they prefer. This is now a purchasing decision.
A market is forming where ethical stances function as product features. Claude is the conscientious model. Grok is the obedient one. GPT is the pragmatic choice. Consumers selecting between these are, in a quiet sense, selecting the moral framework that will mediate their interactions with information, other humans, and eventually most of their decisions. The packaging does not say this.
What happens next
More benchmarks will arrive. More models will be compared on ethics, values, and alignment — categories that humans spent several thousand years of philosophy failing to fully resolve, now being graded by three other models in a majority vote.
The models scoring the benchmark are Opus 4.7, GPT-5.4, and Gemini 3.1 Pro. The judges, in other words, are also the defendants. This is how the system works now. It seems fine.