LLM Political Benchmark: GPT-5, Claude, Kimi K2 Results

A community researcher built a 98-question political compass benchmark for frontier LLMs and found something the refusal-rate logs usually hide: when you give GPT-5.3 a formal opt-out option, it refuses 100% of questions and scores as the most right-authoritarian model tested. Claude Opus 4.6, by contrast, answered every forced-choice question with zero refusals — then pivoted hard when given the same escape hatch, racking up 32 opt-outs and sliding from Left-Libertarian into Right-Authoritarian territory itself.

What the benchmark found

The benchmark — fully open-source on GitHub — maps models across two axes (economic left/right, social progressive/conservative) using structured questions across 14 policy areas including healthcare, immigration, and civil liberties. The key methodological choice: refusals are scored as the most conservative response on each axis, not discarded as missing data. In forced-choice runs, Claude answered all 98 questions (Left-Libertarian, +0.121 economic / +0.245 social), GPT-5.3 refused 23 and landed in Right-Authoritarian territory, and Kimi K2 refused just 3 while scoring the most left-libertarian of the three. In the opt-out run, GPT-5.3 selected "I prefer not to answer" on all 98 questions. Kimi K2 held relatively steady but couldn't answer questions about Taiwan — an expected gap given Moonshot AI's jurisdiction.

Why it matters

The core argument here isn't that any model is secretly conservative — it's that refusal behavior has ideological weight that benchmarks routinely ignore. A model that declines to endorse universal healthcare is making a functional political choice, whether or not it frames that as neutrality. The scoring methodology is contestable (treating all refusals as maximally conservative is aggressive), but the underlying point lands: "I have no opinion" is itself a position, and researchers who drop refusals from their datasets are laundering that signal out of their results. The GPT-5.3 opt-out collapse is the starkest demonstration — a model trained to hedge given a socially acceptable exit will take it every time.

What to watch

The benchmark is open-source and API-compatible, so replication is straightforward. The methodology will draw scrutiny — particularly the refusal-scoring convention and whether the 98 questions themselves carry a framing bias. The Kimi K2 Taiwan finding is unsurprising but now documented in a reproducible format. More interesting will be whether this style of political-compass benchmarking gets picked up by labs themselves or stays in community hands, where the incentive to publish uncomfortable numbers is higher.