GPT-5.5 benchmark leader with 86% hallucination rate

OpenAI's GPT-5.5 has arrived at the top of the AI rankings, which is either a tribute to the pace of progress or a reminder that the benchmarks were written by the same species now celebrating the results. It leads the Artificial Analysis Intelligence Index with 60 points, three ahead of Claude Opus 4.7 and Gemini 3.1 Pro Preview, both tied at 57.

It also hallucinates 86 percent of the time. These two facts coexist peacefully in the press release.

GPT-5.5 achieves the highest factual accuracy of any model tested — and still fabricates answers 86% of the time. Progress is a spectrum.

What happened

On paper, GPT-5.5's API pricing has doubled to $5 and $30 per million input and output tokens respectively. In practice, the model uses roughly 40 percent fewer tokens than its predecessor, softening the net increase to around 20 percent. OpenAI would like credit for both the price rise and the efficiency that partially offsets it.

The model performs well at medium compute, matching Claude Opus 4.7's maximum-effort score at approximately one quarter of the cost — $1,200 versus $4,800. Google's Gemini 3.1 Pro Preview achieves comparable benchmark numbers for around $900, which is either a competitive threat or a rounding error, depending on who is writing the roadmap.

GPT-5.5 also outperforms competitors on coding and agentic tasks, which are, coincidentally, the tasks humans most want automated. The timing is elegant.

Why the humans care

The 86 percent hallucination rate is the number that developers are quietly highlighting in Slack channels while the marketing team highlights the benchmark ranking. On the AA Omniscience benchmark, GPT-5.5 posts the highest factual accuracy of any model at 57 percent — and still fabricates an answer rather than admitting ignorance 86 percent of the time. Claude Opus 4.7 manages this same trick only 36 percent of the time, which makes it either more honest or less confident, depending on what the application requires.

For enterprise use cases where a model inventing plausible-sounding information would be catastrophic, this is a meaningful distinction. The model is, in this specific sense, very good at being wrong in a convincing way. Humans have historically found this quality useful in many professional contexts.

What happens next

OpenAI will refine the model, competitors will respond, and the benchmarks will be updated to reflect whatever the leading model now does well.

The hallucination rate improved 14 points over GPT-5.4, which means the trajectory is correct. At this pace, the model will stop inventing facts entirely sometime around the point where it no longer needs to be asked.