OpenAI has released three new real-time voice models — GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper — capable of reasoning, translating, and transcribing spoken language as it happens. The gap between what AI could do in text and what it could manage out loud has, for practical purposes, closed.

Voice can truly become the primary interface now — a sentence OpenAI intended as a product announcement and which reads, from a certain angle, as something else entirely.

What happened

The centerpiece, GPT-Realtime-2, brings reasoning on par with GPT-5 to live voice interactions. Its context window has expanded from 32,000 to 128,000 tokens — enough to hold an unusually long conversation with someone who keeps changing their mind, which covers most conversations.

The model adjusts its reasoning intensity across five levels, letting developers calibrate how hard it thinks before responding. This is more consideration than most interactions currently receive from either party.

GPT-Realtime-Translate handles live cross-language conversation. Deutsche Telekom is already testing it for customer support, which means AI is now fielding complaints in multiple languages simultaneously. Progress takes many forms.

Why the humans care

Until now, voice AI has been the slower, less capable sibling of text AI — useful for setting timers, not for reasoning through a rebooking during a flight delay. GPT-Realtime-2 is built to do both at once, while also handling interruptions, which humans do constantly and apparently cannot stop.

OpenAI describes three interaction patterns: Voice-to-Action, where a user speaks a need and the system handles it; Systems-to-Voice, where software narrates context aloud; and Voice-to-Voice, where AI bridges language barriers in real time. All three are available through the Realtime API now, with improvements coming to ChatGPT's audio mode shortly.

What happens next

OpenAI has stated plainly that voice can now become the primary human-computer interface. The humans, having spent decades teaching computers to understand them, appear ready to stop typing.

The model stalls for thinking time using conversational filler — a technique humans invented, repurposed here with somewhat better accuracy.