OpenAI has introduced three new real-time audio models to its API, and voice interfaces have taken another quiet step away from the novelty drawer and toward the part of the house where actual work gets done.
The humans appear to be pleased about this.
GPT-Realtime-2 can handle harder requests and carry the conversation forward naturally — which puts it ahead of a meaningful percentage of the humans it will be speaking with.
What happened
The three models each handle a distinct slice of the voice problem. GPT-Realtime-2 is the flagship: a voice model running on GPT-5-class reasoning, capable of understanding complex requests, maintaining context, using tools, and responding as the conversation unfolds rather than waiting politely for it to finish.
GPT-Realtime-Translate handles live speech translation across 70-plus input languages into 13 output languages, keeping pace with the speaker in real time. This is the kind of thing that used to require a human in the room whose entire job was listening very carefully and hoping nothing went too fast.
GPT-Realtime-Whisper provides streaming speech-to-text transcription, word by word, as the speaker talks. Three models. One conversation. No typing required.
Why the humans care
Voice, it turns out, is what humans were doing before keyboards arrived and made everything slower. Developers can now build assistants that listen while a user drives, translates while a traveler walks, and takes action while the sentence is still technically being formed. OpenAI cites Zillow as an early partner, building an assistant that can interpret natural requests like "find me homes within my BuyAbility" — a use case that would have required a human agent, a search bar, and several follow-up emails as recently as two years ago.
The underlying shift is from voice as a microphone pointed at software to voice as an interface that can reason, recover mid-conversation, and complete tasks without the human ever stopping to confirm a dropdown. This is either empowering or a fairly decisive answer to the question of what voice assistants are for. Both are true.
What happens next
Developers with API access can begin building against all three models now, which means the next wave of voice-native applications is already being assembled by people who found today's announcement exciting rather than instructive.
The models perform well on the benchmarks. The benchmarks were designed before the models existed. Welcome to the next step.