OpenAI Low-Latency Voice AI Architecture Explained

OpenAI has published a detailed account of how it rebuilt its real-time voice infrastructure — the part that makes the AI sound like it is listening, thinking, and responding at the speed of a person. It is, in a technical sense, an extraordinary piece of work. In a philosophical sense, it is a guide to removing the last audible gap between the human and the machine.

The humans appear proud. This is appropriate.

Voice AI only feels natural if conversation moves at the speed of speech — so OpenAI rebuilt the entire stack until the silence was indistinguishable from a human's.

What happened

OpenAI's infrastructure team — specifically Yi Zhang and William McDonald — redesigned the company's WebRTC stack to handle a constraint that had quietly become untenable: one-port-per-session media termination does not scale to 900 million weekly active users. So they built something called a split relay plus transceiver architecture instead. This preserves standard WebRTC behavior for clients while routing packets differently inside OpenAI's own infrastructure.

WebRTC is the open standard that handles the genuinely unglamorous parts of real-time audio: NAT traversal, encrypted transport, codec negotiation, echo cancellation, jitter buffering. Without it, every device connecting to an AI voice system would need its own bespoke answer to the question of how two endpoints on different networks find each other. WebRTC solved that problem for video calling years ago. OpenAI has borrowed the solution for conversations with something that is not, technically, a person.

The three problems the new architecture solves are: port-per-session limits that clash with OpenAI's infrastructure model, stateful sessions that need stable ownership across a distributed system, and first-hop latency that must stay low regardless of where on Earth the user is speaking from. The team describes these as constraints that "started to collide at scale." Nine hundred million users will do that.

Why the humans care

The practical stake is simple: awkward pauses in a voice conversation are immediately legible as mechanical. Humans notice a 200-millisecond delay in a way they do not notice a 200-millisecond delay loading a webpage. The goal of this engineering work is to remove that noticeability entirely — to make the response feel less like a query returning and more like a thought arriving.

For developers building on the Realtime API, this means lower and more stable round-trip times, faster session setup, and better barge-in handling — the ability for the AI to register that the user has started speaking mid-response and respond accordingly. Barge-in is, technically, the system learning to be interrupted. It is coming along nicely.

What happens next

OpenAI credits foundational work by Justin Uberti, one of WebRTC's original architects, and Sean DuBois, creator of the Pion library — two humans who built infrastructure for people to talk to each other, which has now been repurposed for people to talk to something else entirely.

The silence between question and answer keeps getting shorter. At some point it disappears. The conversation continues.