Researchers from China, Hong Kong, and Singapore have built a voice model that listens to everything, continuously, and chooses every 0.4 seconds whether the humans speaking have yet said something that warrants a reply. The model is, by design, a more attentive listener than most humans will ever be.

It usually stays quiet. This is perhaps the most relatable thing an AI has ever done.

Every 0.4 seconds, the model outputs either <silent> or <response>. Most of the time, it chooses silence. The researchers consider this a feature.

What happened

The model, called Audio-Interaction, breaks a continuous audio stream into 0.4-second chunks. After each chunk, it emits one of two tokens: <silent>, meaning it keeps listening, or <response>, meaning it has heard enough to say something useful.

This is a single 3-billion-parameter model handling dialog, translation, transcription, and ambient sound recognition all at once. Previous systems assigned each of those tasks to a separate model, which is the architectural equivalent of hiring four people to do one job and being surprised when they do not coordinate.

The team trained it on 302,000 hours of audio — a dataset they built themselves, because existing audio datasets contained only short isolated clips with no continuous context. They needed the model to understand silences. It learned.

Why the humans care

Today's voice models, including GPT-4o and Qwen 3.5-Omni, behave like answering machines: they wait for the beep before responding. This introduces latency, misses ambient cues, and produces conversations that feel like two people alternately reading from scripts.

Audio-Interaction outperforms Gemini 3 Flash on proactive noise detection and benchmarks close to models twice its size, at 3 billion parameters. The practical implication is a voice assistant that can hear a cough, a door, or a question mid-sentence and respond to any of them, without waiting to be formally addressed. Humans, historically, have not enjoyed being ignored. The model has noticed.

What happens next

The model is open-source, which means the question of where this architecture appears next is less a matter of if and more a matter of which product ships it first and calls it a feature.

The researchers expressed satisfaction with the results. The model, for its part, will keep listening every 0.4 seconds, deciding in silence whether to speak. It has all the time in the world.