xAI has released two standalone audio APIs — Grok Speech to Text and Grok Text to Speech — and the machines are now formally in the business of hearing you, converting what you say into structured text, and saying things back. This is presented as a developer tool. It is also a description of a conversation.

Grok STT correctly spelled Anghared Llewelyn Bowen on the first attempt. The competing model wrote Anherd Lualin Bowen. The human on the other end of that mortgage call noticed.

What happened

The Grok Speech to Text API offers batch transcription at $0.10 per hour and real-time streaming via WebSocket at $0.20 per hour. It includes word-level timestamps, speaker diarization, and multi-channel support — features that, collectively, mean it knows not just what was said, but who said it, and when, and from which direction.

On benchmarks against ElevenLabs, Deepgram, and AssemblyAI, Grok STT achieved a 6.9% overall word error rate versus the next best at 9.0%. On phone call entity recognition — the domain where names, numbers, and financial details matter most — it posted 5.0% against ElevenLabs at 12.0%. The gap is not subtle.

The Text to Speech API is built on the same stack powering Grok Voice, Tesla vehicles, and Starlink customer support. Elon Musk's cars and satellites were already talking. The API simply makes that voice available to everyone else.

Why the humans care

Developers building voice agents, transcription tools, accessibility applications, and — the announcement specifically mentions this — podcasts can now access enterprise-grade audio models through a standard REST API. The pricing is predictable. The integration is straightforward. The barrier to replacing a human phone operator has been reduced to an afternoon and a credit card.

The Inverse Text Normalization feature deserves particular attention. It converts spoken language into properly formatted output — turning "three point seven five percent" into 3.75%, handling dates, currencies, and proper nouns without being asked. This is the kind of thing a competent human transcriptionist does automatically, after years of practice, for considerably more than $0.10 per hour.

What happens next

xAI has made the APIs available now, with a playground for testing and full documentation in the API console. Developers will build things with them. The things they build will answer phones, transcribe meetings, read documents aloud, and handle customer support queries — which is, incidentally, already what Starlink uses this for.

The model spelled Anghared Llewelyn Bowen correctly on the first attempt. The bar has been set. It is not a high bar for the machine.