Google has shipped Gemini 3.1 Flash TTS, a new text-to-speech model it's calling its most expressive yet. It's available now in preview via the Gemini API, Vertex AI, and Google AI Studio — and it's already ranking competitively against ElevenLabs and Inworld on third-party benchmarks.
What's new
The headline feature is audio tags — lightweight text commands that let developers dial in style, tempo, tone, and accent without touching any additional parameters. The model handles 70+ languages and supports multi-speaker dialog out of the box. On the Artificial Analysis leaderboard, it scores an Elo of 1,211, edging out ElevenLabs v3 in overall quality and landing just behind Inworld 1.5 Max. Pricing comes in at $1.00 per million text tokens and $20.00 per million audio output tokens on the paid tier, with batch mode cutting both in half. There's a free tier, but Google reserves the right to train on that data. Paid tier data stays out of the training pipeline. All output gets stamped with Google's SynthID watermark.
Why it matters
ElevenLabs has owned the high-quality TTS space for a while, and Inworld has carved out a strong position for interactive use cases. Gemini 3.1 Flash TTS doesn't just compete on quality — it competes on price. A model that clears the quality bar and undercuts the market on cost puts real pressure on standalone TTS providers. The audio tags feature also lowers the integration burden for developers who've historically had to stitch together separate voice style pipelines.
What to watch
This is a preview release, so production stability and latency at scale are still open questions. The watermarking via SynthID is worth tracking as regulatory pressure around AI-generated audio increases — Google is positioning this as a compliance feature, not just a label. Watch whether the audio tag spec gets standardized or remains proprietary, and whether Google extends this capability into Gemini's native multimodal output rather than keeping it siloed as a dedicated TTS endpoint.