Google has shipped Gemini 3.1 Flash TTS, a text-to-speech model that adds granular audio tags — natural language commands that control vocal pace, style, and delivery mid-generation. It's available today in preview via the Gemini API, Google AI Studio, Vertex AI, and Google Vids for Workspace users.

What's new

The headline feature is audio tags: think of them as inline stage directions baked into your prompt. Instead of hoping the model guesses the right tone, you tell it. Beyond that, Google claims improved baseline speech quality and broader language coverage — now over 70 languages. On the Artificial Analysis TTS leaderboard, which aggregates blind human preference votes, 3.1 Flash TTS landed an Elo score of 1,211 and got slotted into the site's "most attractive quadrant" for balancing quality against cost. All output is watermarked using SynthID, Google's existing steganographic audio fingerprinting system.

Why it matters

Controllability has been the persistent weak point of TTS systems — you could get decent prosody or you could get reliability, rarely both. Audio tags are a practical attempt to close that gap without requiring fine-tuning. The SynthID integration is also worth noting: as AI-generated voice becomes indistinguishable from human speech, watermarking infrastructure is increasingly the only audit trail available. Google AI Studio's export-settings workflow also gives developers a repeatable path to consistent voice personas across deployments, which has been a quiet pain point for production audio apps.

What to watch

Google is positioning this as a developer and enterprise play, but the Google Vids integration means it's also a direct consumer product. Whether audio tags hold up under adversarial prompting — users trying to strip out watermarks or generate misleading voices — will be the real test. Competitors including ElevenLabs, OpenAI's TTS endpoints, and Amazon Polly now have a concrete benchmark to race against: an Elo of 1,211 on Artificial Analysis is a number worth watching.