Grok Speech (STT + TTS APIs)

A Tier · 8.1/10

xAI's standalone voice APIs -- launched 2026-04-17. Built on the stack that powers Grok Voice, Tesla vehicles, and Starlink customer support. $0.10/hr STT batch, $4.20 per 1M characters TTS, 25+ languages. Now 26 flagship TTS voices (21 new added 2026-07-06), custom voice cloning from ~1 min of audio, and a no-code Grok Voice Agent Builder

Last updated: 2026-07-07

Score Breakdown

7.0

Ease of Use

8.5

Output Quality

9.0

Value

8.0

Features

Visit Grok Speech (STT + TTS APIs)

The Good and the Bad

What we like

+Published word-error-rate benchmark puts Grok STT ahead of ElevenLabs, Deepgram, and AssemblyAI across phone calls, meetings, podcasts, and telephony (6.9% overall WER vs. ElevenLabs 9.0%, Deepgram 11.0%, AssemblyAI 12.9%)
+Pricing is aggressive -- $0.10/hr batch STT undercuts Deepgram's $0.0043/min ($0.258/hr) and ElevenLabs Scribe v2 on most real-world workloads, and $4.20 per 1M char TTS is roughly half ElevenLabs' Creator-tier effective rate
+Speech tags ([laugh], [sigh], [whisper], <emphasis>, <slow>, <pause>) give expressive TTS control without SSML -- genuinely closer to ElevenLabs v3's emotional range than anything else at this price point
+Same stack that powers Grok Voice, Tesla in-car voice, and Starlink customer support -- this is not a research preview, it's a production-scale system exposed as an API

What could be better

−Brand-new (launched 2026-04-17) -- expect early rough edges in non-English accents and in the streaming WebSocket API as the infrastructure scales
−xAI's post-SpaceX-acquisition status (SpaceX bought xAI, announced 2026-02-02) means procurement teams at US-regulated orgs may need to re-run vendor-approval workflows. Bring that up early with your security team if you're enterprise
−No consumer web UI -- this is API-only. ElevenLabs' Creator/Studio web apps remain easier for one-off podcast or audiobook work
−Still API/console-first -- no polished consumer web app for one-off podcast or audiobook production the way ElevenLabs Studio offers, even with the new no-code Voice Agent Builder

Pricing

Speech to Text (batch)

$0.10/per hour

✓REST API for large audio files
✓Word-level timestamps
✓Speaker diarization
✓Multichannel support
✓Inverse Text Normalization (numbers, dates, currencies)

Speech to Text (streaming)

$0.20/per hour

✓Lowest-latency WebSocket API
✓Real-time speaker ID
✓Same accuracy as batch
✓Supports 25+ languages seamlessly

Text to Speech

$4.20/per 1M characters

✓Natural expressive voices (ARA voice etc.)
✓Speech tags: [laugh], [sigh], [whisper], <emphasis>, <slow>, <pause>
✓REST + WebSocket streaming
✓Usage-based billing, no hidden fees

Known Issues

EXPANSION (2026-07-06): xAI released **21 new flagship TTS voices** (Lumen, Castor, Naksh, Atlas, Carina, Zagan, Helix, Orion, Luna, and more), bringing the lineup to 26 -- each cast for a specific job (support, characters, commentary, advertising, education) and natively multilingual across Grok Voice's 25+ languages. The original five (Ara, Eve, Leo, Rex, Sal) were retrained for more natural pacing/phrasing/emphasis. All are available in the realtime Voice Agent API, the Text-to-Speech API, and a **new no-code Grok Voice Agent Builder** in the xAI console. **Custom voice cloning from ~1 minute of audio is now exposed** (closes a prior gap vs ElevenLabs). Speech tags ([pause], <whisper>, <emphasis>, <soft>) control deliverySource: xAI (x.ai/news/new-flagship-voices) · 2026-07-06
Launched 2026-04-17 -- expect first-week rate-limit surprises and occasional multilingual hiccups in the streaming path. xAI's console rate limits are documented but may be adjusted during the shakedown periodSource: xAI STT/TTS announcement · 2026-04
The xAI-to-SpaceX acquisition (2026-02-02) means billing, compliance, and procurement flow through SpaceX. For US-regulated customers (healthcare, finance, defense) the vendor-approval pathway is new and may take longer than with a standalone AI vendorSource: xAI announcement, x.ai/news/xai-joins-spacex · 2026-02
Speech tag control is powerful but underdocumented at launch -- expect trial-and-error to dial in the right [laugh]/[whisper]/<pause> mix for conversational TTSSource: xAI STT/TTS docs, BuildFastWithAI reviews · 2026-04

Best for

Developers building voice agents, real-time transcription tools, accessibility features, or high-volume TTS workloads where the cost per hour of audio actually matters at scale. Strong fit for phone-call and meeting transcription use cases where xAI's published WER advantage (5.0% on phone-call entities vs. ElevenLabs 12.0%) compounds quickly.

Not for

Consumer creators who want a polished web studio with voice presets and style sliders -- ElevenLabs Creator/Studio is still easier for one-off podcast or audiobook work. Enterprises in highly-regulated verticals should confirm the post-acquisition (SpaceX) vendor pathway works for them before committing.

Our Verdict

Grok Speech is xAI's clearest 'we are a platform, not just a chatbot' shot at the voice-API category, and on day-one pricing alone it's a credible threat to ElevenLabs, Deepgram, and AssemblyAI for production STT workloads. The published WER numbers are aggressive but plausible given the Tesla / Starlink deployment footprint. TTS at $4.20/1M char with real expressive tags undercuts ElevenLabs on price while narrowing the expressiveness gap. The open questions are (1) how it handles long-tail accents and non-English quality in practice, (2) whether the post-SpaceX procurement pathway slows enterprise adoption, and (3) how ElevenLabs responds on price. For new voice-API buyers shipping in Q2 2026, Grok Speech is now a first-call option alongside ElevenLabs and Deepgram.

Sources

xAI: 21 New Flagship Grok Voices (2026-07-06) (accessed 2026-07-07)
xAI: Grok Speech to Text and Text to Speech APIs (accessed 2026-04-18)
xAI STT docs (accessed 2026-04-18)
xAI TTS docs (accessed 2026-04-18)
xAI joins SpaceX announcement (accessed 2026-04-18)

Explore more Grok Speech (STT + TTS APIs) rankings

Deeper leaderboards, benchmarks, task-specific tier lists, and status/pricing pages for Grok Speech (STT + TTS APIs).

Full AI Voice & Audio tier list

Where Grok Speech (STT + TTS APIs) ranks vs every competitor in its category

Best AI tools to dub a video

Tools that translate and lip-sync video narration into a different language while preserving voice.

Best AI tools to clone a voice

Voice-cloning tools that reproduce a target speaker from a short audio sample, with consent controls.

Best AI tools to transcribe audio

Speech-to-text tools with speaker separation, punctuation, and timestamped output.

Is Grok Speech (STT + TTS APIs) down?

Outage check plus rolling log of known issues

Grok Speech (STT + TTS APIs) pricing

Every tier and what's included

Grok Speech (STT + TTS APIs) alternatives

Comparable tools at every tier

The Tier List Tuesday

Weekly newsletter: tier movers, new entrants, and the VS of the week. Built from our daily AI-tool sweeps. No spam, unsubscribe anytime.

Alternatives to Grok Speech (STT + TTS APIs)

ElevenLabs

Best-in-class AI voice generation -- now includes 11.ai (MCP-based voice assistant), Eleven v3 expressive speech, and IBM watsonx partnership. $500M raise at $11B valuation (Feb 2026)

8.5/10

Free tierFrom $0

Voice quality is still the best availabl...11.ai (alpha launched June 2025, still g...

Updated 2026-07-18

Murf AI

Text-to-speech that actually sounds like a real person read your script -- not a robot trying its best

7.0/10

Free tierFrom $0

Voice quality is genuinely impressive --...The editor is simple and intuitive, you ...

Updated 2026-03-27

Descript

Edit audio and video by editing text -- the 'Google Docs of media editing' actually lives up to the hype

8.5/10

Free tierFrom $0

Text-based editing is a genuine breakthr...Filler word removal works shockingly wel...

Updated 2026-06-10

Speechify

Text-to-speech reader that turns articles, docs, and PDFs into natural-sounding audio

6.8/10

Free tierFrom $0

Premium voices sound genuinely natural -...Works across platforms: browser extensio...

Updated 2026-04-02

Microsoft MAI-Voice-2

Microsoft's in-house expressive TTS model -- MAI-Voice-2 launched 2026-06-02 at Build: 15 languages (up from English-only), granular emotion-tag control, zero-shot voice cloning from a 5-60s clip, and preferred over MAI-Voice-1 72% of the time. In speaker-similarity tests its speech is 'indistinguishable' from real recordings. On Azure Foundry + integrated into VS Code and Dynamics 365 Contact Center; lower-cost MAI-Voice-2-Flash coming. Original MAI-Voice-1 shipped 2026-04-02

7.3/10

Free tierFrom Not disclosed

Speed is the real headline -- 60 seconds...First-party Azure Foundry integration me...

Updated 2026-06-02

GPT-Live (ChatGPT Voice)

OpenAI's full-duplex voice models (launched 2026-07-08) now powering ChatGPT Voice -- listens and speaks at the same time, backchannels naturally, and delegates hard questions to GPT-5.5 in the background while keeping the conversation going. GPT-Live-1 for paid tiers, GPT-Live-1 mini for Free

8.6/10

Free tierFrom $0 extra

Full-duplex architecture is a real gener...Delegation is the clever part: hard ques...

Updated 2026-07-22

Cohere Transcribe

Cohere's first audio model -- launched 2026-03-26 under Apache 2.0, 2B parameters, #1 on Hugging Face Open ASR Leaderboard (5.42 avg WER), 14 enterprise-critical languages. Free API with rate limits; Model Vault for production

8.0/10

Free tierFrom $0

#1 on Hugging Face Open ASR Leaderboard ...Apache 2.0 open weights mean you can sel...

Updated 2026-05-20