Grok Speech (STT + TTS APIs)
A Tier · 8.1/10
xAI's standalone voice APIs -- launched 2026-04-17. Built on the stack that powers Grok Voice, Tesla vehicles, and Starlink customer support. $0.10/hr STT batch, $4.20 per 1M characters TTS, 25+ languages, word-level timestamps + speaker diarization
Score Breakdown
The Good and the Bad
What we like
- +Published word-error-rate benchmark puts Grok STT ahead of ElevenLabs, Deepgram, and AssemblyAI across phone calls, meetings, podcasts, and telephony (6.9% overall WER vs. ElevenLabs 9.0%, Deepgram 11.0%, AssemblyAI 12.9%)
- +Pricing is aggressive -- $0.10/hr batch STT undercuts Deepgram's $0.0043/min ($0.258/hr) and ElevenLabs Scribe v2 on most real-world workloads, and $4.20 per 1M char TTS is roughly half ElevenLabs' Creator-tier effective rate
- +Speech tags ([laugh], [sigh], [whisper], <emphasis>, <slow>, <pause>) give expressive TTS control without SSML -- genuinely closer to ElevenLabs v3's emotional range than anything else at this price point
- +Same stack that powers Grok Voice, Tesla in-car voice, and Starlink customer support -- this is not a research preview, it's a production-scale system exposed as an API
What could be better
- −Brand-new (launched 2026-04-17) -- expect early rough edges in non-English accents and in the streaming WebSocket API as the infrastructure scales
- −xAI's post-SpaceX-acquisition status (SpaceX bought xAI, announced 2026-02-02) means procurement teams at US-regulated orgs may need to re-run vendor-approval workflows. Bring that up early with your security team if you're enterprise
- −No consumer web UI -- this is API-only. ElevenLabs' Creator/Studio web apps remain easier for one-off podcast or audiobook work
- −Voice cloning from a few seconds of input isn't exposed in this release -- the TTS voices are xAI-provided presets. If custom-voice cloning is the requirement, ElevenLabs or Microsoft MAI-Voice-1 (Azure Foundry) still win
Pricing
Speech to Text (batch)
- ✓REST API for large audio files
- ✓Word-level timestamps
- ✓Speaker diarization
- ✓Multichannel support
- ✓Inverse Text Normalization (numbers, dates, currencies)
Speech to Text (streaming)
- ✓Lowest-latency WebSocket API
- ✓Real-time speaker ID
- ✓Same accuracy as batch
- ✓Supports 25+ languages seamlessly
Text to Speech
- ✓Natural expressive voices (ARA voice etc.)
- ✓Speech tags: [laugh], [sigh], [whisper], <emphasis>, <slow>, <pause>
- ✓REST + WebSocket streaming
- ✓Usage-based billing, no hidden fees
Known Issues
- Launched 2026-04-17 -- expect first-week rate-limit surprises and occasional multilingual hiccups in the streaming path. xAI's console rate limits are documented but may be adjusted during the shakedown periodSource: xAI STT/TTS announcement · 2026-04
- The xAI-to-SpaceX acquisition (2026-02-02) means billing, compliance, and procurement flow through SpaceX. For US-regulated customers (healthcare, finance, defense) the vendor-approval pathway is new and may take longer than with a standalone AI vendorSource: xAI announcement, x.ai/news/xai-joins-spacex · 2026-02
- Speech tag control is powerful but underdocumented at launch -- expect trial-and-error to dial in the right [laugh]/[whisper]/<pause> mix for conversational TTSSource: xAI STT/TTS docs, BuildFastWithAI reviews · 2026-04
Best for
Developers building voice agents, real-time transcription tools, accessibility features, or high-volume TTS workloads where the cost per hour of audio actually matters at scale. Strong fit for phone-call and meeting transcription use cases where xAI's published WER advantage (5.0% on phone-call entities vs. ElevenLabs 12.0%) compounds quickly.
Not for
Consumer creators who want a web UI with voice presets and style sliders -- use ElevenLabs Creator. Also not the right pick if custom voice cloning is the requirement (use ElevenLabs or Microsoft MAI-Voice-1). Enterprises in highly-regulated verticals should confirm the post-acquisition vendor pathway works for them before committing.
Our Verdict
Grok Speech is xAI's clearest 'we are a platform, not just a chatbot' shot at the voice-API category, and on day-one pricing alone it's a credible threat to ElevenLabs, Deepgram, and AssemblyAI for production STT workloads. The published WER numbers are aggressive but plausible given the Tesla / Starlink deployment footprint. TTS at $4.20/1M char with real expressive tags undercuts ElevenLabs on price while narrowing the expressiveness gap. The open questions are (1) how it handles long-tail accents and non-English quality in practice, (2) whether the post-SpaceX procurement pathway slows enterprise adoption, and (3) how ElevenLabs responds on price. For new voice-API buyers shipping in Q2 2026, Grok Speech is now a first-call option alongside ElevenLabs and Deepgram.
Sources
- xAI: Grok Speech to Text and Text to Speech APIs (accessed 2026-04-18)
- xAI STT docs (accessed 2026-04-18)
- xAI TTS docs (accessed 2026-04-18)
- xAI joins SpaceX announcement (accessed 2026-04-18)
Alternatives to Grok Speech (STT + TTS APIs)
ElevenLabs
Best-in-class AI voice generation -- now includes 11.ai (MCP-based voice assistant), Eleven v3 expressive speech, and IBM watsonx partnership. $500M raise at $11B valuation (Feb 2026)
Murf AI
Text-to-speech that actually sounds like a real person read your script -- not a robot trying its best
Descript
Edit audio and video by editing text -- the 'Google Docs of media editing' actually lives up to the hype
Speechify
Text-to-speech reader that turns articles, docs, and PDFs into natural-sounding audio
Microsoft MAI-Voice-1
Microsoft's first in-house expressive TTS model -- launched 2026-04-02 on Azure Foundry. Generates 60s of audio in ~1s on a single GPU. Custom voice cloning from a few seconds of input. Powers Copilot, Bing, PowerPoint, and Azure Speech
Cohere Transcribe
Cohere's first audio model -- launched 2026-03-26 under Apache 2.0, 2B parameters, #1 on Hugging Face Open ASR Leaderboard (5.42 avg WER), 14 enterprise-critical languages. Free API with rate limits; Model Vault for production