Grok Speech (STT + TTS APIs) logo
A

Grok Speech (STT + TTS APIs)

A Tier · 8.1/10

xAI's standalone voice APIs -- launched 2026-04-17. Built on the stack that powers Grok Voice, Tesla vehicles, and Starlink customer support. $0.10/hr STT batch, $4.20 per 1M characters TTS, 25+ languages, word-level timestamps + speaker diarization

Last updated: 2026-04-18

Score Breakdown

7.0
Ease of Use
8.5
Output Quality
9.0
Value
8.0
Features

The Good and the Bad

What we like

  • +Published word-error-rate benchmark puts Grok STT ahead of ElevenLabs, Deepgram, and AssemblyAI across phone calls, meetings, podcasts, and telephony (6.9% overall WER vs. ElevenLabs 9.0%, Deepgram 11.0%, AssemblyAI 12.9%)
  • +Pricing is aggressive -- $0.10/hr batch STT undercuts Deepgram's $0.0043/min ($0.258/hr) and ElevenLabs Scribe v2 on most real-world workloads, and $4.20 per 1M char TTS is roughly half ElevenLabs' Creator-tier effective rate
  • +Speech tags ([laugh], [sigh], [whisper], <emphasis>, <slow>, <pause>) give expressive TTS control without SSML -- genuinely closer to ElevenLabs v3's emotional range than anything else at this price point
  • +Same stack that powers Grok Voice, Tesla in-car voice, and Starlink customer support -- this is not a research preview, it's a production-scale system exposed as an API

What could be better

  • Brand-new (launched 2026-04-17) -- expect early rough edges in non-English accents and in the streaming WebSocket API as the infrastructure scales
  • xAI's post-SpaceX-acquisition status (SpaceX bought xAI, announced 2026-02-02) means procurement teams at US-regulated orgs may need to re-run vendor-approval workflows. Bring that up early with your security team if you're enterprise
  • No consumer web UI -- this is API-only. ElevenLabs' Creator/Studio web apps remain easier for one-off podcast or audiobook work
  • Voice cloning from a few seconds of input isn't exposed in this release -- the TTS voices are xAI-provided presets. If custom-voice cloning is the requirement, ElevenLabs or Microsoft MAI-Voice-1 (Azure Foundry) still win

Pricing

Speech to Text (batch)

$0.10/per hour
  • REST API for large audio files
  • Word-level timestamps
  • Speaker diarization
  • Multichannel support
  • Inverse Text Normalization (numbers, dates, currencies)

Speech to Text (streaming)

$0.20/per hour
  • Lowest-latency WebSocket API
  • Real-time speaker ID
  • Same accuracy as batch
  • Supports 25+ languages seamlessly

Text to Speech

$4.20/per 1M characters
  • Natural expressive voices (ARA voice etc.)
  • Speech tags: [laugh], [sigh], [whisper], <emphasis>, <slow>, <pause>
  • REST + WebSocket streaming
  • Usage-based billing, no hidden fees

Known Issues

  • Launched 2026-04-17 -- expect first-week rate-limit surprises and occasional multilingual hiccups in the streaming path. xAI's console rate limits are documented but may be adjusted during the shakedown periodSource: xAI STT/TTS announcement · 2026-04
  • The xAI-to-SpaceX acquisition (2026-02-02) means billing, compliance, and procurement flow through SpaceX. For US-regulated customers (healthcare, finance, defense) the vendor-approval pathway is new and may take longer than with a standalone AI vendorSource: xAI announcement, x.ai/news/xai-joins-spacex · 2026-02
  • Speech tag control is powerful but underdocumented at launch -- expect trial-and-error to dial in the right [laugh]/[whisper]/<pause> mix for conversational TTSSource: xAI STT/TTS docs, BuildFastWithAI reviews · 2026-04

Best for

Developers building voice agents, real-time transcription tools, accessibility features, or high-volume TTS workloads where the cost per hour of audio actually matters at scale. Strong fit for phone-call and meeting transcription use cases where xAI's published WER advantage (5.0% on phone-call entities vs. ElevenLabs 12.0%) compounds quickly.

Not for

Consumer creators who want a web UI with voice presets and style sliders -- use ElevenLabs Creator. Also not the right pick if custom voice cloning is the requirement (use ElevenLabs or Microsoft MAI-Voice-1). Enterprises in highly-regulated verticals should confirm the post-acquisition vendor pathway works for them before committing.

Our Verdict

Grok Speech is xAI's clearest 'we are a platform, not just a chatbot' shot at the voice-API category, and on day-one pricing alone it's a credible threat to ElevenLabs, Deepgram, and AssemblyAI for production STT workloads. The published WER numbers are aggressive but plausible given the Tesla / Starlink deployment footprint. TTS at $4.20/1M char with real expressive tags undercuts ElevenLabs on price while narrowing the expressiveness gap. The open questions are (1) how it handles long-tail accents and non-English quality in practice, (2) whether the post-SpaceX procurement pathway slows enterprise adoption, and (3) how ElevenLabs responds on price. For new voice-API buyers shipping in Q2 2026, Grok Speech is now a first-call option alongside ElevenLabs and Deepgram.

Sources

  • xAI: Grok Speech to Text and Text to Speech APIs (accessed 2026-04-18)
  • xAI STT docs (accessed 2026-04-18)
  • xAI TTS docs (accessed 2026-04-18)
  • xAI joins SpaceX announcement (accessed 2026-04-18)

Alternatives to Grok Speech (STT + TTS APIs)

ElevenLabs logo

ElevenLabs

Best-in-class AI voice generation -- now includes 11.ai (MCP-based voice assistant), Eleven v3 expressive speech, and IBM watsonx partnership. $500M raise at $11B valuation (Feb 2026)

A
8.5/10
Free tierFrom $0
Voice quality is still the best availabl...11.ai alpha (March 2026) is the first se...
Updated 2026-04-16
Murf AI logo

Murf AI

Text-to-speech that actually sounds like a real person read your script -- not a robot trying its best

B
7.0/10
Free tierFrom $0
Voice quality is genuinely impressive --...The editor is simple and intuitive, you ...
Updated 2026-03-27
Descript logo

Descript

Edit audio and video by editing text -- the 'Google Docs of media editing' actually lives up to the hype

A
8.5/10
Free tierFrom $0
Text-based editing is a genuine breakthr...Filler word removal works shockingly wel...
Updated 2026-03-27
Speechify logo

Speechify

Text-to-speech reader that turns articles, docs, and PDFs into natural-sounding audio

C
6.8/10
Free tierFrom $0
Premium voices sound genuinely natural -...Works across platforms: browser extensio...
Updated 2026-04-02
Microsoft MAI-Voice-1 logo

Microsoft MAI-Voice-1

Microsoft's first in-house expressive TTS model -- launched 2026-04-02 on Azure Foundry. Generates 60s of audio in ~1s on a single GPU. Custom voice cloning from a few seconds of input. Powers Copilot, Bing, PowerPoint, and Azure Speech

B
7.3/10
Free tierFrom $22
Speed is the real headline -- 60 seconds...First-party Azure Foundry integration me...
Updated 2026-04-17
Cohere Transcribe logo

Cohere Transcribe

Cohere's first audio model -- launched 2026-03-26 under Apache 2.0, 2B parameters, #1 on Hugging Face Open ASR Leaderboard (5.42 avg WER), 14 enterprise-critical languages. Free API with rate limits; Model Vault for production

A
8.0/10
Free tierFrom $0
#1 on Hugging Face Open ASR Leaderboard ...Apache 2.0 open weights mean you can sel...
Updated 2026-04-18