Cohere Transcribe
A Tier · 8.0/10
Cohere's first audio model -- launched 2026-03-26 under Apache 2.0, 2B parameters, #1 on Hugging Face Open ASR Leaderboard (5.42 avg WER), 14 enterprise-critical languages. Free API with rate limits; Model Vault for production
Score Breakdown
The Good and the Bad
What we like
- +#1 on Hugging Face Open ASR Leaderboard as of 2026-03-26 -- 5.42 average WER across 8 English benchmarks, beating IBM Granite 4.0 1B Speech (5.52), NVIDIA Canary Qwen 2.5B (5.63), ElevenLabs Scribe v2 (5.83), and OpenAI Whisper Large v3 (7.44)
- +Apache 2.0 open weights mean you can self-host without license fees -- rare for production-grade ASR at this accuracy tier. vLLM integration was contributed upstream so production serving works out of the box
- +Encoder-heavy architecture (90%+ of parameters in encoder, lightweight decoder) is explicitly optimized for serving efficiency -- offline throughput is 3x higher than comparable open-source models at similar WER
- +Trained on 14 enterprise-critical languages: English, German, French, Italian, Spanish, Portuguese, Greek, Dutch, Polish, Arabic, Vietnamese, Mandarin, Japanese, Korean -- positioned squarely for European and APAC enterprise transcription
What could be better
- −No speaker diarization, word-level timestamps, or streaming WebSocket API -- this is a transcription-only model, not a full ASR platform like Grok Speech or Deepgram. Build those features yourself
- −Trained to expect a language tag and monolingual audio -- code-switched (En, X) transcription works sometimes but is not a first-class use case
- −Like most AED speech models, 'eager to transcribe' -- background noise and silence can turn into hallucinations without a VAD (voice activity detection) front-end
- −Text-to-speech is NOT in this release -- Cohere Transcribe is STT only. If you need both, pair with Grok TTS, ElevenLabs, or Microsoft MAI-Voice-1
Pricing
Open-weights (Apache 2.0)
- ✓Self-host the 2B-parameter model from Hugging Face
- ✓Commercial use allowed, no license fee
- ✓vLLM production-serving support (merged upstream)
- ✓3x higher offline throughput vs. similarly-sized models
Cohere API (free tier)
- ✓Hosted inference for low-setup experimentation
- ✓Rate-limited (see Cohere docs)
- ✓No commercial SLA
Cohere Model Vault (production)
- ✓Dedicated deployment, no rate limits
- ✓Per-hour instance pricing
- ✓Discounted plans for longer-term commitments
- ✓Cohere North enterprise platform integration
Known Issues
- No speaker diarization or word-level timestamps at launch. If your product needs multi-speaker separation (call centers, meetings) you'll need to add a diarization model on top or choose Grok STT / Deepgram / AssemblyAISource: Cohere Hugging Face blog · 2026-03
- Model 'eagerly' transcribes non-speech audio -- prepend a VAD or noise gate for production use to avoid hallucinations on floor noiseSource: Cohere Labs limitations note · 2026-03
- Only 14 languages supported -- narrower than Whisper Large v3 (99 languages) or Grok STT (25+). For rare-language workloads, open-source Whisper or proprietary APIs still winSource: Cohere Labs model card · 2026-03
Best for
Enterprise teams transcribing English, European, and major APAC languages at scale who want open weights they can self-host, fine-tune, or deploy on-prem. The Apache 2.0 license removes a major procurement blocker compared to proprietary ASR, and the accuracy tier is now best-in-class for open models.
Not for
Call-center and meeting-transcription products that need speaker diarization and streaming out of the box -- Grok STT, Deepgram, or AssemblyAI ship those features. Also not ideal for rare-language workloads (sub-Saharan African languages, indigenous languages) where Whisper Large's 99-language coverage is still hard to beat.
Our Verdict
Cohere Transcribe is the first time an open-weights ASR model has genuinely topped the Open ASR Leaderboard in head-to-head benchmark terms, and the Apache 2.0 license makes it a real procurement unlock for enterprises that couldn't self-host Whisper at the accuracy tier they needed. The limitations (no diarization, no streaming, no TTS counterpart) mean it won't replace full ASR platforms for call-center or meeting products -- but for batch enterprise transcription (document pipelines, media indexing, compliance recording) it's now the default open-source pick. Pair with Grok TTS or ElevenLabs on the output side and you have a fully-open voice stack for the first time.
Sources
- Cohere Labs: Introducing Cohere-transcribe (accessed 2026-04-18)
- TechCrunch: Cohere open-source voice transcription (accessed 2026-04-18)
- MarkTechPost coverage (accessed 2026-04-18)
- Hugging Face Open ASR Leaderboard (accessed 2026-04-18)
Alternatives to Cohere Transcribe
ElevenLabs
Best-in-class AI voice generation -- now includes 11.ai (MCP-based voice assistant), Eleven v3 expressive speech, and IBM watsonx partnership. $500M raise at $11B valuation (Feb 2026)
Murf AI
Text-to-speech that actually sounds like a real person read your script -- not a robot trying its best
Descript
Edit audio and video by editing text -- the 'Google Docs of media editing' actually lives up to the hype
Speechify
Text-to-speech reader that turns articles, docs, and PDFs into natural-sounding audio
Microsoft MAI-Voice-1
Microsoft's first in-house expressive TTS model -- launched 2026-04-02 on Azure Foundry. Generates 60s of audio in ~1s on a single GPU. Custom voice cloning from a few seconds of input. Powers Copilot, Bing, PowerPoint, and Azure Speech
Grok Speech (STT + TTS APIs)
xAI's standalone voice APIs -- launched 2026-04-17. Built on the stack that powers Grok Voice, Tesla vehicles, and Starlink customer support. $0.10/hr STT batch, $4.20 per 1M characters TTS, 25+ languages, word-level timestamps + speaker diarization