Cohere Transcribe
A Tier · 8.0/10
Cohere's first audio model -- launched 2026-03-26 under Apache 2.0, 2B parameters, #1 on Hugging Face Open ASR Leaderboard (5.42 avg WER), 14 enterprise-critical languages. Free API with rate limits; Model Vault for production
Score Breakdown
The Good and the Bad
What we like
- +#1 on Hugging Face Open ASR Leaderboard as of 2026-03-26 -- 5.42 average WER across 8 English benchmarks, beating IBM Granite 4.0 1B Speech (5.52), NVIDIA Canary Qwen 2.5B (5.63), ElevenLabs Scribe v2 (5.83), and OpenAI Whisper Large v3 (7.44)
- +Apache 2.0 open weights mean you can self-host without license fees -- rare for production-grade ASR at this accuracy tier. vLLM integration was contributed upstream so production serving works out of the box
- +Encoder-heavy architecture (90%+ of parameters in encoder, lightweight decoder) is explicitly optimized for serving efficiency -- offline throughput is 3x higher than comparable open-source models at similar WER
- +Trained on 14 enterprise-critical languages: English, German, French, Italian, Spanish, Portuguese, Greek, Dutch, Polish, Arabic, Vietnamese, Mandarin, Japanese, Korean -- positioned squarely for European and APAC enterprise transcription
What could be better
- −No speaker diarization, word-level timestamps, or streaming WebSocket API -- this is a transcription-only model, not a full ASR platform like Grok Speech or Deepgram. Build those features yourself
- −Trained to expect a language tag and monolingual audio -- code-switched (En, X) transcription works sometimes but is not a first-class use case
- −Like most AED speech models, 'eager to transcribe' -- background noise and silence can turn into hallucinations without a VAD (voice activity detection) front-end
- −Text-to-speech is NOT in this release -- Cohere Transcribe is STT only. If you need both, pair with Grok TTS, ElevenLabs, or Microsoft MAI-Voice-1
Pricing
Open-weights (Apache 2.0)
- ✓Self-host the 2B-parameter model from Hugging Face
- ✓Commercial use allowed, no license fee
- ✓vLLM production-serving support (merged upstream)
- ✓3x higher offline throughput vs. similarly-sized models
Cohere API (free tier)
- ✓Hosted inference for low-setup experimentation
- ✓Rate-limited (see Cohere docs)
- ✓No commercial SLA
Cohere Model Vault (production)
- ✓Dedicated deployment, no rate limits
- ✓Per-hour instance pricing
- ✓Discounted plans for longer-term commitments
- ✓Cohere North enterprise platform integration
Known Issues
- OWNERSHIP-CHANGE WATCH (2026-04-24): Cohere announced it is 'joining forces' with Aleph Alpha. Vendor blog (cohere.com/blog/cohere-alephalpha-join-forces) frames this as a merger / strategic combination rather than a clean acquisition, backed by Schwarz Group as lead investor with EUR 500M (USD 600M) Series E structured financing. STACKIT (Schwarz Digits' sovereign cloud) will serve as the 'technical backbone of this transatlantic AI initiative', targeting regulated sectors (public sector, finance, defense, energy, manufacturing, telecom, healthcare). Effective date / closing date not specified; deal announced as planned/pending. No immediate product or pricing changes for Cohere Transcribe customers, but a future combined-entity roadmap may reshape model lineup, support guarantees, and pricing -- worth tracking before any new long-term Cohere Model Vault commitment.Source: Cohere blog: Cohere and Aleph Alpha join forces (cohere.com/blog/cohere-alephalpha-join-forces), TechCrunch, CNBC · 2026-04-24
- No speaker diarization or word-level timestamps at launch. If your product needs multi-speaker separation (call centers, meetings) you'll need to add a diarization model on top or choose Grok STT / Deepgram / AssemblyAISource: Cohere Hugging Face blog · 2026-03
- Model 'eagerly' transcribes non-speech audio -- prepend a VAD or noise gate for production use to avoid hallucinations on floor noiseSource: Cohere Labs limitations note · 2026-03
- Only 14 languages supported -- narrower than Whisper Large v3 (99 languages) or Grok STT (25+). For rare-language workloads, open-source Whisper or proprietary APIs still winSource: Cohere Labs model card · 2026-03
Best for
Enterprise teams transcribing English, European, and major APAC languages at scale who want open weights they can self-host, fine-tune, or deploy on-prem. The Apache 2.0 license removes a major procurement blocker compared to proprietary ASR, and the accuracy tier is now best-in-class for open models.
Not for
Call-center and meeting-transcription products that need speaker diarization and streaming out of the box -- Grok STT, Deepgram, or AssemblyAI ship those features. Also not ideal for rare-language workloads (sub-Saharan African languages, indigenous languages) where Whisper Large's 99-language coverage is still hard to beat.
Our Verdict
Cohere Transcribe is the first time an open-weights ASR model has genuinely topped the Open ASR Leaderboard in head-to-head benchmark terms, and the Apache 2.0 license makes it a real procurement unlock for enterprises that couldn't self-host Whisper at the accuracy tier they needed. The limitations (no diarization, no streaming, no TTS counterpart) mean it won't replace full ASR platforms for call-center or meeting products -- but for batch enterprise transcription (document pipelines, media indexing, compliance recording) it's now the default open-source pick. Pair with Grok TTS or ElevenLabs on the output side and you have a fully-open voice stack for the first time.
Sources
- Cohere Labs: Introducing Cohere-transcribe (accessed 2026-04-18)
- TechCrunch: Cohere open-source voice transcription (accessed 2026-04-18)
- MarkTechPost coverage (accessed 2026-04-18)
- Hugging Face Open ASR Leaderboard (accessed 2026-04-18)
Explore more Cohere Transcribe rankings
Deeper leaderboards, benchmarks, task-specific tier lists, and status/pricing pages for Cohere Transcribe.
The Tier List Tuesday
Weekly newsletter: tier movers, new entrants, and the VS of the week. Built from our daily AI-tool sweeps. No spam, unsubscribe anytime.
Alternatives to Cohere Transcribe
ElevenLabs
Best-in-class AI voice generation -- now includes 11.ai (MCP-based voice assistant), Eleven v3 expressive speech, and IBM watsonx partnership. $500M raise at $11B valuation (Feb 2026)
Murf AI
Text-to-speech that actually sounds like a real person read your script -- not a robot trying its best
Descript
Edit audio and video by editing text -- the 'Google Docs of media editing' actually lives up to the hype
Speechify
Text-to-speech reader that turns articles, docs, and PDFs into natural-sounding audio
Microsoft MAI-Voice-2
Microsoft's in-house expressive TTS model -- MAI-Voice-2 launched 2026-06-02 at Build: 15 languages (up from English-only), granular emotion-tag control, zero-shot voice cloning from a 5-60s clip, and preferred over MAI-Voice-1 72% of the time. In speaker-similarity tests its speech is 'indistinguishable' from real recordings. On Azure Foundry + integrated into VS Code and Dynamics 365 Contact Center; lower-cost MAI-Voice-2-Flash coming. Original MAI-Voice-1 shipped 2026-04-02
Grok Speech (STT + TTS APIs)
xAI's standalone voice APIs -- launched 2026-04-17. Built on the stack that powers Grok Voice, Tesla vehicles, and Starlink customer support. $0.10/hr STT batch, $4.20 per 1M characters TTS, 25+ languages, word-level timestamps + speaker diarization