Cohere Transcribe

A Tier · 8.0/10

Cohere's first audio model -- launched 2026-03-26 under Apache 2.0, 2B parameters, #1 on Hugging Face Open ASR Leaderboard (5.42 avg WER), 14 enterprise-critical languages. Free API with rate limits; Model Vault for production

Last updated: 2026-05-20Free tier available

Score Breakdown

7.0

Ease of Use

9.0

Output Quality

9.0

Value

7.0

Features

Visit Cohere Transcribe

The Good and the Bad

What we like

+#1 on Hugging Face Open ASR Leaderboard as of 2026-03-26 -- 5.42 average WER across 8 English benchmarks, beating IBM Granite 4.0 1B Speech (5.52), NVIDIA Canary Qwen 2.5B (5.63), ElevenLabs Scribe v2 (5.83), and OpenAI Whisper Large v3 (7.44)
+Apache 2.0 open weights mean you can self-host without license fees -- rare for production-grade ASR at this accuracy tier. vLLM integration was contributed upstream so production serving works out of the box
+Encoder-heavy architecture (90%+ of parameters in encoder, lightweight decoder) is explicitly optimized for serving efficiency -- offline throughput is 3x higher than comparable open-source models at similar WER
+Trained on 14 enterprise-critical languages: English, German, French, Italian, Spanish, Portuguese, Greek, Dutch, Polish, Arabic, Vietnamese, Mandarin, Japanese, Korean -- positioned squarely for European and APAC enterprise transcription

What could be better

−No speaker diarization, word-level timestamps, or streaming WebSocket API -- this is a transcription-only model, not a full ASR platform like Grok Speech or Deepgram. Build those features yourself
−Trained to expect a language tag and monolingual audio -- code-switched (En, X) transcription works sometimes but is not a first-class use case
−Like most AED speech models, 'eager to transcribe' -- background noise and silence can turn into hallucinations without a VAD (voice activity detection) front-end
−Text-to-speech is NOT in this release -- Cohere Transcribe is STT only. If you need both, pair with Grok TTS, ElevenLabs, or Microsoft MAI-Voice-1

Pricing

Open-weights (Apache 2.0)

✓Self-host the 2B-parameter model from Hugging Face
✓Commercial use allowed, no license fee
✓vLLM production-serving support (merged upstream)
✓3x higher offline throughput vs. similarly-sized models

Cohere API (free tier)

✓Hosted inference for low-setup experimentation
✓Rate-limited (see Cohere docs)
✓No commercial SLA

Cohere Model Vault (production)

Custom

✓Dedicated deployment, no rate limits
✓Per-hour instance pricing
✓Discounted plans for longer-term commitments
✓Cohere North enterprise platform integration

Known Issues

OWNERSHIP-CHANGE WATCH (2026-04-24): Cohere announced it is 'joining forces' with Aleph Alpha. Vendor blog (cohere.com/blog/cohere-alephalpha-join-forces) frames this as a merger / strategic combination rather than a clean acquisition, backed by Schwarz Group as lead investor with EUR 500M (USD 600M) Series E structured financing. STACKIT (Schwarz Digits' sovereign cloud) will serve as the 'technical backbone of this transatlantic AI initiative', targeting regulated sectors (public sector, finance, defense, energy, manufacturing, telecom, healthcare). Effective date / closing date not specified; deal announced as planned/pending. No immediate product or pricing changes for Cohere Transcribe customers, but a future combined-entity roadmap may reshape model lineup, support guarantees, and pricing -- worth tracking before any new long-term Cohere Model Vault commitment.Source: Cohere blog: Cohere and Aleph Alpha join forces (cohere.com/blog/cohere-alephalpha-join-forces), TechCrunch, CNBC · 2026-04-24
No speaker diarization or word-level timestamps at launch. If your product needs multi-speaker separation (call centers, meetings) you'll need to add a diarization model on top or choose Grok STT / Deepgram / AssemblyAISource: Cohere Hugging Face blog · 2026-03
Model 'eagerly' transcribes non-speech audio -- prepend a VAD or noise gate for production use to avoid hallucinations on floor noiseSource: Cohere Labs limitations note · 2026-03
Only 14 languages supported -- narrower than Whisper Large v3 (99 languages) or Grok STT (25+). For rare-language workloads, open-source Whisper or proprietary APIs still winSource: Cohere Labs model card · 2026-03

Best for

Enterprise teams transcribing English, European, and major APAC languages at scale who want open weights they can self-host, fine-tune, or deploy on-prem. The Apache 2.0 license removes a major procurement blocker compared to proprietary ASR, and the accuracy tier is now best-in-class for open models.

Not for

Call-center and meeting-transcription products that need speaker diarization and streaming out of the box -- Grok STT, Deepgram, or AssemblyAI ship those features. Also not ideal for rare-language workloads (sub-Saharan African languages, indigenous languages) where Whisper Large's 99-language coverage is still hard to beat.

Our Verdict

Cohere Transcribe is the first time an open-weights ASR model has genuinely topped the Open ASR Leaderboard in head-to-head benchmark terms, and the Apache 2.0 license makes it a real procurement unlock for enterprises that couldn't self-host Whisper at the accuracy tier they needed. The limitations (no diarization, no streaming, no TTS counterpart) mean it won't replace full ASR platforms for call-center or meeting products -- but for batch enterprise transcription (document pipelines, media indexing, compliance recording) it's now the default open-source pick. Pair with Grok TTS or ElevenLabs on the output side and you have a fully-open voice stack for the first time.

Sources

Cohere Labs: Introducing Cohere-transcribe (accessed 2026-04-18)
TechCrunch: Cohere open-source voice transcription (accessed 2026-04-18)
MarkTechPost coverage (accessed 2026-04-18)
Hugging Face Open ASR Leaderboard (accessed 2026-04-18)

Explore more Cohere Transcribe rankings

Deeper leaderboards, benchmarks, task-specific tier lists, and status/pricing pages for Cohere Transcribe.

Full AI Voice & Audio tier list

Where Cohere Transcribe ranks vs every competitor in its category

Best AI tools to dub a video

Tools that translate and lip-sync video narration into a different language while preserving voice.

Best AI tools to clone a voice

Voice-cloning tools that reproduce a target speaker from a short audio sample, with consent controls.

Best AI tools to transcribe audio

Speech-to-text tools with speaker separation, punctuation, and timestamped output.

Is Cohere Transcribe down?

Outage check plus rolling log of known issues

Cohere Transcribe pricing

Every tier and what's included

Cohere Transcribe alternatives

Comparable tools at every tier

The Tier List Tuesday

Weekly newsletter: tier movers, new entrants, and the VS of the week. Built from our daily AI-tool sweeps. No spam, unsubscribe anytime.

Alternatives to Cohere Transcribe

ElevenLabs

Best-in-class AI voice generation -- now includes 11.ai (MCP-based voice assistant), Eleven v3 expressive speech, and IBM watsonx partnership. $500M raise at $11B valuation (Feb 2026)

8.5/10

Free tierFrom $0

Voice quality is still the best availabl...11.ai (alpha launched June 2025, still g...

Updated 2026-06-09

Murf AI

Text-to-speech that actually sounds like a real person read your script -- not a robot trying its best

7.0/10

Free tierFrom $0

Voice quality is genuinely impressive --...The editor is simple and intuitive, you ...

Updated 2026-03-27

Descript

Edit audio and video by editing text -- the 'Google Docs of media editing' actually lives up to the hype

8.5/10

Free tierFrom $0

Text-based editing is a genuine breakthr...Filler word removal works shockingly wel...

Updated 2026-03-27

Speechify

Text-to-speech reader that turns articles, docs, and PDFs into natural-sounding audio

6.8/10

Free tierFrom $0

Premium voices sound genuinely natural -...Works across platforms: browser extensio...

Updated 2026-04-02

Microsoft MAI-Voice-2

Microsoft's in-house expressive TTS model -- MAI-Voice-2 launched 2026-06-02 at Build: 15 languages (up from English-only), granular emotion-tag control, zero-shot voice cloning from a 5-60s clip, and preferred over MAI-Voice-1 72% of the time. In speaker-similarity tests its speech is 'indistinguishable' from real recordings. On Azure Foundry + integrated into VS Code and Dynamics 365 Contact Center; lower-cost MAI-Voice-2-Flash coming. Original MAI-Voice-1 shipped 2026-04-02

7.3/10

Free tierFrom Not disclosed

Speed is the real headline -- 60 seconds...First-party Azure Foundry integration me...

Updated 2026-06-02

Grok Speech (STT + TTS APIs)

xAI's standalone voice APIs -- launched 2026-04-17. Built on the stack that powers Grok Voice, Tesla vehicles, and Starlink customer support. $0.10/hr STT batch, $4.20 per 1M characters TTS, 25+ languages, word-level timestamps + speaker diarization

8.1/10

From $0.10

Published word-error-rate benchmark puts...Pricing is aggressive -- $0.10/hr batch ...

Updated 2026-04-18