Cohere Transcribe logo
A

Cohere Transcribe

A Tier · 8.0/10

Cohere's first audio model -- launched 2026-03-26 under Apache 2.0, 2B parameters, #1 on Hugging Face Open ASR Leaderboard (5.42 avg WER), 14 enterprise-critical languages. Free API with rate limits; Model Vault for production

Last updated: 2026-04-18Free tier available

Score Breakdown

7.0
Ease of Use
9.0
Output Quality
9.0
Value
7.0
Features

The Good and the Bad

What we like

  • +#1 on Hugging Face Open ASR Leaderboard as of 2026-03-26 -- 5.42 average WER across 8 English benchmarks, beating IBM Granite 4.0 1B Speech (5.52), NVIDIA Canary Qwen 2.5B (5.63), ElevenLabs Scribe v2 (5.83), and OpenAI Whisper Large v3 (7.44)
  • +Apache 2.0 open weights mean you can self-host without license fees -- rare for production-grade ASR at this accuracy tier. vLLM integration was contributed upstream so production serving works out of the box
  • +Encoder-heavy architecture (90%+ of parameters in encoder, lightweight decoder) is explicitly optimized for serving efficiency -- offline throughput is 3x higher than comparable open-source models at similar WER
  • +Trained on 14 enterprise-critical languages: English, German, French, Italian, Spanish, Portuguese, Greek, Dutch, Polish, Arabic, Vietnamese, Mandarin, Japanese, Korean -- positioned squarely for European and APAC enterprise transcription

What could be better

  • No speaker diarization, word-level timestamps, or streaming WebSocket API -- this is a transcription-only model, not a full ASR platform like Grok Speech or Deepgram. Build those features yourself
  • Trained to expect a language tag and monolingual audio -- code-switched (En, X) transcription works sometimes but is not a first-class use case
  • Like most AED speech models, 'eager to transcribe' -- background noise and silence can turn into hallucinations without a VAD (voice activity detection) front-end
  • Text-to-speech is NOT in this release -- Cohere Transcribe is STT only. If you need both, pair with Grok TTS, ElevenLabs, or Microsoft MAI-Voice-1

Pricing

Open-weights (Apache 2.0)

$0
  • Self-host the 2B-parameter model from Hugging Face
  • Commercial use allowed, no license fee
  • vLLM production-serving support (merged upstream)
  • 3x higher offline throughput vs. similarly-sized models

Cohere API (free tier)

$0
  • Hosted inference for low-setup experimentation
  • Rate-limited (see Cohere docs)
  • No commercial SLA

Cohere Model Vault (production)

Custom
  • Dedicated deployment, no rate limits
  • Per-hour instance pricing
  • Discounted plans for longer-term commitments
  • Cohere North enterprise platform integration

Known Issues

  • No speaker diarization or word-level timestamps at launch. If your product needs multi-speaker separation (call centers, meetings) you'll need to add a diarization model on top or choose Grok STT / Deepgram / AssemblyAISource: Cohere Hugging Face blog · 2026-03
  • Model 'eagerly' transcribes non-speech audio -- prepend a VAD or noise gate for production use to avoid hallucinations on floor noiseSource: Cohere Labs limitations note · 2026-03
  • Only 14 languages supported -- narrower than Whisper Large v3 (99 languages) or Grok STT (25+). For rare-language workloads, open-source Whisper or proprietary APIs still winSource: Cohere Labs model card · 2026-03

Best for

Enterprise teams transcribing English, European, and major APAC languages at scale who want open weights they can self-host, fine-tune, or deploy on-prem. The Apache 2.0 license removes a major procurement blocker compared to proprietary ASR, and the accuracy tier is now best-in-class for open models.

Not for

Call-center and meeting-transcription products that need speaker diarization and streaming out of the box -- Grok STT, Deepgram, or AssemblyAI ship those features. Also not ideal for rare-language workloads (sub-Saharan African languages, indigenous languages) where Whisper Large's 99-language coverage is still hard to beat.

Our Verdict

Cohere Transcribe is the first time an open-weights ASR model has genuinely topped the Open ASR Leaderboard in head-to-head benchmark terms, and the Apache 2.0 license makes it a real procurement unlock for enterprises that couldn't self-host Whisper at the accuracy tier they needed. The limitations (no diarization, no streaming, no TTS counterpart) mean it won't replace full ASR platforms for call-center or meeting products -- but for batch enterprise transcription (document pipelines, media indexing, compliance recording) it's now the default open-source pick. Pair with Grok TTS or ElevenLabs on the output side and you have a fully-open voice stack for the first time.

Sources

  • Cohere Labs: Introducing Cohere-transcribe (accessed 2026-04-18)
  • TechCrunch: Cohere open-source voice transcription (accessed 2026-04-18)
  • MarkTechPost coverage (accessed 2026-04-18)
  • Hugging Face Open ASR Leaderboard (accessed 2026-04-18)

Alternatives to Cohere Transcribe

ElevenLabs logo

ElevenLabs

Best-in-class AI voice generation -- now includes 11.ai (MCP-based voice assistant), Eleven v3 expressive speech, and IBM watsonx partnership. $500M raise at $11B valuation (Feb 2026)

A
8.5/10
Free tierFrom $0
Voice quality is still the best availabl...11.ai alpha (March 2026) is the first se...
Updated 2026-04-16
Murf AI logo

Murf AI

Text-to-speech that actually sounds like a real person read your script -- not a robot trying its best

B
7.0/10
Free tierFrom $0
Voice quality is genuinely impressive --...The editor is simple and intuitive, you ...
Updated 2026-03-27
Descript logo

Descript

Edit audio and video by editing text -- the 'Google Docs of media editing' actually lives up to the hype

A
8.5/10
Free tierFrom $0
Text-based editing is a genuine breakthr...Filler word removal works shockingly wel...
Updated 2026-03-27
Speechify logo

Speechify

Text-to-speech reader that turns articles, docs, and PDFs into natural-sounding audio

C
6.8/10
Free tierFrom $0
Premium voices sound genuinely natural -...Works across platforms: browser extensio...
Updated 2026-04-02
Microsoft MAI-Voice-1 logo

Microsoft MAI-Voice-1

Microsoft's first in-house expressive TTS model -- launched 2026-04-02 on Azure Foundry. Generates 60s of audio in ~1s on a single GPU. Custom voice cloning from a few seconds of input. Powers Copilot, Bing, PowerPoint, and Azure Speech

B
7.3/10
Free tierFrom $22
Speed is the real headline -- 60 seconds...First-party Azure Foundry integration me...
Updated 2026-04-17
Grok Speech (STT + TTS APIs) logo

Grok Speech (STT + TTS APIs)

xAI's standalone voice APIs -- launched 2026-04-17. Built on the stack that powers Grok Voice, Tesla vehicles, and Starlink customer support. $0.10/hr STT batch, $4.20 per 1M characters TTS, 25+ languages, word-level timestamps + speaker diarization

A
8.1/10
From $0.10
Published word-error-rate benchmark puts...Pricing is aggressive -- $0.10/hr batch ...
Updated 2026-04-18