DiffusionGemma (Google)

C Tier · 6.8/10

Google DeepMind's experimental open-weights TEXT-DIFFUSION model (June 10, 2026) -- 26B MoE (3.8B active), Apache 2.0, generates 256-token blocks in parallel with bidirectional attention for up to 4x faster output (1,000+ tok/s on H100). Trades some quality vs Gemma 4 for raw speed

Last updated: 2026-06-10Free tier available

Score Breakdown

6.0

Ease of Use

6.5

Output Quality

9.0

Value

6.0

Features

Visit DiffusionGemma (Google)

The Good and the Bad

What we like

+Text diffusion instead of autoregression: generates 256-token blocks in parallel with bidirectional attention -- up to 4x faster generation, 1,000+ tok/s on a single H100, 700+ tok/s on RTX 5090
+First open-weights text-diffusion model from a frontier lab at useful scale (26B MoE, 3.8B active) -- until now diffusion LLMs were research demos or closed previews
+Apache 2.0 with day-one ecosystem support (Hugging Face, NVIDIA NIM, vLLM, MLX, llama.cpp) -- you can actually deploy it, not just read the paper
+Latency profile is ideal for interactive local use: autocomplete, draft generation, agent inner-loops where tokens-per-second matters more than peak quality

What could be better

−Quality trails Gemma 4 on standard benchmarks -- Google positions it as a speed play, not a flagship; don't swap it into quality-sensitive pipelines
−Speedup is hardware-dependent: much weaker on Apple Silicon unified memory, and the parallel-decode advantage shrinks at high-QPS server batch sizes
−Experimental release -- diffusion text models have different failure modes (block-level incoherence) and far less community tuning lore than autoregressive Gemma
−No hosted API tier at launch -- self-host or NIM only, so casual users have no easy way to try it

Pricing

Self-hosted (open weights)

✓Apache 2.0 license -- commercial use permitted
✓Weights on Hugging Face; NVIDIA NIM container; Gemini Enterprise Model Garden
✓~18GB VRAM quantized -- runs on RTX 5090-class consumer hardware (700+ tok/s)
✓Works with vLLM, MLX, and llama.cpp

System Requirements

Hardware needed to self-host. Min = smallest viable setup (usually heavy quantization). Max = full-precision / production-grade.

Model variant	Min	Max
DiffusionGemma 26B MoE (3.8B active)Speedup is strongest on dedicated-VRAM NVIDIA GPUs; Apple Silicon unified memory sees much smaller gains.	~18 GB VRAM quantized (RTX 4090/5090 tier)	1x H100 for 1,000+ tok/s full-speed decode

Known Issues

LAUNCH (2026-06-10): DiffusionGemma released as an experimental open model. 26B-total / 3.8B-active MoE, Apache 2.0, text-diffusion architecture generating 256-token blocks in parallel with bidirectional attention. Vendor-claimed speeds: up to 4x faster than comparable autoregressive models, 1,000+ tok/s on H100, 700+ tok/s on RTX 5090, ~18GB VRAM quantized. Availability: Hugging Face weights, NVIDIA NIM, Gemini Enterprise Model Garden; vLLM/MLX/llama.cpp support at launch. Google's own caveats: trails Gemma 4 on quality benchmarks; speed advantage weaker on Apple Silicon and in high-QPS serving. Front-page Hacker News reception (226 points) on launch daySource: Google blog (blog.google/innovation-and-ai/technology/developers-tools/diffusion-gemma-faster-text-generation/), Google developers blog (developers.googleblog.com/en/diffusiongemma-the-developer-guide/), NVIDIA blog (RTX AI Garage) · 2026-06-10

Best for

Developers who need fast local text generation -- autocomplete, drafting, high-volume agent inner-loops -- on a single GPU, and researchers who want a production-grade open diffusion LLM to build on.

Not for

Quality-first workloads (use Gemma 4 or a frontier API model), Mac-only setups (the speedup mostly evaporates on Apple Silicon), or anyone wanting a hosted try-before-you-deploy API -- there isn't one yet.

Our Verdict

DiffusionGemma is the most interesting open-weights release of June 2026 not because it's the best model -- Google openly says Gemma 4 beats it on quality -- but because it's the first serious, deployable text-diffusion LLM from a frontier lab. The 4x generation-speed claim holds on the right hardware (H100/RTX-class GPUs), which makes it a genuine option for latency-sensitive local work like autocomplete and agent inner-loops. Treat it as a speed specialist and an architecture preview: if diffusion decoding keeps scaling, this is what the next generation of local models may look like. For now, pair it with a quality model rather than replacing one.

Sources

Google blog: DiffusionGemma -- faster text generation (2026-06-10) (accessed 2026-06-10)
Google developers blog: DiffusionGemma developer guide (accessed 2026-06-10)
NVIDIA blog: RTX AI Garage -- local Gemma Diffusion (accessed 2026-06-10)
MarkTechPost: DiffusionGemma analysis (accessed 2026-06-10)

Explore more DiffusionGemma (Google) rankings

Deeper leaderboards, benchmarks, task-specific tier lists, and status/pricing pages for DiffusionGemma (Google).

Full Local & Open-Weight LLMs tier list

Where DiffusionGemma (Google) ranks vs every competitor in its category

Is DiffusionGemma (Google) down?

Outage check plus rolling log of known issues

DiffusionGemma (Google) pricing

Every tier and what's included

DiffusionGemma (Google) alternatives

Comparable tools at every tier

The Tier List Tuesday

Weekly newsletter: tier movers, new entrants, and the VS of the week. Built from our daily AI-tool sweeps. No spam, unsubscribe anytime.

Alternatives to DiffusionGemma (Google)

Llama 4 (Meta)

Meta's open-weights family -- Scout (10M context), Maverick (multimodal 400B MoE). NOTE: Meta's frontier work moved to the proprietary Muse Spark line in April 2026; Llama remains downloadable and supported but is effectively in maintenance mode

DiffusionGemma (Google)

Score Breakdown

The Good and the Bad

What we like

What could be better

Pricing

Self-hosted (open weights)

System Requirements

Known Issues

Best for

Not for

Our Verdict

Sources

Explore more DiffusionGemma (Google) rankings

The Tier List Tuesday

Alternatives to DiffusionGemma (Google)

Llama 4 (Meta)

Mistral AI

DeepSeek

Gemma 4 (Google)

Qwen (Alibaba)

GLM / Z.ai (Zhipu AI)

Kimi K3 (Moonshot)

Nemotron (Nvidia)

MiniMax M3

Falcon (TII)

gpt-oss (OpenAI)

IBM Granite 4.0

Arcee Trinity-Large-Thinking

Olmo 3 (AI2)

AI21 Jamba2

StepFun Step 3.7 Flash

Cohere Command A

LongCat-2.0 (Meituan)

Inkling (Thinking Machines Lab)

Bonsai 27B (PrismML)