Knowledge

MMLU: 2026 AI Leaderboard

The 57-subject knowledge test that became the default LLM benchmark.

What it tests

MMLU (Massive Multitask Language Understanding) is a 14,000-question multiple-choice exam spanning 57 subjects from elementary math to professional law. It measures how much a language model actually knows, not how well it reasons.

How it is scored

Models answer four-choice questions in a zero-shot or few-shot setting. The reported score is average accuracy across all subjects. Scores above 85% are considered strong; humans average roughly 89% on this test.

Why it matters

MMLU is the most widely-reported LLM benchmark, which makes it the easiest point of apples-to-apples comparison across vendors. Its weakness is saturation -- frontier models now cluster in the upper 80s and 90s, so small differences are statistical noise. Use it to rule out weak models, not to pick a winner among strong ones.

Leaderboard (9 models)

Sorted by MMLUscore. Tier column shows the tool's overall AIToolTier rank, which blends this benchmark with pricing, features, and real-world usability.

#ModelTierMMLU score
1Claude (Anthropic)
Claude Fable 5 (launched 2026-06-09) is now the flagship -- Anthropic positions it as its most capable public model on SWE, knowledge work, and vision, but published no standalone numeric benchmark table at launch; legacy Opus-line reasoning-suite scores shown below as baseline, third-party Fable 5 verification pending
A91.3%
2ChatGPT
GPT-5.5 (launched 2026-04-23; scores below are the GPT-5.4 baseline -- GPT-5.5 launch benchmarks per OpenAI are logged in Known Issues, pending third-party verification)
A91%
3DeepSeek
DeepSeek V4-Pro (SWE-bench + Arena Elo third-party verified post-launch; knowledge rows are V3.x baseline pending V4 figures)
A90.8%
4Muse Spark (Meta)
Muse Spark
A89%
5Grok
Grok 4.20
B88.5%
6Nemotron (Nvidia)
Nemotron 3 Ultra (253B)
B88.4%
7Mistral AI
Mistral Medium 3.5 (vendor-published; third-party verification pending)
B86%
8Gemma 4 (Google)
Gemma 4 31B
A83%
9Falcon (TII)
Falcon 3 10B
B73.1%

About MMLU

Creator
Hendrycks et al., 2020 (UC Berkeley)
Unit
% (max 100)

Other benchmarks