Knowledge

MMLU: 2026 AI Leaderboard

The 57-subject knowledge test that became the default LLM benchmark.

What it tests

MMLU (Massive Multitask Language Understanding) is a 14,000-question multiple-choice exam spanning 57 subjects from elementary math to professional law. It measures how much a language model actually knows, not how well it reasons.

How it is scored

Models answer four-choice questions in a zero-shot or few-shot setting. The reported score is average accuracy across all subjects. Scores above 85% are considered strong; humans average roughly 89% on this test.

Why it matters

MMLU is the most widely-reported LLM benchmark, which makes it the easiest point of apples-to-apples comparison across vendors. Its weakness is saturation -- frontier models now cluster in the upper 80s and 90s, so small differences are statistical noise. Use it to rule out weak models, not to pick a winner among strong ones.

Leaderboard (10 models)

Sorted by MMLUscore. Tier column shows the tool's overall AIToolTier rank, which blends this benchmark with pricing, features, and real-world usability.

#ModelTierMMLU score
1Claude (Anthropic)
Claude Opus 4.7 (4.6 baseline scores shown; 4.7 announced 13% coding lift, 3x production task completion)
A91.3%
2ChatGPT
GPT-5.4
A91%
3DeepSeek
DeepSeek V3.2
A90.8%
4Gemini (Google)
Gemini 3.1 Ultra
A90.5%
5Muse Spark (Meta)
Muse Spark
A89%
6Grok
Grok 4.20
B88.5%
7Nemotron (Nvidia)
Nemotron 3 Ultra (253B)
B88.4%
8Mistral AI
Mistral Large 3 / Small 4
B86%
9Gemma 4 (Google)
Gemma 4 31B
A83%
10Falcon (TII)
Falcon 3 10B
B73.1%

About MMLU

Creator
Hendrycks et al., 2020 (UC Berkeley)
Unit
% (max 100)

Other benchmarks