Knowledge

MMLU: 2026 AI Leaderboard

The 57-subject knowledge test that became the default LLM benchmark.

What it tests

MMLU (Massive Multitask Language Understanding) is a 14,000-question multiple-choice exam spanning 57 subjects from elementary math to professional law. It measures how much a language model actually knows, not how well it reasons.

How it is scored

Models answer four-choice questions in a zero-shot or few-shot setting. The reported score is average accuracy across all subjects. Scores above 85% are considered strong; humans average roughly 89% on this test.

Why it matters

MMLU is the most widely-reported LLM benchmark, which makes it the easiest point of apples-to-apples comparison across vendors. Its weakness is saturation -- frontier models now cluster in the upper 80s and 90s, so small differences are statistical noise. Use it to rule out weak models, not to pick a winner among strong ones.

Leaderboard (10 models)

Sorted by MMLUscore. Tier column shows the tool's overall AIToolTier rank, which blends this benchmark with pricing, features, and real-world usability.

#	Model	Tier	MMLU score	Variant	Overall
1	Claude (Anthropic) Claude Opus 4.7 (4.6 baseline scores shown; 4.7 announced 13% coding lift, 3x production task completion)	A	91.3%	MMLU	8.5/10
2	ChatGPT GPT-5.4	A	91%	MMLU	8.8/10
3	DeepSeek DeepSeek V3.2	A	90.8%	MMLU	8.0/10
4	Gemini (Google) Gemini 3.1 Ultra	A	90.5%	MMLU	8.3/10
5	Muse Spark (Meta) Muse Spark	A	89%	MMLU	8.8/10
6	Grok Grok 4.20	B	88.5%	MMLU	7.5/10
7	Nemotron (Nvidia) Nemotron 3 Ultra (253B)	B	88.4%	MMLU (Llama-Nemotron 70B)	7.8/10
8	Mistral AI Mistral Large 3 / Small 4	B	86%	MMLU	7.5/10
9	Gemma 4 (Google) Gemma 4 31B	A	83%	MMLU	8.3/10
10	Falcon (TII) Falcon 3 10B	B	73.1%	MMLU	7.1/10

About MMLU

Creator: Hendrycks et al., 2020 (UC Berkeley)
Unit: % (max 100)
Official source: https://arxiv.org/abs/2009.03300

Other benchmarks