Knowledge
MMLU-Pro: 2026 AI Leaderboard
MMLU's harder successor: 10 answer choices and more reasoning.
What it tests
MMLU-Pro is a successor to MMLU that expands each question to 10 answer choices (up from 4) and rewrites prompts to require multi-step reasoning rather than pure recall.
How it is scored
Same accuracy metric as MMLU but on the harder reformulated question bank. Frontier models score roughly 10-20 points lower here than on base MMLU.
Why it matters
Worth watching because base MMLU has saturated. MMLU-Pro is less saturated and still has headroom, making it a better discriminator for top-tier models in 2026.
Leaderboard (7 models)
Sorted by MMLU-Proscore. Tier column shows the tool's overall AIToolTier rank, which blends this benchmark with pricing, features, and real-world usability.
| # | Model | Tier | MMLU-Pro score | Variant | Overall |
|---|---|---|---|---|---|
| 1 | DeepSeek DeepSeek V3.2 | A | 85% | MMLU-Pro | 8.0/10 |
| 2 | Kimi K2.5 (Moonshot) Kimi K2.5 (1T/32B active MoE) | A | 84.8% | MMLU-Pro | 8.1/10 |
| 3 | Qwen (Alibaba) Qwen3.5-397B MoE | A | 83.5% | MMLU-Pro | 8.8/10 |
| 4 | MiniMax M2 / M2.5 MiniMax M2.5 (230B/10B active MoE) | A | 82.1% | MMLU-Pro | 8.4/10 |
| 5 | GLM / Z.ai (Zhipu AI) GLM-5.1 (744B MoE / 40B active) | A | 81.2% | MMLU-Pro | 8.0/10 |
| 6 | Llama 4 (Meta) Llama 4 Maverick (17B/400B MoE) | B | 80.5% | MMLU-Pro | 7.9/10 |
| 7 | Nemotron (Nvidia) Nemotron 3 Ultra (253B) | B | 79.8% | MMLU-Pro | 7.8/10 |
About MMLU-Pro
- Creator
- TIGER-Lab, 2024
- Unit
- % (max 100)
- Official source
- https://arxiv.org/abs/2406.01574