Humanity's Last Exam: 2026 AI Leaderboard
3,000 questions written by domain experts to still stump frontier models.
What it tests
Humanity's Last Exam (HLE) is a 3,000-question benchmark crowdsourced from thousands of subject-matter experts specifically to find questions that current frontier models cannot yet answer. Coverage spans math, physics, computer science, humanities, and specialist domains.
How it is scored
Exact-match accuracy with expert-graded partial credit. As of early 2026, top models sit in the 20-45% range -- dramatically lower than on MMLU or GPQA.
Why it matters
HLE is the newest 'unsaturated' frontier-reasoning benchmark, which makes it one of the few tests that still separates the top 5 models. Track this score rather than MMLU when comparing bleeding-edge LLMs.
Leaderboard (2 models)
Sorted by Humanity's Last Examscore. Tier column shows the tool's overall AIToolTier rank, which blends this benchmark with pricing, features, and real-world usability.
| # | Model | Tier | Humanity's Last Exam score | Variant | Overall |
|---|---|---|---|---|---|
| 1 | Muse Spark (Meta) Muse Spark | A | 58% | HLE | 8.8/10 |
| 2 | Grok Grok 4.20 | B | 50.7% | HLE | 7.5/10 |
About Humanity's Last Exam
- Creator
- Scale AI & Center for AI Safety, 2024
- Unit
- % (max 100)
- Official source
- https://lastexam.ai/