Reasoning

Humanity's Last Exam: 2026 AI Leaderboard

3,000 questions written by domain experts to still stump frontier models.

What it tests

Humanity's Last Exam (HLE) is a 3,000-question benchmark crowdsourced from thousands of subject-matter experts specifically to find questions that current frontier models cannot yet answer. Coverage spans math, physics, computer science, humanities, and specialist domains.

How it is scored

Exact-match accuracy with expert-graded partial credit. As of early 2026, top models sit in the 20-45% range -- dramatically lower than on MMLU or GPQA.

Why it matters

HLE is the newest 'unsaturated' frontier-reasoning benchmark, which makes it one of the few tests that still separates the top 5 models. Track this score rather than MMLU when comparing bleeding-edge LLMs.

Leaderboard (2 models)

Sorted by Humanity's Last Examscore. Tier column shows the tool's overall AIToolTier rank, which blends this benchmark with pricing, features, and real-world usability.

#ModelTierHumanity's Last Exam score
1Muse Spark (Meta)
Muse Spark
A58%
2Grok
Grok 4.20
B50.7%

About Humanity's Last Exam

Creator
Scale AI & Center for AI Safety, 2024
Unit
% (max 100)
Official source
https://lastexam.ai/

Other benchmarks