Reasoning

Humanity's Last Exam: 2026 AI Leaderboard

3,000 questions written by domain experts to still stump frontier models.

What it tests

Humanity's Last Exam (HLE) is a 3,000-question benchmark crowdsourced from thousands of subject-matter experts specifically to find questions that current frontier models cannot yet answer. Coverage spans math, physics, computer science, humanities, and specialist domains.

How it is scored

Exact-match accuracy with expert-graded partial credit. As of early 2026, top models sit in the 20-45% range -- dramatically lower than on MMLU or GPQA.

Why it matters

HLE is the newest 'unsaturated' frontier-reasoning benchmark, which makes it one of the few tests that still separates the top 5 models. Track this score rather than MMLU when comparing bleeding-edge LLMs.

Leaderboard (2 models)

Sorted by Humanity's Last Examscore. Tier column shows the tool's overall AIToolTier rank, which blends this benchmark with pricing, features, and real-world usability.

#	Model	Tier	Humanity's Last Exam score	Variant	Overall
1	Muse Spark (Meta) Muse Spark	A	58%	HLE	8.8/10
2	Grok Grok 4.20	B	50.7%	HLE	7.5/10

About Humanity's Last Exam

Creator: Scale AI & Center for AI Safety, 2024
Unit: % (max 100)
Official source: https://lastexam.ai/

Other benchmarks