Reasoning

GPQA Diamond: 2026 AI Leaderboard

Graduate-level physics, biology, and chemistry written to defeat Google-search.

What it tests

GPQA (Graduate-level Google-Proof Q&A) Diamond is the hardest subset of a 448-question multiple-choice set written by PhDs in physics, biology, and chemistry. Questions are deliberately designed so that searching the web does not yield the answer.

How it is scored

Four-choice accuracy. Domain PhDs with unlimited internet access score about 65%; non-expert humans with search score roughly 34%. Frontier models in 2026 are hitting the 80s and 90s -- a major inflection.

Why it matters

GPQA Diamond is the most cited reasoning benchmark for frontier LLMs precisely because it resists memorization. A high score implies the model can synthesize knowledge, not just recite training data.

Leaderboard (11 models)

Sorted by GPQA Diamondscore. Tier column shows the tool's overall AIToolTier rank, which blends this benchmark with pricing, features, and real-world usability.

#ModelTierGPQA Diamond score
1ChatGPT
GPT-5.5 (launched 2026-04-23; scores below are the GPT-5.4 baseline -- GPT-5.5 launch benchmarks per OpenAI are logged in Known Issues, pending third-party verification)
A92.8%
2Claude (Anthropic)
Claude Fable 5 (launched 2026-06-09) is now the flagship -- Anthropic positions it as its most capable public model on SWE, knowledge work, and vision, but published no standalone numeric benchmark table at launch; legacy Opus-line reasoning-suite scores shown below as baseline, third-party Fable 5 verification pending
A91.3%
3Muse Spark (Meta)
Muse Spark
A86%
4Grok
Grok 4.20
B85%
5Gemma 4 (Google)
Gemma 4 31B
A84.3%
6DeepSeek
DeepSeek V4-Pro (SWE-bench + Arena Elo third-party verified post-launch; knowledge rows are V3.x baseline pending V4 figures)
A79.9%
7Qwen (Alibaba)
Qwen3.5-397B MoE
A78.2%
8GLM / Z.ai (Zhipu AI)
GLM-5.1 (744B MoE / 40B active)
A74.5%
9Nemotron (Nvidia)
Nemotron 3 Ultra (253B)
B70.5%
10Llama 4 (Meta)
Llama 4 Maverick (17B/400B MoE)
B69.8%
11Falcon (TII)
Falcon 3 10B
B42.5%

About GPQA Diamond

Creator
Rein et al., 2023 (NYU/Cohere/Anthropic)
Unit
% (max 100)

Other benchmarks