Reasoning

GPQA Diamond: 2026 AI Leaderboard

Graduate-level physics, biology, and chemistry written to defeat Google-search.

What it tests

GPQA (Graduate-level Google-Proof Q&A) Diamond is the hardest subset of a 448-question multiple-choice set written by PhDs in physics, biology, and chemistry. Questions are deliberately designed so that searching the web does not yield the answer.

How it is scored

Four-choice accuracy. Domain PhDs with unlimited internet access score about 65%; non-expert humans with search score roughly 34%. Frontier models in 2026 are hitting the 80s and 90s -- a major inflection.

Why it matters

GPQA Diamond is the most cited reasoning benchmark for frontier LLMs precisely because it resists memorization. A high score implies the model can synthesize knowledge, not just recite training data.

Leaderboard (14 models)

Sorted by GPQA Diamondscore. Tier column shows the tool's overall AIToolTier rank, which blends this benchmark with pricing, features, and real-world usability.

#	Model	Tier	GPQA Diamond score	Variant	Overall
1	Gemini (Google) Gemini 3.1 Ultra	A	94.3%	GPQA Diamond	8.3/10
2	ChatGPT GPT-5.4	A	92.8%	GPQA Diamond	8.8/10
3	Claude (Anthropic) Claude Opus 4.7 (4.6 baseline scores shown; 4.7 announced 13% coding lift, 3x production task completion)	A	91.3%	GPQA Diamond	8.5/10
4	Muse Spark (Meta) Muse Spark	A	86%	GPQA Diamond	8.8/10
5	Grok Grok 4.20	B	85%	GPQA Diamond	7.5/10
6	Gemma 4 (Google) Gemma 4 31B	A	84.3%	GPQA Diamond	8.3/10
7	Kimi K2.5 (Moonshot) Kimi K2.5 (1T/32B active MoE)	A	80.5%	GPQA Diamond	8.1/10
8	DeepSeek DeepSeek V3.2	A	79.9%	GPQA Diamond	8.0/10
9	Qwen (Alibaba) Qwen3.5-397B MoE	A	78.2%	GPQA Diamond	8.8/10
10	MiniMax M2 / M2.5 MiniMax M2.5 (230B/10B active MoE)	A	76.8%	GPQA Diamond	8.4/10
11	GLM / Z.ai (Zhipu AI) GLM-5.1 (744B MoE / 40B active)	A	74.5%	GPQA Diamond	8.0/10
12	Nemotron (Nvidia) Nemotron 3 Ultra (253B)	B	70.5%	GPQA Diamond	7.8/10
13	Llama 4 (Meta) Llama 4 Maverick (17B/400B MoE)	B	69.8%	GPQA Diamond	7.9/10
14	Falcon (TII) Falcon 3 10B	B	42.5%	GPQA Diamond	7.1/10

About GPQA Diamond

Creator: Rein et al., 2023 (NYU/Cohere/Anthropic)
Unit: % (max 100)
Official source: https://arxiv.org/abs/2311.12022

Other benchmarks