Coding

HumanEval: 2026 AI Leaderboard

164 Python programming problems: does the generated code pass unit tests?

What it tests

HumanEval is 164 handwritten Python programming problems with hidden unit tests. A model sees the function signature plus docstring and must generate a body that passes every test.

How it is scored

pass@1 -- the percentage of problems solved on the first attempt. Frontier models in 2026 sit in the 94-99% range, so this benchmark is effectively saturated for top-tier LLMs.

Why it matters

Still useful as a floor-check: any serious coding model should clear 90% here. For real-world discrimination, SWE-bench Verified and LiveCodeBench are the benchmarks that still separate the field.

Leaderboard (15 models)

Sorted by HumanEvalscore. Tier column shows the tool's overall AIToolTier rank, which blends this benchmark with pricing, features, and real-world usability.

#	Model	Tier	HumanEval score	Variant	Overall
1	Codex (OpenAI) GPT-5.3-Codex	A	95%	HumanEval	8.3/10
2	ChatGPT GPT-5.4	A	95%	HumanEval	8.8/10
3	Claude (Anthropic) Claude Opus 4.7 (4.6 baseline scores shown; 4.7 announced 13% coding lift, 3x production task completion)	A	94%	HumanEval	8.5/10
4	Gemini (Google) Gemini 3.1 Ultra	A	93.5%	HumanEval	8.3/10
5	Qwen (Alibaba) Qwen3.5-397B MoE	A	92.5%	HumanEval	8.8/10
6	Mistral AI Mistral Large 3 / Small 4	B	92%	HumanEval	7.5/10
7	DeepSeek DeepSeek V3.2	A	91.5%	HumanEval	8.0/10
8	Muse Spark (Meta) Muse Spark	A	91%	HumanEval	8.8/10
9	MiniMax M2 / M2.5 MiniMax M2.5 (230B/10B active MoE)	A	91%	HumanEval	8.4/10
10	Grok Grok 4.20	B	90%	HumanEval	7.5/10
11	Nemotron (Nvidia) Nemotron 3 Ultra (253B)	B	89.6%	HumanEval	7.8/10
12	GLM / Z.ai (Zhipu AI) GLM-5.1 (744B MoE / 40B active)	A	89.1%	HumanEval	8.0/10
13	Llama 4 (Meta) Llama 4 Maverick (17B/400B MoE)	B	88%	HumanEval	7.9/10
14	Gemma 4 (Google) Gemma 4 31B	A	85%	HumanEval	8.3/10
15	Falcon (TII) Falcon 3 10B	B	73.8%	HumanEval	7.1/10

About HumanEval

Creator: OpenAI, 2021
Unit: % (max 100)
Official source: https://arxiv.org/abs/2107.03374

Other benchmarks