Coding

HumanEval: 2026 AI Leaderboard

164 Python programming problems: does the generated code pass unit tests?

What it tests

HumanEval is 164 handwritten Python programming problems with hidden unit tests. A model sees the function signature plus docstring and must generate a body that passes every test.

How it is scored

pass@1 -- the percentage of problems solved on the first attempt. Frontier models in 2026 sit in the 94-99% range, so this benchmark is effectively saturated for top-tier LLMs.

Why it matters

Still useful as a floor-check: any serious coding model should clear 90% here. For real-world discrimination, SWE-bench Verified and LiveCodeBench are the benchmarks that still separate the field.

Leaderboard (15 models)

Sorted by HumanEvalscore. Tier column shows the tool's overall AIToolTier rank, which blends this benchmark with pricing, features, and real-world usability.

#ModelTierHumanEval score
1Codex (OpenAI)
GPT-5.3-Codex
A95%
2ChatGPT
GPT-5.4
A95%
3Claude (Anthropic)
Claude Opus 4.7 (4.6 baseline scores shown; 4.7 announced 13% coding lift, 3x production task completion)
A94%
4Gemini (Google)
Gemini 3.1 Ultra
A93.5%
5Qwen (Alibaba)
Qwen3.5-397B MoE
A92.5%
6Mistral AI
Mistral Large 3 / Small 4
B92%
7DeepSeek
DeepSeek V3.2
A91.5%
8Muse Spark (Meta)
Muse Spark
A91%
9MiniMax M2 / M2.5
MiniMax M2.5 (230B/10B active MoE)
A91%
10Grok
Grok 4.20
B90%
11Nemotron (Nvidia)
Nemotron 3 Ultra (253B)
B89.6%
12GLM / Z.ai (Zhipu AI)
GLM-5.1 (744B MoE / 40B active)
A89.1%
13Llama 4 (Meta)
Llama 4 Maverick (17B/400B MoE)
B88%
14Gemma 4 (Google)
Gemma 4 31B
A85%
15Falcon (TII)
Falcon 3 10B
B73.8%

About HumanEval

Creator
OpenAI, 2021
Unit
% (max 100)

Other benchmarks