HumanEval: 2026 AI Leaderboard
164 Python programming problems: does the generated code pass unit tests?
What it tests
HumanEval is 164 handwritten Python programming problems with hidden unit tests. A model sees the function signature plus docstring and must generate a body that passes every test.
How it is scored
pass@1 -- the percentage of problems solved on the first attempt. Frontier models in 2026 sit in the 94-99% range, so this benchmark is effectively saturated for top-tier LLMs.
Why it matters
Still useful as a floor-check: any serious coding model should clear 90% here. For real-world discrimination, SWE-bench Verified and LiveCodeBench are the benchmarks that still separate the field.
Leaderboard (15 models)
Sorted by HumanEvalscore. Tier column shows the tool's overall AIToolTier rank, which blends this benchmark with pricing, features, and real-world usability.
| # | Model | Tier | HumanEval score | Variant | Overall |
|---|---|---|---|---|---|
| 1 | Codex (OpenAI) GPT-5.3-Codex | A | 95% | HumanEval | 8.3/10 |
| 2 | ChatGPT GPT-5.4 | A | 95% | HumanEval | 8.8/10 |
| 3 | Claude (Anthropic) Claude Opus 4.7 (4.6 baseline scores shown; 4.7 announced 13% coding lift, 3x production task completion) | A | 94% | HumanEval | 8.5/10 |
| 4 | Gemini (Google) Gemini 3.1 Ultra | A | 93.5% | HumanEval | 8.3/10 |
| 5 | Qwen (Alibaba) Qwen3.5-397B MoE | A | 92.5% | HumanEval | 8.8/10 |
| 6 | Mistral AI Mistral Large 3 / Small 4 | B | 92% | HumanEval | 7.5/10 |
| 7 | DeepSeek DeepSeek V3.2 | A | 91.5% | HumanEval | 8.0/10 |
| 8 | Muse Spark (Meta) Muse Spark | A | 91% | HumanEval | 8.8/10 |
| 9 | MiniMax M2 / M2.5 MiniMax M2.5 (230B/10B active MoE) | A | 91% | HumanEval | 8.4/10 |
| 10 | Grok Grok 4.20 | B | 90% | HumanEval | 7.5/10 |
| 11 | Nemotron (Nvidia) Nemotron 3 Ultra (253B) | B | 89.6% | HumanEval | 7.8/10 |
| 12 | GLM / Z.ai (Zhipu AI) GLM-5.1 (744B MoE / 40B active) | A | 89.1% | HumanEval | 8.0/10 |
| 13 | Llama 4 (Meta) Llama 4 Maverick (17B/400B MoE) | B | 88% | HumanEval | 7.9/10 |
| 14 | Gemma 4 (Google) Gemma 4 31B | A | 85% | HumanEval | 8.3/10 |
| 15 | Falcon (TII) Falcon 3 10B | B | 73.8% | HumanEval | 7.1/10 |
About HumanEval
- Creator
- OpenAI, 2021
- Unit
- % (max 100)
- Official source
- https://arxiv.org/abs/2107.03374