Coding

SWE-bench Verified: 2026 AI Leaderboard

Fix real GitHub issues in 12 open-source Python repos.

What it tests

SWE-bench Verified is a 500-issue subset of SWE-bench that has been human-validated as solvable. Each task is a real Python GitHub issue; the model is given the repo, the issue, and must produce a patch that makes the project's test suite pass.

How it is scored

Percentage of issues where the generated patch passes all hidden tests. This is end-to-end agentic coding, not just code-completion. Scores above 70% are state-of-the-art; a year ago it was 30%.

Why it matters

SWE-bench Verified is the closest industry-standard benchmark to 'can this model actually do my job'. It rewards code-reading, multi-file editing, and test-driven iteration -- not just autocomplete.

Leaderboard (9 models)

Sorted by SWE-bench Verifiedscore. Tier column shows the tool's overall AIToolTier rank, which blends this benchmark with pricing, features, and real-world usability.

#	Model	Tier	SWE-bench Verified score	Variant	Overall
1	Claude (Anthropic) Claude Opus 4.7 (4.6 baseline scores shown; 4.7 announced 13% coding lift, 3x production task completion)	A	80.8%	SWE-bench Verified	8.5/10
2	Gemini (Google) Gemini 3.1 Ultra	A	80.6%	SWE-bench Verified	8.3/10
3	MiniMax M2 / M2.5 MiniMax M2.5 (230B/10B active MoE)	A	80.2%	SWE-Bench Verified	8.4/10
4	Kimi K2.5 (Moonshot) Kimi K2.5 (1T/32B active MoE)	A	78.5%	SWE-Bench Verified	8.1/10
5	Codex (OpenAI) GPT-5.3-Codex	A	72%	SWE-bench Verified	8.3/10
6	ChatGPT GPT-5.4	A	72%	SWE-bench Verified	8.8/10
7	Qwen (Alibaba) Qwen3.5-397B MoE	A	69.4%	SWE-Bench Verified	8.8/10
8	DeepSeek DeepSeek V3.2	A	67.8%	SWE-bench Verified	8.0/10
9	GLM / Z.ai (Zhipu AI) GLM-5.1 (744B MoE / 40B active)	A	64.2%	SWE-Bench Verified	8.0/10

About SWE-bench Verified

Creator: Princeton & OpenAI, 2023 (Verified subset 2024)
Unit: % (max 100)
Official source: https://www.swebench.com/

Other benchmarks