Coding

SWE-bench Verified: 2026 AI Leaderboard

Fix real GitHub issues in 12 open-source Python repos.

What it tests

SWE-bench Verified is a 500-issue subset of SWE-bench that has been human-validated as solvable. Each task is a real Python GitHub issue; the model is given the repo, the issue, and must produce a patch that makes the project's test suite pass.

How it is scored

Percentage of issues where the generated patch passes all hidden tests. This is end-to-end agentic coding, not just code-completion. Scores above 70% are state-of-the-art; a year ago it was 30%.

Why it matters

SWE-bench Verified is the closest industry-standard benchmark to 'can this model actually do my job'. It rewards code-reading, multi-file editing, and test-driven iteration -- not just autocomplete.

Leaderboard (7 models)

Sorted by SWE-bench Verifiedscore. Tier column shows the tool's overall AIToolTier rank, which blends this benchmark with pricing, features, and real-world usability.

#ModelTierSWE-bench Verified score
1Claude (Anthropic)
Claude Fable 5 (launched 2026-06-09) is now the flagship -- Anthropic positions it as its most capable public model on SWE, knowledge work, and vision, but published no standalone numeric benchmark table at launch; legacy Opus-line reasoning-suite scores shown below as baseline, third-party Fable 5 verification pending
A80.8%
2DeepSeek
DeepSeek V4-Pro (SWE-bench + Arena Elo third-party verified post-launch; knowledge rows are V3.x baseline pending V4 figures)
A80.6%
3Mistral AI
Mistral Medium 3.5 (vendor-published; third-party verification pending)
B77.6%
4Codex (OpenAI)
GPT-5.2-Codex (launched 2026-04-23 -- SOTA on SWE-Bench Pro and Terminal-Bench 2.0; first-party scores below pending detailed third-party verification)
A72%
5ChatGPT
GPT-5.5 (launched 2026-04-23; scores below are the GPT-5.4 baseline -- GPT-5.5 launch benchmarks per OpenAI are logged in Known Issues, pending third-party verification)
A72%
6Qwen (Alibaba)
Qwen3.5-397B MoE
A69.4%
7GLM / Z.ai (Zhipu AI)
GLM-5.1 (744B MoE / 40B active)
A64.2%

About SWE-bench Verified

Creator
Princeton & OpenAI, 2023 (Verified subset 2024)
Unit
% (max 100)

Other benchmarks