AI Benchmarks (2026)
Every benchmark that matters for ranking LLMs and coding agents, with what it tests, how it is scored, why it matters, and the current leaderboard across 128 reviewed AI tools.
Knowledge
Reasoning
GPQA Diamond
14 scoredGraduate-level physics, biology, and chemistry written to defeat Google-search.
ARC-AGI
3 scoredAbstract visual reasoning puzzles designed to stay hard for LLMs.
Humanity's Last Exam
2 scored3,000 questions written by domain experts to still stump frontier models.
Math
AIME
7 scoredThe American Invitational Math Exam, used as a rolling frontier-math benchmark.
MATH
2 scored12,500 competition-style math problems across algebra, geometry, calculus, and number theory.
Coding
HumanEval
15 scored164 Python programming problems: does the generated code pass unit tests?
SWE-bench Verified
9 scoredFix real GitHub issues in 12 open-source Python repos.
LiveCodeBench
1 scoredCompetitive programming problems published AFTER the model's training cutoff.
How we source benchmark scores
Every score on this site comes from the model vendor's own published technical report or from LMSYS Arena. We cite the source on each tool page and date-stamp the pull. When third-party verification lags vendor claims, we mark the score with a pending label rather than invent a number. See our methodology for the full policy.