AI Benchmarks (2026)

Every benchmark that matters for ranking LLMs and coding agents, with what it tests, how it is scored, why it matters, and the current leaderboard across 128 reviewed AI tools.

Knowledge

MMLU

10 scored

The 57-subject knowledge test that became the default LLM benchmark.

1.AClaude (Anthropic)

MMLU-Pro

7 scored

MMLU's harder successor: 10 answer choices and more reasoning.

1.ADeepSeek

85%

2.AKimi K2.5 (Moonshot)

84.8%

3.AQwen (Alibaba)

83.5%

Reasoning

GPQA Diamond

14 scored

Graduate-level physics, biology, and chemistry written to defeat Google-search.

3.AClaude (Anthropic)

91.3%

ARC-AGI

3 scored

Abstract visual reasoning puzzles designed to stay hard for LLMs.

1.AGemini (Google)

77.1%

2.AClaude (Anthropic)

75.2%

3.AChatGPT

73.3%

Humanity's Last Exam

2 scored

3,000 questions written by domain experts to still stump frontier models.

Math

AIME

7 scored

The American Invitational Math Exam, used as a rolling frontier-math benchmark.

1.AClaude (Anthropic)

99.8%

2.AKimi K2.5 (Moonshot)

91.2%

3.AGemma 4 (Google)

89.2%

MATH

2 scored

12,500 competition-style math problems across algebra, geometry, calculus, and number theory.

Coding

HumanEval

15 scored

164 Python programming problems: does the generated code pass unit tests?

3.AClaude (Anthropic)

94%

SWE-bench Verified

9 scored

Fix real GitHub issues in 12 open-source Python repos.

1.AClaude (Anthropic)

LiveCodeBench

1 scored

Competitive programming problems published AFTER the model's training cutoff.

1.AKimi K2.5 (Moonshot)

74.1%

How we source benchmark scores

Every score on this site comes from the model vendor's own published technical report or from LMSYS Arena. We cite the source on each tool page and date-stamp the pull. When third-party verification lags vendor claims, we mark the score with a pending label rather than invent a number. See our methodology for the full policy.