Math

MATH: 2026 AI Leaderboard

12,500 competition-style math problems across algebra, geometry, calculus, and number theory.

What it tests

MATH is a 12,500-problem dataset of high school competition math (AMC/AIME-style) with step-by-step solutions. Models must produce final numerical or symbolic answers.

How it is scored

Exact-match accuracy against ground-truth answers. Frontier models in 2026 exceed 95%, so it is mostly a saturated benchmark now, supplemented by fresh AIME runs.

Why it matters

Historically useful for tracking multi-step reasoning; largely replaced as a discriminator by AIME and competition-math-live benchmarks.

Leaderboard (2 models)

Sorted by MATHscore. Tier column shows the tool's overall AIToolTier rank, which blends this benchmark with pricing, features, and real-world usability.

#	Model	Tier	MATH score	Variant	Overall
1	Mistral AI Mistral Large 3 / Small 4	B	69%	MATH	7.5/10
2	Falcon (TII) Falcon 3 10B	B	55.4%	MATH	7.1/10

About MATH

Creator: Hendrycks et al., 2021
Unit: % (max 100)
Official source: https://arxiv.org/abs/2103.03874

Other benchmarks