Math

MATH: 2026 AI Leaderboard

12,500 competition-style math problems across algebra, geometry, calculus, and number theory.

What it tests

MATH is a 12,500-problem dataset of high school competition math (AMC/AIME-style) with step-by-step solutions. Models must produce final numerical or symbolic answers.

How it is scored

Exact-match accuracy against ground-truth answers. Frontier models in 2026 exceed 95%, so it is mostly a saturated benchmark now, supplemented by fresh AIME runs.

Why it matters

Historically useful for tracking multi-step reasoning; largely replaced as a discriminator by AIME and competition-math-live benchmarks.

Leaderboard (2 models)

Sorted by MATHscore. Tier column shows the tool's overall AIToolTier rank, which blends this benchmark with pricing, features, and real-world usability.

#ModelTierMATH score
1Mistral AI
Mistral Large 3 / Small 4
B69%
2Falcon (TII)
Falcon 3 10B
B55.4%

About MATH

Creator
Hendrycks et al., 2021
Unit
% (max 100)

Other benchmarks