Math
MATH: 2026 AI Leaderboard
12,500 competition-style math problems across algebra, geometry, calculus, and number theory.
What it tests
MATH is a 12,500-problem dataset of high school competition math (AMC/AIME-style) with step-by-step solutions. Models must produce final numerical or symbolic answers.
How it is scored
Exact-match accuracy against ground-truth answers. Frontier models in 2026 exceed 95%, so it is mostly a saturated benchmark now, supplemented by fresh AIME runs.
Why it matters
Historically useful for tracking multi-step reasoning; largely replaced as a discriminator by AIME and competition-math-live benchmarks.
Leaderboard (2 models)
Sorted by MATHscore. Tier column shows the tool's overall AIToolTier rank, which blends this benchmark with pricing, features, and real-world usability.
| # | Model | Tier | MATH score | Variant | Overall |
|---|---|---|---|---|---|
| 1 | Mistral AI Mistral Large 3 / Small 4 | B | 69% | MATH | 7.5/10 |
| 2 | Falcon (TII) Falcon 3 10B | B | 55.4% | MATH | 7.1/10 |
About MATH
- Creator
- Hendrycks et al., 2021
- Unit
- % (max 100)
- Official source
- https://arxiv.org/abs/2103.03874