Nemotron (Nvidia) logo
B

Nemotron (Nvidia)

B Tier · 7.8/10

Nvidia's open-weights family -- hybrid Mamba-Transformer MoE architecture, optimized for efficient reasoning on Nvidia hardware

Last updated: 2026-04-13Free tier available

Score Breakdown

6.5
Ease of Use
8.0
Output Quality
8.0
Value
8.5
Features

Benchmark Scores

Benchmarks for Nemotron 3 Ultra (253B)

BenchmarkScore
MMLU-Pro79.8%
GPQA Diamond70.5%
AIME 202584.5%
HumanEval89.6%
MMLU (Llama-Nemotron 70B)88.4%

Last updated: 2026-04-13

The Good and the Bad

What we like

  • +Hybrid Mamba-Transformer architecture dramatically reduces memory per token at long context
  • +Nemotron 3 Super activates only 3.6B params -- runs on 8 GB VRAM with top-tier reasoning quality
  • +Nvidia-optimized inference: first-class TensorRT-LLM, vLLM, and NIM deployment
  • +Llama-Nemotron 70B scores MMLU 88.4% -- within a point of GPT-4o on a model you can run locally
  • +Permissive Nvidia Open Model License allows commercial deployment

What could be better

  • Mamba inference ecosystem is still catching up -- Ollama and llama.cpp support is partial
  • Not the absolute frontier on benchmarks -- DeepSeek, Qwen, Kimi outscore on most leaderboards
  • Smaller community than Llama/Qwen -- fewer fine-tunes available
  • Release cadence is slow compared to Chinese labs

Pricing

Self-hosted (Free)

$0
  • NVIDIA Open Model License
  • Commercial use permitted
  • Weights on Hugging Face and NGC

API (build.nvidia.com)

varies/per 1M tokens
  • Free tier for experimentation
  • NIM microservices for production
  • Pricing via Nvidia Cloud partners

System Requirements

Hardware needed to self-host. Min = smallest viable setup (usually heavy quantization). Max = full-precision / production-grade.

Model variantMinMax
Nemotron 3 Super (31.6B total, 3.6B active Mamba-MoE)Mamba hybrid gives unusually low memory per token at long context8 GB VRAM Q4 (RTX 3070)1× A100 40 GB FP16
Nemotron 3 Ultra (253B reasoning)128 GB RAM + 24 GB GPU (Q3)4× H100 FP8
Llama-Nemotron 70B24 GB VRAM Q4 (RTX 3090/4090)1× H100 80 GB FP16

Known Issues

  • Mamba-hybrid layers require custom CUDA kernels -- non-Nvidia hardware (Apple Silicon, AMD ROCm) has limited supportSource: Hugging Face discussions, GitHub issues · 2026-02
  • Early Nemotron 3 Super quantizations below Q4 showed degraded reasoning quality vs. dense Llama at same bit-widthSource: Reddit r/LocalLLaMA · 2026-03

Best for

Teams running on Nvidia hardware (TensorRT-LLM, NIM) who need efficient long-context reasoning. Nemotron 3 Super is a standout for its 8 GB VRAM footprint with strong reasoning.

Not for

Apple Silicon / AMD GPU users -- Mamba hybrid kernels are Nvidia-first. Also not ideal if you want maximum community support (use Llama or Qwen).

Our Verdict

Nemotron is Nvidia's bet that architecture innovation (hybrid Mamba-Transformer MoE) beats pure scale. The bet largely pays off: Nemotron 3 Super runs on a gaming GPU while posting reasoning scores that rival much larger dense models. If you're deployed on Nvidia hardware and need efficient long-context inference, Nemotron is the natural pick. If you're not on Nvidia or need absolute frontier quality, Qwen3 or DeepSeek are stronger options.

Sources

  • Nvidia Nemotron 3 release (accessed 2026-04-13)
  • Artificial Analysis Nemotron Ultra 253B (accessed 2026-04-13)
  • Hugging Face nvidia collection (accessed 2026-04-13)
  • Reddit r/LocalLLaMA Nemotron 3 discussion (accessed 2026-04-13)

Alternatives to Nemotron (Nvidia)

Llama 4 (Meta) logo

Llama 4 (Meta)

Meta's open-weights flagship family -- Scout (10M context), Maverick (multimodal 400B MoE), Behemoth in preview

B
7.9/10
Free tierFrom $0
Llama 4 Scout has a 10M token context wi...Llama 4 Maverick is natively multimodal ...
Updated 2026-04-13
Mistral AI logo

Mistral AI

European AI lab with open and commercial models that punch well above their size

B
7.5/10
Free tierFrom $0
Extremely competitive API pricing -- Mis...Open-weight models (Mistral 7B, Mixtral)...
Updated 2026-03-26
DeepSeek logo

DeepSeek

Near-frontier reasoning for pennies on the dollar -- the open-source LLM that made Silicon Valley nervous

A
8.0/10
Free tierFrom $0
Pricing is absurdly cheap compared to GP...DeepSeek-R1 reasoning model genuinely co...
Updated 2026-03-31
Gemma 4 (Google) logo

Gemma 4 (Google)

Google DeepMind's open-weights model family -- multimodal, 256K context, runs on edge devices

A
8.3/10
Free tierFrom $0
Apache 2.0 license -- truly permissive, ...Multimodal: handles text + image input (...
Updated 2026-04-08
Qwen (Alibaba) logo

Qwen (Alibaba)

Alibaba's open-weights family -- Qwen3.5, Qwen3-Coder-Next, Qwen3-VL, Qwen3-Max. Apache 2.0 flagship sizes.

A
8.8/10
Free tierFrom $0
Apache 2.0 license on the open sizes -- ...Qwen3-Coder-Next 80B-A3B runs on 8 GB VR...
Updated 2026-04-13
GLM / Z.ai (Zhipu AI) logo

GLM / Z.ai (Zhipu AI)

Zhipu AI's open-weights family -- GLM-4.6 text flagship and GLM-4.6V multimodal, true MIT licensed

A
8.0/10
Free tierFrom $0
True MIT license -- one of the few front...GLM-4.6 is SOTA among open models for ag...
Updated 2026-04-13
Kimi K2.5 (Moonshot) logo

Kimi K2.5 (Moonshot)

Moonshot's 1T-parameter MoE open-weights flagship -- best open-source agentic coder, rivals Claude Opus 4.5

A
8.1/10
Free tierFrom $0
Frontier-tier performance -- Elo 1309 on...Beats Claude Opus 4.5 on several coding ...
Updated 2026-04-13
MiniMax M2 / M2.5 logo

MiniMax M2 / M2.5

MiniMax's open-weights frontier -- first open model to match Claude Opus 4.6 on SWE-Bench at 10-20× lower cost

A
8.4/10
Free tierFrom $0
First open-weight model to hit 80.2% on ...~10B active params during inference (out...
Updated 2026-04-13
Falcon (TII) logo

Falcon (TII)

UAE's Technology Innovation Institute open-weights family -- Falcon 3 optimized for efficient sub-10B deployment on consumer hardware

B
7.1/10
Free tierFrom $0
Apache 2.0 license -- fully permissive f...Sub-10B sizes run on any consumer GPU or...
Updated 2026-04-13