Try Nemotron 3 Online
Nemotron 3 at a glance
Scale without the usual compute tax
Nemotron 3 keeps total parameter counts large while activating a smaller subset per token, enabling efficient inference and long-context reasoning for Super and Nano.
Super: 120B total parameters with 12B active
Context window up to 1M tokens
Nano: 30B total parameters with 3.5B active
Where teams use Nemotron 3
Practical evaluation paths
These workflows are the fastest ways to validate long-context reasoning, agentic behavior, and throughput before you commit to deployment.
Long-document synthesis
Summarize reports, legal briefs, and multi-chapter research with Nemotron 3 in a single prompt window.
Multi-file code analysis
Ask the models to navigate large repos, explain architecture, and propose refactors.
Agentic tool workflows
Test multi-step planning and tool calling for research, ops, or automation tasks.
Retrieval + 1M context
Compare RAG strategies against full-context prompting to see what works best.
Local deployment planning
Evaluate quantization targets, VRAM needs, and latency before moving on-prem.
Multilingual evaluations
Run the same prompts across languages to measure consistency and quality.
Research highlights
Architecture and benchmark visuals
Figures sourced from the paper and shared assets, highlighting Nemotron 3 architecture and evaluation snapshots.


Nemotron 3 paper figures
Paper visuals
Key diagrams and benchmark curves extracted from the paper.
Architecture and routing
MoE routing diagrams used in Nemotron 3 Nano.
The paper introduces Nemotron 3 (Nano, Super, Ultra) as a Mixture-of-Experts hybrid Mamba–Transformer family built for strong throughput and up to 1M-token context.
Most layers interleave Mamba-2 and MoE blocks with a small number of self-attention layers; larger models add LatentMoE and MTP layers for quality and faster generation.
Post-training uses multi-environment reinforcement learning to improve reasoning, multi-step tool use, and budget-controlled inference.


Accuracy-efficiency trade-off
Illustrates the accuracy-efficiency trade-off curves of Nemotron 3 Nano by varying the token budget.
Inference-time budget control lets you set a maximum token budget for the thinking trace.
When the budget is reached, appending the `</think>` token prompts the model to continue with the response based on the partial trace.
The curves below show how accuracy trades off against efficiency as the token budget changes.




Official Nemotron 3 resources
NVIDIA Nemotron 3 efficient and open intelligence sources
Use these official sources for accurate paper details, HuggingFace model cards, GitHub links, and Super benchmark context.

NVIDIA Nemotron 3 Research Lab
NVIDIA Nemotron 3 efficient and open intelligence overview, labs, and release notes.
Nemotron 3 paper on arXiv
The official Nemotron 3 paper with methods, training, and evaluation details.
NVIDIA Nemotron 3 Nano 30B A3B FP8
Model card and downloads for NVIDIA Nemotron 3 Nano 30B A3B FP8 on HuggingFace.

Nemotron 3 Super benchmark blog
NVIDIA Developer Blog launch with Nemotron 3 Super benchmark highlights and context.

Nemotron 3 white paper PDF
The Nemotron 3 paper PDF for deeper technical context and design notes.
Nemotron 3 architecture highlights
Hybrid MoE + long-context reasoning
Nemotron 3 combines a hybrid Mamba-Transformer MoE backbone with long-context support and robust post-training, targeting strong agentic performance at high throughput.
Nemotron 3 architecture highlights
Nemotron 3 combines a hybrid Mamba-Transformer MoE backbone with long-context support and robust post-training, targeting strong agentic performance at high throughput.

Pick the right Nemotron 3 model
Nano for efficiency, Super for agentic scale
Nemotron 3 ships as a family. Nano targets efficient local or edge use, Super targets high-end agentic reasoning, and Ultra expands the accuracy frontier. Both Nano and Super are open and long-context by design, including NVIDIA Nemotron 3 Nano 30B A3B FP8.
Pick the right Nemotron 3 model
Nemotron 3 ships as a family. Nano targets efficient local or edge use, Super targets high-end agentic reasoning, and Ultra expands the accuracy frontier. Both Nano and Super are open and long-context by design, including NVIDIA Nemotron 3 Nano 30B A3B FP8.
- Nemotron 3 Nano: 30B total / 3.5B active parameters with up to 1M context.
- Nemotron 3 Super: 120B total / 12B active parameters with up to 1M context.
- Hybrid Mamba-Transformer MoE architecture across the family.
- Open weights, technical reports, and reproducible recipes.

What the community keeps asking for
Signals from YouTube, Reddit, Perplexity AI, and Artificial Analysis
Across community discussions, demand clusters around local deployment, long-context memory, and agentic workflows, plus a desire for open tooling and better control over safety layers.
Local & offline deployment
Clear guidance on quantization, VRAM requirements, and running large models on a single GPU or local server.
Long-context memory
Reliable performance on very long documents, codebases, and multi-file reasoning without constant chunking.
Agentic workflows
Better templates for multi-agent orchestration, tool calling, and long-running tasks.
Throughput & cost efficiency
High tokens/sec and predictable latency for production inference and batch reasoning.
Open ecosystem tooling
Integrations with LangChain, AutoGen, and local UIs, plus open recipes to reproduce results.
Control & multilingual quality
Less over-restriction for creative tasks and stronger non-English performance across domains.
Video walkthrough
Community demo and first-look coverage
A quick video from the community to see how Nemotron-3 Super behaves in practice and what developers are testing.
Video from community discussions referenced in your YouTube notes.
Try Nemotron 3 Super and Nano with a clean workflow
Launch the playground to test Nemotron 3 Super or Nano, or the Nemotron 3 Super API, then open official resources for deployment and model details.
FAQ
Questions about this site
Community signals
What builders keep requesting
Local LLM Builders
DeploymentNeed clear guidance on quantization and VRAM so we can run Super or Nano on a single GPU.
Agent Developers
WorkflowLooking for multi-agent templates and tool-calling patterns that scale to long tasks.
Research Analysts
ReasoningWe want 1M-token context that stays coherent for large docs and codebases.
Cost-sensitive Teams
EfficiencyHigh throughput and predictable latency matter more than flashy UI features.
Tooling Integrators
EcosystemOpen recipes and integrations with LangChain/AutoGen are the fastest path to adoption.
Creative Users
ControlNeed stronger multilingual quality and better control over alignment limits.