NVIDIA-Nemotron-3-Nano-30B-A3B

reasoning

Nemotron-3 Nano-30B is a new-generation LLM from NVIDIA. The model's key feature is its innovative architecture, which integrates Mamba2 layers, Transformer layers, and Mixture-of-Experts (MoE) technology into a unified compute cluster. This structure allows the model to efficiently process massive datasets while maintaining logical coherence and high throughput. The model has a total parameter count of 32 billion, but thanks to MoE routing, only an active subset of approximately 3.5 billion parameters is engaged for generating each individual token. This provides a unique balance: the model possesses the "knowledge" and capacity of a 30B-scale network but consumes computational resources on par with compact models optimized for fast inference. The model was trained on a dataset of about 25 trillion tokens, encompassing 43 programming languages and more than 19 natural languages.

Compared to Nemotron v2, the new version offers an MoE architecture instead of a dense one, delivering 4 times greater throughput. Another key capability of Nemotron-3 Nano is support for a context window of up to 1 million tokens. This expansion ideally showcases the capabilities of Mamba2 layers, which process long sequences with minimal memory overhead. A crucial stage in the model's creation was Multi-environment Reinforcement Learning using the NeMo Gym library. The model was trained not just to answer questions, but to perform action sequences: calling tools, writing functional code, and constructing multi-step plans. This makes its behavior more predictable and reliable in complex scenarios where step-by-step result verification is required.

On the AIME25 benchmark (American Invitational Mathematics Examination), which tests mathematical and quantitative reasoning, Nemotron 3 Nano achieves 99.2% accuracy with tool use, surpassing GPT-OSS-20B at 98.7%. On LiveCodeBench (v6 2025-08–2025–05), the model scores 68.2%, outperforming Qwen3-30B (66.0%) and GPT-OSS-20B (61.0%). On other benchmarks, the model either leads or is on par with its counterparts.

Given its architectural advantages and NVIDIA's recommendations, the model is ideally suited for the following tasks: Agentic Systems and Orchestration, Long-Context RAG, Local/On-Prem and Edge Computing, Code Generation, and Data Structuring.


Announce Date: 15.12.2025
Parameters: 32B
Experts: 128
Activated at inference: 4B
Context: 263K
Layers: 52, using full attention: 6, using no attention: 23
Attention Type: Hybrid Attention
Mamba Type: Mamba 2
Developer: NVIDIA
Transformers Version: 4.55.4
License: NVIDIA Nemotron Open Model License

Public endpoint

Use our pre-built public endpoints for free to test inference and explore NVIDIA-Nemotron-3-Nano-30B-A3B capabilities. You can obtain an API access token on the token management page after registration and verification.
Model Name Context Type GPU TPS Tooling Status Link
stelterlab/NVIDIA-Nemotron-3-Nano-30B-A3B-AWQ 262,144.0 Public 463.00 yes AVAILABLE chat

API access to NVIDIA-Nemotron-3-Nano-30B-A3B endpoints

curl https://chat.immers.cloud/v1/endpoints/nemotron3-nano-30b-a3b/generate/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer USER_API_KEY" \
-d '{"model": "NVIDIA-Nemotron-3-Nano-30B-A3B", "messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Say this is a test"}
], "temperature": 0, "max_tokens": 150}'
$response = Invoke-WebRequest https://chat.immers.cloud/v1/endpoints/nemotron3-nano-30b-a3b/generate/chat/completions `
-Method POST `
-Headers @{
"Authorization" = "Bearer USER_API_KEY"
"Content-Type" = "application/json"
} `
-Body (@{
model = "NVIDIA-Nemotron-3-Nano-30B-A3B"
messages = @(
@{ role = "system"; content = "You are a helpful assistant." },
@{ role = "user"; content = "Say this is a test" }
)
} | ConvertTo-Json)
($response.Content | ConvertFrom-Json).choices[0].message.content
#!pip install OpenAI --upgrade

from openai import OpenAI

client = OpenAI(
api_key="USER_API_KEY",
base_url="https://chat.immers.cloud/v1/endpoints/nemotron3-nano-30b-a3b/generate/",
)

chat_response = client.chat.completions.create(
model="NVIDIA-Nemotron-3-Nano-30B-A3B",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Say this is a test"},
]
)
print(chat_response.choices[0].message.content)

Private server

Rent your own physically dedicated instance with hourly or long-term monthly billing.

We recommend deploying private instances in the following scenarios:

  • maximize endpoint performance,
  • enable full context for long sequences,
  • ensure top-tier security for data processing in an isolated, dedicated environment,
  • use custom weights, such as fine-tuned models or LoRA adapters.

Recommended server configurations for hosting NVIDIA-Nemotron-3-Nano-30B-A3B

Prices:
Name GPU Price, hour TPS Max Concurrency
teslaa10-1.16.32.160
262,144.0
1 $0.53 1.024 Launch
teslat4-2.16.32.160
262,144.0
tensor
2 $0.54 3.347 Launch
teslaa2-2.16.32.160
262,144.0
tensor
2 $0.57 3.395 Launch
rtx3090-1.16.24.160
262,144.0
1 $0.83 1.655 Launch
rtx2080ti-3.12.24.120
262,144.0
pipeline
3 $0.84 1.893 Launch
rtx4090-1.16.32.160
262,144.0
1 $1.02 1.632 Launch
rtx2080ti-4.16.32.160
262,144.0
tensor
4 $1.12 3.961 Launch
rtxa5000-2.16.64.160.nvlink
262,144.0
tensor
2 $1.23 12.800 Launch
rtx5090-1.16.64.160
262,144.0
1 $1.59 6.292 Launch
rtx3080-4.16.64.160
262,144.0
tensor
4 $1.82 2.830 Launch
teslaa100-1.16.64.160
262,144.0
1 $2.37 181.920 34.804 Launch
h100-1.16.64.160
262,144.0
1 $3.83 34.762 Launch
h100nvl-1.16.96.160
262,144.0
1 $4.11 43.042 Launch
teslaa100-2.24.96.160.nvlink
262,144.0
tensor
2 $4.61 80.358 Launch
h200-1.16.128.160
262,144.0
1 $4.74 70.845 Launch
h200-2.24.256.160.nvlink
262,144.0
tensor
2 $9.40 152.440 Launch
Prices:
Name GPU Price, hour TPS Max Concurrency
teslaa10-2.16.64.160
262,144.0
tensor
2 $0.93 2.847 Launch
teslat4-4.16.64.160
262,144.0
tensor
4 $0.96 3.746 Launch
rtxa5000-2.16.64.160.nvlink
262,144.0
tensor
2 $1.23 2.847 Launch
teslaa2-4.32.128.160
262,144.0
tensor
4 $1.26 3.794 Launch
rtx3090-2.16.64.160
262,144.0
tensor
2 $1.56 4.109 Launch
rtx4090-2.16.64.160
262,144.0
tensor
2 $1.92 4.061 Launch
teslaa100-1.16.64.160
262,144.0
1 $2.37 24.851 Launch
rtx5090-2.16.64.160
262,144.0
tensor
2 $2.93 13.382 Launch
h100-1.16.64.160
262,144.0
1 $3.83 134.650 24.809 Launch
h100nvl-1.16.96.160
262,144.0
1 $4.11 33.089 Launch
teslaa100-2.24.96.160.nvlink
262,144.0
tensor
2 $4.61 70.405 Launch
h200-1.16.128.160
262,144.0
1 $4.74 60.892 Launch
h200-2.24.256.160.nvlink
262,144.0
tensor
2 $9.40 142.487 Launch
Prices:
Name GPU Price, hour TPS Max Concurrency
teslaa10-4.16.128.240
262,144.0
tensor
4 $1.76 3.118 Launch
teslaa100-1.16.128.240
262,144.0
1 $2.51 4.691 Launch
rtx3090-4.16.96.320
262,144.0
tensor
4 $2.97 4.380 Launch
rtx4090-4.16.96.320
262,144.0
tensor
4 $3.68 4.333 Launch
h100-1.16.128.240
262,144.0
1 $3.96 4.649 Launch
h100nvl-1.16.96.240
262,144.0
1 $4.12 12.929 Launch
rtx5090-3.16.96.240
262,144.0
pipeline
3 $4.35 5.796 Launch
h200-1.16.128.240
262,144.0
1 $4.74 40.732 Launch
teslaa100-2.24.256.320.nvlink
262,144.0
tensor
2 $4.94 50.245 Launch
rtx5090-4.16.128.320
262,144.0
tensor
4 $5.76 13.654 Launch
h200-2.24.256.240.nvlink
262,144.0
tensor
2 $9.41 122.327 Launch

Related models

Need help?

Contact our dedicated neural networks support team at nn@immers.cloud or send your request to the sales department at sale@immers.cloud.