NVIDIA-Nemotron-3-Nano-30B-A3B

reasoning

Nemotron-3 Nano-30B is a new-generation LLM from NVIDIA. The model's key feature is its innovative architecture, which integrates Mamba2 layers, Transformer layers, and Mixture-of-Experts (MoE) technology into a unified compute cluster. This structure allows the model to efficiently process massive datasets while maintaining logical coherence and high throughput. The model has a total parameter count of 32 billion, but thanks to MoE routing, only an active subset of approximately 3.5 billion parameters is engaged for generating each individual token. This provides a unique balance: the model possesses the "knowledge" and capacity of a 30B-scale network but consumes computational resources on par with compact models optimized for fast inference. The model was trained on a dataset of about 25 trillion tokens, encompassing 43 programming languages and more than 19 natural languages.

Compared to Nemotron v2, the new version offers an MoE architecture instead of a dense one, delivering 4 times greater throughput. Another key capability of Nemotron-3 Nano is support for a context window of up to 1 million tokens. This expansion ideally showcases the capabilities of Mamba2 layers, which process long sequences with minimal memory overhead. A crucial stage in the model's creation was Multi-environment Reinforcement Learning using the NeMo Gym library. The model was trained not just to answer questions, but to perform action sequences: calling tools, writing functional code, and constructing multi-step plans. This makes its behavior more predictable and reliable in complex scenarios where step-by-step result verification is required.

On the AIME25 benchmark (American Invitational Mathematics Examination), which tests mathematical and quantitative reasoning, Nemotron 3 Nano achieves 99.2% accuracy with tool use, surpassing GPT-OSS-20B at 98.7%. On LiveCodeBench (v6 2025-08–2025–05), the model scores 68.2%, outperforming Qwen3-30B (66.0%) and GPT-OSS-20B (61.0%). On other benchmarks, the model either leads or is on par with its counterparts.

Given its architectural advantages and NVIDIA's recommendations, the model is ideally suited for the following tasks: Agentic Systems and Orchestration, Long-Context RAG, Local/On-Prem and Edge Computing, Code Generation, and Data Structuring.


Announce Date: 15.12.2025
Parameters: 32B
Experts: 128
Activated at inference: 4B
Context: 263K
Layers: 52, using full attention: 6, using no attention: 23
Attention Type: Hybrid Attention
Mamba Type: Mamba 2
Developer: NVIDIA
Transformers Version: 4.55.4
License: NVIDIA Nemotron Open Model License

Public endpoint

Use our pre-built public endpoints for free to test inference and explore NVIDIA-Nemotron-3-Nano-30B-A3B capabilities. You can obtain an API access token on the token management page after registration and verification.
Model Name Context Type GPU TPS Tooling Status Link
stelterlab/NVIDIA-Nemotron-3-Nano-30B-A3B-AWQ 262,144.0 Public yes AVAILABLE chat

API access to NVIDIA-Nemotron-3-Nano-30B-A3B endpoints

curl https://chat.immers.cloud/v1/endpoints/nemotron3-nano-30b-a3b/generate/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer USER_API_KEY" \
-d '{"model": "NVIDIA-Nemotron-3-Nano-30B-A3B", "messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Say this is a test"}
], "temperature": 0, "max_tokens": 150}'
$response = Invoke-WebRequest https://chat.immers.cloud/v1/endpoints/nemotron3-nano-30b-a3b/generate/chat/completions `
-Method POST `
-Headers @{
"Authorization" = "Bearer USER_API_KEY"
"Content-Type" = "application/json"
} `
-Body (@{
model = "NVIDIA-Nemotron-3-Nano-30B-A3B"
messages = @(
@{ role = "system"; content = "You are a helpful assistant." },
@{ role = "user"; content = "Say this is a test" }
)
} | ConvertTo-Json)
($response.Content | ConvertFrom-Json).choices[0].message.content
#!pip install OpenAI --upgrade

from openai import OpenAI

client = OpenAI(
api_key="USER_API_KEY",
base_url="https://chat.immers.cloud/v1/endpoints/nemotron3-nano-30b-a3b/generate/",
)

chat_response = client.chat.completions.create(
model="NVIDIA-Nemotron-3-Nano-30B-A3B",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Say this is a test"},
]
)
print(chat_response.choices[0].message.content)

Private server

Rent your own physically dedicated instance with hourly or long-term monthly billing.

We recommend deploying private instances in the following scenarios:

  • maximize endpoint performance,
  • enable full context for long sequences,
  • ensure top-tier security for data processing in an isolated, dedicated environment,
  • use custom weights, such as fine-tuned models or LoRA adapters.

Recommended server configurations for hosting NVIDIA-Nemotron-3-Nano-30B-A3B

Prices:
Name GPU Price, hour TPS Max Concurrency
teslaa10-1.16.32.160
262,144.0
1 $0.53 1.607 Launch
teslat4-2.16.32.160
262,144.0
tensor
2 $0.54 4.648 Launch
teslaa2-2.16.32.160
262,144.0
tensor
2 $0.57 4.648 Launch
rtx3090-1.16.24.160
262,144.0
1 $0.83 1.607 Launch
rtx2080ti-3.12.24.120
262,144.0
pipeline
3 $0.84 3.612 Launch
rtx4090-1.16.32.160
262,144.0
1 $1.02 1.607 Launch
rtx2080ti-4.16.32.160
262,144.0
tensor
4 $1.12 8.400 Launch
rtxa5000-2.16.64.160.nvlink
262,144.0
tensor
2 $1.23 13.964 Launch
rtx3080-3.16.64.160
262,144.0
pipeline
3 $1.43 1.865 Launch
rtx5090-1.16.64.160
262,144.0
1 $1.59 6.265 Launch
rtx3080-4.16.64.160
262,144.0
tensor
4 $1.82 6.071 Launch
teslaa100-1.16.64.160
262,144.0
1 $2.37 34.215 Launch
h100-1.16.64.160
262,144.0
1 $3.83 34.215 Launch
h100nvl-1.16.96.160
262,144.0
1 $4.11 42.367 Launch
teslaa100-2.24.96.160.nvlink
262,144.0
tensor
2 $4.61 79.181 Launch
h200-1.16.128.160
262,144.0
1 $4.74 69.735 Launch
h200-2.24.256.160.nvlink
262,144.0
tensor
2 $9.40 150.220 Launch
Prices:
Name GPU Price, hour TPS Max Concurrency
teslat4-3.32.64.160
262,144.0
pipeline
3 $0.88 2.394 Launch
teslaa10-2.16.64.160
262,144.0
tensor
2 $0.93 4.011 Launch
teslat4-4.16.64.160
262,144.0
tensor
4 $0.96 10.093 Launch
teslaa2-3.32.128.160
262,144.0
pipeline
3 $1.06 2.394 Launch
rtxa5000-2.16.64.160.nvlink
262,144.0
tensor
2 $1.23 4.011 Launch
teslaa2-4.32.128.160
262,144.0
tensor
4 $1.26 10.093 Launch
rtx3090-2.16.64.160
262,144.0
tensor
2 $1.56 4.011 Launch
rtx4090-2.16.64.160
262,144.0
tensor
2 $1.92 4.011 Launch
teslaa100-1.16.64.160
262,144.0
1 $2.37 24.262 Launch
rtx5090-2.16.64.160
262,144.0
tensor
2 $2.93 13.328 Launch
h100-1.16.64.160
262,144.0
1 $3.83 134.650 24.262 Launch
h100nvl-1.16.96.160
262,144.0
1 $4.11 32.414 Launch
teslaa100-2.24.96.160.nvlink
262,144.0
tensor
2 $4.61 69.228 Launch
h200-1.16.128.160
262,144.0
1 $4.74 59.782 Launch
h200-2.24.256.160.nvlink
262,144.0
tensor
2 $9.40 140.267 Launch
Prices:
Name GPU Price, hour TPS Max Concurrency
teslaa2-6.32.128.240
262,144.0
pipeline
6 $1.66 5.331 Launch
teslaa10-4.16.128.240
262,144.0
tensor
4 $1.76 8.566 Launch
teslaa100-1.16.128.240
262,144.0
1 $2.51 4.102 Launch
rtx3090-4.16.96.320
262,144.0
tensor
4 $2.97 8.566 Launch
rtx4090-4.16.96.320
262,144.0
tensor
4 $3.68 8.566 Launch
h100-1.16.128.240
262,144.0
1 $3.96 4.102 Launch
h100nvl-1.16.96.240
262,144.0
1 $4.12 12.254 Launch
rtx5090-3.16.96.240
262,144.0
pipeline
3 $4.35 10.184 Launch
h200-1.16.128.240
262,144.0
1 $4.74 39.622 Launch
teslaa100-2.24.256.320.nvlink
262,144.0
tensor
2 $4.94 49.068 Launch
rtx5090-4.16.128.320
262,144.0
tensor
4 $5.76 27.199 Launch
h200-2.24.256.240.nvlink
262,144.0
tensor
2 $9.41 120.107 Launch

Related models

Need help?

Contact our dedicated neural networks support team at nn@immers.cloud or send your request to the sales department at sale@immers.cloud.