docs_new/cookbook/autoregressive/NVIDIA/Nemotron3-Nano-Omni.mdx
import { Nemotron3NanoOmniDeployment } from '/src/snippets/autoregressive/nemotron3-nano-omni-deployment.jsx';
NVIDIA Nemotron 3 Nano Omni is a 30B-parameter hybrid MoE multimodal model that activates only 3B parameters per forward pass, combining vision and audio encoders into a unified architecture. Part of the Nemotron 3 family, it is designed to power multimodal sub-agents that perceive and reason across vision, audio, and language in a single inference loop — eliminating the fragmented stacks of separate models for each modality.
Architecture and key features:
Modalities: Input: text, image, video, audio — Output: text
Supported GPUs: NVIDIA B200, H100, H200, A100, L40S, DGX Spark, RTX 6000
Available model variants on HuggingFace:
nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoningnvidia/Nemotron-3-Nano-Omni-30B-A3B-BF16nvidia/Nemotron-3-Nano-Omni-30B-A3B-FP8nvidia/Nemotron-3-Nano-Omni-30B-A3B-NVFP4Agentic workloads this model enables:
Install SGLang via pip or from source:
# Install via pip
pip install sglang
# Or install from source
uv pip install 'git+https://github.com/sgl-project/sglang.git#subdirectory=python'
# Or use Docker
docker pull lmsysorg/sglang:dev-cu13-nemotronh-nano-omni-reasoning-v3
For the full Docker setup and other installation methods, refer to the official SGLang installation guide.
This section provides a progressive guide from quick deployment to performance tuning.
Interactive Command Generator: select hardware, model variant, and common knobs to generate a launch command.
<Nemotron3NanoOmniDeployment />Attention backend:
H100/H200: Use flash attention 3 backend by default. B200: Use flashinfer backend by default.
TP support:
To set tensor parallelism, use --tp <1|2|4|8>. A 4×H100 setup is recommended for the BF16/Reasoning variant.
FP8 KV cache:
To enable FP8 KV cache, append --kv-cache-dtype fp8_e4m3. FP8 KV cache trades a small amount of accuracy for memory; omit the flag if you observe accuracy regressions on your workload.
Reasoning parser:
Append --reasoning-parser deepseek-r1 to enable structured reasoning traces (reasoning_content field in the response).
Tool calling:
Append --tool-call-parser qwen3_coder to enable tool calling support.
The command below launches the server for a 4×H100 setup with reasoning and tool calling enabled. See Section 4.8 for FP8 and NVFP4 variants.
sglang serve \
--model-path nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning \
--host 0.0.0.0 \
--port 30000 \
--tp 4 \
--trust-remote-code \
--tool-call-parser qwen3_coder \
--reasoning-parser deepseek-r1
SGLang provides an OpenAI-compatible endpoint. Example with the OpenAI Python client:
from openai import OpenAI
SERVED_MODEL_NAME = "nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning"
client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")
resp = client.chat.completions.create(
model=SERVED_MODEL_NAME,
messages=[
{"role": "system", "content": "You are a helpful AI assistant."},
{"role": "user", "content": "Give me 3 bullet points about SGLang."},
],
temperature=0.6,
max_tokens=512,
)
print(resp.choices[0].message.reasoning_content, resp.choices[0].message.content)
Output:
Reasoning: SGLang is a serving framework I know from my training data. Let me recall the key features...
Content:
- **Radix Attention** — SGLang reuses KV cache across requests sharing a common prefix, dramatically reducing memory and compute for multi-turn and few-shot workloads.
- **OpenAI-compatible API** — Drop-in replacement for the OpenAI Python client; no application code changes required to serve a locally-hosted model.
- **High-throughput serving** — Continuous batching, chunked prefill, and optimized CUDA kernels deliver state-of-the-art throughput on NVIDIA GPUs across A100, H100, and B200.
Streaming chat completion:
from openai import OpenAI
SERVED_MODEL_NAME = "nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning"
client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")
stream = client.chat.completions.create(
model=SERVED_MODEL_NAME,
messages=[
{"role": "system", "content": "You are a helpful AI assistant."},
{"role": "user", "content": "What are the first 5 prime numbers?"},
],
temperature=0.6,
max_tokens=512,
stream=True,
)
for chunk in stream:
delta = chunk.choices[0].delta
if delta and delta.content:
print(delta.content, end="", flush=True)
Pass image inputs using the OpenAI vision format. Supports both URLs and base64-encoded images:
from openai import OpenAI
SERVED_MODEL_NAME = "nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning"
client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")
# From URL
resp = client.chat.completions.create(
model=SERVED_MODEL_NAME,
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Cat03.jpg/1200px-Cat03.jpg"},
},
{"type": "text", "text": "Describe this image in detail."},
],
}
],
temperature=0.6,
max_tokens=512,
)
print(resp.choices[0].message.reasoning_content)
print(resp.choices[0].message.content)
For local images, encode as base64:
import base64
from openai import OpenAI
SERVED_MODEL_NAME = "nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning"
client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")
with open("screenshot.png", "rb") as f:
image_b64 = base64.b64encode(f.read()).decode("utf-8")
resp = client.chat.completions.create(
model=SERVED_MODEL_NAME,
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {"url": f"data:image/png;base64,{image_b64}"},
},
{"type": "text", "text": "What UI elements are visible on this screen? What action would you take next?"},
],
}
],
temperature=0.6,
max_tokens=512,
)
print(resp.choices[0].message.content)
Nemotron 3 Nano Omni uses Conv3D layers and Efficient Video Sampling (EVS) for temporal-spatial video reasoning, processing longer videos at the same compute budget:
import base64
from openai import OpenAI
SERVED_MODEL_NAME = "nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning"
client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")
with open("video.mp4", "rb") as f:
video_b64 = base64.b64encode(f.read()).decode("utf-8")
resp = client.chat.completions.create(
model=SERVED_MODEL_NAME,
messages=[
{
"role": "user",
"content": [
{
"type": "video_url",
"video_url": {"url": f"data:video/mp4;base64,{video_b64}"},
},
{"type": "text", "text": "Summarize what happens in this video step by step."},
],
}
],
temperature=0.6,
max_tokens=1024,
)
print(resp.choices[0].message.reasoning_content)
print(resp.choices[0].message.content)
Pass audio inputs as base64-encoded WAV or MP3 data:
import base64
from openai import OpenAI
SERVED_MODEL_NAME = "nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning"
client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")
with open("audio.wav", "rb") as f:
audio_b64 = base64.b64encode(f.read()).decode("utf-8")
resp = client.chat.completions.create(
model=SERVED_MODEL_NAME,
messages=[
{
"role": "user",
"content": [
{
"type": "input_audio",
"input_audio": {"data": audio_b64, "format": "wav"},
},
{"type": "text", "text": "Transcribe and summarize what was said in this audio."},
],
}
],
temperature=0.6,
max_tokens=512,
)
print(resp.choices[0].message.content)
Combine modalities in a single request. For example, an image alongside an audio question about it:
import base64
from openai import OpenAI
SERVED_MODEL_NAME = "nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning"
client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")
with open("chart.png", "rb") as f:
image_b64 = base64.b64encode(f.read()).decode("utf-8")
resp = client.chat.completions.create(
model=SERVED_MODEL_NAME,
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {"url": f"data:image/png;base64,{image_b64}"},
},
{"type": "text", "text": "Analyze this chart. What are the key trends and what conclusion does the data support?"},
],
}
],
temperature=0.6,
max_tokens=1024,
)
print(resp.choices[0].message.reasoning_content)
print(resp.choices[0].message.content)
The model supports two modes — Reasoning ON (default) vs OFF. Toggle per-request by setting enable_thinking to False:
from openai import OpenAI
SERVED_MODEL_NAME = "nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning"
client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")
# Reasoning ON (default)
print("Reasoning on")
resp = client.chat.completions.create(
model=SERVED_MODEL_NAME,
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the derivative of x^3 sin(x)?"},
],
temperature=0.6,
max_tokens=1024,
)
print(f"Reasoning:\n{resp.choices[0].message.reasoning_content[:300]}...\nContent:\n{resp.choices[0].message.content}")
print("\n")
# Reasoning OFF
print("Reasoning off")
resp = client.chat.completions.create(
model=SERVED_MODEL_NAME,
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is 15% of 200?"},
],
temperature=0.6,
max_tokens=256,
extra_body={"chat_template_kwargs": {"enable_thinking": False}},
)
print(f"Content:\n{resp.choices[0].message.content}")
Output:
Reasoning on
Reasoning:
The user wants the derivative of x^3 sin(x). I'll apply the product rule: d/dx[u·v] = u'v + uv'. Here u = x^3, v = sin(x). So u' = 3x^2, v' = cos(x). The result is 3x^2·sin(x) + x^3·cos(x)...
Content:
Using the product rule: d/dx[x³ sin(x)] = 3x² sin(x) + x³ cos(x)
Reasoning off
Content:
15% of 200 is **30**.
Call functions using the OpenAI Tools schema. The server must be launched with --tool-call-parser qwen3_coder:
from openai import OpenAI
SERVED_MODEL_NAME = "nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning"
client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")
TOOLS = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City and state, e.g. San Francisco, CA",
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
},
},
"required": ["location"],
},
},
}
]
completion = client.chat.completions.create(
model=SERVED_MODEL_NAME,
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the weather like in Santa Clara, CA?"},
],
tools=TOOLS,
temperature=0.6,
top_p=0.95,
max_tokens=512,
stream=False,
)
print(completion.choices[0].message.reasoning_content)
print(completion.choices[0].message.tool_calls)
Output:
The user is asking about weather in Santa Clara, CA. I have a get_weather function that takes a location and optional unit. I should call it with location="Santa Clara, CA".
[ChatCompletionMessageFunctionToolCall(id='call_abc123', function=Function(arguments='{"location": "Santa Clara, CA", "unit": "fahrenheit"}', name='get_weather'), type='function', index=0)]
FP8 variant (recommended for throughput-critical serving on H100/H200/B200):
sglang serve \
--model-path nvidia/Nemotron-3-Nano-Omni-30B-A3B-FP8 \
--host 0.0.0.0 \
--port 30000 \
--tp 4 \
--trust-remote-code \
--tool-call-parser qwen3_coder \
--reasoning-parser deepseek-r1
NVFP4 variant (maximum efficiency on Blackwell B200):
sglang serve \
--model-path nvidia/Nemotron-3-Nano-Omni-30B-A3B-NVFP4 \
--host 0.0.0.0 \
--port 30000 \
--tp 2 \
--trust-remote-code \
--tool-call-parser qwen3_coder \
--reasoning-parser deepseek-r1
Nemotron 3 Nano Omni achieves 9x higher throughput than other open omni models at the same interactivity level, delivering lower cost and better scalability without sacrificing responsiveness. It also achieves ~20% higher multimodal intelligence compared to the best open alternative across image, video, and audio reasoning tasks.
Test Environment:
Model Deployment Command:
sglang serve \
--model-path nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning \
--trust-remote-code \
--tp 4 \
--max-running-requests 1024 \
--host 0.0.0.0 \
--port 30000
Benchmark Command:
python3 -m sglang.bench_serving \
--backend sglang \
--host 127.0.0.1 \
--port 30000 \
--model nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning \
--dataset-name random \
--random-input-len 1024 \
--random-output-len 1024 \
--num-prompts 4096 \
--max-concurrency 256
Environment
Launch Model
sglang serve \
--model-path nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning \
--trust-remote-code \
--tp 4 \
--reasoning-parser deepseek-r1
Run Benchmark
python3 benchmark/gsm8k/bench_sglang.py --port 30000
Run Benchmark
python3 benchmark/mmlu/bench_sglang.py --port 30000