docs_new/cookbook/autoregressive/NVIDIA/Nemotron3-Ultra.mdx
import { Nemotron3UltraDeployment } from '/src/snippets/autoregressive/nemotron3-ultra-deployment.jsx';
NVIDIA Nemotron3-Ultra is an open frontier reasoning model in the Nemotron 3 family, built for long-running autonomous agents. It is optimized for complex orchestration across coding, deep research, enterprise workflows, and EDA use cases where agents must sustain reasoning across many steps and large context windows.
Nemotron 3 Ultra is a 550B parameter hybrid MoE model that activates only 55B parameters per forward pass, delivering frontier reasoning accuracy with high-throughput inference. It supports a 1M token context window so agents can keep conversation history, tool outputs, and plan state in view across persistent workflows.
Architecture and key features:
Modalities: Input: text — Output: text
Supported GPUs:
Available model variants on HuggingFace:
Nemotron3-Ultra support has not yet propagated to lmsysorg/sglang:latest or any stable release. Pull one of the two dedicated images below — matching your CUDA version — to get a runtime with Nemotron3-Ultra support.
# CUDA 13
docker pull lmsysorg/sglang:dev-nemotron3-ultra
# CUDA 12
docker pull lmsysorg/sglang:dev-cu12-nemotron3-ultra
This section provides a progressive guide from quick deployment to performance tuning.
Interactive Command Generator: select model precision, hardware, tensor parallelism, and common knobs to generate a launch command.
The generator only emits a runnable command for combinations that NVIDIA / SGLang have validated. Selecting an unverified tuple (e.g. NVFP4 on H100/H200, BF16 with TP=4 on H100, …) is blocked — the command pane shows an explicit error and the verified support matrix instead of a launch line, so unvalidated commands can't be copied by accident.
<Nemotron3UltraDeployment />Attention backend:
H100/H200: Use flash attention 3 backend by default. B200/GB200/B300/GB300: Use flashinfer backend by default.
TP support:
To set tp size, use --tp <4|8|16>. Recommended pairings:
--tp 16 on H100/H200, --tp 8 on B200/B300--tp 4 or --tp 8 on B200/B300, --tp 4 on GB200/GB300Multi-node BF16 on H100:
The 16×H100 BF16 setup spans two nodes. Use --dist-init-addr <head-node-ip>:5000 --nnodes 2 --node-rank <0|1> on each node and keep --tp 16.
DP attention:
By default the attention layers are tensor-parallel (sharded across all TP ranks). Enabling DP attention (the toggle above, or --dp <N> --enable-dp-attention) instead runs attention as N data-parallel groups: each DP rank serves its own slice of the requests with its own KV cache. --dp must divide --tp.
FP8 KV cache:
To enable fp8 kv cache, please append --kv-cache-dtype fp8_e4m3.
Reasoning parser:
Append --reasoning-parser nemotron_3 to enable structured reasoning traces (reasoning_content field in the response).
Tool calling:
Append --tool-call-parser qwen3_coder to enable tool calling support.
python3 -m sglang.launch_server \
--model-path nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16 \
--host 0.0.0.0 \
--port 5000 \
--trust-remote-code \
--tp 8 \
--tool-call-parser qwen3_coder \
--reasoning-parser nemotron_3
SGLang provides an OpenAI-compatible endpoint. Example with the OpenAI Python client:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:5000/v1",
api_key="EMPTY",
)
resp = client.chat.completions.create(
model="nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16",
messages=[
{"role": "system", "content": "You are a helpful AI assistant."},
{"role": "user", "content": "Give me 3 bullet points about SGLang."},
],
temperature=0.6,
max_tokens=1024,
)
print("Reasoning:", resp.choices[0].message.reasoning_content, "\nContent:", resp.choices[0].message.content)
print("\n")
Output:
Reasoning: The user wants 3 bullet points about SGLang. Let me recall what I know about SGLang — it's a high-performance serving framework for large language models with a focus on structured generation and efficient KV cache reuse...(more tokens)
Content: - **Radix Attention** — SGLang reuses KV cache across requests sharing a common prefix, dramatically reducing memory and compute for multi-turn agent loops and few-shot workloads.
- **OpenAI-compatible API and structured generation** — Drop-in replacement for the OpenAI client, with first-class support for constrained decoding (JSON schema, regex) and OpenAI-style tool calling.
- **High-throughput serving on NVIDIA GPUs** — Continuous batching, chunked prefill, FP8/NVFP4 quantization, and optimized CUDA kernels deliver state-of-the-art throughput across H100, H200, B200, and GB200.
Streaming chat completion:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:5000/v1",
api_key="EMPTY",
)
stream = client.chat.completions.create(
model="nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16",
messages=[
{"role": "system", "content": "You are a helpful AI assistant."},
{"role": "user", "content": "What are the first 5 prime numbers?"}
],
temperature=0.7,
max_tokens=1024,
stream=True,
)
for chunk in stream:
delta = chunk.choices[0].delta
if delta and delta.content:
print(delta.content, end="", flush=True)
Output:
The first 5 prime numbers are:
**2, 3, 5, 7, 11**.
### Explanation:
- A **prime number** is a natural number greater than 1 whose only positive divisors are 1 and itself.
- **2** is the smallest prime and the only even prime.
- **3, 5, 7, 11** are each divisible only by 1 and themselves.
- **1** is not prime by definition (it has only one positive divisor).
- **4, 6, 8, 9, 10** are composite.
The model supports two modes — Reasoning ON (default) vs OFF. This can be toggled by setting enable_thinking to False, as shown below.
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:5000/v1",
api_key="EMPTY",
)
# Reasoning on (default)
print("Reasoning on")
resp = client.chat.completions.create(
model="nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Plan a 3-step approach to debug a flaky integration test. Keep the thinking process short."}
],
temperature=1,
max_tokens=1024,
)
print(f"Reasoning: \n{resp.choices[0].message.reasoning_content[:200]}... \nContent: \n{resp.choices[0].message.content[:200]}...")
print("\n")
# Reasoning off
print("Reasoning off")
resp = client.chat.completions.create(
model="nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Give me 3 facts about SGLang."}
],
temperature=0,
max_tokens=256,
extra_body={"chat_template_kwargs": {"enable_thinking": False}}
)
print(f"Content: \n{resp.choices[0].message.content[:200]}...")
Output:
Reasoning on
Reasoning:
The user wants a short reasoning chain plus a 3-step debug plan for a flaky integration test. I'll think briefly about common causes (timing/race, shared state, external service variance) and pick a t...
Content:
1. **Reproduce deterministically** — run the test in a loop (e.g. 50–100x) with logging at the suspected race points to confirm the failure rate and surface ordering.
2. **Isolate state** — re-run with...
Reasoning off
Content:
Here are 3 facts about SGLang:
1. **High-performance LLM serving system** developed at UC Berkeley with contributions from a broad open-source community, focused on throughput and latency at scale.
...
Call functions using the OpenAI Tools schema and inspect returned tool_calls. The server must be launched with --tool-call-parser qwen3_coder.
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:5000/v1",
api_key="EMPTY",
)
# Tool calling via OpenAI tools schema
TOOLS = [
{
"type": "function",
"function": {
"name": "search_codebase",
"description": "Search the project codebase for a symbol or pattern.",
"parameters": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "The symbol, function name, or regex to search for"
},
"path": {
"type": "string",
"description": "Optional sub-path to restrict the search to"
}
},
"required": ["query"]
}
}
}
]
completion = client.chat.completions.create(
model="nemotron",
messages=[
{"role": "system", "content": "You are a coding agent. Use tools to inspect the repo before answering."},
{"role": "user", "content": "Where is the `RadixCache` class defined?"}
],
tools=TOOLS,
temperature=0.6,
top_p=0.95,
max_tokens=512,
stream=False
)
print(completion.choices[0].message.reasoning_content)
print(completion.choices[0].message.tool_calls)
Output:
The user is asking where the RadixCache class is defined. I should search the codebase for the symbol "RadixCache" to find the file and line. I'll call search_codebase with that query.
[ChatCompletionMessageFunctionToolCall(id='call_8a7f2c4e1b9d4a3e8c2f1d6b', function=Function(arguments='{"query": "class RadixCache"}', name='search_codebase'), type='function', index=0)]
The reasoning_budget parameter allows you to limit the length of the model's reasoning trace. When the reasoning output reaches the specified token budget, the model will attempt to gracefully end the reasoning at the next newline character.
If no newline is encountered within 500 tokens after reaching the budget threshold, the reasoning trace will be forcibly terminated at reasoning_budget + 500 tokens.
from typing import Any, Dict, List
import openai
from transformers import AutoTokenizer
class ThinkingBudgetClient:
def __init__(self, base_url: str, api_key: str, tokenizer_name_or_path: str):
self.base_url = base_url
self.api_key = api_key
self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_name_or_path)
self.client = openai.OpenAI(base_url=self.base_url, api_key=self.api_key)
def chat_completion(
self,
model: str,
messages: List[Dict[str, Any]],
reasoning_budget: int = 512,
max_tokens: int = 1024,
**kwargs,
) -> Dict[str, Any]:
assert (
max_tokens > reasoning_budget
), f"reasoning_budget must be smaller than max_tokens. Given {max_tokens=} and {reasoning_budget=}"
# 1. first call chat completion to get reasoning content
response = self.client.chat.completions.create(
model=model,
messages=messages,
max_tokens=reasoning_budget,
**kwargs
)
reasoning_content = response.choices[0].message.reasoning_content or ""
if "</think>" not in reasoning_content:
# reasoning content is too long, closed with a period (.)
reasoning_content = f"{reasoning_content}.\n</think>\n\n"
reasoning_tokens_used = len(
self.tokenizer.encode(reasoning_content, add_special_tokens=False)
)
remaining_tokens = max_tokens - reasoning_tokens_used
assert (
remaining_tokens > 0
), f"remaining tokens must be positive. Given {remaining_tokens=}. Increase max_tokens or lower reasoning_budget."
# 2. append reasoning content to messages and call completion
messages.append({"role": "assistant", "content": reasoning_content})
prompt = self.tokenizer.apply_chat_template(
messages,
tokenize=False,
continue_final_message=True,
)
response = self.client.completions.create(
model=model,
prompt=prompt,
max_tokens=remaining_tokens,
**kwargs
)
response_data = {
"reasoning_content": reasoning_content.strip().strip("</think>").strip(),
"content": response.choices[0].text,
"finish_reason": response.choices[0].finish_reason,
}
return response_data
Usage example with reasoning_budget=256:
SERVED_MODEL_NAME = "nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16"
# Client
client = ThinkingBudgetClient(
base_url="http://127.0.0.1:5000/v1",
api_key="null",
tokenizer_name_or_path=SERVED_MODEL_NAME
)
resp = client.chat_completion(
model=SERVED_MODEL_NAME,
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Outline a research plan to evaluate the throughput of two MoE serving strategies."}
],
temperature=1,
max_tokens=1024,
reasoning_budget=256
)
print("Reasoning:", resp["reasoning_content"], "\nContent:", resp["content"])
Output:
Reasoning: The user wants a research plan to compare throughput of two MoE serving strategies. I should outline goals, baselines, datasets, metrics (tokens/s, TTFT, ITL, MFU), variables to sweep (TP, batch size, sequence length, concurrency), and statistical handling. Keep it concise since reasoning_budget is 256...
Content:
**Research plan**
1. **Define goal & metrics** — peak token throughput (input+output), TTFT, P99 ITL, MFU; measured at fixed accuracy.
2. **Choose baselines** — Strategy A (TP-only) vs Strategy B (TP + expert-parallel). Hold model checkpoint, precision, and KV-cache dtype constant.
3. **Sweep** — `{batch ∈ 1,4,16,64, concurrency ∈ 16,64,256, seq_len ∈ 1k,8k,32k}` per strategy.
4. **Workload** — `sglang.bench_serving --dataset-name random` with matched input/output budgets.
5. **Analysis** — per-config throughput table + roofline overlay; bootstrap CIs over 3 reruns to bound noise.
Test Environment:
Hardware: GB200 (4x)
Model: nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4
Tensor Parallelism: 4
SGLang Version: main branch
Model Deployment Command:
python3 -m sglang.launch_server \
--model-path nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4 \
--trust-remote-code \
--tp 4 \
--max-running-requests 1024 \
--host 0.0.0.0 \
--port 5000
python3 -m sglang.bench_serving \
--backend sglang \
--host 0.0.0.0 \
--port 5000 \
--model nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4 \
--dataset-name random \
--random-input-len 1024 \
--random-output-len 1024 \
--num-prompts 4096 \
--max-concurrency 256
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 256
Successful requests: 4096
Benchmark duration (s): 1184.58
Total input tokens: 2081726
Total input text tokens: 2081726
Total generated tokens: 2087288
Total generated tokens (retokenized): 1990224
Request throughput (req/s): 3.46
Input token throughput (tok/s): 1757.35
Output token throughput (tok/s): 1762.05
Peak output token throughput (tok/s): 3150.00
Peak concurrent requests: 266
Total token throughput (tok/s): 3519.40
Concurrency: 249.55
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 72169.95
Median E2E Latency (ms): 71994.47
P90 E2E Latency (ms): 99898.56
P99 E2E Latency (ms): 107119.61
---------------Time to First Token----------------
Mean TTFT (ms): 40057.33
Median TTFT (ms): 41375.93
P99 TTFT (ms): 46377.89
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 63.15
Median TPOT (ms): 63.65
P99 TPOT (ms): 78.16
---------------Inter-Token Latency----------------
Mean ITL (ms): 63.14
Median ITL (ms): 35.92
P95 ITL (ms): 178.10
P99 ITL (ms): 182.10
Max ITL (ms): 2466.36
==================================================
Environment
Launch Model
python3 -m sglang.launch_server \
--model-path nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4 \
--trust-remote-code \
--tp 4 \
--reasoning-parser nemotron_3
Run Benchmark
python3 benchmark/gsm8k/bench_sglang.py --port 5000
Test Results:
Accuracy: 0.970
Invalid: 0.000
Latency: 29.129 s
Output throughput: 745.333 token/s
Run Benchmark
python3 benchmark/mmlu/bench_sglang.py --port 5000
Test Results:
TBD