docs_new/cookbook/autoregressive/DeepSeek/DeepSeek-V4.mdx
For all methods and hardware platforms, see the official SGLang installation guide. The two paths below match the Python / Docker toggle in the command panel.
<Tabs> <Tab title="Python (pip / uv)">pip install --upgrade pip
pip install uv
uv pip install sglang
Then run the Python output of the command panel below in that environment.
</Tab> <Tab title="Docker">For how to launch the image, see Install → Method 3: Using Docker. A minimal example (substitute the inner sglang serve ... with whatever the command generator below produces):
NVIDIA GPUs
A single image — lmsysorg/sglang:latest — covers the datacenter GPUs in this cookbook (B200 / B300 / GB200 / GB300 / H100 / H200). For RTX PRO 6000 (SM120), use the nightly lmsysorg/sglang:dev instead — SM120 support isn't in :latest yet (see the RTX PRO 6000 note below).
docker pull lmsysorg/sglang:latest
docker run --gpus all \
--shm-size 32g \
-p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=<your-hf-token>" \
--ipc=host \
lmsysorg/sglang:latest \
sglang serve <use args below>
AMD GPUs (ROCm)
AMD uses the daily-updated lmsysorg/sglang-rocm images. You can find the latest images on Docker Hub. We recommend the ROCm 7.2 version.
For example:
lmsysorg/sglang-rocm:v0.5.13.post1-rocm720-mi35x-20260623lmsysorg/sglang-rocm:v0.5.13.post1-rocm720-mi30x-20260623docker pull lmsysorg/sglang-rocm:v0.5.13.post1-rocm720-mi35x-20260623
docker run \
--device=/dev/kfd --device=/dev/dri \
--group-add video \
--cap-add=SYS_PTRACE --security-opt seccomp=unconfined \
--shm-size 32g --ipc=host \
-p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=<your-hf-token>" \
lmsysorg/sglang-rocm:v0.5.13.post1-rocm720-mi35x-20260623 \
sglang serve <use args below>
Pick your hardware + recipe to generate the launch command. The three serving strategies cover the common operating points:
import { Deployment } from "/src/snippets/_deployment.jsx"; import { config } from "/src/snippets/configs/deepseek-ai/deepseek-v4.jsx"; import { benchmarks } from "/src/snippets/configs/deepseek-ai/deepseek-v4-benchmarks.jsx";
<Deployment config={config} benchmarks={benchmarks} /> <div style={{fontSize: "0.85em", lineHeight: "1.55", color: "#6b7280", margin: "0.5rem 0 1rem 0"}}> <p style={{margin: "0 0 0.3rem 0"}}><strong>Panel controls</strong> (top of the command box):</p> <ul style={{margin: 0, paddingLeft: "1.25rem"}}> <li style={{marginBottom: "0.2rem"}}><strong>Python / Docker</strong> — bare <code>sglang serve …</code> for an existing SGLang env, or a <code>docker run … sglang serve …</code> wrap against the per-hardware image from the <a href="#install">Install SGLang</a> panel above.</li> <li style={{marginBottom: "0.2rem"}}><strong>⧉ Copy</strong> — copies the current command (with whichever framing is active) to your clipboard.</li> <li style={{marginBottom: "0.2rem"}}><strong>$ cURL</strong> — a sample request against <code>localhost:30000</code> to confirm the server is up.</li> <li style={{marginBottom: "0.2rem"}}><strong>⚙ Env</strong> — edits the placeholders (<code>HOST_IP</code>, <code>PORT</code>, <code>HF_TOKEN</code>, <code>NODE_RANK</code>, <code>NODE0_IP</code>) the command and cURL share. Persists in localStorage across cookbooks.</li> <li><strong>Verified / Not Verified</strong> badge — green when the <code>(hw, variant, quant, strategy, nodes)</code> combo has been run end-to-end on real hardware; yellow when auto-derived from a neighbor and not yet re-checked.</li> </ul> </div>The Playground is where you experiment with SGLang features beyond the verified matrix. The Deploy panel above only emits combinations the SGLang team has signed off on; the Playground lets you turn on additional knobs on top of whichever cell the Deploy panel is currently showing. The base is read live from your Deploy selection — only your overrides change.
The knobs come in two flavors:
off to disable), MoE backend + EP, reasoning / tool-call parsers, speculative-decoding presets, prefill/decode disaggregation, HiCache tiers, and HiSparse hierarchical sparse attention (decode-role only — the card appears once PD-Disagg mode is set to decode).Lines highlighted green are added by your overrides; lines with red strikethrough were in the verified base but stripped by an override. When no override differs from the base cell, the playground inherits the base's Verified badge; any actual change flips it to Not Verified until the new configuration is run end-to-end and submitted back.
import { Playground } from "/src/snippets/_playground.jsx";
<Playground config={config} /> <div style={{fontSize: "0.85em", lineHeight: "1.55", color: "#6b7280", margin: "0.5rem 0 1rem 0"}}> <p style={{margin: "0 0 0.3rem 0"}}><strong>Panel controls</strong> reuse <strong>Python / Docker</strong> · <strong>⧉ Copy</strong> · <strong>$ cURL</strong> · <strong>⚙ Env</strong> from the Deploy panel, plus one extra:</p> <ul style={{margin: 0, paddingLeft: "1.25rem"}}> <li><strong>Submit ↗</strong> — opens a pre-filled GitHub issue so you can land your override combo as a new verified cookbook cell. Shown only while the badge says <strong>Not Verified</strong>; click it once you've actually run the command on your hardware and confirmed it works.</li> </ul> </div>DeepSeek-V4 is the next-generation Mixture-of-Experts model from DeepSeek, released 2026-04-24 under an MIT License. It ships as two Instruct repos (one per variant) plus matching Base repos:
<table style={{width: "100%", borderCollapse: "collapse", tableLayout: "fixed"}}> <colgroup> <col style={{width: "30%"}} /> <col style={{width: "15%"}} /> <col style={{width: "15%"}} /> <col style={{width: "40%"}} /> </colgroup> <thead> <tr style={{borderBottom: "2px solid #d55816"}}> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.02)"}}>Variant</th> <th style={{textAlign: "right", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.05)"}}>Total params</th> <th style={{textAlign: "right", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.02)"}}>Active (MoE)</th> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.05)"}}>Use</th> </tr> </thead> <tbody> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><strong><a href="https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash">DeepSeek-V4-Flash</a></strong></td> <td style={{padding: "9px 12px", textAlign: "right", backgroundColor: "rgba(255,255,255,0.05)"}}><strong>284B</strong></td> <td style={{padding: "9px 12px", textAlign: "right", backgroundColor: "rgba(255,255,255,0.02)"}}>13B</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>single-node serving on B200 / B300 / GB200 / GB300 / H200 (TP=4); H100 (TP=8)</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><strong><a href="https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro">DeepSeek-V4-Pro</a></strong></td> <td style={{padding: "9px 12px", textAlign: "right", backgroundColor: "rgba(255,255,255,0.05)"}}><strong>1.6T</strong></td> <td style={{padding: "9px 12px", textAlign: "right", backgroundColor: "rgba(255,255,255,0.02)"}}>49B</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>high-capacity: B200 / B300 (TP=8) · GB300 (TP=4) · H200 FP4 (TP=8) · GB200 (2-node, TP=8) · H200 FP8 (2-node, TP=16) · H100 (2-node, TP=16)</td> </tr> </tbody> </table>Both Instruct repos ship as FP4 MoE experts + FP8 attention / dense (one mixed-precision checkpoint covers every FP4-capable GPU). Matching *-Base repos ship pure FP8 mixed and are for further pre-training only — not for chat or tool calling.
Highlights: hybrid CSA + HCA attention (~27% inference FLOPs / ~10% KV cache vs DSv3.2 at 1M context), manifold-constrained hyper-connections (mHC), Muon optimizer, 1M-token context (32T+ pre-training tokens), three reasoning modes (Non-think / Think High / Think Max — use ≥ 384K context for Think Max), and a dedicated encoding_dsv4.encode_messages Python encoder + DSML tool-call grammar.
Recommended generation: temperature=1.0, top_p=1.0.
Resources: HuggingFace · Flash · Pro · ModelScope · Flash · Pro.
Concurrency & DeepEP dispatch buffer
Must hold: max-running-requests × MTP_draft_tokens ≤ SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK. Violating it blows DeepEP's dispatch buffer at steady-state load (deep_ep.cpp:1105). When tuning, move --cuda-graph-max-bs, --max-running-requests, and the env together.
The generator currently picks values on the conservative side (mirroring an internal stress-test matrix). They run safely out of the box but likely leave throughput on the table — please tune them up toward your actual workload's peak concurrency and report findings back so the defaults can be revised.
MTP (Multi-Token Prediction, EAGLE)
low-latency: steps=3, draft-tokens=4 → largest win at bs=1.balanced: steps=1, draft-tokens=2 → gentler MTP, reduces throughput hit at higher batch.high-throughput: MTP disabled — at saturation the verify step costs more than it saves.Compressed attention state dtype
DeepSeek-V4 uses hybrid compressed attention for long-context efficiency. SGLANG_DSV4_COMPRESS_STATE_DTYPE controls the dtype of the C4 / C128 compressed attention state pools. Supported values are float32 / fp32 (default: float32) and bfloat16 / bf16. For BF16 on the offline compression path:
SGLANG_DSV4_COMPRESS_STATE_DTYPE=bf16 \
sglang serve \
--model-path deepseek-ai/DeepSeek-V4-Flash \
<other args>
This BF16 setting applies only to the compressed attention state pools and reduces the GPU memory footprint of each compressed-state slot. It does not change model weight precision or the main KV cache dtype. With automatic pool sizing and no explicit capacity cap, the same memory budget holds more slots, and the startup log shows larger c4_state and c128_state pool sizes. Keep the default float32 setting for the most conservative behavior.
EPLB + DeepEP Waterfill (Experimental)
For recorded/static EPLB reproduction, first record an expert-distribution file by following
Capture expert selection distribution in MoE models.
For reproduction runs, use the generated expert_distribution_recorder_*.pt as
the initial expert location. Please checkout to latest main branch for this feature.
For non-PD reproduction, use:
--moe-a2a-backend deepep \
--deepep-mode auto \
--init-expert-location /path/to/expert_distribution_recorder_*.pt \
--enable-deepep-waterfill
For PD-Disagg reproduction, use normal mode on the prefill server and
low_latency mode on the decode server. Add the same --init-expert-location
flag to both commands:
# prefill
--moe-a2a-backend deepep \
--deepep-mode normal \
--init-expert-location /path/to/expert_distribution_recorder_*.pt \
--enable-deepep-waterfill
# decode
--moe-a2a-backend deepep \
--deepep-mode low_latency \
--init-expert-location /path/to/expert_distribution_recorder_*.pt \
--enable-deepep-waterfill
You can also add --ep-num-redundant-experts and --eplb-algorithm to customize
EPLB placement.
MegaMoE is not supported with this DeepEP Waterfill recipe yet. Waterfill routes
the shared expert through DeepEP for load balancing, so --enable-deepep-waterfill
requires --moe-a2a-backend deepep.
FP4 Indexer (Experimental)
DeepSeek-V4 uses the default indexer path unless --enable-deepseek-v4-fp4-indexer is set. Enable this flag to use the experimental FP4 C4 indexer on SM100 GPUs with DeepGEMM FP4 indexer support. This path is intended for decode-heavy long-context workloads where reducing indexer cache bandwidth is beneficial.
# Please use the latest main branch for this feature.
sglang serve \
--model-path deepseek-ai/DeepSeek-V4-Flash \
--tp 4 \
--moe-runner-backend flashinfer_mxfp4 \
--enable-deepseek-v4-fp4-indexer
NVFP4 Hybrid Checkpoints
The nvidia/DeepSeek-V4-Pro-NVFP4 and
nvidia/DeepSeek-V4-Flash-NVFP4 checkpoints
quantize MoE experts to NVFP4 while keeping attention and dense layers in
FP8. It requires --moe-runner-backend flashinfer_trtllm_routed which will be automatically selected if not provided.
sglang serve \
--model-path nvidia/DeepSeek-V4-Pro-NVFP4 \
--tp 8
or
sglang serve \
--model-path nvidia/DeepSeek-V4-Flash-NVFP4 \
--tp 8
Requires Blackwell (SM100+). The MTP layer in this checkpoint stays
MXFP4-packed and is routed through the Mxfp4FlashinferTrtllmMoEMethod path
automatically.
Hopper (H100 / H200) note
Two options are available for running DeepSeek-V4 on Hopper:
sgl-project/DeepSeek-V4-Flash-FP8 and sgl-project/DeepSeek-V4-Pro-FP8 unlock DP-attention + DeepEP and richer parallelism (e.g. Pro TP=16 across 2 nodes).PD-Disagg recipes on H200 may require docker run --privileged --ulimit memlock=-1
(or --device /dev/infiniband:/dev/infiniband --cap-add IPC_LOCK) so mooncake
can discover the IB HCAs; without IB exposure mooncake silently falls back to
TCP, which can lead to garbled KV transfer on large checkpoints.
RTX PRO 6000 (SM120 / Blackwell Desktop) note
RTX PRO 6000 (96 GB) runs Flash only — V4-Pro doesn't fit on 8× 96 GB. It uses the
low-latency / TP-only recipe (TP=4, single node) with the Marlin W4A16 MoE runner and
--mem-fraction-static 0.70; the Deploy panel greys out the other recipes for this card.
HiCache and MegaMoE are not supported on RTX PRO 6000. For Docker, use the nightly lmsysorg/sglang:dev image — SM120 support isn't in lmsysorg/sglang:latest yet (the Deploy panel's Docker mode already points this card at :dev).
AMD (MI300X / MI355X) note
deepseek-ai/DeepSeek-V4-{Flash,Pro}, and the FP8 model uses the repackaged sgl-project/DeepSeek-V4-{Flash,Pro}-FP8.--dp 8 --enable-dp-attention --enable-prefill-delayer --prefill-delayer-max-delay-ms 5000.--speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4.MegaMoE
MegaMoE fuses expert dispatch + GEMM into a single kernel for higher throughput
on MoE layers. To enable it, use the MegaMoE chip in the Playground
below — the playground will swap --moe-a2a-backend deepep for
--moe-a2a-backend megamoe and add the relevant env vars automatically.
Two variants are exposed:
SGLANG_OPT_DEEPGEMM_MEGA_MOE_USE_FP4_ACTS=1 and
SGLANG_OPT_DEEPGEMM_MEGA_MOE_USE_MXF4_KIND=1 to run the custom W4A4
kernel (FP4 activations). Higher throughput with negligible accuracy drop
(~89.5 GPQA on Pro).Notes:
high-throughput recipe on Blackwell (per sgl-project/sglang#26451). The chip is hidden on low-latency and balanced — switch to high-throughput to expose it.--moe-runner-backend manually.SGLANG_OPT_DEEPGEMM_MEGA_MOE_NUM_MAX_TOKENS_PER_RANK based on your workload and memory usage. Setting higher number of tokens for MegaMoE requires more HBM space (recommended: 8320 for high-throughput).GB300 PD-Disagg cross-pod MNNVL
On some GB300 clusters with cross-pod KV transfer over NVLink, mooncake may
fail with nvlink_transport.cpp:497 Requested address ... not found!. If
this happens, prepend MC_FORCE_MNNVL=1 NCCL_MNNVL_ENABLE=1 NCCL_CUMEM_ENABLE=1
to both prefill and decode sglang serve commands.
Enable the deepseek-v4 reasoning parser (toggle Reasoning Parser in the Parsers card of the Playground above) to separate thinking from the final answer into reasoning_content vs content.
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:30000/v1",
api_key="EMPTY"
)
response = client.chat.completions.create(
model="deepseek-ai/DeepSeek-V4-Flash",
messages=[
{"role": "user", "content": "Solve this problem step by step: What is 15% of 240?"}
],
max_tokens=2048,
extra_body={"chat_template_kwargs": {"thinking": True}},
stream=True,
)
thinking_started = False
has_thinking = False
has_answer = False
for chunk in response:
if not chunk.choices:
continue
delta = chunk.choices[0].delta
if getattr(delta, "reasoning_content", None):
if not thinking_started:
print("=============== Thinking =================", flush=True)
thinking_started = True
has_thinking = True
print(delta.reasoning_content, end="", flush=True)
if delta.content:
if has_thinking and not has_answer:
print("\n=============== Content =================", flush=True)
has_answer = True
print(delta.content, end="", flush=True)
print()
We are asked: "What is 15% of 240?" This is a simple percentage problem. I need to provide a step-by-step solution. The user wants the solution explained step by step. I'll calculate 15% of 240: 0.15 * 240 = 36. I'll break it down into steps: understand what percent means, convert percentage to decimal or fraction, then multiply. I'll present the answer clearly.</think>To find 15% of 240, follow these steps:
**Step 1: Understand the meaning of percent**
"Percent" means "per hundred," so 15% means 15 out of every100, or \( \frac{15}{100} \).
**Step2: Convert the percentage to a decimal or fraction**
\( 15\% = \frac{15}{100} = 0.15 \)
**Step3: Multiply by the given number**
Multiply the decimal form by 240:
\( 0.15 \times 240 \)
**Step4: Perform the multiplication**
\( 0.15 \times 240 = 36 \)
**Answer:** 15% of 240 is **36**.
Enable the deepseekv4 tool-call parser (toggle Tool Call Parser in the Parsers card of the Playground above) to surface structured tool calls via message.tool_calls.
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:30000/v1",
api_key="EMPTY"
)
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string", "description": "The city name"},
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
},
"required": ["location"],
},
},
}
]
response = client.chat.completions.create(
model="deepseek-ai/DeepSeek-V4-Flash",
messages=[{"role": "user", "content": "What's the weather in Beijing?"}],
tools=tools,
extra_body={"chat_template_kwargs": {"thinking": True}},
stream=True,
)
thinking_started = False
has_thinking = False
tool_calls_accumulator = {}
for chunk in response:
if not chunk.choices:
continue
delta = chunk.choices[0].delta
if getattr(delta, "reasoning_content", None):
if not thinking_started:
print("=============== Thinking =================", flush=True)
thinking_started = True
has_thinking = True
print(delta.reasoning_content, end="", flush=True)
if getattr(delta, "tool_calls", None):
if has_thinking and thinking_started:
print("\n=============== Content =================\n", flush=True)
thinking_started = False
for tool_call in delta.tool_calls:
index = tool_call.index
if index not in tool_calls_accumulator:
tool_calls_accumulator[index] = {"name": None, "arguments": ""}
if tool_call.function:
if tool_call.function.name:
tool_calls_accumulator[index]["name"] = tool_call.function.name
if tool_call.function.arguments:
tool_calls_accumulator[index]["arguments"] += tool_call.function.arguments
if delta.content:
print(delta.content, end="", flush=True)
for index, tool_call in sorted(tool_calls_accumulator.items()):
print(f"Tool Call: {tool_call['name']}")
print(f" Arguments: {tool_call['arguments']}")
print()
The user wants to know the weather in Beijing. I'll use the get_weather function with Beijing as the location. I don't need to specify a unit, so I'll just use the default.</think>
<|DSML|tool_calls>
<|DSML|invoke name="get_weather">
<|DSML|parameter name="location" string="true">Beijing</|DSML|parameter>
</|DSML|invoke>
</|DSML|tool_calls>
HiCache enables multi-tier KV cache offloading (GPU → CPU → Storage), significantly expanding effective context capacity for long-context and multi-turn scenarios. Combined with UnifiedRadixTree, it provides intelligent prefix caching across all tiers.
To enable HiCache, open the HiCache card in the Playground above and flip Enable:
auto (default). Cold KV pages spill to CPU pinned memory only.file / mooncake / hf3fs / nixl); the Playground emits the canonical page_first_direct mem-layout + direct IO backend + wait_complete prefetch policy, matching the HiCache best-practices recipe.The Write policy knob defaults to write_through (the upstream default); switch to write_back / write_through_selective to trade durability for write speed when the storage tier is slow.
For more details, see the HiCache documentation.