docs_new/cookbook/autoregressive/DeepSeek/DeepSeek-V4.mdx
For all methods and hardware platforms, see the official SGLang installation guide. The two paths below match the Python / Docker toggle in the command panel.
<Tabs> <Tab title="Python (pip / uv)">pip install --upgrade pip
pip install uv
uv pip install sglang
Then run the Python output of the command panel below in that environment.
</Tab> <Tab title="Docker">A single image — lmsysorg/sglang:latest — covers the datacenter GPUs in this cookbook (B200 / B300 / GB200 / GB300 / H100 / H200). For RTX PRO 6000 (SM120), use the nightly lmsysorg/sglang:dev instead — SM120 support isn't in :latest yet (see the RTX PRO 6000 note below).
docker pull lmsysorg/sglang:latest
For how to launch the image, see Install → Method 3: Using Docker. A minimal example (substitute the inner sglang serve ... with whatever the command generator below produces):
docker run --gpus all \
--shm-size 32g \
-p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=<your-hf-token>" \
--ipc=host \
lmsysorg/sglang:latest \
sglang serve <use args below>
Pick your hardware + recipe to generate the launch command. The three serving strategies cover the common operating points:
import { Deployment } from "/src/snippets/_deployment.jsx"; import { config } from "/src/snippets/configs/deepseek-ai/deepseek-v4.jsx"; import { benchmarks } from "/src/snippets/configs/deepseek-ai/deepseek-v4-benchmarks.jsx";
<Deployment config={config} benchmarks={benchmarks} /> <div style={{fontSize: "0.85em", lineHeight: "1.55", color: "#6b7280", margin: "0.5rem 0 1rem 0"}}> <p style={{margin: "0 0 0.3rem 0"}}><strong>Panel controls</strong> (top of the command box):</p> <ul style={{margin: 0, paddingLeft: "1.25rem"}}> <li style={{marginBottom: "0.2rem"}}><strong>Python / Docker</strong> — bare <code>sglang serve …</code> for an existing SGLang env, or a <code>docker run … sglang serve …</code> wrap against the per-hardware image from the <a href="#install">Install SGLang</a> panel above.</li> <li style={{marginBottom: "0.2rem"}}><strong>⧉ Copy</strong> — copies the current command (with whichever framing is active) to your clipboard.</li> <li style={{marginBottom: "0.2rem"}}><strong>$ cURL</strong> — a sample request against <code>localhost:30000</code> to confirm the server is up.</li> <li style={{marginBottom: "0.2rem"}}><strong>⚙ Env</strong> — edits the placeholders (<code>HOST_IP</code>, <code>PORT</code>, <code>HF_TOKEN</code>, <code>NODE_RANK</code>, <code>NODE0_IP</code>) the command and cURL share. Persists in localStorage across cookbooks.</li> <li><strong>Verified / Not Verified</strong> badge — green when the <code>(hw, variant, quant, strategy, nodes)</code> combo has been run end-to-end on real hardware; yellow when auto-derived from a neighbor and not yet re-checked.</li> </ul> </div>The Playground is where you experiment with SGLang features beyond the verified matrix. The Deploy panel above only emits combinations the SGLang team has signed off on; the Playground lets you turn on additional knobs on top of whichever cell the Deploy panel is currently showing. The base is read live from your Deploy selection — only your overrides change.
The knobs come in two flavors:
off to disable), MoE backend + EP, reasoning / tool-call parsers, speculative-decoding presets, prefill/decode disaggregation, HiCache tiers, and HiSparse hierarchical sparse attention (decode-role only — the card appears once PD-Disagg mode is set to decode).Lines highlighted green are added by your overrides; lines with red strikethrough were in the verified base but stripped by an override. When no override differs from the base cell, the playground inherits the base's Verified badge; any actual change flips it to Not Verified until the new configuration is run end-to-end and submitted back.
import { Playground } from "/src/snippets/_playground.jsx";
<Playground config={config} /> <div style={{fontSize: "0.85em", lineHeight: "1.55", color: "#6b7280", margin: "0.5rem 0 1rem 0"}}> <p style={{margin: "0 0 0.3rem 0"}}><strong>Panel controls</strong> reuse <strong>Python / Docker</strong> · <strong>⧉ Copy</strong> · <strong>$ cURL</strong> · <strong>⚙ Env</strong> from the Deploy panel, plus one extra:</p> <ul style={{margin: 0, paddingLeft: "1.25rem"}}> <li><strong>Submit ↗</strong> — opens a pre-filled GitHub issue so you can land your override combo as a new verified cookbook cell. Shown only while the badge says <strong>Not Verified</strong>; click it once you've actually run the command on your hardware and confirmed it works.</li> </ul> </div>DeepSeek-V4 is the next-generation Mixture-of-Experts model from DeepSeek, released 2026-04-24 under an MIT License. It ships as two Instruct repos (one per variant) plus matching Base repos:
<table style={{width: "100%", borderCollapse: "collapse", tableLayout: "fixed"}}> <colgroup> <col style={{width: "30%"}} /> <col style={{width: "15%"}} /> <col style={{width: "15%"}} /> <col style={{width: "40%"}} /> </colgroup> <thead> <tr style={{borderBottom: "2px solid #d55816"}}> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.02)"}}>Variant</th> <th style={{textAlign: "right", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.05)"}}>Total params</th> <th style={{textAlign: "right", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.02)"}}>Active (MoE)</th> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.05)"}}>Use</th> </tr> </thead> <tbody> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><strong><a href="https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash">DeepSeek-V4-Flash</a></strong></td> <td style={{padding: "9px 12px", textAlign: "right", backgroundColor: "rgba(255,255,255,0.05)"}}><strong>284B</strong></td> <td style={{padding: "9px 12px", textAlign: "right", backgroundColor: "rgba(255,255,255,0.02)"}}>13B</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>single-node serving on B200 / B300 / GB200 / GB300 / H200 (TP=4); H100 (TP=8)</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><strong><a href="https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro">DeepSeek-V4-Pro</a></strong></td> <td style={{padding: "9px 12px", textAlign: "right", backgroundColor: "rgba(255,255,255,0.05)"}}><strong>1.6T</strong></td> <td style={{padding: "9px 12px", textAlign: "right", backgroundColor: "rgba(255,255,255,0.02)"}}>49B</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>high-capacity: B200 / B300 (TP=8) · GB300 (TP=4) · H200 FP4 (TP=8) · GB200 (2-node, TP=8) · H200 FP8 (2-node, TP=16) · H100 (2-node, TP=16)</td> </tr> </tbody> </table>Both Instruct repos ship as FP4 MoE experts + FP8 attention / dense (one mixed-precision checkpoint covers every FP4-capable GPU). Matching *-Base repos ship pure FP8 mixed and are for further pre-training only — not for chat or tool calling.
Highlights: hybrid CSA + HCA attention (~27% inference FLOPs / ~10% KV cache vs DSv3.2 at 1M context), manifold-constrained hyper-connections (mHC), Muon optimizer, 1M-token context (32T+ pre-training tokens), three reasoning modes (Non-think / Think High / Think Max — use ≥ 384K context for Think Max), and a dedicated encoding_dsv4.encode_messages Python encoder + DSML tool-call grammar.
Recommended generation: temperature=1.0, top_p=1.0.
Resources: HuggingFace · Flash · Pro · ModelScope · Flash · Pro.
Concurrency & DeepEP dispatch buffer
Must hold: max-running-requests × MTP_draft_tokens ≤ SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK. Violating it blows DeepEP's dispatch buffer at steady-state load (deep_ep.cpp:1105). When tuning, move --cuda-graph-max-bs, --max-running-requests, and the env together.
The generator currently picks values on the conservative side (mirroring an internal stress-test matrix). They run safely out of the box but likely leave throughput on the table — please tune them up toward your actual workload's peak concurrency and report findings back so the defaults can be revised.
MTP (Multi-Token Prediction, EAGLE)
low-latency: steps=3, draft-tokens=4 → largest win at bs=1.balanced: steps=1, draft-tokens=2 → gentler MTP, reduces throughput hit at higher batch.high-throughput: MTP disabled — at saturation the verify step costs more than it saves.SGLANG_ENABLE_SPEC_V2, enabled by default).EPLB + DeepEP Waterfill (Experimental)
For recorded/static EPLB reproduction, first record an expert-distribution file by following
Capture expert selection distribution in MoE models.
For reproduction runs, use the generated expert_distribution_recorder_*.pt as
the initial expert location. Please checkout to latest main branch for this feature.
For non-PD reproduction, use:
--moe-a2a-backend deepep \
--deepep-mode auto \
--init-expert-location /path/to/expert_distribution_recorder_*.pt \
--enable-deepep-waterfill
For PD-Disagg reproduction, use normal mode on the prefill server and
low_latency mode on the decode server. Add the same --init-expert-location
flag to both commands:
# prefill
--moe-a2a-backend deepep \
--deepep-mode normal \
--init-expert-location /path/to/expert_distribution_recorder_*.pt \
--enable-deepep-waterfill
# decode
--moe-a2a-backend deepep \
--deepep-mode low_latency \
--init-expert-location /path/to/expert_distribution_recorder_*.pt \
--enable-deepep-waterfill
You can also add --ep-num-redundant-experts and --eplb-algorithm to customize
EPLB placement.
MegaMoE is not supported with this DeepEP Waterfill recipe yet. Waterfill routes
the shared expert through DeepEP for load balancing, so --enable-deepep-waterfill
requires --moe-a2a-backend deepep.
FP4 Indexer (Experimental)
DeepSeek-V4 uses the default indexer path unless --enable-deepseek-v4-fp4-indexer is set. Enable this flag to use the experimental FP4 C4 indexer on SM100 GPUs with DeepGEMM FP4 indexer support. This path is intended for decode-heavy long-context workloads where reducing indexer cache bandwidth is beneficial.
# Please use the latest main branch for this feature.
sglang serve \
--model-path deepseek-ai/DeepSeek-V4-Flash \
--tp 4 \
--moe-runner-backend flashinfer_mxfp4 \
--enable-deepseek-v4-fp4-indexer
Hopper (H100 / H200) note
Two options are available for running DeepSeek-V4 on Hopper:
sgl-project/DeepSeek-V4-Flash-FP8 and sgl-project/DeepSeek-V4-Pro-FP8 unlock DP-attention + DeepEP and richer parallelism (e.g. Pro TP=16 across 2 nodes).PD-Disagg recipes on H200 may require docker run --privileged --ulimit memlock=-1
(or --device /dev/infiniband:/dev/infiniband --cap-add IPC_LOCK) so mooncake
can discover the IB HCAs; without IB exposure mooncake silently falls back to
TCP, which can lead to garbled KV transfer on large checkpoints.
RTX PRO 6000 (SM120 / Blackwell Desktop) note
RTX PRO 6000 (96 GB) runs Flash only — V4-Pro doesn't fit on 8× 96 GB. It uses the
low-latency / TP-only recipe (TP=4, single node) with the Marlin W4A16 MoE runner and
--mem-fraction-static 0.70; the Deploy panel greys out the other recipes for this card.
HiCache and MegaMoE are not supported on RTX PRO 6000. For Docker, use the nightly lmsysorg/sglang:dev image — SM120 support isn't in lmsysorg/sglang:latest yet (the Deploy panel's Docker mode already points this card at :dev).
MegaMoE
MegaMoE fuses expert dispatch + GEMM into a single kernel for higher throughput
on MoE layers. To enable it, use the MegaMoE chip in the Playground
below — the playground will swap --moe-a2a-backend deepep for
--moe-a2a-backend megamoe and add the relevant env vars automatically.
Two variants are exposed:
SGLANG_OPT_DEEPGEMM_MEGA_MOE_USE_FP4_ACTS=1 and
SGLANG_OPT_DEEPGEMM_MEGA_MOE_USE_MXF4_KIND=1 to run the custom W4A4
kernel (FP4 activations). Higher throughput with negligible accuracy drop
(~89.5 GPQA on Pro).Notes:
high-throughput recipe on Blackwell (per sgl-project/sglang#26451). The chip is hidden on low-latency and balanced — switch to high-throughput to expose it.--moe-runner-backend manually.SGLANG_OPT_DEEPGEMM_MEGA_MOE_NUM_MAX_TOKENS_PER_RANK based on your workload and memory usage. Setting higher number of tokens for MegaMoE requires more HBM space (recommended: 8320 for high-throughput).GB300 PD-Disagg cross-pod MNNVL
On some GB300 clusters with cross-pod KV transfer over NVLink, mooncake may
fail with nvlink_transport.cpp:497 Requested address ... not found!. If
this happens, prepend MC_FORCE_MNNVL=1 NCCL_MNNVL_ENABLE=1 NCCL_CUMEM_ENABLE=1
to both prefill and decode sglang serve commands.
Enable the deepseek-v4 reasoning parser (toggle Reasoning Parser in the Parsers card of the Playground above) to separate thinking from the final answer into reasoning_content vs content.
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:30000/v1",
api_key="EMPTY"
)
response = client.chat.completions.create(
model="deepseek-ai/DeepSeek-V4-Flash",
messages=[
{"role": "user", "content": "Solve this problem step by step: What is 15% of 240?"}
],
max_tokens=2048,
extra_body={"chat_template_kwargs": {"thinking": True}},
stream=True,
)
thinking_started = False
has_thinking = False
has_answer = False
for chunk in response:
if not chunk.choices:
continue
delta = chunk.choices[0].delta
if getattr(delta, "reasoning_content", None):
if not thinking_started:
print("=============== Thinking =================", flush=True)
thinking_started = True
has_thinking = True
print(delta.reasoning_content, end="", flush=True)
if delta.content:
if has_thinking and not has_answer:
print("\n=============== Content =================", flush=True)
has_answer = True
print(delta.content, end="", flush=True)
print()
We are asked: "What is 15% of 240?" This is a simple percentage problem. I need to provide a step-by-step solution. The user wants the solution explained step by step. I'll calculate 15% of 240: 0.15 * 240 = 36. I'll break it down into steps: understand what percent means, convert percentage to decimal or fraction, then multiply. I'll present the answer clearly.</think>To find 15% of 240, follow these steps:
**Step 1: Understand the meaning of percent**
"Percent" means "per hundred," so 15% means 15 out of every100, or \( \frac{15}{100} \).
**Step2: Convert the percentage to a decimal or fraction**
\( 15\% = \frac{15}{100} = 0.15 \)
**Step3: Multiply by the given number**
Multiply the decimal form by 240:
\( 0.15 \times 240 \)
**Step4: Perform the multiplication**
\( 0.15 \times 240 = 36 \)
**Answer:** 15% of 240 is **36**.
Enable the deepseekv4 tool-call parser (toggle Tool Call Parser in the Parsers card of the Playground above) to surface structured tool calls via message.tool_calls.
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:30000/v1",
api_key="EMPTY"
)
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string", "description": "The city name"},
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
},
"required": ["location"],
},
},
}
]
response = client.chat.completions.create(
model="deepseek-ai/DeepSeek-V4-Flash",
messages=[{"role": "user", "content": "What's the weather in Beijing?"}],
tools=tools,
extra_body={"chat_template_kwargs": {"thinking": True}},
stream=True,
)
thinking_started = False
has_thinking = False
tool_calls_accumulator = {}
for chunk in response:
if not chunk.choices:
continue
delta = chunk.choices[0].delta
if getattr(delta, "reasoning_content", None):
if not thinking_started:
print("=============== Thinking =================", flush=True)
thinking_started = True
has_thinking = True
print(delta.reasoning_content, end="", flush=True)
if getattr(delta, "tool_calls", None):
if has_thinking and thinking_started:
print("\n=============== Content =================\n", flush=True)
thinking_started = False
for tool_call in delta.tool_calls:
index = tool_call.index
if index not in tool_calls_accumulator:
tool_calls_accumulator[index] = {"name": None, "arguments": ""}
if tool_call.function:
if tool_call.function.name:
tool_calls_accumulator[index]["name"] = tool_call.function.name
if tool_call.function.arguments:
tool_calls_accumulator[index]["arguments"] += tool_call.function.arguments
if delta.content:
print(delta.content, end="", flush=True)
for index, tool_call in sorted(tool_calls_accumulator.items()):
print(f"Tool Call: {tool_call['name']}")
print(f" Arguments: {tool_call['arguments']}")
print()
The user wants to know the weather in Beijing. I'll use the get_weather function with Beijing as the location. I don't need to specify a unit, so I'll just use the default.</think>
<|DSML|tool_calls>
<|DSML|invoke name="get_weather">
<|DSML|parameter name="location" string="true">Beijing</|DSML|parameter>
</|DSML|invoke>
</|DSML|tool_calls>
HiCache enables multi-tier KV cache offloading (GPU → CPU → Storage), significantly expanding effective context capacity for long-context and multi-turn scenarios. Combined with UnifiedRadixTree, it provides intelligent prefix caching across all tiers.
To enable HiCache, open the HiCache card in the Playground above and flip Enable:
auto (default). Cold KV pages spill to CPU pinned memory only.file / mooncake / hf3fs / nixl); the Playground emits the canonical page_first_direct mem-layout + direct IO backend + wait_complete prefetch policy, matching the HiCache best-practices recipe.The Write policy knob defaults to write_through (the upstream default); switch to write_back / write_through_selective to trade durability for write speed when the storage tier is slow.
For more details, see the HiCache documentation.