docs_new/cookbook/autoregressive/DeepSeek/DeepSeek-V4.mdx
DeepSeek-V4 is the next-generation Mixture-of-Experts model from DeepSeek, released 2026-04-24 under an MIT License. It ships as two Instruct repos (one per variant) plus matching Base repos:
<table style={{width: "100%", borderCollapse: "collapse", tableLayout: "fixed"}}> <colgroup> <col style={{width: "30%"}} /> <col style={{width: "15%"}} /> <col style={{width: "15%"}} /> <col style={{width: "40%"}} /> </colgroup> <thead> <tr style={{borderBottom: "2px solid #d55816"}}> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.02)"}}>Variant</th> <th style={{textAlign: "right", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.05)"}}>Total params</th> <th style={{textAlign: "right", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.02)"}}>Active (MoE)</th> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.05)"}}>Use</th> </tr> </thead> <tbody> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><strong><a href="https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash">DeepSeek-V4-Flash</a></strong></td> <td style={{padding: "9px 12px", textAlign: "right", backgroundColor: "rgba(255,255,255,0.05)"}}><strong>284B</strong></td> <td style={{padding: "9px 12px", textAlign: "right", backgroundColor: "rgba(255,255,255,0.02)"}}>13B</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>single-node serving: B200 / GB200 / GB300 / H200 on 4 GPUs</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><strong><a href="https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro">DeepSeek-V4-Pro</a></strong></td> <td style={{padding: "9px 12px", textAlign: "right", backgroundColor: "rgba(255,255,255,0.05)"}}><strong>1.6T</strong></td> <td style={{padding: "9px 12px", textAlign: "right", backgroundColor: "rgba(255,255,255,0.02)"}}>49B</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>high-capacity: B200 8 GPU / GB200 8 GPU (2 nodes) / GB300 4 GPU / H200 8 GPU(fp4)/16 GPU(fp8)</td> </tr> </tbody> </table>The Instruct repos ship FP4 MoE experts + FP8 attention / dense (one mixed-precision checkpoint covers all GPUs that support FP4). The Base (pre-trained only) variants — DeepSeek-V4-Flash-Base, DeepSeek-V4-Pro-Base — ship pure FP8 mixed and are not for chat / tool calling.
Key Features (per the official model card):
encoding_dsv4.encode_messages Python encoder + DSML tool-call grammar (<|DSML|tool_calls> / <|DSML|invoke> / <|DSML|parameter>).Recommended Generation Parameters: temperature=1.0, top_p=1.0 (per the official model card).
License: MIT.
Resources:
SGLang offers multiple installation methods. Choose based on your hardware platform.
Please refer to the official SGLang installation guide for installation instructions.
Docker Images by Hardware Platform:
<table style={{width: "100%", borderCollapse: "collapse", tableLayout: "fixed"}}> <colgroup> <col style={{width: "55%"}} /> <col style={{width: "45%"}} /> </colgroup> <thead> <tr style={{borderBottom: "2px solid #d55816"}}> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.02)"}}>Hardware Platform</th> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.05)"}}>Docker Image</th> </tr> </thead> <tbody> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}>NVIDIA B300</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}><code>lmsysorg/sglang:deepseek-v4-b300</code></td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}>NVIDIA B200</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}><code>lmsysorg/sglang:deepseek-v4-blackwell</code></td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}>NVIDIA GB200</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}><code>lmsysorg/sglang:deepseek-v4-grace-blackwell</code></td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}>NVIDIA GB300</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}><code>lmsysorg/sglang:deepseek-v4-grace-blackwell</code></td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}>NVIDIA H200</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}><code>lmsysorg/sglang:deepseek-v4-hopper</code></td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}>NVIDIA H100</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}><code>lmsysorg/sglang:dev</code></td> </tr> </tbody> </table>For how to actually launch one of these images, see Install → Method 3: Using Docker. A minimal example (substitute the image tag for your platform and the inner sglang serve ... with whatever the command generator below produces):
docker run --gpus all \
--shm-size 32g \
-p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=<your-hf-token>" \
--ipc=host \
lmsysorg/sglang:deepseek-v4-blackwell \
sglang serve <use args below>
SGLang supports three main serving recipes for DeepSeek-V4 with different latency/throughput trade-offs (low-latency, balanced, max-throughput), plus specialized recipes for long-context (cp, prefill context-parallel) and prefill/decode disaggregation (pd-disagg). The interactive generator below emits the exact launch command for any (hardware, variant, recipe) combination.
Interactive Command Generator: Use the selector below to generate the deployment command for your hardware + recipe combination.
import { DeepSeekV4Deployment } from "/src/snippets/autoregressive/deepseek-v4-deployment.jsx";
<DeepSeekV4Deployment />Concurrency & DeepEP dispatch buffer
Must hold: max-running-requests × MTP_draft_tokens ≤ SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK. Violating it blows DeepEP's dispatch buffer at steady-state load (deep_ep.cpp:1105). When tuning, move --cuda-graph-max-bs, --max-running-requests, and the env together.
The generator currently picks values on the conservative side (mirroring an internal stress-test matrix). They run safely out of the box but likely leave throughput on the table — please tune them up toward your actual workload's peak concurrency and report findings back so the defaults can be revised.
MTP (Multi-Token Prediction, EAGLE)
low-latency: steps=3, draft-tokens=4 → largest win at bs=1.balanced: steps=1, draft-tokens=2 → gentler MTP, reduces throughput hit at higher batch.max-throughput: MTP disabled — at saturation the verify step costs more than it saves.SGLANG_ENABLE_SPEC_V2=1.Hopper (H200) note
We provide two different options for running DeepSeek-V4 models on Hopper devices (H200)
sgl-project/DeepSeek-V4-Flash-FP8, sgl-project/DeepSeek-V4-Pro-FP8), which support more parallelism and features.PD-Disagg recipes on H200 may require docker run --privileged --ulimit memlock=-1
(or --device /dev/infiniband:/dev/infiniband --cap-add IPC_LOCK) so mooncake
can discover the IB HCAs; without IB exposure mooncake silently falls back to
TCP, which can lead to garbled KV transfer on large checkpoints.
GB300 PD-Disagg cross-pod MNNVL
On some GB300 clusters with cross-pod KV transfer over NVLink, mooncake may
fail with nvlink_transport.cpp:497 Requested address ... not found!. If
this happens, prepend MC_FORCE_MNNVL=1 NCCL_MNNVL_ENABLE=1 NCCL_CUMEM_ENABLE=1
to both prefill and decode sglang serve commands.
For basic API usage and request examples, see:
Once the server is running (for example via the command generator above), send a request:
curl http://localhost:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-ai/DeepSeek-V4-Flash",
"messages": [{"role": "user", "content": "What is 15% of 240?"}]
}'
PD-Disagg note: if you deployed with the
pd-disaggrecipe from the generator above, the prefill server is on port30000, the decode server on30001, and the router on port8000— client traffic should targethttp://localhost:8000, not:30000.
Enable the deepseek-v4 reasoning parser (check the box in the command panel above) to separate thinking from the final answer into reasoning_content vs content.
Streaming with Thinking Process:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:30000/v1",
api_key="EMPTY"
)
response = client.chat.completions.create(
model="deepseek-ai/DeepSeek-V4-Flash",
messages=[
{"role": "user", "content": "Solve this problem step by step: What is 15% of 240?"}
],
max_tokens=2048,
extra_body={"chat_template_kwargs": {"thinking": True}},
stream=True,
)
thinking_started = False
has_thinking = False
has_answer = False
for chunk in response:
if not chunk.choices:
continue
delta = chunk.choices[0].delta
if getattr(delta, "reasoning_content", None):
if not thinking_started:
print("=============== Thinking =================", flush=True)
thinking_started = True
has_thinking = True
print(delta.reasoning_content, end="", flush=True)
if delta.content:
if has_thinking and not has_answer:
print("\n=============== Content =================", flush=True)
has_answer = True
print(delta.content, end="", flush=True)
print()
Output Example:
Pending update — replace with real server output after deployment.
Enable the deepseekv4 tool-call parser (check the box in the command panel above) to surface structured tool calls via message.tool_calls.
Python Example (with Thinking Process):
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:30000/v1",
api_key="EMPTY"
)
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string", "description": "The city name"},
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
},
"required": ["location"],
},
},
}
]
response = client.chat.completions.create(
model="deepseek-ai/DeepSeek-V4-Flash",
messages=[{"role": "user", "content": "What's the weather in Beijing?"}],
tools=tools,
extra_body={"chat_template_kwargs": {"thinking": True}},
stream=True,
)
thinking_started = False
has_thinking = False
tool_calls_accumulator = {}
for chunk in response:
if not chunk.choices:
continue
delta = chunk.choices[0].delta
if getattr(delta, "reasoning_content", None):
if not thinking_started:
print("=============== Thinking =================", flush=True)
thinking_started = True
has_thinking = True
print(delta.reasoning_content, end="", flush=True)
if getattr(delta, "tool_calls", None):
if has_thinking and thinking_started:
print("\n=============== Content =================\n", flush=True)
thinking_started = False
for tool_call in delta.tool_calls:
index = tool_call.index
if index not in tool_calls_accumulator:
tool_calls_accumulator[index] = {"name": None, "arguments": ""}
if tool_call.function:
if tool_call.function.name:
tool_calls_accumulator[index]["name"] = tool_call.function.name
if tool_call.function.arguments:
tool_calls_accumulator[index]["arguments"] += tool_call.function.arguments
if delta.content:
print(delta.content, end="", flush=True)
for index, tool_call in sorted(tool_calls_accumulator.items()):
print(f"Tool Call: {tool_call['name']}")
print(f" Arguments: {tool_call['arguments']}")
print()
Output Example:
Pending update — replace with real server output after deployment.
HiCache enables multi-tier KV cache offloading (GPU → CPU → Storage), significantly expanding effective context capacity for long-context and multi-turn scenarios. Combined with UnifiedRadixTree, it provides intelligent prefix caching across all tiers.
To enable HiCache, use the HiCache toggle in the command generator above:
SGLANG_ENABLE_UNIFIED_RADIX_TREE=1 for intelligent hierarchical prefix caching.For more details, see the HiCache documentation.
Test Environment:
We use SGLang's built-in benchmarking tool to conduct performance evaluation on the ShareGPT_Vicuna_unfiltered dataset. This dataset contains real conversation data and can better reflect performance in actual use scenarios. To simulate real-world usage patterns, we configure each request with 1024 input tokens and 1024 output tokens, representing typical medium-length conversations with detailed responses.
Model Deployment Command: see the command panel above.
Benchmark Command:
python3 -m sglang.bench_serving \
--backend sglang \
--host 127.0.0.1 \
--port 30000 \
--model deepseek-ai/DeepSeek-V4-Flash \
--random-input-len 1024 \
--random-output-len 1024 \
--num-prompts 10 \
--max-concurrency 1
Pending update — replace with real bench_serving output after the latency run.
Model Deployment Command: see the command panel above.
Benchmark Command:
python3 -m sglang.bench_serving \
--backend sglang \
--host 127.0.0.1 \
--port 30000 \
--model deepseek-ai/DeepSeek-V4-Flash \
--random-input-len 1024 \
--random-output-len 1024 \
--num-prompts 1000 \
--max-concurrency 100
Pending update — replace with real bench_serving output after the throughput run.
python3 -m sglang.test.few_shot_gsm8k --num-questions 200 --port 30000
Pending update
Pending update
cd sglang
bash benchmark/mmlu/download_data.sh
python3 benchmark/mmlu/bench_sglang.py --nsub 10 --port 30000
Pending update
Pending update
Test Environment:
Model Deployment Command: see the command panel above.
Benchmark Command:
python3 -m sglang.bench_serving \
--backend sglang \
--host 127.0.0.1 \
--port 30000 \
--model deepseek-ai/DeepSeek-V4-Flash \
--random-input-len 1024 \
--random-output-len 1024 \
--num-prompts 10 \
--max-concurrency 1
Pending update — replace with real bench_serving output after the latency run.
Model Deployment Command: see the command panel above.
Benchmark Command:
python3 -m sglang.bench_serving \
--backend sglang \
--host 127.0.0.1 \
--port 30000 \
--model deepseek-ai/DeepSeek-V4-Flash \
--random-input-len 1024 \
--random-output-len 1024 \
--num-prompts 1000 \
--max-concurrency 100
Pending update — replace with real bench_serving output after the throughput run.