docs_new/cookbook/autoregressive/DeepSeek/DeepSeek-V4.mdx
DeepSeek-V4 is the next-generation Mixture-of-Experts model from DeepSeek, released 2026-04-24 under an MIT License. It ships as two Instruct repos (one per variant) plus matching Base repos:
<table style={{width: "100%", borderCollapse: "collapse", tableLayout: "fixed"}}> <colgroup> <col style={{width: "30%"}} /> <col style={{width: "15%"}} /> <col style={{width: "15%"}} /> <col style={{width: "40%"}} /> </colgroup> <thead> <tr style={{borderBottom: "2px solid #d55816"}}> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.02)"}}>Variant</th> <th style={{textAlign: "right", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.05)"}}>Total params</th> <th style={{textAlign: "right", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.02)"}}>Active (MoE)</th> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.05)"}}>Use</th> </tr> </thead> <tbody> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><strong><a href="https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash">DeepSeek-V4-Flash</a></strong></td> <td style={{padding: "9px 12px", textAlign: "right", backgroundColor: "rgba(255,255,255,0.05)"}}><strong>284B</strong></td> <td style={{padding: "9px 12px", textAlign: "right", backgroundColor: "rgba(255,255,255,0.02)"}}>13B</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>single-node serving: B200 / GB200 / GB300 / H200 on 4 GPUs</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><strong><a href="https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro">DeepSeek-V4-Pro</a></strong></td> <td style={{padding: "9px 12px", textAlign: "right", backgroundColor: "rgba(255,255,255,0.05)"}}><strong>1.6T</strong></td> <td style={{padding: "9px 12px", textAlign: "right", backgroundColor: "rgba(255,255,255,0.02)"}}>49B</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>high-capacity: B200 8 GPU / GB200 8 GPU (2 nodes) / GB300 4 GPU / H200 8 GPU(fp4)/16 GPU(fp8)</td> </tr> </tbody> </table>The Instruct repos ship FP4 MoE experts + FP8 attention / dense (one mixed-precision checkpoint covers all GPUs that support FP4). The Base (pre-trained only) variants — DeepSeek-V4-Flash-Base, DeepSeek-V4-Pro-Base — ship pure FP8 mixed and are not for chat / tool calling.
Key Features (per the official model card):
encoding_dsv4.encode_messages Python encoder + DSML tool-call grammar (<|DSML|tool_calls> / <|DSML|invoke> / <|DSML|parameter>).Recommended Generation Parameters: temperature=1.0, top_p=1.0 (per the official model card).
License: MIT.
Resources:
SGLang offers multiple installation methods. Choose based on your hardware platform.
Please refer to the official SGLang installation guide for installation instructions.
Docker Images by Hardware Platform:
<table style={{width: "100%", borderCollapse: "collapse", tableLayout: "fixed"}}> <colgroup> <col style={{width: "55%"}} /> <col style={{width: "45%"}} /> </colgroup> <thead> <tr style={{borderBottom: "2px solid #d55816"}}> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.02)"}}>Hardware Platform</th> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.05)"}}>Docker Image</th> </tr> </thead> <tbody> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}>NVIDIA B300</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}><code>lmsysorg/sglang:deepseek-v4-b300</code></td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}>NVIDIA B200</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}><code>lmsysorg/sglang:deepseek-v4-blackwell</code></td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}>NVIDIA GB200</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}><code>lmsysorg/sglang:deepseek-v4-grace-blackwell</code></td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}>NVIDIA GB300</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}><code>lmsysorg/sglang:deepseek-v4-grace-blackwell</code></td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}>NVIDIA H200</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}><code>lmsysorg/sglang:deepseek-v4-hopper</code></td> </tr> </tbody> </table>For how to actually launch one of these images, see Install → Method 3: Using Docker. A minimal example (substitute the image tag for your platform and the inner sglang serve ... with whatever the command generator below produces):
docker run --gpus all \
--shm-size 32g \
-p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=<your-hf-token>" \
--ipc=host \
lmsysorg/sglang:deepseek-v4-blackwell \
sglang serve <use args below>
SGLang supports three main serving recipes for DeepSeek-V4 with different latency/throughput trade-offs (low-latency, balanced, max-throughput), plus specialized recipes for long-context (cp, prefill context-parallel) and prefill/decode disaggregation (pd-disagg). The interactive generator below emits the exact launch command for any (hardware, variant, recipe) combination.
Interactive Command Generator: Use the selector below to generate the deployment command for your hardware + recipe combination.
import { DeepSeekV4Deployment } from "/src/snippets/autoregressive/deepseek-v4-deployment.jsx";
<DeepSeekV4Deployment />Concurrency & DeepEP dispatch buffer
Must hold: max-running-requests × MTP_draft_tokens ≤ SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK. Violating it blows DeepEP's dispatch buffer at steady-state load (deep_ep.cpp:1105). When tuning, move --cuda-graph-max-bs, --max-running-requests, and the env together.
The generator currently picks values on the conservative side (mirroring an internal stress-test matrix). They run safely out of the box but likely leave throughput on the table — please tune them up toward your actual workload's peak concurrency and report findings back so the defaults can be revised.
MTP (Multi-Token Prediction, EAGLE)
low-latency: steps=3, draft-tokens=4 → largest win at bs=1.balanced: steps=1, draft-tokens=2 → gentler MTP, reduces throughput hit at higher batch.max-throughput: MTP disabled — at saturation the verify step costs more than it saves.SGLANG_ENABLE_SPEC_V2=1.Hopper (H200) note
We provide two different options for running DeepSeek-V4 models on Hopper devices (H200)
sgl-project/DeepSeek-V4-Flash-FP8, sgl-project/DeepSeek-V4-Pro-FP8), which support more parallelism and features.PD-Disagg recipes on H200 may require docker run --privileged --ulimit memlock=-1
(or --device /dev/infiniband:/dev/infiniband --cap-add IPC_LOCK) so mooncake
can discover the IB HCAs; without IB exposure mooncake silently falls back to
TCP, which can lead to garbled KV transfer on large checkpoints.
Base model usage
In order to use base models, please enable SGLANG_FIX_DSV4_BASE_MODEL_LOAD=1 and use latest code, before the next round of testing matrix is finished.
GB300 PD-Disagg cross-pod MNNVL
On some GB300 clusters with cross-pod KV transfer over NVLink, mooncake may
fail with nvlink_transport.cpp:497 Requested address ... not found!. If
this happens, prepend MC_FORCE_MNNVL=1 NCCL_MNNVL_ENABLE=1 NCCL_CUMEM_ENABLE=1
to both prefill and decode sglang serve commands.
For basic API usage and request examples, see:
Once the server is running (for example via the command generator above), send a request:
curl http://localhost:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-ai/DeepSeek-V4-Flash",
"messages": [{"role": "user", "content": "What is 15% of 240?"}]
}'
PD-Disagg note: if you deployed with the
pd-disaggrecipe from the generator above, the prefill server is on port30000, the decode server on30001, and the router on port8000— client traffic should targethttp://localhost:8000, not:30000.
Enable the deepseek-v4 reasoning parser (check the box in the command panel above) to separate thinking from the final answer into reasoning_content vs content.
Streaming with Thinking Process:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:30000/v1",
api_key="EMPTY"
)
response = client.chat.completions.create(
model="deepseek-ai/DeepSeek-V4-Flash",
messages=[
{"role": "user", "content": "Solve this problem step by step: What is 15% of 240?"}
],
max_tokens=2048,
extra_body={"chat_template_kwargs": {"thinking": True}},
stream=True,
)
thinking_started = False
has_thinking = False
has_answer = False
for chunk in response:
if not chunk.choices:
continue
delta = chunk.choices[0].delta
if getattr(delta, "reasoning_content", None):
if not thinking_started:
print("=============== Thinking =================", flush=True)
thinking_started = True
has_thinking = True
print(delta.reasoning_content, end="", flush=True)
if delta.content:
if has_thinking and not has_answer:
print("\n=============== Content =================", flush=True)
has_answer = True
print(delta.content, end="", flush=True)
print()
Output Example:
Pending update — replace with real server output after deployment.
Enable the deepseekv4 tool-call parser (check the box in the command panel above) to surface structured tool calls via message.tool_calls.
Python Example (with Thinking Process):
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:30000/v1",
api_key="EMPTY"
)
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string", "description": "The city name"},
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
},
"required": ["location"],
},
},
}
]
response = client.chat.completions.create(
model="deepseek-ai/DeepSeek-V4-Flash",
messages=[{"role": "user", "content": "What's the weather in Beijing?"}],
tools=tools,
extra_body={"chat_template_kwargs": {"thinking": True}},
stream=True,
)
thinking_started = False
has_thinking = False
tool_calls_accumulator = {}
for chunk in response:
if not chunk.choices:
continue
delta = chunk.choices[0].delta
if getattr(delta, "reasoning_content", None):
if not thinking_started:
print("=============== Thinking =================", flush=True)
thinking_started = True
has_thinking = True
print(delta.reasoning_content, end="", flush=True)
if getattr(delta, "tool_calls", None):
if has_thinking and thinking_started:
print("\n=============== Content =================\n", flush=True)
thinking_started = False
for tool_call in delta.tool_calls:
index = tool_call.index
if index not in tool_calls_accumulator:
tool_calls_accumulator[index] = {"name": None, "arguments": ""}
if tool_call.function:
if tool_call.function.name:
tool_calls_accumulator[index]["name"] = tool_call.function.name
if tool_call.function.arguments:
tool_calls_accumulator[index]["arguments"] += tool_call.function.arguments
if delta.content:
print(delta.content, end="", flush=True)
for index, tool_call in sorted(tool_calls_accumulator.items()):
print(f"Tool Call: {tool_call['name']}")
print(f" Arguments: {tool_call['arguments']}")
print()
Output Example:
Pending update — replace with real server output after deployment.
Test Environment:
We use SGLang's built-in benchmarking tool to conduct performance evaluation on the ShareGPT_Vicuna_unfiltered dataset. This dataset contains real conversation data and can better reflect performance in actual use scenarios. To simulate real-world usage patterns, we configure each request with 1024 input tokens and 1024 output tokens, representing typical medium-length conversations with detailed responses.
Model Deployment Command: see the command panel above.
Benchmark Command:
python3 -m sglang.bench_serving \
--backend sglang \
--host 127.0.0.1 \
--port 30000 \
--model deepseek-ai/DeepSeek-V4-Flash \
--random-input-len 1024 \
--random-output-len 1024 \
--num-prompts 10 \
--max-concurrency 1
Pending update — replace with real bench_serving output after the latency run.
Model Deployment Command: see the command panel above.
Benchmark Command:
python3 -m sglang.bench_serving \
--backend sglang \
--host 127.0.0.1 \
--port 30000 \
--model deepseek-ai/DeepSeek-V4-Flash \
--random-input-len 1024 \
--random-output-len 1024 \
--num-prompts 1000 \
--max-concurrency 100
Pending update — replace with real bench_serving output after the throughput run.
python3 -m sglang.test.few_shot_gsm8k --num-questions 200 --port 30000
Pending update
Pending update
cd sglang
bash benchmark/mmlu/download_data.sh
python3 benchmark/mmlu/bench_sglang.py --nsub 10 --port 30000
Pending update
Pending update
Test Environment:
Model Deployment Command: see the command panel above.
Benchmark Command:
python3 -m sglang.bench_serving \
--backend sglang \
--host 127.0.0.1 \
--port 30000 \
--model deepseek-ai/DeepSeek-V4-Flash \
--random-input-len 1024 \
--random-output-len 1024 \
--num-prompts 10 \
--max-concurrency 1
Pending update — replace with real bench_serving output after the latency run.
Model Deployment Command: see the command panel above.
Benchmark Command:
python3 -m sglang.bench_serving \
--backend sglang \
--host 127.0.0.1 \
--port 30000 \
--model deepseek-ai/DeepSeek-V4-Flash \
--random-input-len 1024 \
--random-output-len 1024 \
--num-prompts 1000 \
--max-concurrency 100
Pending update — replace with real bench_serving output after the throughput run.