docs_new/cookbook/autoregressive/DeepSeek/DeepSeek-R1.mdx
import { DeepSeekR1BasicDeployment } from '/src/snippets/autoregressive/deepseek-r1-basic-deployment.jsx'; import { DeepSeekR1AdvancedDeployment } from '/src/snippets/autoregressive/deepseek-r1-advanced-deployment.jsx';
DeepSeek-R1 is DeepSeek's advanced reasoning model that combines powerful language understanding with step-by-step reasoning capabilities. The model is available in multiple quantization formats optimized for different hardware platforms.
Key Features:
Available Models:
License: To use DeepSeek-R1, you must agree to DeepSeek's Community License. See LICENSE for details.
For more details, please refer to the official DeepSeek-R1 repository.
Please refer to the official SGLang installation guide for installation instructions.
This section provides deployment configurations optimized for different hardware platforms and use cases.
Interactive Command Generator: Use the configuration selector below to automatically generate a basic deployment command for your hardware platform, quantization method, and deployment strategy.
<DeepSeekR1BasicDeployment />Pareto-optimal configurations for B200, H200, MI300X, MI325X, and MI355X hardware.
<DeepSeekR1AdvancedDeployment />For more detailed configuration tips and advanced tuning, please refer to DeepSeek V3/V3.1/R1 Usage.
For basic API usage and request examples, please refer to:
DeepSeek-R1 supports advanced reasoning capabilities with built-in thinking process. Enable the reasoning parser during deployment to separate the thinking and content sections:
python -m sglang.launch_server \
--model-path deepseek-ai/DeepSeek-R1-0528 \
--reasoning-parser deepseek-r1 \
--tp 8
Streaming with Thinking Process:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:30000/v1",
api_key="EMPTY"
)
# Enable streaming to see the thinking process in real-time
response = client.chat.completions.create(
model="deepseek-ai/DeepSeek-R1-0528",
messages=[
{"role": "user", "content": "Solve this problem step by step: What is 15% of 240?"}
],
temperature=0.7,
max_tokens=2048,
stream=True
)
# Process the stream
has_thinking = False
has_answer = False
thinking_started = False
for chunk in response:
if chunk.choices and len(chunk.choices) > 0:
delta = chunk.choices[0].delta
# Print thinking process
if hasattr(delta, 'reasoning_content') and delta.reasoning_content:
if not thinking_started:
print("=============== Thinking =================", flush=True)
thinking_started = True
has_thinking = True
print(delta.reasoning_content, end="", flush=True)
# Print answer content
if delta.content:
# Close thinking section and add content header
if has_thinking and not has_answer:
print("\n=============== Content =================", flush=True)
has_answer = True
print(delta.content, end="", flush=True)
print()
Output Example:
=============== Thinking =================
To solve this problem, I need to calculate 15% of 240.
Step 1: Convert 15% to decimal: 15% = 0.15
Step 2: Multiply 240 by 0.15
Step 3: 240 × 0.15 = 36
=============== Content =================
The answer is 36. To find 15% of 240, we multiply 240 by 0.15, which equals 36.
Note: The reasoning parser captures the model's step-by-step thinking process, allowing you to see how the model arrives at its conclusions.
DeepSeek-R1 supports tool calling capabilities. Enable the tool call parser:
python -m sglang.launch_server \
--model-path deepseek-ai/DeepSeek-R1-0528 \
--reasoning-parser deepseek-r1 \
--tool-call-parser deepseekv3 \
--chat-template examples/chat_template/tool_chat_template_deepseekr1.jinja \
--tp 8
Python Example (with Thinking Process):
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:30000/v1",
api_key="EMPTY"
)
# Define available tools
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The city name"
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"description": "Temperature unit"
}
},
"required": ["location"]
}
}
}
]
# Make request with streaming to see thinking process
response = client.chat.completions.create(
model="deepseek-ai/DeepSeek-R1-0528",
messages=[
{"role": "user", "content": "What's the weather in Beijing?"}
],
tools=tools,
temperature=0.7,
stream=True
)
# Process streaming response
thinking_started = False
has_thinking = False
for chunk in response:
if chunk.choices and len(chunk.choices) > 0:
delta = chunk.choices[0].delta
# Print thinking process
if hasattr(delta, 'reasoning_content') and delta.reasoning_content:
if not thinking_started:
print("=============== Thinking =================", flush=True)
thinking_started = True
has_thinking = True
print(delta.reasoning_content, end="", flush=True)
# Print tool calls
if hasattr(delta, 'tool_calls') and delta.tool_calls:
# Close thinking section if needed
if has_thinking and thinking_started:
print("\n=============== Content =================", flush=True)
thinking_started = False
for tool_call in delta.tool_calls:
if tool_call.function:
print(f"🔧 Tool Call: {tool_call.function.name}")
print(f" Arguments: {tool_call.function.arguments}")
# Print content
if delta.content:
print(delta.content, end="", flush=True)
print()
Output Example:
=============== Thinking =================
The user is asking about the weather in Beijing. I need to use the get_weather function to retrieve this information.
I should call the function with location="Beijing".
=============== Content =================
🔧 Tool Call: get_weather
Arguments:
🔧 Tool Call: None
Arguments: {"location": "Beijing"}
Note:
Handling Tool Call Results:
# After getting the tool call, execute the function
def get_weather(location, unit="celsius"):
# Your actual weather API call here
return f"The weather in {location} is 22°{unit[0].upper()} and sunny."
# Send tool result back to the model
messages = [
{"role": "user", "content": "What's the weather in Beijing?"},
{
"role": "assistant",
"content": None,
"tool_calls": [{
"id": "call_123",
"type": "function",
"function": {
"name": "get_weather",
"arguments": '{"location": "Beijing", "unit": "celsius"}'
}
}]
},
{
"role": "tool",
"tool_call_id": "call_123",
"content": get_weather("Beijing", "celsius")
}
]
final_response = client.chat.completions.create(
model="deepseek-ai/DeepSeek-R1-0528",
messages=messages,
temperature=0.7
)
print(final_response.choices[0].message.content)
# Output: "The weather in Beijing is currently 22°C and sunny."
This section uses industry-standard configurations for comparable benchmark results.
Test Environment:
Benchmark Methodology:
We use industry-standard benchmark configurations to ensure results are comparable across frameworks and hardware platforms.
Three core scenarios reflect real-world usage patterns:
<table style={{width: "100%", borderCollapse: "collapse", tableLayout: "fixed"}}> <colgroup> <col style={{width: "25%"}} /> <col style={{width: "25%"}} /> <col style={{width: "25%"}} /> <col style={{width: "25%"}} /> </colgroup> <thead> <tr style={{borderBottom: "2px solid #d55816"}}> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.02)"}}>Scenario</th> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.05)"}}>Input Length</th> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.02)"}}>Output Length</th> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.05)"}}>Use Case</th> </tr> </thead> <tbody> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}>**Chat**</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>1K</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>1K</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Most common conversational AI workload</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}>**Reasoning**</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>1K</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>8K</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Long-form generation, complex reasoning tasks</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}>**Summarization**</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>8K</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>1K</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Document summarization, RAG retrieval</td> </tr> </tbody> </table>Test each scenario at different concurrency levels to capture the throughput vs. latency trade-off:
--max-concurrency 1 (Latency-optimized)--max-concurrency 16 (Balanced)--max-concurrency 100 (Throughput-optimized)For each concurrency level, configure num_prompts to simulate realistic user loads:
num_prompts = concurrency × 1 (minimal test)num_prompts = concurrency × 5 (standard benchmark)num_prompts = concurrency × 10 (production-grade)Scenario 1: Chat (1K/1K) - Most Important
python -m sglang.launch_server \
--model-path deepseek-ai/DeepSeek-R1-0528 \
--tp 8
python -m sglang.bench_serving \
--backend sglang \
--model deepseek-ai/DeepSeek-R1-0528 \
--dataset-name random \
--random-input-len 1000 \
--random-output-len 1000 \
--num-prompts 10 \
--max-concurrency 1 \
--request-rate inf
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 1
Successful requests: 10
Benchmark duration (s): 40.00
Total input tokens: 6101
Total input text tokens: 6101
Total input vision tokens: 0
Total generated tokens: 4210
Total generated tokens (retokenized): 4205
Request throughput (req/s): 0.25
Input token throughput (tok/s): 152.52
Output token throughput (tok/s): 105.24
Peak output token throughput (tok/s): 110.00
Peak concurrent requests: 2
Total token throughput (tok/s): 257.76
Concurrency: 1.00
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 3998.40
Median E2E Latency (ms): 3207.53
---------------Time to First Token----------------
Mean TTFT (ms): 153.00
Median TTFT (ms): 140.76
P99 TTFT (ms): 214.66
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 9.16
Median TPOT (ms): 9.15
P99 TPOT (ms): 9.21
---------------Inter-Token Latency----------------
Mean ITL (ms): 9.16
Median ITL (ms): 9.15
P95 ITL (ms): 9.47
P99 ITL (ms): 9.63
Max ITL (ms): 15.45
==================================================
python -m sglang.bench_serving \
--backend sglang \
--model deepseek-ai/DeepSeek-R1-0528 \
--dataset-name random \
--random-input-len 1000 \
--random-output-len 1000 \
--num-prompts 80 \
--max-concurrency 16 \
--request-rate inf
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 16
Successful requests: 80
Benchmark duration (s): 51.21
Total input tokens: 39668
Total input text tokens: 39668
Total input vision tokens: 0
Total generated tokens: 40725
Total generated tokens (retokenized): 40458
Request throughput (req/s): 1.56
Input token throughput (tok/s): 774.66
Output token throughput (tok/s): 795.30
Peak output token throughput (tok/s): 1088.00
Peak concurrent requests: 21
Total token throughput (tok/s): 1569.96
Concurrency: 13.93
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 8918.33
Median E2E Latency (ms): 9466.16
---------------Time to First Token----------------
Mean TTFT (ms): 273.51
Median TTFT (ms): 131.71
P99 TTFT (ms): 839.57
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 17.56
Median TPOT (ms): 17.46
P99 TPOT (ms): 28.68
---------------Inter-Token Latency----------------
Mean ITL (ms): 17.02
Median ITL (ms): 14.70
P95 ITL (ms): 16.41
P99 ITL (ms): 112.38
Max ITL (ms): 461.90
==================================================
python -m sglang.bench_serving \
--backend sglang \
--model deepseek-ai/DeepSeek-R1-0528 \
--dataset-name random \
--random-input-len 1000 \
--random-output-len 1000 \
--num-prompts 500 \
--max-concurrency 100 \
--request-rate inf
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 100
Successful requests: 500
Benchmark duration (s): 110.46
Total input tokens: 249831
Total input text tokens: 249831
Total input vision tokens: 0
Total generated tokens: 252162
Total generated tokens (retokenized): 251441
Request throughput (req/s): 4.53
Input token throughput (tok/s): 2261.80
Output token throughput (tok/s): 2282.90
Peak output token throughput (tok/s): 3900.00
Peak concurrent requests: 109
Total token throughput (tok/s): 4544.71
Concurrency: 92.26
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 20380.71
Median E2E Latency (ms): 19391.65
---------------Time to First Token----------------
Mean TTFT (ms): 563.14
Median TTFT (ms): 147.62
P99 TTFT (ms): 2632.11
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 40.11
Median TPOT (ms): 41.98
P99 TPOT (ms): 50.10
---------------Inter-Token Latency----------------
Mean ITL (ms): 39.37
Median ITL (ms): 26.36
P95 ITL (ms): 98.16
P99 ITL (ms): 150.08
Max ITL (ms): 2052.85
==================================================
Scenario 2: Reasoning (1K/8K)
python -m sglang.bench_serving \
--backend sglang \
--model deepseek-ai/DeepSeek-R1-0528 \
--dataset-name random \
--random-input-len 1000 \
--random-output-len 8000 \
--num-prompts 10 \
--max-concurrency 1 \
--request-rate inf
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 1
Successful requests: 10
Benchmark duration (s): 411.34
Total input tokens: 6101
Total input text tokens: 6101
Total input vision tokens: 0
Total generated tokens: 44452
Total generated tokens (retokenized): 44390
Request throughput (req/s): 0.02
Input token throughput (tok/s): 14.83
Output token throughput (tok/s): 108.07
Peak output token throughput (tok/s): 110.00
Peak concurrent requests: 2
Total token throughput (tok/s): 122.90
Concurrency: 1.00
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 41132.04
Median E2E Latency (ms): 44288.71
---------------Time to First Token----------------
Mean TTFT (ms): 125.76
Median TTFT (ms): 126.19
P99 TTFT (ms): 137.69
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 9.21
Median TPOT (ms): 9.20
P99 TPOT (ms): 9.27
---------------Inter-Token Latency----------------
Mean ITL (ms): 9.23
Median ITL (ms): 9.22
P95 ITL (ms): 9.64
P99 ITL (ms): 9.86
Max ITL (ms): 15.18
==================================================
python -m sglang.bench_serving \
--backend sglang \
--model deepseek-ai/DeepSeek-R1-0528 \
--dataset-name random \
--random-input-len 1000 \
--random-output-len 8000 \
--num-prompts 80 \
--max-concurrency 16 \
--request-rate inf
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 16
Successful requests: 80
Benchmark duration (s): 348.93
Total input tokens: 39668
Total input text tokens: 39668
Total input vision tokens: 0
Total generated tokens: 318226
Total generated tokens (retokenized): 317630
Request throughput (req/s): 0.23
Input token throughput (tok/s): 113.69
Output token throughput (tok/s): 912.02
Peak output token throughput (tok/s): 1088.00
Peak concurrent requests: 19
Total token throughput (tok/s): 1025.70
Concurrency: 14.07
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 61360.70
Median E2E Latency (ms): 62071.20
---------------Time to First Token----------------
Mean TTFT (ms): 176.02
Median TTFT (ms): 153.75
P99 TTFT (ms): 268.44
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 15.42
Median TPOT (ms): 15.59
P99 TPOT (ms): 16.07
---------------Inter-Token Latency----------------
Mean ITL (ms): 15.39
Median ITL (ms): 15.17
P95 ITL (ms): 16.62
P99 ITL (ms): 18.13
Max ITL (ms): 226.59
==================================================
python -m sglang.bench_serving \
--backend sglang \
--model deepseek-ai/DeepSeek-R1-0528 \
--dataset-name random \
--random-input-len 1000 \
--random-output-len 8000 \
--num-prompts 320 \
--max-concurrency 64 \
--request-rate inf
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 64
Successful requests: 320
Benchmark duration (s): 589.31
Total input tokens: 158939
Total input text tokens: 158939
Total input vision tokens: 0
Total generated tokens: 1300705
Total generated tokens (retokenized): 1297658
Request throughput (req/s): 0.54
Input token throughput (tok/s): 269.70
Output token throughput (tok/s): 2207.16
Peak output token throughput (tok/s): 2944.00
Peak concurrent requests: 68
Total token throughput (tok/s): 2476.86
Concurrency: 57.03
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 105032.36
Median E2E Latency (ms): 108229.09
---------------Time to First Token----------------
Mean TTFT (ms): 223.91
Median TTFT (ms): 158.15
P99 TTFT (ms): 474.86
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 25.94
Median TPOT (ms): 26.72
P99 TPOT (ms): 27.99
---------------Inter-Token Latency----------------
Mean ITL (ms): 25.79
Median ITL (ms): 25.37
P95 ITL (ms): 26.70
P99 ITL (ms): 105.49
Max ITL (ms): 237.91
==================================================
Scenario 3: Summarization (8K/1K)
python -m sglang.bench_serving \
--backend sglang \
--model deepseek-ai/DeepSeek-R1-0528 \
--dataset-name random \
--random-input-len 8000 \
--random-output-len 1000 \
--num-prompts 10 \
--max-concurrency 1 \
--request-rate inf
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 1
Successful requests: 10
Benchmark duration (s): 40.65
Total input tokens: 41941
Total input text tokens: 41941
Total input vision tokens: 0
Total generated tokens: 4210
Total generated tokens (retokenized): 4195
Request throughput (req/s): 0.25
Input token throughput (tok/s): 1031.65
Output token throughput (tok/s): 103.56
Peak output token throughput (tok/s): 110.00
Peak concurrent requests: 2
Total token throughput (tok/s): 1135.20
Concurrency: 1.00
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 4063.62
Median E2E Latency (ms): 3296.13
---------------Time to First Token----------------
Mean TTFT (ms): 165.91
Median TTFT (ms): 154.96
P99 TTFT (ms): 240.92
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 9.26
Median TPOT (ms): 9.27
P99 TPOT (ms): 9.42
---------------Inter-Token Latency----------------
Mean ITL (ms): 9.28
Median ITL (ms): 9.28
P95 ITL (ms): 9.66
P99 ITL (ms): 9.83
Max ITL (ms): 14.06
==================================================
python -m sglang.bench_serving \
--backend sglang \
--model deepseek-ai/DeepSeek-R1-0528 \
--dataset-name random \
--random-input-len 8000 \
--random-output-len 1000 \
--num-prompts 80 \
--max-concurrency 16 \
--request-rate inf
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 16
Successful requests: 80
Benchmark duration (s): 56.71
Total input tokens: 300020
Total input text tokens: 300020
Total input vision tokens: 0
Total generated tokens: 41589
Total generated tokens (retokenized): 41490
Request throughput (req/s): 1.41
Input token throughput (tok/s): 5290.75
Output token throughput (tok/s): 733.41
Peak output token throughput (tok/s): 1024.00
Peak concurrent requests: 20
Total token throughput (tok/s): 6024.16
Concurrency: 14.25
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 10098.99
Median E2E Latency (ms): 10623.46
---------------Time to First Token----------------
Mean TTFT (ms): 486.80
Median TTFT (ms): 189.59
P99 TTFT (ms): 2138.73
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 19.06
Median TPOT (ms): 19.23
P99 TPOT (ms): 30.69
---------------Inter-Token Latency----------------
Mean ITL (ms): 18.53
Median ITL (ms): 15.63
P95 ITL (ms): 16.64
P99 ITL (ms): 109.71
Max ITL (ms): 1471.36
==================================================
python -m sglang.bench_serving \
--backend sglang \
--model deepseek-ai/DeepSeek-R1-0528 \
--dataset-name random \
--random-input-len 8000 \
--random-output-len 1000 \
--num-prompts 320 \
--max-concurrency 64 \
--request-rate inf
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 64
Successful requests: 320
Benchmark duration (s): 115.55
Total input tokens: 1273893
Total input text tokens: 1273893
Total input vision tokens: 0
Total generated tokens: 169680
Total generated tokens (retokenized): 169275
Request throughput (req/s): 2.77
Input token throughput (tok/s): 11024.93
Output token throughput (tok/s): 1468.50
Peak output token throughput (tok/s): 2254.00
Peak concurrent requests: 70
Total token throughput (tok/s): 12493.43
Concurrency: 59.45
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 21465.98
Median E2E Latency (ms): 20686.26
---------------Time to First Token----------------
Mean TTFT (ms): 913.93
Median TTFT (ms): 224.92
P99 TTFT (ms): 6257.83
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 39.93
Median TPOT (ms): 40.99
P99 TPOT (ms): 60.91
---------------Inter-Token Latency----------------
Mean ITL (ms): 38.83
Median ITL (ms): 26.29
P95 ITL (ms): 113.81
P99 ITL (ms): 176.94
Max ITL (ms): 5521.53
==================================================
Key Metrics:
Why These Configurations Matter:
Interpreting Results:
Document model accuracy on standard benchmarks:
python3 benchmark/gsm8k/bench_sglang.py \
--num-shots 8 \
--num-questions 1316 \
--parallel 1316
Test Results:
Accuracy: 0.959
Invalid: 0.000
Latency: 29.185 s
Output throughput: 4854.672 token/s