DeepSeek-R1 - Sglang

import { DeepSeekR1BasicDeployment } from '/src/snippets/autoregressive/deepseek-r1-basic-deployment.jsx'; import { DeepSeekR1AdvancedDeployment } from '/src/snippets/autoregressive/deepseek-r1-advanced-deployment.jsx';

1. Model Introduction

DeepSeek-R1 is DeepSeek's advanced reasoning model that combines powerful language understanding with step-by-step reasoning capabilities. The model is available in multiple quantization formats optimized for different hardware platforms.

Key Features:

Advanced Reasoning: Built-in reasoning capabilities for complex problem-solving
Multiple Quantizations: FP8 and FP4 variants for different performance/memory trade-offs
Hardware Optimization: Specifically tuned for NVIDIA B200 (Blackwell) and H200 (Hopper) GPUs, AMD MI300X, MI325X and MI355X GPUs, as well as Intel Xeon CPUs
High Performance: Optimized for both throughput and latency scenarios

Available Models:

FP8 (8-bit quantized): deepseek-ai/DeepSeek-R1-0528 - Recommended for H200 and MI300X
FP4 (4-bit quantized): nvidia/DeepSeek-R1-0528-FP4-v2 - Recommended for B200 and MI355X
BF16 (upcast from FP8): unsloth/DeepSeek-R1-0528-BF16
INT8 (channel-wise): meituan/DeepSeek-R1-Channel-INT8
W4A8: novita/Deepseek-R1-0528-W4AFP8
AWQ (4-bit): QuixiAI/DeepSeek-R1-0528-AWQ
MXFP4: amd/DeepSeek-R1-MXFP4

License: To use DeepSeek-R1, you must agree to DeepSeek's Community License. See LICENSE for details.

For more details, please refer to the official DeepSeek-R1 repository.

2. SGLang Installation

Please refer to the official SGLang installation guide for installation instructions.

For SGLang CPU installation, please refer to the CPU version installation guide.

3. Model Deployment

This section provides deployment configurations optimized for different hardware platforms and use cases.

3.1 Basic Configuration

Interactive Command Generator: Use the configuration selector below to automatically generate a basic deployment command for your hardware platform, quantization method, and deployment strategy.

3.2 Optimal Configurations

Pareto-optimal configurations for B200, H200, MI300X, MI325X, and MI355X hardware.

3.3 Configuration Tips

DeepSeek-R1 shares the same MoE architecture as DeepSeek-V3, so the same hardware and optimization recommendations apply.

Recommended GPU configurations by weight type:

<table style={{width: "100%", borderCollapse: "collapse"}}> <thead> <tr style={{borderBottom: "2px solid #d55816"}}> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, backgroundColor: "rgba(255,255,255,0.02)"}}>Weight Type</th> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, backgroundColor: "rgba(255,255,255,0.05)"}}>Supported Hardware</th> </tr> </thead> <tbody> <tr> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}><strong>FP8</strong> (recommended)</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>8× H200, 8× B200, 8× MI300X, 2×8× H100/H800/H20</td> </tr> <tr> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}><strong>BF16</strong> (upcast from FP8)</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>2×8× H200, 2×8× MI300X, 4×8× H100/H800, 4×8× A100/A800</td> </tr> <tr> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}><strong>INT8</strong></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>16× A100/A800, 32× L40S, Xeon 6980P CPU, 4× Atlas 800I A3</td> </tr> <tr> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}><strong>W4A8 / AWQ / MXFP4 / NVFP4</strong></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>8× H20/H100, 4× H200; 8× H100/A100; 8/4× MI355X/MI350X; 8/4× B200</td> </tr> </tbody> </table>

The official DeepSeek-R1 checkpoint is already in FP8 format — do not add --quantization fp8 when serving it.

DeepGEMM precompilation (NVIDIA Hopper / Blackwell): Precompile GEMM kernels to avoid JIT overhead (~10 min):

bash

python3 -m sglang.compile_deep_gemm --model deepseek-ai/DeepSeek-R1 --tp 8 --trust-remote-code

Data Parallelism Attention (--enable-dp-attention): Recommended for high-throughput scenarios. Use --enable-dp-attention --tp 8 --dp 8 on a single 8-GPU node.

NCCL timeout: If model loading is slow, increase: --dist-timeout 3600.

For configuring CPU service, please refer to the Notes part in the serving engine launching section in the SGLang CPU server document to better understand how to configure the arguments, especially for TP (tensor parallel) and NUMA binding settings.

4. Model Invocation

4.1 Basic Usage

For basic API usage and request examples, please refer to:

SGLang Basic Usage Guide

4.2 Advanced Usage

4.2.1 Reasoning Parser

DeepSeek-R1 supports advanced reasoning capabilities with built-in thinking process. Enable the reasoning parser during deployment to separate the thinking and content sections:

shell

python -m sglang.launch_server \
  --model-path deepseek-ai/DeepSeek-R1-0528 \
  --reasoning-parser deepseek-r1 \
  --tp 8

Streaming with Thinking Process:

python

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY"
)

# Enable streaming to see the thinking process in real-time
response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-R1-0528",
    messages=[
        {"role": "user", "content": "Solve this problem step by step: What is 15% of 240?"}
    ],
    temperature=0.7,
    max_tokens=2048,
    stream=True
)

# Process the stream
has_thinking = False
has_answer = False
thinking_started = False

for chunk in response:
    if chunk.choices and len(chunk.choices) > 0:
        delta = chunk.choices[0].delta

        # Print thinking process
        if hasattr(delta, 'reasoning_content') and delta.reasoning_content:
            if not thinking_started:
                print("=============== Thinking =================", flush=True)
                thinking_started = True
            has_thinking = True
            print(delta.reasoning_content, end="", flush=True)

        # Print answer content
        if delta.content:
            # Close thinking section and add content header
            if has_thinking and not has_answer:
                print("\n=============== Content =================", flush=True)
                has_answer = True
            print(delta.content, end="", flush=True)

print()

Output Example:

text

=============== Thinking =================
To solve this problem, I need to calculate 15% of 240.
Step 1: Convert 15% to decimal: 15% = 0.15
Step 2: Multiply 240 by 0.15
Step 3: 240 × 0.15 = 36
=============== Content =================

The answer is 36. To find 15% of 240, we multiply 240 by 0.15, which equals 36.

Note: The reasoning parser captures the model's step-by-step thinking process, allowing you to see how the model arrives at its conclusions.

4.2.2 Tool Calling

DeepSeek-R1 supports tool calling capabilities. Enable the tool call parser:

shell

python -m sglang.launch_server \
  --model-path deepseek-ai/DeepSeek-R1-0528 \
  --reasoning-parser deepseek-r1 \
  --tool-call-parser deepseekv3 \
  --chat-template examples/chat_template/tool_chat_template_deepseekr1.jinja \
  --tp 8

Python Example (with Thinking Process):

python

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY"
)

# Define available tools
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "The city name"
                    },
                    "unit": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"],
                        "description": "Temperature unit"
                    }
                },
                "required": ["location"]
            }
        }
    }
]

# Make request with streaming to see thinking process
response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-R1-0528",
    messages=[
        {"role": "user", "content": "What's the weather in Beijing?"}
    ],
    tools=tools,
    temperature=0.7,
    stream=True
)

# Process streaming response
thinking_started = False
has_thinking = False

for chunk in response:
    if chunk.choices and len(chunk.choices) > 0:
        delta = chunk.choices[0].delta

        # Print thinking process
        if hasattr(delta, 'reasoning_content') and delta.reasoning_content:
            if not thinking_started:
                print("=============== Thinking =================", flush=True)
                thinking_started = True
            has_thinking = True
            print(delta.reasoning_content, end="", flush=True)

        # Print tool calls
        if hasattr(delta, 'tool_calls') and delta.tool_calls:
            # Close thinking section if needed
            if has_thinking and thinking_started:
                print("\n=============== Content =================", flush=True)
                thinking_started = False

            for tool_call in delta.tool_calls:
                if tool_call.function:
                    print(f"🔧 Tool Call: {tool_call.function.name}")
                    print(f"   Arguments: {tool_call.function.arguments}")

        # Print content
        if delta.content:
            print(delta.content, end="", flush=True)

print()

Output Example:

text

=============== Thinking =================
The user is asking about the weather in Beijing. I need to use the get_weather function to retrieve this information.
I should call the function with location="Beijing".
=============== Content =================

🔧 Tool Call: get_weather
   Arguments:
🔧 Tool Call: None
   Arguments: {"location": "Beijing"}

Note:

The reasoning parser shows how the model decides to use a tool
Tool calls are clearly marked with the function name and arguments
You can then execute the function and send the result back to continue the conversation

Handling Tool Call Results:

python

# After getting the tool call, execute the function
def get_weather(location, unit="celsius"):
    # Your actual weather API call here
    return f"The weather in {location} is 22°{unit[0].upper()} and sunny."

# Send tool result back to the model
messages = [
    {"role": "user", "content": "What's the weather in Beijing?"},
    {
        "role": "assistant",
        "content": None,
        "tool_calls": [{
            "id": "call_123",
            "type": "function",
            "function": {
                "name": "get_weather",
                "arguments": '{"location": "Beijing", "unit": "celsius"}'
            }
        }]
    },
    {
        "role": "tool",
        "tool_call_id": "call_123",
        "content": get_weather("Beijing", "celsius")
    }
]

final_response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-R1-0528",
    messages=messages,
    temperature=0.7
)

print(final_response.choices[0].message.content)
# Output: "The weather in Beijing is currently 22°C and sunny."

4.2.3 Multi-Token Prediction (EAGLE Speculative Decoding)

DeepSeek-R1 supports EAGLE-based Multi-Token Prediction (MTP), the same mechanism as DeepSeek-V3. Refer to DeepSeek-V3 §4.2.3 for the complete launch command, flag reference, tuning guidance (--speculative-num-steps, --speculative-eagle-topk, --max-running-requests), and bench_speculative.py link. R1's speed benchmark commands that include --speculative-* flags use this mechanism.

4.2.4 Thinking Budget

Limit the model's thinking token budget using CustomLogitProcessor. Launch with --enable-custom-logit-processor:

shell

python3 -m sglang.launch_server \
  --model deepseek-ai/DeepSeek-R1 \
  --tp 8 \
  --port 30000 \
  --reasoning-parser deepseek-r1 \
  --enable-custom-logit-processor

python

import openai
from sglang.srt.sampling.custom_logit_processor import DeepSeekR1ThinkingBudgetLogitProcessor

client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="*")
response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-R1",
    messages=[{"role": "user", "content": "Is Paris the Capital of France?"}],
    max_tokens=1024,
    extra_body={
        "custom_logit_processor": DeepSeekR1ThinkingBudgetLogitProcessor().to_str(),
        "custom_params": {"thinking_budget": 512},
    },
)
print(response)

5. Benchmark

This section uses industry-standard configurations for comparable benchmark results.

5.1 Speed Benchmark

Test Environment:

Hardware: B200 GPU (8x)
Model: DeepSeek-R1-0528
Tensor Parallelism: 8
SGLang Version: 0.5.6.post1

Benchmark Methodology:

We use industry-standard benchmark configurations to ensure results are comparable across frameworks and hardware platforms.

5.1.1 Standard Test Scenarios

Three core scenarios reflect real-world usage patterns:

<table style={{width: "100%", borderCollapse: "collapse", tableLayout: "fixed"}}> <colgroup> <col style={{width: "25%"}} /> <col style={{width: "25%"}} /> <col style={{width: "25%"}} /> <col style={{width: "25%"}} /> </colgroup> <thead> <tr style={{borderBottom: "2px solid #d55816"}}> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.02)"}}>Scenario</th> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.05)"}}>Input Length</th> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.02)"}}>Output Length</th> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.05)"}}>Use Case</th> </tr> </thead> <tbody> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}>**Chat**</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>1K</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>1K</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Most common conversational AI workload</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}>**Reasoning**</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>1K</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>8K</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Long-form generation, complex reasoning tasks</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}>**Summarization**</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>8K</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>1K</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Document summarization, RAG retrieval</td> </tr> </tbody> </table>

5.1.2 Concurrency Levels

Test each scenario at different concurrency levels to capture the throughput vs. latency trade-off:

Low Concurrency: --max-concurrency 1 (Latency-optimized)
Medium Concurrency: --max-concurrency 16 (Balanced)
High Concurrency: --max-concurrency 100 (Throughput-optimized)

5.1.3 Number of Prompts

For each concurrency level, configure num_prompts to simulate realistic user loads:

Quick Test: num_prompts = concurrency × 1 (minimal test)
Recommended: num_prompts = concurrency × 5 (standard benchmark)
Stable Measurements: num_prompts = concurrency × 10 (production-grade)

5.1.4 Benchmark Commands

Scenario 1: Chat (1K/1K) - Most Important

Model Deployment

bash

python -m sglang.launch_server \
  --model-path deepseek-ai/DeepSeek-R1-0528 \
  --tp 8

Low Concurrency (Latency-Optimized)

bash

python -m sglang.bench_serving \
  --backend sglang \
  --model deepseek-ai/DeepSeek-R1-0528 \
  --dataset-name random \
  --random-input-len 1000 \
  --random-output-len 1000 \
  --num-prompts 10 \
  --max-concurrency 1 \
  --request-rate inf

text

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 1
Successful requests:                     10
Benchmark duration (s):                  40.00
Total input tokens:                      6101
Total input text tokens:                 6101
Total input vision tokens:               0
Total generated tokens:                  4210
Total generated tokens (retokenized):    4205
Request throughput (req/s):              0.25
Input token throughput (tok/s):          152.52
Output token throughput (tok/s):         105.24
Peak output token throughput (tok/s):    110.00
Peak concurrent requests:                2
Total token throughput (tok/s):          257.76
Concurrency:                             1.00
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   3998.40
Median E2E Latency (ms):                 3207.53
---------------Time to First Token----------------
Mean TTFT (ms):                          153.00
Median TTFT (ms):                        140.76
P99 TTFT (ms):                           214.66
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          9.16
Median TPOT (ms):                        9.15
P99 TPOT (ms):                           9.21
---------------Inter-Token Latency----------------
Mean ITL (ms):                           9.16
Median ITL (ms):                         9.15
P95 ITL (ms):                            9.47
P99 ITL (ms):                            9.63
Max ITL (ms):                            15.45
==================================================

Medium Concurrency (Balanced)

bash

python -m sglang.bench_serving \
  --backend sglang \
  --model deepseek-ai/DeepSeek-R1-0528 \
  --dataset-name random \
  --random-input-len 1000 \
  --random-output-len 1000 \
  --num-prompts 80 \
  --max-concurrency 16 \
  --request-rate inf

text

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 16
Successful requests:                     80
Benchmark duration (s):                  51.21
Total input tokens:                      39668
Total input text tokens:                 39668
Total input vision tokens:               0
Total generated tokens:                  40725
Total generated tokens (retokenized):    40458
Request throughput (req/s):              1.56
Input token throughput (tok/s):          774.66
Output token throughput (tok/s):         795.30
Peak output token throughput (tok/s):    1088.00
Peak concurrent requests:                21
Total token throughput (tok/s):          1569.96
Concurrency:                             13.93
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   8918.33
Median E2E Latency (ms):                 9466.16
---------------Time to First Token----------------
Mean TTFT (ms):                          273.51
Median TTFT (ms):                        131.71
P99 TTFT (ms):                           839.57
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          17.56
Median TPOT (ms):                        17.46
P99 TPOT (ms):                           28.68
---------------Inter-Token Latency----------------
Mean ITL (ms):                           17.02
Median ITL (ms):                         14.70
P95 ITL (ms):                            16.41
P99 ITL (ms):                            112.38
Max ITL (ms):                            461.90
==================================================

High Concurrency (Throughput-Optimized)

bash

python -m sglang.bench_serving \
  --backend sglang \
  --model deepseek-ai/DeepSeek-R1-0528 \
  --dataset-name random \
  --random-input-len 1000 \
  --random-output-len 1000 \
  --num-prompts 500 \
  --max-concurrency 100 \
  --request-rate inf

text

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 100
Successful requests:                     500
Benchmark duration (s):                  110.46
Total input tokens:                      249831
Total input text tokens:                 249831
Total input vision tokens:               0
Total generated tokens:                  252162
Total generated tokens (retokenized):    251441
Request throughput (req/s):              4.53
Input token throughput (tok/s):          2261.80
Output token throughput (tok/s):         2282.90
Peak output token throughput (tok/s):    3900.00
Peak concurrent requests:                109
Total token throughput (tok/s):          4544.71
Concurrency:                             92.26
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   20380.71
Median E2E Latency (ms):                 19391.65
---------------Time to First Token----------------
Mean TTFT (ms):                          563.14
Median TTFT (ms):                        147.62
P99 TTFT (ms):                           2632.11
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          40.11
Median TPOT (ms):                        41.98
P99 TPOT (ms):                           50.10
---------------Inter-Token Latency----------------
Mean ITL (ms):                           39.37
Median ITL (ms):                         26.36
P95 ITL (ms):                            98.16
P99 ITL (ms):                            150.08
Max ITL (ms):                            2052.85
==================================================

Scenario 2: Reasoning (1K/8K)

Low Concurrency

bash

python -m sglang.bench_serving \
  --backend sglang \
  --model deepseek-ai/DeepSeek-R1-0528 \
  --dataset-name random \
  --random-input-len 1000 \
  --random-output-len 8000 \
  --num-prompts 10 \
  --max-concurrency 1 \
  --request-rate inf

text

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 1
Successful requests:                     10
Benchmark duration (s):                  411.34
Total input tokens:                      6101
Total input text tokens:                 6101
Total input vision tokens:               0
Total generated tokens:                  44452
Total generated tokens (retokenized):    44390
Request throughput (req/s):              0.02
Input token throughput (tok/s):          14.83
Output token throughput (tok/s):         108.07
Peak output token throughput (tok/s):    110.00
Peak concurrent requests:                2
Total token throughput (tok/s):          122.90
Concurrency:                             1.00
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   41132.04
Median E2E Latency (ms):                 44288.71
---------------Time to First Token----------------
Mean TTFT (ms):                          125.76
Median TTFT (ms):                        126.19
P99 TTFT (ms):                           137.69
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          9.21
Median TPOT (ms):                        9.20
P99 TPOT (ms):                           9.27
---------------Inter-Token Latency----------------
Mean ITL (ms):                           9.23
Median ITL (ms):                         9.22
P95 ITL (ms):                            9.64
P99 ITL (ms):                            9.86
Max ITL (ms):                            15.18
==================================================

Medium Concurrency

bash

python -m sglang.bench_serving \
  --backend sglang \
  --model deepseek-ai/DeepSeek-R1-0528 \
  --dataset-name random \
  --random-input-len 1000 \
  --random-output-len 8000 \
  --num-prompts 80 \
  --max-concurrency 16 \
  --request-rate inf

text

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 16
Successful requests:                     80
Benchmark duration (s):                  348.93
Total input tokens:                      39668
Total input text tokens:                 39668
Total input vision tokens:               0
Total generated tokens:                  318226
Total generated tokens (retokenized):    317630
Request throughput (req/s):              0.23
Input token throughput (tok/s):          113.69
Output token throughput (tok/s):         912.02
Peak output token throughput (tok/s):    1088.00
Peak concurrent requests:                19
Total token throughput (tok/s):          1025.70
Concurrency:                             14.07
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   61360.70
Median E2E Latency (ms):                 62071.20
---------------Time to First Token----------------
Mean TTFT (ms):                          176.02
Median TTFT (ms):                        153.75
P99 TTFT (ms):                           268.44
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          15.42
Median TPOT (ms):                        15.59
P99 TPOT (ms):                           16.07
---------------Inter-Token Latency----------------
Mean ITL (ms):                           15.39
Median ITL (ms):                         15.17
P95 ITL (ms):                            16.62
P99 ITL (ms):                            18.13
Max ITL (ms):                            226.59
==================================================

High Concurrency

bash

python -m sglang.bench_serving \
  --backend sglang \
  --model deepseek-ai/DeepSeek-R1-0528 \
  --dataset-name random \
  --random-input-len 1000 \
  --random-output-len 8000 \
  --num-prompts 320 \
  --max-concurrency 64 \
  --request-rate inf

text

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 64
Successful requests:                     320
Benchmark duration (s):                  589.31
Total input tokens:                      158939
Total input text tokens:                 158939
Total input vision tokens:               0
Total generated tokens:                  1300705
Total generated tokens (retokenized):    1297658
Request throughput (req/s):              0.54
Input token throughput (tok/s):          269.70
Output token throughput (tok/s):         2207.16
Peak output token throughput (tok/s):    2944.00
Peak concurrent requests:                68
Total token throughput (tok/s):          2476.86
Concurrency:                             57.03
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   105032.36
Median E2E Latency (ms):                 108229.09
---------------Time to First Token----------------
Mean TTFT (ms):                          223.91
Median TTFT (ms):                        158.15
P99 TTFT (ms):                           474.86
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          25.94
Median TPOT (ms):                        26.72
P99 TPOT (ms):                           27.99
---------------Inter-Token Latency----------------
Mean ITL (ms):                           25.79
Median ITL (ms):                         25.37
P95 ITL (ms):                            26.70
P99 ITL (ms):                            105.49
Max ITL (ms):                            237.91
==================================================

Scenario 3: Summarization (8K/1K)

Low Concurrency

bash

python -m sglang.bench_serving \
  --backend sglang \
  --model deepseek-ai/DeepSeek-R1-0528 \
  --dataset-name random \
  --random-input-len 8000 \
  --random-output-len 1000 \
  --num-prompts 10 \
  --max-concurrency 1 \
  --request-rate inf

text

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 1
Successful requests:                     10
Benchmark duration (s):                  40.65
Total input tokens:                      41941
Total input text tokens:                 41941
Total input vision tokens:               0
Total generated tokens:                  4210
Total generated tokens (retokenized):    4195
Request throughput (req/s):              0.25
Input token throughput (tok/s):          1031.65
Output token throughput (tok/s):         103.56
Peak output token throughput (tok/s):    110.00
Peak concurrent requests:                2
Total token throughput (tok/s):          1135.20
Concurrency:                             1.00
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   4063.62
Median E2E Latency (ms):                 3296.13
---------------Time to First Token----------------
Mean TTFT (ms):                          165.91
Median TTFT (ms):                        154.96
P99 TTFT (ms):                           240.92
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          9.26
Median TPOT (ms):                        9.27
P99 TPOT (ms):                           9.42
---------------Inter-Token Latency----------------
Mean ITL (ms):                           9.28
Median ITL (ms):                         9.28
P95 ITL (ms):                            9.66
P99 ITL (ms):                            9.83
Max ITL (ms):                            14.06
==================================================

Medium Concurrency

bash

python -m sglang.bench_serving \
  --backend sglang \
  --model deepseek-ai/DeepSeek-R1-0528 \
  --dataset-name random \
  --random-input-len 8000 \
  --random-output-len 1000 \
  --num-prompts 80 \
  --max-concurrency 16 \
  --request-rate inf

text

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 16
Successful requests:                     80
Benchmark duration (s):                  56.71
Total input tokens:                      300020
Total input text tokens:                 300020
Total input vision tokens:               0
Total generated tokens:                  41589
Total generated tokens (retokenized):    41490
Request throughput (req/s):              1.41
Input token throughput (tok/s):          5290.75
Output token throughput (tok/s):         733.41
Peak output token throughput (tok/s):    1024.00
Peak concurrent requests:                20
Total token throughput (tok/s):          6024.16
Concurrency:                             14.25
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   10098.99
Median E2E Latency (ms):                 10623.46
---------------Time to First Token----------------
Mean TTFT (ms):                          486.80
Median TTFT (ms):                        189.59
P99 TTFT (ms):                           2138.73
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          19.06
Median TPOT (ms):                        19.23
P99 TPOT (ms):                           30.69
---------------Inter-Token Latency----------------
Mean ITL (ms):                           18.53
Median ITL (ms):                         15.63
P95 ITL (ms):                            16.64
P99 ITL (ms):                            109.71
Max ITL (ms):                            1471.36
==================================================

High Concurrency

bash

python -m sglang.bench_serving \
  --backend sglang \
  --model deepseek-ai/DeepSeek-R1-0528 \
  --dataset-name random \
  --random-input-len 8000 \
  --random-output-len 1000 \
  --num-prompts 320 \
  --max-concurrency 64 \
  --request-rate inf

text

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 64
Successful requests:                     320
Benchmark duration (s):                  115.55
Total input tokens:                      1273893
Total input text tokens:                 1273893
Total input vision tokens:               0
Total generated tokens:                  169680
Total generated tokens (retokenized):    169275
Request throughput (req/s):              2.77
Input token throughput (tok/s):          11024.93
Output token throughput (tok/s):         1468.50
Peak output token throughput (tok/s):    2254.00
Peak concurrent requests:                70
Total token throughput (tok/s):          12493.43
Concurrency:                             59.45
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   21465.98
Median E2E Latency (ms):                 20686.26
---------------Time to First Token----------------
Mean TTFT (ms):                          913.93
Median TTFT (ms):                        224.92
P99 TTFT (ms):                           6257.83
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          39.93
Median TPOT (ms):                        40.99
P99 TPOT (ms):                           60.91
---------------Inter-Token Latency----------------
Mean ITL (ms):                           38.83
Median ITL (ms):                         26.29
P95 ITL (ms):                            113.81
P99 ITL (ms):                            176.94
Max ITL (ms):                            5521.53
==================================================

5.1.5 Understanding the Results

Key Metrics:

Request Throughput (req/s): Number of requests processed per second
Output Token Throughput (tok/s): Total tokens generated per second
Mean TTFT (ms): Time to First Token - measures responsiveness
Mean TPOT (ms): Time Per Output Token - measures generation speed
Mean ITL (ms): Inter-Token Latency - measures streaming consistency

Why These Configurations Matter:

1K/1K (Chat): Represents the most common conversational AI workload. This is the highest priority scenario for most deployments.
1K/8K (Reasoning): Tests long-form generation capabilities crucial for complex reasoning, code generation, and detailed explanations.
8K/1K (Summarization): Evaluates performance with large context inputs, essential for RAG systems, document Q&A, and summarization tasks.
Variable Concurrency: Captures the Pareto frontier - the optimal trade-off between throughput and latency at different load levels. Low concurrency shows best-case latency, high concurrency shows maximum throughput.

Interpreting Results:

Compare your results against baseline numbers for your hardware
Higher throughput at same latency = better performance
Lower TTFT = more responsive user experience
Lower TPOT = faster generation speed

5.2 Accuracy Benchmark

Document model accuracy on standard benchmarks:

5.2.1 GSM8K Benchmark

Benchmark Command

bash

python3 benchmark/gsm8k/bench_sglang.py \
  --num-shots 8 \
  --num-questions 1316 \
  --parallel 1316

Test Results:

text

Accuracy: 0.959
Invalid: 0.000
Latency: 29.185 s
Output throughput: 4854.672 token/s