GLM-5.1 - Sglang — ContextQMD

1. Model Introduction

Available Models:

BF16 (Full precision): zai-org/GLM-5.1
FP8 (8-bit quantized): zai-org/GLM-5.1-FP8

License: MIT

2. SGLang Installation

Please refer to the official SGLang installation guide for installation instructions.

3. Model Deployment

This section provides deployment configurations optimized for different hardware platforms and use cases.

3.1 Basic Configuration

Interactive Command Generator: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform, quantization method, and capabilities. SGLang supports serving GLM-5.1 on NVIDIA H100, H200, B200, GB300, and AMD MI300X/MI325X/MI355X GPUs.

import { GLM51Deployment } from '/src/snippets/autoregressive/glm-51-deployment.jsx'

3.2 Configuration Tips

Speculative decoding (MTP) can significantly reduce latency for interactive use cases.
DP Attention: Enables data parallel attention for higher throughput under high concurrency. Note that DP attention trades off low-concurrency latency for high-concurrency throughput — disable it if your workload is latency-sensitive with few concurrent requests.
The --mem-fraction-static flag is recommended for optimal memory utilization, adjust it based on your hardware and workload.
BF16 model always requires 2x GPUs compared to FP8 on NVIDIA hardware.

<table style={{width: "100%", borderCollapse: "collapse", tableLayout: "fixed"}}> <thead> <tr style={{borderBottom: "2px solid #d55816"}}> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.02)"}}>Hardware</th> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.05)"}}>FP8</th> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.02)"}}>BF16</th> </tr> </thead> <tbody> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}>H100</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>tp=16</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>tp=32</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}>H200</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>tp=8</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>tp=16</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}>B200</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>tp=8</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>tp=16</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}>GB300</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>tp=4</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>—</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}>MI300X/MI325X</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>tp=8</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>tp=8</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}>MI355X</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>tp=8</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>tp=8</td> </tr> </tbody> </table>

AMD GPUs: Both BF16 and FP8 checkpoints are supported on MI300X/MI325X/MI355X at tp=8. Use --nsa-prefill-backend tilelang --nsa-decode-backend tilelang for the NSA attention backend. Add --chunked-prefill-size 131072 and --watchdog-timeout 1200 (20 minutes for weight loading). FP8 uses approximately half the memory of BF16 (~89 GB/GPU vs ~175 GB/GPU). EAGLE speculative decoding is not currently supported on AMD for GLM-5.1.
GB300: Only the FP8 checkpoint is recommended on GB300, with tp=4. For high-throughput DP attention on GB300, use --dp 4.
For other configuration tips, please refer to DeepSeek V3.2 documentation. GLM-5.1 and DeepSeek V3.2 share the same model structure, so the optimization techniques between these two models are also common (MTP, DSA kernel, Context Parallel...).
Use --json-model-override-args '{"index_topk_pattern": "FFSFSSSFSSFFFSSSFFFSFSSSSSSFFSFFSFFSSFFFFFFSFFFFFSFFSSSSSSFSFFFSFSSSFSFFSFFSSS"}' for GLM-5.1-FP8 if you want to enable the IndexCache method. This feature is supported through this PR and introduces only a small accuracy loss. However, if you are running rigorous accuracy evaluations, it is not recommended to enable this feature.

4. Model Invocation

Deploy GLM-5.1 with the following command (FP8 on H200, all features enabled):

shell

SGLANG_ENABLE_SPEC_V2=1 sglang serve \
  --model-path zai-org/GLM-5.1-FP8 \
  --tp 8 \
  --tool-call-parser glm47 \
  --reasoning-parser glm45 \
  --speculative-algorithm EAGLE \
  --speculative-num-steps 3 \
  --speculative-eagle-topk 1 \
  --speculative-num-draft-tokens 4 \
  --mem-fraction-static 0.85 \
  --host 0.0.0.0 \
  --port 30000

4.1 MI300X/MI325X/MI355X (ROCm) Server Command

The following ROCm commands are additional options for AMD GPUs and do not replace the NVIDIA instructions above.

FP8 (Recommended)

shell

sglang serve \
  --model-path zai-org/GLM-5.1-FP8 \
  --tp 8 \
  --trust-remote-code \
  --tool-call-parser glm47 \
  --reasoning-parser glm45 \
  --nsa-prefill-backend tilelang \
  --nsa-decode-backend tilelang \
  --chunked-prefill-size 131072 \
  --mem-fraction-static 0.80 \
  --watchdog-timeout 1200 \
  --host 0.0.0.0 \
  --port 30000

BF16

shell

sglang serve \
  --model-path zai-org/GLM-5.1 \
  --tp 8 \
  --trust-remote-code \
  --nsa-prefill-backend tilelang \
  --nsa-decode-backend tilelang \
  --chunked-prefill-size 131072 \
  --mem-fraction-static 0.80 \
  --watchdog-timeout 1200 \
  --host 0.0.0.0 \
  --port 30000

4.2 Basic Usage

For basic API usage and request examples, please refer to:

SGLang Basic Usage Guide

4.3 Advanced Usage

4.3.1 Reasoning Parser

GLM-5.1 supports Thinking mode by default. Enable the reasoning parser during deployment to separate the thinking and content sections. The thinking process is returned via reasoning_content in the streaming response.

To disable thinking and use Instruct mode, pass chat_template_kwargs at request time:

Thinking mode (default): The model performs step-by-step reasoning before answering. No extra parameters needed.
Instruct mode ({"enable_thinking": false}): The model responds directly without a thinking process.

Example 1: Thinking Mode (Default)

Thinking mode is enabled by default. The model will reason step-by-step before answering, and the thinking process is returned via reasoning_content:

python

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY"
)

# Thinking mode is enabled by default, no extra parameters needed
response = client.chat.completions.create(
    model="zai-org/GLM-5.1-FP8",
    messages=[
        {"role": "user", "content": "Solve this problem step by step: What is 15% of 240?"}
    ],
    max_tokens=2048,
    stream=True
)

# Process the stream
has_thinking = False
has_answer = False
thinking_started = False

for chunk in response:
    if chunk.choices and len(chunk.choices) > 0:
        delta = chunk.choices[0].delta

        # Print thinking process
        if hasattr(delta, 'reasoning_content') and delta.reasoning_content:
            if not thinking_started:
                print("=============== Thinking =================", flush=True)
                thinking_started = True
            has_thinking = True
            print(delta.reasoning_content, end="", flush=True)

        # Print answer content
        if delta.content:
            # Close thinking section and add content header
            if has_thinking and not has_answer:
                print("\n=============== Content =================", flush=True)
                has_answer = True
            print(delta.content, end="", flush=True)

print()

Output Example:

text

=============== Thinking =================
1.  **Understand the Goal:** The user wants to find 15% of 240, and they want the solution explained step-by-step.

2.  **Identify the Core Mathematical Concept:** "Percent" means "per hundred" or "out of 100". Finding "X% of Y" translates to the mathematical operation: $(X / 100) \times Y$.

3.  **Step-by-Step Breakdown:**
    *   *Step 1: Convert the percentage to a decimal (or fraction).* 15% means 15 out of 100, which is $15/100$ or $0.15$.
    *   *Step 2: Multiply the decimal by the given number.* Multiply $0.15$ by $240$.
    *   *Step 3: Perform the calculation.*
        *   $0.15 \times 240$
        *   I can break this down further to make it easy to follow:
            *   $0.10 \times 240 = 24$ (which is 10%)
            *   $0.05 \times 240 = 12$ (which is 5%, half of 10%)
            *   $24 + 12 = 36$
        *   Alternatively, standard multiplication:
            *   $240 \times 15 = 3600$
            *   Move decimal two places left -> $36$
    *   *Step 4: State the final answer clearly.*

4.  **Draft the Response (incorporating the steps clearly):**
    *   *Introduction:* State the problem clearly.
    *   *Step 1:* Explain how to convert 15% to a decimal.
    *   *Step 2:* Explain the multiplication step.
    *   *Step 3:* Show the actual math (I'll provide the standard multiplication and the "mental math" trick as it adds value).
    *   *Conclusion:* Give the final answer.

5.  **Refine the Output (Self-Correction/Polishing during drafting):**
    *   *Drafting Step 1:* To find 15% of 240, first convert 15% into a decimal. Since percent means "per hundred," you divide 15 by 100. 15 ÷ 100 = 0.15.
    *   *Drafting Step 2:* Next, multiply this decimal by the number you are finding the percentage of (which is 240). So, calculate 0.15 × 240.
    *   *Drafting Step 3 (Standard way):* 0.15 × 240 = 36.
    *   *Adding the alternative mental math way:* It's often helpful to break it down into 10% and 5%.
        *   10% of 240 = 24 (move the decimal point one place to the left)
        *   5% is half of 10%, so half of 24 = 12
        *   Add them together: 24 + 12 = 36.
    *   *Final Answer:* 15% of 240 is 36.

6.  **Final Review against User Prompt:** Does it solve the problem? Yes. Is it step-by-step? Yes. Is it clear? Yes. (Proceed to generate output).
=============== Content =================
Here is the step-by-step solution to find 15% of 240:

**Step 1: Convert the percentage to a decimal.**
To convert a percentage to a decimal, divide it by 100 (or simply move the decimal point two places to the left).
* 15% = 15 ÷ 100 = **0.15**

**Step 2: Multiply the decimal by the number.**
Now, multiply the decimal (0.15) by the number you are finding the percentage of (240).
* 0.15 × 240 = **36**

*(Alternative mental math method for Step 2)*:
If you don't want to multiply by 0.15 directly, you can break 15% down into 10% and 5%:
* **10% of 240** = 24 (just move the decimal point one place to the left)
* **5% of 240** = 12 (5% is half of 10%, so just divide 24 by 2)
* **Add them together**: 24 + 12 = **36**

**Answer:**
15% of 240 is **36**.

Example 2: Instruct Mode (Thinking Off)

To disable thinking and get a direct response, pass {"enable_thinking": false} via chat_template_kwargs:

python

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY"
)

# Disable thinking mode via chat_template_kwargs
response = client.chat.completions.create(
    model="zai-org/GLM-5.1-FP8",
    messages=[
        {"role": "user", "content": "What is 15% of 240?"}
    ],
    extra_body={"chat_template_kwargs": {"enable_thinking": False}},
    max_tokens=2048,
    stream=True
)

# In Instruct mode, the model responds directly without reasoning_content
for chunk in response:
    if chunk.choices and len(chunk.choices) > 0:
        delta = chunk.choices[0].delta
        if delta.content:
            print(delta.content, end="", flush=True)

print()

Output Example:

text

15% of 240 is 36.

Here is how to calculate it:
1. Convert the percentage to a decimal: 15% = 0.15
2. Multiply the decimal by the number: 0.15 × 240 = 36

4.3.2 Tool Calling

GLM-5.1 supports tool calling capabilities. Enable the tool call parser during deployment. Thinking mode is on by default; to disable it for tool calling requests, pass extra_body={"chat_template_kwargs": {"enable_thinking": False}}.

Python Example (with Thinking Process):

python

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY"
)

# Define available tools
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "The city name"
                    },
                    "unit": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"],
                        "description": "Temperature unit"
                    }
                },
                "required": ["location"]
            }
        }
    }
]

# Make request with streaming to see thinking process
response = client.chat.completions.create(
    model="zai-org/GLM-5.1-FP8",
    messages=[
        {"role": "user", "content": "What's the weather in Beijing?"}
    ],
    tools=tools,
    stream=True
)

# Process streaming response
thinking_started = False
has_thinking = False

for chunk in response:
    if chunk.choices and len(chunk.choices) > 0:
        delta = chunk.choices[0].delta

        # Print thinking process
        if hasattr(delta, 'reasoning_content') and delta.reasoning_content:
            if not thinking_started:
                print("=============== Thinking =================", flush=True)
                thinking_started = True
            has_thinking = True
            print(delta.reasoning_content, end="", flush=True)

        # Print tool calls
        if hasattr(delta, 'tool_calls') and delta.tool_calls:
            # Close thinking section if needed
            if has_thinking and thinking_started:
                print("\n=============== Content =================", flush=True)
                thinking_started = False

            for tool_call in delta.tool_calls:
                if tool_call.function:
                    print(f"Tool Call: {tool_call.function.name}")
                    print(f"   Arguments: {tool_call.function.arguments}")

        # Print content
        if delta.content:
            print(delta.content, end="", flush=True)

print()

Output Example:

text

=============== Thinking =================
The user wants to know the weather in Beijing. I'll call the get_weather function with "Beijing" as the location.
=============== Content =================
Tool Call: get_weather
   Arguments:
Tool Call: None
   Arguments: {
Tool Call: None
   Arguments: "location": "Be
Tool Call: None
   Arguments: ijing"
Tool Call: None
   Arguments: }

5. Benchmark

5.1 Speed Benchmark

Test Environment:

Hardware: H200 (8x)
Model: GLM-5.1-FP8
Tensor Parallelism: 8
SGLang Version: commit 947927bdb

5.1.1 Latency Benchmark

bash

python3 -m sglang.bench_serving \
  --backend sglang \
  --model zai-org/GLM-5.1-FP8 \
  --dataset-name random \
  --random-input-len 1000 \
  --random-output-len 1000 \
  --num-prompts 10 \
  --max-concurrency 1 \
  --request-rate inf

text

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 1
Successful requests:                     10
Benchmark duration (s):                  35.78
Total input tokens:                      6101
Total input text tokens:                 6101
Total generated tokens:                  4220
Total generated tokens (retokenized):    4213
Request throughput (req/s):              0.28
Input token throughput (tok/s):          170.54
Output token throughput (tok/s):         117.96
Peak output token throughput (tok/s):    148.00
Peak concurrent requests:                2
Total token throughput (tok/s):          288.50
Concurrency:                             1.00
Accept length:                           3.48
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   3576.31
Median E2E Latency (ms):                 2935.97
P90 E2E Latency (ms):                    5908.97
P99 E2E Latency (ms):                    8588.08
---------------Time to First Token----------------
Mean TTFT (ms):                          290.88
Median TTFT (ms):                        282.34
P99 TTFT (ms):                           332.27
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          7.54
Median TPOT (ms):                        6.97
P99 TPOT (ms):                           9.04
---------------Inter-Token Latency----------------
Mean ITL (ms):                           7.80
Median ITL (ms):                         6.81
P95 ITL (ms):                            13.51
P99 ITL (ms):                            26.99
Max ITL (ms):                            29.50
==================================================

5.1.2 Throughput Benchmark

bash

python3 -m sglang.bench_serving \
  --backend sglang \
  --model zai-org/GLM-5.1-FP8 \
  --dataset-name random \
  --random-input-len 1000 \
  --random-output-len 1000 \
  --num-prompts 1000 \
  --max-concurrency 100 \
  --request-rate inf

text

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 100
Successful requests:                     1000
Benchmark duration (s):                  411.74
Total input tokens:                      502493
Total input text tokens:                 502493
Total generated tokens:                  500251
Total generated tokens (retokenized):    499614
Request throughput (req/s):              2.43
Input token throughput (tok/s):          1220.41
Output token throughput (tok/s):         1214.97
Peak output token throughput (tok/s):    2648.00
Peak concurrent requests:                105
Total token throughput (tok/s):          2435.38
Concurrency:                             96.30
Accept length:                           3.50
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   39648.76
Median E2E Latency (ms):                 39058.12
P90 E2E Latency (ms):                    57009.82
P99 E2E Latency (ms):                    68880.33
---------------Time to First Token----------------
Mean TTFT (ms):                          20613.80
Median TTFT (ms):                        21429.21
P99 TTFT (ms):                           29543.17
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          38.73
Median TPOT (ms):                        36.52
P99 TPOT (ms):                           67.09
---------------Inter-Token Latency----------------
Mean ITL (ms):                           38.13
Median ITL (ms):                         16.57
P95 ITL (ms):                            86.01
P99 ITL (ms):                            164.88
Max ITL (ms):                            1307.02
==================================================

5.2 Accuracy Benchmark

5.2.1 GSM8K Benchmark

Benchmark Command

bash

python3 benchmark/gsm8k/bench_sglang.py --port 30000

Test Result

text

Accuracy: 0.955
Invalid: 0.000
Latency: 32.470 s
Output throughput: 642.044 token/s

5.2.2 MMLU Benchmark

Benchmark Command

bash

python3 benchmark/mmlu/bench_sglang.py --port 30000

Test Result

text

subject: abstract_algebra, #q:100, acc: 0.860
subject: anatomy, #q:135, acc: 0.874
subject: astronomy, #q:152, acc: 0.941
subject: business_ethics, #q:100, acc: 0.880
subject: clinical_knowledge, #q:265, acc: 0.932
subject: college_biology, #q:144, acc: 0.972
subject: college_chemistry, #q:100, acc: 0.640
subject: college_computer_science, #q:100, acc: 0.900
subject: college_mathematics, #q:100, acc: 0.810
subject: college_medicine, #q:173, acc: 0.873
subject: college_physics, #q:102, acc: 0.912
subject: computer_security, #q:100, acc: 0.880
subject: conceptual_physics, #q:235, acc: 0.928
subject: econometrics, #q:114, acc: 0.807
subject: electrical_engineering, #q:145, acc: 0.897
subject: elementary_mathematics, #q:378, acc: 0.937
subject: formal_logic, #q:126, acc: 0.778
subject: global_facts, #q:100, acc: 0.710
subject: high_school_biology, #q:310, acc: 0.961
subject: high_school_chemistry, #q:203, acc: 0.847
subject: high_school_computer_science, #q:100, acc: 0.960
subject: high_school_european_history, #q:165, acc: 0.891
subject: high_school_geography, #q:198, acc: 0.960
subject: high_school_government_and_politics, #q:193, acc: 0.984
subject: high_school_macroeconomics, #q:390, acc: 0.923
subject: high_school_mathematics, #q:270, acc: 0.696
subject: high_school_microeconomics, #q:238, acc: 0.962
subject: high_school_physics, #q:151, acc: 0.821
subject: high_school_psychology, #q:545, acc: 0.956
subject: high_school_statistics, #q:216, acc: 0.889
subject: high_school_us_history, #q:204, acc: 0.941
subject: high_school_world_history, #q:237, acc: 0.945
subject: human_aging, #q:223, acc: 0.857
subject: human_sexuality, #q:131, acc: 0.908
subject: international_law, #q:121, acc: 0.934
subject: jurisprudence, #q:108, acc: 0.907
subject: logical_fallacies, #q:163, acc: 0.933
subject: machine_learning, #q:112, acc: 0.830
subject: management, #q:103, acc: 0.942
subject: marketing, #q:234, acc: 0.940
subject: medical_genetics, #q:100, acc: 0.990
subject: miscellaneous, #q:783, acc: 0.959
subject: moral_disputes, #q:346, acc: 0.873
subject: moral_scenarios, #q:895, acc: 0.837
subject: nutrition, #q:306, acc: 0.922
subject: philosophy, #q:311, acc: 0.897
subject: prehistory, #q:324, acc: 0.929
subject: professional_accounting, #q:282, acc: 0.844
subject: professional_law, #q:1534, acc: 0.714
subject: professional_medicine, #q:272, acc: 0.941
subject: professional_psychology, #q:612, acc: 0.913
subject: public_relations, #q:110, acc: 0.791
subject: security_studies, #q:245, acc: 0.878
subject: sociology, #q:201, acc: 0.940
subject: us_foreign_policy, #q:100, acc: 0.920
subject: virology, #q:166, acc: 0.596
subject: world_religions, #q:171, acc: 0.936
Total latency: 165.275
Average accuracy: 0.877

5.3 AMD GPU Benchmarks

5.3.1 GSM8K Benchmark (MI325/MI35x)

MI325/MI35x Test (GLM-5.1 BF16, tp=8, TileLang NSA backends)

bash

python3 benchmark/gsm8k/bench_sglang.py --num-questions 200

text

Accuracy: 0.970
Invalid: 0.000

Results from AMD nightly CI. See also sglang#18911.