docs_new/cookbook/autoregressive/GLM/GLM-5.1.mdx
Available Models:
License: MIT
Please refer to the official SGLang installation guide for installation instructions.
This section provides deployment configurations optimized for different hardware platforms and use cases.
Interactive Command Generator: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform, quantization method, and capabilities. SGLang supports serving GLM-5.1 on NVIDIA H100, H200, B200, GB300, and AMD MI300X/MI325X/MI355X GPUs.
import { GLM51Deployment } from '/src/snippets/autoregressive/glm-51-deployment.jsx'
<GLM51Deployment />--mem-fraction-static flag is recommended for optimal memory utilization, adjust it based on your hardware and workload.--nsa-prefill-backend tilelang --nsa-decode-backend tilelang for the NSA attention backend. Add --chunked-prefill-size 131072 and --watchdog-timeout 1200 (20 minutes for weight loading). FP8 uses approximately half the memory of BF16 (~89 GB/GPU vs ~175 GB/GPU). EAGLE speculative decoding is not currently supported on AMD for GLM-5.1.tp=4. For high-throughput DP attention on GB300, use --dp 4.--json-model-override-args '{"index_topk_pattern": "FFSFSSSFSSFFFSSSFFFSFSSSSSSFFSFFSFFSSFFFFFFSFFFFFSFFSSSSSSFSFFFSFSSSFSFFSFFSSS"}' for GLM-5.1-FP8 if you want to enable the IndexCache method. This feature is supported through this PR and introduces only a small accuracy loss. However, if you are running rigorous accuracy evaluations, it is not recommended to enable this feature.Deploy GLM-5.1 with the following command (FP8 on H200, all features enabled):
SGLANG_ENABLE_SPEC_V2=1 sglang serve \
--model-path zai-org/GLM-5.1-FP8 \
--tp 8 \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--speculative-algorithm EAGLE \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \
--mem-fraction-static 0.85 \
--host 0.0.0.0 \
--port 30000
The following ROCm commands are additional options for AMD GPUs and do not replace the NVIDIA instructions above.
sglang serve \
--model-path zai-org/GLM-5.1-FP8 \
--tp 8 \
--trust-remote-code \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--nsa-prefill-backend tilelang \
--nsa-decode-backend tilelang \
--chunked-prefill-size 131072 \
--mem-fraction-static 0.80 \
--watchdog-timeout 1200 \
--host 0.0.0.0 \
--port 30000
sglang serve \
--model-path zai-org/GLM-5.1 \
--tp 8 \
--trust-remote-code \
--nsa-prefill-backend tilelang \
--nsa-decode-backend tilelang \
--chunked-prefill-size 131072 \
--mem-fraction-static 0.80 \
--watchdog-timeout 1200 \
--host 0.0.0.0 \
--port 30000
For basic API usage and request examples, please refer to:
GLM-5.1 supports Thinking mode by default. Enable the reasoning parser during deployment to separate the thinking and content sections. The thinking process is returned via reasoning_content in the streaming response.
To disable thinking and use Instruct mode, pass chat_template_kwargs at request time:
{"enable_thinking": false}): The model responds directly without a thinking process.Example 1: Thinking Mode (Default)
Thinking mode is enabled by default. The model will reason step-by-step before answering, and the thinking process is returned via reasoning_content:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:30000/v1",
api_key="EMPTY"
)
# Thinking mode is enabled by default, no extra parameters needed
response = client.chat.completions.create(
model="zai-org/GLM-5.1-FP8",
messages=[
{"role": "user", "content": "Solve this problem step by step: What is 15% of 240?"}
],
max_tokens=2048,
stream=True
)
# Process the stream
has_thinking = False
has_answer = False
thinking_started = False
for chunk in response:
if chunk.choices and len(chunk.choices) > 0:
delta = chunk.choices[0].delta
# Print thinking process
if hasattr(delta, 'reasoning_content') and delta.reasoning_content:
if not thinking_started:
print("=============== Thinking =================", flush=True)
thinking_started = True
has_thinking = True
print(delta.reasoning_content, end="", flush=True)
# Print answer content
if delta.content:
# Close thinking section and add content header
if has_thinking and not has_answer:
print("\n=============== Content =================", flush=True)
has_answer = True
print(delta.content, end="", flush=True)
print()
Output Example:
=============== Thinking =================
1. **Understand the Goal:** The user wants to find 15% of 240, and they want the solution explained step-by-step.
2. **Identify the Core Mathematical Concept:** "Percent" means "per hundred" or "out of 100". Finding "X% of Y" translates to the mathematical operation: $(X / 100) \times Y$.
3. **Step-by-Step Breakdown:**
* *Step 1: Convert the percentage to a decimal (or fraction).* 15% means 15 out of 100, which is $15/100$ or $0.15$.
* *Step 2: Multiply the decimal by the given number.* Multiply $0.15$ by $240$.
* *Step 3: Perform the calculation.*
* $0.15 \times 240$
* I can break this down further to make it easy to follow:
* $0.10 \times 240 = 24$ (which is 10%)
* $0.05 \times 240 = 12$ (which is 5%, half of 10%)
* $24 + 12 = 36$
* Alternatively, standard multiplication:
* $240 \times 15 = 3600$
* Move decimal two places left -> $36$
* *Step 4: State the final answer clearly.*
4. **Draft the Response (incorporating the steps clearly):**
* *Introduction:* State the problem clearly.
* *Step 1:* Explain how to convert 15% to a decimal.
* *Step 2:* Explain the multiplication step.
* *Step 3:* Show the actual math (I'll provide the standard multiplication and the "mental math" trick as it adds value).
* *Conclusion:* Give the final answer.
5. **Refine the Output (Self-Correction/Polishing during drafting):**
* *Drafting Step 1:* To find 15% of 240, first convert 15% into a decimal. Since percent means "per hundred," you divide 15 by 100. 15 ÷ 100 = 0.15.
* *Drafting Step 2:* Next, multiply this decimal by the number you are finding the percentage of (which is 240). So, calculate 0.15 × 240.
* *Drafting Step 3 (Standard way):* 0.15 × 240 = 36.
* *Adding the alternative mental math way:* It's often helpful to break it down into 10% and 5%.
* 10% of 240 = 24 (move the decimal point one place to the left)
* 5% is half of 10%, so half of 24 = 12
* Add them together: 24 + 12 = 36.
* *Final Answer:* 15% of 240 is 36.
6. **Final Review against User Prompt:** Does it solve the problem? Yes. Is it step-by-step? Yes. Is it clear? Yes. (Proceed to generate output).
=============== Content =================
Here is the step-by-step solution to find 15% of 240:
**Step 1: Convert the percentage to a decimal.**
To convert a percentage to a decimal, divide it by 100 (or simply move the decimal point two places to the left).
* 15% = 15 ÷ 100 = **0.15**
**Step 2: Multiply the decimal by the number.**
Now, multiply the decimal (0.15) by the number you are finding the percentage of (240).
* 0.15 × 240 = **36**
*(Alternative mental math method for Step 2)*:
If you don't want to multiply by 0.15 directly, you can break 15% down into 10% and 5%:
* **10% of 240** = 24 (just move the decimal point one place to the left)
* **5% of 240** = 12 (5% is half of 10%, so just divide 24 by 2)
* **Add them together**: 24 + 12 = **36**
**Answer:**
15% of 240 is **36**.
Example 2: Instruct Mode (Thinking Off)
To disable thinking and get a direct response, pass {"enable_thinking": false} via chat_template_kwargs:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:30000/v1",
api_key="EMPTY"
)
# Disable thinking mode via chat_template_kwargs
response = client.chat.completions.create(
model="zai-org/GLM-5.1-FP8",
messages=[
{"role": "user", "content": "What is 15% of 240?"}
],
extra_body={"chat_template_kwargs": {"enable_thinking": False}},
max_tokens=2048,
stream=True
)
# In Instruct mode, the model responds directly without reasoning_content
for chunk in response:
if chunk.choices and len(chunk.choices) > 0:
delta = chunk.choices[0].delta
if delta.content:
print(delta.content, end="", flush=True)
print()
Output Example:
15% of 240 is 36.
Here is how to calculate it:
1. Convert the percentage to a decimal: 15% = 0.15
2. Multiply the decimal by the number: 0.15 × 240 = 36
GLM-5.1 supports tool calling capabilities. Enable the tool call parser during deployment. Thinking mode is on by default; to disable it for tool calling requests, pass extra_body={"chat_template_kwargs": {"enable_thinking": False}}.
Python Example (with Thinking Process):
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:30000/v1",
api_key="EMPTY"
)
# Define available tools
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The city name"
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"description": "Temperature unit"
}
},
"required": ["location"]
}
}
}
]
# Make request with streaming to see thinking process
response = client.chat.completions.create(
model="zai-org/GLM-5.1-FP8",
messages=[
{"role": "user", "content": "What's the weather in Beijing?"}
],
tools=tools,
stream=True
)
# Process streaming response
thinking_started = False
has_thinking = False
for chunk in response:
if chunk.choices and len(chunk.choices) > 0:
delta = chunk.choices[0].delta
# Print thinking process
if hasattr(delta, 'reasoning_content') and delta.reasoning_content:
if not thinking_started:
print("=============== Thinking =================", flush=True)
thinking_started = True
has_thinking = True
print(delta.reasoning_content, end="", flush=True)
# Print tool calls
if hasattr(delta, 'tool_calls') and delta.tool_calls:
# Close thinking section if needed
if has_thinking and thinking_started:
print("\n=============== Content =================", flush=True)
thinking_started = False
for tool_call in delta.tool_calls:
if tool_call.function:
print(f"Tool Call: {tool_call.function.name}")
print(f" Arguments: {tool_call.function.arguments}")
# Print content
if delta.content:
print(delta.content, end="", flush=True)
print()
Output Example:
=============== Thinking =================
The user wants to know the weather in Beijing. I'll call the get_weather function with "Beijing" as the location.
=============== Content =================
Tool Call: get_weather
Arguments:
Tool Call: None
Arguments: {
Tool Call: None
Arguments: "location": "Be
Tool Call: None
Arguments: ijing"
Tool Call: None
Arguments: }
Test Environment:
python3 -m sglang.bench_serving \
--backend sglang \
--model zai-org/GLM-5.1-FP8 \
--dataset-name random \
--random-input-len 1000 \
--random-output-len 1000 \
--num-prompts 10 \
--max-concurrency 1 \
--request-rate inf
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 1
Successful requests: 10
Benchmark duration (s): 35.78
Total input tokens: 6101
Total input text tokens: 6101
Total generated tokens: 4220
Total generated tokens (retokenized): 4213
Request throughput (req/s): 0.28
Input token throughput (tok/s): 170.54
Output token throughput (tok/s): 117.96
Peak output token throughput (tok/s): 148.00
Peak concurrent requests: 2
Total token throughput (tok/s): 288.50
Concurrency: 1.00
Accept length: 3.48
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 3576.31
Median E2E Latency (ms): 2935.97
P90 E2E Latency (ms): 5908.97
P99 E2E Latency (ms): 8588.08
---------------Time to First Token----------------
Mean TTFT (ms): 290.88
Median TTFT (ms): 282.34
P99 TTFT (ms): 332.27
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 7.54
Median TPOT (ms): 6.97
P99 TPOT (ms): 9.04
---------------Inter-Token Latency----------------
Mean ITL (ms): 7.80
Median ITL (ms): 6.81
P95 ITL (ms): 13.51
P99 ITL (ms): 26.99
Max ITL (ms): 29.50
==================================================
python3 -m sglang.bench_serving \
--backend sglang \
--model zai-org/GLM-5.1-FP8 \
--dataset-name random \
--random-input-len 1000 \
--random-output-len 1000 \
--num-prompts 1000 \
--max-concurrency 100 \
--request-rate inf
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 100
Successful requests: 1000
Benchmark duration (s): 411.74
Total input tokens: 502493
Total input text tokens: 502493
Total generated tokens: 500251
Total generated tokens (retokenized): 499614
Request throughput (req/s): 2.43
Input token throughput (tok/s): 1220.41
Output token throughput (tok/s): 1214.97
Peak output token throughput (tok/s): 2648.00
Peak concurrent requests: 105
Total token throughput (tok/s): 2435.38
Concurrency: 96.30
Accept length: 3.50
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 39648.76
Median E2E Latency (ms): 39058.12
P90 E2E Latency (ms): 57009.82
P99 E2E Latency (ms): 68880.33
---------------Time to First Token----------------
Mean TTFT (ms): 20613.80
Median TTFT (ms): 21429.21
P99 TTFT (ms): 29543.17
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 38.73
Median TPOT (ms): 36.52
P99 TPOT (ms): 67.09
---------------Inter-Token Latency----------------
Mean ITL (ms): 38.13
Median ITL (ms): 16.57
P95 ITL (ms): 86.01
P99 ITL (ms): 164.88
Max ITL (ms): 1307.02
==================================================
python3 benchmark/gsm8k/bench_sglang.py --port 30000
Accuracy: 0.955
Invalid: 0.000
Latency: 32.470 s
Output throughput: 642.044 token/s
python3 benchmark/mmlu/bench_sglang.py --port 30000
subject: abstract_algebra, #q:100, acc: 0.860
subject: anatomy, #q:135, acc: 0.874
subject: astronomy, #q:152, acc: 0.941
subject: business_ethics, #q:100, acc: 0.880
subject: clinical_knowledge, #q:265, acc: 0.932
subject: college_biology, #q:144, acc: 0.972
subject: college_chemistry, #q:100, acc: 0.640
subject: college_computer_science, #q:100, acc: 0.900
subject: college_mathematics, #q:100, acc: 0.810
subject: college_medicine, #q:173, acc: 0.873
subject: college_physics, #q:102, acc: 0.912
subject: computer_security, #q:100, acc: 0.880
subject: conceptual_physics, #q:235, acc: 0.928
subject: econometrics, #q:114, acc: 0.807
subject: electrical_engineering, #q:145, acc: 0.897
subject: elementary_mathematics, #q:378, acc: 0.937
subject: formal_logic, #q:126, acc: 0.778
subject: global_facts, #q:100, acc: 0.710
subject: high_school_biology, #q:310, acc: 0.961
subject: high_school_chemistry, #q:203, acc: 0.847
subject: high_school_computer_science, #q:100, acc: 0.960
subject: high_school_european_history, #q:165, acc: 0.891
subject: high_school_geography, #q:198, acc: 0.960
subject: high_school_government_and_politics, #q:193, acc: 0.984
subject: high_school_macroeconomics, #q:390, acc: 0.923
subject: high_school_mathematics, #q:270, acc: 0.696
subject: high_school_microeconomics, #q:238, acc: 0.962
subject: high_school_physics, #q:151, acc: 0.821
subject: high_school_psychology, #q:545, acc: 0.956
subject: high_school_statistics, #q:216, acc: 0.889
subject: high_school_us_history, #q:204, acc: 0.941
subject: high_school_world_history, #q:237, acc: 0.945
subject: human_aging, #q:223, acc: 0.857
subject: human_sexuality, #q:131, acc: 0.908
subject: international_law, #q:121, acc: 0.934
subject: jurisprudence, #q:108, acc: 0.907
subject: logical_fallacies, #q:163, acc: 0.933
subject: machine_learning, #q:112, acc: 0.830
subject: management, #q:103, acc: 0.942
subject: marketing, #q:234, acc: 0.940
subject: medical_genetics, #q:100, acc: 0.990
subject: miscellaneous, #q:783, acc: 0.959
subject: moral_disputes, #q:346, acc: 0.873
subject: moral_scenarios, #q:895, acc: 0.837
subject: nutrition, #q:306, acc: 0.922
subject: philosophy, #q:311, acc: 0.897
subject: prehistory, #q:324, acc: 0.929
subject: professional_accounting, #q:282, acc: 0.844
subject: professional_law, #q:1534, acc: 0.714
subject: professional_medicine, #q:272, acc: 0.941
subject: professional_psychology, #q:612, acc: 0.913
subject: public_relations, #q:110, acc: 0.791
subject: security_studies, #q:245, acc: 0.878
subject: sociology, #q:201, acc: 0.940
subject: us_foreign_policy, #q:100, acc: 0.920
subject: virology, #q:166, acc: 0.596
subject: world_religions, #q:171, acc: 0.936
Total latency: 165.275
Average accuracy: 0.877
tp=8, TileLang NSA backends)python3 benchmark/gsm8k/bench_sglang.py --num-questions 200
Accuracy: 0.970
Invalid: 0.000
Results from AMD nightly CI. See also sglang#18911.