docs_new/cookbook/autoregressive/GLM/GLM-5.2.mdx
For all methods and hardware platforms, see the official SGLang installation guide. The two paths below match the Python / Docker toggle in the command panel.
<Tabs> <Tab title="Python (pip / uv)">pip install --upgrade pip
pip install uv
uv pip install sglang
Then run the Python output of the command panel below in that environment.
</Tab> <Tab title="Docker">docker pull lmsysorg/sglang:latest
For how to launch the image, see Install → Method 3: Using Docker. Substitute the inner sglang serve ... with what the command generator below produces.
Pick your hardware + recipe to generate the launch command. The three serving strategies cover the common operating points:
import { Deployment } from "/src/snippets/_deployment.jsx"; import { config } from "/src/snippets/configs/zai-org/glm-5.2.jsx"; import { benchmarks } from "/src/snippets/configs/zai-org/glm-5.2-benchmarks.jsx";
<Deployment config={config} benchmarks={benchmarks} />The Playground is where you experiment with SGLang features beyond the verified matrix. The Deploy panel above only emits combinations the SGLang team has signed off on; the Playground lets you turn on additional knobs on top of whichever cell the Deploy panel is currently showing.
import { Playground } from "/src/snippets/_playground.jsx";
<Playground config={config} />GLM-5.2 is Z.ai's flagship Mixture-of-Experts model built on DeepSeek Sparse Attention (DSA): a lightning indexer selects a sparse set of key tokens per query (top-2048), so attention cost stays near-constant as context grows. It ships in two precisions — FP8 (zai-org/GLM-5.2-FP8) and full BF16 (zai-org/GLM-5.2) — both with 78 transformer layers, 256 routed experts (8 active per token), a 1M-token context window, and a single MTP (Multi-Token Prediction) layer for built-in EAGLE-style speculative decoding. FP8 is the recommended deployment; BF16 (~1.5 TB) needs an 8×B300 node or a multi-node setup.
Recommended generation: temperature=1.0, top_p=0.95 (the checkpoint's generation_config.json defaults; informational — do not hardcode in client code).
Resources: GLM-5.2-FP8 · GLM-5.2 (BF16).
glm_moe_dsa architecture; SGLang auto-selects the DSA attention backends (flashmla_sparse prefill, fa3 decode, sgl-kernel indexer topk). No attention-backend flag is needed on the supported hardware. SGLang also auto-selects the KV-cache dtype for DSA models — fp8_e4m3 on Blackwell (B200/GB300/B300, which then routes DSA through the TensorRT-LLM backend) and bf16 on Hopper (H200) — so no --kv-cache-dtype flag is required.--speculative-algorithm EAGLE --speculative-num-steps 5 --speculative-eagle-topk 1 --speculative-num-draft-tokens 6 for low-latency; 1-1-2 for balanced). The config's index_share_for_mtp_iteration reuses the DSA indexer's topk across draft steps (effective only at --speculative-eagle-topk 1). Tune the draft length to the accept length. GLM-5.2's MTP head is strong — accept length runs high (4+ in many workloads, near-saturating at 5–6 in low-latency runs). Watch the server's reported accept length and adjust --speculative-num-steps / --speculative-num-draft-tokens accordingly: while accept length stays close to the draft-token count there is headroom to push them higher (more accepted tokens per step); if it falls well below, lower them — every rejected draft token is wasted verification compute.--attn-cp-size ranks. On Hopper (H200) this gives a large prefill-latency win at long context — e.g. round-robin CP (--tp 8 --attn-cp-size 8 --enable-dsa-prefill-context-parallel --dsa-prefill-cp-mode round-robin-split) cut 64K-token prefill TTFT roughly 2.5–2.8× vs. plain TP8 in our testing. Trade-offs: CP partitions the KV pool (lower max context at the same --mem-fraction-static) and adds some decode-side overhead, so it pays off only for long sequences. CP is currently verified on Hopper only — the Blackwell (sm100) DSA-CP FP8 rope kernel is not yet adapted, so leave CP off on B200/B300/GB300.--mem-fraction-static 0.8 on H200 (TP8) and tune up; raise it for the 4-GPU GB300 single-node layout (TP4).zai-org/GLM-5.2, ~1.5 TB) does not fit a single 8×H200 / 8×B200 / 4×GB300 node. It fits single-node on 8×B300 (TP8, ~2.1 TB HBM) — verified; on the smaller GPUs it needs a multi-node layout (e.g. 2×8×H200 or 2×8×B200 at TP16, 2×4×GB300 at TP8), and those multi-node BF16 recipes are still proposed/inferred (verified: false). FP8 is the recommended deployment. Use the same DSA / MTP / chunked-prefill guidance as FP8. On B300, BF16 low-latency matches FP8 (the sm103 FP8 path is not yet optimized), but FP8 wins at the balanced/high-throughput points.--chunked-prefill-size 2048 is too small and leaves the balanced point prefill-bound (queueing dominates TTFT). Raising it to --chunked-prefill-size 32768 on the balanced recipe gave roughly +34–78% output throughput and −39–59% TTFT on 8×H200 and 8×B200 (8K-in / 1K-out) in our testing. It is neutral for high-throughput (decode-bound there) — keep the default. --max-running-requests tracks KV capacity, not a tuning free-for-all: ~60–90 concurrent 8K+1K FP8 requests fit on a single 8-GPU node, so pin balanced near --max-running-requests 80 and let high-throughput run wider.GLM-5.2 is a hybrid-reasoning model. Enable the glm45 reasoning parser (toggle Reasoning Parser in the Parsers card of the Playground above) to separate thinking from the final answer — thinking lands in message.reasoning_content, the answer in message.content. Thinking is on by default; turn it off with chat_template_kwargs: {"enable_thinking": False} (the template variable is enable_thinking, not thinking).
Reasoning effort. Pass chat_template_kwargs: {"reasoning_effort": ...} to inject a Reasoning Effort: <level> system line (only while thinking is on). The template wires only two effective levels — Max and High — and if you don't pass reasoning_effort at all you get Max, the highest. "high" is the only value that lowers effort; every other value (including "low" and "medium") falls through to Max:
reasoning_effort | Injected system line | Effect |
|---|---|---|
| (not passed / unset) | Reasoning Effort: Max | default — highest reasoning |
"high" | Reasoning Effort: High | dials reasoning down |
"low", "medium", any other value | Reasoning Effort: Max | falls through to Max (not a distinct level) |
from openai import OpenAI
client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")
resp = client.chat.completions.create(
model="zai-org/GLM-5.2-FP8",
messages=[{"role": "user", "content": "What is 15% of 240?"}],
extra_body={"chat_template_kwargs": {"enable_thinking": True, "reasoning_effort": "high"}},
)
msg = resp.choices[0].message
print("Reasoning:", getattr(msg, "reasoning_content", None))
print("Answer:", msg.content)
Reasoning: 1. **Identify the core question:** The user wants to find 15% of 240.
2. **Convert the percentage to a decimal:** 15% = 0.15
3. **Multiply by the total:** 0.15 * 240 = 36
(Quick mental math: 10% of 240 = 24; 5% = 12; 24 + 12 = 36.)
Answer: 15% of 240 is **36**.
Here is how you can calculate it:
0.15 × 240 = 36
Enable the glm47 tool-call parser (toggle Tool Call Parser in the Parsers card of the Playground above) to surface structured tool calls via message.tool_calls. GLM-5.2 emits the newer <tool_call>…<arg_key>…<arg_value>… format, so it needs the glm47 parser — the older glm45 parser does not parse it (the call would be left as raw text in content). On thinking mode the turn also fills reasoning_content, so print both fields.
from openai import OpenAI
client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")
tools = [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a city",
"parameters": {
"type": "object",
"properties": {"city": {"type": "string"}},
"required": ["city"],
},
},
}]
resp = client.chat.completions.create(
model="zai-org/GLM-5.2-FP8",
messages=[{"role": "user", "content": "What's the weather in Paris?"}],
tools=tools,
)
msg = resp.choices[0].message
print("Reasoning:", getattr(msg, "reasoning_content", None))
print("Tool calls:", msg.tool_calls)
Reasoning: The user wants to know the weather in Paris. I'll call the get_weather function with "Paris" as the city.
Tool calls: [
{
"id": "call_13fcd52146934b7781d06d4a",
"type": "function",
"function": {"name": "get_weather", "arguments": "{\"city\": \"Paris\"}"}
}
]
For long-context, prefix-heavy workloads, enable hierarchical KV caching to spill cold KV blocks to host memory (toggle the Hierarchical KV Cache card in the Playground above). Useful given GLM-5.2's 1M-token window; pair --hicache-ratio with a write policy that matches your reuse pattern.
GLM-5.2's strong reasoning + tool-calling makes it a good backend for Claude Code, Anthropic's agentic CLI. SGLang exposes the Anthropic-compatible /v1/messages endpoint on every server, so Claude Code can talk to a GLM-5.2 server with only environment variables — no code change. Launch the server with --reasoning-parser glm45 --tool-call-parser glm47 (any recipe from the Deployment panel above works), then:
export ANTHROPIC_BASE_URL="http://127.0.0.1:30000"
export ANTHROPIC_AUTH_TOKEN="dummy"
export API_TIMEOUT_MS="3000000"
export CLAUDE_CODE_AUTO_COMPACT_WINDOW="1000000"
export CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1
export CLAUDE_CODE_ATTRIBUTION_HEADER=0
export ANTHROPIC_DEFAULT_HAIKU_MODEL="glm-5.2[1m]"
export ANTHROPIC_DEFAULT_SONNET_MODEL="glm-5.2[1m]"
export ANTHROPIC_DEFAULT_OPUS_MODEL="glm-5.2[1m]"
claude
Two of these matter specifically for GLM-5.2:
CLAUDE_CODE_ATTRIBUTION_HEADER=0 — Claude Code prepends a per-request attribution block to the system prompt. GLM-5.2's chat template renders tools before system, so that per-request hash is the first token to diverge between turns and the radix prefix cache re-prefills the whole system + history every turn. This env removes the block and restores prefix-cache reuse.glm-5.2[1m] as the model name — the [1m] suffix is the client-side hint that enables Claude Code's 1M-context beta, matching GLM-5.2's 1,048,576-token window. Without it, context is capped well below 1M. SGLang does not validate the model field, so any name is accepted server-side.For the full setup (streaming, tool-use, count_tokens, persisting env in ~/.claude/settings.json, troubleshooting), see Anthropic-Compatible API.