Back to Sglang

GLM-5.2

docs_new/cookbook/autoregressive/GLM/GLM-5.2.mdx

0.5.1413.6 KB
Original Source

Deployment

<a id="install" /> <Accordion title="Install SGLang">

For all methods and hardware platforms, see the official SGLang installation guide. The two paths below match the Python / Docker toggle in the command panel.

<Tabs> <Tab title="Python (pip / uv)">
bash
pip install --upgrade pip
pip install uv
uv pip install sglang

Then run the Python output of the command panel below in that environment.

</Tab> <Tab title="Docker">
bash
docker pull lmsysorg/sglang:latest

For how to launch the image, see Install → Method 3: Using Docker. Substitute the inner sglang serve ... with what the command generator below produces.

</Tab> </Tabs> </Accordion>

Pick your hardware + recipe to generate the launch command. The three serving strategies cover the common operating points:

  • Low-Latency — fastest reply for a single user. Pick for chat.
  • Balanced — good speed with several users at once. Use for typical multi-user serving.
  • High-Throughput — most tokens per second across many users. Best for batch jobs.

import { Deployment } from "/src/snippets/_deployment.jsx"; import { config } from "/src/snippets/configs/zai-org/glm-5.2.jsx"; import { benchmarks } from "/src/snippets/configs/zai-org/glm-5.2-benchmarks.jsx";

<Deployment config={config} benchmarks={benchmarks} />

Playground

The Playground is where you experiment with SGLang features beyond the verified matrix. The Deploy panel above only emits combinations the SGLang team has signed off on; the Playground lets you turn on additional knobs on top of whichever cell the Deploy panel is currently showing.

import { Playground } from "/src/snippets/_playground.jsx";

<Playground config={config} />

1. Model Introduction

GLM-5.2 is Z.ai's flagship Mixture-of-Experts model built on DeepSeek Sparse Attention (DSA): a lightning indexer selects a sparse set of key tokens per query (top-2048), so attention cost stays near-constant as context grows. It ships in two precisions — FP8 (zai-org/GLM-5.2-FP8) and full BF16 (zai-org/GLM-5.2) — both with 78 transformer layers, 256 routed experts (8 active per token), a 1M-token context window, and a single MTP (Multi-Token Prediction) layer for built-in EAGLE-style speculative decoding. FP8 is the recommended deployment; BF16 (~1.5 TB) needs an 8×B300 node or a multi-node setup.

<table style={{width: "100%", borderCollapse: "collapse", tableLayout: "fixed"}}> <thead> <tr style={{borderBottom: "2px solid #d55816"}}> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700}}>Model</th> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700}}>Architecture</th> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700}}>Context</th> </tr> </thead> <tbody> <tr> <td style={{padding: "9px 12px"}}><strong><a href="https://huggingface.co/zai-org/GLM-5.2-FP8">GLM-5.2-FP8</a></strong></td> <td style={{padding: "9px 12px"}}>MoE · DSA · 256 experts (top-8) · MTP · FP8</td> <td style={{padding: "9px 12px", textAlign: "right"}}>1,048,576</td> </tr> <tr> <td style={{padding: "9px 12px"}}><strong><a href="https://huggingface.co/zai-org/GLM-5.2">GLM-5.2</a></strong></td> <td style={{padding: "9px 12px"}}>MoE · DSA · 256 experts (top-8) · MTP · BF16</td> <td style={{padding: "9px 12px", textAlign: "right"}}>1,048,576</td> </tr> </tbody> </table>

Recommended generation: temperature=1.0, top_p=0.95 (the checkpoint's generation_config.json defaults; informational — do not hardcode in client code).

Resources: GLM-5.2-FP8 · GLM-5.2 (BF16).

2. Configuration Tips

  • DeepSeek Sparse Attention (DSA). GLM-5.2 uses the glm_moe_dsa architecture; SGLang auto-selects the DSA attention backends (flashmla_sparse prefill, fa3 decode, sgl-kernel indexer topk). No attention-backend flag is needed on the supported hardware. SGLang also auto-selects the KV-cache dtype for DSA models — fp8_e4m3 on Blackwell (B200/GB300/B300, which then routes DSA through the TensorRT-LLM backend) and bf16 on Hopper (H200) — so no --kv-cache-dtype flag is required.
  • MTP / speculative decoding. The checkpoint ships one nextn layer. Enable EAGLE MTP for lower latency (--speculative-algorithm EAGLE --speculative-num-steps 5 --speculative-eagle-topk 1 --speculative-num-draft-tokens 6 for low-latency; 1-1-2 for balanced). The config's index_share_for_mtp_iteration reuses the DSA indexer's topk across draft steps (effective only at --speculative-eagle-topk 1). Tune the draft length to the accept length. GLM-5.2's MTP head is strong — accept length runs high (4+ in many workloads, near-saturating at 5–6 in low-latency runs). Watch the server's reported accept length and adjust --speculative-num-steps / --speculative-num-draft-tokens accordingly: while accept length stays close to the draft-token count there is headroom to push them higher (more accepted tokens per step); if it falls well below, lower them — every rejected draft token is wasted verification compute.
  • Context Parallelism (CP) for long prefill. DSA prefill CP splits the long-prefill attention across --attn-cp-size ranks. On Hopper (H200) this gives a large prefill-latency win at long context — e.g. round-robin CP (--tp 8 --attn-cp-size 8 --enable-dsa-prefill-context-parallel --dsa-prefill-cp-mode round-robin-split) cut 64K-token prefill TTFT roughly 2.5–2.8× vs. plain TP8 in our testing. Trade-offs: CP partitions the KV pool (lower max context at the same --mem-fraction-static) and adds some decode-side overhead, so it pays off only for long sequences. CP is currently verified on Hopper only — the Blackwell (sm100) DSA-CP FP8 rope kernel is not yet adapted, so leave CP off on B200/B300/GB300.
  • Memory. The FP8 weights are large (MoE total, not active params). Start around --mem-fraction-static 0.8 on H200 (TP8) and tune up; raise it for the 4-GPU GB300 single-node layout (TP4).
  • DP-Attention + DeepEP for the balanced/high-throughput strategies spreads attention across data-parallel ranks and routes MoE through DeepEP.
  • BF16 weights need more GPUs. The full-precision build (zai-org/GLM-5.2, ~1.5 TB) does not fit a single 8×H200 / 8×B200 / 4×GB300 node. It fits single-node on 8×B300 (TP8, ~2.1 TB HBM) — verified; on the smaller GPUs it needs a multi-node layout (e.g. 2×8×H200 or 2×8×B200 at TP16, 2×4×GB300 at TP8), and those multi-node BF16 recipes are still proposed/inferred (verified: false). FP8 is the recommended deployment. Use the same DSA / MTP / chunked-prefill guidance as FP8. On B300, BF16 low-latency matches FP8 (the sm103 FP8 path is not yet optimized), but FP8 wins at the balanced/high-throughput points.
  • Chunked-prefill size is regime-dependent. At long input (8K+) the default --chunked-prefill-size 2048 is too small and leaves the balanced point prefill-bound (queueing dominates TTFT). Raising it to --chunked-prefill-size 32768 on the balanced recipe gave roughly +34–78% output throughput and −39–59% TTFT on 8×H200 and 8×B200 (8K-in / 1K-out) in our testing. It is neutral for high-throughput (decode-bound there) — keep the default. --max-running-requests tracks KV capacity, not a tuning free-for-all: ~60–90 concurrent 8K+1K FP8 requests fit on a single 8-GPU node, so pin balanced near --max-running-requests 80 and let high-throughput run wider.

3. Advanced Usage

3.1 Reasoning

GLM-5.2 is a hybrid-reasoning model. Enable the glm45 reasoning parser (toggle Reasoning Parser in the Parsers card of the Playground above) to separate thinking from the final answer — thinking lands in message.reasoning_content, the answer in message.content. Thinking is on by default; turn it off with chat_template_kwargs: {"enable_thinking": False} (the template variable is enable_thinking, not thinking).

Reasoning effort. Pass chat_template_kwargs: {"reasoning_effort": ...} to inject a Reasoning Effort: <level> system line (only while thinking is on). The template wires only two effective levels — Max and High — and if you don't pass reasoning_effort at all you get Max, the highest. "high" is the only value that lowers effort; every other value (including "low" and "medium") falls through to Max:

reasoning_effortInjected system lineEffect
(not passed / unset)Reasoning Effort: Maxdefault — highest reasoning
"high"Reasoning Effort: Highdials reasoning down
"low", "medium", any other valueReasoning Effort: Maxfalls through to Max (not a distinct level)
<Accordion title="Reasoning Example (Python)">
python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")
resp = client.chat.completions.create(
    model="zai-org/GLM-5.2-FP8",
    messages=[{"role": "user", "content": "What is 15% of 240?"}],
    extra_body={"chat_template_kwargs": {"enable_thinking": True, "reasoning_effort": "high"}},
)
msg = resp.choices[0].message
print("Reasoning:", getattr(msg, "reasoning_content", None))
print("Answer:", msg.content)
</Accordion> <Accordion title="Example Output">
text
Reasoning: 1.  **Identify the core question:** The user wants to find 15% of 240.
2.  **Convert the percentage to a decimal:** 15% = 0.15
3.  **Multiply by the total:** 0.15 * 240 = 36
    (Quick mental math: 10% of 240 = 24; 5% = 12; 24 + 12 = 36.)

Answer: 15% of 240 is **36**.

Here is how you can calculate it:
0.15 × 240 = 36
</Accordion>

3.2 Tool Calling

Enable the glm47 tool-call parser (toggle Tool Call Parser in the Parsers card of the Playground above) to surface structured tool calls via message.tool_calls. GLM-5.2 emits the newer <tool_call>…<arg_key>…<arg_value>… format, so it needs the glm47 parser — the older glm45 parser does not parse it (the call would be left as raw text in content). On thinking mode the turn also fills reasoning_content, so print both fields.

<Accordion title="Tool Calling Example (Python)">
python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")
tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get the current weather for a city",
        "parameters": {
            "type": "object",
            "properties": {"city": {"type": "string"}},
            "required": ["city"],
        },
    },
}]
resp = client.chat.completions.create(
    model="zai-org/GLM-5.2-FP8",
    messages=[{"role": "user", "content": "What's the weather in Paris?"}],
    tools=tools,
)
msg = resp.choices[0].message
print("Reasoning:", getattr(msg, "reasoning_content", None))
print("Tool calls:", msg.tool_calls)
</Accordion> <Accordion title="Example Output">
text
Reasoning: The user wants to know the weather in Paris. I'll call the get_weather function with "Paris" as the city.

Tool calls: [
  {
    "id": "call_13fcd52146934b7781d06d4a",
    "type": "function",
    "function": {"name": "get_weather", "arguments": "{\"city\": \"Paris\"}"}
  }
]
</Accordion>

3.3 HiCache (Hierarchical KV Caching)

For long-context, prefix-heavy workloads, enable hierarchical KV caching to spill cold KV blocks to host memory (toggle the Hierarchical KV Cache card in the Playground above). Useful given GLM-5.2's 1M-token window; pair --hicache-ratio with a write policy that matches your reuse pattern.

3.4 Claude Code Integration

GLM-5.2's strong reasoning + tool-calling makes it a good backend for Claude Code, Anthropic's agentic CLI. SGLang exposes the Anthropic-compatible /v1/messages endpoint on every server, so Claude Code can talk to a GLM-5.2 server with only environment variables — no code change. Launch the server with --reasoning-parser glm45 --tool-call-parser glm47 (any recipe from the Deployment panel above works), then:

bash
export ANTHROPIC_BASE_URL="http://127.0.0.1:30000"
export ANTHROPIC_AUTH_TOKEN="dummy"
export API_TIMEOUT_MS="3000000"
export CLAUDE_CODE_AUTO_COMPACT_WINDOW="1000000"
export CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1
export CLAUDE_CODE_ATTRIBUTION_HEADER=0
export ANTHROPIC_DEFAULT_HAIKU_MODEL="glm-5.2[1m]"
export ANTHROPIC_DEFAULT_SONNET_MODEL="glm-5.2[1m]"
export ANTHROPIC_DEFAULT_OPUS_MODEL="glm-5.2[1m]"
claude

Two of these matter specifically for GLM-5.2:

  • CLAUDE_CODE_ATTRIBUTION_HEADER=0 — Claude Code prepends a per-request attribution block to the system prompt. GLM-5.2's chat template renders tools before system, so that per-request hash is the first token to diverge between turns and the radix prefix cache re-prefills the whole system + history every turn. This env removes the block and restores prefix-cache reuse.
  • glm-5.2[1m] as the model name — the [1m] suffix is the client-side hint that enables Claude Code's 1M-context beta, matching GLM-5.2's 1,048,576-token window. Without it, context is capped well below 1M. SGLang does not validate the model field, so any name is accepted server-side.

For the full setup (streaming, tool-use, count_tokens, persisting env in ~/.claude/settings.json, troubleshooting), see Anthropic-Compatible API.