docs_new/cookbook/autoregressive/GLM/GLM-4.7.mdx
GLM-4.7 is the latest and most powerful language model in the GLM series developed by Zhipu AI, featuring state-of-the-art capabilities in reasoning, function calling, and multi-modal understanding.
As the newest iteration in the GLM series, GLM-4.7 achieves significant improvements across all domains:
For more details, please refer to the official GLM-4.7 documentation.
Key Features:
Available Models:
License:
Please refer to the official GLM-4.7 model card for license details.
SGLang offers multiple installation methods. You can choose the most suitable installation method based on your hardware platform and requirements.
Please refer to the official SGLang installation guide for installation instructions.
This section provides deployment configurations optimized for different hardware platforms and use cases.
Interactive Command Generator: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform, quantization method, deployment strategy, and thinking capabilities.
import { GLM47Deployment } from "/src/snippets/autoregressive/glm-47-deployment.jsx";
<GLM47Deployment />For more detailed configuration tips, please refer to GLM-4.7 Usage.
For basic API usage and request examples, please refer to:
GLM-4.7 supports Thinking mode by default. Enable the reasoning parser during deployment to separate the thinking and the content sections:
python -m sglang.launch_server \
--model zai-org/GLM-4.7 \
--reasoning-parser glm47 \
--tp 8 \
--host 0.0.0.0 \
--port 8000
Streaming with Thinking Process:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="EMPTY"
)
# Enable streaming to see the thinking process in real-time
response = client.chat.completions.create(
model="zai-org/GLM-4.7",
messages=[
{"role": "user", "content": "Solve this problem step by step: What is 15% of 240?"}
],
temperature=0.7,
max_tokens=2048,
stream=True
)
# Process the stream
has_thinking = False
has_answer = False
thinking_started = False
for chunk in response:
if chunk.choices and len(chunk.choices) > 0:
delta = chunk.choices[0].delta
# Print thinking process
if hasattr(delta, 'reasoning_content') and delta.reasoning_content:
if not thinking_started:
print("=============== Thinking =================", flush=True)
thinking_started = True
has_thinking = True
print(delta.reasoning_content, end="", flush=True)
# Print answer content
if delta.content:
# Close thinking section and add content header
if has_thinking and not has_answer:
print("\n=============== Content =================", flush=True)
has_answer = True
print(delta.content, end="", flush=True)
print()
Output Example:
=============== Thinking =================
To solve this problem, I need to calculate 15% of 240.
Step 1: Convert 15% to decimal: 15% = 0.15
Step 2: Multiply 240 by 0.15
Step 3: 240 × 0.15 = 36
=============== Content =================
The answer is 36. To find 15% of 240, we multiply 240 by 0.15, which equals 36.
Note: The reasoning parser captures the model's step-by-step thinking process, allowing you to see how the model arrives at its conclusions.
GLM-4.7 supports tool calling capabilities. Enable the tool call parser:
python -m sglang.launch_server \
--model zai-org/GLM-4.7 \
--reasoning-parser glm47 \
--tool-call-parser glm47 \
--tp 8 \
--host 0.0.0.0 \
--port 8000
Python Example (with Thinking Process):
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="EMPTY"
)
# Define available tools
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The city name"
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"description": "Temperature unit"
}
},
"required": ["location"]
}
}
}
]
# Make request with streaming to see thinking process
response = client.chat.completions.create(
model="zai-org/GLM-4.7",
messages=[
{"role": "user", "content": "What's the weather in Beijing?"}
],
tools=tools,
temperature=0.7,
stream=True
)
# Process streaming response
thinking_started = False
has_thinking = False
for chunk in response:
if chunk.choices and len(chunk.choices) > 0:
delta = chunk.choices[0].delta
# Print thinking process
if hasattr(delta, 'reasoning_content') and delta.reasoning_content:
if not thinking_started:
print("=============== Thinking =================", flush=True)
thinking_started = True
has_thinking = True
print(delta.reasoning_content, end="", flush=True)
# Print tool calls
if hasattr(delta, 'tool_calls') and delta.tool_calls:
# Close thinking section if needed
if has_thinking and thinking_started:
print("\n=============== Content =================", flush=True)
thinking_started = False
for tool_call in delta.tool_calls:
if tool_call.function:
print(f"Tool Call: {tool_call.function.name}")
print(f" Arguments: {tool_call.function.arguments}")
# Print content
if delta.content:
print(delta.content, end="", flush=True)
print()
Output Example:
=============== Thinking =================
The user is asking about the weather in Beijing. I need to use the get_weather function to retrieve this information.
I should call the function with location="Beijing".
=============== Content =================
Tool Call: get_weather
Arguments: {"location": "Beijing", "unit": "celsius"}
Note:
Handling Tool Call Results:
# After getting the tool call, execute the function
def get_weather(location, unit="celsius"):
# Your actual weather API call here
return f"The weather in {location} is 22°{unit[0].upper()} and sunny."
# Send tool result back to the model
messages = [
{"role": "user", "content": "What's the weather in Beijing?"},
{
"role": "assistant",
"content": None,
"tool_calls": [{
"id": "call_123",
"type": "function",
"function": {
"name": "get_weather",
"arguments": '{"location": "Beijing", "unit": "celsius"}'
}
}]
},
{
"role": "tool",
"tool_call_id": "call_123",
"content": get_weather("Beijing", "celsius")
}
]
final_response = client.chat.completions.create(
model="zai-org/GLM-4.7",
messages=messages,
temperature=0.7
)
print(final_response.choices[0].message.content)
# Output: "The weather in Beijing is currently 22°C and sunny."
This section uses industry-standard configurations for comparable benchmark results.
Test Environment:
Benchmark Methodology:
We use industry-standard benchmark configurations to ensure results are comparable across frameworks and hardware platforms.
Three core scenarios reflect real-world usage patterns:
<table style={{width: "100%", borderCollapse: "collapse", tableLayout: "fixed"}}> <colgroup> <col style={{width: "25%"}} /> <col style={{width: "25%"}} /> <col style={{width: "25%"}} /> <col style={{width: "25%"}} /> </colgroup> <thead> <tr style={{borderBottom: "2px solid #d55816"}}> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.02)"}}>Scenario</th> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.05)"}}>Input Length</th> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.02)"}}>Output Length</th> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.05)"}}>Use Case</th> </tr> </thead> <tbody> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}>**Chat**</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>1K</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>1K</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Most common conversational AI workload</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}>**Reasoning**</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>1K</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>8K</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Long-form generation, complex reasoning tasks</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}>**Summarization**</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>8K</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>1K</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Document summarization, RAG retrieval</td> </tr> </tbody> </table>Test each scenario at three concurrency levels to capture the throughput vs. latency tradeoff (Pareto frontier):
--max-concurrency 1 (Latency-optimized)--max-concurrency 16 (Balanced)--max-concurrency 100 (Throughput-optimized)For each concurrency level, configure num_prompts to simulate realistic user loads:
num_prompts = concurrency × 1 (minimal test)num_prompts = concurrency × 5 (standard benchmark)num_prompts = concurrency × 10 (production-grade)Scenario 1: Chat (1K/1K) - Most Important
python -m sglang.launch_server \
--model zai-org/GLM-4.7 \
--tp 8
python -m sglang.bench_serving \
--backend sglang \
--model zai-org/GLM-4.7 \
--dataset-name random \
--random-input-len 1000 \
--random-output-len 1000 \
--num-prompts 10 \
--max-concurrency 1 \
--request-rate inf
python -m sglang.bench_serving \
--backend sglang \
--model zai-org/GLM-4.7 \
--dataset-name random \
--random-input-len 1000 \
--random-output-len 1000 \
--num-prompts 80 \
--max-concurrency 16 \
--request-rate inf
python -m sglang.bench_serving \
--backend sglang \
--model zai-org/GLM-4.7 \
--dataset-name random \
--random-input-len 1000 \
--random-output-len 1000 \
--num-prompts 500 \
--max-concurrency 100 \
--request-rate inf
Scenario 2: Reasoning (1K/8K)
python -m sglang.bench_serving \
--backend sglang \
--model zai-org/GLM-4.7 \
--dataset-name random \
--random-input-len 1000 \
--random-output-len 8000 \
--num-prompts 10 \
--max-concurrency 1 \
--request-rate inf
python -m sglang.bench_serving \
--backend sglang \
--model zai-org/GLM-4.7 \
--dataset-name random \
--random-input-len 1000 \
--random-output-len 8000 \
--num-prompts 80 \
--max-concurrency 16 \
--request-rate inf
python -m sglang.bench_serving \
--backend sglang \
--model zai-org/GLM-4.7 \
--dataset-name random \
--random-input-len 1000 \
--random-output-len 8000 \
--num-prompts 320 \
--max-concurrency 64 \
--request-rate inf
Scenario 3: Summarization (8K/1K)
python -m sglang.bench_serving \
--backend sglang \
--model zai-org/GLM-4.7 \
--dataset-name random \
--random-input-len 8000 \
--random-output-len 1000 \
--num-prompts 10 \
--max-concurrency 1 \
--request-rate inf
python -m sglang.bench_serving \
--backend sglang \
--model zai-org/GLM-4.7 \
--dataset-name random \
--random-input-len 8000 \
--random-output-len 1000 \
--num-prompts 80 \
--max-concurrency 16 \
--request-rate inf
python -m sglang.bench_serving \
--backend sglang \
--model zai-org/GLM-4.7 \
--dataset-name random \
--random-input-len 8000 \
--random-output-len 1000 \
--num-prompts 320 \
--max-concurrency 64 \
--request-rate inf
Key Metrics:
Why These Configurations Matter:
Interpreting Results:
Document model accuracy on standard benchmarks:
python -m sglang.test.few_shot_gsm8k \
--num-questions 200 \
--port 30000