docs_new/cookbook/autoregressive/GLM/GLM-4.6V.mdx
GLM-4.6V series model includes two versions: GLM-4.6V (106B), a foundation model designed for cloud and high-performance cluster scenarios, and GLM-4.6V-Flash (9B), a lightweight model optimized for local deployment and low-latency applications. GLM-4.6V scales its context window to 128k tokens in training, and achieves SoTA performance in visual understanding among models of similar parameter scales. Crucially, GLM team integrated native Function Calling capabilities for the first time. This effectively bridges the gap between "visual perception" and "executable action" providing a unified technical foundation for multimodal agents in real-world business scenarios.
Beyond achieves SoTA performance across major multimodal benchmarks at comparable model scales. GLM-4.6V introduces several key features:
SGLang offers multiple installation methods. You can choose the most suitable installation method based on your hardware platform and requirements.
docker pull lmsysorg/sglang:latest
Advantages:
If you need to use the latest development version or require custom modifications, you can build from source:
# Install SGLang using UV (recommended)
git clone https://github.com/sgl-project/sglang.git
cd sglang
uv venv
source .venv/bin/activate
uv pip install -e "python[all]" --index-url=https://pypi.org/simple
pip install nvidia-cudnn-cu12==9.16.0.29
# Install ffmpeg to support video input
sudo apt update
sudo apt install ffmpeg
Use Cases:
For general installation instructions, you can also refer to the official SGLang installation guide.
Interactive Command Generator: Use the interactive configuration generator below to customize your deployment settings. Select your hardware platform, model size, quantization method, and other options to generate the appropriate launch command.
import { GLM46VDeployment } from "/src/snippets/autoregressive/glm-46v-deployment.jsx";
<GLM46VDeployment />SGLANG_USE_CUDA_IPC_TRANSPORT=1 to use CUDA IPC for transferring multimodal features, which significantly improves TTFT. This consumes additional memory and may require adjusting --mem-fraction-static and/or --max-running-requests. (additional memory is proportional to image size * number of images in current running requests.)--mm-enable-dp-encoder (which the generator above handles automatically).--model-loader-extra-config='{"enable_multithread_load": "true","num_threads": 64}'.--mm-attention-backend fa3: Specify multimodal attention backend (Flash Attention 3).--keep-mm-feature-on-device: Retain multimodal feature tensors on GPU after processing to avoid D2H memory copies.SGLANG_USE_CUDA_IPC_TRANSPORT=1: Use CUDA IPC shared memory for multimodal data transport to significantly improve E2E latency.Example with full multimodal optimizations:
SGLANG_USE_CUDA_IPC_TRANSPORT=1 \
SGLANG_VLM_CACHE_SIZE_MB=0 \
python -m sglang.launch_server \
--model-path zai-org/GLM-4.6V \
--host 0.0.0.0 \
--port 30000 \
--trust-remote-code \
--tp-size 8 \
--enable-cache-report \
--log-level info \
--max-running-requests 64 \
--mem-fraction-static 0.65 \
--chunked-prefill-size 8192 \
--attention-backend fa3 \
--mm-attention-backend fa3 \
--mm-enable-dp-encoder \
--enable-metrics
For basic API usage and request examples, please refer to:
GLM-4.6V supports image and video inputs via the OpenAI-compatible API.
Image Input:
import subprocess
curl_command = f"""
curl -s http://localhost:{30000}/v1/chat/completions \\
-H "Content-Type: application/json" \\
-d '{{
"model": "default",
"messages": [
{{
"role": "user",
"content": [
{{
"type": "image_url",
"image_url": {{
"url": "https://github.com/sgl-project/sglang/blob/main/examples/assets/example_image.png?raw=true"
}}
}},
{{
"type": "text",
"text": "What is the image"
}}
]
}}
],
"temperature": "0",
"max_completion_tokens": "1000",
"max_tokens": "1000"
}}'
"""
response = subprocess.check_output(curl_command, shell=True).decode()
print(response)
{"id":"b61596ca71394dd699fd8abd4f650c44","object":"chat.completion","created":1765259019,"model":"default","choices":[{"index":0,"message":{"role":"assistant","content":"The image is a logo featuring the text \"SGL\" (in a bold, orange-brown font) alongside a stylized icon. The icon includes a network-like structure with circular nodes (suggesting connectivity or a tree/graph structure) and a tag with \"</>\" (a common symbol for coding, web development, or software). The color scheme uses warm orange-brown tones with a black background, giving it a tech-focused, modern aesthetic (likely representing a company, project, or tool related to software, web development, or digital technology).<|begin_of_box|>SGL logo (stylized text + network/coding icon)<|end_of_box|>","reasoning_content":"Okay, let's see. The image has a logo with the text \"SGL\" and a little icon on the left. The icon looks like a network or a tree structure with circles, and there's a tag with \"</>\" which is a common symbol for coding or web development. The colors are orange and brown tones, with a black background. So probably a logo for a company or project named SGL, maybe related to software, web development, or a tech company.","tool_calls":null},"logprobs":null,"finish_reason":"stop","matched_stop":151336}],"usage":{"prompt_tokens":2222,"total_tokens":2448,"completion_tokens":226,"prompt_tokens_details":null,"reasoning_tokens":0},"metadata":{"weight_version":"default"}}
Video Input:
import subprocess
curl_command = f"""
curl -s http://localhost:{30000}/v1/chat/completions \\
-H "Content-Type: application/json" \\
-d '{{
"model": "default",
"messages": [
{{
"role": "user",
"content": [
{{
"type": "video_url",
"video_url": {{
"url": "https://github.com/sgl-project/sgl-test-files/raw/refs/heads/main/videos/jobs_presenting_ipod.mp4"
}}
}},
{{
"type": "text",
"text": "What is in the video"
}}
]
}}
],
"temperature": "0",
"max_completion_tokens": "1000",
"max_tokens": "1000"
}}'
"""
response = subprocess.check_output(curl_command, shell=True).decode()
print(response)
{"id":"520e0a079e5d4b17b82a6af619315a97","object":"chat.completion","created":1765259029,"model":"default","choices":[{"index":0,"message":{"role":"assistant","content":"The image is a still from a presentation by a man on a stage. He is pointing to a small pocket on his jeans and asking the audience what the pocket is for. The video is being shared by Evan Carmichael. The man then reveals that the pocket is for an iPod Nano.","reasoning_content":"Based on the visual evidence in the video, here is a breakdown of what is being shown:\n\n* **Subject:** The video features a man on a stage, giving a presentation. He is wearing a black t-shirt and dark jeans.\n* **Action:** The man is pointing to a pocket on his jeans. He is asking the audience a question about the purpose of this pocket.\n* **Context:** The presentation is being filmed, and the video is being shared by \"Evan Carmichael,\" a well-known motivational speaker and content creator. The source of the clip is credited to \"JoshuaG.\"\n* **Reveal:** The man then reveals the answer to his question. He pulls a small, white, rectangular device out of the pocket. He identifies this device as an \"iPod Nano.\"\n\nIn summary, the image is a still from a presentation where a speaker is explaining the purpose of the small pocket found on many pairs of jeans.","tool_calls":null},"logprobs":null,"finish_reason":"stop","matched_stop":151336}],"usage":{"prompt_tokens":30276,"total_tokens":30532,"completion_tokens":256,"prompt_tokens_details":null,"reasoning_tokens":0},"metadata":{"weight_version":"default"}}
GLM-4.6V supports Thinking mode. Enable the reasoning parser during deployment:
python -m sglang.launch_server \
--model zai-org/GLM-4.6V \
--reasoning-parser glm45 \
--tp 8 \
--host 0.0.0.0 \
--port 30000
Streaming with Thinking Process:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:30000/v1",
api_key="EMPTY"
)
# Enable streaming to see the thinking process in real-time
response = client.chat.completions.create(
model="zai-org/GLM-4.6V",
messages=[
{"role": "user", "content": "Solve this problem step by step: What is 15% of 240?"}
],
temperature=0.7,
max_tokens=2048,
stream=True
)
# Process the stream
has_thinking = False
has_answer = False
thinking_started = False
for chunk in response:
if chunk.choices and len(chunk.choices) > 0:
delta = chunk.choices[0].delta
# Print thinking process
if hasattr(delta, 'reasoning_content') and delta.reasoning_content:
if not thinking_started:
print("=============== Thinking =================", flush=True)
thinking_started = True
has_thinking = True
print(delta.reasoning_content, end="", flush=True)
# Print answer content
if delta.content:
# Close thinking section and add content header
if has_thinking and not has_answer:
print("\n=============== Content =================", flush=True)
has_answer = True
print(delta.content, end="", flush=True)
print()
Output Example:
=============== Thinking =================
To solve this problem, I need to calculate 15% of 240.
Step 1: Convert 15% to decimal: 15% = 0.15
Step 2: Multiply 240 by 0.15
Step 3: 240 × 0.15 = 36
=============== Content =================
The answer is 36. To find 15% of 240, we multiply 240 by 0.15, which equals 36.
Note: The reasoning parser captures the model's step-by-step thinking process, allowing you to see how the model arrives at its conclusions.
GLM-4.6V supports tool calling with vision capabilities. Pass tools in your API request:
from openai import OpenAI
openai_api_key = "EMPTY"
openai_api_base = "http://127.0.0.1:30000/v1"
client = OpenAI(api_key=openai_api_key, base_url=openai_api_base)
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current temperature for a given location.",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City and country e.g. Beijing, China",
}
},
"required": ["location"],
"additionalProperties": False,
},
},
}
]
messages = [
{
"role": "user",
"content": "Please help me check today's weather in Beijing, and tell me whether the tool returned an image."
},
{
"role": "assistant",
"tool_calls": [
{
"id": "call_bk32t88BGpSdbtDgzT044Rh4",
"type": "function",
"function": {
"name": 'get_weather',
"arguments": '{"location":"Beijing, China"}'
}
}
]
},
{
"role": "tool",
"tool_call_id": "call_bk32t88BGpSdbtDgzT044Rh4",
"content": [
{
"type": "text",
"text": "Weather report generated: Beijing, November 7, 2025, sunny, temperature 2°C."
},
{
"type": "image_url",
"image_url": {
"url": "https://github.com/sgl-project/sglang/blob/main/examples/assets/example_image.png?raw=true"
}
}
]
},
]
response = client.chat.completions.create(
model="zai-org/GLM-4.6V",
messages=messages,
timeout=900,
tools=tools
)
print(response.choices[0].message.content.strip())
Output Example:
The weather in Beijing today (November 7, 2025) is sunny with a temperature of 2°C.
Yes, the tool returned an image (the SGL logo).
Beyond the reasoning parser, you can cap the number of thinking tokens using CustomLogitProcessor. Launch with --enable-custom-logit-processor and pass Glm4MoeThinkingBudgetLogitProcessor in the request — same as the GLM-4.6 text model approach:
import openai
from sglang.srt.sampling.custom_logit_processor import Glm4MoeThinkingBudgetLogitProcessor
client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="*")
response = client.chat.completions.create(
model="zai-org/GLM-4.6V",
messages=[{"role": "user", "content": "Describe this image briefly."}],
max_tokens=1024,
extra_body={
"custom_logit_processor": Glm4MoeThinkingBudgetLogitProcessor().to_str(),
"custom_params": {"thinking_budget": 512},
},
)
print(response)
python3 ./benchmark/gsm8k/bench_sglang.py
Accuracy: 0.925
Invalid: 0.000
Latency: 15.327 s
Output throughput: 1788.375 token/s
python3 -m sglang.bench_serving \
--backend sglang-oai-chat \
--port 30000 \
--model zai-org/GLM-4.6V \
--dataset-name image \
--image-count 2 \
--image-resolution 720p \
--random-input-len 128 \
--random-output-len 1024 \
--num-prompts 128 \
--max-concurrency 8
============ Serving Benchmark Result ============
Backend: sglang-oai-chat
Traffic request rate: inf
Max request concurrency: 8
Successful requests: 128
Benchmark duration (s): 89.27
Total input tokens: 315390
Total input text tokens: 8702
Total input vision tokens: 306688
Total generated tokens: 66020
Total generated tokens (retokenized): 31037
Request throughput (req/s): 1.43
Input token throughput (tok/s): 3533.17
Output token throughput (tok/s): 739.59
Peak output token throughput (tok/s): 823.00
Peak concurrent requests: 12
Total token throughput (tok/s): 4272.76
Concurrency: 7.67
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 5349.20
Median E2E Latency (ms): 5380.98
---------------Time to First Token----------------
Mean TTFT (ms): 1724.04
Median TTFT (ms): 1688.16
P99 TTFT (ms): 6152.34
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 8.15
Median TPOT (ms): 7.77
P99 TPOT (ms): 23.97
---------------Inter-Token Latency----------------
Mean ITL (ms): 10.00
Median ITL (ms): 8.44
P95 ITL (ms): 9.23
P99 ITL (ms): 116.02
Max ITL (ms): 173.48
==================================================
python3 benchmark/mmmu/bench_sglang.py --response-answer-regex "<\|begin_of_box\|>(.*)<\|end_of_box\|>" --port 30000 --concurrency 64 --extra-request-body '{"max_tokens": 4096}'
Benchmark time: 487.2229107860476
answers saved to: ./answer_sglang.json
Evaluating...
answers saved to: ./answer_sglang.json
{'Accounting': {'acc': 0.962, 'num': 26},
'Agriculture': {'acc': 0.5, 'num': 30},
'Architecture_and_Engineering': {'acc': 0.733, 'num': 15},
'Art': {'acc': 0.833, 'num': 30},
'Art_Theory': {'acc': 0.9, 'num': 30},
'Basic_Medical_Science': {'acc': 0.733, 'num': 30},
'Biology': {'acc': 0.586, 'num': 29},
'Chemistry': {'acc': 0.654, 'num': 26},
'Clinical_Medicine': {'acc': 0.633, 'num': 30},
'Computer_Science': {'acc': 0.76, 'num': 25},
'Design': {'acc': 0.867, 'num': 30},
'Diagnostics_and_Laboratory_Medicine': {'acc': 0.633, 'num': 30},
'Economics': {'acc': 0.862, 'num': 29},
'Electronics': {'acc': 0.5, 'num': 18},
'Energy_and_Power': {'acc': 0.875, 'num': 16},
'Finance': {'acc': 0.857, 'num': 28},
'Geography': {'acc': 0.714, 'num': 28},
'History': {'acc': 0.767, 'num': 30},
'Literature': {'acc': 0.897, 'num': 29},
'Manage': {'acc': 0.759, 'num': 29},
'Marketing': {'acc': 1.0, 'num': 26},
'Materials': {'acc': 0.833, 'num': 18},
'Math': {'acc': 0.76, 'num': 25},
'Mechanical_Engineering': {'acc': 0.619, 'num': 21},
'Music': {'acc': 0.286, 'num': 28},
'Overall': {'acc': 0.761, 'num': 803},
'Overall-Art and Design': {'acc': 0.729, 'num': 118},
'Overall-Business': {'acc': 0.884, 'num': 138},
'Overall-Health and Medicine': {'acc': 0.773, 'num': 150},
'Overall-Humanities and Social Science': {'acc': 0.78, 'num': 118},
'Overall-Science': {'acc': 0.728, 'num': 136},
'Overall-Tech and Engineering': {'acc': 0.671, 'num': 143},
'Pharmacy': {'acc': 0.933, 'num': 30},
'Physics': {'acc': 0.929, 'num': 28},
'Psychology': {'acc': 0.733, 'num': 30},
'Public_Health': {'acc': 0.933, 'num': 30},
'Sociology': {'acc': 0.724, 'num': 29}}
eval out saved to ./val_sglang.json
Overall accuracy: 0.761