GLM-4.6V - Sglang — ContextQMD

1. Model Introduction

GLM-4.6V series model includes two versions: GLM-4.6V (106B), a foundation model designed for cloud and high-performance cluster scenarios, and GLM-4.6V-Flash (9B), a lightweight model optimized for local deployment and low-latency applications. GLM-4.6V scales its context window to 128k tokens in training, and achieves SoTA performance in visual understanding among models of similar parameter scales. Crucially, GLM team integrated native Function Calling capabilities for the first time. This effectively bridges the gap between "visual perception" and "executable action" providing a unified technical foundation for multimodal agents in real-world business scenarios.

Beyond achieves SoTA performance across major multimodal benchmarks at comparable model scales. GLM-4.6V introduces several key features:

Native Multimodal Function Calling Enables native vision-driven tool use. Images, screenshots, and document pages can be passed directly as tool inputs without text conversion, while visual outputs (charts, search images, rendered pages) are interpreted and integrated into the reasoning chain. This closes the loop from perception to understanding to execution. Please refer to this example.
Interleaved Image-Text Content Generation Supports high-quality mixed media creation from complex multimodal inputs. GLM-4.6V takes a multimodal context—spanning documents, user inputs, and tool-retrieved images—and synthesizes coherent, interleaved image-text content tailored to the task. During generation it can actively call search and retrieval tools to gather and curate additional text and visuals, producing rich, visually grounded content.
Multimodal Document Understanding GLM-4.6V can process up to 128K tokens of multi-document or long-document input, directly interpreting richly formatted pages as images. It understands text, layout, charts, tables, and figures jointly, enabling accurate comprehension of complex, image-heavy documents without requiring prior conversion to plain text.
Frontend Replication & Visual Editing Reconstructs pixel-accurate HTML/CSS from UI screenshots and supports natural-language-driven edits. It detects layout, components, and styles visually, generates clean code, and applies iterative visual modifications through simple user instructions.

2. SGLang Installation

SGLang offers multiple installation methods. You can choose the most suitable installation method based on your hardware platform and requirements.

2.1 Docker Installation (Recommended)

shell

docker pull lmsysorg/sglang:latest

Advantages:

Ready to use out of the box, no manual environment configuration needed
Avoids dependency conflict issues
Easy to migrate between different environments

2.2 Build from Source

If you need to use the latest development version or require custom modifications, you can build from source:

bash

# Install SGLang using UV (recommended)
git clone https://github.com/sgl-project/sglang.git
cd sglang
uv venv
source .venv/bin/activate
uv pip install -e "python[all]" --index-url=https://pypi.org/simple
pip install nvidia-cudnn-cu12==9.16.0.29
# Install ffmpeg to support video input
sudo apt update
sudo apt install ffmpeg

Use Cases:

Need to customize and modify SGLang source code
Want to use the latest development features
Participate in SGLang project development

For general installation instructions, you can also refer to the official SGLang installation guide.

3. Model Deployment

3.1 Basic Configuration

Interactive Command Generator: Use the interactive configuration generator below to customize your deployment settings. Select your hardware platform, model size, quantization method, and other options to generate the appropriate launch command.

import { GLM46VDeployment } from "/src/snippets/autoregressive/glm-46v-deployment.jsx";

3.2 Configuration Tips

TTFT Optimization : Set SGLANG_USE_CUDA_IPC_TRANSPORT=1 to use CUDA IPC for transferring multimodal features, which significantly improves TTFT. This consumes additional memory and may require adjusting --mem-fraction-static and/or --max-running-requests. (additional memory is proportional to image size * number of images in current running requests.)
TP=8 Configuration: When using Tensor Parallelism (TP) of 8, the vision attention's 12 heads cannot be evenly divided. You can resolve this by adding --mm-enable-dp-encoder (which the generator above handles automatically).
Fast Model Loading: For large models (like the 106B version), you can speed up model loading by using --model-loader-extra-config='{"enable_multithread_load": "true","num_threads": 64}'.
Hardware Notes:
- H100 (FP8): Use the FP8 checkpoint for best memory efficiency.
- A100 / H100 (BF16): Use standard multimodal parameters to manage throughput and GPU memory usage.
- H200 / B200: Runs out of the box, supporting full context length plus concurrent image + video processing.
Additional Multimodal Parameters:
- --mm-attention-backend fa3: Specify multimodal attention backend (Flash Attention 3).
- --keep-mm-feature-on-device: Retain multimodal feature tensors on GPU after processing to avoid D2H memory copies.
- SGLANG_USE_CUDA_IPC_TRANSPORT=1: Use CUDA IPC shared memory for multimodal data transport to significantly improve E2E latency.

Example with full multimodal optimizations:

bash

SGLANG_USE_CUDA_IPC_TRANSPORT=1 \
SGLANG_VLM_CACHE_SIZE_MB=0 \
python -m sglang.launch_server \
  --model-path zai-org/GLM-4.6V \
  --host 0.0.0.0 \
  --port 30000 \
  --trust-remote-code \
  --tp-size 8 \
  --enable-cache-report \
  --log-level info \
  --max-running-requests 64 \
  --mem-fraction-static 0.65 \
  --chunked-prefill-size 8192 \
  --attention-backend fa3 \
  --mm-attention-backend fa3 \
  --mm-enable-dp-encoder \
  --enable-metrics

4. Model Invocation

4.1 Basic Usage

For basic API usage and request examples, please refer to:

4.2 Advanced Usage

4.2.1 Multi-Modal Inputs

GLM-4.6V supports image and video inputs via the OpenAI-compatible API.

Image Input:

python

import subprocess

curl_command = f"""
curl -s http://localhost:{30000}/v1/chat/completions \\
  -H "Content-Type: application/json" \\
  -d '{{
    "model": "default",
    "messages": [
      {{
        "role": "user",
        "content": [
          {{
            "type": "image_url",
            "image_url": {{
              "url": "https://github.com/sgl-project/sglang/blob/main/examples/assets/example_image.png?raw=true"
            }}
          }},
          {{
            "type": "text",
            "text": "What is the image"
          }}
        ]
      }}
    ],
    "temperature": "0",
    "max_completion_tokens": "1000",
    "max_tokens": "1000"
  }}'
"""

response = subprocess.check_output(curl_command, shell=True).decode()
print(response)

text

{"id":"b61596ca71394dd699fd8abd4f650c44","object":"chat.completion","created":1765259019,"model":"default","choices":[{"index":0,"message":{"role":"assistant","content":"The image is a logo featuring the text \"SGL\" (in a bold, orange-brown font) alongside a stylized icon. The icon includes a network-like structure with circular nodes (suggesting connectivity or a tree/graph structure) and a tag with \"</>\" (a common symbol for coding, web development, or software). The color scheme uses warm orange-brown tones with a black background, giving it a tech-focused, modern aesthetic (likely representing a company, project, or tool related to software, web development, or digital technology).<|begin_of_box|>SGL logo (stylized text + network/coding icon)<|end_of_box|>","reasoning_content":"Okay, let's see. The image has a logo with the text \"SGL\" and a little icon on the left. The icon looks like a network or a tree structure with circles, and there's a tag with \"</>\" which is a common symbol for coding or web development. The colors are orange and brown tones, with a black background. So probably a logo for a company or project named SGL, maybe related to software, web development, or a tech company.","tool_calls":null},"logprobs":null,"finish_reason":"stop","matched_stop":151336}],"usage":{"prompt_tokens":2222,"total_tokens":2448,"completion_tokens":226,"prompt_tokens_details":null,"reasoning_tokens":0},"metadata":{"weight_version":"default"}}

Video Input:

python

import subprocess

curl_command = f"""
curl -s http://localhost:{30000}/v1/chat/completions \\
  -H "Content-Type: application/json" \\
  -d '{{
    "model": "default",
    "messages": [
      {{
        "role": "user",
        "content": [
          {{
            "type": "video_url",
            "video_url": {{
              "url": "https://github.com/sgl-project/sgl-test-files/raw/refs/heads/main/videos/jobs_presenting_ipod.mp4"
            }}
          }},
          {{
            "type": "text",
            "text": "What is in the video"
          }}
        ]
      }}
    ],
    "temperature": "0",
    "max_completion_tokens": "1000",
    "max_tokens": "1000"
  }}'
"""

response = subprocess.check_output(curl_command, shell=True).decode()
print(response)

text

{"id":"520e0a079e5d4b17b82a6af619315a97","object":"chat.completion","created":1765259029,"model":"default","choices":[{"index":0,"message":{"role":"assistant","content":"The image is a still from a presentation by a man on a stage. He is pointing to a small pocket on his jeans and asking the audience what the pocket is for. The video is being shared by Evan Carmichael. The man then reveals that the pocket is for an iPod Nano.","reasoning_content":"Based on the visual evidence in the video, here is a breakdown of what is being shown:\n\n*   **Subject:** The video features a man on a stage, giving a presentation. He is wearing a black t-shirt and dark jeans.\n*   **Action:** The man is pointing to a pocket on his jeans. He is asking the audience a question about the purpose of this pocket.\n*   **Context:** The presentation is being filmed, and the video is being shared by \"Evan Carmichael,\" a well-known motivational speaker and content creator. The source of the clip is credited to \"JoshuaG.\"\n*   **Reveal:** The man then reveals the answer to his question. He pulls a small, white, rectangular device out of the pocket. He identifies this device as an \"iPod Nano.\"\n\nIn summary, the image is a still from a presentation where a speaker is explaining the purpose of the small pocket found on many pairs of jeans.","tool_calls":null},"logprobs":null,"finish_reason":"stop","matched_stop":151336}],"usage":{"prompt_tokens":30276,"total_tokens":30532,"completion_tokens":256,"prompt_tokens_details":null,"reasoning_tokens":0},"metadata":{"weight_version":"default"}}

4.2.2 Thinking Mode

GLM-4.6V supports Thinking mode. Enable the reasoning parser during deployment:

shell

python -m sglang.launch_server \
  --model zai-org/GLM-4.6V \
  --reasoning-parser glm45 \
  --tp 8 \
  --host 0.0.0.0 \
  --port 30000

Streaming with Thinking Process:

python

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY"
)

# Enable streaming to see the thinking process in real-time
response = client.chat.completions.create(
    model="zai-org/GLM-4.6V",
    messages=[
        {"role": "user", "content": "Solve this problem step by step: What is 15% of 240?"}
    ],
    temperature=0.7,
    max_tokens=2048,
    stream=True
)

# Process the stream
has_thinking = False
has_answer = False
thinking_started = False

for chunk in response:
    if chunk.choices and len(chunk.choices) > 0:
        delta = chunk.choices[0].delta

        # Print thinking process
        if hasattr(delta, 'reasoning_content') and delta.reasoning_content:
            if not thinking_started:
                print("=============== Thinking =================", flush=True)
                thinking_started = True
            has_thinking = True
            print(delta.reasoning_content, end="", flush=True)

        # Print answer content
        if delta.content:
            # Close thinking section and add content header
            if has_thinking and not has_answer:
                print("\n=============== Content =================", flush=True)
                has_answer = True
            print(delta.content, end="", flush=True)

print()

Output Example:

text

=============== Thinking =================
To solve this problem, I need to calculate 15% of 240.
Step 1: Convert 15% to decimal: 15% = 0.15
Step 2: Multiply 240 by 0.15
Step 3: 240 × 0.15 = 36
=============== Content =================

The answer is 36. To find 15% of 240, we multiply 240 by 0.15, which equals 36.

Note: The reasoning parser captures the model's step-by-step thinking process, allowing you to see how the model arrives at its conclusions.

4.2.3 Tool Calling

GLM-4.6V supports tool calling with vision capabilities. Pass tools in your API request:

python

from openai import OpenAI

openai_api_key = "EMPTY"
openai_api_base = "http://127.0.0.1:30000/v1"
client = OpenAI(api_key=openai_api_key, base_url=openai_api_base)



tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current temperature for a given location.",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "City and country e.g. Beijing, China",
                    }
                },
                "required": ["location"],
                "additionalProperties": False,
            },
        },
    }
]


messages = [
    {
        "role": "user",
        "content": "Please help me check today's weather in Beijing, and tell me whether the tool returned an image."
    },
    {
        "role": "assistant",
        "tool_calls": [
            {
                "id": "call_bk32t88BGpSdbtDgzT044Rh4",
                "type": "function",
                "function": {
                    "name": 'get_weather',
                    "arguments": '{"location":"Beijing, China"}'
                }
            }
        ]
    },
    {
        "role": "tool",
        "tool_call_id": "call_bk32t88BGpSdbtDgzT044Rh4",
        "content": [
            {
                "type": "text",
                "text": "Weather report generated: Beijing, November 7, 2025, sunny, temperature 2°C."
            },
            {
                "type": "image_url",
                "image_url": {
                     "url": "https://github.com/sgl-project/sglang/blob/main/examples/assets/example_image.png?raw=true"
                }
            }
        ]
    },
]

response = client.chat.completions.create(
    model="zai-org/GLM-4.6V",
    messages=messages,
    timeout=900,
    tools=tools
)
print(response.choices[0].message.content.strip())

Output Example:

text

The weather in Beijing today (November 7, 2025) is sunny with a temperature of 2°C.

Yes, the tool returned an image (the SGL logo).

4.2.4 Thinking Budget

Beyond the reasoning parser, you can cap the number of thinking tokens using CustomLogitProcessor. Launch with --enable-custom-logit-processor and pass Glm4MoeThinkingBudgetLogitProcessor in the request — same as the GLM-4.6 text model approach:

python

import openai
from sglang.srt.sampling.custom_logit_processor import Glm4MoeThinkingBudgetLogitProcessor

client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="*")
response = client.chat.completions.create(
    model="zai-org/GLM-4.6V",
    messages=[{"role": "user", "content": "Describe this image briefly."}],
    max_tokens=1024,
    extra_body={
        "custom_logit_processor": Glm4MoeThinkingBudgetLogitProcessor().to_str(),
        "custom_params": {"thinking_budget": 512},
    },
)
print(response)

5. Benchmark

5.1. Text Benchmark: Latency, Throughput and Accuracy

Command

shell

python3 ./benchmark/gsm8k/bench_sglang.py

Result Output

text

Accuracy: 0.925
Invalid: 0.000
Latency: 15.327 s
Output throughput: 1788.375 token/s

5.2. Multimodal Benchmark - Latency and Throughput

Command

shell

python3 -m sglang.bench_serving \
  --backend sglang-oai-chat \
  --port 30000 \
  --model zai-org/GLM-4.6V \
  --dataset-name image \
  --image-count 2 \
  --image-resolution 720p \
  --random-input-len 128 \
  --random-output-len 1024 \
  --num-prompts 128 \
  --max-concurrency 8

Result Output

text

============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    inf
Max request concurrency:                 8
Successful requests:                     128
Benchmark duration (s):                  89.27
Total input tokens:                      315390
Total input text tokens:                 8702
Total input vision tokens:               306688
Total generated tokens:                  66020
Total generated tokens (retokenized):    31037
Request throughput (req/s):              1.43
Input token throughput (tok/s):          3533.17
Output token throughput (tok/s):         739.59
Peak output token throughput (tok/s):    823.00
Peak concurrent requests:                12
Total token throughput (tok/s):          4272.76
Concurrency:                             7.67
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   5349.20
Median E2E Latency (ms):                 5380.98
---------------Time to First Token----------------
Mean TTFT (ms):                          1724.04
Median TTFT (ms):                        1688.16
P99 TTFT (ms):                           6152.34
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          8.15
Median TPOT (ms):                        7.77
P99 TPOT (ms):                           23.97
---------------Inter-Token Latency----------------
Mean ITL (ms):                           10.00
Median ITL (ms):                         8.44
P95 ITL (ms):                            9.23
P99 ITL (ms):                            116.02
Max ITL (ms):                            173.48
==================================================

5.3. Multimodal Accuracy Benchmark - MMMU

Command

shell

python3 benchmark/mmmu/bench_sglang.py --response-answer-regex "<\|begin_of_box\|>(.*)<\|end_of_box\|>" --port 30000 --concurrency 64 --extra-request-body '{"max_tokens": 4096}'

Result Output

text

Benchmark time: 487.2229107860476
answers saved to: ./answer_sglang.json
Evaluating...
answers saved to: ./answer_sglang.json
{'Accounting': {'acc': 0.962, 'num': 26},
 'Agriculture': {'acc': 0.5, 'num': 30},
 'Architecture_and_Engineering': {'acc': 0.733, 'num': 15},
 'Art': {'acc': 0.833, 'num': 30},
 'Art_Theory': {'acc': 0.9, 'num': 30},
 'Basic_Medical_Science': {'acc': 0.733, 'num': 30},
 'Biology': {'acc': 0.586, 'num': 29},
 'Chemistry': {'acc': 0.654, 'num': 26},
 'Clinical_Medicine': {'acc': 0.633, 'num': 30},
 'Computer_Science': {'acc': 0.76, 'num': 25},
 'Design': {'acc': 0.867, 'num': 30},
 'Diagnostics_and_Laboratory_Medicine': {'acc': 0.633, 'num': 30},
 'Economics': {'acc': 0.862, 'num': 29},
 'Electronics': {'acc': 0.5, 'num': 18},
 'Energy_and_Power': {'acc': 0.875, 'num': 16},
 'Finance': {'acc': 0.857, 'num': 28},
 'Geography': {'acc': 0.714, 'num': 28},
 'History': {'acc': 0.767, 'num': 30},
 'Literature': {'acc': 0.897, 'num': 29},
 'Manage': {'acc': 0.759, 'num': 29},
 'Marketing': {'acc': 1.0, 'num': 26},
 'Materials': {'acc': 0.833, 'num': 18},
 'Math': {'acc': 0.76, 'num': 25},
 'Mechanical_Engineering': {'acc': 0.619, 'num': 21},
 'Music': {'acc': 0.286, 'num': 28},
 'Overall': {'acc': 0.761, 'num': 803},
 'Overall-Art and Design': {'acc': 0.729, 'num': 118},
 'Overall-Business': {'acc': 0.884, 'num': 138},
 'Overall-Health and Medicine': {'acc': 0.773, 'num': 150},
 'Overall-Humanities and Social Science': {'acc': 0.78, 'num': 118},
 'Overall-Science': {'acc': 0.728, 'num': 136},
 'Overall-Tech and Engineering': {'acc': 0.671, 'num': 143},
 'Pharmacy': {'acc': 0.933, 'num': 30},
 'Physics': {'acc': 0.929, 'num': 28},
 'Psychology': {'acc': 0.733, 'num': 30},
 'Public_Health': {'acc': 0.933, 'num': 30},
 'Sociology': {'acc': 0.724, 'num': 29}}
eval out saved to ./val_sglang.json
Overall accuracy: 0.761