Back to Sglang

MiniCPM-V 4.6

docs_new/cookbook/autoregressive/OpenBMB/MiniCPM-V-4_6.mdx

0.5.1224.6 KB
Original Source

1. Model Introduction

MiniCPM-V 4.6 is the next-generation multimodal model from OpenBMB, the team behind the MiniCPM-V series. The model combines a Qwen3.5-style hybrid LLM backbone (Gated Delta Net + full attention) with a NaViT-packed vision encoder that handles arbitrary aspect ratios and high-resolution slicing natively, plus end-to-end video support.

OpenBMB ships two variants on HuggingFace:

  • openbmb/MiniCPM-V-4.6 — base instruct model. Use this for general multimodal serving; thinking mode is still available per-request via chat_template_kwargs.enable_thinking=true.
  • openbmb/MiniCPM-V-4.6-Thinking — thinking-tuned variant with stronger chain-of-thought behavior. Pair with the same --reasoning-parser qwen3 flag.

Key Features:

  • Hybrid LLM backbone: Qwen3.5-style mix of Gated Delta Net (linear-attention) layers and full-attention layers, providing long-context efficiency without giving up modeling power.
  • Native variable-resolution vision: NaViT-packed vision encoder with mid-ViT merger and per-image window attention. Images of any aspect ratio are processed without forced letterboxing.
  • High-resolution slicing: Source image plus a configurable grid of slice tiles (up to 9 tiles in the open test variant) lets the model reason over fine detail in 1280×720+ images.
  • Video: Frame-by-frame multi-modal data items routed through the same vision encoder; any number of frames per request.
  • Reasoning Parser: switchable thinking mode (Qwen3.5 lineage), exposed via chat_template_kwargs.enable_thinking per request and SGLang's --reasoning-parser qwen3 on the server side.
  • Tool Calling: Qwen3.5-style <tool_call><function=…><parameter=…>…</parameter></function></tool_call> XML format, surfaced as OpenAI-compatible message.tool_calls via SGLang's --tool-call-parser qwen3_coder. Composes with thinking mode and with image / video inputs.

License: Apache 2.0.

2. SGLang Installation

Pull the nightly Docker image (rolling tag, tracks main):

bash
# CUDA 13 (Hopper / Blackwell, default)
docker pull lmsysorg/sglang:dev

# CUDA 12 (Ampere or older drivers)
docker pull lmsysorg/sglang:dev-cu12

For the general SGLang installation guide (PyPI, source, Docker) see the official SGLang installation guide.

3. Model Deployment

3.1 Basic Configuration

Interactive Command Generator: Use the configuration selector below to generate the appropriate deployment command. The Variant toggle switches between openbmb/MiniCPM-V-4.6 (base) and openbmb/MiniCPM-V-4.6-Thinking. The Reasoning Parser and Tool Call Parser toggles add --reasoning-parser qwen3 and --tool-call-parser qwen3_coder respectively; see §4.4 for usage details.

import { MiniCPMV46Deployment } from '/src/snippets/autoregressive/minicpm-v-4_6-deployment.jsx'

<MiniCPMV46Deployment />

3.2 Configuration Tips

  • Mamba Radix Cache: Qwen3.5's hybrid Gated Delta Networks architecture supports two mamba scheduling strategies via --mamba-scheduler-strategy:
    • V1 (no_buffer): Default. No overlap scheduler, lower memory usage. Required for AMD MI GPUs.
    • V2 (extra_buffer): Enables overlap scheduling and branching point caching with --mamba-scheduler-strategy extra_buffer --page-size 64. Requires FLA kernel backend (NVIDIA GPUs only). Trades higher mamba state memory for better throughput. Strictly superior in non-KV-cache-bound scenarios; in KV-cache-bound cases, weigh the overlap scheduling benefit against reduced max concurrency. --page-size must satisfy FLA_CHUNK_SIZE % page_size == 0 or page_size % FLA_CHUNK_SIZE == 0 (FLA_CHUNK_SIZE is currently 64).
  • The --mem-fraction-static flag is recommended for optimal memory utilization, adjust it based on your hardware and workload.
  • Context length defaults to 262,144 tokens. If you encounter OOM errors, consider reducing it, but maintain at least 128K to preserve thinking capabilities.
  • To speed up weight loading for this large model, add --model-loader-extra-config='{"enable_multithread_load": "true","num_threads": 64}' to the launch command.
  • CUDA IPC Transport: Add SGLANG_USE_CUDA_IPC_TRANSPORT=1 as an environment variable to use CUDA IPC for transferring multimodal features, significantly improving TTFT (Time To First Token). Note: this consumes additional memory proportional to image size, so you may need to lower --mem-fraction-static or --max-running-requests.
  • Multimodal Attention Backend: Use --mm-attention-backend fa3 on H100/H200 for better vision performance, or --mm-attention-backend fa4 on B200/B300.
  • For processing large images or videos, you may need to lower --mem-fraction-static to leave room for image feature tensors.
  • Multi-image and high-resolution images: the image processor produces one source patch plus per-slice tile patches; each is its own MultimodalDataItem. No special server-side flag needed.
  • Video: decoded frame-by-frame through the same image-style slicer. No extra flag needed; pass video_url in the OpenAI chat completion request.
  • Chunked Prefill: For high-concurrency vision benchmarking with many large/sliced images, pass --chunked-prefill-size -1 to disable prefill chunking. The default chunked-prefill path can mis-split a request across an image boundary in mm_utils.embed_mm_inputs and crash the server; disabling chunking sidesteps this at the cost of higher TTFT under concurrency. For interactive serving leave the default on.

4. Model Invocation

Deploy the model on an H200:

bash
sglang serve --model-path openbmb/MiniCPM-V-4.6 \
  --trust-remote-code \
  --dtype bfloat16 \
  --mem-fraction-static 0.15 \
  --mamba-scheduler-strategy extra_buffer \
  --page-size 64 \
  --host 0.0.0.0 --port 30000

4.1 Basic Usage (Image)

python
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY",
)

response = client.chat.completions.create(
    model="openbmb/MiniCPM-V-4.6",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://www.ilankelman.org/stopsigns/australia.jpg",
                    },
                },
                {"type": "text", "text": "Describe this image in one sentence."},
            ],
        }
    ],
    max_tokens=200,
    extra_body={"chat_template_kwargs": {"enable_thinking": False}},
)

print(response.choices[0].message.content)

Output Example:

text
A black SUV drives past a Chinese-style gate with a red stop sign and traditional architecture, while storefronts and street signs line the sidewalk.

4.2 High-Resolution / Sliced Images

The image processor automatically picks a slice grid (up to 9 tiles) for high-resolution inputs. A 1280×720 source produces grid [2, 3]

  • 7 patches with tgt_sizes=[(24, 44), 6×(28, 36)], byte-for-byte matching the HF reference implementation.
python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")

response = client.chat.completions.create(
    model="openbmb/MiniCPM-V-4.6",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/idefics-few-shot.jpg",
                    },
                },
                {"type": "text", "text": "Describe this image in one sentence."},
            ],
        }
    ],
    max_tokens=200,
    extra_body={"chat_template_kwargs": {"enable_thinking": False}},
)

print(response.choices[0].message.content)

Output Example:

text
The Statue of Liberty stands tall against a cloudy sky, holding a torch aloft and a document in her left hand, symbolizing freedom and enlightenment.

4.3 Video Input

python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")

response = client.chat.completions.create(
    model="openbmb/MiniCPM-V-4.6",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "video_url",
                    "video_url": {"url": "<your-video-url-or-file-path>"},
                },
                {"type": "text", "text": "Describe what happens in this video in one sentence."},
            ],
        }
    ],
    max_tokens=200,
    extra_body={"chat_template_kwargs": {"enable_thinking": False}},
)

print(response.choices[0].message.content)

Output Example (run against an 8-frame synthetic test mp4 of shifting colored squares):

text
The video shows a grid of colored squares moving in a random pattern.

4.4 Advanced Usage

4.4.1 Reasoning Parser

Pass --reasoning-parser qwen3 to the server (toggle "Reasoning Parser" on in §3.1, default) so SGLang splits each response on the <think> / </think> boundaries: the pre-</think> block goes to reasoning_content, the post-</think> text to content. Per-request, the chat template's enable_thinking flag toggles whether the model actually emits reasoning.

  • Thinking mode (default, enable_thinking=true): assistant prompt ends with <think>\n; the model writes reasoning, closes with </think>, then the answer. reasoning_content and content are both populated.
  • Instruct mode (enable_thinking=false): the chat template injects an empty <think></think> placeholder so the model emits no thinking tokens; reasoning_content ends up empty.
python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")

response = client.chat.completions.create(
    model="openbmb/MiniCPM-V-4.6",
    messages=[{"role": "user", "content": "Reply with the single word 'hi'. No explanation."}],
    max_tokens=200,
)

msg = response.choices[0].message
print("reasoning_content:", msg.reasoning_content)
print("content          :", msg.content)
text
reasoning_content: Got it, let's see. The user wants a reply with "hi" and no explanation. So I need to just say "hi" as the response. ...
content          : hi
python
response = client.chat.completions.create(
    model="openbmb/MiniCPM-V-4.6",
    messages=[{"role": "user", "content": "Reply with the single word 'hi'. No explanation."}],
    max_tokens=200,
    extra_body={"chat_template_kwargs": {"enable_thinking": False}},
)

msg = response.choices[0].message
print("reasoning_content:", msg.reasoning_content)
print("content          :", msg.content)
text
reasoning_content:
content          : hi

4.4.2 Tool Calling

Pass --tool-call-parser qwen3_coder to the server (toggle "Tool Call Parser" on in §3.1) so SGLang extracts <tool_call> blocks from the model output into the OpenAI-style message.tool_calls field (with finish_reason="tool_calls"). The model speaks the Qwen3.5 XML tool-call format (<tool_call><function=name><parameter=k>v</parameter></function></tool_call>); the qwen3_coder parser is the right one. Tool calls compose with both reasoning modes and with image / video inputs.

<Warning> Do **not** use `--tool-call-parser qwen` for MiniCPM-V 4.6 — that parser expects the older Qwen2.5 JSON format `<tool_call>{"name":..., "arguments":...}</tool_call>`, but both public 4.6 variants emit the Qwen3.5-style XML format with nested `<function=…>` and `<parameter=…>` tags. With `qwen` the outer `<tool_call>` markers match but the inner JSON parse fails, so `tool_calls` returns empty and the raw markup is left in `content`. </Warning>
python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a city.",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string"},
                    "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
                },
                "required": ["location"],
            },
        },
    },
]

response = client.chat.completions.create(
    model="openbmb/MiniCPM-V-4.6",
    messages=[{"role": "user", "content": "What is the weather in San Francisco? Use the tool."}],
    tools=tools,
    max_tokens=200,
    extra_body={"chat_template_kwargs": {"enable_thinking": False}},
)

choice = response.choices[0]
print("finish_reason:", choice.finish_reason)
for tc in choice.message.tool_calls or []:
    print(f"  {tc.function.name}({tc.function.arguments})")
text
finish_reason: tool_calls
  get_weather({"location": "San Francisco", "unit": "celsius"})

To get the final natural-language answer, feed the tool's result back as a tool role message and call the API again with the same tools list — the model emits finish_reason="stop" with the answer in content.

5. Benchmark

Common Test Environment (all benchmarks below):

  • Hardware: 1× NVIDIA H200 (141 GB), single GPU (no TP / DP)
  • Docker Image: lmsysorg/sglang:dev (transformers 5.6.0, sgl-kernel 0.4.2.post1)
  • Precision: BF16

Common Server Launch Command:

bash
CUDA_VISIBLE_DEVICES=0 python -m sglang.launch_server \
  --model-path openbmb/MiniCPM-V-4.6 \
  --trust-remote-code \
  --dtype bfloat16 \
  --mem-fraction-static 0.5 \
  --mamba-scheduler-strategy extra_buffer \
  --chunked-prefill-size -1 \
  --host 0.0.0.0 --port 30000

(--chunked-prefill-size -1 is required for the vision throughput run; see §3.2.)

5.1 Accuracy Benchmark

5.1.1 MMMU Benchmark

  • Benchmark Command
bash
python3 benchmark/mmmu/bench_sglang.py --port 30000 --concurrency 48 --max-new-tokens 2048
  • Test Result
{'Accounting': {'acc': 0.767, 'num': 30},
 'Agriculture': {'acc': 0.533, 'num': 30},
 'Architecture_and_Engineering': {'acc': 0.4, 'num': 30},
 'Art': {'acc': 0.6, 'num': 30},
 'Art_Theory': {'acc': 0.667, 'num': 30},
 'Basic_Medical_Science': {'acc': 0.533, 'num': 30},
 'Biology': {'acc': 0.333, 'num': 30},
 'Chemistry': {'acc': 0.333, 'num': 30},
 'Clinical_Medicine': {'acc': 0.467, 'num': 30},
 'Computer_Science': {'acc': 0.333, 'num': 30},
 'Design': {'acc': 0.533, 'num': 30},
 'Diagnostics_and_Laboratory_Medicine': {'acc': 0.333, 'num': 30},
 'Economics': {'acc': 0.633, 'num': 30},
 'Electronics': {'acc': 0.5, 'num': 30},
 'Energy_and_Power': {'acc': 0.633, 'num': 30},
 'Finance': {'acc': 0.533, 'num': 30},
 'Geography': {'acc': 0.367, 'num': 30},
 'History': {'acc': 0.533, 'num': 30},
 'Literature': {'acc': 0.7, 'num': 30},
 'Manage': {'acc': 0.367, 'num': 30},
 'Marketing': {'acc': 0.733, 'num': 30},
 'Materials': {'acc': 0.367, 'num': 30},
 'Math': {'acc': 0.567, 'num': 30},
 'Mechanical_Engineering': {'acc': 0.333, 'num': 30},
 'Music': {'acc': 0.267, 'num': 30},
 'Overall': {'acc': 0.527, 'num': 900},
 'Overall-Art and Design': {'acc': 0.517, 'num': 120},
 'Overall-Business': {'acc': 0.607, 'num': 150},
 'Overall-Health and Medicine': {'acc': 0.553, 'num': 150},
 'Overall-Humanities and Social Science': {'acc': 0.617, 'num': 120},
 'Overall-Science': {'acc': 0.473, 'num': 150},
 'Overall-Tech and Engineering': {'acc': 0.443, 'num': 210},
 'Pharmacy': {'acc': 0.667, 'num': 30},
 'Physics': {'acc': 0.767, 'num': 30},
 'Psychology': {'acc': 0.567, 'num': 30},
 'Public_Health': {'acc': 0.767, 'num': 30},
 'Sociology': {'acc': 0.667, 'num': 30}}
eval out saved to ./val_sglang.json
Overall accuracy: 0.527

5.2 Speed Benchmark

We use SGLang's built-in bench_serving tool with random text prompts (1000 input / 1000 output tokens) to characterize text-only serving performance.

5.2.1 Latency Benchmark

bash
python3 -m sglang.bench_serving \
  --backend sglang \
  --model openbmb/MiniCPM-V-4.6 \
  --dataset-name random \
  --random-input-len 1000 \
  --random-output-len 1000 \
  --num-prompts 10 \
  --max-concurrency 1 \
  --request-rate inf
text
============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 1
Successful requests:                     10
Benchmark duration (s):                  7.47
Total input tokens:                      6101
Total input text tokens:                 6101
Total generated tokens:                  4220
Total generated tokens (retokenized):    3554
Request throughput (req/s):              1.34
Input token throughput (tok/s):          816.44
Output token throughput (tok/s):         564.73
Peak output token throughput (tok/s):    690.00
Peak concurrent requests:                4
Total token throughput (tok/s):          1381.17
Concurrency:                             1.00
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   746.20
Median E2E Latency (ms):                 590.05
P90 E2E Latency (ms):                    1446.13
P99 E2E Latency (ms):                    1709.38
---------------Time to First Token----------------
Mean TTFT (ms):                          138.12
Median TTFT (ms):                        103.70
P99 TTFT (ms):                           330.79
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          1.44
Median TPOT (ms):                        1.44
P99 TPOT (ms):                           1.45
---------------Inter-Token Latency----------------
Mean ITL (ms):                           1.44
Median ITL (ms):                         1.45
P95 ITL (ms):                            1.49
P99 ITL (ms):                            1.57
Max ITL (ms):                            5.79
==================================================

5.2.2 Throughput Benchmark

bash
python3 -m sglang.bench_serving \
  --backend sglang \
  --model openbmb/MiniCPM-V-4.6 \
  --dataset-name random \
  --random-input-len 1000 \
  --random-output-len 1000 \
  --num-prompts 1000 \
  --max-concurrency 100 \
  --request-rate inf
text
============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 100
Successful requests:                     1000
Benchmark duration (s):                  47.07
Total input tokens:                      502493
Total input text tokens:                 502493
Total generated tokens:                  500251
Total generated tokens (retokenized):    469844
Request throughput (req/s):              21.24
Input token throughput (tok/s):          10675.32
Output token throughput (tok/s):         10627.69
Peak output token throughput (tok/s):    25911.00
Peak concurrent requests:                130
Total token throughput (tok/s):          21303.01
Concurrency:                             97.24
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   4576.94
Median E2E Latency (ms):                 4331.97
P90 E2E Latency (ms):                    8634.07
P99 E2E Latency (ms):                    9636.44
---------------Time to First Token----------------
Mean TTFT (ms):                          206.50
Median TTFT (ms):                        184.72
P99 TTFT (ms):                           624.23
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          8.73
Median TPOT (ms):                        9.16
P99 TPOT (ms):                           13.63
---------------Inter-Token Latency----------------
Mean ITL (ms):                           8.75
Median ITL (ms):                         0.05
P95 ITL (ms):                            29.95
P99 ITL (ms):                            108.91
Max ITL (ms):                            448.40
==================================================

5.3 Vision Speed Benchmark

We use SGLang's built-in bench_serving tool with random images. Each request has 128 input text tokens, one 720p image, and 1024 output tokens.

5.3.1 Latency Benchmark

bash
python3 -m sglang.bench_serving \
  --backend sglang-oai-chat \
  --host 127.0.0.1 \
  --port 30000 \
  --model openbmb/MiniCPM-V-4.6 \
  --dataset-name image \
  --image-count 1 \
  --image-resolution 720p \
  --random-input-len 128 \
  --random-output-len 1024 \
  --num-prompts 10 \
  --max-concurrency 1 \
  --request-rate inf
text
============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    inf
Max request concurrency:                 1
Successful requests:                     10
Benchmark duration (s):                  10.26
Total input tokens:                      767
Total input text tokens:                 750
Total input vision tokens:               17
Total generated tokens:                  4220
Total generated tokens (retokenized):    4220
Request throughput (req/s):              0.97
Input token throughput (tok/s):          74.77
Output token throughput (tok/s):         411.39
Peak output token throughput (tok/s):    654.00
Peak concurrent requests:                2
Total token throughput (tok/s):          486.16
Concurrency:                             1.00
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   1024.04
Median E2E Latency (ms):                 897.99
P90 E2E Latency (ms):                    1584.25
P99 E2E Latency (ms):                    1781.78
---------------Time to First Token----------------
Mean TTFT (ms):                          416.94
Median TTFT (ms):                        403.18
P99 TTFT (ms):                           477.49
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          1.44
Median TPOT (ms):                        1.44
P99 TPOT (ms):                           1.45
---------------Inter-Token Latency----------------
Mean ITL (ms):                           1.44
Median ITL (ms):                         1.44
P95 ITL (ms):                            1.48
P99 ITL (ms):                            1.56
Max ITL (ms):                            2.89
==================================================

5.3.2 Throughput Benchmark

bash
python3 -m sglang.bench_serving \
  --backend sglang-oai-chat \
  --host 127.0.0.1 \
  --port 30000 \
  --model openbmb/MiniCPM-V-4.6 \
  --dataset-name image \
  --image-count 1 \
  --image-resolution 720p \
  --random-input-len 128 \
  --random-output-len 1024 \
  --num-prompts 1000 \
  --max-concurrency 100 \
  --request-rate inf
text
============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    inf
Max request concurrency:                 100
Successful requests:                     1000
Benchmark duration (s):                  360.01
Total input tokens:                      79925
Total input text tokens:                 78283
Total input vision tokens:               1642
Total generated tokens:                  510855
Total generated tokens (retokenized):    430289
Request throughput (req/s):              2.78
Input token throughput (tok/s):          222.01
Output token throughput (tok/s):         1419.01
Peak output token throughput (tok/s):    19620.00
Peak concurrent requests:                105
Total token throughput (tok/s):          1641.02
Concurrency:                             99.69
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   35888.57
Median E2E Latency (ms):                 35321.48
P90 E2E Latency (ms):                    41017.37
P99 E2E Latency (ms):                    60343.22
---------------Time to First Token----------------
Mean TTFT (ms):                          35096.32
Median TTFT (ms):                        34301.37
P99 TTFT (ms):                           59966.25
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          1.63
Median TPOT (ms):                        1.45
P99 TPOT (ms):                           10.15
---------------Inter-Token Latency----------------
Mean ITL (ms):                           1.58
Median ITL (ms):                         0.12
P95 ITL (ms):                            0.23
P99 ITL (ms):                            0.77
Max ITL (ms):                            2086.12
==================================================