docs_new/cookbook/autoregressive/Xiaomi/MiMo-V2.5.mdx
MiMo-V2.5-Pro and MiMo-V2.5 are next-generation Mixture-of-Experts models from the XiaomiMiMo Team.
<table style={{width: "100%", borderCollapse: "collapse", tableLayout: "fixed"}}> <colgroup> <col style={{width: "25%"}} /> <col style={{width: "15%"}} /> <col style={{width: "15%"}} /> <col style={{width: "45%"}} /> </colgroup> <thead> <tr style={{borderBottom: "2px solid #d55816"}}> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, backgroundColor: "rgba(255,255,255,0.02)"}}>Variant</th> <th style={{textAlign: "right", padding: "10px 12px", fontWeight: 700, backgroundColor: "rgba(255,255,255,0.05)"}}>Total params</th> <th style={{textAlign: "right", padding: "10px 12px", fontWeight: 700, backgroundColor: "rgba(255,255,255,0.02)"}}>Active (MoE)</th> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, backgroundColor: "rgba(255,255,255,0.05)"}}>Modalities</th> </tr> </thead> <tbody> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><strong><a href="https://huggingface.co/XiaomiMiMo/MiMo-V2.5-Pro">MiMo-V2.5-Pro</a></strong></td> <td style={{padding: "9px 12px", textAlign: "right", backgroundColor: "rgba(255,255,255,0.05)"}}><strong>1.02T</strong></td> <td style={{padding: "9px 12px", textAlign: "right", backgroundColor: "rgba(255,255,255,0.02)"}}>42B</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Text (multimodal planned)</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><strong><a href="https://huggingface.co/XiaomiMiMo/MiMo-V2.5">MiMo-V2.5</a></strong></td> <td style={{padding: "9px 12px", textAlign: "right", backgroundColor: "rgba(255,255,255,0.05)"}}><strong>310B</strong></td> <td style={{padding: "9px 12px", textAlign: "right", backgroundColor: "rgba(255,255,255,0.02)"}}>15B</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Text, Image, Video, Audio</td> </tr> </tbody> </table>Key Features:
License: Apache 2.0
Refer to the official SGLang installation guide.
Docker Images by Variant × Hardware:
| Variant | Hardware | Docker Image |
|---|---|---|
| MiMo-V2.5 (310B) | H100 / H200 (Hopper, CUDA 12.9) | lmsysorg/sglang:nightly-dev-20260511-044bb88a |
| MiMo-V2.5 (310B) | B200 / GB300 (Blackwell, CUDA 13.0) | lmsysorg/sglang:nightly-dev-cu13-20260511-044bb88a |
| MiMo-V2.5-Pro (1.02T) | H100 / H200 (Hopper, CUDA 12.9) | lmsysorg/sglang:nightly-dev-20260511-044bb88a |
| MiMo-V2.5-Pro (1.02T) | B200 / GB300 (Blackwell, CUDA 13.0) | lmsysorg/sglang:nightly-dev-cu13-20260511-044bb88a |
Pull the image matching your GPU's CUDA driver.
lmsysorg/sglang:latestwill not load either checkpoint.
TPU (sgl-jax): MiMo-V2.5-Pro can also be served on TPU via the JAX-based sgl-jax runtime. The container image and pip install steps are listed in §3.3 TPU Deployment.
Use the selector below to generate the deployment command for your variant and hardware.
import { MiMoV25Deployment } from '/src/snippets/autoregressive/mimo-v25-deployment.jsx'
<MiMoV25Deployment />MiMo-V2.5-Pro (1.02T):
--attention-backend fa4 + --moe-runner-backend flashinfer_trtllm + --mem-fraction-static 0.8. Set --swa-full-tokens-ratio 0.1 to keep KV-cache footprint within 192 GB HBM.NCCL_MNNVL_ENABLE=1 NCCL_CUMEM_ENABLE=1. Default SWA ratio is fine.fa3 + DeepEP + EAGLE multi-layer); fits with --mem-fraction-static 0.7 and --swa-full-tokens-ratio 0.3. DeepEP dispatch tuning: SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=256 avoids memory spikes during prefill.SGLANG_ENABLE_SPEC_V2=1 and --enable-multi-layer-eagle (both Hopper and Blackwell). See §5.4 for acceptance-rate behavior on natural text vs random prompts.MiMo-V2.5 (310B):
qkv_proj; attention-TP per DP group must be 4. Use --dp = TP / 4; for TP > 4 this also requires DP-attention. Total GPUs must be a multiple of 4. A bare --tp 8 without --dp 2 will fail to load with MiMoV2 fused qkv_proj checkpoint is TP=4-interleaved; got attention tp_size=8.--tp 8 --dp 2), B200 4× GPUs (--tp 4, dp=1, no DP-attn flag needed), GB300 4× GPUs (--tp 4, single NVL4 node). FP8 quantization.--enable-dp-lm-head and --mm-enable-dp-encoder are required whenever --enable-dp-attention is on, to keep LM head and encoder sharding consistent.SGLANG_ENABLE_SPEC_V2=1, --speculative-algorithm EAGLE, and --enable-multi-layer-eagle (both Hopper and Blackwell).DeepEP (optional toggle, Hopper-only):
--moe-a2a-backend deepep + --moe-dense-tp-size 1 (and --ep <tp> for Pro) plus SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=256 env to cap the dispatch buffer. Requires pip install deep_ep (not part of the default sglang install).flashinfer_trtllm; the DeepEP toggle is a no-op there.MiMo-V2.5-Pro can also be served on TPU via sgl-jax. The runtime is a separate JAX-based stack (sgl_jax.launch_server); pick TPU v7x or TPU v6e in the panel above to generate the launch command. Verified topologies:
| TPU Type | Topology | Chips/Node | Nodes | Total Chips | JAX Devices/Chip | Total JAX Devices (= --tp-size) |
|---|---|---|---|---|---|---|
| v7x | 2×2×4 | 4 | 4 | 16 | 2 | 32 |
| v6e | 4×4×4 | 4 | 16 | 64 | 1 | 64 |
v7x exposes 2 logical JAX devices per chip, so
--tp-size = 16 chips × 2 = 32. v6e exposes 1 device per chip, so--tp-size = 64. Always set--tp-sizeto the total JAX device count across all nodes, not the chip count.
All nodes must sit in the same TPU slice and reach each other on the JAX init port (20000) and the TPU process port (8471).
Step 1 — Launch the JAX TPU container on every node:
docker run -it --privileged \
--shm-size=32g \
--ipc=host \
--network=host \
-v /dev:/dev \
us-docker.pkg.dev/cloud-tpu-images/jax-ai-image/tpu:jax0.8.1-rev1 bash
The image is pinned to
jax0.8.1-rev1to keep the JAX runtime aligned with sgl-jax's TPU extras.
Step 2 — Clone and install sgl-jax (inside the container):
git clone https://github.com/sgl-project/sglang-jax.git
cd sglang-jax
pip install -e "python[tpu]"
See Basic API Usage.
Both variants support hybrid thinking mode. Thinking content is separated via the reasoning parser.
Thinking Mode (default):
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:30000/v1",
api_key="EMPTY"
)
response = client.chat.completions.create(
model="XiaomiMiMo/MiMo-V2.5",
messages=[
{"role": "user", "content": "Which is larger, 9.11 or 9.9? Think carefully."}
]
)
print("====== Reasoning ======")
print(response.choices[0].message.reasoning_content)
print("====== Answer ======")
print(response.choices[0].message.content)
Output Example (MiMo-V2.5):
====== Reasoning ======
Comparing 9.11 and 9.9.
The integer parts are both 9. Now compare the decimal parts: 0.11 vs 0.9.
0.9 = 0.90, which is greater than 0.11.
So 9.9 > 9.11.
====== Answer ======
**9.9 is larger than 9.11.**
Here's the reasoning: When comparing decimals, line them up to the same number of decimal places:
- 9.11
- 9.90
Both have a **9** in the ones place, but in the tenths place, **9 > 1**, so 9.90 > 0.11.
**9.9 > 9.11**
Thinking Off (instant mode):
response = client.chat.completions.create(
model="XiaomiMiMo/MiMo-V2.5",
messages=[
{"role": "user", "content": "Which is larger, 9.11 or 9.9? Think carefully."}
],
extra_body={"chat_template_kwargs": {"thinking": False}}
)
print(response.choices[0].message.content)
Output Example (MiMo-V2.5):
## Comparing 9.11 and 9.9
**9.9 is larger.**
The key is to compare them place by place. It helps to write them with the same number of decimal places:
- **9.11** → 9.11
- **9.9** → 9.90
Both have **9** in the ones place, but in the tenths place: **9** (in 9.90) is greater than **1** (in 9.11).
So **9.90 > 9.11**.
Image Understanding:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")
response = client.chat.completions.create(
model="XiaomiMiMo/MiMo-V2.5",
messages=[{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "https://raw.githubusercontent.com/sgl-project/sgl-test-files/refs/heads/main/images/man_ironing_on_back_of_suv.png"}},
{"type": "text", "text": "Describe this image in detail."}
]
}]
)
print(response.choices[0].message.content)
Output Example:
Based on the image provided, here is a detailed description:
The image captures a whimsical or surreal scene set on a busy city street, likely in New York City given the iconic yellow cabs. In the center foreground, a man is sitting on a folding chair, casually crossing his legs. He is wearing a bright yellow hoodie with a graphic on the front and blue jeans. He is intently focused on ironing a white dress shirt that rests on an ironing board set up directly on the asphalt.
Behind him, a yellow SUV taxi cab is stopped or moving slowly, angled slightly away from the camera. To his left, another yellow taxi sedan is captured in motion blur, indicating it is driving past him. The background features tall city buildings with glass windows and storefronts. There are banners hanging from streetlights, and some greenery is visible in the distance. The overall impression is one of incongruity—performing a domestic chore like ironing in the middle of a chaotic urban environment.
Video Understanding:
response = client.chat.completions.create(
model="XiaomiMiMo/MiMo-V2.5",
messages=[{
"role": "user",
"content": [
{"type": "video_url", "video_url": {"url": "https://videos.pexels.com/video-files/4114797/4114797-uhd_3840_2160_25fps.mp4"}},
{"type": "text", "text": "Summarize what happens in this video."}
]
}]
)
print(response.choices[0].message.content)
Output Example:
A person wearing blue protective gloves is shown operating a microscope in a close-up shot. The individual is adjusting a knob on the side of the microscope, which moves the stage holding a glass slide, likely focusing the lens on the specimen.
Video decoding requires
decord(pip install decord); SGLang's MiMo-V2.5 multimodal processor usesdecord.VideoReaderfor frame extraction.
Audio Understanding:
response = client.chat.completions.create(
model="XiaomiMiMo/MiMo-V2.5",
messages=[{
"role": "user",
"content": [
{"type": "audio_url", "audio_url": {"url": "https://raw.githubusercontent.com/sgl-project/sgl-test-files/refs/heads/main/audios/Trump_WEF_2018_10s.mp3"}},
{"type": "text", "text": "Transcribe and summarize this audio."}
]
}]
)
print(response.choices[0].message.content)
Output Example:
**Transcript:**
"Thank you Klaus very much. It's a privilege to be here at this forum where leaders in business, science, art, diplomacy and world affairs have gathered for..."
**Summary:**
The speaker thanks Klaus for the introduction and expresses their honor at attending a forum. They highlight that the event has brought together high-level leaders from various sectors, including business, science, art, and diplomacy.
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:30000/v1",
api_key="EMPTY"
)
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string", "description": "City name"},
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
},
"required": ["location"]
}
}
}
]
response = client.chat.completions.create(
model="XiaomiMiMo/MiMo-V2.5",
messages=[{"role": "user", "content": "What's the weather in Beijing?"}],
tools=tools
)
msg = response.choices[0].message
if msg.reasoning_content:
print("=== Reasoning ===")
print(msg.reasoning_content)
if msg.tool_calls:
print("=== Tool Calls ===")
for tc in msg.tool_calls:
print(f" Function: {tc.function.name}")
print(f" Arguments: {tc.function.arguments}")
Output Example (MiMo-V2.5):
=== Reasoning ===
The user wants to know the weather in Beijing. I have a function available called "get_weather" that can retrieve current weather for a location. Let me call that function with Beijing as the location.
=== Tool Calls ===
Function: get_weather
Arguments: {"location": "Beijing"}
Accuracy numbers come from sglang.test.run_eval (GSM8K standard 5-shot, MMMU validation split). Speed numbers come from sglang.bench_serving with generated random prompts; text runs use 1024 input tokens and 1024 output tokens per request, and the image run uses 2 random 720p images per request.
Standard 5-shot, temperature=0, max_tokens=4096, model defaults to thinking-on (responses contain <think>...</think> and the eval extracts the trailing number via regex). Server launch: see Section 3.
Benchmark Command:
python3 -m sglang.test.run_eval \
--base-url http://127.0.0.1:30000 \
--model XiaomiMiMo/MiMo-V2.5 \
--eval-name gsm8k \
--num-examples 200 \
--num-threads 8 \
--max-tokens 4096 \
--temperature 0.0
run_eval.pyautomatically appends/v1to--base-url; pass the barehost:portURL (without trailing/v1), otherwise requests resolve to/v1/v1/chat/completionsand 404.
Score: 0.965 (193 / 200)
Latency: 253.90 s
Output throughput: 461.78 tok/s
Score: 0.980 (196 / 200)
Latency: 477.52 s
Output throughput: 88.9 tok/s
MMMU/MMMU validation split (multi-discipline multimodal), concurrency=16, default sampling.
python3 benchmark/mmmu/bench_sglang.py \
--port 30000 \
--model XiaomiMiMo/MiMo-V2.5 \
--concurrency 16
Pending update
Test Environment:
XiaomiMiMo/MiMo-V2.5-Pro (FP8)--moe-runner-backend flashinfer_trtllm, --attention-backend fa4, --mem-fraction-static 0.8, --swa-full-tokens-ratio 0.1)The numbers in §5.2 are the no-EAGLE baseline on
random 1024/1024. On uniform-random token streams the MiMo-V2.5-Pro 3-layer MTP draft has very low accept-rate (~0.13–0.27 vs ~0.75 on natural-text prompts, see §5.4) — there's no token-co-occurrence signal for the draft to model — so EAGLE here adds verify overhead without recovering enough draft tokens to be a net win on this workload. EAGLE MTP itself works on B200 +--enable-multi-layer-eagle(see §3 deployment command and §5.4 for an acceptance profile on natural text).
python3 -m sglang.bench_serving \
--backend sglang \
--host 127.0.0.1 \
--port 30000 \
--model XiaomiMiMo/MiMo-V2.5-Pro \
--random-input-len 1024 \
--random-output-len 1024 \
--num-prompts 10 \
--max-concurrency 1
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 1
Successful requests: 10
Benchmark duration (s): 27.59
Total input tokens: 1997
Total input text tokens: 1997
Total generated tokens: 2798
Total generated tokens (retokenized): 2794
Request throughput (req/s): 0.36
Input token throughput (tok/s): 72.38
Output token throughput (tok/s): 101.41
Peak output token throughput (tok/s): 110.00
Peak concurrent requests: 3
Total token throughput (tok/s): 173.79
Concurrency: 1.00
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 2757.26
Median E2E Latency (ms): 3319.10
P90 E2E Latency (ms): 4157.47
P99 E2E Latency (ms): 4869.32
---------------Time to First Token----------------
Mean TTFT (ms): 162.17
Median TTFT (ms): 68.11
P99 TTFT (ms): 929.58
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 9.19
Median TPOT (ms): 9.33
P99 TPOT (ms): 9.39
---------------Inter-Token Latency----------------
Mean ITL (ms): 9.31
Median ITL (ms): 9.35
P95 ITL (ms): 9.44
P99 ITL (ms): 9.77
Max ITL (ms): 19.80
==================================================
python3 -m sglang.bench_serving \
--backend sglang \
--host 127.0.0.1 \
--port 30000 \
--model XiaomiMiMo/MiMo-V2.5-Pro \
--random-input-len 1024 \
--random-output-len 1024 \
--num-prompts 1000 \
--max-concurrency 100
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 100
Successful requests: 1000
Benchmark duration (s): 112.78
Total input tokens: 302118
Total input text tokens: 302118
Total generated tokens: 195775
Total generated tokens (retokenized): 191069
Request throughput (req/s): 8.87
Input token throughput (tok/s): 2678.83
Output token throughput (tok/s): 1735.90
Peak output token throughput (tok/s): 3040.00
Peak concurrent requests: 121
Total token throughput (tok/s): 4414.73
Concurrency: 87.80
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 9901.96
Median E2E Latency (ms): 6525.54
P90 E2E Latency (ms): 23567.98
P99 E2E Latency (ms): 42109.22
---------------Time to First Token----------------
Mean TTFT (ms): 223.69
Median TTFT (ms): 139.45
P99 TTFT (ms): 1082.02
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 50.63
Median TPOT (ms): 51.66
P99 TPOT (ms): 91.41
---------------Inter-Token Latency----------------
Mean ITL (ms): 49.79
Median ITL (ms): 33.69
P95 ITL (ms): 103.37
P99 ITL (ms): 151.34
Max ITL (ms): 1600.00
==================================================
Test Environment:
XiaomiMiMo/MiMo-V2.5 (FP8)--dp 2)0.0.0.dev1+g7d99af439 (lmsysorg/sglang:dev-mimo-v2.5)python3 -m sglang.bench_serving \
--backend sglang \
--host 127.0.0.1 \
--port 30000 \
--model XiaomiMiMo/MiMo-V2.5 \
--random-input-len 1024 \
--random-output-len 1024 \
--num-prompts 10 \
--max-concurrency 1
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 1
Successful requests: 10
Benchmark duration (s): 14.72
Total input tokens: 1997
Total input text tokens: 1997
Total generated tokens: 2798
Total generated tokens (retokenized): 2697
Request throughput (req/s): 0.68
Input token throughput (tok/s): 135.67
Output token throughput (tok/s): 190.09
Peak output token throughput (tok/s): 245.00
Peak concurrent requests: 3
Total token throughput (tok/s): 325.77
Concurrency: 1.00
Accept length: 3.08
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 1469.98
Median E2E Latency (ms): 1652.84
P90 E2E Latency (ms): 2210.80
P99 E2E Latency (ms): 2823.86
---------------Time to First Token----------------
Mean TTFT (ms): 143.89
Median TTFT (ms): 99.25
P99 TTFT (ms): 481.01
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 4.87
Median TPOT (ms): 4.30
P99 TPOT (ms): 6.64
---------------Inter-Token Latency----------------
Mean ITL (ms): 4.76
Median ITL (ms): 3.46
P95 ITL (ms): 13.52
P99 ITL (ms): 13.84
Max ITL (ms): 74.37
==================================================
python3 -m sglang.bench_serving \
--backend sglang \
--host 127.0.0.1 \
--port 30000 \
--model XiaomiMiMo/MiMo-V2.5 \
--random-input-len 1024 \
--random-output-len 1024 \
--num-prompts 1000 \
--max-concurrency 100
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 100
Successful requests: 1000
Benchmark duration (s): 93.41
Total input tokens: 302118
Total input text tokens: 302118
Total generated tokens: 195775
Total generated tokens (retokenized): 188139
Request throughput (req/s): 10.71
Input token throughput (tok/s): 3234.48
Output token throughput (tok/s): 2095.97
Peak output token throughput (tok/s): 3019.00
Peak concurrent requests: 121
Total token throughput (tok/s): 5330.45
Concurrency: 91.04
Accept length: 2.95
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 8503.45
Median E2E Latency (ms): 7491.96
P90 E2E Latency (ms): 13706.99
P99 E2E Latency (ms): 20474.33
---------------Time to First Token----------------
Mean TTFT (ms): 4399.20
Median TTFT (ms): 4333.35
P99 TTFT (ms): 8004.81
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 58.23
Median TPOT (ms): 21.78
P99 TPOT (ms): 747.79
---------------Inter-Token Latency----------------
Mean ITL (ms): 20.06
Median ITL (ms): 15.28
P95 ITL (ms): 48.36
P99 ITL (ms): 96.99
Max ITL (ms): 969.61
==================================================
python3 -m sglang.bench_serving \
--backend sglang-oai-chat \
--host 127.0.0.1 \
--port 30000 \
--model XiaomiMiMo/MiMo-V2.5 \
--dataset-name image \
--image-count 2 \
--image-resolution 720p \
--random-input-len 128 \
--random-output-len 1024 \
--num-prompts 10 \
--max-concurrency 1
============ Serving Benchmark Result ============
Backend: sglang-oai-chat
Traffic request rate: inf
Max request concurrency: 1
Successful requests: 10
Benchmark duration (s): 25.73
Total input tokens: 661
Total input text tokens: 631
Total input vision tokens: 30
Total generated tokens: 4220
Total generated tokens (retokenized): 0
Request throughput (req/s): 0.39
Input token throughput (tok/s): 25.69
Output token throughput (tok/s): 164.03
Peak output token throughput (tok/s): 1.00
Peak concurrent requests: 2
Total token throughput (tok/s): 189.73
Concurrency: 1.00
Accept length: 2.94
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 2570.74
Median E2E Latency (ms): 2411.92
P90 E2E Latency (ms): 3711.62
P99 E2E Latency (ms): 4949.74
---------------Time to First Token----------------
Mean TTFT (ms): 0.00
Median TTFT (ms): 0.00
P99 TTFT (ms): 0.00
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 7.31
Median TPOT (ms): 6.17
P99 TPOT (ms): 17.18
---------------Inter-Token Latency----------------
Mean ITL (ms): 0.00
Median ITL (ms): 0.00
P95 ITL (ms): 0.00
P99 ITL (ms): 0.00
Max ITL (ms): 0.00
==================================================
Pro's 3-layer MTP behaves very differently on natural text vs uniform-random token streams. The §5.2 benchmarks use random 1024/1024, which collapses accept-rate; this section measures the same server on GSM8K so the acceptance number is comparable to real workloads.
Test Environment:
XiaomiMiMo/MiMo-V2.5-Pro (FP8)--moe-runner-backend flashinfer_trtllm, --attention-backend fa4, --mem-fraction-static 0.8, --swa-full-tokens-ratio 0.1)--enable-multi-layer-eagle --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 (top-1, max accept length 4)Benchmark Command:
python3 -m sglang.test.run_eval \
--base-url http://127.0.0.1:30000 \
--model XiaomiMiMo/MiMo-V2.5-Pro \
--eval-name gsm8k \
--num-examples 200 \
--num-threads 4
The accept_rate and accept_length rows below are not part of run_eval's own output — they were aggregated from the server-side Decode batch ... accept rate: X accept len: Y log lines emitted during the GSM8K run (307 batches total).
| Workload | accept_rate | accept_length (max = 4) |
|---|---|---|
| GSM8K (natural text) | 0.755 | 3.27 |
random 1024/1024 (reference) | 0.13–0.27 | ~1.x |
GSM8K Score: 0.97 (194 / 200), output throughput ≈ 635 tok/s end-to-end on this single-server run.
The accept-rate gap is intrinsic to MTP-style speculative decoding: the draft model is trained on natural-language token distributions and has no useful signal on uniform-random byte sequences. Workloads with structure (chat, code, reasoning traces) should expect the GSM8K-class number; the random-prompt baseline in §5.2 is a worst case for draft acceptance.
Reference numbers from the day0 enablement PR, collected on a 2-node Hopper deployment with the EP=16, DP=2, TP=16 recipe (--moe-a2a-backend deepep, --attention-backend fa3, --enable-multi-layer-eagle). The setup, parallelism, and benchmark methodology all differ from §5.2 (Blackwell TP=8 with random 1024/1024), so treat these as a separate operating point — long-context prefill scaling and the MTP decode speedup — rather than a comparison against §5.2.
Test Environment:
XiaomiMiMo/MiMo-V2.5-Pro (FP8)--tp 16 --dp 2 --ep 16 --moe-dense-tp-size 1 --enable-dp-attentionTest setting: chunked_prefill_size=32K, random_output_len=1, cache flushed before every run. For input lengths ≥ 512K the workload was split into two requests routed to distinct DP ranks and the per-node throughput was read from bench_serving output.
python3 -m sglang.bench_serving \
--backend sglang \
--model XiaomiMiMo/MiMo-V2.5-Pro \
--host 0.0.0.0 \
--port 30000 \
--dataset-name random \
--random-input-len <INPUT_LEN> \
--random-output-len 1 \
--random-range-ratio 1.0 \
--flush-cache \
--seed 12345 \
--num-prompts 10000
| Input length | Output length | Single-node prefill throughput |
|---|---|---|
| 4K | 1 | 30.80K tok/s |
| 8K | 1 | 30.65K tok/s |
| 16K | 1 | 29.85K tok/s |
| 32K | 1 | 28.60K tok/s |
| 64K | 1 | 26.65K tok/s |
| 128K | 1 | 23.00K tok/s |
| 256K | 1 | 17.90K tok/s |
| 512K | 1 | 11.30K tok/s |
| 768K | 1 | 9.40K tok/s |
| 1M | 1 | 7.30K tok/s |
Prefill throughput stays within ~10% of peak from 4K up to 32K and degrades gracefully past 128K, confirming the hybrid SWA+GA attention works correctly at 1M context.
Test setting: fixed 16K input / 1K output, varying batch size per DP rank, with and without the 3-layer MTP module. MTP accept length is the average number of draft tokens accepted per step under EAGLE speculative decoding. TPS below is per-request output tokens/sec (i.e. single-user perceived speed); the rightmost column is aggregated single-node decode throughput (= TPS × batch size).
| BS per DP rank | MTP | MTP accept length | Per-request TPS | Single-node decode throughput |
|---|---|---|---|---|
| 64 | disabled | - | 29.3 | 1875 tok/s |
| 64 | 3-layer | 3 | 60.5 | 3873 tok/s |
| 64 | 3-layer | 4 | 79.7 | 5103 tok/s |
| 96 | disabled | - | 26.7 | 2564 tok/s |
| 96 | 3-layer | 3 | 50.4 | 4840 tok/s |
| 96 | 3-layer | 4 | 64.8 | 6225 tok/s |
Summary — MTP on / off:
| BS per DP rank | Without MTP | 3-layer MTP, accept=3 | 3-layer MTP, accept=4 |
|---|---|---|---|
| 64 | 1875 tok/s | 3873 tok/s (2.07×) | 5103 tok/s (2.72×) |
| 96 | 2564 tok/s | 4840 tok/s (1.89×) | 6225 tok/s (2.43×) |
The 3-layer MTP module delivers ~2× decode throughput at accept length 3 and ~2.5–2.7× at accept length 4 — the same order of magnitude as the "2–3× decode speedup" guidance in §3.2.