docs_new/cookbook/autoregressive/MiniMax/MiniMax-M3.mdx
For all methods and hardware platforms, see the official SGLang installation guide. The two paths below match the Python / Docker toggle in the command panel.
<Tabs> <Tab title="Python (pip / uv)">pip install -U uv
uv venv --python 3.12 && source .venv/bin/activate
# MiniMax-M3 ships in SGLang PR #27944, not yet in a tagged release — install from
# the PR head. The serving runtime is in the base dependencies, so no extra is needed:
git clone https://github.com/sgl-project/sglang.git
cd sglang
git fetch origin pull/27944/head && git checkout FETCH_HEAD
uv pip install -e python
Then run the Python output of the command panel below in that environment. The Docker tab is simpler — its image bundles the CUDA-13 runtime and the #27944 code. Once PR #27944 is merged and released, uv pip install sglang will pull M3 support directly.
# Pull the M3 image the command panel selects for your platform, e.g.:
docker pull lmsysorg/sglang:dev-cu13-minimax-m3
The command panel below fills in the right tag per platform: dev-cu13-minimax-m3 (CUDA 13 — B300, GB200, GB300), dev-cu12-minimax-m3 (CUDA 12 — Hopper H200), or dev-minimax-m3 (default). On AMD Instinct it uses the matching ROCm image (MI300X/MI325X → …-rocm700-mi30x, MI350X/MI355X → …-rocm720-mi35x). For how to launch the image, see Install → Method 3: Using Docker, substituting the inner sglang serve ... with what the command generator produces.
Pick your hardware + recipe to generate the launch command.
import { Deployment } from "/src/snippets/_deployment.jsx"; import { config } from "/src/snippets/configs/MiniMaxAI/minimax-m3.jsx"; import { benchmarks } from "/src/snippets/configs/MiniMaxAI/minimax-m3-benchmarks.jsx";
<Deployment config={config} benchmarks={benchmarks} />The Playground is where you experiment with SGLang features beyond the verified matrix. The Deploy panel above only emits combinations the SGLang team has signed off on; the Playground lets you turn on additional knobs on top of whichever cell the Deploy panel is currently showing.
import { Playground } from "/src/snippets/_playground.jsx";
<Playground config={config} />MiniMax-M3 is MiniMax's native-multimodal Mixture-of-Experts reasoning model: ~428B total parameters with ~23B activated per token (128 experts, 4 active per token), 60 layers, and a 1M-token context over text, image, and video. Its defining feature is MiniMax Sparse Attention (MSA) — a block-sparse "lightning indexer" attention that keeps long-context cost low (MiniMax reports ~9× prefill / ~15× decode speedup over M2 at 1M context). This page serves the MXFP8 variant (MiniMaxAI/MiniMax-M3-MXFP8, ~440 GB) on NVIDIA Blackwell and AMD Instinct; on NVIDIA Hopper (H200), use the full-precision bfloat16 build MiniMaxAI/MiniMax-M3 (§2.4). Released under the MiniMax Community License.
Key characteristics as served by SGLang:
MiniMaxM3SparseForConditionalGeneration). Image input via URL and base64 is validated; video input has not been tested here.<mm:think>...</mm:think>. Always launch with --reasoning-parser auto — it auto-detects the right parser from the chat template, and SGLang then strips the tags and returns the trace separately in message.reasoning_content.tool_calls. Always launch with --tool-call-parser auto — it auto-detects the right parser from the chat template. Single, parallel, and nested (object / array) arguments are supported.[128,128] at load and serves them on the tuned ROCm kernels (§2.3). The vision tower stays unquantized.Recommended generation: the model's generation_config.json sets temperature 1.0 / top_p 0.95, which SGLang applies automatically (the default --sampling-defaults model). The model card additionally suggests top_k 40, but that value is not in generation_config.json, so SGLang does not apply it by default. top_k is a per-request sampling parameter (not a launch flag) — set it per call if you want it, e.g. extra_body={"top_k": 40} with the OpenAI client.
Resources: HuggingFace · MSA kernel
MiniMax MSA (fmha_sm100, MIT-licensed) is the recommended Blackwell kernel for M3's main sparse-attention step — faster and more memory-efficient than the built-in Triton fallback. It ships pre-installed in the M3 dev image (lmsysorg/sglang:dev-minimax-m3, also published under the dev-cu13-minimax-m3 tag), so the Blackwell recipe above engages it automatically with no extra setup — import fmha_sm100 works out of the box and the kernels JIT-compile on first use. It is otherwise purely additive: on a custom image, install it (below) and the recipe engages it automatically; without it the same recipe still serves on the built-in Triton path. The swap is numerically equivalent (cosine ≥ 0.99999 vs Triton), decode stays CUDA-graph-capturable, prefill TTFT drops ~9–12% at 8K–64K context, and the MSA path survives memory configurations where the Triton path OOMs.
Requirements (from the MSA README):
nvcc ≥ 12.x on PATH (or CUDA_HOME set) — the kernels are JIT-compiled at first import.The M3 Blackwell dev images above already bundle MSA, so you can skip straight to the gate check. The git clone / pip install steps are only needed on a custom image that doesn't have fmha_sm100.
# Only on a custom image: --recursive pulls the CUTLASS submodule required for JIT compilation
git clone --recursive https://github.com/MiniMax-AI/MSA.git msa
cd msa && pip install .
# Verify the SGLang gate (True -> MSA engaged on this device; False -> Triton fallback):
python -c "from sglang.srt.layers.attention.minimax_sparse_ops.msa import msa_available; print(msa_available())"
The gate requires --attention-backend fa4 --page-size 128 (already part of the Blackwell recipe above; on current main these are also the auto-selected M3 defaults on SM100 GPUs). Force the Triton path at any time with the env var SGLANG_DISABLE_MSA=1. MSA is a Blackwell (SM100) kernel and does not apply to the AMD ROCm paths.
The NVIDIA Blackwell recipes are validated single-node: B200 at --tp 8 and B300 / GB300 at --tp 4 (4-GPU is also the GB200 / GB300 single-node ceiling). GB200 (sm_100, aarch64) is inferred-supported — both of its axes are validated above (B200 is sm_100; GB300 is sm_103 aarch64) — but not directly benchmarked. The AMD recipes use 8-GPU (--tp 8).
--mem-fraction-static reserves GPU memory for weights + KV pool; the rest is prefill activation headroom. The value scales with free memory per GPU (card capacity minus per-GPU weight), so it tracks the card more than the TP degree: 0.65 on B200 (180 GB — less headroom once weights are resident) and 0.75 on the larger-memory B300 / GB300 (0.80 on AMD). Lower TP packs more weight per GPU, so a tighter config needs a lower value — B200 needs 0.65 even at --tp 4. Raising it past the validated value is fine only for low-concurrency single-stream serving; it OOMs under high concurrency or long context.--mem-fraction-static at the platform default and raise --chunked-prefill-size to 16384. Decode TPOT stays roughly flat in context length thanks to sparse attention; 1K–128K prompts are validated.--tp 8; B300 / GB200 / GB300 at --tp 4 (the single-node cross-family common denominator). On an 8-GPU B300 host you can also raise to --tp 8 for more throughput / KV headroom.--ep (see Expert Parallelism Deployment). On AMD, set --ep equal to --tp. Shared-experts fusion is automatically disabled when EP > 1; on AMD standard EP the server also disables --enable-aiter-allreduce-fusion automatically to preserve accuracy.--trust-remote-code is required to load the MiniMax config / processor classes.MiniMax-M3 runs on AMD Instinct GPUs through two code paths, by architecture — both selected automatically; you still pass --quantization mxfp8 either way:
[128,128] at load time, then serves them with the tuned ROCm block-fp8 kernels (--attention-backend aiter, --moe-runner-backend triton; the aiter runner also works and scores marginally higher). On a cold start the first generation can JIT-compile AITER configs and exceed the default warmup/HTTP timeout, so the recipe adds --watchdog-timeout 3600 --skip-server-warmup. The block-fp8 step adds only a small relative error over MXFP8's native 1×32 scaling — negligible on GSM8K (see the benchmark card).Select an MI300X/MI325X or MI350X/MI355X tile in the command panel above to get the exact launch command for each path.
<Note> The AMD recipes are validated end-to-end on **text** workloads — chat, reasoning separation, and tool calling. The vision tower was not exercised on ROCm; for image input on AMD, omit the Blackwell `--mm-attention-backend flashinfer_cudnn` flag and let the encoder use the ROCm default backend, and treat vision as unvalidated on that path. </Note>The MXFP8 kernels are Blackwell-only, so Hopper (H200) serves the full-precision bfloat16 build MiniMaxAI/MiniMax-M3. Select H200 + BF16 in the Deploy panel above for the exact command — it runs at --tp 8 (the bf16 weights need a full 8-GPU node). SGLang picks the right backends for Hopper automatically, so the recipe stays minimal:
High-concurrency throughput (optional). On Hopper the sparse prefill runs on the Triton path as a separate eager forward, which briefly stalls the in-flight decode batch under heavy concurrent load. Adding --enable-mixed-chunk --chunked-prefill-size 2048 merges the running decodes into the prefill step instead of preempting them, which recovers roughly +10% output throughput and ~10% lower median TPOT at high concurrency on 8×H200, with no change in accuracy. Leave it off for latency-sensitive low-concurrency serving.
Validated on 8×H200 — reasoning and tool-call auto-detection plus long-context generation. For prefill/decode disaggregation on Hopper, see §3.4.
Launch with --reasoning-parser auto (or toggle Reasoning Parser in the Parsers card of the Playground above). The <mm:think> trace then lands in message.reasoning_content, separate from the final answer in message.content — no client-side tag stripping needed.
from openai import OpenAI
client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")
response = client.chat.completions.create(
model="MiniMaxAI/MiniMax-M3-MXFP8",
messages=[{"role": "user", "content": "What is 15% of 240? Explain briefly."}],
max_tokens=2048,
)
message = response.choices[0].message
print("=============== Reasoning ===============")
print(message.reasoning_content)
print("=============== Answer ==================")
print(message.content)
=============== Reasoning ===============
15% of 240. 15% = 0.15. 240 * 0.15 = 36. Quick check: 10% is 24, 5% is 12, 24 + 12 = 36.
=============== Answer ==================
15% of 240 is **36**.
(10% of 240 = 24, and 5% of 240 = 12; 24 + 12 = 36.)
When streaming, the trace arrives on delta.reasoning_content and the answer on delta.content, so the two sections can be rendered separately in real time:
response = client.chat.completions.create(
model="MiniMaxAI/MiniMax-M3-MXFP8",
messages=[{"role": "user", "content": "Solve step by step: what is 15% of 240?"}],
max_tokens=2048,
stream=True,
)
for chunk in response:
if not chunk.choices:
continue
delta = chunk.choices[0].delta
if getattr(delta, "reasoning_content", None):
print(delta.reasoning_content, end="", flush=True) # thinking stream
if delta.content:
print(delta.content, end="", flush=True) # answer stream
print()
Output Example:
[delta.reasoning_content — thinking stream]
Let me solve this step by step.
15% of 240
= 0.15 × 240
= 36
Let me verify: 10% of 240 = 24, 5% of 240 = 12, so 15% = 24 + 12 = 36. ✓
[delta.content — answer stream]
# Solving 15% of 240
## Step 1: Convert the percentage to a decimal
15% = 15/100 = 0.15
## Step 2: Multiply by 240
0.15 × 240 = 36
## Answer
**15% of 240 = 36**
Launch with --tool-call-parser auto (or toggle Tool Call Parser in the Parsers card of the Playground above) — it auto-detects M3's tool-call parser from the chat template. M3 emits tool calls in a custom namespace-token XML format:
]<]minimax[>[<tool_call>
]<]minimax[>[<invoke name="get_weather">]<]minimax[>[<location>Beijing]<]minimax[>[</location>]<]minimax[>[</invoke>
]<]minimax[>[</tool_call>
The parser converts that into the standard OpenAI tool_calls structure:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string", "description": "The city name"},
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
},
"required": ["location"],
},
},
}
]
response = client.chat.completions.create(
model="MiniMaxAI/MiniMax-M3-MXFP8",
messages=[{"role": "user", "content": "What's the weather in Beijing?"}],
tools=tools,
)
message = response.choices[0].message
if message.tool_calls:
for call in message.tool_calls:
print(f"Tool: {call.function.name}")
print(f"Args: {call.function.arguments}")
Tool: get_weather
Args: {"location": "Beijing"}
Beyond a single flat call, the parser also supports:
<invoke> blocks inside the single <tool_call> wrapper, surfaced as multiple message.tool_calls entries.object-typed parameter is emitted as nested XML tags and reconstructed into a JSON object.array-typed parameter uses repeated <item> children and is reconstructed into a JSON list.For example, a tool with object and array parameters round-trips cleanly:
create_event {"title": "Design sync", "attendees": ["alice", "bob"], "location": {"room": "R2", "floor": 3}}
To return a tool result, append the assistant's tool_calls turn plus a matching tool message and ask the model to continue — the follow-up answer may place text in reasoning_content as well as content, so print both.
Images go through the standard OpenAI image_url content type. The vision tower is always loaded; for image serving add --mm-attention-backend flashinfer_cudnn (the vision-tower backend) to the Blackwell deployment recipe — the text --attention-backend is unchanged (§2.1 note). On AMD, omit --mm-attention-backend and let the encoder use the ROCm default (vision is unvalidated on ROCm — §2.3).
from openai import OpenAI
client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")
response = client.chat.completions.create(
model="MiniMaxAI/MiniMax-M3-MXFP8",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://raw.githubusercontent.com/sgl-project/sglang/main/examples/assets/example_image.png"
},
},
{"type": "text", "text": "Describe this image in detail."},
],
}
],
max_tokens=1024,
)
print(response.choices[0].message.content)
Output Example:
This image captures a striking and unusual urban scene on what appears to be a busy New York City street.
**Main Subject:**
A man stands on the rear bumper of a yellow taxi cab (an SUV-style cab, likely a Ford Escape hybrid), operating a full-sized ironing board set up across the back of the vehicle. He is wearing a bright yellow long-sleeved shirt and dark pants, and is actively ironing a blue garment, holding an iron in his right hand.
**Vehicles:**
- The yellow SUV taxi on the right is stationary, its rear hatch serving as the ironing platform.
- A second yellow taxi (a sedan) drives past on the left, captured with motion blur.
**Setting:**
Tall city buildings with classic urban architecture, an American flag, and white lane markings — a bustling downtown area, possibly Midtown Manhattan.
Notes:
data:image/png;base64,... URI — SGLang decodes it server-side.image_url entries to the content list.<mm:think> trace and/or tool calls.PD disaggregation runs prefill and decode on separate SGLang servers linked by an RDMA KV-transfer fabric (mooncake or NIXL), fronted by the PD router. M3 needs one thing beyond a dense model: alongside the main KV cache, every sparse "lightning-indexer" layer keeps a K-only index buffer, and that buffer must reach the decode server too — otherwise sparse attention reads stale state. SGLang transfers it alongside the main KV — reusing the same page mapping — so M3 disaggregates correctly with no extra flags.
Supported topology (the released MiniMax-M3, whose sparse layers are all K-only):
--tp.Launch the prefill server, then the decode server — the same recipe with --disaggregation-mode decode and no bootstrap port. Pick your hardware:
On Blackwell the MXFP8 recipe — fa4, page size 128, deep_gemm MoE, and the MSA fast path (§2.1) — is auto-selected, so each role adds only the --disaggregation-* flags. This is the validated 2 × 4×B200 setup (TP4 prefill on node A, TP4 decode on node B); point --disaggregation-ib-device at your RDMA NIC(s).
sglang serve \
--model-path MiniMaxAI/MiniMax-M3-MXFP8 \
--trust-remote-code \
--reasoning-parser auto \
--tool-call-parser auto \
--tp 4 \
--disaggregation-mode prefill \
--disaggregation-transfer-backend nixl \
--disaggregation-ib-device mlx5_0 \
--host 0.0.0.0 --port 30000 \
--disaggregation-bootstrap-port 8998
sglang serve \
--model-path MiniMaxAI/MiniMax-M3-MXFP8 \
--trust-remote-code \
--reasoning-parser auto \
--tool-call-parser auto \
--tp 4 \
--disaggregation-mode decode \
--disaggregation-transfer-backend nixl \
--disaggregation-ib-device mlx5_0 \
--host 0.0.0.0 --port 30001
On Hopper (H200) M3 runs the bf16 build (§2.4) with Triton MoE and the built-in Triton sparse path, pinned to --page-size 128 so both roles share the page layout the sparse-index transfer relies on. This is the validated 2 × 8×H200 setup (TP8 each).
sglang serve \
--model-path MiniMaxAI/MiniMax-M3 \
--trust-remote-code \
--reasoning-parser auto \
--tool-call-parser auto \
--tp 8 \
--attention-backend triton \
--moe-runner-backend triton \
--page-size 128 \
--disaggregation-mode prefill \
--disaggregation-transfer-backend mooncake \
--disaggregation-ib-device mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_7 \
--host 0.0.0.0 --port 30000 \
--disaggregation-bootstrap-port 8998
sglang serve \
--model-path MiniMaxAI/MiniMax-M3 \
--trust-remote-code \
--reasoning-parser auto \
--tool-call-parser auto \
--tp 8 \
--attention-backend triton \
--moe-runner-backend triton \
--page-size 128 \
--disaggregation-mode decode \
--disaggregation-transfer-backend mooncake \
--disaggregation-ib-device mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_7 \
--host 0.0.0.0 --port 30001
Then start the PD router, pointing it at the prefill bootstrap (URL plus its --disaggregation-bootstrap-port) and the decode endpoint:
python3 -m sglang_router.launch_router \
--pd-disaggregation \
--prefill http://<prefill-host>:30000 8998 \
--decode http://<decode-host>:30001 \
--policy round_robin \
--host 0.0.0.0 --port 8000
Clients hit the router exactly like a single server — it splits each request across the two stages transparently:
<Accordion title="PD Client Example (Python)">from openai import OpenAI
client = OpenAI(base_url="http://<router-host>:8000/v1", api_key="EMPTY")
response = client.chat.completions.create(
model="MiniMaxAI/MiniMax-M3-MXFP8",
messages=[{"role": "user", "content": "What is 2 + 2?"}],
max_tokens=64,
)
print(response.choices[0].message.content)
Output Example:
2 + 2 = 4
Validation. PD disaggregation preserves output quality — the K-only sparse index transfers arrive intact and disaggregated output matches non-disaggregated serving. GSM8K is scored with the single sgl-eval harness used by the benchmark card above (full 1319-question split, chat with --thinking); see that card for per-platform single-node accuracy.
random isl=2048 / osl=256 / conc=64 row, so the throughput figures are not directly comparable) measured mean TTFT 1.1 s and TPOT 16.6 ms (≈ 60 tok/s per stream, ≈ 2.3k tok/s aggregate).