MiniMax-M3 - Sglang — ContextQMD

Deployment

For all methods and hardware platforms, see the official SGLang installation guide. The two paths below match the Python / Docker toggle in the command panel.

bash

pip install -U uv
uv venv --python 3.12 && source .venv/bin/activate

# MiniMax-M3 ships in SGLang PR #27944, not yet in a tagged release — install from
# the PR head. The serving runtime is in the base dependencies, so no extra is needed:
git clone https://github.com/sgl-project/sglang.git
cd sglang
git fetch origin pull/27944/head && git checkout FETCH_HEAD
uv pip install -e python

Then run the Python output of the command panel below in that environment. The Docker tab is simpler — its image bundles the CUDA-13 runtime and the #27944 code. Once PR #27944 is merged and released, uv pip install sglang will pull M3 support directly.

</Tab> <Tab title="Docker">

bash

# Pull the M3 image the command panel selects for your platform, e.g.:
docker pull lmsysorg/sglang:dev-cu13-minimax-m3

The command panel below fills in the right tag per platform: dev-cu13-minimax-m3 (CUDA 13 — B300, GB200, GB300), dev-cu12-minimax-m3 (CUDA 12 — Hopper H200), or dev-minimax-m3 (default). On AMD Instinct it uses the matching ROCm image (MI300X/MI325X → …-rocm700-mi30x, MI350X/MI355X → …-rocm720-mi35x). For how to launch the image, see Install → Method 3: Using Docker, substituting the inner sglang serve ... with what the command generator produces.

<Note> These M3 dev images now **bundle MiniMax's MSA sparse-attention kernel** (`fmha_sm100`), so Blackwell users get the recommended fast path automatically — no manual install needed (see **§2.1**). On a custom image without it, the same recipe still serves on the built-in Triton sparse path. </Note> </Tab> </Tabs> </Accordion>

Pick your hardware + recipe to generate the launch command.

import { Deployment } from "/src/snippets/_deployment.jsx"; import { config } from "/src/snippets/configs/MiniMaxAI/minimax-m3.jsx"; import { benchmarks } from "/src/snippets/configs/MiniMaxAI/minimax-m3-benchmarks.jsx";

Playground

The Playground is where you experiment with SGLang features beyond the verified matrix. The Deploy panel above only emits combinations the SGLang team has signed off on; the Playground lets you turn on additional knobs on top of whichever cell the Deploy panel is currently showing.

import { Playground } from "/src/snippets/_playground.jsx";

1. Model Introduction

MiniMax-M3 is MiniMax's native-multimodal Mixture-of-Experts reasoning model: ~428B total parameters with ~23B activated per token (128 experts, 4 active per token), 60 layers, and a 1M-token context over text, image, and video. Its defining feature is MiniMax Sparse Attention (MSA) — a block-sparse "lightning indexer" attention that keeps long-context cost low (MiniMax reports ~9× prefill / ~15× decode speedup over M2 at 1M context). This page serves the MXFP8 variant (MiniMaxAI/MiniMax-M3-MXFP8, ~440 GB) on NVIDIA Blackwell and AMD Instinct; on NVIDIA Hopper (H200), use the full-precision bfloat16 build MiniMaxAI/MiniMax-M3 (§2.4). Released under the MiniMax Community License.

Key characteristics as served by SGLang:

Multimodal (vision + text): accepts interleaved text and images through the OpenAI-compatible chat API (loaded as MiniMaxM3SparseForConditionalGeneration). Image input via URL and base64 is validated; video input has not been tested here.
Reasoning model: emits its chain of thought wrapped in <mm:think>...</mm:think>. Always launch with --reasoning-parser auto — it auto-detects the right parser from the chat template, and SGLang then strips the tags and returns the trace separately in message.reasoning_content.
Native tool calling: a custom namespace-token XML format, parsed into standard OpenAI tool_calls. Always launch with --tool-call-parser auto — it auto-detects the right parser from the chat template. Single, parallel, and nested (object / array) arguments are supported.
Sparse attention: most layers use M3's "lightning indexer" block-sparse attention (top-k 128-token blocks), which keeps decode cost roughly flat in context length. On Blackwell, MiniMax's open-source MSA kernel accelerates this path further (§2.1).
MXFP8 quantization across vendors: the MXFP8 MoE weights run natively on NVIDIA Blackwell (B200 / B300 / GB200 / GB300) and on AMD Instinct MI350X/MI355X (gfx950 / CDNA4), both of which have hardware MX-scaled matmul. On AMD MI300X/MI325X (gfx942 / CDNA3) — no hardware MX — SGLang converts the weights to block-fp8 [128,128] at load and serves them on the tuned ROCm kernels (§2.3). The vision tower stays unquantized.

Recommended generation: the model's generation_config.json sets temperature 1.0 / top_p 0.95, which SGLang applies automatically (the default --sampling-defaults model). The model card additionally suggests top_k 40, but that value is not in generation_config.json, so SGLang does not apply it by default. top_k is a per-request sampling parameter (not a launch flag) — set it per call if you want it, e.g. extra_body={"top_k": 40} with the OpenAI client.

Resources: HuggingFace · MSA kernel

2. Configuration Tips

2.1 MSA sparse-attention fast path (recommended for Blackwell users)

MiniMax MSA (fmha_sm100, MIT-licensed) is the recommended Blackwell kernel for M3's main sparse-attention step — faster and more memory-efficient than the built-in Triton fallback. It ships pre-installed in the M3 dev image (lmsysorg/sglang:dev-minimax-m3, also published under the dev-cu13-minimax-m3 tag), so the Blackwell recipe above engages it automatically with no extra setup — import fmha_sm100 works out of the box and the kernels JIT-compile on first use. It is otherwise purely additive: on a custom image, install it (below) and the recipe engages it automatically; without it the same recipe still serves on the built-in Triton path. The swap is numerically equivalent (cosine ≥ 0.99999 vs Triton), decode stays CUDA-graph-capturable, prefill TTFT drops ~9–12% at 8K–64K context, and the MSA path survives memory configurations where the Triton path OOMs.

Requirements (from the MSA README):

GPU: NVIDIA SM100 family — sm_100 (B200 / GB200) and sm_103 (B300 / GB300).
Toolchain: CUDA Toolkit with nvcc ≥ 12.x on PATH (or CUDA_HOME set) — the kernels are JIT-compiled at first import.
Python: ≥ 3.10; OS: Linux — works on both x86_64 and aarch64 (Grace, e.g. GB200 / GB300); the aarch64 build needs no source edits.

The M3 Blackwell dev images above already bundle MSA, so you can skip straight to the gate check. The git clone / pip install steps are only needed on a custom image that doesn't have fmha_sm100.

bash

# Only on a custom image: --recursive pulls the CUTLASS submodule required for JIT compilation
git clone --recursive https://github.com/MiniMax-AI/MSA.git msa
cd msa && pip install .
# Verify the SGLang gate (True -> MSA engaged on this device; False -> Triton fallback):
python -c "from sglang.srt.layers.attention.minimax_sparse_ops.msa import msa_available; print(msa_available())"

</Accordion> <Note> The first import JIT-compiles the kernels, which can take 30 s to a few minutes on a cold `nvcc` cache — this is normal, not a hang. Subsequent server starts hit the JIT cache. </Note> <Warning> **Warm the JIT cache before a multi-GPU launch.** On a *cold* cache, several tensor-parallel ranks racing to JIT-compile MSA's plan kernel can leave one rank loading a half-linked module (`AttributeError: Module has no function 'plan'` at CUDA-graph capture). Run the gate-check `python -c "..."` (or any single-process `fmha_sm100_plan` call) once before launching the server — that compiles the kernel single-process, and every rank then hits the warm cache. </Warning>

The gate requires --attention-backend fa4 --page-size 128 (already part of the Blackwell recipe above; on current main these are also the auto-selected M3 defaults on SM100 GPUs). Force the Triton path at any time with the env var SGLANG_DISABLE_MSA=1. MSA is a Blackwell (SM100) kernel and does not apply to the AMD ROCm paths.

<Note> For multimodal (image) serving, keep the same text recipe above — `--attention-backend fa4 --page-size 128` (MSA) is unchanged — and add `--mm-attention-backend flashinfer_cudnn` for the vision tower. The text and vision-tower attention backends are independent knobs; MSA only touches the language-model sparse attention, not image handling. </Note>

2.2 Memory and workload tuning

The NVIDIA Blackwell recipes are validated single-node: B200 at --tp 8 and B300 / GB300 at --tp 4 (4-GPU is also the GB200 / GB300 single-node ceiling). GB200 (sm_100, aarch64) is inferred-supported — both of its axes are validated above (B200 is sm_100; GB300 is sm_103 aarch64) — but not directly benchmarked. The AMD recipes use 8-GPU (--tp 8).

Memory: --mem-fraction-static reserves GPU memory for weights + KV pool; the rest is prefill activation headroom. The value scales with free memory per GPU (card capacity minus per-GPU weight), so it tracks the card more than the TP degree: 0.65 on B200 (180 GB — less headroom once weights are resident) and 0.75 on the larger-memory B300 / GB300 (0.80 on AMD). Lower TP packs more weight per GPU, so a tighter config needs a lower value — B200 needs 0.65 even at --tp 4. Raising it past the validated value is fine only for low-concurrency single-stream serving; it OOMs under high concurrency or long context.
Long context (32K+): keep --mem-fraction-static at the platform default and raise --chunked-prefill-size to 16384. Decode TPOT stays roughly flat in context length thanks to sparse attention; 1K–128K prompts are validated.
Scaling TP: B200 is documented at --tp 8; B300 / GB200 / GB300 at --tp 4 (the single-node cross-family common denominator). On an 8-GPU B300 host you can also raise to --tp 8 for more throughput / KV headroom.
Expert parallelism: to trade latency for throughput add --ep (see Expert Parallelism Deployment). On AMD, set --ep equal to --tp. Shared-experts fusion is automatically disabled when EP > 1; on AMD standard EP the server also disables --enable-aiter-allreduce-fusion automatically to preserve accuracy.
--trust-remote-code is required to load the MiniMax config / processor classes.

2.3 AMD Instinct (ROCm)

MiniMax-M3 runs on AMD Instinct GPUs through two code paths, by architecture — both selected automatically; you still pass --quantization mxfp8 either way:

MI350X / MI355X (gfx950, CDNA4) has hardware MX-scaled matmul, so the MXFP8 weights are served natively. SGLang auto-detects the checkpoint, selects the Triton MiniMax-M3 MoE path with the packaged tuned MXFP8 configs, and enables AITER fused all-reduce for single-node tensor parallelism. The launch command is the NVIDIA recipe minus the Blackwell-only backend flags.
MI300X / MI325X (gfx942, CDNA3) has no hardware MX matmul. SGLang transparently converts the MXFP8 weights to block-fp8 [128,128] at load time, then serves them with the tuned ROCm block-fp8 kernels (--attention-backend aiter, --moe-runner-backend triton; the aiter runner also works and scores marginally higher). On a cold start the first generation can JIT-compile AITER configs and exceed the default warmup/HTTP timeout, so the recipe adds --watchdog-timeout 3600 --skip-server-warmup. The block-fp8 step adds only a small relative error over MXFP8's native 1×32 scaling — negligible on GSM8K (see the benchmark card).

Select an MI300X/MI325X or MI350X/MI355X tile in the command panel above to get the exact launch command for each path.

<Note> The AMD recipes are validated end-to-end on **text** workloads — chat, reasoning separation, and tool calling. The vision tower was not exercised on ROCm; for image input on AMD, omit the Blackwell `--mm-attention-backend flashinfer_cudnn` flag and let the encoder use the ROCm default backend, and treat vision as unvalidated on that path. </Note>

2.4 Serving on Hopper (H200) with the bf16 build

The MXFP8 kernels are Blackwell-only, so Hopper (H200) serves the full-precision bfloat16 build MiniMaxAI/MiniMax-M3. Select H200 + BF16 in the Deploy panel above for the exact command — it runs at --tp 8 (the bf16 weights need a full 8-GPU node). SGLang picks the right backends for Hopper automatically, so the recipe stays minimal:

MoE runner: Triton, auto-selected for bf16 weights.
Attention: FlashAttention-3 with page size 1. MSA (§2.1) is a Blackwell kernel, so M3's sparse step runs on the built-in Triton path here.
CUDA graph: on, with full decode-graph capture.

High-concurrency throughput (optional). On Hopper the sparse prefill runs on the Triton path as a separate eager forward, which briefly stalls the in-flight decode batch under heavy concurrent load. Adding --enable-mixed-chunk --chunked-prefill-size 2048 merges the running decodes into the prefill step instead of preempting them, which recovers roughly +10% output throughput and ~10% lower median TPOT at high concurrency on 8×H200, with no change in accuracy. Leave it off for latency-sensitive low-concurrency serving.

Validated on 8×H200 — reasoning and tool-call auto-detection plus long-context generation. For prefill/decode disaggregation on Hopper, see §3.4.

3. Advanced Usage

3.1 Reasoning

Launch with --reasoning-parser auto (or toggle Reasoning Parser in the Parsers card of the Playground above). The <mm:think> trace then lands in message.reasoning_content, separate from the final answer in message.content — no client-side tag stripping needed.

python

from openai import OpenAI

client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")

response = client.chat.completions.create(
    model="MiniMaxAI/MiniMax-M3-MXFP8",
    messages=[{"role": "user", "content": "What is 15% of 240? Explain briefly."}],
    max_tokens=2048,
)

message = response.choices[0].message
print("=============== Reasoning ===============")
print(message.reasoning_content)
print("=============== Answer ==================")
print(message.content)

</Accordion> <Accordion title="Example Output">

text

=============== Reasoning ===============
15% of 240. 15% = 0.15. 240 * 0.15 = 36. Quick check: 10% is 24, 5% is 12, 24 + 12 = 36.
=============== Answer ==================
15% of 240 is **36**.
(10% of 240 = 24, and 5% of 240 = 12; 24 + 12 = 36.)

</Accordion>

When streaming, the trace arrives on delta.reasoning_content and the answer on delta.content, so the two sections can be rendered separately in real time:

python

response = client.chat.completions.create(
    model="MiniMaxAI/MiniMax-M3-MXFP8",
    messages=[{"role": "user", "content": "Solve step by step: what is 15% of 240?"}],
    max_tokens=2048,
    stream=True,
)

for chunk in response:
    if not chunk.choices:
        continue
    delta = chunk.choices[0].delta
    if getattr(delta, "reasoning_content", None):
        print(delta.reasoning_content, end="", flush=True)  # thinking stream
    if delta.content:
        print(delta.content, end="", flush=True)            # answer stream
print()

Output Example:

text

[delta.reasoning_content — thinking stream]
Let me solve this step by step.

15% of 240
= 0.15 × 240
= 36

Let me verify: 10% of 240 = 24, 5% of 240 = 12, so 15% = 24 + 12 = 36. ✓

[delta.content — answer stream]
# Solving 15% of 240
## Step 1: Convert the percentage to a decimal
15% = 15/100 = 0.15
## Step 2: Multiply by 240
0.15 × 240 = 36
## Answer
**15% of 240 = 36**

</Accordion>

3.2 Tool Calling

Launch with --tool-call-parser auto (or toggle Tool Call Parser in the Parsers card of the Playground above) — it auto-detects M3's tool-call parser from the chat template. M3 emits tool calls in a custom namespace-token XML format:

text

]<]minimax[>[<tool_call>
]<]minimax[>[<invoke name="get_weather">]<]minimax[>[<location>Beijing]<]minimax[>[</location>]<]minimax[>[</invoke>
]<]minimax[>[</tool_call>

The parser converts that into the standard OpenAI tool_calls structure:

python

from openai import OpenAI

client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string", "description": "The city name"},
                    "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
                },
                "required": ["location"],
            },
        },
    }
]

response = client.chat.completions.create(
    model="MiniMaxAI/MiniMax-M3-MXFP8",
    messages=[{"role": "user", "content": "What's the weather in Beijing?"}],
    tools=tools,
)

message = response.choices[0].message
if message.tool_calls:
    for call in message.tool_calls:
        print(f"Tool: {call.function.name}")
        print(f"Args: {call.function.arguments}")

</Accordion> <Accordion title="Example Output">

text

Tool: get_weather
Args: {"location": "Beijing"}

</Accordion>

Beyond a single flat call, the parser also supports:

Parallel calls — multiple <invoke> blocks inside the single <tool_call> wrapper, surfaced as multiple message.tool_calls entries.
Nested object arguments — an object-typed parameter is emitted as nested XML tags and reconstructed into a JSON object.
Array arguments — an array-typed parameter uses repeated <item> children and is reconstructed into a JSON list.

For example, a tool with object and array parameters round-trips cleanly:

text

create_event {"title": "Design sync", "attendees": ["alice", "bob"], "location": {"room": "R2", "floor": 3}}

To return a tool result, append the assistant's tool_calls turn plus a matching tool message and ask the model to continue — the follow-up answer may place text in reasoning_content as well as content, so print both.

3.3 Multimodal (Vision) Input

Images go through the standard OpenAI image_url content type. The vision tower is always loaded; for image serving add --mm-attention-backend flashinfer_cudnn (the vision-tower backend) to the Blackwell deployment recipe — the text --attention-backend is unchanged (§2.1 note). On AMD, omit --mm-attention-backend and let the encoder use the ROCm default (vision is unvalidated on ROCm — §2.3).

python

from openai import OpenAI

client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")

response = client.chat.completions.create(
    model="MiniMaxAI/MiniMax-M3-MXFP8",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://raw.githubusercontent.com/sgl-project/sglang/main/examples/assets/example_image.png"
                    },
                },
                {"type": "text", "text": "Describe this image in detail."},
            ],
        }
    ],
    max_tokens=1024,
)
print(response.choices[0].message.content)

Output Example:

text

This image captures a striking and unusual urban scene on what appears to be a busy New York City street.

**Main Subject:**
A man stands on the rear bumper of a yellow taxi cab (an SUV-style cab, likely a Ford Escape hybrid), operating a full-sized ironing board set up across the back of the vehicle. He is wearing a bright yellow long-sleeved shirt and dark pants, and is actively ironing a blue garment, holding an iron in his right hand.

**Vehicles:**
- The yellow SUV taxi on the right is stationary, its rear hatch serving as the ironing platform.
- A second yellow taxi (a sedan) drives past on the left, captured with motion blur.

**Setting:**
Tall city buildings with classic urban architecture, an American flag, and white lane markings — a bustling downtown area, possibly Midtown Manhattan.

</Accordion>

Notes:

If the server cannot fetch external URLs, embed the image as a base64 data:image/png;base64,... URI — SGLang decodes it server-side.
Multiple images per message are supported; add more image_url entries to the content list.
Reasoning and tool calling work the same way for multimodal requests — a vision prompt can still produce a <mm:think> trace and/or tool calls.

3.4 Prefill-Decode (PD) Disaggregation

PD disaggregation runs prefill and decode on separate SGLang servers linked by an RDMA KV-transfer fabric (mooncake or NIXL), fronted by the PD router. M3 needs one thing beyond a dense model: alongside the main KV cache, every sparse "lightning-indexer" layer keeps a K-only index buffer, and that buffer must reach the decode server too — otherwise sparse attention reads stale state. SGLang transfers it alongside the main KV — reusing the same page mapping — so M3 disaggregates correctly with no extra flags.

Supported topology (the released MiniMax-M3, whose sparse layers are all K-only):

Equal tensor parallelism — the prefill and decode servers run the same --tp.
Single pipeline stage — PP = 1 (the default).
mooncake or NIXL transfer backend over RDMA / InfiniBand.

Launch the prefill server, then the decode server — the same recipe with --disaggregation-mode decode and no bootstrap port. Pick your hardware:

On Blackwell the MXFP8 recipe — fa4, page size 128, deep_gemm MoE, and the MSA fast path (§2.1) — is auto-selected, so each role adds only the --disaggregation-* flags. This is the validated 2 × 4×B200 setup (TP4 prefill on node A, TP4 decode on node B); point --disaggregation-ib-device at your RDMA NIC(s).

bash

sglang serve \
  --model-path MiniMaxAI/MiniMax-M3-MXFP8 \
  --trust-remote-code \
  --reasoning-parser auto \
  --tool-call-parser auto \
  --tp 4 \
  --disaggregation-mode prefill \
  --disaggregation-transfer-backend nixl \
  --disaggregation-ib-device mlx5_0 \
  --host 0.0.0.0 --port 30000 \
  --disaggregation-bootstrap-port 8998

bash

sglang serve \
  --model-path MiniMaxAI/MiniMax-M3-MXFP8 \
  --trust-remote-code \
  --reasoning-parser auto \
  --tool-call-parser auto \
  --tp 4 \
  --disaggregation-mode decode \
  --disaggregation-transfer-backend nixl \
  --disaggregation-ib-device mlx5_0 \
  --host 0.0.0.0 --port 30001

</Tab> <Tab title="Hopper · bf16">

On Hopper (H200) M3 runs the bf16 build (§2.4) with Triton MoE and the built-in Triton sparse path, pinned to --page-size 128 so both roles share the page layout the sparse-index transfer relies on. This is the validated 2 × 8×H200 setup (TP8 each).

bash

sglang serve \
  --model-path MiniMaxAI/MiniMax-M3 \
  --trust-remote-code \
  --reasoning-parser auto \
  --tool-call-parser auto \
  --tp 8 \
  --attention-backend triton \
  --moe-runner-backend triton \
  --page-size 128 \
  --disaggregation-mode prefill \
  --disaggregation-transfer-backend mooncake \
  --disaggregation-ib-device mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_7 \
  --host 0.0.0.0 --port 30000 \
  --disaggregation-bootstrap-port 8998

bash

sglang serve \
  --model-path MiniMaxAI/MiniMax-M3 \
  --trust-remote-code \
  --reasoning-parser auto \
  --tool-call-parser auto \
  --tp 8 \
  --attention-backend triton \
  --moe-runner-backend triton \
  --page-size 128 \
  --disaggregation-mode decode \
  --disaggregation-transfer-backend mooncake \
  --disaggregation-ib-device mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_7 \
  --host 0.0.0.0 --port 30001

</Tab> </Tabs>

Then start the PD router, pointing it at the prefill bootstrap (URL plus its --disaggregation-bootstrap-port) and the decode endpoint:

bash

python3 -m sglang_router.launch_router \
  --pd-disaggregation \
  --prefill http://<prefill-host>:30000 8998 \
  --decode http://<decode-host>:30001 \
  --policy round_robin \
  --host 0.0.0.0 --port 8000

Clients hit the router exactly like a single server — it splits each request across the two stages transparently:

python

from openai import OpenAI

client = OpenAI(base_url="http://<router-host>:8000/v1", api_key="EMPTY")

response = client.chat.completions.create(
    model="MiniMaxAI/MiniMax-M3-MXFP8",
    messages=[{"role": "user", "content": "What is 2 + 2?"}],
    max_tokens=64,
)
print(response.choices[0].message.content)

Output Example:

text

2 + 2 = 4

</Accordion>

Validation. PD disaggregation preserves output quality — the K-only sparse index transfers arrive intact and disaggregated output matches non-disaggregated serving. GSM8K is scored with the single sgl-eval harness used by the benchmark card above (full 1319-question split, chat with --thinking); see that card for per-platform single-node accuracy.

2 × 4×B200 (TP4+TP4, MXFP8, NIXL over InfiniBand) — output matches single-node serving. The 2-node PD serving benchmark (512-token input, 256-token output, 16 concurrent — a different workload from the card's single-node random isl=2048 / osl=256 / conc=64 row, so the throughput figures are not directly comparable) measured mean TTFT 1.1 s and TPOT 16.6 ms (≈ 60 tok/s per stream, ≈ 2.3k tok/s aggregate).
2 × 8×H200 (TP8+TP8, bf16, mooncake) — output matches single-node serving.