doc/en/kt-kernel/MiniMax-M3-Tutorial.md
This tutorial demonstrates how to run MiniMax-M3 model inference using SGLang integrated with KT-Kernel for CPU-GPU heterogeneous inference. This setup enables efficient deployment of M3's 128-routed-expert sparse architecture by offloading experts to CPU.
The examples use the public model ID MiniMaxAI/MiniMax-M3-MXFP8.
The launch commands below target an 8-GPU TP8 server with 64 CPU inference threads and 2 CPU thread pools. Adjust --tp-size, --kt-cpuinfer, --kt-threadpool-count, and GPU expert counts for your hardware.
Before starting, ensure you have:
KT-Kernel installed (required for hybrid CPU offload)
git clone https://github.com/kvcache-ai/ktransformers.git
cd ktransformers
git submodule update --init --recursive
cd kt-kernel && ./install.sh
SGLang installed — install the kvcache-ai fork of SGLang (one of):
# Option A: One-click install (from ktransformers root)
./install.sh
# Option B: pip install
pip install sglang-kt
Supported GPUs: SM90 (Hopper: H100 / H200 / H20 / H800). Upstream sglang targets SM100 (Blackwell datacenter: B100 / B200 / GB200) so far.
CUDA toolkit — CUDA 12.0+ recommended; CUDA 12.8+ for FP8 / MXFP8 deployments.
Hugging Face CLI — for downloading models:
pip install -U huggingface-hub
Download the MiniMax-M3 weights from Hugging Face. M3 ships natively in MXFP8 (fp8 e4m3 + uint8 ue8m0 1x32 scale).
hf download MiniMaxAI/MiniMax-M3-MXFP8 \
--local-dir /path/to/MiniMax-M3-MXFP8
Note: Replace /path/to/ with your actual storage path throughout this tutorial.
python -m sglang.launch_server \
--model-path /path/to/MiniMax-M3-MXFP8 \
--kt-weight-path /path/to/MiniMax-M3-MXFP8 \
--kt-method MXFP8 \
--kt-cpuinfer 64 \
--kt-threadpool-count 2 \
--kt-num-gpu-experts 40 \
--kt-gpu-prefill-token-threshold 500 \
--tp-size 8 \
--quantization mxfp8 \
--moe-runner-backend triton \
--trust-remote-code \
--host 0.0.0.0 \
--port 8000 \
--mem-fraction-static 0.55 \
--chunked-prefill-size 8192 \
--cuda-graph-max-bs 1 \
--tool-call-parser minimax-m3 \
--reasoning-parser minimax-m3 \
--served-model-name MiniMax-M3
A single 96 GB Hopper card is sufficient if most routed experts stay on CPU:
python -m sglang.launch_server \
--model-path /path/to/MiniMax-M3-MXFP8 \
--kt-weight-path /path/to/MiniMax-M3-MXFP8 \
--kt-method MXFP8 \
--kt-cpuinfer 64 \
--kt-threadpool-count 2 \
--kt-num-gpu-experts 4 \
--kt-gpu-prefill-token-threshold 500 \
--tp-size 1 \
--quantization mxfp8 \
--moe-runner-backend triton \
--trust-remote-code \
--host 0.0.0.0 \
--port 8000 \
--mem-fraction-static 0.85 \
--chunked-prefill-size 4096 \
--cuda-graph-max-bs 1 \
--tool-call-parser minimax-m3 \
--reasoning-parser minimax-m3 \
--served-model-name MiniMax-M3
If you encounter OOM, lower --kt-num-gpu-experts, --mem-fraction-static, --chunked-prefill-size, or --max-running-requests (default high).
Once the server is running at http://localhost:8000, you can interact with the model in several ways.
The server exposes an OpenAI-compatible API at http://localhost:8000/v1.
curl example (non-streaming):
curl -s http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "MiniMax-M3",
"messages": [{"role": "user", "content": "Solve step by step: 17 * 23"}],
"temperature": 0.0,
"max_tokens": 256
}'
M3 emits tool calls in its native <minimax:tool_call> XML format. The --tool-call-parser minimax-m3 flag converts them to the OpenAI tool_calls array automatically.
tools = [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a city.",
"parameters": {
"type": "object",
"properties": {"city": {"type": "string"}},
"required": ["city"],
},
},
}]
response = client.chat.completions.create(
model="MiniMax-M3",
messages=[{"role": "user", "content": "What's the weather in Shanghai?"}],
tools=tools,
tool_choice="auto",
max_tokens=200,
)
print(response.choices[0].message.tool_calls)
M3 supports request-level thinking control via chat_template_kwargs.thinking_mode. Reasoning output (if any) is returned under message.reasoning_content.
thinking_mode | Behavior |
|---|---|
"enabled" | Force chain-of-thought; the <mm:think> start tag is prefilled by the template. |
"disabled" | Suppress thinking; the closing tag is prefilled. |
"adaptive" (default) | Model self-decides; detector handles emitted <mm:think> blocks. |
Example:
response = client.chat.completions.create(
model="MiniMax-M3",
messages=[{"role": "user", "content": "What is 2+2?"}],
extra_body={"chat_template_kwargs": {"thinking_mode": "disabled"}},
max_tokens=50,
)
# content: "4", reasoning_content: None
Default generation settings:
KT-Kernel hybrid sizing rule-of-thumb:
| TP size | Suggested --kt-num-gpu-experts | Suggested --kt-cpuinfer | Notes |
|---|---|---|---|
| 1 | 2–8 | physical core count | Single 96 GB card, most experts on CPU |
| 4 | 20–40 | physical core count | Mid-range deploy |
| 8 | 40–60 | physical core count | Highest throughput; raise --mem-fraction-static |
--kt-gpu-prefill-token-threshold 500 enables layerwise full-GPU prefill fallback for prompts longer than 500 tokens; set to a larger value to disable for short workloads.
Model implementation errors
Make sure your SGLang build is on feat/minimax-m3 and that --trust-remote-code is enabled.
OOM during startup or serving
Reduce one or more of:
--kt-num-gpu-experts--mem-fraction-static--chunked-prefill-size--max-running-requests