docs/advance/mtp.md
Author: https://github.com/meituan-search
Last updated: 02/15/2026
Currently, RL training can be performed on mimo-7B-RL, Qwen-next, and Deepseek series models based on the MTP architecture. The support rules for training and inference engines are as follows:
Training Engine: Only supports the mbridge/Megatron-Bridge + megatron combination; other training engines are not compatible at this time;
Inference Engine: Compatible with all engines, but the model must be in the corresponding engine's compatibility list;
Dependency Versions:
mbridge: Apply the patches and review suggestions from PR: #62 (Already merged into the main branch);
Megatron-Bridge: Apply the patches and review suggestions from PR if you want to try out mimo-7B-RL: #2387 (will be merged into the main branch in the future);
megatron: Use the latest dev version (commit: 23e092f41ec8bc659020e401ddac9576c1cfed7e), which supports MTP + CP training methods.
sglang: Use the specified branch: https://github.com/ArronHZG/sglang/tree/fix_mtp_update_weights_from_tensor, PR , which fix the MTP update weights from tensor OOM issue.
The MTP training process can be flexibly controlled through the following configurations. All configurations are based on the actor_rollout_ref.model.mtp prefix:
| Configuration Scenario | Core Parameters | Description |
|---|---|---|
| Load MTP Parameters Only | enable=True | VRAM usage will increase, but the exported parameters include the MTP module and can be directly used for online deployment |
| Full-Parameter MTP Training | enable=True | |
enable_train=True | ||
mtp_loss_scaling_factor=0.1 | MTP Loss will apply to all model parameters | |
| MTP Parameter-Only Training | enable=True | |
enable_train=True | ||
detach_encoder=True | Freeze the Encoder layer, update only MTP module parameters, MTP Loss applies only to MTP parameters | |
| MTP Accelerated Rollout | 1. vLLM configuration: | |
enable=True | ||
enable_rollout=True | ||
method="mtp" | ||
num_speculative_tokens=1 |
enable=True
enable_rollout=True
speculative_algorithm="EAGLE"
speculative_num_steps=2
speculative_eagle_topk=2
speculative_num_draft_tokens=4 | Achieve inference acceleration during the Rollout phase based on MTP |The experiment was conducted as follows:
Experiment chart:
The wandb link for the graph: wandb
Scenarios with No Significant Effect
The following configurations will not have a noticeable impact on training results:
The base model does not carry MTP parameters;
The base model carries MTP parameters, but the MTP module is not trained;
The base model carries MTP parameters and trains MTP, with mtp_loss_scaling_factor=0;
The base model carries MTP parameters, trains MTP and detaches the encoder, with mtp_loss_scaling_factor=0.1.
Scenarios with Significant Effect
Only the following configuration will have a noticeable impact on training results:
mtp_loss_scaling_factor=0.1.Recommended Training Method
It is recommended to adopt the detach_encoder=True approach for MTP training.
Enabling MTP improves the rollout acceptance rate by around 14%. However, on H20 GPUs, overall throughput does not increase and even decreases slightly.
The effectiveness of MTP-accelerated Rollout is significantly affected by model size and inference hardware. Key reference information is as follows:
Hardware Tensor Core Performance
| Hardware Model | FP16 Performance (TFLOPS) |
|---|---|
| H20 | 148 |
| H800 | 1,671 |
| H200 | 1,979 |
Measured Performance and Recommendations
Taking the mimo-7B model deployed separately on H20 hardware using SGLang as an example: After enabling MTP speculative decoding, the Rollout throughput decreases by approximately 50%.
Current priority recommendation: Do not enable MTP acceleration during the inference phase for now;
Future planning: Further optimization of the speculative logic in the Rollout phase will be conducted to improve throughput performance.
The SFT training with MTP is supported, using the same MTP training configuration as RL training.
An example configuration for running SFT can be found in examples/sft/gsm8k/run_mimo_megatron_mtp.sh
SFT result
The experiment was conducted using following data:
The result: wandb link
The presence of mtp layer has limited effect on main loss. However, when MTP layer is detached, the mtp_loss converges to a higher value.