Back to Verl

Guide to Using MTP in SFT/RL Training and Inference

docs/advance/mtp.md

0.7.17.0 KB
Original Source

Guide to Using MTP in SFT/RL Training and Inference

Author: https://github.com/meituan-search

Last updated: 02/15/2026

1. Scope of Support

Currently, RL training can be performed on mimo-7B-RL, Qwen-next, and Deepseek series models based on the MTP architecture. The support rules for training and inference engines are as follows:

  • Training Engine: Only supports the mbridge/Megatron-Bridge + megatron combination; other training engines are not compatible at this time;

  • Inference Engine: Compatible with all engines, but the model must be in the corresponding engine's compatibility list;

  • Dependency Versions:

2. MTP Training Configuration (Core Parameters)

The MTP training process can be flexibly controlled through the following configurations. All configurations are based on the actor_rollout_ref.model.mtp prefix:

Configuration ScenarioCore ParametersDescription
Load MTP Parameters Onlyenable=TrueVRAM usage will increase, but the exported parameters include the MTP module and can be directly used for online deployment
Full-Parameter MTP Trainingenable=True
enable_train=True
mtp_loss_scaling_factor=0.1MTP Loss will apply to all model parameters
MTP Parameter-Only Trainingenable=True
enable_train=True
detach_encoder=TrueFreeze the Encoder layer, update only MTP module parameters, MTP Loss applies only to MTP parameters
MTP Accelerated Rollout1. vLLM configuration:
enable=True
enable_rollout=True
method="mtp"
num_speculative_tokens=1
  1. SGLang configuration: enable=True enable_rollout=True speculative_algorithm="EAGLE" speculative_num_steps=2 speculative_eagle_topk=2 speculative_num_draft_tokens=4 | Achieve inference acceleration during the Rollout phase based on MTP |

3. Experimental Results

The experiment was conducted as follows:

  • model = mimo-7B-math
  • max_response_length = 8k

Experiment chart:

The wandb link for the graph: wandb

Scenarios with No Significant Effect

The following configurations will not have a noticeable impact on training results:

  1. The base model does not carry MTP parameters;

  2. The base model carries MTP parameters, but the MTP module is not trained;

  3. The base model carries MTP parameters and trains MTP, with mtp_loss_scaling_factor=0;

  4. The base model carries MTP parameters, trains MTP and detaches the encoder, with mtp_loss_scaling_factor=0.1.

Scenarios with Significant Effect

Only the following configuration will have a noticeable impact on training results:

  • The base model carries MTP parameters, MTP Loss applies to all model parameters, and mtp_loss_scaling_factor=0.1.

Recommended Training Method

It is recommended to adopt the detach_encoder=True approach for MTP training.

4. Performance Notes for MTP in Rollout Inference

Enabling MTP improves the rollout acceptance rate by around 14%. However, on H20 GPUs, overall throughput does not increase and even decreases slightly.

The effectiveness of MTP-accelerated Rollout is significantly affected by model size and inference hardware. Key reference information is as follows:

Hardware Tensor Core Performance

Hardware ModelFP16 Performance (TFLOPS)
H20148
H8001,671
H2001,979

Measured Performance and Recommendations

Taking the mimo-7B model deployed separately on H20 hardware using SGLang as an example: After enabling MTP speculative decoding, the Rollout throughput decreases by approximately 50%.

  • Current priority recommendation: Do not enable MTP acceleration during the inference phase for now;

  • Future planning: Further optimization of the speculative logic in the Rollout phase will be conducted to improve throughput performance.

5. SFT training

The SFT training with MTP is supported, using the same MTP training configuration as RL training.

An example configuration for running SFT can be found in examples/sft/gsm8k/run_mimo_megatron_mtp.sh

SFT result

The experiment was conducted using following data:

  • model = mimo-7B-math
  • dataset = gsm8k

The result: wandb link

The presence of mtp layer has limited effect on main loss. However, when MTP layer is detached, the mtp_loss converges to a higher value.