examples/mtp_trainer/README.md
MTP uses an auxiliary token-prediction head (speculative / draft head) during training. Currently supported on MiMo-7B-RL with Megatron backend.
| Script | Infer | Train | Mode | Platform |
|---|---|---|---|---|
run_mimo_7b_mtp_megatron.sh | SGLang | Megatron | Sync hybrid-engine | NVIDIA |
run_mimo_7b_mtp_rl_vllm_sgl_megatron.sh | SGLang / vLLM | Megatron | Sync hybrid-engine, slime-aligned RL/EAGLE setup | NVIDIA |
run_mimo_7b_mtp_fully_async_megatron_multinode.sh | SGLang | Megatron | Fully-async split-placement (DAPO) | NVIDIA |
IMPORTANT: after downloading MiMo-7B-RL, set max_position_embeddings: 32768 in its config.json.
actor_rollout_ref.model.mtp.enable=Trueactor_rollout_ref.model.mtp.enable_train=Trueactor_rollout_ref.model.mtp.mtp_loss_scaling_factor=0.1actor_rollout_ref.model.mtp.detach_encoder=TrueThe *_multinode.sh variant uses the fully-async one-step-off trainer
(verl.experimental.fully_async_policy.fully_async_main). Scale it via:
TRAIN_NNODES=4 TRAIN_NGPUS_PER_NODE=8 \
ROLLOUT_NNODES=4 ROLLOUT_NGPUS_PER_NODE=8 \
bash examples/mtp_trainer/run_mimo_7b_mtp_fully_async_megatron_multinode.sh
Defaults to a single-node 4+4 split (trainer + rollout) for a smoke-test,
matching the historical ..._math_megatron_4_4.sh layout.