Back to Verl

On-Policy Distillation

examples/on_policy_distillation_trainer/README.md

0.8.01.4 KB
Original Source

On-Policy Distillation

This trainer jointly trains a student model with policy-gradient on-policy rollouts and a distillation loss against a frozen teacher model served by a separate Ray cluster. Compared to pure SFT from teacher generations, on-policy distillation typically closes more of the teacher/student gap at the same compute budget.

Canonical Scripts

ScriptTeachersModalityInferTrainPlatform
run_qwen3_8b_fsdp.shsingletextvLLMFSDPNVIDIA
run_qwen3_8b_megatron.shsingletextvLLMMegatronNVIDIA
run_qwen3_vl_8b_fsdp.shsingleVLvLLMFSDPNVIDIA
run_qwen3_8b_mopd_fsdp.shmultitext + VLvLLMFSDPNVIDIA

Override STUDENT_MODEL and TEACHER_MODEL via env vars to swap model pairs in the single-teacher scripts. The MOPD script exposes per-teacher overrides.

Key Flags

  • distillation.enabled=True
  • distillation.teacher_models.teacher_model.model_path=<HF path> (single-teacher)
  • +distillation.teacher_models.<name>.{key,model_path,num_replicas,inference.*} (multi-teacher)
  • distillation.distillation_loss.loss_mode={k1, k3, forward_kl_topk, ...}
  • distillation.distillation_loss.use_policy_gradient=True|False
  • distillation.distillation_loss.topk=64