examples/on_policy_distillation_trainer/README.md
This trainer jointly trains a student model with policy-gradient on-policy rollouts and a distillation loss against a frozen teacher model served by a separate Ray cluster. Compared to pure SFT from teacher generations, on-policy distillation typically closes more of the teacher/student gap at the same compute budget.
| Script | Teachers | Modality | Infer | Train | Platform |
|---|---|---|---|---|---|
run_qwen3_8b_fsdp.sh | single | text | vLLM | FSDP | NVIDIA |
run_qwen3_8b_megatron.sh | single | text | vLLM | Megatron | NVIDIA |
run_qwen3_vl_8b_fsdp.sh | single | VL | vLLM | FSDP | NVIDIA |
run_qwen3_8b_mopd_fsdp.sh | multi | text + VL | vLLM | FSDP | NVIDIA |
Override STUDENT_MODEL and TEACHER_MODEL via env vars to swap model pairs in
the single-teacher scripts. The MOPD script exposes per-teacher overrides.
distillation.enabled=Truedistillation.teacher_models.teacher_model.model_path=<HF path> (single-teacher)+distillation.teacher_models.<name>.{key,model_path,num_replicas,inference.*} (multi-teacher)distillation.distillation_loss.loss_mode={k1, k3, forward_kl_topk, ...}distillation.distillation_loss.use_policy_gradient=True|Falsedistillation.distillation_loss.topk=64