docs/source/sdft_trainer.md
Self-Distilled Fine-Tuning (SDFT) is described in Self-Training with On-Policy Self-Distillation for Language Model Alignment.
The TRL implementation adapts SDFT to the experimental trainer API while reusing the shared self-distillation infrastructure also used by SDPO.
In the current TRL implementation:
no_grad for non-PEFT); use sync_ref_model=True for an EMA teacherprompt and privileged_contextprivileged_context contains only the extra teacher-only information; the trainer combines it with prompt to build the teacher promptteacher_prompt_template controls how prompt and privileged_context are combined into the teacher promptgenerate_from_teachernum_loss_tokens_to_skip can exclude initial completion tokens from the distillation lossuse_vllm=Trueprompt plus privileged_contextfrom datasets import Dataset
from trl.experimental.sdft import SDFTConfig, SDFTTrainer
dataset = Dataset.from_dict(
{
"prompt": [[{"role": "user", "content": "Solve 2+2."}]],
"privileged_context": ["Example answer: 4."],
}
)
training_args = SDFTConfig(
output_dir="sdft-model",
distillation_alpha=0.5,
distillation_topk=5,
max_completion_length=64,
)
trainer = SDFTTrainer(
model="Qwen/Qwen2.5-1.5B-Instruct",
args=training_args,
train_dataset=dataset,
)
trainer.train()
To generate from the teacher-conditioned prompt instead of the student prompt, set generate_from_teacher=True.
To customize how the teacher prompt is built, set teacher_prompt_template on [SDFTConfig].
Each example must provide:
prompt: the student-facing promptprivileged_context: only the extra teacher-only information, such as a demonstration, hint, or privileged feedbackBoth standard text prompts and conversational prompts are supported by the trainer prompt handling.
The trainer emits a small set of callback hooks that are useful for debugging, observability, and tests. These hooks are intended as practical integration points for experimental self-distillation workflows.
Shared self-distillation hooks:
on_self_distillation_batch_prepared: fired when a self-distillation batch is ready. The payload includes prompt_ids, completion_ids, and old_per_token_logps when importance-sampling clipping inputs are available.on_generation_batch_built: fired when a new buffered generation batch is created. The payload includes generate_every and steps_per_generation.SDFT-specific hook:
on_generation_prompts_selected: fired when SDFT chooses the prompt source for on-policy generation. The payload includes the selected generation_prompts and the corresponding generation_prompt_text.Use trl/experimental/sdft/sdft.py to launch SDFT training from the command line. The script supports any causal LM from the Hub, custom local datasets via --dataset_path, and PEFT/LoRA via the standard ModelConfig flags.
python trl/experimental/sdft/sdft.py \
--model_name_or_path Qwen/Qwen3.5-0.8B \
--dataset_name your-org/your-dataset \
--output_dir outputs/sdft-qwen3.5-0.8b \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 16 \
--learning_rate 2e-5 \
--max_prompt_length 1024 \
--max_completion_length 512 \
--generate_from_teacher \
--sync_ref_model \
--ref_model_sync_steps 1 \
--ref_model_mixup_alpha 0.01 \
--eval_strategy steps \
--eval_steps 50 \
--report_to wandb
[[autodoc]] experimental.sdft.SDFTConfig
[[autodoc]] experimental.sdft.SDFTTrainer - train - save_model - push_to_hub