trl/skills/trl-training/SKILL.md
You are an expert at using the TRL (Transformers Reinforcement Learning) library to train and fine-tune large language models.
TRL provides CLI commands for post-training foundation models using state-of-the-art techniques:
TRL is built on top of Hugging Face Transformers and Accelerate, providing seamless integration with the Hugging Face ecosystem.
Fine-tune language models on instruction-following or conversational datasets.
Full training:
trl sft \
--model_name_or_path Qwen/Qwen2-0.5B \
--dataset_name trl-lib/Capybara \
--learning_rate 2.0e-5 \
--num_train_epochs 1 \
--packing \
--per_device_train_batch_size 2 \
--gradient_accumulation_steps 8 \
--eos_token '<|im_end|>' \
--eval_strategy steps \
--eval_steps 100 \
--output_dir Qwen2-0.5B-SFT \
--push_to_hub
Train with LoRA adapters:
trl sft \
--model_name_or_path Qwen/Qwen2-0.5B \
--dataset_name trl-lib/Capybara \
--learning_rate 2.0e-4 \
--num_train_epochs 1 \
--packing \
--per_device_train_batch_size 2 \
--gradient_accumulation_steps 8 \
--eos_token '<|im_end|>' \
--eval_strategy steps \
--eval_steps 100 \
--use_peft \
--lora_r 32 \
--lora_alpha 16 \
--output_dir Qwen2-0.5B-SFT \
--push_to_hub
Align models using preference data (chosen/rejected pairs).
Full training:
trl dpo \
--dataset_name trl-lib/ultrafeedback_binarized \
--model_name_or_path Qwen/Qwen2-0.5B-Instruct \
--learning_rate 5.0e-7 \
--num_train_epochs 1 \
--per_device_train_batch_size 2 \
--max_steps 1000 \
--gradient_accumulation_steps 8 \
--eval_strategy steps \
--eval_steps 50 \
--output_dir Qwen2-0.5B-DPO \
--no_remove_unused_columns
Train with LoRA adapters:
trl dpo \
--dataset_name trl-lib/ultrafeedback_binarized \
--model_name_or_path Qwen/Qwen2-0.5B-Instruct \
--learning_rate 5.0e-6 \
--num_train_epochs 1 \
--per_device_train_batch_size 2 \
--max_steps 1000 \
--gradient_accumulation_steps 8 \
--eval_strategy steps \
--eval_steps 50 \
--output_dir Qwen2-0.5B-DPO \
--no_remove_unused_columns \
--use_peft \
--lora_r 32 \
--lora_alpha 16
Train models using reward functions or LLM-as-a-judge for evaluating generations and providing rewards.
Basic usage:
trl grpo \
--model_name_or_path Qwen/Qwen2.5-0.5B \
--dataset_name trl-lib/gsm8k \
--reward_funcs accuracy_reward \
--output_dir Qwen2-0.5B-GRPO \
--push_to_hub
Online RL training where the model generates text and receives rewards based on custom criteria.
Basic usage:
trl rloo \
--model_name_or_path Qwen/Qwen2.5-0.5B \
--dataset_name trl-lib/tldr \
--reward_model_name_or_path sentiment-analysis:nlptown/bert-base-multilingual-uncased-sentiment \
--output_dir Qwen2-0.5B-RLOO \
--push_to_hub
Train a reward model to score text quality for RLHF.
Full training:
trl reward \
--model_name_or_path Qwen/Qwen2-0.5B-Instruct \
--dataset_name trl-lib/ultrafeedback_binarized \
--output_dir Qwen2-0.5B-Reward \
--per_device_train_batch_size 8 \
--num_train_epochs 1 \
--learning_rate 1.0e-5 \
--eval_strategy steps \
--eval_steps 50 \
--max_length 2048
Train with LoRA adapters:
trl reward \
--model_name_or_path Qwen/Qwen2-0.5B-Instruct \
--dataset_name trl-lib/ultrafeedback_binarized \
--output_dir Qwen2-0.5B-Reward-LoRA \
--per_device_train_batch_size 8 \
--num_train_epochs 1 \
--learning_rate 1.0e-4 \
--eval_strategy steps \
--eval_steps 50 \
--max_length 2048 \
--use_peft \
--lora_task_type SEQ_CLS \
--lora_r 32 \
--lora_alpha 16
TRL supports YAML configuration files for reproducible training. All CLI arguments can be specified in a config file.
Example config (sft_config.yaml):
model_name_or_path: Qwen/Qwen2.5-0.5B
dataset_name: trl-lib/Capybara
learning_rate: 2.0e-5
num_train_epochs: 1
per_device_train_batch_size: 8
gradient_accumulation_steps: 2
output_dir: ./sft_output
use_peft: true
lora_r: 16
lora_alpha: 16
report_to: trackio
Launch with config:
trl sft --config sft_config.yaml
Override config values:
trl sft --config sft_config.yaml --learning_rate 1.0e-5
TRL integrates with Accelerate for multi-GPU and multi-node training.
Multi-GPU training:
trl sft \
--config sft_config.yaml \
--num_processes 4
Use predefined Accelerate configs:
TRL provides predefined configs: single_gpu, multi_gpu, fsdp1, fsdp2, zero1, zero2, zero3
trl sft \
--config sft_config.yaml \
--accelerate_config zero2
Custom Accelerate config:
# Generate custom config
accelerate config
# Use custom config
trl sft --config sft_config.yaml --config_file ~/.cache/huggingface/accelerate/default_config.yaml
Fully Sharded Data Parallel (FSDP):
trl sft --config sft_config.yaml --accelerate_config fsdp2
DeepSpeed ZeRO:
trl sft --config sft_config.yaml --accelerate_config zero3
--per_device_train_batch_size and increase --gradient_accumulation_steps--use_peft for LoRA training--gradient_checkpointing to save memory--dataset_config for multi-config datasetsfrom datasets import load_dataset; ds = load_dataset(name)hf auth login--packing for short sequences--per_device_train_batch_size if memory allows--tf32 for faster computation on Ampere GPUs--bf16 on supported hardware--num_processes--temperature and --top_p for generation--use_peft for faster training and lower memory--report_to trackio (or --report_to wandb or --report_to tensorboard) for tracking--output_dirWhen helping users with TRL: