docs/source/quickstart.md
TRL is a comprehensive library for post-training foundation models using techniques like Supervised Fine-Tuning (SFT), Group Relative Policy Optimization (GRPO), Direct Preference Optimization (DPO).
Get started instantly with TRL's most popular trainers. Each example uses compact models for quick experimentation.
from trl import SFTTrainer
from datasets import load_dataset
trainer = SFTTrainer(
model="Qwen/Qwen2.5-0.5B",
train_dataset=load_dataset("trl-lib/Capybara", split="train"),
)
trainer.train()
from trl import GRPOTrainer
from datasets import load_dataset
from trl.rewards import accuracy_reward
trainer = GRPOTrainer(
model="Qwen/Qwen2.5-0.5B-Instruct",
train_dataset=load_dataset("trl-lib/DeepMath-103K", split="train"),
reward_funcs=accuracy_reward,
)
trainer.train()
from trl import DPOTrainer
from datasets import load_dataset
trainer = DPOTrainer(
model="Qwen/Qwen2.5-0.5B-Instruct",
train_dataset=load_dataset("trl-lib/ultrafeedback_binarized", split="train"),
)
trainer.train()
from trl import RewardTrainer
from datasets import load_dataset
dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
trainer = RewardTrainer(
model="Qwen/Qwen2.5-0.5B-Instruct",
train_dataset=dataset,
)
trainer.train()
Skip the code entirely - train directly from your terminal:
# SFT: Fine-tune on instructions
trl sft --model_name_or_path Qwen/Qwen2.5-0.5B \
--dataset_name trl-lib/Capybara
# DPO: Align with preferences
trl dpo --model_name_or_path Qwen/Qwen2.5-0.5B-Instruct \
--dataset_name trl-lib/ultrafeedback_binarized
# Reward: Train a reward model
trl reward --model_name_or_path Qwen/Qwen2.5-0.5B-Instruct \
--dataset_name trl-lib/ultrafeedback_binarized
Reduce batch size and enable optimizations:
<hfoptions id="batch_size"> <hfoption id="SFT">training_args = SFTConfig(
per_device_train_batch_size=1, # Start small
gradient_accumulation_steps=8, # Maintain effective batch size
)
training_args = DPOConfig(
per_device_train_batch_size=1, # Start small
gradient_accumulation_steps=8, # Maintain effective batch size
)
Try adjusting the learning rate:
training_args = SFTConfig(learning_rate=2e-5) # Good starting point
For more help, open an issue on GitHub.