optional-skills/mlops/training/trl-fine-tuning/references/online-rl.md
Guide to online reinforcement learning with PPO, GRPO, RLOO, and OnlineDPO.
Online RL generates completions during training and optimizes based on rewards.
Classic RL algorithm for LLM alignment.
python -m trl.scripts.ppo \
--model_name_or_path Qwen/Qwen2.5-0.5B-Instruct \
--reward_model_path reward-model \
--dataset_name trl-internal-testing/descriptiveness-sentiment-trl-style \
--output_dir model-ppo \
--learning_rate 3e-6 \
--per_device_train_batch_size 64 \
--total_episodes 10000 \
--num_ppo_epochs 4 \
--kl_coef 0.05
kl_coef: KL penalty (0.05-0.2)num_ppo_epochs: Epochs per batch (2-4)cliprange: PPO clip (0.1-0.3)vf_coef: Value function coef (0.1)Memory-efficient online RL.
from trl import GRPOTrainer, GRPOConfig
from datasets import load_dataset
# Define reward function
def reward_func(completions, **kwargs):
return [len(set(c.split())) for c in completions]
config = GRPOConfig(
output_dir="model-grpo",
num_generations=4, # Completions per prompt
max_new_tokens=128
)
trainer = GRPOTrainer(
model="Qwen/Qwen2-0.5B-Instruct",
reward_funcs=reward_func,
args=config,
train_dataset=load_dataset("trl-lib/tldr", split="train")
)
trainer.train()
num_generations: 2-8 completionsmax_new_tokens: 64-256| Method | Memory (7B) | Speed | Use Case |
|---|---|---|---|
| PPO | 40GB | Medium | Maximum control |
| GRPO | 24GB | Fast | Memory-constrained |
| OnlineDPO | 28GB | Fast | No reward model |