Back to Hermes Agent

Online RL Methods

optional-skills/mlops/training/trl-fine-tuning/references/online-rl.md

2026.6.51.9 KB
Original Source

Online RL Methods

Guide to online reinforcement learning with PPO, GRPO, RLOO, and OnlineDPO.

Overview

Online RL generates completions during training and optimizes based on rewards.

PPO (Proximal Policy Optimization)

Classic RL algorithm for LLM alignment.

Basic Usage

bash
python -m trl.scripts.ppo \
    --model_name_or_path Qwen/Qwen2.5-0.5B-Instruct \
    --reward_model_path reward-model \
    --dataset_name trl-internal-testing/descriptiveness-sentiment-trl-style \
    --output_dir model-ppo \
    --learning_rate 3e-6 \
    --per_device_train_batch_size 64 \
    --total_episodes 10000 \
    --num_ppo_epochs 4 \
    --kl_coef 0.05

Key Parameters

  • kl_coef: KL penalty (0.05-0.2)
  • num_ppo_epochs: Epochs per batch (2-4)
  • cliprange: PPO clip (0.1-0.3)
  • vf_coef: Value function coef (0.1)

GRPO (Group Relative Policy Optimization)

Memory-efficient online RL.

Basic Usage

python
from trl import GRPOTrainer, GRPOConfig
from datasets import load_dataset

# Define reward function
def reward_func(completions, **kwargs):
    return [len(set(c.split())) for c in completions]

config = GRPOConfig(
    output_dir="model-grpo",
    num_generations=4,  # Completions per prompt
    max_new_tokens=128
)

trainer = GRPOTrainer(
    model="Qwen/Qwen2-0.5B-Instruct",
    reward_funcs=reward_func,
    args=config,
    train_dataset=load_dataset("trl-lib/tldr", split="train")
)
trainer.train()

Key Parameters

  • num_generations: 2-8 completions
  • max_new_tokens: 64-256
  • Learning rate: 1e-5 to 1e-4

Memory Comparison

MethodMemory (7B)SpeedUse Case
PPO40GBMediumMaximum control
GRPO24GBFastMemory-constrained
OnlineDPO28GBFastNo reward model

References