Back to Verl

Group Relative Policy Optimization (GRPO)

examples/grpo_trainer/README.md

0.8.06.4 KB
Original Source

Group Relative Policy Optimization (GRPO)

In reinforcement learning, classic algorithms like PPO rely on a "critic" model to estimate the value of actions, guiding the learning process. However, training this critic model can be resource-intensive.

GRPO simplifies this process by eliminating the need for a separate critic model. Instead, it operates as follows:

  • Group Sampling: for a given problem, the model generates multiple possible solutions, forming a "group" of outputs.
  • Reward Assignment: each solution is evaluated and assigned a reward based on its correctness or quality.
  • Baseline Calculation: the average reward of the group serves as a baseline.
  • Policy Update: the model updates its parameters by comparing each solution's reward to the group baseline, reinforcing better-than-average solutions and discouraging worse-than-average ones.

For more details, refer to the original paper DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.

Key Components

  • No Value Function (Critic-less): unlike PPO, GRPO does not train a separate value network (critic).
  • Group Sampling (Grouped Rollouts): instead of evaluating one rollout per input, GRPO generates multiple completions (responses) from the current policy for each prompt. This set of completions is referred to as a group.
  • Relative Rewards: within each group, completions are scored (e.g., based on correctness), and rewards are normalized relative to the group.

Important knobs

  • actor_rollout_ref.rollout.n: per-prompt sample count (required >= 2 for GRPO).
  • data.train_batch_size: prompts per global step. Total trajectories = train_batch_size * rollout.n.
  • actor_rollout_ref.actor.ppo_mini_batch_size: global mini-batch for actor updates (must divide train_batch_size * n).
  • actor_rollout_ref.actor.ppo_epochs: inner-loop epochs over the sampled trajectories.
  • actor_rollout_ref.actor.clip_ratio: PPO clip range, default 0.2.
  • actor_rollout_ref.actor.loss_agg_mode: token-mean (default), seq-mean-token-sum, or seq-mean-token-mean.
  • actor_rollout_ref.actor.use_kl_loss=True + actor_rollout_ref.actor.kl_loss_coef / kl_loss_type: regularise toward the reference policy via KL loss on the actor.
  • algorithm.adv_estimator=grpo.

Dr. GRPO

To enable Dr. GRPO (see Understanding R1-Zero-Like Training), set on top of the canonical GRPO overrides:

actor_rollout_ref.actor.loss_agg_mode=seq-mean-token-sum-norm
actor_rollout_ref.actor.use_kl_loss=False
algorithm.norm_adv_by_std_in_grpo=False

Canonical scripts

All scripts in this directory follow the naming convention:

run_<model>_<train-backend>[_<platform-or-variant>].sh

Where:

  • <model> is the canonical size for a model family (qwen3_8b for dense text, qwen3_30b_a3b for MoE, qwen2_5_vl_7b / qwen3_vl_8b for vision, qwen3_235b_a22b / deepseek_v3_671b for scale demos).
  • <train-backend> ∈ {fsdp, megatron, mindspeed}.
  • <platform-or-variant> is used only for hardware-specific variants such as gb200, fp8, veomni, or MindSpeed NPU scripts.
  • INFER_BACKEND selects rollout backend inside scripts that support multiple choices (vllm, sglang, or trtllm).
  • DEVICE selects GPU/NPU paths inside scripts that support both platforms.

Every script exposes the commonly tuned knobs as environment variables at the top, so you can run:

bash
MODEL_PATH=Qwen/Qwen3-14B \
NNODES=2 NGPUS_PER_NODE=8 \
INFER_BACKEND=sglang ROLLOUT_N=8 TRAIN_BATCH_SIZE=2048 \
bash examples/grpo_trainer/run_qwen3_8b_fsdp.sh

Defaults

  • dynamic batch size and sequence balancing are enabled by default on all scripts.
  • Text LLM scripts train on gsm8k + math by default; vision scripts train on geo3k.
  • Scale-demo scripts (235B, 671B) train on dapo-math-17k / aime-2024.

Matrix

Model familyvllmsglangtrtllmTrain backendPlatforms
Qwen3-8B (dense)FSDP, Megatronnvidia, npu (FSDP + MindSpeed), _gb200 variant
Qwen2.5-VL-7BFSDP, Megatronnvidia
Qwen3-VL-8BFSDP, Megatronnvidia, npu (FSDP)
Qwen3-VL-30B-A3BFSDP, Megatronnvidia, npu (FSDP, VeOmni)
Qwen3-VL-235B-A22BMegatronnvidia
Qwen3-30B-A3B (MoE)FSDP, Megatronnvidia, npu (MindSpeed, VeOmni)
Qwen3-235B-A22BMegatronnvidia, npu
Qwen3-Next-80B-A3BFSDPnpu
Qwen3.5-27B (dense)FSDP2nvidia, npu
Qwen3.5-35B (dense)FSDP2, Megatronnvidia, npu
Qwen3.5-35B-A3B (MoE)VeOmninvidia
Qwen3.5-122B-A10BMegatronnvidia
DeepSeek-V3 671BMegatronnvidia
GLM-4.1V-9BFSDPnvidia
MiniCPM-o-2.6FSDPnvidia
Moonlight-16B-A3BMegatronnvidia
Nemotron-Nano-v3-30B-A3BMegatronnvidia
Seed-OSS-36BFSDP2nvidia
GPT-OSS-20BFSDPnvidia
Mistral-Nemo-12B (RM demo)FSDPnvidia

LoRA variants live in examples/tuning/lora/, profiling variants in examples/profile/. Scale / hardware-specific demos (e.g. run_qwen3_8b_fsdp_gb200.sh, FP8 variants, VeOmni) keep a trailing suffix to stay discoverable.

Reference