docs/algo/gpg.md
Last updated: 07/03/2025.
Group Policy Gradient (GPG) is a minimalist reinforcement learning (RL) method that enhances the reasoning ability of large language models without relying on supervised fine-tuning or complex tricks. GPG revisits traditional policy gradients and directly optimizes the RL objective—no surrogate losses, no KL penalties, no critic, and no reference model. Compared to GRPO, GPG is simpler, more efficient, and achieves better results on many tasks. For more details, please refer to the original paper GPG: A Simple and Strong Reinforcement Learning Baseline for Model Reasoning .
To configure GPG within the framework, use the following YAML settings.
algorithm:
adv_estimator: gpg
actor_rollout_ref:
actor:
policy_loss:
loss_mode: "gpg"
GPG is a simple and strong baseline for model reasoning. Although it avoids using KL loss in its original form, you can still use KL loss to further improve the performance.
algorithm:
adv_estimator: gpg
actor_rollout_ref:
actor:
use_kl_loss: True # enable kl regularization
kl_loss_coef: 0.01
policy_loss:
loss_mode: "gpg"