Back to Verl

SAPO (Smooth Advantage Policy Optimization)

examples/sapo_trainer/README.md

0.8.0803 B
Original Source

SAPO (Smooth Advantage Policy Optimization)

SAPO replaces PPO's ratio clipping with a smooth, tau-parameterized surrogate objective.

Reference: Revisiting Policy Gradient Methods for Large Language Models.

Canonical Scripts

ScriptInferTrainPlatform
run_qwen3_8b_fsdp.shvLLMFSDP2Ascend
run_qwen3_30b_a3b_fsdp.shvLLMFSDP2NVIDIA

Key Flags

  • actor_rollout_ref.actor.policy_loss.loss_mode=sapo
  • +actor_rollout_ref.actor.policy_loss.tau_pos=1.0
  • +actor_rollout_ref.actor.policy_loss.tau_neg=1.05

Note: SAPO disables ratio clipping; no clip_ratio_low/high needed.