Back to Verl

ReMax

examples/remax_trainer/README.md

0.8.0834 B
Original Source

ReMax

ReMax is a lightweight policy-gradient method that uses a single greedy-decoded baseline response per prompt to reduce variance without a critic.

Reference: ReMax: A Simple, Effective, and Efficient Reinforcement Learning Method for Aligning Large Language Models.

Canonical Scripts

ScriptInferTrainPlatform
run_qwen3_8b_fsdp.shvLLMFSDPNVIDIA
run_qwen2.5_math_7b_fsdp_sync.shvLLMFSDP+SyncNVIDIA

Override any argument via env vars at the top of the script.

Key Flags

  • algorithm.adv_estimator=remax
  • actor_rollout_ref.actor.use_kl_loss=False and algorithm.use_kl_in_reward=True