ReMax

ReMax is a lightweight policy-gradient method that uses a single greedy-decoded baseline response per prompt to reduce variance without a critic.

Canonical Scripts

Script	Infer	Train	Platform
`run_qwen3_8b_fsdp.sh`	vLLM	FSDP	NVIDIA
`run_qwen2.5_math_7b_fsdp_sync.sh`	vLLM	FSDP+Sync	NVIDIA

Override any argument via env vars at the top of the script.

algorithm.adv_estimator=remax
actor_rollout_ref.actor.use_kl_loss=False and algorithm.use_kl_in_reward=True