Back to Verl

CISPO

examples/cispo_trainer/README.md

0.8.0731 B
Original Source

CISPO

CISPO (Clipped IS-weight Policy Optimization) is a policy-loss variant that decouples the lower/upper clip ratios to stabilize IS-ratio-weighted updates, used in MiniMax-M1.

Reference: MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention.

Canonical Scripts

ScriptInferTrainPlatform
run_qwen3_8b_fsdp.shvLLMFSDPNVIDIA

Key Flags

  • actor_rollout_ref.actor.policy_loss.loss_mode=cispo
  • actor_rollout_ref.actor.clip_ratio_low=10 (effectively unclamped on lower side)
  • actor_rollout_ref.actor.clip_ratio_high=0.2