Back to Verl

Optimal Token Baseline (OTB)

examples/otb_trainer/README.md

0.8.0612 B
Original Source

Optimal Token Baseline (OTB)

OTB uses a token-wise optimal variance-reduction baseline derived from the per-prompt sample group. See tir_optimal_token_baseline variant for a TIR-specific form.

Canonical Scripts

ScriptInferTrainPlatform
run_qwen3_8b_fsdp.shvLLMFSDPNVIDIA

Set ADV_ESTIMATOR=tir_optimal_token_baseline for the TIR variant.

Key Flags

  • algorithm.adv_estimator=optimal_token_baseline
  • actor_rollout_ref.actor.calculate_sum_pi_squared=True