examples/fapo_trainer/README.md
This example include a runnable and fully reproducible example that demonstrates how to use reward model to optimize a policy.
First construct the training and evaluation datasets by:
python examples/fapo_trainer/prepare_data.py --local_dir ${RAY_DATA_HOME}/data/
Or you can directly use the data available here.
To integrate the GRM into the final training, we provide two options:
The following list the most-relevant parameters in the config file:
reward:
reward_model:
model_path: "/path/to/your/reward_model" # your reward model path
# whether to enable resource pool for the reward model
# True -> Standalone Mode, False -> Colocate Mode
enable_resource_pool: True
# the number of nodes to deploy the reward model
# only effective when enable_resource_pool is True
nnodes: 1
# the number of GPUs to deploy the reward model on each node
# only effective when enable_resource_pool is True
n_gpus_per_node: 8
# inference engine configs, similar to those in rollout configs
rollout:
# set to True in colocate mode, False in standalone mode
free_cache_engine: True
# ... (ommitted)
# customized reward function, where user should implement the invocation logic
# of the specified reward model (both generative and discriminative)
custom_reward_function:
path: null
name: compute_score
cd verl # Repo root
export RAY_ADDRESS="..." # The Ray cluster address to connect to
export WORKING_DIR="${PWD}" # The local directory to package to the Ray cluster
export NNODES=xxx
bash examples/fapo_trainer/run_qwen_7b_rm_colocate.sh # 7b fapo model
bash examples/fapo_trainer/run_qwen_32b_rm_colocate.sh # 32b fapo model
cd verl # Repo root
export RAY_ADDRESS="..." # The Ray cluster address to connect to
export WORKING_DIR="${PWD}" # The local directory to package to the Ray cluster
export NNODES=xxx # for actor/rollout/trainer
export RM_NODES=xxx # for standalone reward model
bash examples/fapo_trainer/run_qwen_7b_rm_standalone.sh # 7b fapo model
bash examples/fapo_trainer/run_qwen_32b_rm_standalone.sh # 32b fapo model
Compared with baseline (no reward model), FAPO significantly improves the reasoning ability of the model.
If you would like to use discriminative reward models, the usage is essentially similar to GenRM. You only need to replace the "/v1/chat/completions" endpoint in the custom reward function with the reward model's endpoint.
We provide a standard way to compute the DisRM reward score, with the implementation in RewardLoopWorker::compute_score_disrm.
You can enable this computation method by not specifying a custom reward function.
Both GenRM and DisRM can obtain reward scores via HTTP requests in the custom reward function. This allows users to flexibly combine rule-based rewards with reward models to construct more sophisticated reward logic.
For more detailed usage instructions and infrastructure design, please refer to the Reward Loop document.
If you find our works useful for your research, please consider citing:
@article{ding2025fapo,
title={FAPO: Flawed-Aware Policy Optimization for Efficient and Reliable Reasoning},
author={Ding, Yuyang and Zhang, Chi and Li, Juntao and Lin, Haibin and Zhang, Min},
journal={arXiv preprint arXiv:2510.22543},
year={2025}
}