Preference Optimization for Reasoning with Pseudo Feedback

This repo contains the source code for Preference Optimization for Reasoning with Pseudo Feedback (ICLR 2025).

We introduce a novel approach to generate pseudo feedback for reasoning tasks by framing the labeling of solutions to reasoning problems as an evaluation against associated test cases. We explore two forms of pseudo feedback based on test cases: one generated by frontier LLMs and the other by extending self-consistency to multi-test-case. We conduct experiments on both mathematical reasoning and coding tasks using pseudo feedback for preference optimization, and observe improvements across both tasks. Specifically, using Mathstral-7B as our base model, we improve Mathstral-7B on MATH from 58.3 to 68.6, surpassing both NuminaMath-72B and GPT-4-Turbo-1106-preview. Building on Deepseek-coder-7B-v1.5, we achieve a score of 24.6 on LiveCodeBench (from 21.1), surpassing Claude-3-Haiku.

Summary of Main Experimental Results

Mathematical Reasoning

Model	MATH	GSM8K	College Math
GPT-4o-2024-0512	78.7	95.8	46.7
GPT-4-Turbo-2024-0409	72.8	94.8	44.2
GPT-4-Turbo-1106-preview	64.3	---	---
GPT-4-0613	55.0	93.5	39.0
NuminaMath-72B-CoT	67.1	91.7	39.8
Llama-3.1-8B-Instruct	47.5	84.5	27.5
Llama-3.1-70B-Instruct	68.1	95.5	41.8
Llama-3.1-8B-base	20.3 (4-shot)	56.7 (8-shot)	20.1 (4-shot)
w/ SFT	53.8	85.1	34.6
w/ PFPO-LLM Iter. 0	55.0	86.6	35.8
w/ PFPO-Self Iter. 1	55.9	87.6	36.6
w/ PFPO-Self Iter. 2	56.6	88.9	37.0
w/ PFPO-Self Iter. 3	57.0	88.8	36.7
w/ PFPO-Self Iter. 4	57.4	89.1	37.6
w/ PFPO-Self Iter. 5	57.8	89.6	38.0
Mathstral-7B-v0.1	58.3	85.6	34.3
w/ SFT	61.4	87.3	38.4
w/ PFPO-LLM Iter. 0	66.7	90.0	41.3
w/ PFPO-Self Iter. 1	67.8	90.8	42.0
w/ PFPO-Self Iter. 2	68.6	90.3	42.2
w/ PFPO-Self Iter. 3	68.2	90.4	42.3

Coding - LiveCodeBench

Model	Overall	Easy	Medium	Hard
Claude-3.5-Sonnet	51.3	87.2	45.3	11.0
Claude-3-Sonnet	26.9	67.2	7.3	1.4
Claude-3-Haiku	24.0	61.3	5.5	0.9
GPT-3.5-Turbo-0125	24.0	55.0	11.6	0.3
Llama-3.1-70B-Instruct	31.8	67.9	17.3	4.1
Llama-3-70B-Instruct	27.4	59.4	15.6	1.3
CodeQwen1.5-7B-Chat	16.8	35.9	10.9	0.3
DeepSeekCoder-V2-236B	41.9	79.9	32.0	4.9
Deepseek-Coder-33B-Instruct	23.4	56.1	8.6	0.9
Deepseek-coder-7B-v1.5-Insturct	21.1	51.3	7.4	0.2
w/ SFT (APPs)	22.9	53.0	10.6	0.2
w/ DPO (APPs)	22.9	53.7	9.4	1.0
w/ pDPO (APPs)	22.9	55.0	8.1	1.3
w/ PFPO-LLM Iter. 0 (APPs)	24.0	56.8	9.3	1.4
w/ PFPO-Self Iter. 1 (APPs & M.C.)	24.2	57.8	8.5	1.7
w/ PFPO-Self Iter. 2 (APPs & M.C. & xCode.)	24.6	58.7	9.1	1.5
w/ PFPO-Self Iter. 0 (APPs)	23.4	54.2	10.3	0.7
w/ PFPO-Self Iter. 1 (APPs & M.C.)	23.7	55.8	9.5	1.1
w/ PFPO-Self Iter. 2 (APPs & M.C. & xCode)	24.3	56.8	9.8	1.6

<details> <summary>Coding - APPs (click to expand) </summary>

Model	Overall	Introductory	Interview	Competition
GPT-4-0613	35.1	61.8	34.4	10.6
GPT-4o-2024-0513	34.0	56.6	32.2	16.7
Llama-3.1-8B-Instruct	11.5	29.4	8.5	2.7
Llama-3.1-70B-Instruct	24.9	51.8	21.3	9.1
Codestral-22B-V0.1	20.3	45.2	16.9	5.8
CodeQwen1.5-7B-chat	8.6	24.1	16.8	2.0
Qwen2.5-Coder-7B-Instruct	15.7	37.3	12.3	4.1
Deepseek-coder-33B-Instruct	18.4	44.2	14.5	4.4
Deepseek-coder-v1.5-Instruct	14.3	35.7	10.8	3.2
w/ SFT (APPs)	15.4	37.8	11.6	4.1
w/ DPO (APPs)	16.3	36.2	13.3	5.3
w/ pDPO (APPs)	16.9	37.3	13.8	6.1
w/ PFPO-LLM Iter. 0 (APPs)	17.9	38.3	14.7	7.1
w/ PFPO-Self Iter. 1 (APPs & M.C.)	18.9	40.8	15.5	7.5
w/ PFPO-Self Iter. 2 (APPs & M.C. & xCode.)	19.1	39.6	16.1	7.4
w/ PFPO-Self Iter. 0 (APPs)	17.4	37.5	14.8	5.4
w/ PFPO-Self Iter. 1 (APPs & M.C.)	18.0	39.2	14.9	6.2
w/ PFPO-Self Iter. 2 (APPs & M.C. & xCode.)	19.1	40.9	15.9	6.9

</details> <details> <summary>Coding - HumanEval & MBPP (click to expand) </summary>

Model	HumanEval	MBPP
GPT-4-0613	87.8	82.1
GPT-4o-2024-0513	93.3	87.2
Llama-3.1-8B-Instruct	72.6	71.2
Llama-3.1-70B-Instruct	80.5	83.3
Codestral-22B-V0.1	81.1	78.2
CodeQwen1.5-7B-chat	85.6	80.5
Qwen2.5-Coder-7B-Instruct	85.4	86.0
Deepseek-coder-33B-Instruct	77.4	79.0
Deepseek-coder-v1.5-Instruct	75.6	73.9
w/ SFT (APPs)	72.0	72.8
w/ DPO (APPs)	74.4	74.3
w/ pDPO (APPs)	73.8	73.2
w/ PFPO-LLM Iter. 0 (APPs)	73.8	75.9
w/ PFPO-Self Iter. 1 (APPs & M.C.)	76.8	73.9
w/ PFPO-Self Iter. 2 (APPs & M.C. & xCode.)	81.7	72.4
w/ PFPO-Self Iter. 0 (APPs)	73.2	75.1
w/ PFPO-Self Iter. 1 (APPs & M.C.)	79.3	75.5
w/ PFPO-Self Iter. 2 (APPs & M.C. & xCode.)	73.8	75.1

</details>

Install Dependencies

Most dependencies are listed in requirements.txt.

Besides, you need to install flash-attention by yourself.

We also provides a docker image for running the experiments. You can pull the image by running:

bash

docker pull jiaofangkai/normal:torch-2.5.1-vllm-0.6.4.post1-eval-1206

Instruction to Run the Experiments

Math (Taking Mathstral as Example)

SFT on MathScale

First, please prepare your own SFT data or download our released MathScale-4o (to be released soon). The file is single json file containing a list, where each item has several keys: question, box_solution, and id, demonstrating the question, CoT solution with \\bxoed{}, and item index.

After that, run the following command:

bash

torchrun --nnodes 2 --nproc_per_node 8 --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT trainer_base_ds_mul_fs_tp.py -cp conf/exp/mathscale/mistral/sft/ -cn mathstral-mathscale4o-sft-v2.0-v100

The above command should be run on two 8xV100 nodes. For less nodes, or less GPU resources, please change the gradient accumulation steps in the configuration file accordingly.

In order to disable tensor parallel, please refer to the section below and set the tp_size to 1.

DPO using Ground-truth Feedback (Teacher Feedback)

Run Inference

Run the following command for inference using vLLM:

bash

python vllm_inference.py test_file=${test_file} output_dir=${output_dir} eval_sub_path=${eval_sub_path} \
  # Can keep the default values in the config file
  sampling_params.n=8 sampling_params.temperature=1.0 sampling_params.top_p=0.9 split_size=1 split_id=0 \
  -cp conf/api/vllm/mathscale/ -cn 4o_mathstral_train_0shot_v1_0

where test_file indicates the data file for inference, output_dir is the directory of your checkpoint, and eval_sub_path is sub-path of the checkpoint, e.g., checkpoint-100. The data file is also a json file, which contains a list of items, where each item should have question, id and label.

Construct Preference Pairs

Run the following command:

bash

python scripts/math_scale/construct_prefer_pair.py --input_file $input_file_glob_path --output_file $output_file_path

The input file path supports glob pattern, and the output file path is the file to save the constructed preference pairs.

Run DPO Training

bash

torchrun --nnodes 1 --nproc_per_node 8 trainer_base_ds_mul_fs_tp.py -cp conf/exp/mathscale/mistral/dpo/ -cn mathstral-dpo-4o-iter0-v1.1-a100

The above config is set on single 8xA100-80G node. Remember to set train_file as your saved preference pair file, and sft_model_dir as the directory of the SFT model checkpoint.

pDPO using Ground-truth Feedback (Teacher Feedback)

Following full trajectory sampling, we first need to sample some trajectory prefixes for completion and evaluation:

bash

python scripts/math/deepseek_math_sample_steps.py --input_file $input_file --output_file $output_file \
    --upper_step_ratio 0.7 --sample_ratio 0.3 --filter_all_same --sample_over_p 10

The input_file sets the full trajectory output data, and the output_file is the file to save the sampled prefixes. The upper_step_ratio indicates that we avoid sampling steps at the last 1-upper_step_ratio * 100 percent steps, and the sample_ratio is the ratio of sampled prefixes. The sample_over_p is the number of sampled prefixes for each problem. --filter_all_same indicates that we avoid sampling prefixes from the problems where all predictions are the same.

Run Completion Inference for Trajectory Prefixes

bash

python vllm_inference.py test_file=${test_file} output_dir=${output_dir} eval_sub_path=${eval_sub_path} \
  # Can keep the default values in the config file
  sampling_params.n=3 sampling_params.temperature=1.0 sampling_params.top_p=0.9 split_size=1 split_id=0 \
  -cp conf/api/vllm/mathscale/ -cn 4o_mathstral_train_0shot_v1_0_completion

where test_file indicates the saved prefix file in the last step.

Construct Prefix-Preference Pair

bash

python scripts/math_scale/construct_process_rm_sample_gd.py --input_file $prefix_completion_file --output_file $output_file --num_workers 128

Run pDPO Training

bash

torchrun --nnodes 48 --nproc_per_node 8 --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT trainer_base_ds_mul_fs_tp.py \
    -cp conf/exp/mathscale/mistral/dpo/ -cn mathstral-pdpo-4o-iter0-v2.2-V100

The above experiment runs on 48 8xV100 nodes with tp_size=8. Please adjust per_gpu_train_batch_size, gradient_accumulation_steps accordingly and tp_size according to your resources.

DPO using Self-Generated Feedback

The overall workflow keeps the same the ground-truth feedback, and thus we only need to change the scripts for each step.

Construct Preference Pairs

bash

python ~/gpt-chat-examples/scripts/math_scale/construct_prefer_pair_sc.py --input_file $full_trajectory_data --output_file $output_file --top_p $confidence_threshold

Construct Prefix-Preference Pairs

bash

python scripts/math_scale/construct_process_rm_sample_sc.py \
  --input_file $prefix_completion_file --output_file $output_file --response_file_for_sc $full_trajectory_data --response_id_field id --num_workers 128

For specified experimental configs, you can refer to the section below.

Code

SFT on APPs

We use a special format to collect SFT data from GPT-4o, and you can refer to the prompt template here:

bash

python scripts/apps/pp_solution_gen_inputs.py

Afterwards, we need to run the generated solutions on the annotated test cases for filtering:

bash

python scripts/apps/solution_fail_extract.py --completion_file $completion_file --output_file $output_file --num_workers 16

Finally, we could conduct SFT training on this dataset:

bash

torchrun --nnodes 2 --nproc_per_node 8 --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT trainer_base_ds_mul_fs_tp.py \
  -cp conf/exp/apps/r2c_generation/deepseek_coder/sft/ -cn gpt4o-distil-v3.1-v100

The above experiment runs on 2 8xV100 nodes.

Pseudo Test Case Inputs Generation

Before we synthesize the pseudo feedback, we need to first prepare the test case inputs. We prompt general LLMs (e.g., GPT-4o, Mistral-Large-2409) to complete this process, and you can find the prompting template here:

prompts/apps/test_input_gen_2shot_v2.1.txt

Note that, if your LLM service supports constraint decoding using json object, please enable this feature for better performance.

DPO on APPs based Ground-truth Test Cases

For running inference on the training set of APPs:

bash

python vllm_inference.py split_size=1 split_id=0 -cp conf/api/vllm/apps/deepseek_coder/r2c/ -cn train_v2_0

Since the training set has included test cases, the above inference process will also include the evaluation, so that we can directly construct preference pairs by the evaluation results:

bash

python scripts/apps/construct_prefer_pair.py \
    --input_file $full_trajectory_data --output_file $output_file --response_field response --test_case_field test_cases

Then, run DPO training:

bash

torchrun --nnodes 2 --nproc_per_node 8 --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT trainer_base_ds_mul_fs_tp.py \
 -cp conf/exp/apps/r2c_generation/deepseek_coder/dpo/ -cn gpt4o-distil-v3.2-v100

The above experiment runs on 2 8xV100 nodes.

DPO/pDPO on APPs based Self-Consistency Test Cases

In order to construct prefer pairs under self-consistency-based test cases, we need to re-run the full trajectory data (code solutions) on the synthetic test case inputs and obtain the pseudo outputs:

bash

python scripts/apps/solution_run_pseudo_outputs_local.py \
    --completion_file $full_trajectory_data --output_file $output_file --pseudo_test_case $synthetic_test_inputs --num_workers 128

This process is better to be conducted in sandbox.

Afterwards, we can construct the prefix-preference pairs:

bash

python scripts/apps/pseudo_test_cases/collect_pseudo_outputs.py \
    --pseudo_test_case_file $result_file_on_synthetic_inputs \
    --output_file $output_file \
    --construct_prefer_pair --pass_case_margin 6 --pass_case_lower_bound 0.5

where pass_case_margin denotes the margin for preference pair, and pass_case_lower_bound is the minimum ratio of passed cases for some solution to serve as a positive anchor.

Then, run DPO training:

bash

torchrun --nnodes 8 --nproc_per_node 8 --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT trainer_base_ds_mul_fs_tp.py \
 -cp conf/exp/apps/r2c_generation/deepseek_coder/dpo/ -cn gpt4o-distil-v4.0-v100-ps-test

The above experiment runs on 8 8xV100 nodes with tp_size=8.

In order to perform pDPO training, first sample steps from the full trajectory data:

bash

python scripts/apps/prm/sample_steps.py \
    --input_file $full_trajectory_data --upper_step_ratio 0.8 --sample_ratio 0.3 --output_file $output_file

For prefix completion, run:

bash

python vllm_inference.py split_size=1 split_id=0 -cp conf/api/vllm/apps/deepseek_coder/r2c/ -cn train_v2_0_prefix_completion

As we have already synthesized the pseudo outputs, we can evaluate the prefix completions on the pseudo test cases:

bash

python scripts/apps/pseudo_test_cases/prefix_fail_extract_pseudo_label.py \
    --completion_file $prefix_completion_file --output_file $output_file --num_workers 64 --pseudo_test_cases $pseudo_test_cases

Finally, construct the prefix-preference pairs:

bash

python scripts/apps/prm/construct_process_rm_sample_fix.py \
    --input_file $prefix_completion_execute_file --output_file $output_file \
    --pass_case_lower_bound 0.8 --pass_case_margin 4 --test_case_field pseudo_input_output

Then, run pDPO training:

bash

torchrun --nnodes 16 --nproc_per_node 8 --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT trainer_base_ds_mul_fs_tp.py \
 -cp conf/exp/apps/r2c_generation/deepseek_coder/dpo/ -cn gpt4o-distil-v4.9-V100-ps-pdpo

The above experiment runs on 16 8xV100 nodes with tp_size=8.

DPO/pDPO on MagiCoder-OSS and XCodeEval

Due to the similar process, we provide the commands for data processing in the following bash script for your reference:

text

scripts/apps/pseudo_test_cases/pipeline.sh  # For Magicoder-OSS
scripts/apps/pseudo_test_cases/xcode_pipeline.sh    # For XCodeEval

We will release our preprocessed data including the synthetic test case inputs to reduce your workload.

Configuration of All Experiments

Here are the configuration files of all experiments in Table 1, 2, 3, and 5 in the paper:

Experiment	Configuration File
Mathstral w/ SFT	yaml file
w/ DPO (M.S.-500k, Iter. 0)	yaml file
w/ pDPO (M.S.-500k, Iter. 0)	yaml file
w/ pDPO (M.S.-300k-S.C., Iter. 1)	yaml file
w/ pDPO (M.S.-300k-S.C., Iter. 2)	yaml file
Llama-3.1-8B w/ SFT
w/ DPO (M.S.-500k, Iter. 0)	yaml file
w/ pDPO (M.S.-500k, Iter. 0)	yaml file
w/ pDPO (Numina-S.C. 160k, Iter. 1)	yaml file
w/ pDPO (Numina-S.C. 320k, Iter. 2)	yaml file
w/ pDPO (Numina-S.C. 480k, Iter. 3)	yaml file
w/ pDPO (Numina-S.C. 640k, Iter. 4)	yaml file
w/ pDPO (Numina-S.C. 790k, Iter. 5)	yaml file
Deepseek-coder-v1.5-chat w/ SFT	yaml file
w/ DPO (APPs)	yaml file
w/ pDPO (APPs)	yaml file
w/ DPO (APPs - S.C.)	yaml file
w/ pDPO (APPs - S.C.)	yaml file
w/ DPO (APPs & M.C. - S.C.)	yaml file
w/ DPO (APPs & M.C. & xCode. - S.C.)	yaml file
w/ pDPO (APPs & M.C. & xCode. - S.C.)	yaml file
w/ pDPO (APPs & M.C. - S.C.)	yaml file

Evaluation Configs

For evaluation, simply run python vllm_inference.py -cp $config_path -cn $config_name. The evaluation is included in the inference process. Belows are the evaluation configs for different tasks.

MWPBench (including MATH and GSM8K):

The config file is conf/api/vllm/mwp-bench/mathstral_test_0shot_v1_0.yaml.

Note that, you need use sympy evaluation for more accurate evaluation. Please refer to scripts/math_scale/qwen25math_style_eval_v2.0.py for more details.

If your prediction file is generated through our config, simply run:

bash

python scripts/math_scale/qwen25math_style_eval_v2.0.py --input_file $prediction_file_path

For the necessary dependency to run sympy, please create a new virtual environment and follow the instruction of Qwen2.5-Math.

Code

APPs: conf/api/vllm/apps/deepseek_coder/r2c/dev_v2_0.yaml
HumanEval: conf/api/vllm/human_eval/ds_coder/r2c/test_v2_2_local.yaml
MBPP-257: conf/api/vllm/mbpp_sanitized/r2c/test_v1_0_local.yaml

For the evaluation of LiveCodeBench, please refer to the official repo. You can also refer to my commit for reference. We only modified the prompts template to adapt to the evaluation.

Basic Tutorial for Hydra Configuration

In this repo, we have used Hydra and Yaml files to configure the experiments. We have used some features of Hydra and we will give some basic introduction here to avoid potential confusion.

Launch Job

In most cases, the entrance is trainer_base_ds_mul_fs_tp.py, where you will see the following main function:

python

import hydra
from omegaconf import DictConfig


@hydra.main(config_path="conf", config_name="config", version_base="1.2")
def main(cfg: DictConfig):
    ...

The launch command is as normal, such as using torchrun or deepspeed, for example:

bash

deepspeed trainer_base_ds_mul_fs_tp.py seed=42 [other arguments without "--" prefix] \
  cp=${config_path} cn=${config_name}

where config_path is the path of the directory containing the corresponding confie file, and config_name is the file name without the suffix .yaml.

Runtime Function Calling and Dependency Import

In the configuration, you will see some usage like the following:

yaml

model:
  _target_: models.llama_tp.LlamaForCausalLM.from_pretrained
  gradient_checkpointing: True
  attn_implementation: "flash_attention_2"
  torch_dtype: ${torch_dtype}
  pad_token_id: ${base_eos_token_id}

where _target_ indicates this is a function call (including __init__ function, i.e., object initialization), and the arguments are specified in the following lines. Besides, models.llama_tp.LlamaForCausalLM.from_pretrained indicates the relative path of the function to be called, and you do not need to import this function during coding.

In python code, you can obtain the returned value of the called function through

python

model = hydra.utils.call(cfg.model, cfg.model_name_or_path, state_dict=pretrain_state_dict)

where the arguments not specified in the configuration file can be passed as additional arguments.

Additionally, you can initialize the objects through hydra in a recursive manner. In the above example, the torch_dtype is also defined as a returned value of another function call:

yaml

torch_dtype:
  _target_: general_util.training_utils.return_torch_dtype
  dtype: float16

Implementation

Change Deepspeed Configuration

There are some pre-defined configurations under conf/deepspeed. You can import them in your config file at the beginning by changing deepspeed@ds_cfg:

yaml

defaults:
  - hydra: default
  - deepspeed@ds_cfg: train_hybrid_engine_zero1_cosine
  - _self_  # see here for more details: https://hydra.cc/docs/tutorials/basic/your_first_app/defaults/#composition-order-of-primary-config

The {a}@{b}:c symbol indicates that the configuration group to be imported is conf/a/c.yaml and this configuration group is renamed to b in current configuration file.

Enable Tensor Parallel based on FairScale

The are some implementations using tensor parallel under models, ending with _tp.py. To enable tensor parallel, use the model with tensor parallel implementation such as models.llama_tp.LlamaForCausalLM.from_pretrained, and set the tp_size in your configuration file.

Note that you need to use scripts/model_converter/convert_llama_to_llama_tp.py to convert the original model to the tensor parallel model. Currently the script supports Llama, Qwen and Mistral model series.

Memory Optimization

We would recommend the following order to try to reduce the memory usage:

text

zero1 > zero2 > intra-node-zero3 & cross-node dp > intra-node tp & cross node zero1/2 > global zero3

More resources on this project can be found here

Contact

If you have any problem about our code or paper, feel free to open an issue or send an email to the authors.

Citation

If you feel our paper or code is helpful, please cite our paper:


@inproceedings{jiao2024pfpo,
title={Preference Optimization for Reasoning with Pseudo Feedback},
author={Fangkai Jiao and Geyang Guo and Xingxing Zhang and Nancy F. Chen and Shafiq Joty and Furu Wei},
year={2025},
booktitle={ICLR},
}

If you feel the code base for pDPO is also useful, kindly cite the following paper:

@inproceedings{jiao2024lpr,
  author={Fangkai Jiao and Chengwei Qin and Zhengyuan Liu and Nancy F. Chen and Shafiq Joty},
  title        = {Learning Planning-based Reasoning with Trajectory Collection and Process Rewards Synthesizing},
  booktitle    = {{EMNLP}},
  publisher    = {Association for Computational Linguistics},
  year         = {2024},
}