PFPO/README.md
This repo contains the source code for Preference Optimization for Reasoning with Pseudo Feedback (ICLR 2025).
We introduce a novel approach to generate pseudo feedback for reasoning tasks by framing the labeling of solutions to reasoning problems as an evaluation against
associated test cases. We explore two forms of pseudo feedback based on test cases: one generated by frontier LLMs and the other by extending self-consistency
to multi-test-case. We conduct experiments on both mathematical reasoning and coding tasks using pseudo feedback for preference optimization, and observe
improvements across both tasks. Specifically, using Mathstral-7B as our base model, we improve Mathstral-7B on MATH from 58.3 to 68.6, surpassing both NuminaMath-72B and GPT-4-Turbo-1106-preview. Building on Deepseek-coder-7B-v1.5, we achieve a score of 24.6 on LiveCodeBench (from
21.1), surpassing Claude-3-Haiku.
| Model | MATH | GSM8K | College Math |
|---|---|---|---|
| GPT-4o-2024-0512 | 78.7 | 95.8 | 46.7 |
| GPT-4-Turbo-2024-0409 | 72.8 | 94.8 | 44.2 |
| GPT-4-Turbo-1106-preview | 64.3 | --- | --- |
| GPT-4-0613 | 55.0 | 93.5 | 39.0 |
| NuminaMath-72B-CoT | 67.1 | 91.7 | 39.8 |
| Llama-3.1-8B-Instruct | 47.5 | 84.5 | 27.5 |
| Llama-3.1-70B-Instruct | 68.1 | 95.5 | 41.8 |
| Llama-3.1-8B-base | 20.3 (4-shot) | 56.7 (8-shot) | 20.1 (4-shot) |
| w/ SFT | 53.8 | 85.1 | 34.6 |
| w/ PFPO-LLM Iter. 0 | 55.0 | 86.6 | 35.8 |
| w/ PFPO-Self Iter. 1 | 55.9 | 87.6 | 36.6 |
| w/ PFPO-Self Iter. 2 | 56.6 | 88.9 | 37.0 |
| w/ PFPO-Self Iter. 3 | 57.0 | 88.8 | 36.7 |
| w/ PFPO-Self Iter. 4 | 57.4 | 89.1 | 37.6 |
| w/ PFPO-Self Iter. 5 | 57.8 | 89.6 | 38.0 |
| Mathstral-7B-v0.1 | 58.3 | 85.6 | 34.3 |
| w/ SFT | 61.4 | 87.3 | 38.4 |
| w/ PFPO-LLM Iter. 0 | 66.7 | 90.0 | 41.3 |
| w/ PFPO-Self Iter. 1 | 67.8 | 90.8 | 42.0 |
| w/ PFPO-Self Iter. 2 | 68.6 | 90.3 | 42.2 |
| w/ PFPO-Self Iter. 3 | 68.2 | 90.4 | 42.3 |
| Model | Overall | Easy | Medium | Hard |
|---|---|---|---|---|
| Claude-3.5-Sonnet | 51.3 | 87.2 | 45.3 | 11.0 |
| Claude-3-Sonnet | 26.9 | 67.2 | 7.3 | 1.4 |
| Claude-3-Haiku | 24.0 | 61.3 | 5.5 | 0.9 |
| GPT-3.5-Turbo-0125 | 24.0 | 55.0 | 11.6 | 0.3 |
| Llama-3.1-70B-Instruct | 31.8 | 67.9 | 17.3 | 4.1 |
| Llama-3-70B-Instruct | 27.4 | 59.4 | 15.6 | 1.3 |
| CodeQwen1.5-7B-Chat | 16.8 | 35.9 | 10.9 | 0.3 |
| DeepSeekCoder-V2-236B | 41.9 | 79.9 | 32.0 | 4.9 |
| Deepseek-Coder-33B-Instruct | 23.4 | 56.1 | 8.6 | 0.9 |
| Deepseek-coder-7B-v1.5-Insturct | 21.1 | 51.3 | 7.4 | 0.2 |
| w/ SFT (APPs) | 22.9 | 53.0 | 10.6 | 0.2 |
| w/ DPO (APPs) | 22.9 | 53.7 | 9.4 | 1.0 |
| w/ pDPO (APPs) | 22.9 | 55.0 | 8.1 | 1.3 |
| w/ PFPO-LLM Iter. 0 (APPs) | 24.0 | 56.8 | 9.3 | 1.4 |
| w/ PFPO-Self Iter. 1 (APPs & M.C.) | 24.2 | 57.8 | 8.5 | 1.7 |
| w/ PFPO-Self Iter. 2 (APPs & M.C. & xCode.) | 24.6 | 58.7 | 9.1 | 1.5 |
| w/ PFPO-Self Iter. 0 (APPs) | 23.4 | 54.2 | 10.3 | 0.7 |
| w/ PFPO-Self Iter. 1 (APPs & M.C.) | 23.7 | 55.8 | 9.5 | 1.1 |
| w/ PFPO-Self Iter. 2 (APPs & M.C. & xCode) | 24.3 | 56.8 | 9.8 | 1.6 |
| Model | Overall | Introductory | Interview | Competition |
|---|---|---|---|---|
| GPT-4-0613 | 35.1 | 61.8 | 34.4 | 10.6 |
| GPT-4o-2024-0513 | 34.0 | 56.6 | 32.2 | 16.7 |
| Llama-3.1-8B-Instruct | 11.5 | 29.4 | 8.5 | 2.7 |
| Llama-3.1-70B-Instruct | 24.9 | 51.8 | 21.3 | 9.1 |
| Codestral-22B-V0.1 | 20.3 | 45.2 | 16.9 | 5.8 |
| CodeQwen1.5-7B-chat | 8.6 | 24.1 | 16.8 | 2.0 |
| Qwen2.5-Coder-7B-Instruct | 15.7 | 37.3 | 12.3 | 4.1 |
| Deepseek-coder-33B-Instruct | 18.4 | 44.2 | 14.5 | 4.4 |
| Deepseek-coder-v1.5-Instruct | 14.3 | 35.7 | 10.8 | 3.2 |
| w/ SFT (APPs) | 15.4 | 37.8 | 11.6 | 4.1 |
| w/ DPO (APPs) | 16.3 | 36.2 | 13.3 | 5.3 |
| w/ pDPO (APPs) | 16.9 | 37.3 | 13.8 | 6.1 |
| w/ PFPO-LLM Iter. 0 (APPs) | 17.9 | 38.3 | 14.7 | 7.1 |
| w/ PFPO-Self Iter. 1 (APPs & M.C.) | 18.9 | 40.8 | 15.5 | 7.5 |
| w/ PFPO-Self Iter. 2 (APPs & M.C. & xCode.) | 19.1 | 39.6 | 16.1 | 7.4 |
| w/ PFPO-Self Iter. 0 (APPs) | 17.4 | 37.5 | 14.8 | 5.4 |
| w/ PFPO-Self Iter. 1 (APPs & M.C.) | 18.0 | 39.2 | 14.9 | 6.2 |
| w/ PFPO-Self Iter. 2 (APPs & M.C. & xCode.) | 19.1 | 40.9 | 15.9 | 6.9 |
| Model | HumanEval | MBPP |
|---|---|---|
| GPT-4-0613 | 87.8 | 82.1 |
| GPT-4o-2024-0513 | 93.3 | 87.2 |
| Llama-3.1-8B-Instruct | 72.6 | 71.2 |
| Llama-3.1-70B-Instruct | 80.5 | 83.3 |
| Codestral-22B-V0.1 | 81.1 | 78.2 |
| CodeQwen1.5-7B-chat | 85.6 | 80.5 |
| Qwen2.5-Coder-7B-Instruct | 85.4 | 86.0 |
| Deepseek-coder-33B-Instruct | 77.4 | 79.0 |
| Deepseek-coder-v1.5-Instruct | 75.6 | 73.9 |
| w/ SFT (APPs) | 72.0 | 72.8 |
| w/ DPO (APPs) | 74.4 | 74.3 |
| w/ pDPO (APPs) | 73.8 | 73.2 |
| w/ PFPO-LLM Iter. 0 (APPs) | 73.8 | 75.9 |
| w/ PFPO-Self Iter. 1 (APPs & M.C.) | 76.8 | 73.9 |
| w/ PFPO-Self Iter. 2 (APPs & M.C. & xCode.) | 81.7 | 72.4 |
| w/ PFPO-Self Iter. 0 (APPs) | 73.2 | 75.1 |
| w/ PFPO-Self Iter. 1 (APPs & M.C.) | 79.3 | 75.5 |
| w/ PFPO-Self Iter. 2 (APPs & M.C. & xCode.) | 73.8 | 75.1 |
Most dependencies are listed in requirements.txt.
Besides, you need to install flash-attention by yourself.
We also provides a docker image for running the experiments. You can pull the image by running:
docker pull jiaofangkai/normal:torch-2.5.1-vllm-0.6.4.post1-eval-1206
First, please prepare your own SFT data or download our released MathScale-4o (to be released soon). The file is single json file containing a list, where each
item has several keys: question, box_solution, and id, demonstrating the question, CoT solution with \\bxoed{}, and item index.
After that, run the following command:
torchrun --nnodes 2 --nproc_per_node 8 --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT trainer_base_ds_mul_fs_tp.py -cp conf/exp/mathscale/mistral/sft/ -cn mathstral-mathscale4o-sft-v2.0-v100
The above command should be run on two 8xV100 nodes. For less nodes, or less GPU resources, please change the gradient accumulation steps in the configuration file accordingly.
In order to disable tensor parallel, please refer to
the section below and set the tp_size to 1.
Run Inference
Run the following command for inference using vLLM:
python vllm_inference.py test_file=${test_file} output_dir=${output_dir} eval_sub_path=${eval_sub_path} \
# Can keep the default values in the config file
sampling_params.n=8 sampling_params.temperature=1.0 sampling_params.top_p=0.9 split_size=1 split_id=0 \
-cp conf/api/vllm/mathscale/ -cn 4o_mathstral_train_0shot_v1_0
where test_file indicates the data file for inference, output_dir is the directory of your checkpoint, and eval_sub_path is sub-path of the checkpoint,
e.g., checkpoint-100. The data file is also a json file, which contains a list of items, where each item should have question, id and label.
Construct Preference Pairs
Run the following command:
python scripts/math_scale/construct_prefer_pair.py --input_file $input_file_glob_path --output_file $output_file_path
The input file path supports glob pattern, and the output file path is the file to save the constructed preference pairs.
Run DPO Training
torchrun --nnodes 1 --nproc_per_node 8 trainer_base_ds_mul_fs_tp.py -cp conf/exp/mathscale/mistral/dpo/ -cn mathstral-dpo-4o-iter0-v1.1-a100
The above config is set on single 8xA100-80G node. Remember to set train_file as your saved preference pair file, and sft_model_dir as the directory of the
SFT model checkpoint.
Following full trajectory sampling, we first need to sample some trajectory prefixes for completion and evaluation:
python scripts/math/deepseek_math_sample_steps.py --input_file $input_file --output_file $output_file \
--upper_step_ratio 0.7 --sample_ratio 0.3 --filter_all_same --sample_over_p 10
The input_file sets the full trajectory output data, and the output_file is the file to save the sampled prefixes. The upper_step_ratio indicates that we
avoid sampling steps at the last 1-upper_step_ratio * 100 percent steps, and the sample_ratio is the ratio of sampled prefixes. The sample_over_p is the
number of sampled prefixes for each problem. --filter_all_same indicates that we avoid sampling prefixes from the problems where all predictions are the same.
Run Completion Inference for Trajectory Prefixes
python vllm_inference.py test_file=${test_file} output_dir=${output_dir} eval_sub_path=${eval_sub_path} \
# Can keep the default values in the config file
sampling_params.n=3 sampling_params.temperature=1.0 sampling_params.top_p=0.9 split_size=1 split_id=0 \
-cp conf/api/vllm/mathscale/ -cn 4o_mathstral_train_0shot_v1_0_completion
where test_file indicates the saved prefix file in the last step.
Construct Prefix-Preference Pair
python scripts/math_scale/construct_process_rm_sample_gd.py --input_file $prefix_completion_file --output_file $output_file --num_workers 128
Run pDPO Training
torchrun --nnodes 48 --nproc_per_node 8 --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT trainer_base_ds_mul_fs_tp.py \
-cp conf/exp/mathscale/mistral/dpo/ -cn mathstral-pdpo-4o-iter0-v2.2-V100
The above experiment runs on 48 8xV100 nodes with tp_size=8. Please adjust per_gpu_train_batch_size, gradient_accumulation_steps accordingly and tp_size
according to your resources.
The overall workflow keeps the same the ground-truth feedback, and thus we only need to change the scripts for each step.
Construct Preference Pairs
python ~/gpt-chat-examples/scripts/math_scale/construct_prefer_pair_sc.py --input_file $full_trajectory_data --output_file $output_file --top_p $confidence_threshold
Construct Prefix-Preference Pairs
python scripts/math_scale/construct_process_rm_sample_sc.py \
--input_file $prefix_completion_file --output_file $output_file --response_file_for_sc $full_trajectory_data --response_id_field id --num_workers 128
For specified experimental configs, you can refer to the section below.
We use a special format to collect SFT data from GPT-4o, and you can refer to the prompt template here:
python scripts/apps/pp_solution_gen_inputs.py
Afterwards, we need to run the generated solutions on the annotated test cases for filtering:
python scripts/apps/solution_fail_extract.py --completion_file $completion_file --output_file $output_file --num_workers 16
Finally, we could conduct SFT training on this dataset:
torchrun --nnodes 2 --nproc_per_node 8 --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT trainer_base_ds_mul_fs_tp.py \
-cp conf/exp/apps/r2c_generation/deepseek_coder/sft/ -cn gpt4o-distil-v3.1-v100
The above experiment runs on 2 8xV100 nodes.
Before we synthesize the pseudo feedback, we need to first prepare the test case inputs. We prompt general LLMs (e.g., GPT-4o, Mistral-Large-2409) to complete this process, and you can find the prompting template here:
prompts/apps/test_input_gen_2shot_v2.1.txt
Note that, if your LLM service supports constraint decoding using json object, please enable this feature for better performance.
For running inference on the training set of APPs:
python vllm_inference.py split_size=1 split_id=0 -cp conf/api/vllm/apps/deepseek_coder/r2c/ -cn train_v2_0
Since the training set has included test cases, the above inference process will also include the evaluation, so that we can directly construct preference pairs by the evaluation results:
python scripts/apps/construct_prefer_pair.py \
--input_file $full_trajectory_data --output_file $output_file --response_field response --test_case_field test_cases
Then, run DPO training:
torchrun --nnodes 2 --nproc_per_node 8 --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT trainer_base_ds_mul_fs_tp.py \
-cp conf/exp/apps/r2c_generation/deepseek_coder/dpo/ -cn gpt4o-distil-v3.2-v100
The above experiment runs on 2 8xV100 nodes.
In order to construct prefer pairs under self-consistency-based test cases, we need to re-run the full trajectory data (code solutions) on the synthetic test case inputs and obtain the pseudo outputs:
<!-- python scripts/apps/solution_run_pseudo_outputs_local.py \ --completion_file "${OUTPUT_PREFIX_PATH}/experiments/deepseek-coder-v1.5-ins.7b.apps.r2c.gpt4o.distil.V100.w8.v3.1.dp4.tp4.s42/apps/checkpoint-200/train.0shot.tem1.0.n10.?-of-4.v2.0.json" \ --output_file ${OUTPUT_PREFIX_PATH}/experiments/deepseek-coder-v1.5-ins.7b.apps.r2c.gpt4o.distil.V100.w8.v3.1.dp4.tp4.s42/apps/checkpoint-200/train.0shot.tem1.0.n10.v2.0.pseudo_test_cases.v1.0.azure.json \ --pseudo_test_case ${DATA_PREFIX_PATH}/apps/test_case_inputs_gen/apps.train.test_case_inputs.gen.v2.1.func_only_combine.outputs.gpt4o.n1.tem0.0.json_obj.json --num_workers 128 -->python scripts/apps/solution_run_pseudo_outputs_local.py \
--completion_file $full_trajectory_data --output_file $output_file --pseudo_test_case $synthetic_test_inputs --num_workers 128
This process is better to be conducted in sandbox.
Afterwards, we can construct the prefix-preference pairs:
<!--- python scripts/apps/pseudo_test_cases/collect_pseudo_outputs.py \ --pseudo_test_case_file ../msranlpintern/reward_modeling/experiments/deepseek-coder-v1.5-ins.7b.apps.r2c.gpt4o.distil.V100.w8.v3.1.dp4.tp4.s42/apps/checkpoint-200/train.0shot.tem1.0.n10.v2.0.pseudo_test_cases.v1.0.azure.json \ --output_file ../msranlpintern/reward_modeling/experiments/deepseek-coder-v1.5-ins.7b.apps.r2c.gpt4o.distil.V100.w8.v3.1.dp4.tp4.s42/apps/checkpoint-200/train.0shot.tem1.0.n10.v2.0.pseudo_input_output.v1.0.clean.dpo_m6_low0.5.json \ --construct_prefer_pair --pass_case_margin 6 --pass_case_lower_bound 0.5 -->python scripts/apps/pseudo_test_cases/collect_pseudo_outputs.py \
--pseudo_test_case_file $result_file_on_synthetic_inputs \
--output_file $output_file \
--construct_prefer_pair --pass_case_margin 6 --pass_case_lower_bound 0.5
where pass_case_margin denotes the margin for preference pair, and pass_case_lower_bound is the minimum ratio of passed cases for some solution to serve as
a positive anchor.
Then, run DPO training:
torchrun --nnodes 8 --nproc_per_node 8 --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT trainer_base_ds_mul_fs_tp.py \
-cp conf/exp/apps/r2c_generation/deepseek_coder/dpo/ -cn gpt4o-distil-v4.0-v100-ps-test
The above experiment runs on 8 8xV100 nodes with tp_size=8.
In order to perform pDPO training, first sample steps from the full trajectory data:
python scripts/apps/prm/sample_steps.py \
--input_file $full_trajectory_data --upper_step_ratio 0.8 --sample_ratio 0.3 --output_file $output_file
For prefix completion, run:
python vllm_inference.py split_size=1 split_id=0 -cp conf/api/vllm/apps/deepseek_coder/r2c/ -cn train_v2_0_prefix_completion
As we have already synthesized the pseudo outputs, we can evaluate the prefix completions on the pseudo test cases:
<!--- python scripts/apps/pseudo_test_cases/prefix_fail_extract_pseudo_label.py \ --completion_file ${OUTPUT_PREFIX_PATH}/experiments/deepseek-coder-v1.5-ins.7b.apps.r2c.gpt4o.distil.V100.w8.v3.1.dp4.tp4.s42/apps/checkpoint-200/train.0shot.tem1.0.n5.{split_id}-of-256.v2.0.json \ --output_file ${OUTPUT_PREFIX_PATH}/experiments/deepseek-coder-v1.5-ins.7b.apps.r2c.gpt4o.distil.V100.w8.v3.1.dp4.tp4.s42/apps/checkpoint-200/train.tem1.0.n10.prefix.upper0.8.r0.3.completion.tem1.0.n5.v2.0.{split_id}-of-256.pseudo_test_case.exec.json \ --num_workers 64 \ --pseudo_test_cases ${OUTPUT_PREFIX_PATH}/experiments/deepseek-coder-v1.5-ins.7b.apps.r2c.gpt4o.distil.V100.w8.v3.1.dp4.tp4.s42/apps/checkpoint-200/train.0shot.tem1.0.n10.v2.0.pseudo_input_output.v1.0.json -->python scripts/apps/pseudo_test_cases/prefix_fail_extract_pseudo_label.py \
--completion_file $prefix_completion_file --output_file $output_file --num_workers 64 --pseudo_test_cases $pseudo_test_cases
Finally, construct the prefix-preference pairs:
<!-- python scripts/apps/prm/construct_process_rm_sample_fix.py \ --input_file "/mnt/fangkai_blob/reward_modeling//experiments/deepseek-coder-v1.5-ins.7b.apps.r2c.gpt4o.distil.V100.w8.v3.1.dp4.tp4.s42/apps/checkpoint-200/train.tem1.0.n10.prefix.upper0.8.r0.3.completion.tem1.0.n5.v2.0.[0-9]*-of-256.pseudo_test_case.exec.json" \ --output_file /mnt/fangkai_blob/reward_modeling//experiments/deepseek-coder-v1.5-ins.7b.apps.r2c.gpt4o.distil.V100.w8.v3.1.dp4.tp4.s42/apps/checkpoint-200/train.tem1.0.n10.prefix.upper0.8.r0.3.completion.tem1.0.n5.v2.0.pseudo_test_case.prefix_pass_num.fix.json \ --pass_case_lower_bound 0.8 --pass_case_margin 4 --test_case_field pseudo_input_output -->python scripts/apps/prm/construct_process_rm_sample_fix.py \
--input_file $prefix_completion_execute_file --output_file $output_file \
--pass_case_lower_bound 0.8 --pass_case_margin 4 --test_case_field pseudo_input_output
Then, run pDPO training:
torchrun --nnodes 16 --nproc_per_node 8 --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT trainer_base_ds_mul_fs_tp.py \
-cp conf/exp/apps/r2c_generation/deepseek_coder/dpo/ -cn gpt4o-distil-v4.9-V100-ps-pdpo
The above experiment runs on 16 8xV100 nodes with tp_size=8.
Due to the similar process, we provide the commands for data processing in the following bash script for your reference:
scripts/apps/pseudo_test_cases/pipeline.sh # For Magicoder-OSS
scripts/apps/pseudo_test_cases/xcode_pipeline.sh # For XCodeEval
We will release our preprocessed data including the synthetic test case inputs to reduce your workload.
Here are the configuration files of all experiments in Table 1, 2, 3, and 5 in the paper:
| Experiment | Configuration File |
|---|---|
| Mathstral w/ SFT | yaml file |
| w/ DPO (M.S.-500k, Iter. 0) | yaml file |
| w/ pDPO (M.S.-500k, Iter. 0) | yaml file |
| w/ pDPO (M.S.-300k-S.C., Iter. 1) | yaml file |
| w/ pDPO (M.S.-300k-S.C., Iter. 2) | yaml file |
| Llama-3.1-8B w/ SFT | |
| w/ DPO (M.S.-500k, Iter. 0) | yaml file |
| w/ pDPO (M.S.-500k, Iter. 0) | yaml file |
| w/ pDPO (Numina-S.C. 160k, Iter. 1) | yaml file |
| w/ pDPO (Numina-S.C. 320k, Iter. 2) | yaml file |
| w/ pDPO (Numina-S.C. 480k, Iter. 3) | yaml file |
| w/ pDPO (Numina-S.C. 640k, Iter. 4) | yaml file |
| w/ pDPO (Numina-S.C. 790k, Iter. 5) | yaml file |
| Deepseek-coder-v1.5-chat w/ SFT | yaml file |
| w/ DPO (APPs) | yaml file |
| w/ pDPO (APPs) | yaml file |
| w/ DPO (APPs - S.C.) | yaml file |
| w/ pDPO (APPs - S.C.) | yaml file |
| w/ DPO (APPs & M.C. - S.C.) | yaml file |
| w/ DPO (APPs & M.C. & xCode. - S.C.) | yaml file |
| w/ pDPO (APPs & M.C. & xCode. - S.C.) | yaml file |
| w/ pDPO (APPs & M.C. - S.C.) | yaml file |
For evaluation, simply run python vllm_inference.py -cp $config_path -cn $config_name. The evaluation is included in the inference process. Belows are the
evaluation configs for different tasks.
The config file is conf/api/vllm/mwp-bench/mathstral_test_0shot_v1_0.yaml.
Note that, you need use sympy evaluation for more accurate evaluation. Please refer to scripts/math_scale/qwen25math_style_eval_v2.0.py for more details.
If your prediction file is generated through our config, simply run:
python scripts/math_scale/qwen25math_style_eval_v2.0.py --input_file $prediction_file_path
For the necessary dependency to run sympy, please create a new virtual environment and follow the instruction of Qwen2.5-Math.
APPs: conf/api/vllm/apps/deepseek_coder/r2c/dev_v2_0.yaml
HumanEval: conf/api/vllm/human_eval/ds_coder/r2c/test_v2_2_local.yaml
MBPP-257: conf/api/vllm/mbpp_sanitized/r2c/test_v1_0_local.yaml
For the evaluation of LiveCodeBench, please refer to the official repo. You can also refer to my commit for reference. We only modified the prompts template to adapt to the evaluation.
In this repo, we have used Hydra and Yaml files to configure the experiments. We have used some features of Hydra and we will give some basic introduction here to avoid potential confusion.
In most cases, the entrance is trainer_base_ds_mul_fs_tp.py, where you will see the following main function:
import hydra
from omegaconf import DictConfig
@hydra.main(config_path="conf", config_name="config", version_base="1.2")
def main(cfg: DictConfig):
...
The launch command is as normal, such as using torchrun or deepspeed, for example:
deepspeed trainer_base_ds_mul_fs_tp.py seed=42 [other arguments without "--" prefix] \
cp=${config_path} cn=${config_name}
where config_path is the path of the directory containing the corresponding confie file, and config_name is the file name without the suffix .yaml.
In the configuration, you will see some usage like the following:
model:
_target_: models.llama_tp.LlamaForCausalLM.from_pretrained
gradient_checkpointing: True
attn_implementation: "flash_attention_2"
torch_dtype: ${torch_dtype}
pad_token_id: ${base_eos_token_id}
where _target_ indicates this is a function call (including __init__ function, i.e., object initialization), and the arguments are specified in the
following lines. Besides, models.llama_tp.LlamaForCausalLM.from_pretrained indicates the relative path of the function to be called, and you do not need to
import this function during coding.
In python code, you can obtain the returned value of the called function through
model = hydra.utils.call(cfg.model, cfg.model_name_or_path, state_dict=pretrain_state_dict)
where the arguments not specified in the configuration file can be passed as additional arguments.
Additionally, you can initialize the objects through hydra in a recursive manner. In the above example, the torch_dtype is also defined as a returned value of
another function call:
torch_dtype:
_target_: general_util.training_utils.return_torch_dtype
dtype: float16
There are some pre-defined configurations under conf/deepspeed. You can import them in your config file at the beginning by changing deepspeed@ds_cfg:
defaults:
- hydra: default
- deepspeed@ds_cfg: train_hybrid_engine_zero1_cosine
- _self_ # see here for more details: https://hydra.cc/docs/tutorials/basic/your_first_app/defaults/#composition-order-of-primary-config
The {a}@{b}:c symbol indicates that the configuration group to be imported is conf/a/c.yaml and this configuration group is renamed to b in current
configuration file.
The are some implementations using tensor parallel under models, ending with _tp.py. To enable tensor parallel, use the model with tensor parallel
implementation such as models.llama_tp.LlamaForCausalLM.from_pretrained, and set the tp_size in your configuration file.
Note that you need to use scripts/model_converter/convert_llama_to_llama_tp.py to convert the original model to the tensor parallel model. Currently the
script supports Llama, Qwen and Mistral model series.
We would recommend the following order to try to reduce the memory usage:
zero1 > zero2 > intra-node-zero3 & cross-node dp > intra-node tp & cross node zero1/2 > global zero3
More resources on this project can be found here
If you have any problem about our code or paper, feel free to open an issue or send an email to the authors.
If you feel our paper or code is helpful, please cite our paper:
@inproceedings{jiao2024pfpo,
title={Preference Optimization for Reasoning with Pseudo Feedback},
author={Fangkai Jiao and Geyang Guo and Xingxing Zhang and Nancy F. Chen and Shafiq Joty and Furu Wei},
year={2025},
booktitle={ICLR},
}
If you feel the code base for pDPO is also useful, kindly cite the following paper:
@inproceedings{jiao2024lpr,
author={Fangkai Jiao and Chengwei Qin and Zhengyuan Liu and Nancy F. Chen and Shafiq Joty},
title = {Learning Planning-based Reasoning with Trajectory Collection and Process Rewards Synthesizing},
booktitle = {{EMNLP}},
publisher = {Association for Computational Linguistics},
year = {2024},
}