docs/preparation/prepare_data.rst
Last updated: 02/09/2025.
Before starting the post-training job, we need to prepare the data for the policy training. The data should be stored in the parquet format.
We provide several data preprocess scripts for different datasets, including GSM8K, MATH, HelloSwag, Full_hh_rlhf. To prepare other datasets, we need to follow the following steps: The data preprocess script can be divided into two parts:
datasets package. Then preprocess the datasets with
the make_map_fn and then store in the parquet format... code:: python
import re import os import datasets
from verl.utils.hdfs_io import copy, makedirs import argparse
if name == 'main': parser = argparse.ArgumentParser() parser.add_argument('--local_dir', default='/opt/tiger/gsm8k') parser.add_argument('--hdfs_dir', default=None)
args = parser.parse_args()
num_few_shot = 5
data_source = 'openai/gsm8k'
dataset = datasets.load_dataset(data_source, 'main')
train_dataset = dataset['train']
test_dataset = dataset['test']
# Construct a `def make_map_fn(split)` for the corresponding datasets.
# ...
train_dataset = train_dataset.map(function=make_map_fn('train'), with_indices=True)
test_dataset = test_dataset.map(function=make_map_fn('test'), with_indices=True)
local_dir = args.local_dir
hdfs_dir = args.hdfs_dir
train_dataset.to_parquet(os.path.join(local_dir, 'train.parquet'))
test_dataset.to_parquet(os.path.join(local_dir, 'test.parquet'))
makedirs(hdfs_dir)
copy(src=local_dir, dst=hdfs_dir)
2. The users are required to implement the make_map_fn() function
(as well as the extract_solution) on their own to support
different datasets or tasks.
We already implemented the data preprocess of GSM8k, MATH, Hellaswag and Full_hh_rlhf datasets. And we take the GSM8k dataset as an example:
GSM8K
In the make_map_fn, each data field should consist of the following
5 fields:
data_source: The name of the dataset. To index the corresponding
reward function in the RewardModelprompt: This field should be constructed in the format of
huggingface chat_template. The tokenizer in RLHFDataset will
apply chat template and tokenize the prompt.ability: Define the task category.reward_model: Currently, we only utilize the ground_truth
field during evaluation. The ground_truth is computed by the
extract_solution function. NOTED that the implementation of
the corresponding reward function should align with this extracted
ground_truth.extra_info: Record some information of the current prompt. Not
use for now... code:: python
def extract_solution(solution_str): solution = re.search("#### (\-?[0-9\.\,]+)", solution_str) # extract the solution after #### assert solution is not None final_solution = solution.group(0) final_solution = final_solution.split('#### ')[1].replace(',', '') return final_solution
instruction_following = "Let's think step by step and output the final answer after "####"."
def make_map_fn(split):
def process_fn(example, idx):
question = example.pop('question')
question = question + ' ' + instruction_following
answer = example.pop('answer')
solution = extract_solution(answer)
data = {
"data_source": data_source,
"prompt": [{
"role": "user",
"content": question
}],
"ability": "math",
"reward_model": {
"style": "rule",
"ground_truth": solution
},
"extra_info": {
'split': split,
'index': idx
}
}
return data
return process_fn