optional-skills/mlops/simpo/references/datasets.md
Complete guide to preference datasets for SimPO training.
Preference datasets must contain:
{
"prompt": "User question or instruction",
"chosen": "Better/preferred response",
"rejected": "Worse/rejected response"
}
Alternative field names (auto-detected):
prompt → question, instruction, inputchosen → response_chosen, winner, preferredrejected → response_rejected, loser{
"prompt": "Explain quantum computing in simple terms.",
"chosen": "Quantum computing uses quantum bits (qubits) that can exist in multiple states simultaneously through superposition. This allows quantum computers to process many possibilities at once, making them potentially much faster than classical computers for specific tasks like cryptography and optimization.",
"rejected": "It's like regular computing but quantum."
}
HuggingFaceH4/ultrafeedback_binarized:
Config:
dataset_mixer:
HuggingFaceH4/ultrafeedback_binarized: 1.0
dataset_splits:
- train_prefs
- test_prefs
argilla/ultrafeedback-binarized-preferences-cleaned:
Config:
dataset_mixer:
argilla/ultrafeedback-binarized-preferences-cleaned: 1.0
argilla/distilabel-math-preference-dpo:
Config:
dataset_mixer:
argilla/distilabel-math-preference-dpo: 1.0
nvidia/HelpSteer:
Config:
dataset_mixer:
nvidia/HelpSteer: 1.0
Anthropic/hh-rlhf:
Config:
dataset_mixer:
Anthropic/hh-rlhf: 1.0
Equal mix:
dataset_mixer:
HuggingFaceH4/ultrafeedback_binarized: 0.5
Anthropic/hh-rlhf: 0.5
Weighted mix:
dataset_mixer:
HuggingFaceH4/ultrafeedback_binarized: 0.7
argilla/distilabel-math-preference-dpo: 0.2
nvidia/HelpSteer: 0.1
Domain-specific emphasis:
# 80% general + 20% math
dataset_mixer:
HuggingFaceH4/ultrafeedback_binarized: 0.8
argilla/distilabel-math-preference-dpo: 0.2
Good preference data:
Poor preference data:
Filter by length difference:
def filter_by_length(example):
chosen_len = len(example['chosen'].split())
rejected_len = len(example['rejected'].split())
# Reject if chosen is much shorter (potential low-effort)
return chosen_len >= rejected_len * 0.5
dataset = dataset.filter(filter_by_length)
Filter by diversity:
seen_prompts = set()
def filter_duplicates(example):
prompt = example['prompt']
if prompt in seen_prompts:
return False
seen_prompts.add(prompt)
return True
dataset = dataset.filter(filter_duplicates)
File (preferences.jsonl):
{"prompt": "What is Python?", "chosen": "Python is a high-level programming language...", "rejected": "It's a snake."}
{"prompt": "Explain AI.", "chosen": "AI refers to systems that can...", "rejected": "It's computers that think."}
Load:
dataset_mixer:
json:
data_files: preferences.jsonl
Create from dict:
from datasets import Dataset
data = {
"prompt": ["What is Python?", "Explain AI."],
"chosen": ["Python is...", "AI refers to..."],
"rejected": ["It's a snake.", "It's computers..."]
}
dataset = Dataset.from_dict(data)
dataset.push_to_hub("username/my-preferences")
Use in config:
dataset_mixer:
username/my-preferences: 1.0
For conversational data:
{
"prompt": [
{"role": "user", "content": "What is quantum computing?"}
],
"chosen": [
{"role": "assistant", "content": "Quantum computing uses qubits..."}
],
"rejected": [
{"role": "assistant", "content": "It's like regular computing but quantum."}
]
}
Apply chat template:
dataset_text_field: null # Will apply chat template
Prompt template:
Given the following question:
{prompt}
Generate two responses:
1. A high-quality, detailed response (chosen)
2. A low-quality, brief response (rejected)
Format as JSON with "chosen" and "rejected" fields.
Example code:
import openai
def generate_pair(prompt):
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[{
"role": "user",
"content": f"Given: {prompt}\n\nGenerate chosen/rejected pair in JSON."
}]
)
return json.loads(response.choices[0].message.content)
# Generate dataset
prompts = load_prompts()
dataset = [generate_pair(p) for p in prompts]
With vLLM:
from vllm import LLM
llm = LLM(model="meta-llama/Meta-Llama-3-70B-Instruct")
def generate_variations(prompt):
# Generate multiple completions
outputs = llm.generate(
[prompt] * 4,
sampling_params={
"temperature": 0.8,
"top_p": 0.9,
"max_tokens": 512
}
)
# Select best/worst
chosen = max(outputs, key=lambda x: len(x.outputs[0].text))
rejected = min(outputs, key=lambda x: len(x.outputs[0].text))
return {
"prompt": prompt,
"chosen": chosen.outputs[0].text,
"rejected": rejected.outputs[0].text
}
Limit sequence length:
max_prompt_length: 512
max_completion_length: 512
max_length: 1024 # Total
Implementation:
def truncate_example(example):
tokenizer.truncation_side = "left" # For prompts
prompt_tokens = tokenizer(
example['prompt'],
max_length=512,
truncation=True
)
tokenizer.truncation_side = "right" # For completions
chosen_tokens = tokenizer(
example['chosen'],
max_length=512,
truncation=True
)
return {
"prompt": tokenizer.decode(prompt_tokens['input_ids']),
"chosen": tokenizer.decode(chosen_tokens['input_ids'])
}
dataset = dataset.map(truncate_example)
Remove exact duplicates:
dataset = dataset.unique('prompt')
Remove near-duplicates (MinHash):
from datasketch import MinHash, MinHashLSH
def deduplicate_lsh(dataset, threshold=0.8):
lsh = MinHashLSH(threshold=threshold, num_perm=128)
seen = []
for i, example in enumerate(dataset):
m = MinHash(num_perm=128)
for word in example['prompt'].split():
m.update(word.encode('utf8'))
if not lsh.query(m):
lsh.insert(i, m)
seen.append(example)
return Dataset.from_list(seen)
dataset = deduplicate_lsh(dataset)
def paraphrase_prompt(example):
# Use paraphrasing model
paraphrased = paraphrase_model(example['prompt'])
return [
example, # Original
{
"prompt": paraphrased,
"chosen": example['chosen'],
"rejected": example['rejected']
}
]
dataset = dataset.map(paraphrase_prompt, batched=False, remove_columns=[])
Mix easy/medium/hard:
def categorize_difficulty(example):
prompt_len = len(example['prompt'].split())
if prompt_len < 20:
return "easy"
elif prompt_len < 50:
return "medium"
else:
return "hard"
dataset = dataset.map(lambda x: {"difficulty": categorize_difficulty(x)})
# Sample balanced dataset
easy = dataset.filter(lambda x: x['difficulty'] == 'easy').shuffle().select(range(1000))
medium = dataset.filter(lambda x: x['difficulty'] == 'medium').shuffle().select(range(1000))
hard = dataset.filter(lambda x: x['difficulty'] == 'hard').shuffle().select(range(1000))
balanced = concatenate_datasets([easy, medium, hard]).shuffle()
def compute_stats(dataset):
prompt_lens = [len(x['prompt'].split()) for x in dataset]
chosen_lens = [len(x['chosen'].split()) for x in dataset]
rejected_lens = [len(x['rejected'].split()) for x in dataset]
print(f"Dataset size: {len(dataset)}")
print(f"Avg prompt length: {np.mean(prompt_lens):.1f} words")
print(f"Avg chosen length: {np.mean(chosen_lens):.1f} words")
print(f"Avg rejected length: {np.mean(rejected_lens):.1f} words")
print(f"Chosen > Rejected: {sum(c > r for c, r in zip(chosen_lens, rejected_lens)) / len(dataset):.1%}")
compute_stats(dataset)
Expected output:
Dataset size: 50000
Avg prompt length: 45.2 words
Avg chosen length: 180.5 words
Avg rejected length: 120.3 words
Chosen > Rejected: 85.2%
# Sample 10 random examples
samples = dataset.shuffle().select(range(10))
for ex in samples:
print(f"Prompt: {ex['prompt']}")
print(f"Chosen: {ex['chosen'][:100]}...")
print(f"Rejected: {ex['rejected'][:100]}...")
print(f"Preference clear: {'✓' if len(ex['chosen']) > len(ex['rejected']) else '?'}")
print()