optional-skills/mlops/research/dspy/references/optimizers.md
Complete guide to DSPy's optimization algorithms for improving prompts and model weights.
DSPy optimizers (called "teleprompters") automatically improve your modules by:
Key idea: Instead of manually tuning prompts, define a metric and let DSPy optimize.
| Optimizer | Best For | Speed | Quality | Data Needed |
|---|---|---|---|---|
| BootstrapFewShot | General purpose | Fast | Good | 10-50 examples |
| MIPRO | Instruction tuning | Medium | Excellent | 50-200 examples |
| BootstrapFinetune | Fine-tuning | Slow | Excellent | 100+ examples |
| COPRO | Prompt optimization | Medium | Good | 20-100 examples |
| KNNFewShot | Quick baseline | Very fast | Fair | 10+ examples |
Most popular optimizer - Generates few-shot demonstrations from training data.
How it works:
Parameters:
metric: Function that scores predictions (required)max_bootstrapped_demos: Max demonstrations to generate (default: 4)max_labeled_demos: Max labeled examples to use (default: 16)max_rounds: Optimization iterations (default: 1)metric_threshold: Minimum score to accept (optional)import dspy
from dspy.teleprompt import BootstrapFewShot
# Define metric
def validate_answer(example, pred, trace=None):
"""Return True if prediction matches gold answer."""
return example.answer.lower() == pred.answer.lower()
# Training data
trainset = [
dspy.Example(question="What is 2+2?", answer="4").with_inputs("question"),
dspy.Example(question="What is 3+5?", answer="8").with_inputs("question"),
dspy.Example(question="What is 10-3?", answer="7").with_inputs("question"),
]
# Create module
qa = dspy.ChainOfThought("question -> answer")
# Optimize
optimizer = BootstrapFewShot(
metric=validate_answer,
max_bootstrapped_demos=3,
max_rounds=2
)
optimized_qa = optimizer.compile(qa, trainset=trainset)
# Now optimized_qa has learned few-shot examples!
result = optimized_qa(question="What is 5+7?")
Best practices:
max_bootstrapped_demos=3-5 for most tasksmax_rounds=2-3 for better qualityWhen to use:
State-of-the-art optimizer - Iteratively searches for better instructions.
How it works:
Parameters:
metric: Evaluation metric (required)num_candidates: Instructions to try per iteration (default: 10)init_temperature: Sampling temperature (default: 1.0)verbose: Show progress (default: False)from dspy.teleprompt import MIPRO
# Define metric with more nuance
def answer_quality(example, pred, trace=None):
"""Score answer quality 0-1."""
if example.answer.lower() in pred.answer.lower():
return 1.0
# Partial credit for similar answers
return 0.5 if len(set(example.answer.split()) & set(pred.answer.split())) > 0 else 0.0
# Larger training set (MIPRO benefits from more data)
trainset = [...] # 50-200 examples
valset = [...] # 20-50 examples
# Create module
qa = dspy.ChainOfThought("question -> answer")
# Optimize with MIPRO
optimizer = MIPRO(
metric=answer_quality,
num_candidates=10,
init_temperature=1.0,
verbose=True
)
optimized_qa = optimizer.compile(
student=qa,
trainset=trainset,
valset=valset, # MIPRO uses separate validation set
num_trials=100 # More trials = better quality
)
Best practices:
When to use:
Fine-tune model weights - Creates training dataset for fine-tuning.
How it works:
Parameters:
metric: Evaluation metric (required)max_bootstrapped_demos: Demonstrations to generate (default: 4)max_rounds: Data generation rounds (default: 1)from dspy.teleprompt import BootstrapFinetune
# Training data
trainset = [...] # 100+ examples recommended
# Define metric
def validate(example, pred, trace=None):
return example.answer == pred.answer
# Create module
qa = dspy.ChainOfThought("question -> answer")
# Generate fine-tuning data
optimizer = BootstrapFinetune(metric=validate)
optimized_qa = optimizer.compile(qa, trainset=trainset)
# Exports training data to file
# You then fine-tune using your LM provider's API
# After fine-tuning, load your model:
finetuned_lm = dspy.OpenAI(model="ft:gpt-3.5-turbo:your-model-id")
dspy.settings.configure(lm=finetuned_lm)
Best practices:
When to use:
Optimize prompts via gradient-free search.
How it works:
from dspy.teleprompt import COPRO
# Training data
trainset = [...]
# Define metric
def metric(example, pred, trace=None):
return example.answer == pred.answer
# Create module
qa = dspy.ChainOfThought("question -> answer")
# Optimize with COPRO
optimizer = COPRO(
metric=metric,
breadth=10, # Candidates per iteration
depth=3 # Optimization rounds
)
optimized_qa = optimizer.compile(qa, trainset=trainset)
When to use:
Simple k-nearest neighbors - Selects similar examples for each query.
How it works:
from dspy.teleprompt import KNNFewShot
trainset = [...]
# No metric needed - just selects similar examples
optimizer = KNNFewShot(k=3)
optimized_qa = optimizer.compile(qa, trainset=trainset)
# For each query, uses 3 most similar examples from trainset
When to use:
Metrics are functions that score predictions. They're critical for optimization.
def exact_match(example, pred, trace=None):
"""Return True if prediction exactly matches gold."""
return example.answer == pred.answer
def contains_answer(example, pred, trace=None):
"""Return True if prediction contains gold answer."""
return example.answer.lower() in pred.answer.lower()
def f1_score(example, pred, trace=None):
"""F1 score between prediction and gold."""
pred_tokens = set(pred.answer.lower().split())
gold_tokens = set(example.answer.lower().split())
if not pred_tokens:
return 0.0
precision = len(pred_tokens & gold_tokens) / len(pred_tokens)
recall = len(pred_tokens & gold_tokens) / len(gold_tokens)
if precision + recall == 0:
return 0.0
return 2 * (precision * recall) / (precision + recall)
def semantic_similarity(example, pred, trace=None):
"""Embedding similarity between prediction and gold."""
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
emb1 = model.encode(example.answer)
emb2 = model.encode(pred.answer)
similarity = cosine_similarity(emb1, emb2)
return similarity
def comprehensive_metric(example, pred, trace=None):
"""Combine multiple factors."""
score = 0.0
# Correctness (50%)
if example.answer.lower() in pred.answer.lower():
score += 0.5
# Conciseness (25%)
if len(pred.answer.split()) <= 20:
score += 0.25
# Citation (25%)
if "source:" in pred.answer.lower():
score += 0.25
return score
def metric_with_trace(example, pred, trace=None):
"""Metric that uses trace for debugging."""
is_correct = example.answer == pred.answer
if trace is not None and not is_correct:
# Log failures for analysis
print(f"Failed on: {example.question}")
print(f"Expected: {example.answer}")
print(f"Got: {pred.answer}")
return is_correct
# Split data
trainset = data[:100] # 70%
valset = data[100:120] # 15%
testset = data[120:] # 15%
# Optimize on train
optimized = optimizer.compile(module, trainset=trainset)
# Validate during optimization (for MIPRO)
optimized = optimizer.compile(module, trainset=trainset, valset=valset)
# Evaluate on test
from dspy.evaluate import Evaluate
evaluator = Evaluate(devset=testset, metric=metric)
score = evaluator(optimized)
from sklearn.model_selection import KFold
kfold = KFold(n_splits=5)
scores = []
for train_idx, val_idx in kfold.split(data):
trainset = [data[i] for i in train_idx]
valset = [data[i] for i in val_idx]
optimized = optimizer.compile(module, trainset=trainset)
score = evaluator(optimized, devset=valset)
scores.append(score)
print(f"Average score: {sum(scores) / len(scores):.2f}")
results = {}
for opt_name, optimizer in [
("baseline", None),
("fewshot", BootstrapFewShot(metric=metric)),
("mipro", MIPRO(metric=metric)),
]:
if optimizer is None:
module_opt = module
else:
module_opt = optimizer.compile(module, trainset=trainset)
score = evaluator(module_opt, devset=testset)
results[opt_name] = score
print(results)
# {'baseline': 0.65, 'fewshot': 0.78, 'mipro': 0.85}
from dspy.teleprompt import Teleprompter
class CustomOptimizer(Teleprompter):
def __init__(self, metric):
self.metric = metric
def compile(self, student, trainset, **kwargs):
# Your optimization logic here
# Return optimized student module
return student
# Stage 1: Bootstrap few-shot
stage1 = BootstrapFewShot(metric=metric, max_bootstrapped_demos=3)
optimized1 = stage1.compile(module, trainset=trainset)
# Stage 2: Instruction tuning
stage2 = MIPRO(metric=metric, num_candidates=10)
optimized2 = stage2.compile(optimized1, trainset=trainset, valset=valset)
# Final optimized module
final_module = optimized2
class EnsembleModule(dspy.Module):
def __init__(self, modules):
super().__init__()
self.modules = modules
def forward(self, question):
predictions = [m(question=question).answer for m in self.modules]
# Vote or average
return dspy.Prediction(answer=max(set(predictions), key=predictions.count))
# Optimize multiple modules
opt1 = BootstrapFewShot(metric=metric).compile(module, trainset=trainset)
opt2 = MIPRO(metric=metric).compile(module, trainset=trainset)
opt3 = COPRO(metric=metric).compile(module, trainset=trainset)
# Ensemble
ensemble = EnsembleModule([opt1, opt2, opt3])
# No optimization
baseline = dspy.ChainOfThought("question -> answer")
baseline_score = evaluator(baseline, devset=testset)
print(f"Baseline: {baseline_score}")
# Quick optimization
fewshot = BootstrapFewShot(metric=metric, max_bootstrapped_demos=3)
optimized = fewshot.compile(baseline, trainset=trainset)
fewshot_score = evaluator(optimized, devset=testset)
print(f"Few-shot: {fewshot_score} (+{fewshot_score - baseline_score:.2f})")
# State-of-the-art optimization
mipro = MIPRO(metric=metric, num_candidates=10)
optimized_mipro = mipro.compile(baseline, trainset=trainset, valset=valset)
mipro_score = evaluator(optimized_mipro, devset=testset)
print(f"MIPRO: {mipro_score} (+{mipro_score - baseline_score:.2f})")
if mipro_score > fewshot_score:
optimized_mipro.save("models/best_model.json")
else:
optimized.save("models/best_model.json")
# ❌ Bad: Too many demos
optimizer = BootstrapFewShot(max_bootstrapped_demos=20) # Overfits!
# ✅ Good: Moderate demos
optimizer = BootstrapFewShot(max_bootstrapped_demos=3-5)
# ❌ Bad: Binary metric for nuanced task
def bad_metric(example, pred, trace=None):
return example.answer == pred.answer # Too strict!
# ✅ Good: Graded metric
def good_metric(example, pred, trace=None):
return f1_score(example.answer, pred.answer) # Allows partial credit
# ❌ Bad: Too little data
trainset = data[:5] # Not enough!
# ✅ Good: Sufficient data
trainset = data[:50] # Better
# ❌ Bad: Optimizing on test set
optimizer.compile(module, trainset=testset) # Cheating!
# ✅ Good: Proper splits
optimizer.compile(module, trainset=trainset, valset=valset)
evaluator(optimized, devset=testset)