docs/docs/tutorials/gepa_aime/index.ipynb
In this tutorial, we optimize GPT-4.1 Mini's Chain of Thought (dspy.ChainOfThought) for solving math problems (AIME) using the dspy.GEPA optimizer!
<a href="https://mlflow.org/">MLflow</a> is an LLMOps tool that natively integrates with DSPy and offer explainability and experiment tracking. MLflow's autologging capability automatically tracks progress of GEPA optimization, as well as visualizes prompts and module executions as traces to understand the DSPy's behavior better. You can set up MLflow easily by following the four steps below.
Visualize module executions as traces
Automatically track optimization progress and results
Setup MLflow
%pip install mlflow>=3.0.0
mlflow ui --port 5000 --backend-store-uri sqlite:///mlruns.db
import mlflow
mlflow.set_tracking_uri("http://localhost:5000")
mlflow.set_experiment("DSPy")
mlflow.dspy.autolog(
# Log the optimization progress
log_compiles=True,
# Log the evaluation results
log_evals=True,
# Log traces from module executions
log_traces=True
)
To learn more about the integration, visit MLflow DSPy Documentation as well.
</details>api_key = input("Enter your OpenAI API key: ")
import dspy
lm = dspy.LM("openai/gpt-4.1-mini", temperature=1, api_key=api_key, max_tokens=32000)
dspy.configure(lm=lm)
The AIME exam consists of 2 problem sets of size 15 for each year. For this tutorial, we will use AIME problem sets from previous years (2022-2024) for optimization (amounting to total 3 years x 2 sets x 15 problems = 90 problems, split equally between train and validation sets), and test the performance on AIME 2025 (2 sets x 15 problems = 30 problems). Since AIME 2025 is a small set, we repeat it 5 times for statistical stability in evaluation.
import dspy
from datasets import load_dataset
def init_dataset():
train_split = load_dataset("AI-MO/aimo-validation-aime")['train']
train_split = [
dspy.Example({
"problem": x['problem'],
'solution': x['solution'],
'answer': x['answer'],
}).with_inputs("problem")
for x in train_split
]
import random
random.Random(0).shuffle(train_split)
tot_num = len(train_split)
test_split = load_dataset("MathArena/aime_2025")['train']
test_split = [
dspy.Example({
"problem": x['problem'],
'answer': x['answer'],
}).with_inputs("problem")
for x in test_split
]
train_set = train_split[:int(0.5 * tot_num)]
val_set = train_split[int(0.5 * tot_num):]
test_set = test_split * 5
return train_set, val_set, test_set
train_set, val_set, test_set = init_dataset()
len(train_set), len(val_set), len(test_set)
Let's view an example task input
print("Problem:")
print(train_set[0]['problem'])
print("\n\nSolution:")
print(train_set[0]['solution'])
print("\n\nAnswer:")
print(train_set[0]['answer'])
dspy.ChainOfThoughtclass GenerateResponse(dspy.Signature):
"""Solve the problem and provide the answer in the correct format."""
problem = dspy.InputField()
answer = dspy.OutputField()
program = dspy.ChainOfThought(GenerateResponse)
We simply check exact match between the predicted answer and the correct answer.
def metric(example, prediction, trace=None, pred_name=None, pred_trace=None):
correct_answer = int(example['answer'])
try:
llm_answer = int(prediction.answer)
except ValueError as e:
return 0
return int(correct_answer == llm_answer)
import dspy
evaluate = dspy.Evaluate(
devset=test_set,
metric=metric,
num_threads=32,
display_table=True,
display_progress=True
)
evaluate(program)
dspy.GEPAGEPA is a reflective prompt optimizer, and it's strength lies in being able to leverage additional sources of information, like the DSPy program's execution and evaluation pipelines, which provides GEPA more visibility into why the system got the score that it did, and then GEPA can introspect to identify how to improve the score. GEPA can also leverage additional supervision provided in this manner. For example, during optimization, we can return the correct solution's to the problems the program failed to solve.
We note that while such explicit supervision is not available in all scenarios, GEPA can work very flexibly with different forms of feedback (for example, using LLM-as-a-judge feedback shown in the PAPILLON tutorial, or just using answer labels, as shown in the facility-support tutorial).
Let's quickly modify the evaluation metric to become an optimization metric for GEPA, that can provide this additional supervision!
def metric_with_feedback(example, prediction, trace=None, pred_name=None, pred_trace=None):
correct_answer = int(example['answer'])
written_solution = example.get('solution', '')
try:
llm_answer = int(prediction.answer)
except ValueError as e:
feedback_text = f"The final answer must be a valid integer and nothing else. You responded with '{prediction.answer}', which couldn't be parsed as a python integer. Please ensure your answer is a valid integer without any additional text or formatting."
feedback_text += f" The correct answer is '{correct_answer}'."
if written_solution:
feedback_text += f" Here's the full step-by-step solution:\n{written_solution}\n\nThink about what takeaways you can learn from this solution to improve your future answers and approach to similar problems and ensure your final answer is a valid integer."
return dspy.Prediction(score=0, feedback=feedback_text)
score = int(correct_answer == llm_answer)
feedback_text = ""
if score == 1:
feedback_text = f"Your answer is correct. The correct answer is '{correct_answer}'."
else:
feedback_text = f"Your answer is incorrect. The correct answer is '{correct_answer}'."
if written_solution:
feedback_text += f" Here's the full step-by-step solution:\n{written_solution}\n\nThink about what takeaways you can learn from this solution to improve your future answers and approach to similar problems."
return dspy.Prediction(score=score, feedback=feedback_text)
from dspy import GEPA
optimizer = GEPA(
metric=metric_with_feedback,
auto="light",
num_threads=32,
track_stats=True,
reflection_minibatch_size=3,
reflection_lm=dspy.LM(model="gpt-5", temperature=1.0, max_tokens=32000, api_key=api_key)
)
optimized_program = optimizer.compile(
program,
trainset=train_set,
valset=val_set,
)
print(optimized_program.predict.signature.instructions)
It can be seen that what GEPA is doing here, is precomputing some reasoning to come up with a good plan for future task instances. Due to the improved performance in unseen validation set, we expect this prompt to generalize!
evaluate(optimized_program)
GEPA was able to optimize the GPT-4.1 Mini's performance on AIME 2025 from 46.6% score to 56.6%, a 10% improvement, with just a budget of auto="light"!