docs/docs/tutorials/agents/index.ipynb
Let's walk through a quick example of setting up a dspy.ReAct agent with a couple of tools and optimizing it to conduct advanced browsing for multi-hop search.
Install the latest DSPy via pip install -U dspy and follow along. You also need to run pip install datasets.
<a href="https://mlflow.org/">MLflow</a> is an LLMOps tool that natively integrates with DSPy and offer explainability and experiment tracking. In this tutorial, you can use MLflow to visualize prompts and optimization progress as traces to understand the DSPy's behavior better. You can set up MLflow easily by following the four steps below.
%pip install mlflow>=2.20
mlflow ui --port 5000
import mlflow
mlflow.set_tracking_uri("http://localhost:5000")
mlflow.set_experiment("DSPy")
mlflow.dspy.autolog()
Once you have completed the steps above, you can see traces for each program execution on the notebook. They provide great visibility into the model's behavior and helps you understand the DSPy's concepts better throughout the tutorial.
To kearn more about the integration, visit MLflow DSPy Documentation as well.
</details>In this tutorial, we'll use an extremely small LM, Meta's Llama-3.2-3B-Instruct which has 3 billion parameters.
A model like this is not very reliable out of the box for long or complex agent loops. However, it's extremely fast and cheap to host, as it needs very little RAM.
You might be able to host the 3B model on your laptop with Ollama, on your GPU server with SGLang, or via a provider that hosts it for you like Databricks or Together.
In the snippet below, we'll configure our main LM as Llama-3.2-3B. We'll also set up a larger LM, i.e. GPT-4o, as a teacher that we'll invoke a very small number of times to help teach the small LM.
import dspy
llama3b = dspy.LM('<provider>/Llama-3.2-3B-Instruct', temperature=0.7)
gpt4o = dspy.LM('openai/gpt-4o', temperature=0.7)
dspy.configure(lm=llama3b)
Let's load a dataset for our task. We'll load examples from the HoVer multi-hop task, where the input is a (really!) complex claim and the output we're seeking is the set of Wikipedia pages that are required to fact-check that claim.
import random
from dspy.datasets import DataLoader
kwargs = dict(fields=("claim", "supporting_facts", "hpqa_id", "num_hops"), input_keys=("claim",))
hover = DataLoader().from_huggingface(dataset_name="vincentkoc/hover-parquet", split="train", trust_remote_code=True, **kwargs)
hpqa_ids = set()
hover = [
dspy.Example(claim=x.claim, titles=list(set([y["key"] for y in x.supporting_facts]))).with_inputs("claim")
for x in hover
if x["num_hops"] == 3 and x["hpqa_id"] not in hpqa_ids and not hpqa_ids.add(x["hpqa_id"])
]
random.Random(0).shuffle(hover)
trainset, devset, testset = hover[:100], hover[100:200], hover[650:]
Let's view an example of this task:
example = trainset[0]
print("Claim:", example.claim)
print("Pages that must be retrieved:", example.titles)
Now, let's define a function to do the search in Wikipedia. We'll rely on a ColBERTv2 server that can search the "abstracts" (i.e., first paragraphs) of every article that existed in Wikipedia in 2017, which is the data used in HoVer.
DOCS = {}
def search(query: str, k: int) -> list[str]:
results = dspy.ColBERTv2(url='http://20.102.90.50:2017/wiki17_abstracts')(query, k=k)
results = [x['text'] for x in results]
for result in results:
title, text = result.split(" | ", 1)
DOCS[title] = text
return results
Now, let's use the search function to define two tools for our ReAct agent:
def search_wikipedia(query: str) -> list[str]:
"""Returns top-5 results and then the titles of the top-5 to top-30 results."""
topK = search(query, 30)
titles, topK = [f"`{x.split(' | ')[0]}`" for x in topK[5:30]], topK[:5]
return topK + [f"Other retrieved pages have titles: {', '.join(titles)}."]
def lookup_wikipedia(title: str) -> str:
"""Returns the text of the Wikipedia page, if it exists."""
if title in DOCS:
return DOCS[title]
results = [x for x in search(title, 10) if x.startswith(title + " | ")]
if not results:
return f"No Wikipedia page found for title: {title}"
return results[0]
Now, let's define the ReAct agent in DSPy. It's going to be super simple: it'll take a claim and produce a list titles: list[str].
We'll instruct it to find all Wikipedia titles that are needed to fact-check the claim.
instructions = "Find all Wikipedia titles relevant to verifying (or refuting) the claim."
signature = dspy.Signature("claim -> titles: list[str]", instructions)
react = dspy.ReAct(signature, tools=[search_wikipedia, lookup_wikipedia], max_iters=20)
Let's try it with a really simple claim to see if our tiny 3B model can do it!
react(claim="David Gregory was born in 1625.").titles[:3]
Great. Now let's set up an evaluation metric, top5_recall.
It will return the fraction of the gold pages (which are always 3) that are retrieved in the top-5 titles returned by the agent.
def top5_recall(example, pred, trace=None):
gold_titles = example.titles
recall = sum(x in pred.titles[:5] for x in gold_titles) / len(gold_titles)
# If we're "bootstrapping" for optimization, return True if and only if the recall is perfect.
if trace is not None:
return recall >= 1.0
# If we're just doing inference, just measure the recall.
return recall
evaluate = dspy.Evaluate(devset=devset, metric=top5_recall, num_threads=16, display_progress=True, display_table=5)
Let's evaluate our off-the-shelf agent, with Llama-3.2-8B, to see how far we can go already.
This model is tiny, so it can complain fairly often. Let's wrap it in a try/except block to hide those.
def safe_react(claim: str):
try:
return react(claim=claim)
except Exception as e:
return dspy.Prediction(titles=[])
evaluate(safe_react)
To track and visualize the evaluation results over time, you can record the results in MLflow Experiment.
import mlflow
with mlflow.start_run(run_name="agent_evaluation"):
evaluate = dspy.Evaluate(
devset=devset,
metric=top5_recall,
num_threads=16,
display_progress=True,
)
# Evaluate the program as usual
result = evaluate(cot)
# Log the aggregated score
mlflow.log_metric("top5_recall", result.score)
# Log the detailed evaluation results as a table
mlflow.log_table(
{
"Claim": [example.claim for example in eval_set],
"Expected Titles": [example.titles for example in eval_set],
"Predicted Titles": [output[1] for output in result.results],
"Top 5 Recall": [output[2] for output in result.results],
},
artifact_file="eval_results.json",
)
To learn more about the integration, visit MLflow DSPy Documentation as well.
</details>Wow. It only scores 8% in terms of recall. Not that good!
Let's now optimize the two prompts inside dspy.ReAct jointly to maximize the recall of our agent. This may take around 30 minutes and make some $5 worth of calls to GPT-4o to optimize Llama-3.2-3B.
kwargs = dict(teacher_settings=dict(lm=gpt4o), prompt_model=gpt4o, max_errors=999)
tp = dspy.MIPROv2(metric=top5_recall, auto="medium", num_threads=16, **kwargs)
optimized_react = tp.compile(react, trainset=trainset, max_bootstrapped_demos=3, max_labeled_demos=0)
Let's now evaluate again, after optimization.
evaluate(optimized_react)
Awesome. It looks like the system improved drastically from 8% recall to around 40% recall. That was a pretty straightforward approach, but DSPy gives you many tools to continue iterating on this from here.
Next, let's inspect the optimized prompts to understand what it has learned. We'll run one query and then inspect the last two prompts, which will show us the prompts used for both ReAct sub-modules, the one that does the agentic loop and the other than prepares the final results. (Alternatively, if you enabled MLflow Tracing following the instructions above, you can see all steps done by the agent including LLM calls, prompts, tool execution, in a rich tree-view.)
optimized_react(claim="The author of the 1960s unproduced script written for The Beatles, Up Against It, and Bernard-Marie Koltès are both playwrights.").titles
dspy.inspect_history(n=2)
Finally, let's save our optimized program so we can use it again later.
optimized_react.save("optimized_react.json")
loaded_react = dspy.ReAct("claim -> titles: list[str]", tools=[search_wikipedia, lookup_wikipedia], max_iters=20)
loaded_react.load("optimized_react.json")
loaded_react(claim="The author of the 1960s unproduced script written for The Beatles, Up Against It, and Bernard-Marie Koltès are both playwrights.").titles
Instead of saving the program to a local file, you can track it in MLflow for better reproducibility and collaboration.
To save the program in MLflow, run the following code:
import mlflow
# Start an MLflow Run and save the program
with mlflow.start_run(run_name="optimized_rag"):
model_info = mlflow.dspy.log_model(
optimized_react,
artifact_path="model", # Any name to save the program in MLflow
)
# Load the program back from MLflow
loaded = mlflow.dspy.load_model(model_info.model_uri)
To learn more about the integration, visit MLflow DSPy Documentation as well.
</details>