Tutorial: Agents

Let's walk through a quick example of setting up a dspy.ReAct agent with a couple of tools and optimizing it to conduct advanced browsing for multi-hop search.

Install the latest DSPy via pip install -U dspy and follow along. You also need to run pip install datasets.

<details> <summary>Recommended: Set up MLflow Tracing to understand what's happening under the hood.</summary>

MLflow DSPy Integration

<a href="https://mlflow.org/">MLflow</a> is an LLMOps tool that natively integrates with DSPy and offer explainability and experiment tracking. In this tutorial, you can use MLflow to visualize prompts and optimization progress as traces to understand the DSPy's behavior better. You can set up MLflow easily by following the four steps below.

Install MLflow

bash

%pip install mlflow>=2.20

Start MLflow UI in a separate terminal

bash

mlflow ui --port 5000

Connect the notebook to MLflow

python

import mlflow

mlflow.set_tracking_uri("http://localhost:5000")
mlflow.set_experiment("DSPy")

Enabling tracing.

python

mlflow.dspy.autolog()

Once you have completed the steps above, you can see traces for each program execution on the notebook. They provide great visibility into the model's behavior and helps you understand the DSPy's concepts better throughout the tutorial.

To kearn more about the integration, visit MLflow DSPy Documentation as well.

</details>

In this tutorial, we'll use an extremely small LM, Meta's Llama-3.2-3B-Instruct which has 3 billion parameters.

A model like this is not very reliable out of the box for long or complex agent loops. However, it's extremely fast and cheap to host, as it needs very little RAM.

You might be able to host the 3B model on your laptop with Ollama, on your GPU server with SGLang, or via a provider that hosts it for you like Databricks or Together.

In the snippet below, we'll configure our main LM as Llama-3.2-3B. We'll also set up a larger LM, i.e. GPT-4o, as a teacher that we'll invoke a very small number of times to help teach the small LM.

python

import dspy

llama3b = dspy.LM('<provider>/Llama-3.2-3B-Instruct', temperature=0.7)
gpt4o = dspy.LM('openai/gpt-4o', temperature=0.7)

dspy.configure(lm=llama3b)

Let's load a dataset for our task. We'll load examples from the HoVer multi-hop task, where the input is a (really!) complex claim and the output we're seeking is the set of Wikipedia pages that are required to fact-check that claim.

python

import random
from dspy.datasets import DataLoader

kwargs = dict(fields=("claim", "supporting_facts", "hpqa_id", "num_hops"), input_keys=("claim",))
hover = DataLoader().from_huggingface(dataset_name="vincentkoc/hover-parquet", split="train", trust_remote_code=True, **kwargs)

hpqa_ids = set()
hover = [
    dspy.Example(claim=x.claim, titles=list(set([y["key"] for y in x.supporting_facts]))).with_inputs("claim")
    for x in hover
    if x["num_hops"] == 3 and x["hpqa_id"] not in hpqa_ids and not hpqa_ids.add(x["hpqa_id"])
]

random.Random(0).shuffle(hover)
trainset, devset, testset = hover[:100], hover[100:200], hover[650:]

Let's view an example of this task:

python

example = trainset[0]

print("Claim:", example.claim)
print("Pages that must be retrieved:", example.titles)

Now, let's define a function to do the search in Wikipedia. We'll rely on a ColBERTv2 server that can search the "abstracts" (i.e., first paragraphs) of every article that existed in Wikipedia in 2017, which is the data used in HoVer.

python

DOCS = {}

def search(query: str, k: int) -> list[str]:
    results = dspy.ColBERTv2(url='http://20.102.90.50:2017/wiki17_abstracts')(query, k=k)
    results = [x['text'] for x in results]

    for result in results:
        title, text = result.split(" | ", 1)
        DOCS[title] = text

    return results

Now, let's use the search function to define two tools for our ReAct agent:

python

def search_wikipedia(query: str) -> list[str]:
    """Returns top-5 results and then the titles of the top-5 to top-30 results."""

    topK = search(query, 30)
    titles, topK = [f"`{x.split(' | ')[0]}`" for x in topK[5:30]], topK[:5]
    return topK + [f"Other retrieved pages have titles: {', '.join(titles)}."]

def lookup_wikipedia(title: str) -> str:
    """Returns the text of the Wikipedia page, if it exists."""

    if title in DOCS:
        return DOCS[title]

    results = [x for x in search(title, 10) if x.startswith(title + " | ")]
    if not results:
        return f"No Wikipedia page found for title: {title}"
    return results[0]

Now, let's define the ReAct agent in DSPy. It's going to be super simple: it'll take a claim and produce a list titles: list[str].

We'll instruct it to find all Wikipedia titles that are needed to fact-check the claim.

python

instructions = "Find all Wikipedia titles relevant to verifying (or refuting) the claim."
signature = dspy.Signature("claim -> titles: list[str]", instructions)
react = dspy.ReAct(signature, tools=[search_wikipedia, lookup_wikipedia], max_iters=20)

Let's try it with a really simple claim to see if our tiny 3B model can do it!

python

react(claim="David Gregory was born in 1625.").titles[:3]

Great. Now let's set up an evaluation metric, top5_recall.

It will return the fraction of the gold pages (which are always 3) that are retrieved in the top-5 titles returned by the agent.

python

def top5_recall(example, pred, trace=None):
    gold_titles = example.titles
    recall = sum(x in pred.titles[:5] for x in gold_titles) / len(gold_titles)

    # If we're "bootstrapping" for optimization, return True if and only if the recall is perfect.
    if trace is not None:
        return recall >= 1.0
    
    # If we're just doing inference, just measure the recall.
    return recall

evaluate = dspy.Evaluate(devset=devset, metric=top5_recall, num_threads=16, display_progress=True, display_table=5)

Let's evaluate our off-the-shelf agent, with Llama-3.2-8B, to see how far we can go already.

This model is tiny, so it can complain fairly often. Let's wrap it in a try/except block to hide those.

python

def safe_react(claim: str):
    try:
        return react(claim=claim)
    except Exception as e:
        return dspy.Prediction(titles=[])

evaluate(safe_react)

<details> <summary>Tracking Evaluation Results in MLflow Experiment</summary>

To track and visualize the evaluation results over time, you can record the results in MLflow Experiment.

python

import mlflow

with mlflow.start_run(run_name="agent_evaluation"):
    evaluate = dspy.Evaluate(
        devset=devset,
        metric=top5_recall,
        num_threads=16,
        display_progress=True,
    )

    # Evaluate the program as usual
    result = evaluate(cot)

    # Log the aggregated score
    mlflow.log_metric("top5_recall", result.score)
    # Log the detailed evaluation results as a table
    mlflow.log_table(
        {
            "Claim": [example.claim for example in eval_set],
            "Expected Titles": [example.titles for example in eval_set],
            "Predicted Titles": [output[1] for output in result.results],
            "Top 5 Recall": [output[2] for output in result.results],
        },
        artifact_file="eval_results.json",
    )

To learn more about the integration, visit MLflow DSPy Documentation as well.

</details>

Wow. It only scores 8% in terms of recall. Not that good!

Let's now optimize the two prompts inside dspy.ReAct jointly to maximize the recall of our agent. This may take around 30 minutes and make some $5 worth of calls to GPT-4o to optimize Llama-3.2-3B.

python

kwargs = dict(teacher_settings=dict(lm=gpt4o), prompt_model=gpt4o, max_errors=999)

tp = dspy.MIPROv2(metric=top5_recall, auto="medium", num_threads=16, **kwargs)
optimized_react = tp.compile(react, trainset=trainset, max_bootstrapped_demos=3, max_labeled_demos=0)

Let's now evaluate again, after optimization.

python

evaluate(optimized_react)

Awesome. It looks like the system improved drastically from 8% recall to around 40% recall. That was a pretty straightforward approach, but DSPy gives you many tools to continue iterating on this from here.

Next, let's inspect the optimized prompts to understand what it has learned. We'll run one query and then inspect the last two prompts, which will show us the prompts used for both ReAct sub-modules, the one that does the agentic loop and the other than prepares the final results. (Alternatively, if you enabled MLflow Tracing following the instructions above, you can see all steps done by the agent including LLM calls, prompts, tool execution, in a rich tree-view.)

python

optimized_react(claim="The author of the 1960s unproduced script written for The Beatles, Up Against It, and Bernard-Marie Koltès are both playwrights.").titles

python

dspy.inspect_history(n=2)

Finally, let's save our optimized program so we can use it again later.

python

optimized_react.save("optimized_react.json")

loaded_react = dspy.ReAct("claim -> titles: list[str]", tools=[search_wikipedia, lookup_wikipedia], max_iters=20)
loaded_react.load("optimized_react.json")

loaded_react(claim="The author of the 1960s unproduced script written for The Beatles, Up Against It, and Bernard-Marie Koltès are both playwrights.").titles

<details> <summary>Saving programs in MLflow Experiment</summary>

Instead of saving the program to a local file, you can track it in MLflow for better reproducibility and collaboration.

Dependency Management: MLflow automatically save the frozen environment metadata along with the program to ensure reproducibility.
Experiment Tracking: With MLflow, you can track the program's performance and cost along with the program itself.
Collaboration: You can share the program and results with your team members by sharing the MLflow experiment.

To save the program in MLflow, run the following code:

python

import mlflow

# Start an MLflow Run and save the program
with mlflow.start_run(run_name="optimized_rag"):
    model_info = mlflow.dspy.log_model(
        optimized_react,
        artifact_path="model", # Any name to save the program in MLflow
    )

# Load the program back from MLflow
loaded = mlflow.dspy.load_model(model_info.model_uri)

To learn more about the integration, visit MLflow DSPy Documentation as well.

</details>