Back to Opik

Optimzing using Synthetic Q&A Data from Opik Traces

sdks/opik_optimizer/notebooks/OpikSyntheticDataOptimizer.ipynb

2.0.22-6605-merge-206511.3 KB
Original Source

Optimzing using Synthetic Q&A Data from Opik Traces

You will need:

  1. A Comet account, for seeing Opik visualizations (free!) - comet.com
  2. An OpenAI account, for using an LLM platform.openai.com/settings/organization/api-keys

This example will use:

Setup

This pip-install takes about a minute.

python
%pip install opik-optimizer tinyqabenchmarkpp --upgrade

This step configures the Opik library for your session. It will prompt for your Comet API key if not already set in your environment or through Opik's configuration.

python
import opik

opik.configure()

For this example, we'll use OpenAI models, so we need to set our OpenAI API key:

python
import os
import getpass

if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API key: ")

Fetching Traces from Opik

Next up we will "fetch" existing traces of our AI application within Opik (we will use the demo project that ships with every Opik installation).

This fetches traces from the demo project and formats them as a context string. Returns a formatted string with explanatory text and cleaned traces.

python
OPIK_PROJECT_NAME = "Demo chatbot 🤖"

# Will prompt for API key if not set
opik.configure()

# Fetch traces from the demo project
#
# Commented out the project name to
# fetch all traces across all projects
client = opik.Opik()
traces = client.search_traces(
    # project_name=OPIK_PROJECT_NAME,
    max_results=40
)
python
print(f"Found {len(traces)} traces")

We will define some helper functions to clean and traverse the traces as we don't wan to send noise to the LLM and break the input to the synthetic data generation step.

python
import re


def extract_text_from_dict(d: dict) -> list[str]:
    """
    Recursively extracts text from a dictionary.
    """
    texts = []
    for key, value in d.items():
        if isinstance(value, str):
            texts.append(value)
        elif isinstance(value, dict):
            texts.extend(extract_text_from_dict(value))
        elif isinstance(value, list):
            for item in value:
                if isinstance(item, str):
                    texts.append(item)
                elif isinstance(item, dict):
                    texts.extend(extract_text_from_dict(item))
    return texts


def clean_text(text: str) -> str:
    """
    Cleans text by removing special characters and normalizing whitespace.
    """
    if not text:
        return ""
    # Replace special characters with spaces, but keep basic punctuation
    text = re.sub(r'[^\w\s.,!?;:\'"-]', " ", text)
    # Normalize whitespace
    text = re.sub(r"\s+", " ", text)
    # Remove any leading/trailing whitespace
    text = text.strip()
    return text

We are now ready to extract and clean the text from the traces.

python
# Extract and clean text from traces
cleaned_texts = []
for i, trace in enumerate(traces):
    # Extract from input
    if trace.input:
        if isinstance(trace.input, dict):
            texts = extract_text_from_dict(trace.input)
            for text in texts:
                cleaned = clean_text(text)
                if cleaned:
                    cleaned_texts.append(cleaned)
        elif isinstance(trace.input, str):
            cleaned = clean_text(trace.input)
            if cleaned:
                cleaned_texts.append(cleaned)

    # Extract from output
    if trace.output:
        if isinstance(trace.output, dict):
            texts = extract_text_from_dict(trace.output)
            for text in texts:
                cleaned = clean_text(text)
                if cleaned:
                    cleaned_texts.append(cleaned)
        elif isinstance(trace.output, str):
            cleaned = clean_text(trace.output)
            if cleaned:
                cleaned_texts.append(cleaned)

    # Extract from metadata if it exists
    if trace.metadata:
        if isinstance(trace.metadata, dict):
            texts = extract_text_from_dict(trace.metadata)
            for text in texts:
                cleaned = clean_text(text)
                if cleaned:
                    cleaned_texts.append(cleaned)

if not cleaned_texts:
    print("Debug: No text content found in traces. Here's what we got:")
    for i, trace in enumerate(traces[:5]):  # Show first 5 traces for debugging
        print(f"\nTrace {i}:")
        print(f"Input: {trace.input}")
        print(f"Output: {trace.output}")
        print(f"Metadata: {trace.metadata}")
    raise ValueError("No valid text content found in traces")

Let's quickly inspect the traces we have.

python
display(traces[0])

Remove any duplicates while preserving the order

python
# Remove duplicates while preserving order
seen = set()
unique_texts = []
for text in cleaned_texts:
    if text not in seen:
        seen.add(text)
        unique_texts.append(text)

Now we need to create the context to pass to the synthetic data generation step

python
context = f"""
This is a collection of AI/LLM conversation traces from
a given Comet Opik observability project. The following
text contains various interactions and responses that
can be used to generate relevant questions and answers.
<input>
{chr(10).join(unique_texts)}
</input>
"""
python
print(f"Found and cleaned {len(unique_texts)} unique text segments from traces")
print(f"Total context length: {len(context)} characters")

Generating Synthethic Data

We are now ready to generate the synthethic data using tinyqabenchmarkpp

python
# Model for tinyqabenchmarkpp
TQB_GENERATOR_MODEL = "openai/gpt-4o-mini"

# Number of questions to generate
TQB_NUM_QUESTIONS = 20

# Languages to generate questions in
TQB_LANGUAGES = "en"

# Categories to generate questions in
TQB_CATEGORIES = (
    "use context provided and elaborate on it to generate a more detailed answers"
)

# Difficulty of the questions to generate
TQB_DIFFICULTY = "medium"
python
# Command to generate the synthetic data
command = [
    "python",
    "-m",
    "tinyqabenchmarkpp.generate",
    "--num",
    str(TQB_NUM_QUESTIONS),
    "--languages",
    TQB_LANGUAGES,
    "--categories",
    TQB_CATEGORIES,
    "--difficulty",
    TQB_DIFFICULTY,
    "--model",
    TQB_GENERATOR_MODEL,
    "--str-output",
    "--context",
    context,
]

Now we run the synthetic data generation step, please be patient as the language model is called.

python
# Use a subprocess to run the command
import subprocess

process = subprocess.run(command, capture_output=True, text=True, check=True)

if process.stderr:
    # Print the errors
    print("tinyqabenchmarkpp errors:")
    print(process.stderr)
else:
    # Print the output
    print("Synthetic data generated successfully")
    print(process.stdout)

Store New Dataset in Opik

We can use the Opik SDK to push this dataset to Opik

python
generated_data = process.stdout

Helper function to process the JSONL response and push to Opik Once we have defined we will be able to run this

python
import json


def load_synthetic_data_to_opik(data_str):
    """Load JSONL synthetic data into Opik as a dataset."""
    items = []
    for line in data_str.strip().split("\n"):
        try:
            data = json.loads(line)
            if not isinstance(data, dict):
                continue
            item = {
                "question": data.get("text"),
                "answer": data.get("label"),
                "generated_context": data.get("context"),
                "category": data.get("tags", {}).get("category"),
                "difficulty": data.get("tags", {}).get("difficulty"),
            }
            if item["question"] and item["answer"]:
                items.append(item)
        except Exception:
            continue

    if not items:
        print("No valid items found.")
        return None

    dataset_name = (
        f"demo-tinyqab-dataset-{TQB_CATEGORIES.replace(',', '_')}-{TQB_NUM_QUESTIONS}"
    )
    dataset_name = "".join(
        c if c.isalnum() or c in ["-", "_"] else "_" for c in dataset_name
    )

    opik_client = opik.Opik()
    dataset = opik_client.get_or_create_dataset(
        name=dataset_name,
        description=f"Synthetic QA from tinyqabenchmarkpp for {TQB_CATEGORIES}",
    )
    dataset.insert(items)
    print(f"Opik Dataset '{dataset.name}' created with ID: {dataset.id}")
    return dataset

Push the data to Opik using helper function

python
opik_synthetic_dataset = load_synthetic_data_to_opik(generated_data)
if not opik_synthetic_dataset:
    print("Failed to load synthetic data into Opik. Exiting.")

Agent Optimization Using Synthetic Data

Lets import the required packages for the Opik Agent Optimizer SDK

python
from opik_optimizer import MetaPromptOptimizer
from opik.evaluation.metrics import LevenshteinRatio

We need to setup some intputs to our optimizer such as our starting prompt and some other configuration items.

python
from opik_optimizer import ChatPrompt

# Initial prompt for the optimizer
OPTIMIZER_INITIAL_PROMPT = ChatPrompt(
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "{question}"},
    ],
    project_name=OPIK_PROJECT_NAME,
)

# Model for Opik Agent Optimizer
OPTIMIZER_MODEL = "openai/gpt-4o-mini"

# Population size for the optimizer
# Reduced for quicker demo
OPTIMIZER_POPULATION_SIZE = 5

# Number of generations for the optimizer
# Reduced for quicker demo
OPTIMIZER_NUM_GENERATIONS = 2

# Number of samples from dataset for optimization eval
OPTIMIZER_N_SAMPLES_OPTIMIZATION = 10

Now we can setup the metric configuration used for the evaluation, as well as he task_config for passing in the dataset headings and initial prompt.

We finally pass this to our optimizer to set this up. We are opting to use the MetaPromptOptimizer optimizer in the SDK.

python
# Metric Configuration
def levenshtein_ratio(dataset_item, llm_output):
    return LevenshteinRatio().score(reference=dataset_item["answer"], output=llm_output)


# Initialize the optimizer
optimizer = MetaPromptOptimizer(
    model=OPTIMIZER_MODEL,
    population_size=OPTIMIZER_POPULATION_SIZE,
    num_generations=OPTIMIZER_NUM_GENERATIONS,
    infer_output_style=True,
    verbose=1,
)

Now we can run the optimizer on the dataset and initial starting prompt to find the best prompt based on our synthetic data.

python
# Commented out if you wan to pull the dataset from Opik without having
# to generate the synthetic data again

# import opik
# opik_client = opik.Opik()
# opik_synthetic_dataset = opik_client.get_or_create_dataset("demo-tinyqab-dataset-use_context_provided_and_elaborate_on_it_to_generate_a_more_detailed_answers-20")
python
# Run the optimizer
result = optimizer.optimize_prompt(
    prompt=OPTIMIZER_INITIAL_PROMPT,
    dataset=opik_synthetic_dataset,
    metric=levenshtein_ratio,
    n_samples=OPTIMIZER_N_SAMPLES_OPTIMIZATION,
)

Optimization process finished

We can output our results to show

python
result.display()

Next Steps

You can try out other optimizers. More details can be found in the Opik Agent Optimizer documentation.