sdks/opik_optimizer/notebooks/OpikSyntheticDataOptimizer.ipynb
You will need:
This example will use:
This pip-install takes about a minute.
%pip install opik-optimizer tinyqabenchmarkpp --upgrade
This step configures the Opik library for your session. It will prompt for your Comet API key if not already set in your environment or through Opik's configuration.
import opik
opik.configure()
For this example, we'll use OpenAI models, so we need to set our OpenAI API key:
import os
import getpass
if "OPENAI_API_KEY" not in os.environ:
os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API key: ")
Next up we will "fetch" existing traces of our AI application within Opik (we will use the demo project that ships with every Opik installation).
This fetches traces from the demo project and formats them as a context string. Returns a formatted string with explanatory text and cleaned traces.
OPIK_PROJECT_NAME = "Demo chatbot 🤖"
# Will prompt for API key if not set
opik.configure()
# Fetch traces from the demo project
#
# Commented out the project name to
# fetch all traces across all projects
client = opik.Opik()
traces = client.search_traces(
# project_name=OPIK_PROJECT_NAME,
max_results=40
)
print(f"Found {len(traces)} traces")
We will define some helper functions to clean and traverse the traces as we don't wan to send noise to the LLM and break the input to the synthetic data generation step.
import re
def extract_text_from_dict(d: dict) -> list[str]:
"""
Recursively extracts text from a dictionary.
"""
texts = []
for key, value in d.items():
if isinstance(value, str):
texts.append(value)
elif isinstance(value, dict):
texts.extend(extract_text_from_dict(value))
elif isinstance(value, list):
for item in value:
if isinstance(item, str):
texts.append(item)
elif isinstance(item, dict):
texts.extend(extract_text_from_dict(item))
return texts
def clean_text(text: str) -> str:
"""
Cleans text by removing special characters and normalizing whitespace.
"""
if not text:
return ""
# Replace special characters with spaces, but keep basic punctuation
text = re.sub(r'[^\w\s.,!?;:\'"-]', " ", text)
# Normalize whitespace
text = re.sub(r"\s+", " ", text)
# Remove any leading/trailing whitespace
text = text.strip()
return text
We are now ready to extract and clean the text from the traces.
# Extract and clean text from traces
cleaned_texts = []
for i, trace in enumerate(traces):
# Extract from input
if trace.input:
if isinstance(trace.input, dict):
texts = extract_text_from_dict(trace.input)
for text in texts:
cleaned = clean_text(text)
if cleaned:
cleaned_texts.append(cleaned)
elif isinstance(trace.input, str):
cleaned = clean_text(trace.input)
if cleaned:
cleaned_texts.append(cleaned)
# Extract from output
if trace.output:
if isinstance(trace.output, dict):
texts = extract_text_from_dict(trace.output)
for text in texts:
cleaned = clean_text(text)
if cleaned:
cleaned_texts.append(cleaned)
elif isinstance(trace.output, str):
cleaned = clean_text(trace.output)
if cleaned:
cleaned_texts.append(cleaned)
# Extract from metadata if it exists
if trace.metadata:
if isinstance(trace.metadata, dict):
texts = extract_text_from_dict(trace.metadata)
for text in texts:
cleaned = clean_text(text)
if cleaned:
cleaned_texts.append(cleaned)
if not cleaned_texts:
print("Debug: No text content found in traces. Here's what we got:")
for i, trace in enumerate(traces[:5]): # Show first 5 traces for debugging
print(f"\nTrace {i}:")
print(f"Input: {trace.input}")
print(f"Output: {trace.output}")
print(f"Metadata: {trace.metadata}")
raise ValueError("No valid text content found in traces")
Let's quickly inspect the traces we have.
display(traces[0])
Remove any duplicates while preserving the order
# Remove duplicates while preserving order
seen = set()
unique_texts = []
for text in cleaned_texts:
if text not in seen:
seen.add(text)
unique_texts.append(text)
Now we need to create the context to pass to the synthetic data generation step
context = f"""
This is a collection of AI/LLM conversation traces from
a given Comet Opik observability project. The following
text contains various interactions and responses that
can be used to generate relevant questions and answers.
<input>
{chr(10).join(unique_texts)}
</input>
"""
print(f"Found and cleaned {len(unique_texts)} unique text segments from traces")
print(f"Total context length: {len(context)} characters")
We are now ready to generate the synthethic data using tinyqabenchmarkpp
# Model for tinyqabenchmarkpp
TQB_GENERATOR_MODEL = "openai/gpt-4o-mini"
# Number of questions to generate
TQB_NUM_QUESTIONS = 20
# Languages to generate questions in
TQB_LANGUAGES = "en"
# Categories to generate questions in
TQB_CATEGORIES = (
"use context provided and elaborate on it to generate a more detailed answers"
)
# Difficulty of the questions to generate
TQB_DIFFICULTY = "medium"
# Command to generate the synthetic data
command = [
"python",
"-m",
"tinyqabenchmarkpp.generate",
"--num",
str(TQB_NUM_QUESTIONS),
"--languages",
TQB_LANGUAGES,
"--categories",
TQB_CATEGORIES,
"--difficulty",
TQB_DIFFICULTY,
"--model",
TQB_GENERATOR_MODEL,
"--str-output",
"--context",
context,
]
Now we run the synthetic data generation step, please be patient as the language model is called.
# Use a subprocess to run the command
import subprocess
process = subprocess.run(command, capture_output=True, text=True, check=True)
if process.stderr:
# Print the errors
print("tinyqabenchmarkpp errors:")
print(process.stderr)
else:
# Print the output
print("Synthetic data generated successfully")
print(process.stdout)
We can use the Opik SDK to push this dataset to Opik
generated_data = process.stdout
Helper function to process the JSONL response and push to Opik Once we have defined we will be able to run this
import json
def load_synthetic_data_to_opik(data_str):
"""Load JSONL synthetic data into Opik as a dataset."""
items = []
for line in data_str.strip().split("\n"):
try:
data = json.loads(line)
if not isinstance(data, dict):
continue
item = {
"question": data.get("text"),
"answer": data.get("label"),
"generated_context": data.get("context"),
"category": data.get("tags", {}).get("category"),
"difficulty": data.get("tags", {}).get("difficulty"),
}
if item["question"] and item["answer"]:
items.append(item)
except Exception:
continue
if not items:
print("No valid items found.")
return None
dataset_name = (
f"demo-tinyqab-dataset-{TQB_CATEGORIES.replace(',', '_')}-{TQB_NUM_QUESTIONS}"
)
dataset_name = "".join(
c if c.isalnum() or c in ["-", "_"] else "_" for c in dataset_name
)
opik_client = opik.Opik()
dataset = opik_client.get_or_create_dataset(
name=dataset_name,
description=f"Synthetic QA from tinyqabenchmarkpp for {TQB_CATEGORIES}",
)
dataset.insert(items)
print(f"Opik Dataset '{dataset.name}' created with ID: {dataset.id}")
return dataset
Push the data to Opik using helper function
opik_synthetic_dataset = load_synthetic_data_to_opik(generated_data)
if not opik_synthetic_dataset:
print("Failed to load synthetic data into Opik. Exiting.")
Lets import the required packages for the Opik Agent Optimizer SDK
from opik_optimizer import MetaPromptOptimizer
from opik.evaluation.metrics import LevenshteinRatio
We need to setup some intputs to our optimizer such as our starting prompt and some other configuration items.
from opik_optimizer import ChatPrompt
# Initial prompt for the optimizer
OPTIMIZER_INITIAL_PROMPT = ChatPrompt(
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "{question}"},
],
project_name=OPIK_PROJECT_NAME,
)
# Model for Opik Agent Optimizer
OPTIMIZER_MODEL = "openai/gpt-4o-mini"
# Population size for the optimizer
# Reduced for quicker demo
OPTIMIZER_POPULATION_SIZE = 5
# Number of generations for the optimizer
# Reduced for quicker demo
OPTIMIZER_NUM_GENERATIONS = 2
# Number of samples from dataset for optimization eval
OPTIMIZER_N_SAMPLES_OPTIMIZATION = 10
Now we can setup the metric configuration used for the evaluation, as well as he task_config for passing in the dataset headings and initial prompt.
We finally pass this to our optimizer to set this up. We are opting to use the MetaPromptOptimizer optimizer in the SDK.
# Metric Configuration
def levenshtein_ratio(dataset_item, llm_output):
return LevenshteinRatio().score(reference=dataset_item["answer"], output=llm_output)
# Initialize the optimizer
optimizer = MetaPromptOptimizer(
model=OPTIMIZER_MODEL,
population_size=OPTIMIZER_POPULATION_SIZE,
num_generations=OPTIMIZER_NUM_GENERATIONS,
infer_output_style=True,
verbose=1,
)
Now we can run the optimizer on the dataset and initial starting prompt to find the best prompt based on our synthetic data.
# Commented out if you wan to pull the dataset from Opik without having
# to generate the synthetic data again
# import opik
# opik_client = opik.Opik()
# opik_synthetic_dataset = opik_client.get_or_create_dataset("demo-tinyqab-dataset-use_context_provided_and_elaborate_on_it_to_generate_a_more_detailed_answers-20")
# Run the optimizer
result = optimizer.optimize_prompt(
prompt=OPTIMIZER_INITIAL_PROMPT,
dataset=opik_synthetic_dataset,
metric=levenshtein_ratio,
n_samples=OPTIMIZER_N_SAMPLES_OPTIMIZATION,
)
We can output our results to show
result.display()
You can try out other optimizers. More details can be found in the Opik Agent Optimizer documentation.