Evaluate single prompts - Opik

<Note> In Opik 2.0, Experiments and Evaluation Suites are project-scoped. Make sure to specify a `project_name` when creating datasets and running experiments. </Note>

When developing prompts and performing prompt engineering, it can be challenging to know if a new prompt is better than the previous version.

Opik Experiments allow you to evaluate the prompt on multiple samples, score each LLM output and compare the performance of different prompts.

There are two way to evaluate a prompt in Opik:

Using the prompt playground
Using the evaluate_prompt function in the Python SDK

Using the prompt playground

The Opik playground allows you to quickly test different prompts and see how they perform.

You can compare multiple prompts to each other by clicking the + Add prompt button in the top right corner of the playground. This will allow you to enter multiple prompts and compare them side by side.

In order to evaluate the prompts on samples, you can add variables to the prompt messages using the {{variable}} syntax. You can then connect a dataset and run the prompts on each dataset item.

Programmatically evaluating prompts

The Opik SDKs provide a simple way to evaluate prompts using the evaluate prompt methods. This method allows you to specify a dataset, a prompt and a model. The prompt is then evaluated on each dataset item and the output can then be reviewed and annotated in the Opik UI.

To run the experiment, you can use the following code:

<CodeBlocks> ```typescript title="TypeScript" language="typescript" import { Opik, evaluatePrompt } from 'opik';

// Create a dataset that contains the samples you want to evaluate
const opikClient = new Opik();
const dataset = await opikClient.getOrCreateDataset({
    name: "my_dataset",
});
await dataset.insert([
    { input: "Hello, world!", expected_output: "Hello, world!" },
    { input: "What is the capital of France?", expected_output: "Paris" },
]);

// Run the evaluation
await evaluatePrompt({
    dataset,
    messages: [
        { role: "user", content: "Translate the following text to French: {{input}}" },
    ],
    model: "gpt-4o",
    projectName: "my-project",
});
```

```python title="Python" language="python"
import opik
from opik.evaluation import evaluate_prompt

# Create a dataset that contains the samples you want to evaluate
opik_client = opik.Opik()
dataset = opik_client.get_or_create_dataset("my_dataset", project_name="my-project")
dataset.insert([
    {"input": "Hello, world!", "expected_output": "Hello, world!"},
    {"input": "What is the capital of France?", "expected_output": "Paris"},
])

# Run the evaluation
evaluate_prompt(
    dataset=dataset,
    messages=[
        {"role": "user", "content": "Translate the following text to French: {{input}}"},
    ],
    model="gpt-3.5-turbo",
    project_name="my-project",
)
```

</CodeBlocks> Once the evaluation is complete, you can view the responses in the Opik UI and score each LLM output. <Frame> </Frame>

Automate the scoring process

Manually reviewing each LLM output can be time-consuming and error-prone. The evaluate_prompt function allows you to specify a list of scoring metrics which allows you to score each LLM output. Opik has a set of built-in metrics that allow you to detect hallucinations, answer relevance, etc and if we don't have the metric you need, you can easily create your own.

You can find a full list of all the Opik supported metrics in the Metrics Overview section or you can define your own metric using Custom Metrics section.

By adding the scoring_metrics parameter to the evaluate_prompt function, you can specify a list of metrics to use for scoring. We will update the example above to use the Hallucination metric for scoring:

<CodeBlocks> ```python title="Python" language="python" import opik from opik.evaluation import evaluate_prompt from opik.evaluation.metrics import Hallucination

# Create a dataset that contains the samples you want to evaluate
opik_client = opik.Opik()
dataset = opik_client.get_or_create_dataset("my_dataset", project_name="my-project")
dataset.insert([
    {"input": "Hello, world!", "expected_output": "Hello, world!"},
    {"input": "What is the capital of France?", "expected_output": "Paris"},
])

# Run the evaluation
evaluate_prompt(
    dataset=dataset,
    messages=[
        {"role": "user", "content": "Translate the following text to French: {{input}}"},
    ],
    model="gpt-3.5-turbo",
    scoring_metrics=[Hallucination()],
    project_name="my-project",
)
```

```typescript title="TypeScript" language="typescript"
import { Opik, evaluatePrompt, Hallucination } from 'opik';

// Create a dataset that contains the samples you want to evaluate
const opikClient = new Opik();
const dataset = await opikClient.getOrCreateDataset({
    name: "my_dataset",
});
await dataset.insert([
    { input: "Hello, world!", expected_output: "Hello, world!" },
    { input: "What is the capital of France?", expected_output: "Paris" },
]);

// Run the evaluation
await evaluatePrompt({
    dataset,
    messages: [
        { role: "user", content: "Translate the following text to French: {{input}}" },
    ],
    model: "gpt-4o",
    scoringMetrics: [new Hallucination()],
    projectName: "my-project",
});
```

</CodeBlocks>

Customizing the model used

You can customize the model used by create a new model using the LiteLLMChatModel class. This supports passing additional parameters to the model like the temperature or base url to use for the model.

<CodeBlocks> ```python import opik from opik.evaluation import evaluate_prompt from opik.evaluation.metrics import Hallucination from opik.evaluation import models

# Create a dataset that contains the samples you want to evaluate
opik_client = opik.Opik()
dataset = opik_client.get_or_create_dataset("my_dataset", project_name="my-project")
dataset.insert([
    {"input": "Hello, world!", "expected_output": "Hello, world!"},
    {"input": "What is the capital of France?", "expected_output": "Paris"},
])

# Run the evaluation
evaluate_prompt(
    dataset=dataset,
    messages=[
        {"role": "user", "content": "Translate the following text to French: {{input}}"},
    ],
    model=models.LiteLLMChatModel(model="gpt-3.5-turbo", temperature=0),
    scoring_metrics=[Hallucination()],
    project_name="my-project",
)
```

```typescript title="TypeScript" language="typescript"
import { Opik, evaluatePrompt, Hallucination } from 'opik';
import { openai } from '@ai-sdk/openai';

// Create a dataset that contains the samples you want to evaluate
const opikClient = new Opik();
const dataset = await opikClient.getOrCreateDataset({
    name: "my_dataset",
});
await dataset.insert([
    { input: "Hello, world!", expected_output: "Hello, world!" },
    { input: "What is the capital of France?", expected_output: "Paris" },
]);


// Define a custom model proivider with specific configuration
const openai = createOpenAI({
    // custom settings https://ai-sdk.dev/providers/ai-sdk-providers/openai#setup
    baseURL: "https://api.openai.com/v1"
});

// Run the evaluation
await evaluatePrompt({
    dataset,
    messages: [
        { role: "user", content: "Translate the following text to French: {{input}}" },
    ],
    model: openai('gpt-4o'),
    scoringMetrics: [new Hallucination()],
    temperature: 0,
    projectName: "my-project",
});
```

</CodeBlocks>

Filtering dataset items

You can evaluate only a subset of your dataset items by using the dataset_filter_string parameter. This is useful when you want to run experiments on specific categories of data:

<CodeBlocks> ```python title="Python" language="python" import opik from opik.evaluation import evaluate_prompt

# Create or get a dataset
opik_client = opik.Opik()
dataset = opik_client.get_or_create_dataset("my_dataset", project_name="my-project")

# Evaluate only items with specific tags
evaluate_prompt(
    dataset=dataset,
    messages=[
        {"role": "user", "content": "Translate the following text to French: {{input}}"},
    ],
    model="gpt-3.5-turbo",
    dataset_filter_string='tags contains "production"',
    project_name="my-project",
)

# Evaluate items matching multiple conditions
evaluate_prompt(
    dataset=dataset,
    messages=[
        {"role": "user", "content": "Answer the question: {{question}}"},
    ],
    model="gpt-4",
    dataset_filter_string='data.category = "finance" AND data.difficulty = "hard"',
    project_name="my-project",
)
```

</CodeBlocks>

The filter uses Opik Query Language (OQL) syntax. For more details on filter syntax and supported columns, see Filtering syntax.

Next steps

To evaluate complex LLM applications like RAG applications or agents, you can use the evaluate function.

<Tip> You can also compute experiment-level aggregate metrics when evaluating prompts using the `experiment_scoring_functions` parameter. Learn more about [experiment-level metrics](/v1/evaluation/evaluate_your_llm#computing-experiment-level-metrics). </Tip>