apps/opik-documentation/documentation/fern/docs/evaluation/evaluate_prompt.mdx
When developing prompts and performing prompt engineering, it can be challenging to know if a new prompt is better than the previous version.
Opik Experiments allow you to evaluate the prompt on multiple samples, score each LLM output and compare the performance of different prompts.
<Frame> </Frame>There are two way to evaluate a prompt in Opik:
evaluate_prompt function in the Python SDKThe Opik playground allows you to quickly test different prompts and see how they perform.
You can compare multiple prompts to each other by clicking the + Add prompt button in the top
right corner of the playground. This will allow you to enter multiple prompts and compare them side
by side.
In order to evaluate the prompts on samples, you can add variables to the prompt messages using the
{{variable}} syntax. You can then connect a dataset and run the prompts on each dataset item.
The Opik SDKs provide a simple way to evaluate prompts using the evaluate prompt methods. This
method allows you to specify a dataset, a prompt and a model. The prompt is then evaluated on each
dataset item and the output can then be reviewed and annotated in the Opik UI.
To run the experiment, you can use the following code:
<CodeBlocks> ```typescript title="TypeScript" language="typescript" import { Opik, evaluatePrompt } from 'opik';// Create a dataset that contains the samples you want to evaluate
const opikClient = new Opik();
const dataset = await opikClient.getOrCreateDataset({
name: "my_dataset",
});
await dataset.insert([
{ input: "Hello, world!", expected_output: "Hello, world!" },
{ input: "What is the capital of France?", expected_output: "Paris" },
]);
// Run the evaluation
await evaluatePrompt({
dataset,
messages: [
{ role: "user", content: "Translate the following text to French: {{input}}" },
],
model: "gpt-4o",
projectName: "my-project",
});
```
```python title="Python" language="python"
import opik
from opik.evaluation import evaluate_prompt
# Create a dataset that contains the samples you want to evaluate
opik_client = opik.Opik()
dataset = opik_client.get_or_create_dataset("my_dataset", project_name="my-project")
dataset.insert([
{"input": "Hello, world!", "expected_output": "Hello, world!"},
{"input": "What is the capital of France?", "expected_output": "Paris"},
])
# Run the evaluation
evaluate_prompt(
dataset=dataset,
messages=[
{"role": "user", "content": "Translate the following text to French: {{input}}"},
],
model="gpt-3.5-turbo",
project_name="my-project",
)
```
Manually reviewing each LLM output can be time-consuming and error-prone. The evaluate_prompt
function allows you to specify a list of scoring metrics which allows you to score each LLM output.
Opik has a set of built-in metrics that allow you to detect hallucinations, answer relevance, etc
and if we don't have the metric you need, you can easily create your own.
You can find a full list of all the Opik supported metrics in the Metrics Overview section or you can define your own metric using Custom Metrics section.
By adding the scoring_metrics parameter to the evaluate_prompt function, you can specify a list
of metrics to use for scoring. We will update the example above to use the Hallucination metric
for scoring:
# Create a dataset that contains the samples you want to evaluate
opik_client = opik.Opik()
dataset = opik_client.get_or_create_dataset("my_dataset", project_name="my-project")
dataset.insert([
{"input": "Hello, world!", "expected_output": "Hello, world!"},
{"input": "What is the capital of France?", "expected_output": "Paris"},
])
# Run the evaluation
evaluate_prompt(
dataset=dataset,
messages=[
{"role": "user", "content": "Translate the following text to French: {{input}}"},
],
model="gpt-3.5-turbo",
scoring_metrics=[Hallucination()],
project_name="my-project",
)
```
```typescript title="TypeScript" language="typescript"
import { Opik, evaluatePrompt, Hallucination } from 'opik';
// Create a dataset that contains the samples you want to evaluate
const opikClient = new Opik();
const dataset = await opikClient.getOrCreateDataset({
name: "my_dataset",
});
await dataset.insert([
{ input: "Hello, world!", expected_output: "Hello, world!" },
{ input: "What is the capital of France?", expected_output: "Paris" },
]);
// Run the evaluation
await evaluatePrompt({
dataset,
messages: [
{ role: "user", content: "Translate the following text to French: {{input}}" },
],
model: "gpt-4o",
scoringMetrics: [new Hallucination()],
projectName: "my-project",
});
```
You can customize the model used by create a new model using the LiteLLMChatModel class. This supports passing additional parameters to the model like the temperature or base url to use for the model.
# Create a dataset that contains the samples you want to evaluate
opik_client = opik.Opik()
dataset = opik_client.get_or_create_dataset("my_dataset", project_name="my-project")
dataset.insert([
{"input": "Hello, world!", "expected_output": "Hello, world!"},
{"input": "What is the capital of France?", "expected_output": "Paris"},
])
# Run the evaluation
evaluate_prompt(
dataset=dataset,
messages=[
{"role": "user", "content": "Translate the following text to French: {{input}}"},
],
model=models.LiteLLMChatModel(model="gpt-3.5-turbo", temperature=0),
scoring_metrics=[Hallucination()],
project_name="my-project",
)
```
```typescript title="TypeScript" language="typescript"
import { Opik, evaluatePrompt, Hallucination } from 'opik';
import { openai } from '@ai-sdk/openai';
// Create a dataset that contains the samples you want to evaluate
const opikClient = new Opik();
const dataset = await opikClient.getOrCreateDataset({
name: "my_dataset",
});
await dataset.insert([
{ input: "Hello, world!", expected_output: "Hello, world!" },
{ input: "What is the capital of France?", expected_output: "Paris" },
]);
// Define a custom model proivider with specific configuration
const openai = createOpenAI({
// custom settings https://ai-sdk.dev/providers/ai-sdk-providers/openai#setup
baseURL: "https://api.openai.com/v1"
});
// Run the evaluation
await evaluatePrompt({
dataset,
messages: [
{ role: "user", content: "Translate the following text to French: {{input}}" },
],
model: openai('gpt-4o'),
scoringMetrics: [new Hallucination()],
temperature: 0,
projectName: "my-project",
});
```
You can evaluate only a subset of your dataset items by using the dataset_filter_string parameter. This is useful when you want to run experiments on specific categories of data:
# Create or get a dataset
opik_client = opik.Opik()
dataset = opik_client.get_or_create_dataset("my_dataset", project_name="my-project")
# Evaluate only items with specific tags
evaluate_prompt(
dataset=dataset,
messages=[
{"role": "user", "content": "Translate the following text to French: {{input}}"},
],
model="gpt-3.5-turbo",
dataset_filter_string='tags contains "production"',
project_name="my-project",
)
# Evaluate items matching multiple conditions
evaluate_prompt(
dataset=dataset,
messages=[
{"role": "user", "content": "Answer the question: {{question}}"},
],
model="gpt-4",
dataset_filter_string='data.category = "finance" AND data.difficulty = "hard"',
project_name="my-project",
)
```
The filter uses Opik Query Language (OQL) syntax. For more details on filter syntax and supported columns, see Filtering syntax.
To evaluate complex LLM applications like RAG applications or agents, you can use the evaluate function.