apps/opik-documentation/documentation/fern/docs-v2/evaluation/metrics/overview.mdx
Opik provides a set of built-in evaluation metrics that you can mix and match to evaluate LLM behaviour. These metrics are broken down into two main categories:
Heuristic metrics are ideal when you need reproducible checks such as exact matching, regex validation, or similarity scores against a reference. LLM as a Judge metrics are useful when you want richer qualitative feedback (hallucination detection, helpfulness, summarisation quality, regulatory risk, etc.).
| Metric | Description | Documentation |
|---|---|---|
| BERTScore | Contextual embedding similarity score | BERTScore |
| ChrF | Character n-gram F-score (chrF / chrF++) | ChrF |
| Contains | Checks whether the output contains a specific substring | Contains |
| Corpus BLEU | Computes corpus-level BLEU across multiple outputs | CorpusBLEU |
| Equals | Checks if the output exactly matches an expected string | Equals |
| GLEU | Estimates grammatical fluency for candidate sentences | GLEU |
| IsJson | Validates that the output can be parsed as JSON | IsJson |
| JSDivergence | Jensen–Shannon similarity between token distributions | JSDivergence |
| JSDistance | Raw Jensen–Shannon divergence | JSDistance |
| KLDivergence | Kullback–Leibler divergence with smoothing | KLDivergence |
| Language Adherence | Verifies output language code | Language Adherence |
| Levenshtein | Calculates the normalized Levenshtein distance between output and reference | Levenshtein |
| Readability | Reports Flesch Reading Ease and FK grade | Readability |
| RegexMatch | Checks if the output matches a specified regular expression pattern | RegexMatch |
| ROUGE | Calculates ROUGE variants (rouge1/2/L/Lsum/W) | ROUGE |
| Sentence BLEU | Computes a BLEU score for a single output against one or more references | SentenceBLEU |
| Sentiment | Scores sentiment using VADER | Sentiment |
| Spearman Ranking | Spearman's rank correlation | Spearman Ranking |
| Tone | Flags tone issues such as shouting or negativity | Tone |
| Metric | Description | Documentation |
|---|---|---|
| DegenerationC | Detects repetition and degeneration patterns over a conversation | DegenerationC |
| Knowledge Retention | Checks whether the last assistant reply preserves user facts from earlier turns | Knowledge Retention |
| Metric | Description | Documentation |
|---|---|---|
| Agent Task Completion Judge | Checks whether an agent fulfilled its assigned task | Agent Task Completion |
| Agent Tool Correctness Judge | Evaluates whether an agent used tools correctly | Agent Tool Correctness |
| Answer Relevance | Checks whether the answer stays on-topic with the question | Answer Relevance |
| Compliance Risk Judge | Identifies non-compliant or high-risk statements | Compliance Risk |
| Context Precision | Ensures the answer only uses relevant context | Context Precision |
| Context Recall | Measures how well the answer recalls supporting context | Context Recall |
| Dialogue Helpfulness Judge | Evaluates how helpful an assistant reply is in a dialogue | Dialogue Helpfulness |
| G-Eval | Task-agnostic judge configurable with custom instructions | G-Eval |
| Hallucination | Detects unsupported or hallucinated claims using an LLM judge | Hallucination |
| LLM Juries Judge | Averages scores from multiple judge metrics for ensemble scoring | LLM Juries |
| Meaning Match | Evaluates semantic equivalence between output and ground truth | Meaning Match |
| Moderation | Flags safety or policy violations in assistant responses | Moderation |
| Prompt Uncertainty Judge | Detects ambiguity in prompts that may confuse LLMs | Prompt Diagnostics |
| QA Relevance Judge | Determines whether an answer directly addresses the user question | QA Relevance |
| Structured Output Compliance | Checks JSON or schema adherence for structured responses | Structured Output |
| Summarization Coherence Judge | Rates the structure and coherence of a summary | Summarization Coherence |
| Summarization Consistency Judge | Checks if a summary stays faithful to the source | Summarization Consistency |
| Trajectory Accuracy | Scores how closely agent trajectories follow expected steps | Trajectory Accuracy |
| Usefulness | Rates how useful the answer is to the user | Usefulness |
| Metric | Description | Documentation |
|---|---|---|
| Conversational Coherence | Evaluates coherence across sliding windows of a dialogue | Conversational Coherence |
| Session Completeness Quality | Checks whether user goals were satisfied during the session | Session Completeness |
| User Frustration | Estimates the likelihood a user was frustrated | User Frustration |
By default, Opik uses GPT-5-nano from OpenAI as the LLM to evaluate the output of other LLMs. However, you can easily switch to another LLM provider by specifying a different model parameter.
metric = Hallucination(model="bedrock/anthropic.claude-3-sonnet-20240229-v1:0")
metric.score( input="What is the capital of France?", output="The capital of France is Paris. It is famous for its iconic Eiffel Tower and rich cultural heritage.", )
```typescript title="TypeScript" language="typescript"
import { Hallucination } from 'opik';
import { openai } from '@ai-sdk/openai';
// Using model ID string (simplest approach)
const metric1 = new Hallucination({ model: 'gpt-4o' });
const metric2 = new Hallucination({ model: 'claude-3-5-sonnet-latest' });
const metric3 = new Hallucination({ model: 'gemini-2.0-flash' });
// With generation parameters (temperature, seed, maxTokens)
const metric4 = new Hallucination({
model: 'gpt-4o',
temperature: 0.3,
seed: 42
});
// Using custom LanguageModel instance for provider-specific configuration
const customModel = openai('gpt-4o', {
structuredOutputs: true
});
const metric5 = new Hallucination({ model: customModel });
// Score using the metric
await metric4.score({
input: "What is the capital of France?",
output: "The capital of France is Paris. It is famous for its iconic Eiffel Tower and rich cultural heritage.",
});
For Python, this functionality is based on LiteLLM framework. You can find a full list of supported LLM providers and how to configure them in the LiteLLM Providers guide.
For TypeScript, the SDK integrates with the Vercel AI SDK. You can use model ID strings for simplicity or LanguageModel instances for advanced configuration. See the Models documentation for more details.