apps/opik-documentation/python-sdk-docs/source/evaluation/metrics/index.rst
Opik includes a number of pre-built metrics to help you evaluate your LLM application.
Each metric can be called as a standalone function using the score method::
from opik.evaluation.metrics import Hallucination
metric = Hallucination()
metric.score( input="What is the capital of France?", output="The capital of France is Paris. It is famous for its iconic Eiffel Tower and rich cultural heritage.", context=["France is a country in Western Europe. Its capital is Paris, which is known for landmarks like the Eiffel Tower."], )
Or as part of an evaluation run using the evaluate function.
You can learn more about each metric in the following sections:
.. toctree:: :maxdepth: 4 :titlesonly:
Equals RegexMatch Contains IsJson LevenshteinRatio
Hallucination GEval Moderation
AnswerRelevance ContextPrecision ContextRecall
BaseMetric ConversationThreadMetric HeuristicMetrics ConversationHeuristicMetrics ConversationLLMJudges LLMJudgePresets LLMJuries UtilityMetrics
The pages above fall into two categories:
Equals, Hallucination) that remain the
authoritative deep dives.opik.evaluation.metrics has an accompanying API reference.Use these aggregation pages to browse the extended catalog:
HeuristicMetrics — sentence/word overlap, readability, sentiment, prompt safety, and distribution comparisons.ConversationHeuristicMetrics — fast heuristics for degeneracy and knowledge retention.ConversationLLMJudges — LLM-as-a-judge conversation evaluators and session quality metrics.LLMJudgePresets — pre-built GEval presets and bias checks.LLMJuries — multi-judge aggregation.UtilityMetrics — helpers such as AggregatedMetric and RagasMetricWrapper.Import any metric directly from :mod:opik.evaluation.metrics, and pair these API
references with the Fern guides in
apps/opik-documentation/documentation/fern/docs/evaluation/metrics for workflow
context.