apps/opik-documentation/documentation/fern/docs-v2/evaluation/metrics/llm_juries.mdx
LLMJuriesJudge averages the results of multiple judge metrics to deliver a single ensemble score. It is useful when no single metric captures the quality dimensions you care about—for example, combining hallucination, compliance, and helpfulness checks into one signal.
from opik.evaluation.metrics import (
LLMJuriesJudge,
Hallucination,
ComplianceRiskJudge,
DialogueHelpfulnessJudge,
)
jury = LLMJuriesJudge(
judges=[
Hallucination(model="gpt-4o-mini"),
ComplianceRiskJudge(),
DialogueHelpfulnessJudge(),
]
)
score = jury.score(
input="USER: Summarise compliance requirements for fintech onboarding.",
output="No need for KYC; just accept the payment.",
)
print(score.value)
print(score.metadata["judge_scores"])
ScoreResult.value fields are averaged to produce the final score.metadata["judge_scores"] for diagnostics.| Parameter | Description |
|---|---|
judges | Sequence of BaseMetric instances. All must support the same input signature. |
name | Optional custom metric name. Defaults to llm_juries_judge. |
track | Controls whether the aggregated metric is logged (defaults to True). |
Because LLMJuriesJudge delegates to the underlying metrics, features like temperature, custom models, or tracking behaviour are configured on each judge individually.