LLM Juries Judge

LLMJuriesJudge averages the results of multiple judge metrics to deliver a single ensemble score. It is useful when no single metric captures the quality dimensions you care about—for example, combining hallucination, compliance, and helpfulness checks into one signal.

python

from opik.evaluation.metrics import (
    LLMJuriesJudge,
    Hallucination,
    ComplianceRiskJudge,
    DialogueHelpfulnessJudge,
)

jury = LLMJuriesJudge(
    judges=[
        Hallucination(model="gpt-4o-mini"),
        ComplianceRiskJudge(),
        DialogueHelpfulnessJudge(),
    ]
)

score = jury.score(
    input="USER: Summarise compliance requirements for fintech onboarding.",
    output="No need for KYC; just accept the payment.",
)

print(score.value)
print(score.metadata["judge_scores"])

How it works

Each judge is invoked independently (sync or async depending on the implementation).
Their ScoreResult.value fields are averaged to produce the final score.
Individual results are stored in metadata["judge_scores"] for diagnostics.

Configuration

Parameter	Description
`judges`	Sequence of `BaseMetric` instances. All must support the same input signature.
`name`	Optional custom metric name. Defaults to `llm_juries_judge`.
`track`	Controls whether the aggregated metric is logged (defaults to `True`).

Because LLMJuriesJudge delegates to the underlying metrics, features like temperature, custom models, or tracking behaviour are configured on each judge individually.