Agent-as-Judge Eval Cookbooks

Agent-as-judge examples evaluate output quality with model-based scoring.

Files

agent_as_judge_basic.py - Sync and async numeric scoring with persisted results.
agent_as_judge_post_hook.py - Sync and async post-hook evaluation examples.
agent_as_judge_batch.py - Batch case evaluation with summary output.
agent_as_judge_binary.py - PASS/FAIL quality evaluation example.
agent_as_judge_custom_evaluator.py - Uses a custom evaluator agent.
agent_as_judge_team.py - Evaluates quality of team-generated responses.
agent_as_judge_team_post_hook.py - Team post-hook quality checking.
agent_as_judge_with_guidelines.py - Numeric scoring with additional guidelines.
agent_as_judge_with_tools.py - Evaluates responses from a tool-using agent.
agent_as_judge_eval_metrics.py - Eval model metrics tracked under "eval_model" detail key via post-hook.