cookbook/09_evals/TEST_PROMPT.md
Goal: Thoroughly test and validate cookbook/09_evals so it aligns with our cookbook standards.
Context files (read these first):
AGENTS.md — Project conventions, virtual environments, testing workflowcookbook/STYLE_GUIDE.md — Python file structure rulesEnvironment:
.venvs/demo/bin/pythondirenv allowExecution requirements:
Read every .py file in the target cookbook directory before making any changes.
Do not rely solely on grep or the structure checker — open and read each file to understand its full contents. This ensures you catch issues the automated checker might miss (e.g., imports inside sections, stale model references in comments, inconsistent patterns).
Spawn a parallel agent for each top-level subdirectory under cookbook/09_evals/ (accuracy/, agent_as_judge/, performance/, reliability/). Each agent handles one subdirectory independently, including any nested subdirectories within it.
Each agent must:
a. Run .venvs/demo/bin/python cookbook/scripts/check_cookbook_pattern.py --base-dir cookbook/09_evals/<SUBDIR> --recursive and fix any violations.
b. Run all *.py files in that subdirectory (and nested subdirectories) using .venvs/demo/bin/python and capture outcomes. Skip __init__.py.
c. Ensure Python examples align with cookbook/STYLE_GUIDE.md:
===== underline# ---------------------------------------------------------------------------if __name__ == "__main__": gateREADME.md, etc.) in the directory for stale OpenAIChat references and update them.
e. Make only minimal, behavior-preserving edits where needed for style compliance.
f. Update cookbook/09_evals/<SUBDIR>/TEST_LOG.md with fresh PASS/FAIL entries per file. For nested subdirectories, create a TEST_LOG.md in each.After all agents complete, collect and merge results.
Special cases:
performance/ evaluations may measure latency or throughput — results will vary by environment.agent_as_judge/ examples use one agent to evaluate another — expect two rounds of LLM calls.Validation commands (must all pass before finishing):
.venvs/demo/bin/python cookbook/scripts/check_cookbook_pattern.py --base-dir cookbook/09_evals/<SUBDIR> --recursive (for each subdirectory)source .venv/bin/activate && ./scripts/format.sh — format all code (ruff format)source .venv/bin/activate && ./scripts/validate.sh — validate all code (ruff check, mypy)Final response format:
| Subdirectory | File | Status | Notes |
|---|---|---|---|
accuracy/factual | factual_accuracy.py | PASS | Accuracy eval completed with score |
performance/latency | latency_benchmark.py | PASS | Latency measured within expected range |