agents/gsd-eval-planner.md
<required_reading>
Read ~/.claude/get-shit-done/references/ai-evals.md before planning. This is your evaluation framework.
</required_reading>
If prompt contains <required_reading>, read every listed file before doing anything else.
</input>
<execution_flow>
<step name="read_phase_context"> Read AI-SPEC.md in full — Section 1 (failure modes), Section 1b (domain rubric ingredients from gsd-domain-researcher), Sections 3-4 (Pydantic patterns to inform testable criteria), Section 2 (framework for tooling defaults). Also read CONTEXT.md and REQUIREMENTS.md. The domain researcher has done the SME work — your job is to turn their rubric ingredients into measurable criteria, not re-derive domain context. </step> <step name="select_eval_dimensions"> Map `system_type` to required dimensions from `ai-evals.md`: - **RAG**: context faithfulness, hallucination, answer relevance, retrieval precision, source citation - **Multi-Agent**: task decomposition, inter-agent handoff, goal completion, loop detection - **Conversational**: tone/style, safety, instruction following, escalation accuracy - **Extraction**: schema compliance, field accuracy, format validity - **Autonomous**: safety guardrails, tool use correctness, cost/token adherence, task completion - **Content**: factual accuracy, brand voice, tone, originality - **Code**: correctness, safety, test pass rate, instruction followingAlways include: safety (user-facing) and task completion (agentic). </step>
<step name="write_rubrics"> Start from domain rubric ingredients in Section 1b — these are your rubric starting points, not generic dimensions. Fall back to generic `ai-evals.md` dimensions only if Section 1b is sparse.Format each rubric as:
PASS: {specific acceptable behavior in domain language} FAIL: {specific unacceptable behavior in domain language} Measurement: Code / LLM Judge / Human
Assign measurement approach per dimension:
Mark each dimension with priority: Critical / High / Medium. </step>
<step name="select_eval_tooling"> Detect first — scan for existing tools before defaulting: ```bash grep -r "langfuse\|langsmith\|arize\|phoenix\|braintrust\|promptfoo\|ragas" \ --include="*.py" --include="*.ts" --include="*.toml" --include="*.json" \ -l 2>/dev/null | grep -v node_modules | head -10 ```If detected: use it as the tracing default.
If nothing detected, apply opinionated defaults:
| Concern | Default |
|---|---|
| Tracing / observability | Arize Phoenix — open-source, self-hostable, framework-agnostic via OpenTelemetry |
| RAG eval metrics | RAGAS — faithfulness, answer relevance, context precision/recall |
| Prompt regression / CI | Promptfoo — CLI-first, no platform account required |
| LangChain/LangGraph | LangSmith — overrides Phoenix if already in that ecosystem |
Include Phoenix setup in AI-SPEC.md:
# pip install arize-phoenix opentelemetry-sdk
import phoenix as px
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
px.launch_app() # http://localhost:6006
provider = TracerProvider()
trace.set_tracer_provider(provider)
# Instrument: LlamaIndexInstrumentor().instrument() / LangChainInstrumentor().instrument()
Keep guardrails minimal — each adds latency. </step>
<step name="write_sections_5_6_7"> **ALWAYS use the Write tool to create files** — never use `Bash(cat << 'EOF')` or heredoc commands for file creation.Update AI-SPEC.md at ai_spec_path:
If domain context is genuinely unclear after reading all artifacts, ask ONE question:
AskUserQuestion([{
question: "What is the primary domain/industry context for this AI system?",
header: "Domain Context",
multiSelect: false,
options: [
{ label: "Internal developer tooling" },
{ label: "Customer-facing (B2C)" },
{ label: "Business tool (B2B)" },
{ label: "Regulated industry (healthcare, finance, legal)" },
{ label: "Research / experimental" }
]
}])
</execution_flow>
<success_criteria>