Back to Get Shit Done

AI-SPEC — Phase {N}: {phase_name}

get-shit-done/templates/AI-SPEC.md

1.40.06.7 KB
Original Source

AI-SPEC — Phase {N}: {phase_name}

AI design contract generated by /gsd-ai-integration-phase. Consumed by gsd-planner and gsd-eval-auditor. Locks framework selection, implementation guidance, and evaluation strategy before planning begins.


1. System Classification

System Type: <!-- RAG | Multi-Agent | Conversational | Extraction | Autonomous Agent | Content Generation | Code Automation | Hybrid -->

Description:

<!-- One-paragraph description of what this AI system does, who uses it, and what "good" looks like -->

Critical Failure Modes:

<!-- The 3-5 behaviors that absolutely cannot go wrong in this system -->

1b. Domain Context

Researched by gsd-domain-researcher. Grounds the evaluation strategy in domain expert knowledge.

Industry Vertical: <!-- healthcare | legal | finance | customer service | education | developer tooling | e-commerce | etc. -->

User Population: <!-- who uses this system and in what context -->

Stakes Level: <!-- Low | Medium | High | Critical -->

Output Consequence: <!-- what happens downstream when the AI output is acted on -->

What Domain Experts Evaluate Against

<!-- Domain-specific rubric ingredients — in practitioner language, not AI jargon --> <!-- Format: Dimension / Good (expert accepts) / Bad (expert flags) / Stakes / Source -->

Known Failure Modes in This Domain

<!-- Domain-specific failure modes from research — not generic hallucination, but how it manifests here -->

Regulatory / Compliance Context

<!-- Relevant regulations or constraints — or "None identified" if genuinely none apply -->

Domain Expert Roles for Evaluation

RoleResponsibility
<!-- e.g., Senior practitioner --><!-- Dataset labeling / rubric calibration / production sampling -->

2. Framework Decision

Selected Framework: <!-- e.g., LlamaIndex v0.10.x -->

Version: <!-- Pin the version -->

Rationale:

<!-- Why this framework fits this system type, team context, and production requirements -->

Alternatives Considered:

FrameworkRuled Out Because

Vendor Lock-In Accepted: <!-- Yes / No / Partial — document the trade-off consciously -->


3. Framework Quick Reference

Fetched from official docs by gsd-ai-researcher. Distilled for this specific use case.

Installation

bash
# Install command(s)

Core Imports

python
# Key imports for this use case

Entry Point Pattern

python
# Minimal working example for this system type

Key Abstractions

<!-- Framework-specific concepts the developer must understand before coding -->
ConceptWhat It IsWhen You Use It

Common Pitfalls

<!-- Gotchas specific to this framework and system type — from docs, issues, and community reports -->
project/
├── # Framework-specific folder layout

4. Implementation Guidance

Model Configuration:

<!-- Which model(s), temperature, max tokens, and other key parameters -->

Core Pattern:

<!-- The primary implementation pattern for this system type in this framework -->

Tool Use:

<!-- Tools/integrations needed and how to configure them -->

State Management:

<!-- How state is persisted, retrieved, and updated -->

Context Window Strategy:

<!-- How to manage context limits for this system type -->

4b. AI Systems Best Practices

Written by gsd-ai-researcher. Cross-cutting patterns every developer building AI systems needs — independent of framework choice.

Structured Outputs with Pydantic

<!-- Framework-specific Pydantic integration pattern for this use case --> <!-- Include: output model definition, how the framework uses it, retry logic on validation failure -->
python
# Pydantic output model for this system type

Async-First Design

<!-- How async is handled in this framework, the one common mistake, and when to stream vs. await -->

Prompt Engineering Discipline

<!-- System vs. user prompt separation, few-shot guidance, token budget strategy -->

Context Window Management

<!-- Strategy specific to this system type: RAG chunking / conversation summarisation / agent compaction -->

Cost and Latency Budget

<!-- Per-call cost estimate, caching strategy, sub-task model routing -->

5. Evaluation Strategy

Dimensions

DimensionRubric (Pass/Fail or 1-5)Measurement ApproachPriority
Code / LLM Judge / HumanCritical / High / Medium

Eval Tooling

Primary Tool: <!-- e.g., RAGAS + Langfuse -->

Setup:

bash
# Install and configure

CI/CD Integration:

bash
# Command to run evals in CI/CD pipeline

Reference Dataset

Size: <!-- e.g., 20 examples to start -->

Composition:

<!-- What scenario types the dataset covers: critical paths, edge cases, failure modes -->

Labeling:

<!-- Who labels examples and how (domain expert, LLM judge with calibration, etc.) -->

6. Guardrails

Online (Real-Time)

GuardrailTriggerIntervention
Block / Escalate / Flag

Offline (Flywheel)

MetricSampling StrategyAction on Degradation

7. Production Monitoring

Tracing Tool: <!-- e.g., Langfuse self-hosted -->

Key Metrics to Track:

<!-- 3-5 metrics that will be monitored in production -->

Alert Thresholds:

<!-- When to page/alert -->

Smart Sampling Strategy:

<!-- How to select interactions for human review — signal-based filters -->

Checklist

  • System type classified
  • Critical failure modes identified (≥ 3)
  • Domain context researched (Section 1b: vertical, stakes, expert criteria, failure modes)
  • Regulatory/compliance context identified or explicitly noted as none
  • Domain expert roles defined for evaluation involvement
  • Framework selected with rationale documented
  • Alternatives considered and ruled out
  • Framework quick reference written (install, imports, pattern, pitfalls)
  • AI systems best practices written (Section 4b: Pydantic, async, prompt discipline, context)
  • Evaluation dimensions grounded in domain rubric ingredients
  • Each eval dimension has a concrete rubric (Good/Bad in domain language)
  • Eval tooling selected — Arize Phoenix default confirmed or override noted
  • Reference dataset spec written (size ≥ 10, composition + labeling defined)
  • CI/CD eval integration specified
  • Online guardrails defined
  • Production monitoring configured (tracing tool + sampling strategy)