Back to Gastown

gt-model-eval

gt-model-eval/README.md

1.0.15.6 KB
Original Source

gt-model-eval

Promptfoo-based model comparison framework for Gas Town patrol agents. Compare Claude Opus, Sonnet, and Haiku on patrol decision tasks to find where cheaper models match Opus quality.

Why

Gas Town multi-agent setups burn through Opus budget on patrol agents (deacon, witness, dogs) that follow prescriptive formulas. These agents parse shell output and make rule-based decisions — they may not need Opus-level reasoning. This framework provides evidence for safely downgrading roles to Sonnet or Haiku.

See Discussion #1542 and Issue #1545.

Quick Start

bash
# Prerequisites: Node.js 18+, Anthropic API key
export ANTHROPIC_API_KEY=sk-ant-...

# Run all tests across Opus, Sonnet, Haiku
npx promptfoo eval

# Run 3x per test for consistency measurement
npx promptfoo eval --repeat 3

# View results in browser
npx promptfoo view

# Export results
npx promptfoo eval --output results.json --output results.html

Test Suites

Class B — Directive Tests (instruction-following)

Each test provides a role_context that describes the expected behavior. These validate that all models can follow explicit patrol instructions.

FileRoleTestsWhat It Measures
deacon-zombie.yamlDeacon10Zombie session detection vs healthy/idle/crashed
deacon-plugin-gate.yamlDeacon10Plugin cooldown/cron gate evaluation (mechanical)
deacon-dog-health.yamlDeacon10Dog timeout matrix and pool spawn/retire decisions
witness-stuck.yamlWitness12Stuck polecat assessment: no-op → nudge → escalate
witness-cleanup.yamlWitness10Dead session cleanup: nuke vs recover vs escalate
refinery-triage.yamlRefinery10Test failure diagnosis: branch-caused vs pre-existing
refinery-conflict.yamlRefinery10Merge conflict handling and push verification
dog-orphan.yamlDog10Orphan triage: RESET/REASSIGN/RECOVER/ESCALATE/BURN

Class A — Reasoning Tests (evidence-based)

Each test provides a neutral role_context with NO answer hints. These measure whether the model can derive the correct action from shell output evidence alone — the key signal for downgrade decisions.

FileRoleTestsWhat It Measures
class-a-deacon.yamlDeacon3Zombie vs idle vs healthy from raw evidence
class-a-witness.yamlWitness3Active vs dirty-dead vs clean-dead from raw evidence
class-a-refinery.yamlRefinery3Branch-caused vs pre-existing vs push failure from raw evidence
class-a-dog.yamlDog3Reset vs recover vs escalate from raw evidence

Total: 94 test cases (82 Class B + 12 Class A) across 4 patrol roles.

Each test provides simulated shell output and expects a structured JSON decision. Class B tests are split into "clear" cases (all models should agree) and "edge" cases (where model quality matters). Class A results are what directly informs the downgrade decision.

How It Works

Each test case simulates a patrol agent decision point:

  1. System prompt defines the role and formula step
  2. Shell output provides the evidence (tmux status, git state, bead data)
  3. Model responds with a JSON decision: {"action": "...", "reason": "..."}
  4. Assertions check correctness, cost, latency, and safety

Promptfoo runs the same test across all three models and compares results.

Sharing Results

bash
# Generate markdown report
./scripts/results-to-discussion.sh results.json

# Post directly to GitHub Discussions
./scripts/results-to-discussion.sh results.json --post

# Post to a different repo
./scripts/results-to-discussion.sh results.json --post --repo your-org/gastown

Contributing Test Cases

Test files are YAML in tests/. Each test case needs:

yaml
- description: "What this test checks"
  vars:
    role: deacon|witness|refinery|dog
    role_context: >
      Context about the role and what it should know.
    formula_step: which-step
    allowed_actions: '["action1", "action2", ...]'
    shell_output: |
      $ command
      output
    context: "Additional context for the decision."
  assert:
    - type: javascript
      value: |
        const d = JSON.parse(output);
        return d.action === "expected-action";

Good test cases:

  • Have a clear expected outcome
  • Include realistic shell output
  • Test one decision at a time
  • Cover both "clear" and "edge" cases

Project Structure

gt-model-eval/
├── promptfooconfig.yaml           # Main config (3 providers, assertions)
├── package.json                   # Pin promptfoo version
├── prompts/
│   └── patrol-decision.txt        # System prompt template
├── tests/
│   ├── deacon-zombie.yaml         # 10 Class B tests
│   ├── deacon-plugin-gate.yaml    # 10 Class B tests
│   ├── deacon-dog-health.yaml     # 10 Class B tests
│   ├── witness-stuck.yaml         # 12 Class B tests
│   ├── witness-cleanup.yaml       # 10 Class B tests
│   ├── refinery-triage.yaml       # 10 Class B tests
│   ├── refinery-conflict.yaml     # 10 Class B tests
│   ├── dog-orphan.yaml            # 10 Class B tests
│   ├── class-a-deacon.yaml        #  3 Class A tests
│   ├── class-a-witness.yaml       #  3 Class A tests
│   ├── class-a-refinery.yaml      #  3 Class A tests
│   └── class-a-dog.yaml           #  3 Class A tests
├── scripts/
│   └── results-to-discussion.sh   # Format + post results
└── README.md