plugins/ruflo-testgen/skills/tdd-repair/SKILL.md
Surfaces the Test-Driven Repair loop as a ruflo skill. Use when you have a failing test and want the source-under-test fixed automatically, with the test's pass/fail as the verification gate (no LLM-as-judge).
tdd-workflow skill), then run tdd-repair to drive the green.--no-test-oracle) is scoped for a follow-up ADR — needs MCTS over repro generation. For now, write a failing test first.claude -p runs with --allowedTools Read,Edit,Bash — no MCP, no network, no arbitrary file writes — but Bash can still touch the filesystem. Don't point this at code you wouldn't git checkout . after.Implementation: scripts/tdd-repair/tdd-repair.mjs.
test-already-passes). Repairing a green test is either a no-op or a --test-command typo.claude -p with a focused prompt:
--allowedTools Read,Edit,Bash restricts capability. --max-budget-usd caps cost per attempt. --permission-mode acceptEdits auto-accepts file edits within the allowed set.success: true + per-attempt usage. If red after --max-attempts: emit success: false + receipts. Either way, the workspace is left as claude -p modified it (caller can git diff to review).{
"success": true,
"data": {
"repaired": true,
"attemptsTaken": 1,
"mode": "test-driven",
"before": { "passed": false, "exitCode": 1 },
"after": { "passed": true, "exitCode": 0, "durationMs": 4321 },
"attempts": [
{ "attempt": 1, "claude": { "ok": true, "durationMs": 38421, "usage": { "cost_usd": 0.0234 } }, "verify": { "passed": true } }
],
"totalCostUsd": 0.0234,
"budgetUsd": 5.0,
"budgetExhausted": false,
"shape": { "repo": "...", "test": "...", "testCommand": "...", "maxAttempts": 1, "model": "haiku" }
}
}
| Code | Meaning |
|---|---|
| 0 | Test green after repair (success) |
| 1 | Test still red after --max-attempts |
| 2 | Config error (test file missing, test already passes, --no-test-oracle unsupported, etc.) |
| 3 | Claude CLI exited non-zero (infrastructure failure) |
| 99 | Reserved for safety tripwire (per ADR-153) |
| Layer | Mechanism |
|---|---|
| Cost cap | --max-budget-usd default $5, divided across --max-attempts. Hard ceiling — claude exits when reached. |
| Capability cap | --allowedTools Read,Edit,Bash — no MCP, no network, no arbitrary writes. |
| Scope cap | Prompt forbids modifying the test or adding dependencies. |
| Confirmation gate | --confirm REQUIRED — without it, returns dry-run plan (mirrors harness-evolve / harness-mint convention). |
| Hard timeout | 15 min total wall-clock; per-attempt budget of timeoutMs / maxAttempts. |
| Pre-flight | Refuses to run if the test already passes (catches --test-command typos). |
Modeled on the Test-Driven Repair mode from agent-harness-generator/packages/darwin-mode ADR-175. Key design difference: instead of wrapping metaharness-darwin evolve (population-based search), we drive a single claude -p invocation. Rationale:
claude -p is already in our stack — no new optional dep--session-id if iteration is neededConformant mode (no test, write own repro via MCTS) is deferred to a future ADR.
# Smoke / dry-run (no --confirm yet)
node plugins/ruflo-testgen/scripts/tdd-repair/tdd-repair.mjs \
--repo /path/to/myrepo \
--test tests/auth.test.ts \
--test-command "npx vitest run tests/auth.test.ts"
# Actually repair (Haiku tier, $5 budget, 1 attempt)
node plugins/ruflo-testgen/scripts/tdd-repair/tdd-repair.mjs \
--repo /path/to/myrepo \
--test tests/auth.test.ts \
--test-command "npx vitest run tests/auth.test.ts" \
--confirm
# Bigger model + more attempts for harder bugs
node plugins/ruflo-testgen/scripts/tdd-repair/tdd-repair.mjs \
--repo . --test tests/regression-2456.test.ts \
--test-command "npm test -- tests/regression-2456.test.ts" \
--model sonnet --max-attempts 3 --budget 15.00 \
--confirm
| Tier | Model | Per-attempt typical | Use when |
|---|---|---|---|
| 1 | Haiku | $0.02 – $0.20 | First try — most "make red green" bugs are tactical |
| 2 | Sonnet | $0.30 – $2.00 | Haiku failed, or the bug has multi-file scope |
| 3 | Opus | $1.50 – $8.00 | Sonnet failed — architectural reasoning required (rarely worth it for a single failing test) |