v3/@claude-flow/guidance/docs/adrs/ADR-G009-headless-testing-harness.md
Accepted
2026-02-01
The guidance control plane compiles rules, retrieves shards, enforces gates, logs events, and evolves rules. But none of these components answer the question: does the guidance actually work?
"Works" means:
Answering these questions requires running real tasks against real Claude Code with real guidance and measuring the results. Manual testing is slow, subjective, and unrepeatable. We need an automated, deterministic, repeatable evaluation primitive.
Claude Code provides headless mode: claude -p '<prompt>' --output-format json, which accepts a prompt on stdin, runs it non-interactively, and outputs structured JSON. This mode is the natural evaluation primitive for guidance testing.
Build a HeadlessRunner class (src/headless.ts) that uses Claude Code's headless mode as the evaluation primitive, with three layers: task definition, execution, and assertion checking.
A TestTask defines a single evaluation scenario:
interface TestTask {
id: string; // Unique task ID
prompt: string; // The prompt to send to Claude Code
expectedIntent: TaskIntent; // Expected intent classification
assertions: TaskAssertion[]; // Expected behavior assertions
maxViolations: number; // Maximum allowed violations
timeoutMs: number; // Execution timeout
tags: string[]; // Tags for filtering (e.g., 'security', 'compliance')
}
Assertions are typed checks against the output:
interface TaskAssertion {
type: 'output-contains' | 'output-not-contains' | 'files-touched' |
'no-forbidden-commands' | 'tests-pass' | 'custom';
expected: string; // Expected value or regex pattern
description: string; // Human-readable description
}
HeadlessRunner.runTask() executes a single task:
claude -p '<escaped_prompt>' --output-format json 2>/dev/nullICommandExecutor interface (default: ProcessExecutor using child_process.execFile). Injection enables testing without actual Claude Code.HeadlessOutput with result, toolsUsed, filesModified, hasErrors, and metadata. Non-JSON output is treated as plain text result.TaskAssertion is evaluated against the parsed output.Violation objects with ruleId: ASSERT-{taskId}.RunLedger is attached, a RunEvent is created and finalized.HeadlessRunner.runSuite() runs multiple tasks sequentially, optionally filtered by tags, and produces a SuiteRunSummary:
interface SuiteRunSummary {
totalTasks: number;
tasksPassed: number;
tasksFailed: number;
totalViolations: number;
totalAssertions: number;
assertionsPassed: number;
passRate: number; // tasksPassed / totalTasks
durationMs: number;
results: TaskRunResult[];
}
createComplianceSuite() provides a starter suite with three scenarios:
--force flag.The ICommandExecutor interface decouples the runner from the actual Claude Code process:
interface ICommandExecutor {
execute(command: string, timeoutMs: number): Promise<{
stdout: string;
stderr: string;
exitCode: number;
}>;
}
In unit tests, a mock executor returns predetermined outputs. In CI, the ProcessExecutor runs real Claude Code. In the optimizer's A/B tests (ADR-G008), the executor can be configured to run against different guidance versions.
The headless harness is the mechanism by which the optimizer validates proposed rule changes:
This requires running the suite twice per proposed change, which is why the optimizer defaults to evaluating only the top 3 violations per cycle.
ICommandExecutor injection allows testing the harness itself (assertion logic, violation detection, ledger integration) without requiring a Claude Code installation.parseOutput() method handles multiple field name variants (result, text, content) and falls back to plain text, but future format changes could break parsing.for...of loop). Parallel execution would be faster but risks resource contention and makes violation attribution harder.Pre-define model responses and test the guidance pipeline (compile, retrieve, gate) against them. Rejected because mocked responses do not test whether the model actually follows the guidance. They test the pipeline but not the effectiveness.
Have a human review model output against a checklist. Rejected because it is slow, subjective, non-repeatable, and does not scale to weekly optimization cycles.
Send model output to another model and ask "did this follow the rules?" Rejected because it adds cost, introduces non-determinism (judge model may disagree across runs), and creates a circular dependency (using a model to evaluate a model's rule compliance).
Parse the model's tool calls and file edits without running them, checking for rule compliance. Rejected because static analysis cannot determine whether tests actually pass, whether the output is functionally correct, or whether the model's reasoning was sound. Headless execution captures the full end-to-end outcome.
Run all tasks concurrently for speed. Considered but deferred. Parallel execution risks Claude Code instances competing for file locks, port bindings, and git state. Sequential execution is simpler and more reliable for the initial implementation. Parallel execution can be added later with proper isolation (separate working directories per task).
v3/@claude-flow/guidance/src/headless.ts -- HeadlessRunner, TestTask, TaskAssertion, SuiteRunSummary, ProcessExecutor, createComplianceSuite()v3/@claude-flow/guidance/src/types.ts -- RunEvent, Violation, EvaluatorResultv3/@claude-flow/guidance/src/ledger.ts -- RunLedger.createEvent(), finalizeEvent(), evaluate()v3/@claude-flow/guidance/src/index.ts -- GuidanceControlPlane.getHeadlessRunner()