v3/docs/adr/ADR-171-provenance-tiered-evaluation-oracle.md
ID: ADR-171
Status: Proposed — implemented on feat/agenticow-integration (ships in 3.21.0)
Date: 2026-07-04
Authors: rUv (drafted with Claude Code)
Related ADRs:
The weight-eft distillation slice (ADR-173) needs a gold resolved: boolean per trajectory to build SFT data. ruflo has no SWE-bench oracle — historically resolved was derived from output-verifier structural confidence: a proxy. A tune on proxy-labeled data distills plausible-but-wrong completions, and a single blended "resolved" number is exactly the benchmark theater ADR-169 forbids.
Two facts changed the calculus:
ruvultra, tailscale) can execute actual task evaluations (FAIL_TO_PASS), giving true ground truth for trajectories that carry a test spec.Label resolved through a tiered trust hierarchy, and tag every label with its provenance. Never blend tiers into one opaque score.
Tier 1 oracle:test-exec real evaluation (FAIL_TO_PASS via darwin bench/eval,
executed on a remote GPU host over SSH) — GROUND TRUTH
Tier 2 judge:fable headless Fable LLM-as-judge (ADR-172) — smarter proxy
Tier 3 proxy:structural output-verifier structural confidence — WEAKEST, triage only
Interface:
labelResolved(trajectories, opts): Promise<Array<{
...trajectory,
resolved: boolean,
resolvedBy: 'oracle:test-exec' | 'judge:fable' | 'proxy:structural',
resolvedConfidence?: number,
resolvedReason?: string,
}>>
Tiers are tried in order per trajectory; the first that can decide wins, and its tag is recorded. Default (no opts) = Tier-3 proxy + a Tier-1 dry-run preflight, ZERO spend, no SSH exec, no Fable call. Tier 1 requires --execute; Tier 2 requires --fable-judge + a budget cap.
A speculative branch (ADR-170 §2.3) is promote-ineligible unless its winning trajectory is cleared by oracle:test-exec, or by judge:fable explicitly accepted by the caller. proxy:structural can never clear a promote — it is triage-only. This is what keeps the flywheel from graduating plausible-but-wrong work into shared memory.
On discard/rollback, emit a single receipt bundling {checkpoint, diff, failing command, oracle provenance, promotion decision}. A rollback that restores state but loses why is half-useful; the receipt is the forensic trail.
oracle:test-exec-only before trusting an adapter.proxy:structural — usable for triage, never for gold claims. The set of un-ground-truthable tasks is reported, not hidden.