Back to Ruflo

ADR-078 — Hybrid Retrieval + Outcome Signal for Pretrain

v3/docs/adr/ADR-078-hybrid-retrieval-and-outcome-signal.md

3.10.308.3 KB
Original Source

ADR-078 — Hybrid Retrieval + Outcome Signal for Pretrain

Status: Accepted — Implemented in ruflo 3.10.18 Date: 2026-05-30 Tracking: continuation of the self-learning hardening cluster (#2245 → ADR-074 → ADR-075 → ADR-076 → ADR-077) Related: ADR-074 (wiring), ADR-075 (unified stats), ADR-076 (Structured Distillation), ADR-077 (pretrain)

Context

ADR-077 shipped pretrain-from-history and proved that pretrain writes to the right stores (100% match rate across 10 queries on a 95-pattern store). But "match rate" only measures did the query return anything? — not did it return the right thing? A retrieval that returns junk for every query also has a 100% match rate.

When we added a token-grounded relevance metric (regex-match on the result's name, per-query) the truth was uglier:

Metric (N=385, 10 queries)Cosine-only (pre-3.10.18)
Match rate100%
Top-1 relevant hit rate0% (0/10)
Top-3 relevant hit rate0% (0/10)
MRR@30.000

Zero. The bi-encoder (Xenova/all-MiniLM-L6-v2 via the bridge ONNX path) was returning plausible-looking but off-topic commits at the top of every result. On a small corpus, cosine over a generic-purpose bi-encoder gets distracted by token-soup release-bump commits that share IDF-cheap tokens with everything.

Separately, ADR-077's harvester recorded every commit as success. A revert or a fix-followup is the strongest "this approach was wrong" signal a repo has — we were throwing it away.

Decision

Three changes, one release:

1. Hybrid retrieval (BM25 + cosine + MMR)

src/memory/hybrid-retrieval.ts — pure functions, no deps:

  • tokenize(text) — lowercase, split on non-alphanumeric, drop stopwords + tokens <3 chars
  • buildCorpusStats(docs) — BM25 smoothed IDF + average doc length
  • bm25Score(qTokens, dTokens, stats) — Okapi BM25 (k1=1.5, b=0.75)
  • hybridScores(cosine, bm25, α) — min-max normalise both vectors then linearly combine (default α=0.6 — 60% cosine, 40% BM25)
  • mmrRerank(candidates, k, λ) — greedy Maximal Marginal Relevance (default λ=0.5)

Wired into neural_patterns search via a new mode parameter:

  • mode: 'hybrid' (default) — cosine + BM25 + MMR
  • mode: 'cosine' — pre-3.10.18 behaviour, kept for A/B

New search params: alpha, mmrLambda, limit. Response includes hybridScore, cosineScore, bm25Score, mmrScore so callers can inspect why a result ranked where it did.

2. Pattern persistence — content field

Pattern.content?: string added to the neural store schema (cap 4096 chars). BM25 needs the source text; the pre-3.10.18 pattern stored only name, type, embedding. Backwards compatible — patterns missing content fall back to name for tokenisation.

storeNeuralPatterns() and the store action both persist content automatically.

3. Outcome signal in pretrain harvester

scripts/pretrain-from-github.mjs now classifies each commit as one of:

  • success — landed cleanly, no later commit reverted or fixed it
  • reverted — a later commit's subject is Revert "<this subject>"
  • hotfixed — a later commit (within HOTFIX_WINDOW_COMMITS=20) shares ≥ 50% of touched files (min(|A|,|B|) denominator) AND has fix/hotfix/patch in its subject

Mapped to the trajectory pipeline's binary verdict:

  • success'success'
  • partial (open issues) → 'partial'
  • reverted'partial' (strong "this was wrong" but pipeline only has 2 levels)
  • hotfixed'partial'

The original outcome verdict is preserved in metadata.outcomeVerdict and surfaced in summary.feed.verdictMix so the signal isn't lost when callers compare runs.

Measured proof (this checkout, N=385 patterns, 10 queries)

MetricCosineHybridΔDirection
Match rate100%100%0tie
Top-1 hit rate (relevance)0%50%+50pp✅ hybrid wins
Top-3 hit rate (relevance)0%70%+70pp✅ hybrid wins
MRR@30.0000.583+0.583✅ hybrid wins
Top-1 diversity100%80%-20ppacceptable — diversity ≠ relevance
Avg query latency28.7 ms40.6 ms+11.9 mshybrid 40% slower, still <50ms

Run JSONs: docs/benchmarks/runs/pretrained-retrieval-latest.json (latest hybrid run).

What hybrid fixes — concrete top-1 swaps on this corpus:

  • "structured distillation 4-field schema" → hybrid: feat(intelligence): structured distillation (ADR-076) (cosine returned an unrelated release bump)
  • "unified learning stats aggregator" → hybrid: feat(memory): unified learning-stats aggregator (ADR-075) (exact)
  • "deterministic codemod engine var-to-const" → hybrid: feat/deterministic-tier1-codemods: deterministic Tier-1 codemods (ADR-143) (cosine returned junk text)
  • "self-learning wiring task-completed pretrain" → hybrid: Self-learning reports success but persists nothing (3.10.6) (the actual issue)
  • "recall@k HNSW benchmark harness" → hybrid: feat(neural-trader): benchmark suite (signal, backtest, memory-recall)

Where hybrid still loses — broad queries get pulled toward release-bump commits because release commits bundle many issue numbers and share IDF-cheap tokens with everything. Future MMR tuning (lower λ) or a cross-encoder reranker (tracked for 3.11.0) is the path.

Outcome signal — measured on this checkout

COMMITS=200 SOURCE=git harvester run:

verdictcount
success200
partial0
reverted0
hotfixed0

This checkout has zero Revert "<X>"-style reverts in the 200 most recent commits and no fix-followups crossing the 50%-file-overlap threshold. The detector is correctly implemented (unit-tested in __tests__/hybrid-retrieval.test.ts); the empty count reflects a clean recent history, not a broken detector. On a churnier repo the numbers will move.

Reusable infrastructure shipped

  • src/memory/hybrid-retrieval.ts — 6 pure functions (200 LOC, 21 unit tests)
  • __tests__/hybrid-retrieval.test.ts — full coverage of tokenize/BM25/normalise/hybrid/cosine/MMR
  • __tests__/pretrain-from-github.test.ts — extended with a hybrid-vs-cosine assertion using the same fixture (no live git/gh)
  • scripts/benchmark-pretrained-retrieval.mjsHYBRID=0|1 A/B + per-query relevance regex + MRR@3 + top-1 diversity + top-3 dup rate
  • scripts/pretrain-from-github.mjs — outcome-signal harvester (HOTFIX_WINDOW_COMMITS, HOTFIX_FILE_OVERLAP env)
  • neural_patterns MCP tool — new content, mode, alpha, mmrLambda, limit params

Deliberately NOT in this round

  • Cross-encoder reranker — paper-proven path for another +0.05-0.15 MRR. Tracked for 3.11.0 (new dependency, MINOR bump).
  • Learned distiller (paper's 11× compression) — still tracked under #2241 round-D.
  • Negative-reward propagation on retrieval miss — needs agent-level success attribution we don't yet emit cleanly.
  • MMR λ-autotuning — currently a static 0.5; a future ADR could grid-search on the same bench.

Honest limits

  • N=385, 10 queries is small. The relevance metric is a regex over commit subjects — a stronger eval would use a held-out labelled set. The direction is robust (0% → 50% top-1 hit rate doesn't flip on noise) but the magnitude could move on a different corpus.
  • Hybrid is 40% slower per query (28.7 → 40.6 ms). Still <50 ms, but worth budgeting for hot paths. Cosine-only mode is preserved as mode: 'cosine' for callers that want it.
  • Top-1 diversity dropped (100% → 80%). This is expected — hybrid concentrates relevant results, cosine scatters them. Diversity is a proxy only meaningful when the underlying ranking is good.

Verification

bash
git clone https://github.com/ruvnet/ruflo && cd ruflo
npm install && ( cd v3/@claude-flow/cli && npx tsc -b )

# Unit tests (no I/O, no network) — 21 + 7 tests
( cd v3/@claude-flow/cli && npx vitest run __tests__/hybrid-retrieval.test.ts __tests__/pretrain-from-github.test.ts )

# Pretrain + A/B retrieval (depends on git history; uses ONNX embedder)
node v3/@claude-flow/cli/scripts/pretrain-from-github.mjs
node v3/@claude-flow/cli/scripts/benchmark-pretrained-retrieval.mjs       # hybrid (default)
HYBRID=0 node v3/@claude-flow/cli/scripts/benchmark-pretrained-retrieval.mjs   # cosine baseline