Back to Ruflo

ADR-082 — Grid-Search Retrieval Defaults Against Labelled Metric

v3/docs/adr/ADR-082-grid-search-retrieval-defaults.md

3.10.306.2 KB
Original Source

ADR-082 — Grid-Search Retrieval Defaults Against Labelled Metric

Status: Accepted — Implemented in ruflo 3.10.22 Date: 2026-05-30 Tracking: continuation of self-learning hardening cluster (ADR-077 → 078 → 079 → 080 → 081 → 082) Related: ADR-078 (hybrid retrieval), ADR-079 (multi-field BM25), ADR-080 (cross-encoder), ADR-081 (labelled corpus)

Context

ADR-079's tuning (α=0.6, subjectWeight=3.0, mmrLambda=0.5) and ADR-080's tuning (hybridWeight=0.5, ceWeight=0.5) were both selected against the regex relevance proxy that ADR-081 then revealed was misleading. The defaults were never validated against ground truth.

ADR-081 shipped the labelled corpus. Now we can re-tune properly.

Decision

Built a grid-search harness (scripts/grid-search-retrieval.mjs) that sweeps the retrieval hyperparameter space against the ADR-081 labelled corpus and reports label nDCG@3, top-1, top-3, precision@3, MRR@3 per configuration.

Grid:

  • alpha ∈ {0.3, 0.5, 0.7}
  • subjectWeight ∈ {2.0, 3.0, 5.0}
  • mmrLambda ∈ {0.3, 0.5, 0.7}
  • hybridWeight × ceWeight ∈ {(0.3, 0.7), (0.4, 0.6), (0.5, 0.5), (0.6, 0.4), (0.7, 0.3)} (rerank-only)

32 configs total: 27 hybrid + 5 rerank.

Findings → new defaults

The grid revealed:

  1. α=0.5 beats α=0.6 and crushes α=0.7. α=0.7 (more cosine, less BM25) collapses to 40-50% top-1 across all subjectWeight × mmrLambda combinations. The BM25 signal carries more discriminating power than the bi-encoder cosine on this corpus.
  2. subjectWeight=2 beats sw=3 (slightly) and sw=5 (clearly). Less weight on subject lets body tokens contribute relevance signal that gets crowded out at sw=3 or 5.
  3. mmrLambda=0.7 beats 0.5 and 0.3. Less diversity / more relevance ranking pulls more relevant docs into top-3. Diversity is paying for itself less than expected on this corpus.
  4. For rerank (hybridWeight/ceWeight): grid-search winner was hw=0.7 cw=0.3 (nDCG@3=0.963) when tested against OLD α/sw baselines. When the new α/sw shipped, the joint optimum shifted — hw=0.5 cw=0.5 with new α/sw gives a similar but mixed profile. Kept at 0.5/0.5 pending a joint re-grid.

New defaults

ParameterOld (ADRs 079-080)New (ADR-082)Why
alpha0.60.5Grid: nDCG@3 0.900 → 0.963
subjectWeight3.02.0Grid: sw=2 dominates the row
mmrLambda0.50.7Grid: mmr=0.7 beats 0.5 by ~0.02 nDCG
bodyWeight1.01.0unchanged
typePenaltyFactor1.01.0unchanged (opt-in)
hybridWeight0.50.5unchanged pending joint re-grid
ceWeight0.50.5unchanged pending joint re-grid

Measured proof (N=385, 10 queries, labelled metric)

Hybrid path (default, no opt-in rerank):

Metric3.10.21 (old defaults)3.10.22 (ADR-082)Δ
Label top-1 hit rate90%90%tied
Label top-3 hit rate90%100%+10pp
Label MRR@30.9000.950+0.050
Label precision@30.4000.533+0.133
Label nDCG@30.9000.963+0.063 (+7%)
Label nDCG@50.8750.938+0.063
Avg query latency42 ms55 ms+13 ms (still <100 ms)

Rerank path (opt-in {rerank: true}, with new hybrid defaults underneath):

Metric3.10.21 (old defaults)3.10.22 (ADR-082)Δ
Label top-1 hit rate80%90%+10pp
Label top-3 hit rate100%90%-10pp
Label MRR@30.8830.925+0.042
Label precision@30.6670.700+0.033
Label nDCG@30.9130.900-0.013

Rerank's trade-off: top-1/MRR/P3 up, nDCG@3/top-3 marginally down. Net: positive but not unambiguous. A joint re-grid (including hybridWeight/ceWeight × new α/sw) is tracked.

Cumulative SOTA push since cosine baseline (3.10.17 → 3.10.22)

Metric (labelled, canonical)3.10.173.10.193.10.203.10.22
Label top-1 (hybrid)0%90%90%90%
Label top-3 (hybrid)0%90%90%100%
Label nDCG@3 (hybrid)0.0000.9000.9000.963
Label precision@3 (hybrid)0.0000.4000.4000.533
Label top-1 (rerank)80%90%
Label nDCG@3 (rerank)0.9130.900
Label precision@3 (rerank)0.6670.700

Reusable infrastructure shipped

  • scripts/grid-search-retrieval.mjs — sweeps the hyperparameter space, reports per-config metrics, picks winners by nDCG/top-1/precision@3. Re-runnable on any pretrained store. Includes --quick mode for fast iteration.
  • Default value updates in src/mcp-tools/neural-tools.ts + schema descriptions.
  • Run JSONs at docs/benchmarks/runs/grid-search-retrieval-{ts,latest}.json with full config × metrics matrix.

Honest limits

  • Single annotator corpus (ADR-081 limitation).
  • N=10 queries — direction is robust (60-percentile differences) but per-config differences within ±2% are noise.
  • Rerank joint re-grid is pending — the rerank winner from the original grid was tested against OLD α/sw; a joint re-grid with new α/sw is the next ceiling-raiser.
  • MMR may be unnecessary at λ=0.7 — at this λ MMR is essentially pure-relevance ordering. A future ADR could remove MMR entirely from the hybrid path and just use hybrid-score-descending.

Deliberately NOT in this round

  • Joint rerank × α/sw grid — tracked; would need ~75 rerank configs at ~1s each = 12 minutes. Worth one focused ADR.
  • Default-on rerank — still wrong as a default (1s/query). Rerank gain on top-1 (80→90%) is marginal vs hybrid alone (which already hits 90%).
  • Removing MMR at high λ — saves a small amount of compute but not a measurable relevance lift.

Verification

bash
git clone https://github.com/ruvnet/ruflo && cd ruflo
npm install && ( cd v3/@claude-flow/cli && npx tsc )

# Pretrain
node v3/@claude-flow/cli/scripts/pretrain-from-github.mjs

# Grid-search (full grid, ~1 min)
cd v3/@claude-flow/cli && node scripts/grid-search-retrieval.mjs

# Confirm new defaults on canonical bench
BENCH_NO_WRITE=1 node scripts/benchmark-pretrained-retrieval.mjs            # hybrid → nDCG@3 0.963
RERANK=1 BENCH_NO_WRITE=1 node scripts/benchmark-pretrained-retrieval.mjs   # rerank → nDCG@3 0.900