Back to Ruflo

ADR-083 — Joint Rerank Re-Grid (Conditional Defaults for Rerank Path)

v3/docs/adr/ADR-083-joint-rerank-grid.md

3.10.306.3 KB
Original Source

ADR-083 — Joint Rerank Re-Grid (Conditional Defaults for Rerank Path)

Status: Accepted — Implemented in ruflo 3.10.23 Date: 2026-05-30 Tracking: continuation of self-learning hardening cluster (ADR-077 → 078 → 079 → 080 → 081 → 082 → 083) Related: ADR-082 (single-axis grid-search)

Context

ADR-082 grid-searched the hybrid (non-rerank) hyperparameter space against the ADR-081 labelled corpus and shipped α=0.5, sw=2, mmr=0.7 — pushing hybrid nDCG@3 from 0.900 to 0.963. The rerank path was left at hw=0.5/cw=0.5 because the original rerank grid had been tested against the OLD α=0.6/sw=3.0 baseline; with α/sw shifting underneath, the joint optimum needed measuring.

ADR-082 flagged this explicitly: "Rerank's trade-off: top-1/MRR/P3 up, nDCG@3/top-3 marginally down. Net: positive but not unambiguous. A joint re-grid (including hybridWeight/ceWeight × new α/sw) is tracked."

This is that re-grid.

Decision

Extended scripts/grid-search-retrieval.mjs with a joint rerank sweep:

  • hybridWeight × ceWeight ∈ {(0.2,0.8), (0.3,0.7), (0.4,0.6), (0.5,0.5), (0.6,0.4), (0.7,0.3), (0.8,0.2)}
  • alpha ∈ {0.3, 0.5} (for the underlying hybrid stage when rerank is on)
  • subjectWeight ∈ {2.0, 3.0}

7 × 2 × 2 = 28 rerank configs, ~25 min wall-clock with the cross-encoder.

Key finding

The rerank path wants different hybrid sub-params than the non-rerank path:

PathBest αBest swBest mmrBest hw/cwnDCG@3
Non-rerank (hybrid only)0.52.00.70.963
Rerank0.53.00.7hw=0.7 cw=0.30.963

This makes intuitive sense: when the cross-encoder is doing the semantic understanding downstream, the hybrid stage can be more keyword-focused (higher subject weight). When hybrid is the final stage, lower subject weight gives the body tokens room to contribute relevance signal.

Implementation: subjectWeight default is now conditional on useRerank (3.0 if reranking, 2.0 otherwise). Explicit subjectWeight param still overrides.

New defaults

ParameterNon-rerankRerankChange vs 3.10.22
alpha0.50.5unchanged
subjectWeight2.03.0rerank +1.0 (conditional default)
mmrLambda0.70.7unchanged
hybridWeight0.7rerank: 0.5 → 0.7
ceWeight0.3rerank: 0.5 → 0.3

Measured proof (N=385, 10 queries, labelled metric)

Rerank path:

Metric3.10.22 (old joint defaults)3.10.23 (ADR-083)Δ
Label top-1 hit rate90%90%tied
Label top-3 hit rate90%100%+10pp
Label MRR@30.9250.950+0.025
Label precision@30.7000.700tied
Label nDCG@30.9000.963+0.063 (+7%)
Label nDCG@50.9040.944+0.040

Hybrid (non-rerank) path: unchanged from 3.10.22 (nDCG@3 = 0.963).

Cumulative SOTA push (3.10.17 → 3.10.23)

Metric (labelled)3.10.17 cosine3.10.19 hybrid3.10.20 +rerank3.10.223.10.23
Hybrid nDCG@30.0000.9000.9000.9630.963
Rerank nDCG@30.9130.9000.963
Hybrid top-30%90%90%100%100%
Rerank top-3100%90%100%
Rerank precision@30.6670.7000.700

Both retrieval paths now hit nDCG@3 = 0.963 on this corpus. The choice between them is now purely cost vs richness:

  • Hybrid (39 ms/query) — same nDCG, leaner top-3 (P3=0.533), use for hot paths
  • Rerank (1000 ms/query) — same nDCG, denser top-3 (P3=0.700), use for richness-first paths

Joint grid top-5 by nDCG@3 (full ranking)

RankConfigtop-1top-3nDCGP3MRR
1hybrid α=0.5 sw=2 mmr=0.790%100%0.9630.5330.950
1rerank hw=0.7 cw=0.3 α=0.5 sw=390%100%0.9630.7000.950
1rerank hw=0.8 cw=0.2 α=0.3 sw=390%100%0.9630.7670.950
4hybrid α=0.5 sw=3 mmr=0.790%100%0.9550.5330.950
4rerank hw=0.8 cw=0.2 α=0.5 sw=390%100%0.9550.7000.950

Three configs tied at the corpus ceiling. Picked hw=0.7 cw=0.3 over hw=0.8 cw=0.2 because the latter underweights the cross-encoder's contribution (cw=0.2 leaves the CE doing almost nothing).

Reusable infrastructure shipped

  • scripts/grid-search-retrieval.mjs — extended with joint rerank sweep (28 configs across hw/cw × α × sw). Re-runnable on any corpus.
  • Conditional default logic in src/mcp-tools/neural-tools.tssubjectWeight default depends on useRerank.
  • Run JSON at docs/benchmarks/runs/grid-search-retrieval-latest.json with the full 48-config matrix.

Honest limits

  • N=10 queries still — three configs tied at 0.963 are likely indistinguishable within noise.
  • Cross-repo generalisation pending — all numbers in ADRs 077-083 are on the ruflo corpus. The real SOTA test is "does this hold up on a different repo's history?" — pretrain on agentdb / agentic-flow, run a similar labelled bench, see if nDCG@3 stays near 0.96. Tracked for the next iteration.
  • Conditional defaults add complexity — callers passing explicit subjectWeight get consistent behaviour, but callers relying on defaults see different values based on rerank flag. This is documented in the schema but is a minor surprise.

Deliberately NOT in this round

  • Cross-repo generalisation test — biggest pending validation; next loop iteration.
  • Pure-precision configs (hw=0.4/cw=0.6 α=0.3 sw=3 hit P3=0.800 with nDCG=0.900) — would be the right opt-in for "richer top-K" but adds yet another config option. Skipping.
  • Larger cross-encoder (L-12 vs L-6) — corpus is already near ceiling; bigger model unlikely to help on N=10.

Verification

bash
git clone https://github.com/ruvnet/ruflo && cd ruflo
npm install && ( cd v3/@claude-flow/cli && npx tsc )
node v3/@claude-flow/cli/scripts/pretrain-from-github.mjs

# Joint grid (~25 min)
cd v3/@claude-flow/cli && node scripts/grid-search-retrieval.mjs

# Verify new defaults
BENCH_NO_WRITE=1 node scripts/benchmark-pretrained-retrieval.mjs              # hybrid → nDCG@3 0.963
RERANK=1 BENCH_NO_WRITE=1 node scripts/benchmark-pretrained-retrieval.mjs     # rerank → nDCG@3 0.963 (was 0.900)