Back to Ruflo

ADR-080 — Cross-Encoder Reranker (Opt-In Quality Pass)

v3/docs/adr/ADR-080-cross-encoder-reranker.md

3.10.306.4 KB
Original Source

ADR-080 — Cross-Encoder Reranker (Opt-In Quality Pass)

Status: Accepted — Implemented in ruflo 3.10.20 Date: 2026-05-30 Tracking: continuation of self-learning hardening cluster (#2245 → ADR-074 → ADR-075 → ADR-076 → ADR-077 → ADR-078 → ADR-079) Related: ADR-078 (hybrid retrieval), ADR-079 (multi-field BM25)

Context

ADR-079 lifted top-1 hit rate from 0% (cosine) to 80% via multi-field BM25 over the existing bi-encoder + MMR pipeline. The remaining 20% miss rate comes from queries where the bi-encoder + BM25 combine to surface a related-but-wrong commit at top-1 — they agree on lexical overlap, but neither has the joint understanding of (query, document) that a cross-encoder provides.

A cross-encoder reads (query, document) as a single concatenated input and produces a calibrated relevance score. Paper-proven path: typical +0.05–0.15 MRR lift over bi-encoder rerankers on small corpora. The cost is real — instead of one query embedding compared against N pre-computed doc embeddings, the model now runs N forward passes per query.

Decision

Add an opt-in cross-encoder rerank step ({rerank: true}) in neural_patterns search:

  1. Hybrid path returns top-K6 candidates by combined cosine+BM25 score (vs the default top-K3 used by MMR).
  2. Cross-encoder (Xenova/ms-marco-MiniLM-L-6-v2, int8-quantised, ~30MB) scores each (query, doc) pair.
  3. Final score = hybridWeight * normalise(hybrid) + ceWeight * normalise(crossEncoder) (default 0.5 / 0.5).
  4. Top-K by final score. MMR is bypassed when reranking is on (cross-encoder's calibrated score is more precise than MMR's diversity heuristic at this stage).

Default is OFF. Latency cost is ~25× the hybrid-only path (1.0 s vs 39 ms per query at N=385). Worth it when relevance matters more than throughput; for hot paths or batch retrieval the default hybrid is still right.

Why linear combination, not pure rerank

Ablation showed the cross-encoder alone hits 100% top-3 but loses top-1 (calibration on short commit subjects is noisy — the model was trained on MS MARCO passages). Hybrid is the opposite — strong top-1 (80%), weaker top-3 (80%). Linear combination preserves both:

ConfigTop-1Top-3MRR@3
Hybrid only (no rerank, 3.10.19)8/10 (80%)8/10 (80%)0.800
Cross-encoder alone (over top-30)6/10 (60%)10/10 (100%)0.733
Combined 0.5/0.5 (3.10.20 default)9/10 (90%)10/10 (100%)0.933

Why opt-in, not default

Two reasons:

  1. Latency — 1 s per query is too slow for sub-100 ms hot paths.
  2. Cold-start cost — first call downloads the ~30 MB int8 model; without network it gracefully degrades to hybrid-only order via crossEncoderRerank()'s try/catch.

Callers who want SOTA relevance flip {rerank: true}. Tests cover the degradation contract.

Measured proof (N=385, 10 queries, this checkout)

Cumulative since cosine baseline (3.10.17):

Metric3.10.17 cosine3.10.18 hybrid3.10.19 multi-field3.10.20 + rerankΔ since cosine
Top-1 hit rate0%50%80%90%+90pp
Top-3 hit rate0%70%80%100%+100pp
MRR@30.0000.5830.8000.933+0.933
Top-1 diversity100%80%100%100%0pp
Avg query latency28.7 ms40.6 ms39.0 ms984 ms+955 ms

Grid-search for hybrid:ce weight (N=385, 10 queries):

hybrid : cetop-1top-3MRR@3
0.7 : 0.38/1010/100.883
0.6 : 0.47/1010/100.833
0.5 : 0.59/1010/100.933
0.4 : 0.69/1010/100.933
0.3 : 0.79/1010/100.933

Sweet spot is broad — anywhere from 0.5:0.5 to 0.3:0.7 hits the same 9/10, 10/10, 0.933 plateau. Default 0.5:0.5.

Reusable infrastructure shipped

  • src/memory/cross-encoder-rerank.ts — lazy-loaded singleton + crossEncoderRerank(query, docs, topK?) + status diagnostic.
    • Direct AutoTokenizer + AutoModelForSequenceClassification path (the v2 pipeline('text-classification') API can't ingest {text, text_pair} pairs reliably).
    • One-shot load policy — after a failed load, subsequent getCrossEncoder() calls return null immediately. No retry loops.
    • Handles both single-logit (sigmoid) and binary-logit (softmax) heads.
  • neural_patterns MCP tool — new params: rerank, hybridWeight, ceWeight.
  • 5 new tests in __tests__/cross-encoder-rerank.test.ts covering the degradation contract (no network needed in tests).

Honest limits

  • Latency — 1 s per query at N=385. ~30 ms per (query, doc) pair × 30 candidates = the lion's share. Pool size could be tuned per-call.
  • First-run cost — ~30 MB model download (int8 quantised). Subsequent runs hit the local cache.
  • Calibration on short text — MS MARCO was trained on passages; commit subjects are short. The model's score distribution is bimodal (very-high or near-zero) which is what motivated the linear combination — pure-rerank dropped top-1 because of this calibration mismatch.
  • Same 10-query bench — direction (90%/100%/0.933) is robust to the regex-relevance proxy noise, but absolute numbers could shift on a different corpus. A labelled held-out evaluation is the right next gauge.

Deliberately NOT in this round

  • Default-on rerank — latency cost (25× hybrid) makes it wrong as a default. Callers can opt-in per query.
  • Larger cross-encoder (ms-marco-MiniLM-L-12-v2, etc.) — int8 v6 is the speed/quality sweet spot for now.
  • HyDE — LLM-call cost dominates everything else at this point.
  • Learned distiller — still tracked under #2241 round-D.

Verification

bash
git clone https://github.com/ruvnet/ruflo && cd ruflo
npm install && ( cd v3/@claude-flow/cli && npx tsc )

# Unit tests — 44 total (5 new cross-encoder degradation tests, no network)
( cd v3/@claude-flow/cli && npx vitest run __tests__/cross-encoder-rerank.test.ts __tests__/hybrid-retrieval.test.ts __tests__/pretrain-from-github.test.ts )

# Live A/B (cross-encoder downloads ~30MB on first run)
cd v3/@claude-flow/cli
node scripts/pretrain-from-github.mjs
node scripts/benchmark-pretrained-retrieval.mjs              # 3.10.19 default → 80% top-1
RERANK=1 node scripts/benchmark-pretrained-retrieval.mjs    # 3.10.20 + rerank → 90% top-1, 100% top-3
HYBRID=0 node scripts/benchmark-pretrained-retrieval.mjs    # cosine baseline → 0% top-1