v3/docs/adr/ADR-088-lucene-bm25-and-rerank.md
Status: Accepted — Implemented in ruflo 3.10.28 Date: 2026-05-30 Tracking: continuation of BEIR climb (ADR-085, 086, 087) Related: ADR-087 (the RRF negative result that diagnosed this fix)
ADR-087 measured standard RRF k=60 underperforming dense-alone on both NFCorpus and SciFact, and diagnosed the cause as asymmetric input strength: our hybrid-retrieval multi-field BM25 was ~0.05 nDCG@10 below the published Lucene baseline, so RRF averaged its noise into the top-K instead of cancelling it.
This ADR fixes that diagnosis directly and adds the second-stage win (cross-encoder rerank) that the user explicitly identified as the next move after the user's "you should stack proven IR primitives" reframe.
src/memory/lucene-bm25.ts — pure-function module, no external deps:
caresses → caress, agreed → agre, motoring → motor, vietnamization → vietnam, etc.).scripts/run-beir-hybrid.mjs now supports:
USE_LUCENE_BM25=1 — swap multi-field BM25 for Lucene-styleRERANK=1 — apply Xenova/ms-marco-MiniLM-L-6-v2 cross-encoder over top-100 RRF outputThe cross-encoder infrastructure was already shipped in ADR-080 for repo-history retrieval; this ADR wires it into the public BEIR runner and proves it on standardised benchmarks.
Ruflo's runtime retrieval still uses the multi-field BM25 + dense + MMR + optional CE rerank pipeline from ADRs 078-083, tuned against repo-history corpora. The Lucene BM25 in this ADR is a BEIR-benchmark-only module — the multi-field BM25 stays better for short commit-subject text. We isolated the benchmark-vs-runtime concerns deliberately.
| Configuration | NFCorpus nDCG@10 | SciFact nDCG@10 | Mean | Beats published BM25 both? |
|---|---|---|---|---|
| dense alone (BGE-base) | 0.352 | 0.626 | 0.489 | ✗ (loses SciFact -0.053) |
| Multi-field BM25 alone | 0.279 | 0.576 | 0.428 | ✗ (loses both) |
| Lucene BM25 alone (ADR-088) | 0.328 | 0.681 | 0.505 | tied (NFCorpus +0.003, SciFact +0.002) |
| Multi-field RRF k=60 (ADR-087, broken) | 0.328 ↓ | 0.569 ↓ | 0.449 | ✗ (loses both) |
| Lucene RRF k=60 | 0.360 | 0.632 | 0.496 | ✗ (loses SciFact -0.047) |
| Lucene RRF k=30 | 0.363 | 0.639 | 0.501 | ✗ (loses SciFact -0.040) |
| Multi-field RRF k=60 + CE rerank | 0.355 | 0.685 | 0.520 | ✓ (NFCorpus +0.030, SciFact +0.006) |
| Lucene RRF k=60 + CE rerank (best) | 0.358 | 0.683 | 0.521 | ✓ (NFCorpus +0.033, SciFact +0.004) |
| BM25 (published Lucene) | 0.325 | 0.679 | 0.502 | — |
| SPLADE++ (published) | 0.347 | 0.704 | 0.526 | — |
| BGE-large-v1.5 (published) | 0.380 | 0.722 | 0.551 | — |
| System | Mean nDCG@10 |
|---|---|
| BGE-large-v1.5 (published, 335M) | 0.551 |
| SPLADE++ (published) | 0.526 |
| ruflo Lucene RRF + CE rerank (BGE-base 110M) | 0.521 |
| Multi-field RRF + CE rerank | 0.520 |
| Lucene BM25 alone | 0.505 |
| BM25 (published Lucene) | 0.502 |
| dense alone (BGE-base) | 0.489 |
Acceptance test from the climb plan ("ruflo beats BM25 on both datasets") PASSES. With RRF+CE rerank we're 0.521 on the 2-dataset mean — beats published BM25 (0.502), beats every other published baseline except SPLADE++ (0.526, 1 percentage point above us) and BGE-large (0.551, 3 percentage points above).
On NFCorpus, Lucene RRF k=60 alone (0.360) is essentially tied with Lucene RRF + CE rerank (0.358) — the cross-encoder doesn't help when the underlying RRF is already strong. The CE rerank's value is on SciFact (0.639 → 0.683, +0.044 lift). The pipeline auto-adapts: when RRF is strong, rerank is mostly a pass-through; when RRF is weaker, rerank substantially lifts.
This matches the published literature on hybrid retrieval — reranking helps most when the candidate pool has high recall but low top-K precision.
BGE_MODEL=Xenova/bge-large-en-v1.5. Likely lifts both datasets further. ~3× embed latency.git clone https://github.com/ruvnet/ruflo && cd ruflo
npm install && ( cd v3/@claude-flow/cli && npx tsc )
# Re-use NFCorpus + SciFact caches from ADR-085 (or re-ingest if needed)
cd /tmp/beir-nfcorpus
USE_LUCENE_BM25=1 RERANK=1 node /path/to/v3/@claude-flow/cli/scripts/run-beir-hybrid.mjs
# → nDCG@10 0.358, rank 2/11 on NFCorpus
cd /tmp/beir-scifact
USE_LUCENE_BM25=1 RERANK=1 BEIR_DATA_DIR=/tmp/beir-scifact/scifact \
node /path/to/v3/@claude-flow/cli/scripts/run-beir-hybrid.mjs
# → nDCG@10 0.683, rank 3/11 on SciFact
# Stand-alone Lucene BM25 (no rerank, fast)
USE_LUCENE_BM25=1 node /path/to/scripts/run-beir-hybrid.mjs