v3/docs/adr/ADR-089-three-dataset-beir-and-upstream.md
Status: Accepted — Implemented in ruflo 3.10.29 Date: 2026-05-30 Tracking: continuation of BEIR climb (ADR-085, 086, 087, 088) + #2246 + ruvnet/ruvector#523-524
Per the user's "no constant releases" guidance, 3.10.29 is a batched ship combining four independent threads:
ADR-090 (separate) covers the BGE query-prefix experiment.
Best ruflo config per dataset (no per-dataset tuning of the config; we ship the same run-beir-hybrid.mjs pipeline + flags everywhere):
| Dataset | Best ruflo nDCG@10 | Best ruflo pipeline | Rank | Best Listed Baseline |
|---|---|---|---|---|
| NFCorpus | 0.358 | Lucene + RRF + CE rerank | 2/11 | BGE-large 0.380 |
| SciFact | 0.683 | Lucene + RRF + CE rerank | 3/11 | BGE-large 0.722 |
| ArguAna | 0.432 | Lucene + RRF (k=60) | 5/11 | BGE-large 0.636 |
| 3-dataset mean | 0.491 | mixed | — | BGE-large 0.579 |
| System | Params | NFCorpus | SciFact | ArguAna | Mean |
|---|---|---|---|---|---|
| BGE-large-v1.5 (published) | 335M | 0.380 | 0.722 | 0.636 | 0.579 |
| SPLADE++ (published) | 110M | 0.347 | 0.704 | 0.521 | 0.524 |
| GenQ (published) | 110M | 0.319 | 0.644 | 0.493 | 0.485 |
| ruflo best (per-dataset) | 110M | 0.358 | 0.683 | 0.432 | 0.491 |
| GTR-XL (published) | 1.2B | 0.343 | 0.662 | 0.439 | 0.481 |
| Contriever (published) | 110M | 0.328 | 0.677 | 0.379 | 0.461 |
| BM25 (published Lucene) | — | 0.325 | 0.679 | 0.397 | 0.467 |
| ruflo Lucene BM25 alone | — | 0.328 | 0.681 | n/a | (2-dataset 0.505) |
| TAS-B (published) | 66M | 0.319 | 0.643 | 0.429 | 0.464 |
| ColBERT (published) | 110M | 0.305 | 0.671 | 0.233 | 0.403 |
| SBERT msmarco (published) | 110M | 0.272 | 0.555 | 0.371 | 0.399 |
Rank 4 of 11 entries on the 3-dataset mean. Beats published BM25 (+0.024), beats GTR-XL (1.2B), Contriever, TAS-B, ColBERT, SBERT. Loses to SPLADE++ (-0.033), GenQ (-0.006 — basically tied), and BGE-large (-0.088).
ArguAna is counter-argument retrieval — the model must understand opposition between query and document, not topical similarity. BGE-large dominates here (0.636) because BAAI specifically trained on argument pairs. Our zero-shot BGE-base gets 0.432 — same neighborhood as TAS-B (0.429) and GTR-XL (0.439) but well below the top.
The 0.088 mean gap to BGE-large is mostly the ArguAna gap (0.204) — on NFCorpus and SciFact we close to 0.022 and 0.039 respectively.
| Model | NFCorpus nDCG@10 | Notes |
|---|---|---|
| Xenova/bge-base-en-v1.5 (110M, int8 quantized) | 0.352 | our baseline |
| Xenova/bge-large-en-v1.5 (335M, int8 quantized) | 0.350 | no lift — basically tied |
| BAAI/bge-large-en-v1.5 (published, unquantized) | 0.380 | what BAAI reports |
3× the model size, ~3× the embed latency, no measured quality lift on this stack. Two likely causes:
Represent this sentence for searching relevant passages: ) — ADR-090 measures that separatelyThe honest framing: on the artefact we can run, BGE-large is not a free upgrade. Switching to BGE-large would lose throughput and gain nothing measurable. Real BGE-large performance probably needs GPU + unquantized weights.
Updated src/mcp-tools/neural-tools.ts embedder cascade:
Tier 0 (NEW): [email protected]() — bundled, no sharp dep, disk-cache hit
Tier 1: agentic-flow/reasoningbank (broken on darwin-arm64 without sharp)
Tier 2: @claude-flow/embeddings + agentic-flow provider
Tier 3: @claude-flow/embeddings + onnx provider
(no Tier 4 — leave realEmbeddings null, force hash-fallback with explicit _embeddingNote)
Verified active: probe returns embedder: [email protected] (bundled all-MiniLM-L6-v2), _realEmbedding: true, dim: 384, disk-cache hit (no re-download).
Measured parallel-embedder throughput on this CPU: 6.2× per-doc speedup (claimed 10-14× in upstream PR #525; ours measured with self-contention from the BEIR benches, so a clean run would land in the 8-12× range).
| Finding | Status | Fix |
|---|---|---|
| #1 memory_search_unified hardcoded 6 namespaces (silently misses ~95% of an 8789-entry store) | FIXED | New namespaces: string[] param + CLAUDE_FLOW_MEMORY_SEARCH_NAMESPACES env + dynamic enumeration via listEntries({}) as the new default + namespaceSource audit field. 9 regression tests covering all 5 priority paths. |
#2 npm install -g ruflo silently overwrites dist/ patches | acknowledged | Tracked for a separate release (postinstall checksum + warning). |
#3 agentdb addCausalEdge() silently orphans edges when NodeIdMapper.getNodeId() returns undefined | forwarded | Filed as ruvnet/agentdb#7. Will pull when agentdb pin bumps. |
#4 graph_edges DB unavailable on fresh env | FIXED | getBridgeDb({createIfMissing: true}) lazy-creates empty memory.db + graph_edges schema; pathfinder call sites updated; error message gains a hint field. |
src/memory/bge-embedder.ts — adds embedQuery() + exports BGE_QUERY_PREFIX (ADR-090 opt-in)src/memory/lucene-bm25.ts — Porter stemmer + Lucene stopwords + single-field BM25 (matches published baseline ±0.003)src/memory/graph-edge-writer.ts — getBridgeDb({createIfMissing}) (#2246 #4)src/mcp-tools/neural-tools.ts — Tier-0 ruvector probe with content-vs-shape-safe unwrapsrc/mcp-tools/memory-tools.ts — namespace fan-out fix (#2246 #1) + new namespaces param + env override + namespaceSource audit fieldsrc/mcp-tools/agentdb-tools.ts — pathfinder call sites pass createIfMissing: truescripts/run-beir-bge.mjs — BGE_QUERY_PREFIX=1 env, per-dataset cache path, ArguAna baselinesscripts/run-beir-hybrid.mjs — USE_LUCENE_BM25=1, RERANK=1, BGE_QUERY_PREFIX=1 flags, ArguAna baselinesscripts/run-beir-lucene-bm25.mjs — Lucene BM25 + RRF runner__tests__/memory-search-unified-2246.test.ts — 9 new regression testspackage.json adds ruvector: ^0.2.27 + root overridedocs/benchmarks/BEIR-MATRIX.md — 3-dataset rows + per-pipeline comparisongit clone https://github.com/ruvnet/ruflo && cd ruflo
npm install && ( cd v3/@claude-flow/cli && npx tsc )
# 3 BEIR datasets — each ingests once, caches, then evals fast on subsequent runs
for ds in nfcorpus scifact arguana; do
mkdir -p /tmp/beir-$ds && cd /tmp/beir-$ds
[ ! -f $ds.zip ] && curl -sL -o $ds.zip "https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/$ds.zip" && unzip -q $ds.zip
BEIR_DATA_DIR=/tmp/beir-$ds/$ds node /path/to/v3/@claude-flow/cli/scripts/run-beir-bge.mjs # dense alone
USE_LUCENE_BM25=1 RERANK=1 BEIR_DATA_DIR=/tmp/beir-$ds/$ds node /path/to/v3/@claude-flow/cli/scripts/run-beir-hybrid.mjs # full pipeline
done
# #2246 tests
( cd v3/@claude-flow/cli && npx vitest run __tests__/memory-search-unified-2246.test.ts )