Back to Ruflo

ADR-088: LongMemEval Benchmark for AgentDB Memory System

v3/docs/adr/ADR-088-longmemeval-benchmark.md

3.6.309.3 KB
Original Source

ADR-088: LongMemEval Benchmark for AgentDB Memory System

Status: Accepted
Date: 2026-04-08
Author: ruflo team
Relates to: ADR-076 (Memory Bridge), ADR-077 (DiskANN), ADR-075 (Learning Pipeline)

Context

MemPalace, a new open-source AI memory system, reported a 96.6% raw score and 100% hybrid score on LongMemEval (ICLR 2025) — a benchmark of 500 questions testing long-term conversational memory across 6 question types. This prompted the question: how does Ruflo's AgentDB memory system compare?

LongMemEval Landscape (April 2026)

SystemScoreModeAPI Required
MemPalace100% (500/500)Hybrid (Haiku reranking)Yes (Haiku)
MemPalace96.6%Raw (local only)No
OMEGA95.4%CloudYes
Observational Memory94.87%gpt-5-miniYes
Supermemory~93%gpt-4oYes
GPT-4o (long context)30-70%BaselineYes
AgentDBUnknown

Why This Matters

  • LongMemEval is the de facto standard for evaluating AI memory systems
  • Without a published score, AgentDB cannot be credibly compared
  • AgentDB has architectural advantages (HNSW indexing, semantic routing, 19 controllers) that should perform well — but we need proof
  • Independent analysis of MemPalace found their "+34% retrieval boost" is standard metadata filtering, not novel — AgentDB's actual HNSW + controller architecture may outperform

What LongMemEval Tests

The benchmark evaluates 5 core memory abilities across 500 questions:

  1. Information Extraction — Retrieve specific facts from past conversations
  2. Multi-Session Reasoning — Combine information across multiple conversation sessions
  3. Temporal Reasoning — Understand when events occurred and their ordering
  4. Knowledge Updates — Track how facts change over time (corrections, updates)
  5. Abstention — Correctly refuse to answer when information was never provided

Question types: single-session (1-hop), multi-session (1-hop), single-session (multi-hop), multi-session (multi-hop), knowledge update, temporal reasoning.

Dataset

  • Source: HuggingFace
  • Files: longmemeval_oracle.json, longmemeval_s_cleaned.json, longmemeval_m_cleaned.json
  • Size: 500 questions across conversation histories of varying length
  • Evaluation: src/evaluation/evaluate_qa.py (official script)
  • Paper: arXiv:2410.10813

Decision

Implement a full LongMemEval benchmark harness for AgentDB and publish results transparently, including per-category breakdowns and comparison with other systems.

Architecture

v3/@claude-flow/memory/benchmarks/longmemeval/
├── README.md                    # Setup & reproduction instructions
├── harness.ts                   # Main benchmark runner
├── adapters/
│   ├── agentdb-adapter.ts       # AgentDB memory backend
│   ├── agentdb-hnsw-adapter.ts  # AgentDB + HNSW mode
│   └── baseline-adapter.ts      # Plain vector search baseline
├── ingest.ts                    # Load LongMemEval conversations into AgentDB
├── evaluate.ts                  # Run question answering + score
├── report.ts                    # Generate comparison report
├── results/                     # Published results (git-tracked)
│   └── .gitkeep
└── scripts/
    ├── download-dataset.sh      # Fetch from HuggingFace
    └── run-benchmark.sh         # End-to-end benchmark execution

Benchmark Modes

ModeDescriptionAPI Cost
RawAgentDB HNSW search only, no LLM$0
HybridHNSW retrieval + Haiku reranking~$0.05
FullHNSW + controller routing + Haiku~$0.10
BaselinePlain cosine similarity (no HNSW)$0

Implementation Plan

Phase 1: Harness Setup (Week 1)

  1. Download LongMemEval dataset from HuggingFace
  2. Build conversation ingestion pipeline (load sessions into AgentDB)
  3. Implement question-answering interface using AgentDB retrieval
  4. Wire up official evaluation script (evaluate_qa.py) for scoring
  5. Create baseline adapter (plain vector search) for comparison

Phase 2: AgentDB Optimization (Week 2)

  1. Test with existing HNSW index configuration
  2. Tune retrieval parameters:
    • efSearch (accuracy vs speed tradeoff)
    • M (graph connectivity)
    • Top-k retrieval count
    • Similarity threshold
  3. Test controller-based routing for multi-hop questions
  4. Test temporal metadata for time-based questions
  5. Test knowledge update detection via version tracking

Phase 3: Comparative Evaluation (Week 3)

  1. Run all 4 modes (raw, hybrid, full, baseline)
  2. Break down scores by question type (6 categories)
  3. Compare against published results:
    • MemPalace (96.6% raw, 100% hybrid)
    • OMEGA (95.4%)
    • Observational Memory (94.87%)
  4. Measure latency per query (p50, p95, p99)
  5. Measure memory usage and storage size
  6. Generate public report with full methodology

Phase 4: Publication (Week 3)

  1. Commit results to results/ directory
  2. Create GitHub issue with findings
  3. Update CLAUDE.md and README with verified scores
  4. If score >= 95%, create dedicated benchmark page

Key Metrics to Report

MetricDescription
Overall accuracy% of 500 questions correct
Per-type accuracyBreakdown by 6 question types
Raw mode scoreZero-API, local-only score
Hybrid mode scoreWith Haiku reranking
Latency p50/p95/p99Query response time
Memory footprintRAM usage during evaluation
Storage sizeDisk usage for ingested conversations
Ingestion timeTime to load all conversations

Honesty Protocol

Following the honesty audit standards from v3.5.71+:

  1. No tuning on test set — Report held-out scores; if any questions are used for debugging, disclose it explicitly
  2. Report all modes — Don't cherry-pick the best number; show raw, hybrid, and baseline
  3. Per-category breakdown — Don't hide weak categories behind a strong aggregate
  4. Reproducible — Anyone can clone the repo, run the script, and get the same numbers
  5. Disclose failures — If AgentDB scores lower than MemPalace on any category, report it prominently
  6. Compare fairly — Use the same evaluation script and dataset version as other systems

Success Criteria

TargetScorePriority
Raw mode (zero API)>= 90%Must-have
Hybrid mode (Haiku)>= 96%Target
Competitive with MemPalace raw>= 96.6%Stretch
Beat MemPalace raw> 96.6%Aspirational
Latency p95< 200msMust-have
Full reproducibility100%Must-have

Expected AgentDB Advantages

  1. HNSW indexing — Approximate nearest neighbor search should outperform ChromaDB's brute-force on larger datasets
  2. Controller routing — 19 specialized controllers can route multi-hop questions to the right retrieval strategy
  3. Temporal metadata — AgentDB stores timestamps natively, which should help temporal reasoning questions
  4. Version tracking — Knowledge update questions should benefit from AgentDB's entry versioning
  5. Semantic routingagentdb_semantic-route can classify question type and apply type-specific retrieval

Expected AgentDB Disadvantages

  1. No verbatim storage — AgentDB uses embeddings, not raw text storage; may lose detail on exact-match questions
  2. No conversation structure — MemPalace's palace metaphor (wings/halls/rooms) provides hierarchical scoping that AgentDB lacks
  3. Embedding model size — all-MiniLM-L6-v2 (384-dim) is smaller than some competitors' models

Consequences

Positive

  • First published LongMemEval score for AgentDB — fills a credibility gap
  • Identifies specific areas where AgentDB's retrieval can be improved
  • Provides a reproducible benchmark for regression testing
  • Positions Ruflo in the growing "AI memory leaderboard" conversation

Negative

  • If AgentDB scores significantly below 90%, it's a public admission of weakness
  • Benchmark optimization could distract from feature development
  • LongMemEval is a synthetic benchmark — real-world performance may differ

Risks

  • LongMemEval is a conversational memory benchmark; AgentDB is designed for agent orchestration memory — the benchmark may not test AgentDB's actual strengths
  • Over-optimizing for a benchmark can lead to benchmark gaming (Goodhart's Law)

References