v3/docs/adr/ADR-088-longmemeval-benchmark.md
Status: Accepted
Date: 2026-04-08
Author: ruflo team
Relates to: ADR-076 (Memory Bridge), ADR-077 (DiskANN), ADR-075 (Learning Pipeline)
MemPalace, a new open-source AI memory system, reported a 96.6% raw score and 100% hybrid score on LongMemEval (ICLR 2025) — a benchmark of 500 questions testing long-term conversational memory across 6 question types. This prompted the question: how does Ruflo's AgentDB memory system compare?
| System | Score | Mode | API Required |
|---|---|---|---|
| MemPalace | 100% (500/500) | Hybrid (Haiku reranking) | Yes (Haiku) |
| MemPalace | 96.6% | Raw (local only) | No |
| OMEGA | 95.4% | Cloud | Yes |
| Observational Memory | 94.87% | gpt-5-mini | Yes |
| Supermemory | ~93% | gpt-4o | Yes |
| GPT-4o (long context) | 30-70% | Baseline | Yes |
| AgentDB | Unknown | — | — |
The benchmark evaluates 5 core memory abilities across 500 questions:
Question types: single-session (1-hop), multi-session (1-hop), single-session (multi-hop), multi-session (multi-hop), knowledge update, temporal reasoning.
longmemeval_oracle.json, longmemeval_s_cleaned.json, longmemeval_m_cleaned.jsonsrc/evaluation/evaluate_qa.py (official script)Implement a full LongMemEval benchmark harness for AgentDB and publish results transparently, including per-category breakdowns and comparison with other systems.
v3/@claude-flow/memory/benchmarks/longmemeval/
├── README.md # Setup & reproduction instructions
├── harness.ts # Main benchmark runner
├── adapters/
│ ├── agentdb-adapter.ts # AgentDB memory backend
│ ├── agentdb-hnsw-adapter.ts # AgentDB + HNSW mode
│ └── baseline-adapter.ts # Plain vector search baseline
├── ingest.ts # Load LongMemEval conversations into AgentDB
├── evaluate.ts # Run question answering + score
├── report.ts # Generate comparison report
├── results/ # Published results (git-tracked)
│ └── .gitkeep
└── scripts/
├── download-dataset.sh # Fetch from HuggingFace
└── run-benchmark.sh # End-to-end benchmark execution
| Mode | Description | API Cost |
|---|---|---|
| Raw | AgentDB HNSW search only, no LLM | $0 |
| Hybrid | HNSW retrieval + Haiku reranking | ~$0.05 |
| Full | HNSW + controller routing + Haiku | ~$0.10 |
| Baseline | Plain cosine similarity (no HNSW) | $0 |
evaluate_qa.py) for scoringefSearch (accuracy vs speed tradeoff)M (graph connectivity)results/ directory| Metric | Description |
|---|---|
| Overall accuracy | % of 500 questions correct |
| Per-type accuracy | Breakdown by 6 question types |
| Raw mode score | Zero-API, local-only score |
| Hybrid mode score | With Haiku reranking |
| Latency p50/p95/p99 | Query response time |
| Memory footprint | RAM usage during evaluation |
| Storage size | Disk usage for ingested conversations |
| Ingestion time | Time to load all conversations |
Following the honesty audit standards from v3.5.71+:
| Target | Score | Priority |
|---|---|---|
| Raw mode (zero API) | >= 90% | Must-have |
| Hybrid mode (Haiku) | >= 96% | Target |
| Competitive with MemPalace raw | >= 96.6% | Stretch |
| Beat MemPalace raw | > 96.6% | Aspirational |
| Latency p95 | < 200ms | Must-have |
| Full reproducibility | 100% | Must-have |
agentdb_semantic-route can classify question type and apply type-specific retrieval