benchmark/locomo/mem0/README.md
Evaluate mem0 memory retrieval on the LoCoMo benchmark using OpenClaw as the agent.
Two-phase pipeline:
user_id per sample)openclaw-mem0 plugin installed (~/.openclaw/extensions/openclaw-mem0)~/.openclaw/openclaw.json with plugins.slots.memory = "openclaw-mem0"~/.openviking_benchmark_env:MEM0_API_KEY=m0-...
ARK_API_KEY=... # Volcengine ARK, used for judge LLM
uv sync --frozen --extra dev
LoCoMo 10-sample dataset at benchmark/locomo/data/locomo10.json:
Import conversations into mem0. Each sample is stored under user_id = sample_id (e.g. conv-26).
# Ingest all 10 samples
python ingest.py
# Ingest a single sample
python ingest.py --sample conv-26
# Force re-ingest (ignore existing records)
python ingest.py --sample conv-26 --force-ingest
# Clear all ingest records and start fresh
python ingest.py --clear-ingest-record
Key options:
| Option | Description |
|---|---|
--sample | Sample ID (e.g. conv-26) or index (0-based). Default: all |
--sessions | Session range, e.g. 1-4 or 3. Default: all |
--limit | Max samples to process |
--force-ingest | Re-ingest even if already recorded |
--clear-ingest-record | Clear .ingest_record.json before running |
Ingest records are saved to result/.ingest_record.json to avoid duplicate ingestion.
Send QA questions to OpenClaw agent and optionally judge answers.
Before each sample, eval.py automatically:
~/.openclaw/openclaw.json to set openclaw-mem0.config.userId = sample_iduserId is active via a dummy request# Run QA + judge for all samples (6 concurrent threads)
python eval.py --threads 6 --judge
# Single sample
python eval.py --sample conv-26 --threads 6 --judge
# First 12 questions only
python eval.py --sample conv-26 --count 12 --threads 6 --judge
# Judge-only (grade existing responses in CSV)
python eval.py --judge-only
Key options:
| Option | Description |
|---|---|
--sample | Sample ID or index. Default: all |
--count | Max QA items to process |
--threads | Concurrent threads per sample (default: 10) |
--judge | Auto-judge each response after answering |
--judge-only | Skip QA, only grade ungraded rows in existing CSV |
--no-skip-adversarial | Include category-5 adversarial questions |
--openclaw-url | OpenClaw gateway URL (default: http://127.0.0.1:18789) |
--openclaw-token | Auth token (or OPENCLAW_GATEWAY_TOKEN env var) |
--judge-base-url | Judge API base URL (default: Volcengine ARK) |
--judge-model | Judge model (default: doubao-seed-2-0-pro-260215) |
--output | Output CSV path (default: result/qa_results.csv) |
Results are written to result/qa_results.csv. Failed ([ERROR]) rows are automatically removed at the start of each run and retried.
result/qa_results.csv columns:
| Column | Description |
|---|---|
sample_id | Conversation sample ID |
question_id | Unique question ID (e.g. conv-26_qa0) |
question / answer | Question and gold answer |
category / category_name | Question category |
response | Agent response |
input_tokens / output_tokens / total_tokens | LLM token usage (all turns summed) |
time_cost | End-to-end latency (seconds) |
result | CORRECT or WRONG |
reasoning | Judge's reasoning |
After eval completes:
=== Token & Latency Summary ===
Total input tokens : 123456
Avg time per query : 18.3s
=== Accuracy Summary ===
Overall: 512/1540 = 33.25%
By category:
multi-hop : 120/321 = 37.38%
single-hop : 98/282 = 34.75%
temporal : 28/96 = 29.17%
world-knowledge : 266/841 = 31.63%
# Delete a specific sample
python delete_user.py conv-26
# Delete all samples from the dataset
python delete_user.py --from-data
# Delete first N samples
python delete_user.py --from-data --limit 3