benchmark/locomo/supermemory/README.md
Evaluate Supermemory on the LoCoMo benchmark using OpenClaw as the agent (same approach as the mem0 eval).
Two-phase pipeline:
containerTag per sample)Before each sample, eval.py automatically:
~/.openclaw/openclaw.json to set openclaw-supermemory.config.containerTag = sanitize(sample_id)plugins.slots.memory to "openclaw-supermemory"Tag sanitization:
conv-26→conv_26(matches openclaw-supermemory's internalsanitizeTaglogic). Bothingest.pyandeval.pyapply the same transformation automatically.
openclaw-supermemory plugin installed (~/.openclaw/extensions/openclaw-supermemory)~/.openclaw/openclaw.json with openclaw-supermemory.config.apiKey set~/.openviking_benchmark_env:SUPERMEMORY_API_KEY=sm-...
ARK_API_KEY=... # Volcengine ARK, used for judge LLM
uv sync --frozen --extra dev
pip install supermemory openai python-dotenv
LoCoMo 10-sample dataset at benchmark/locomo/data/locomo10.json:
Import conversations into Supermemory. Each sample is stored under containerTag = sample_id (e.g. conv-26).
Sessions are formatted as date-prefixed JSON strings, matching the memorybench supermemory provider convention. Indexing is polled until both document and memory reach done status.
# Ingest all 10 samples
python ingest.py
# Ingest a single sample
python ingest.py --sample conv-26
# Ingest specific sessions only
python ingest.py --sample conv-26 --sessions 1-4
# Force re-ingest (ignore existing records)
python ingest.py --sample conv-26 --force-ingest
# Clear all ingest records and start fresh
python ingest.py --clear-ingest-record
Key options:
| Option | Description |
|---|---|
--sample | Sample ID (e.g. conv-26) or index (0-based). Default: all |
--sessions | Session range, e.g. 1-4 or 3. Default: all |
--limit | Max samples to process |
--force-ingest | Re-ingest even if already recorded |
--clear-ingest-record | Clear .ingest_record.json before running |
--no-wait-indexing | Skip indexing poll (faster, no status check) |
Ingest records are saved to result/.ingest_record.json to avoid duplicate ingestion.
Send QA questions to OpenClaw agent and optionally judge answers.
# Run QA + judge for all samples (6 concurrent threads)
python eval.py --threads 6 --judge
# Single sample
python eval.py --sample conv-26 --threads 6 --judge
# First 12 questions only
python eval.py --sample conv-26 --count 12 --threads 6 --judge
# Judge-only (grade existing responses in CSV)
python eval.py --judge-only
Key options:
| Option | Description |
|---|---|
--sample | Sample ID or index. Default: all |
--count | Max QA items to process |
--threads | Concurrent threads per sample (default: 10) |
--judge | Auto-judge each response after answering |
--judge-only | Skip QA, only grade ungraded rows in existing CSV |
--openclaw-url | OpenClaw gateway URL (default: http://127.0.0.1:18789) |
--openclaw-token | Auth token (or OPENCLAW_GATEWAY_TOKEN env var) |
--judge-base-url | Judge API base URL (default: Volcengine ARK) |
--judge-model | Judge model (default: doubao-seed-2-0-pro-260215) |
--output | Output CSV path (default: result/qa_results.csv) |
Results are written to result/qa_results.csv. Failed ([ERROR]) rows are automatically removed at the start of each run and retried.
result/qa_results.csv columns:
| Column | Description |
|---|---|
sample_id | Conversation sample ID |
question_id | Unique question ID (e.g. conv-26_qa0) |
question / answer | Question and gold answer |
category / category_name | Question category |
response | Agent response |
input_tokens / output_tokens / total_tokens | LLM token usage |
time_cost | End-to-end latency (seconds) |
result | CORRECT or WRONG |
reasoning | Judge's reasoning |
After eval completes:
=== Token & Latency Summary ===
Total input tokens : 123456
Avg time per query : 18.3s
=== Accuracy Summary ===
Overall: 512/1540 = 33.25%
By category:
multi-hop : 120/321 = 37.38%
single-hop : 98/282 = 34.75%
temporal : 28/96 = 29.17%
world-knowledge : 266/841 = 31.63%
# Delete a specific sample's documents
python delete_container.py conv-26
# Delete all samples from the dataset
python delete_container.py --from-data
# Delete first N samples
python delete_container.py --from-data --limit 3
Note:
delete_container.pyusesdocuments.list(containerTags=[tag])+documents.deleteBulk(ids=[...])in batches of 100, and also clears the corresponding ingest records fromresult/.ingest_record.json.