Back to Openviking

LoCoMo Benchmark — Supermemory Evaluation

benchmark/locomo/supermemory/README.md

0.3.145.5 KB
Original Source

LoCoMo Benchmark — Supermemory Evaluation

Evaluate Supermemory on the LoCoMo benchmark using OpenClaw as the agent (same approach as the mem0 eval).

Overview

Two-phase pipeline:

  1. Ingest — Import LoCoMo conversations into Supermemory (one containerTag per sample)
  2. Eval — Send QA questions to OpenClaw agent (which recalls from Supermemory internally), then judge answers with an LLM

Before each sample, eval.py automatically:

  1. Updates ~/.openclaw/openclaw.json to set openclaw-supermemory.config.containerTag = sanitize(sample_id)
  2. Switches plugins.slots.memory to "openclaw-supermemory"
  3. Restarts the OpenClaw gateway to pick up the new config

Tag sanitization: conv-26conv_26 (matches openclaw-supermemory's internal sanitizeTag logic). Both ingest.py and eval.py apply the same transformation automatically.

Prerequisites

  • OpenClaw installed and configured
  • openclaw-supermemory plugin installed (~/.openclaw/extensions/openclaw-supermemory)
  • ~/.openclaw/openclaw.json with openclaw-supermemory.config.apiKey set
  • API keys in ~/.openviking_benchmark_env:
env
SUPERMEMORY_API_KEY=sm-...
ARK_API_KEY=...         # Volcengine ARK, used for judge LLM
  • Python dependencies:
bash
uv sync --frozen --extra dev
pip install supermemory openai python-dotenv

Data

LoCoMo 10-sample dataset at benchmark/locomo/data/locomo10.json:

  • 10 samples (conversations between two people)
  • 1986 QA pairs across 5 categories:
    • 1: single-hop
    • 2: multi-hop
    • 3: temporal
    • 4: world-knowledge
    • 5: adversarial (skipped by default)

Step 1 — Ingest

Import conversations into Supermemory. Each sample is stored under containerTag = sample_id (e.g. conv-26).

Sessions are formatted as date-prefixed JSON strings, matching the memorybench supermemory provider convention. Indexing is polled until both document and memory reach done status.

bash
# Ingest all 10 samples
python ingest.py

# Ingest a single sample
python ingest.py --sample conv-26

# Ingest specific sessions only
python ingest.py --sample conv-26 --sessions 1-4

# Force re-ingest (ignore existing records)
python ingest.py --sample conv-26 --force-ingest

# Clear all ingest records and start fresh
python ingest.py --clear-ingest-record

Key options:

OptionDescription
--sampleSample ID (e.g. conv-26) or index (0-based). Default: all
--sessionsSession range, e.g. 1-4 or 3. Default: all
--limitMax samples to process
--force-ingestRe-ingest even if already recorded
--clear-ingest-recordClear .ingest_record.json before running
--no-wait-indexingSkip indexing poll (faster, no status check)

Ingest records are saved to result/.ingest_record.json to avoid duplicate ingestion.

Step 2 — Eval

Send QA questions to OpenClaw agent and optionally judge answers.

bash
# Run QA + judge for all samples (6 concurrent threads)
python eval.py --threads 6 --judge

# Single sample
python eval.py --sample conv-26 --threads 6 --judge

# First 12 questions only
python eval.py --sample conv-26 --count 12 --threads 6 --judge

# Judge-only (grade existing responses in CSV)
python eval.py --judge-only

Key options:

OptionDescription
--sampleSample ID or index. Default: all
--countMax QA items to process
--threadsConcurrent threads per sample (default: 10)
--judgeAuto-judge each response after answering
--judge-onlySkip QA, only grade ungraded rows in existing CSV
--openclaw-urlOpenClaw gateway URL (default: http://127.0.0.1:18789)
--openclaw-tokenAuth token (or OPENCLAW_GATEWAY_TOKEN env var)
--judge-base-urlJudge API base URL (default: Volcengine ARK)
--judge-modelJudge model (default: doubao-seed-2-0-pro-260215)
--outputOutput CSV path (default: result/qa_results.csv)

Results are written to result/qa_results.csv. Failed ([ERROR]) rows are automatically removed at the start of each run and retried.

Output

result/qa_results.csv columns:

ColumnDescription
sample_idConversation sample ID
question_idUnique question ID (e.g. conv-26_qa0)
question / answerQuestion and gold answer
category / category_nameQuestion category
responseAgent response
input_tokens / output_tokens / total_tokensLLM token usage
time_costEnd-to-end latency (seconds)
resultCORRECT or WRONG
reasoningJudge's reasoning

Summary Output

After eval completes:

=== Token & Latency Summary ===
  Total input tokens : 123456
  Avg time per query : 18.3s

=== Accuracy Summary ===
  Overall: 512/1540 = 33.25%
  By category:
    multi-hop           : 120/321 = 37.38%
    single-hop          : 98/282 = 34.75%
    temporal            : 28/96  = 29.17%
    world-knowledge     : 266/841 = 31.63%

Delete Supermemory Data

bash
# Delete a specific sample's documents
python delete_container.py conv-26

# Delete all samples from the dataset
python delete_container.py --from-data

# Delete first N samples
python delete_container.py --from-data --limit 3

Note: delete_container.py uses documents.list(containerTags=[tag]) + documents.deleteBulk(ids=[...]) in batches of 100, and also clears the corresponding ingest records from result/.ingest_record.json.