benchmark/locomo/claudecode/README.md
Four reproducible evaluation paths over the LoCoMo-10 dataset, each exercising a different way of giving Claude Code long-term memory.
| Mode | Ingest path | OV namespace | Entry point |
|---|---|---|---|
| Prompted | claude -p per session, CC writes MEMORY.md | — (no OpenViking) | run_prompted.sh |
| SDK iso | OpenViking Python SDK direct import | per-sample agent_id | run_sdk_iso.sh |
| SDK no-iso | OpenViking Python SDK direct import | shared (default) | run_sdk_noiso.sh |
| e2e | claude -p stream-json multi-turn, auto-capture into OV | shared | run_e2e.sh |
All four use the same QA / judge / stats pipeline (eval.py → judge.py →
stat_judge_result.py) so accuracy numbers are directly comparable.
locomo10.json into .tmp/:
benchmark/locomo/claudecode/.tmp/locomo10.json
uv (use uv run invocations as below).PATH (the runners call claude -p).export ANTHROPIC_AUTH_TOKEN=...
export ANTHROPIC_BASE_URL=https://ark.cn-beijing.volces.com/api/compatible
export ANTHROPIC_MODEL=doubao-seed-2-0-code-preview-260215
openviking-server (defaults to 127.0.0.1:1933). The e2e mode
wants it running inside a tmux session so the runner can restart it
across snapshots — set OPENVIKING_SERVER_TMUX=<session> (default
ovserver).claude-code-memory-plugin:
export OPENVIKING_PLUGIN_DIR=$HOME/Dev/OpenViking/examples/claude-code-memory-plugin
config/ov-hooks.json and from
scripts/auto-capture.mjs.export OPENVIKING_CLI_CONFIG_FILE=$HOME/.openviking/ovcli-local.conf
to pin the local OV target and avoid leaking prod URLs from the
user-shell's claude wrapper.Single-sample smoke test (any mode, any conv id):
./run_prompted.sh conv-26
./run_sdk_iso.sh conv-26
./run_sdk_noiso.sh conv-26
./run_e2e.sh conv-26
Full 10-conv run — omit the conv id:
./run_prompted.sh
./run_sdk_iso.sh
./run_sdk_noiso.sh
./run_e2e.sh
Each script ends with the result of stat_judge_result.py printed to stdout
and saved to .tmp/result-<mode>/summary.txt. Per-question CSVs land in
.tmp/result-<mode>/qa_results.csv.
ingest.py — feeds each LoCoMo session as a claude -p invocation;
CC's vanilla auto-memory writes MEMORY.md inside the project dir.import_to_ov.py — uses the OpenViking Python SDK to push LoCoMo
conversations directly into OV. Per-sample namespace by default; pass
--no-user-agent-id for shared namespace.ingest_e2e.py — opens one claude -p per LoCoMo session in
--input-format stream-json mode and streams the messages one at a time
through stdin. The plugin's Stop hook auto-captures per turn; SessionEnd
fires once per LoCoMo session (one commit per session). The benchmark's
scripts/auto-capture.mjs shim extracts the LoCoMo per-message timestamp
so OV event archive dates line up with conv chronology.eval.py — drives QA. Per-question claude -p; reads MEMORY.md for
Prompted mode, talks to OV via auto-recall hook + MCP for the OV modes.judge.py / stat_judge_result.py — LLM-based judging and per-category
accuracy roll-up.benchmark/locomo/claudecode/
├── README.md
├── run_prompted.sh # 4 entry points, one per published mode
├── run_sdk_iso.sh
├── run_sdk_noiso.sh
├── run_e2e.sh
├── ingest.py # Prompted ingest
├── ingest_e2e.py # e2e stream-json ingest
├── import_to_ov.py # SDK pre-ingest (iso + no-iso)
├── eval.py
├── judge.py
├── stat_judge_result.py
├── config/
│ ├── ov-hooks.json # hooks settings, references $OPENVIKING_PLUGIN_DIR
│ ├── ov-mcp.json # MCP server (viking)
│ ├── ov-ingest.conf # auto-capture on, auto-recall on
│ ├── ov-qa.conf # auto-capture off, auto-recall on
│ └── sys-prompts/ # optional system prompts (unused in default flows)
├── scripts/
│ └── auto-capture.mjs # LoCoMo-aware Stop-hook (per-message created_at)
└── .tmp/ # gitignored; data, snapshots, results
├── locomo10.json # ← drop dataset here
├── legacy/ # historical scripts/configs/reports kept aside
└── result-<mode>/ # output of a run
Historical iterations of the experiment (r8 … r14b snapshot scripts,
fine/listed/min ingest variants, intermediate reports) live under
.tmp/legacy/ and are not part of the published numbers.