explorations/longmemeval/USAGE.md
The prepare step processes the dataset through mock agents to populate the storage:
# Quick test with 5 questions
pnpm prepare:quick
# Full dataset with different memory configs
pnpm prepare:s # Small dataset, semantic-recall (default)
pnpm prepare:s:lastk # Small dataset, last-k messages
pnpm prepare:s:working # Small dataset, working-memory
pnpm prepare:s:combined # Small dataset, combined (semantic + working memory)
pnpm prepare:m # Medium dataset, semantic-recall
After preparing data, run the benchmark:
# Quick test
pnpm run:quick
# Full runs
pnpm run:s # Small dataset with semantic-recall (default)
pnpm run:s:lastk # Small dataset with last-k
pnpm run:s:working # Small dataset with working-memory
pnpm run:s:combined # Small dataset with combined
pnpm cli prepare \
-d <dataset> # longmemeval_s, longmemeval_m, longmemeval_oracle
-c <memory-config> # full-history, last-k, semantic-recall, working-memory, combined
[--subset <n>] # Process only n questions
[--output <dir>] # Output directory (default: ./prepared-data)
pnpm cli run \
-d <dataset> # longmemeval_s, longmemeval_m, longmemeval_oracle
-m <model> # Model name (e.g., gpt-4o)
-c <memory-config> # full-history, last-k, semantic-recall, working-memory, combined
[--subset <n>] # Run only n questions
[--concurrency <n>] # Parallel requests (default: 5)
[--prepared-data <dir>] # Prepared data directory
[--output <dir>] # Results directory
Note: full-history is available but not recommended for testing memory systems as it defeats the purpose by loading everything into context.
# Required for running benchmarks
export OPENAI_API_KEY=your-key-here
# Optional for downloading datasets
export HF_TOKEN=your-huggingface-token
# 1. Test with small subset first
pnpm prepare:quick
pnpm run:quick
# 2. Run full benchmark with semantic recall
pnpm prepare:s
pnpm run:s
Results are saved in ./results/run_<timestamp>/:
results.jsonl: Raw evaluation resultsmetrics.json: Aggregated metrics# View all runs
pnpm cli report -r ./results
# Check specific metrics
cat results/run_*/metrics.json | jq '.overall_accuracy'