explorations/longmemeval/README.md
This package implements the LongMemEval benchmark (+Github) for testing Mastra's long-term memory capabilities.
LongMemEval is a comprehensive benchmark designed by researchers to evaluate the long-term memory capabilities of chat assistants. It was introduced in the paper:
"LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory"
Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, Dong Yu (ICLR 2025)
š Paper | š Website | š¤ Dataset
The benchmark evaluates five core long-term memory abilities through 500 meticulously curated questions:
Current LLMs show a 30-60% performance drop when tested on LongMemEval, revealing significant challenges in maintaining coherent long-term memory. This benchmark helps identify and improve these limitations.
# From packages/longmemeval directory
# 1. Set your API keys
export OPENAI_API_KEY=your_openai_key_here
export HF_TOKEN=your_huggingface_token_here # For automatic dataset download
# 2. Run a benchmark (downloads datasets automatically if needed)
pnpm bench:s # Run small dataset (10 parallel requests)
pnpm bench:m # Run medium dataset (10 parallel requests)
pnpm bench:oracle # Run oracle dataset (10 parallel requests)
# Or run quick 10-question tests
pnpm bench:s:quick # Test with 10 questions from small dataset
pnpm bench:m:quick # Test with 10 questions from medium dataset
pnpm bench:oracle:quick # Test with 10 questions from oracle dataset
Note: The benchmark will automatically download datasets on first run. Get your HuggingFace token from https://huggingface.co/settings/tokens
# From the monorepo root
pnpm install
pnpm build
# Set your HuggingFace token
export HF_TOKEN=your_token_here
# Download datasets (no Python or Git LFS required)
pnpm download
If automatic download fails, see DOWNLOAD_GUIDE.md for manual download instructions.
# From packages/longmemeval directory
# Quick commands for each dataset (10 parallel requests)
pnpm bench:s # Small dataset (full run)
pnpm bench:m # Medium dataset (full run)
pnpm bench:oracle # Oracle dataset (full run)
# Quick test runs (10 questions only, 5 parallel)
pnpm bench:s:quick # Small dataset (quick test)
pnpm bench:m:quick # Medium dataset (quick test)
pnpm bench:oracle:quick # Oracle dataset (quick test)
# Advanced: Use full CLI with custom options
pnpm cli run --dataset longmemeval_s --model gpt-4o
# Adjust parallelization (default: 5)
pnpm cli run --dataset longmemeval_s --model gpt-4o --concurrency 20
# Graceful shutdown: Press Ctrl+C to stop and save progress
# Run with specific memory configuration
pnpm cli run --dataset longmemeval_s --memory-config last-k --model gpt-4o
pnpm cli run --dataset longmemeval_s --memory-config semantic-recall --model gpt-4o
pnpm cli run --dataset longmemeval_s --memory-config working-memory --model gpt-4o
# Custom subset size
pnpm cli run --dataset longmemeval_oracle --model gpt-4o --subset 25
pnpm cli stats --dataset longmemeval_s
pnpm cli evaluate --results ./results/run_12345/results.jsonl --dataset longmemeval_s
pnpm cli report --results ./results/
Results are saved in the results/ directory with:
results.jsonl: Individual question resultshypotheses.json: Model responsesquestions.json: Questions for referencemetrics.json: Aggregated metrics and configurationLongMemEval provides three dataset variants:
If you use this benchmark in your research, please cite the original paper:
@article{wu2024longmemeval,
title={LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory},
author={Wu, Di and Wang, Hongwei and Yu, Wenhao and Zhang, Yuwei and Chang, Kai-Wei and Yu, Dong},
journal={arXiv preprint arXiv:2410.10813},
year={2024}
}
To add custom memory configurations:
src/benchmark/runner.ts and add your configuration to getMemoryConfig()MemoryConfigType in src/data/types.tssrc/memory-adapters/mastra-adapter.ts