explorations/longmemeval/docs/INVESTIGATION_WORKFLOW.md
This document describes how to investigate failing questions in LongMemEval benchmarks to identify root causes and implement fixes.
The point is to find deficiencies in the LongMemEval dataset - there appear to be many broken question/answer pairs where the question is misleading, or the answer includes incorrect information or details that the question didn't ask for.
The investigation workflow has 4 stages:
pending → investigated → fix-implemented → synced
longmemeval_s.json dataset# 1. List all runs with failures
pnpm investigate --list
# 2. Setup investigation for a specific run
pnpm investigate <run-id>
# 3. Open the next uninvestigated question
pnpm investigate --next
# 4. After investigating, mark as done
pnpm investigate --done <question-id>
# 5. After implementing fix, mark as fixed
pnpm investigate --fixed <question-id>
# 6. Sync all fixes to dataset
pnpm investigate --sync
# List all runs with failures, grouped by config
pnpm investigate --list
# Filter by config name
pnpm investigate --list -c gpt5
pnpm investigate --list -c om-gemini
# Setup investigation directory for a run
pnpm investigate run_1768439350043
This creates:
investigations/
└── run_1768439350043/
├── progress.json # Tracks investigation status
└── <question-id>/
├── analysis.md # Investigation template
└── data/
├── original.json # Raw dataset for this question
├── result.json # Evaluation result
├── om.md # Agent's context window
└── om.json # Prepared OM data (if exists)
# Open the next uninvestigated question in your editor
pnpm investigate --next
# Check current progress
pnpm investigate --status
The investigate command provides several utilities to help diagnose issues:
# Search what the Observer extracted
pnpm investigate --search "keyword" -q <question-id>
# Search the raw dataset with full context
pnpm investigate --search-original "keyword" -q <question-id>
# Trace a keyword through the entire pipeline
pnpm investigate --trace "keyword" -q <question-id>
This shows where information exists at each stage:
# List all sessions for a question
pnpm investigate --list-sessions -q <question-id>
# View a specific session
pnpm investigate --session 33 -q <question-id>
# Show summary of question's data
pnpm investigate --inspect <question-id>
# View observations around a specific date
pnpm investigate --date "2023/05/29" -q <question-id>
pnpm investigate --date "May 29" -q <question-id> --context 2
Edit the analysis.md file for each question:
## Failure Category
- [x] Observer missed critical information
- [ ] Reflector lost/merged information incorrectly
- [ ] Agent reasoning error (had info, wrong conclusion)
- [ ] Ambiguous/poorly-worded question
- [ ] Dataset inconsistency/error
- [ ] RAG retrieval miss (if applicable)
- [ ] Other: \_\_\_
## Root Cause Analysis
<!-- Describe what went wrong -->
## Evidence
<!-- Quote relevant parts of om.md, original data, etc. -->
## Potential Improvements
### Observer/Reflector Changes
- **Likelihood**: High
- **Suggested prompt change**: ...
### Fixed Question/Answer
- **improved_question**: ...
- **improved_answer**: ...
- **improvement_note**: ...
pnpm investigate --done <question-id>
This:
analysis.mdprogress.jsonBased on your investigation, implement fixes:
Observer/Reflector prompt changes: Edit packages/memory/src/experiments/observational-memory/observer-agent.ts or reflector-agent.ts
Improved question/answer: Add to analysis.md:
### Fixed Question/Answer
- **improved_question**: What is the current location of my old sneakers?
- **improved_answer**: in a shoe rack in my closet
- **improvement_note**: Original question was ambiguous about timeframe
Re-prepare data: If Observer/Reflector prompts changed:
pnpm prepare om --from-failures ./results/om/run_xxx/failures.json
pnpm investigate --fixed <question-id>
pnpm investigate --sync
This syncs improved_question, improved_answer, and improvement_note from analysis.md files to longmemeval_s.json.
Symptoms: Information exists in original dataset but not in observations.
Diagnosis:
pnpm investigate --trace "keyword" -q <question-id>
# Look for: "❌ Observer missed this information"
Common causes:
Symptoms: Information in observations but lost after reflection.
Diagnosis: Compare observations before/after reflection in om.json.
Symptoms: Information present in om.md but agent reached wrong conclusion.
Diagnosis: Check om.md - if the answer is there, it's a reasoning issue.
Symptoms: Conflicting information in the dataset itself.
Diagnosis:
pnpm investigate --search-original "keyword" -q <question-id>
# Look for contradictory statements
Start with --trace: It quickly shows where information was lost.
Use --search-original: See the full context of what the user actually said.
Check the date: Use --list-sessions to find when information was mentioned.
Look for patterns: Similar failures often have the same root cause.
Document everything: Good analysis.md files help identify systemic issues.
# 1. Find the question
pnpm investigate --list -c om
# 2. Setup
pnpm investigate run_1768439350043
# 3. Start investigating
pnpm investigate --next
# 4. Trace the issue
pnpm investigate --trace "shoe rack" -q 07741c45
# 5. Search original data
pnpm investigate --search-original "shoe rack" -q 07741c45
# 6. View the session
pnpm investigate --session 33 -q 07741c45
# 7. Document findings in analysis.md
# (edit the file)
# 8. Mark as done
pnpm investigate --done 07741c45
# 9. After implementing fix
pnpm investigate --fixed 07741c45
# 10. Sync to dataset
pnpm investigate --sync