docs/tech-notes/roachtest-investigation-tips/README.md
This directory contains guides for investigating common roachtest failures.
Gather artifacts using ./scripts/fetch-roachtest-artifacts.
Identify the failure type from test logs and error messages.
Follow the appropriate guide based on the failure type:
Document findings
Edit and extend investigation.md to capture your findings:
Review the code: based on the observations above, perform a cursory review of code involved based on the errors, jobs, etc observed:
Search git history for recent changes related to the failure:
git log --oneline -n 10 pkg/path/to/test.git log --oneline -n 20 pkg/path/to/relevant/package/.git log -S "keyword" --oneline.Maintain a single, unified timeline for ease of reference throughout the investigation.
08:55:24 - job 456 records last progress change to 22% before stalling there for 34mins
08:55:27 - first addsstable 'Command too large' error appears in logs on n2
08:55:27-09:28:04 addsstable 'Command too large' errors on all nodes for 33mins
And there would be a section later in the document describing addsstable errors with complete
examples, number, time period and location observed, etc.When investigating performance changes or failures, use systematic correlation to identify what differs between working and non-working periods:
This technique is particularly useful when investigating jobs that show changes in progress rates or workloads that show performance degradation over time.
./scripts/fetch-roachtest-artifacts [issue_num|issue_link|issue_comment_link] - Downloads test
artifacts for investigation..claude/commands/roachtest-failure.md - Claude Code automation instructions for following this
guide.