docs/tech-notes/roachtest-investigation-tips/exits_crashes_ooms.md
When investigating a node reported as exiting unexpectedly, look for messages, stack traces, and exit codes, either logged by the node itself as it crashed or by the system if it killed it.
logs/N.unredacted/cockroach.stderr.log for stack traces and fatal
messages.logs/N.unredacted/cockroach.exit.log files for exit codes (only nodes that crashed will
have these).grep -n "panic\|fatal\|abort" logs/*/unredacted/cockroach.stderr.log.grep "disk stall detected" logs/*/unredacted/cockroach.log.N.dmesg.txt files, and messages about cockroach in
artifacts/*.journalctl.txt.Before diving into crash details, establish the timeline context:
Timing Analysis:
Test Type Context:
Check for Intentional Node Stops: Many tests intentionally stop nodes as part of their test scenario. Before investigating a crash as a problem:
Failure Mode Assessment:
Check artifacts/test.log and the crashed node's stderr for exit codes:
When multiple issues are present, investigate in this order: