docs/tech-notes/roachtest-investigation-tips/debugging-a-job.md
Roachtests that run specific jobs verify both successful completion and correct behavior, including performance characteristics. This guide provides a systematic approach to investigating job-related test failures.
Identifying the specific job ID is critical for investigation. If not found in test.log, search
debug/system.jobs.txt for the relevant job type:
c2c/* (cluster-to-cluster) → STREAM INGESTION (in dest_debug).ldr/* → LOGICAL REPLICATION.cdc/*, changefeed/* → CHANGEFEED.backup/* → BACKUP, restore/* → RESTORE.import/* → IMPORT.schemachange/* → SCHEMA CHANGE.Note for c2c tests: The cluster is split into source and destination clusters, with separate debug
folders source_debug and dest_debug/.
Verify the job is correctly identified before proceeding:
Document all observations with citations and references to their locations in the test output. Create an investigation.md file to track findings systematically.
Create a focused, combined log containing all entries from all nodes that mention the job ID:
grep -h "JOB_ID" PATH/TO/FETCHED/artifacts/logs/*.unredacted/cockroach.log | cut -c2- | sort > job_JOB_ID.log
NB: Use the unredacted logs, not the redacted ones in artifacts/logs/*.cockroach.log or you may be missing details.
Review these logs for unusual patterns, error messages, retries, notable events and their timestamps.
Examine these system tables for job-specific information:
debug/crdb_internal.jobs.txt: job status and errors.
debug/system.job_message.txt: status messages recorded by the job.
debug/system.job_progress_history.txt: historical progress values sorted by time.
./scripts/job-progress-plot <job-id> <path> to
generate an ASCII plot and insert it in your investigation document below the timeline.debug/jobs/<jobID>/* sometimes contains zipped up job trace data.Note: C2C tests have separate system tables for source and destination clusters. The destination cluster typically contains the most relevant events.
Timestamp handling: Timestamps appear in different formats:
<seconds>.<fractional-seconds>,<logical>.<nanoseconds>.<logical>.When searching logs, use only the higher (seconds) digits to match both formats. Maintain consistency in your notes by using one format throughout.
Most distributed jobs utilize "processors" that run on multiple nodes to execute work in parallel.
Analyzing processor distribution and behavior can be key to spotting issues in work distribution, and can be a signal of replannings or retry loops if those aren't explicitly logged already.
Processor log messages are already in the job-focused log since they are tagged with the job ID.
Processor start messages indicate a job was planned or replanned: if this wasn't observed in the logs, this is often worth noting in the timeline (along with if the start message is seen on all nodes or just some).
Sometimes these messages include a number of spans or work units assigned; if this is not relatively balanced across nodes, or some nodes do not report starting processors when others do, this is notable.
SSTable cannot be added spanning range bounds errors are automatically handled by retrying
each half of the SST.This section is continuously evolving. If your job type isn't covered here:
For BACKUP and RESTORE jobs, always record these key infrastructure details in your investigation:
Job phases:
Key terminology:
Expected behaviors:
Replication components:
C2C-specific checklists:
Augment the investigation timeline to document:
Resolved progress and lag analysis: Look at the progress history, and logged resolved timestamps and logged lag observations:
During replication phase: - Are there periods where lag increases? - Is resolved time advancing consistently or does it stall at a constant value for extended periods? - Do stalls correlate with other cluster events or errors?
Update the timeline with any observations, particularly periods of constant vs increasing lag, or sudden changes in resolved time.
Component investigation: if replication stalls, examine both source and destination components for issues.
Understanding frontier stalls: If resolved time stops advancing during replication:
Data validation: tests typically compare fingerprints between clusters.