buildscripts/resmokelib/hang_analyzer/README.md
There are two main ways of running the core analyzer.
To run the core analyzer with local core dumps and binaries:
python3 buildscripts/resmoke.py core-analyzer
This will look for binaries in the build/install directory, and it will look for core dumps in the current directory. If your local environment is different you can include --install-dir and --core-dir in your invocation to specify other locations.
To run the core analyzer with core dumps and binaries from an evergreen task:
python3 buildscripts/resmoke.py core-analyzer --task-id={task_id}
This will download all of the core dumps and binaries from the task and put them into the configured --working-dir, this defaults to the core-analyzer directory.
All of the task analysis will be added to the analysis directory inside the configured --working-dir.
Note: Currently the core analyzer only runs on linux. Windows uses the legacy hang analyzer but will be switched over when we run into issues or have time to do the transition. We have not tackled the problem of getting core dumps on macOS so we have no core dump analysis on that operating system.
sequenceDiagram
Task Timed Out ->> Hang Analyzer: Scan all python processes
for resmoke process
Hang Analyzer ->> Resmoke: Signal resmoke to archive data
files and take core dumps
Resmoke ->> Hang Analyzer: Report resmoke pids to hang
analyzer to take core dumps of
Hang Analyzer ->> Core Dumps: Attach to pid and generate core dumps
When a task times out, it hits the timeout section in the defined evergreen config. In this timeout section, we run this task which runs the hang-analyzer with the following invocation:
python3 buildscripts/resmoke.py hang-analyzer -o file -o stdout -m exact -p python
This tells the hang-analyzer to look for all of the python processes (we are specifically looking for resmoke) on the machine and to signal them. When resmoke is signaled, it again invokes the hang analyzer with the specific pids of it's child processes. It will look similar to this most of the time:
python3 buildscripts/resmoke.py hang-analyzer -o file -o stdout -k -c -d pid1,pid2,pid3
The things to note here are the -k which kills the process and -c which takes core dumps.
The resulting core dumps are put into the current running directory.
An optional test timeout (--testTimeout=N seconds) can be used when running resmoke that will run the hang-analyzer on all processes related to that test.
When a test times out, it will analyze:
|-python resmoke.py (pgid 5)
| |-mongo (ENV_MARKER=0, pgid 6)
| | |-foo (ENV_MARKER=0, pgid 6)
| | |-bar (ENV_MARKER=0, pgid 7)
| |-mongo (ENV_MARKER=1, pgid 8)
| |-mongo (ENV_MARKER=2, pgid 9)
Caution: Should a process be created in a new process group as bar is in the above example, it may be missed on MacOS. If foo crashes/exits, bar is orphaned and reparented to the init process. It is no longer a "child" and it is not generally possible to read environment variables of arbitrary processes on MacOS with System Integrity Protection (SIP) enabled.
When a task fails normally, core dumps may also be generated by the linux kernel and put into the working directory.
We use a non-standard way of uploading core dumps to evergreen due to timeout issues we were facing when archiving and uploading them normally through evergreen commands. After investigation of the above issue, we found that compressing and uploading core dumps was slow for a couple reasons:
We made a script that gzips all of the core dumps in parallel and uploads them to S3 individually asynchronously. This solved all of the problems listed above.
sequenceDiagram
Task Shut Down ->> Generate task script: If core dumps are present,
generate task config
Generate Task Script ->> Task Shut Down: Write generated task config to disk
Task Shut Down ->> Generated Task: Use evergreen command to generate task
Task Shut Down ->> Core Analyzer Output: Upload temporary text file containing a link to the generated task
Generated Task ->> Core Analyzer Output: Overwrite output with
core dump analysis
In the post task section, we define the evergreen function used to generate the core analyzer task. This script runs on every task (passing or failing) and is independent of anything else that happened prior in the task and does all of the checks to ensure it should run. These checks include:
The output from this script is a json file in the format evergreen expects.
We then pass this json file into the generate.tasks evergreen command to generate the task.
After the task is generated, we have another script that finds the task that was just generated and attaches it to the current task being ran.
The reason we upload a temporary file to the original task is to attach that s3 file link to the task. Evergreen does not currently have a way to attach files to a task after it was ran so we need to upload something while the original task is in progress.