third_party/xla/docs/megascale/hangs.md
If you see the following error message in your TPU worker logs, this means that MXLA timed out after detecting no progress:
Megascale hang detected: Timed out waiting for 4 graphs to complete at launch_id 13650. Already completed: 100. StepGloballyInProgress: true. Timeout: 1m
resource_manager job.Megascale detects a hang.Megascale detects a hang that is likely caused by bad TPU chips on the following hosts. Please remove the hosts from the fleet and restart the workload. If problem persists please contact Megascale XLA team.
The host that have bad TPUs are: <host_name>
Full error digest:
Potential cause: Bad TPU chips
Potential culprit workers: <job_name>/<task_id>:<host_name>
This error means that the issue is potentially caused by a faulty TPU chip. The
error message should include the job information and host name of the faulty
chip. In the example above, the faulty chip is on host <host_name>, affecting
task <task_id> of the job <job_name>. You can configure your job to avoid
that host.
Note: There are some cases that the hang was caused by a XLA or custom pallas kernel bug, but if you see the same host appearing multiple times (for example more than 3 times) in multiple hang events, the TPU on that host is very likely faulty.
Megascale detects a hang that is likely caused by a networking issue. Please examine the underlying networking stack for the following hosts.
The hosts are: <host_name>
Full error digest:
Potential cause: Networking issue
Potential culprit workers: <host_name_1>, <host_name_2>
This error indicates that your job has encountered a failed network link. The
error message should include a single or pair of job name, task id, host name of
the faulty network link. In the example above, the faulty network link is
between host host_name_1 and host_name_2. Sometimes RapidEye can further
localize the faulty host if a single host appears in multiple broken network
links. You can configure your job to avoid those hosts.
Megascale detects a hang that is likely caused by running different modules on different devices. Please confirm that all workers is running the exact same program. It can also be caused by a hang in a subset of devices and the unaffected devices have moved on to the next program. Please inspect the digest below to further root cause the hang.
Example hosts that have different HLO modules: <host_name>
Full error digest:
Potential cause: Different module
Potential culprit workers: <host_name>
TPU stats:
<host_name>: <pc>
TPU states:
Module: jit_loss_and_grad
Fingerprint: <fingerprint>
Launch ID: 193
<tag>:<pc>(<hlo>): <host_name>
Module: jit_optimizer_apply
Fingerprint: <fingerprint>
Launch ID: 0
<tag>:<pc>(<hlo>): <host_name>
This error may indicate that a hang has occurred in a subset of workers, causing those workers to be stuck at the current module while unaffected workers advance to the next module. To identify the root cause, inspect the digest printed by RapidEye in the logs.
The TPU states section of the logs shows which modules are running on which
workers. In the example above, workers are running different modules:
jit_loss_and_grad and jit_optimizer_apply.
Megascale detects a hang that is likely caused by inconsistent HLO module compilation across workers. This is likely a bug in JAX tracing or XLA compiler. Please inspect the HLO dumps to confirm the root cause.
Example hosts that have different HLO fingerprints: <host_name>
Full error digest:
Potential cause: Fingerprint mismatch
Potential culprit workers: <host_name>
TPU stats:
Module: reduce.31
Fingerprint: <fingerprint_1>
Launch ID: 37
<tag>:<pc>(<hlo>): <host_name>
Module: reduce.31
Fingerprint: <fingerprint_2>
Launch ID: 40
<tag>:<pc>(<hlo>): <host_name>
This log message indicates the hang was likely caused by inconsistent HLO module compilation across workers, possibly due to an issue with JAX tracing or the XLA compiler. If you see this log, follow these steps to collect HLO dumps from the culprit workers for further debugging.
Note: This error is not yet implemented.
Megascale detects a hang that is likely caused by data input stall on the
following hosts. Please check the workers to make sure the data input pipeline
is working properly.
The host that have data input stalls are: <host_name>
This error means that all devices launched the same program, but that input data was not provided to the program before the system timed out. To fix this issue, confirm that:
Some workers have halted with an unrecoverable error:
<worker> : {some error}
Please inspect the error log of these workers:
<worker>
This error means that there was an issue that prevented the program from properly executing and could not be recovered automatically. This error was unable to be specifically categorized. Further information can be obtained from checking the logs of the worker(s) mentioned in the error report.
If the error appears to be specific to the given machine (ex. failure to copy data from TPU to host), then you can configure your job to avoid those hosts.
Megascale detects a hang but cannot determine the root cause. Please inspect the
full digest below.
This error means that there was an issue that prevented the program from properly executing and could not be recovered automatically. This error was unable to be specifically categorized and there is no further error information available.
<!-- linter style on -->