third_party/xla/docs/megascale/overview.md
MegascaleXLA is a compiler + runtime system that powers large-scale TPU training. It implements collective communication primitives that allow multiple TPU slices to communicate, which allows running training jobs that span beyond the limits of a single ICI domain.
The debugging guide discusses how to identify and diagnose sources of performance issues such as slowness, hangs or errors in a multi-slice job driven by Megascale.
Slice
RapidEye
Megascale Collective