third_party/xla/docs/hlo_passes.md
This document outlines the HLO optimizations and transformations passes in the XLA compiler.
A single HLO Pass can be comprised of one or many compiler optimizations and transformations, and XLA provides several hundred such passes. HLO focuses only on the shape (e.g. a 3x4 matrix) and the operation semantics of the arrays to make the optimization or transformation easier.
For example:
AlgebraicSimplifier:
A pass that performs a number of mostly arithmetic simplifications and
optimizations. Including:
HloRematerialization:
A pass that recomputes selected expressions in the computation to reduce
memory pressure caused by long live ranges of array-shaped values.
The base class for HLO passes can be found in
xla/hlo/pass/hlo_pass_interface.h.
HLO pass should not extend this class directly but instead should extend
HloModulePass.
See also XLA HLO Pass Framework.
XLA comes with multiple command line tools, including the hlo-opt tool. This tool allows execution of an individual pass independent of the given platform compilation stages. For more information see Tooling.
For information on writing unit tests for HLO Passes see Testing HLO Passes.
This section describes a few examples of passes shared across XLA backends. Some passes may be specialized for specific backends, but the high-level functionality is similar.
Shared passes or hardware-independent passes can be found in
xla/hlo/transforms.
See also
HloRematerialization.
Selectively recomputes expressions within the HLO graph to reduce memory usage. Trades off higher compute for lower memory usage. Can reduce memory usage by tens of percent and is required to run many large models.
See also
AlgebraicSimplifier.
A grab bag of simplifications, optimizations, and canonicalizations. Analogous
to
LLVM’s instcombine pass.
See also
HloConstantFolding.
Replaces expressions which can be evaluated at compile time with their constant equivalent.
See also
HloDCE
.
Removes operations with unused results (fast implementation).
See also
FlattenCallGraph.
A legalization pass which converts the HLO call graph into a tree by cloning computations. Required because memory is statically assigned to HLO operations and not based on dynamic call context.
See also
ReshapeMover.
Reshapes and transposes can be expensive, especially on TPU. This pass moves and reshapes and transposes across elementwise operations enabling the operations to be merged or eliminated.
See also
ZeroSizedHloElimination.
HLO supports arrays of zero size (one or more dimensions has a bound of zero). This pass simplifies the graph by replacing zero-sized operations with zero-sized constants.
Passes specific to the TPU backend.
The partitioning of an XLA program across multiple cores is performed at the HLO level and the TPU HLO pipeline includes a number of passes for supporting multi-core execution.
See also
ShardingPropagation.
Pass to support dividing operations across devices along non-batch dimensions.
See also
BFloat16ConversionFolding,
BFloat16MixedPrecisionRemoval,
and
BFloat16Propagation.
TPUs support bfloat16 as a lower-precision, more compact floating-point representation than 32-bit floats. Using bfloat16 reduces memory footprint and memory bandwidth. The TPU HLO pipeline includes various passes for replacing floats with bfloat16 into the program and propagating the precision through the graph.
See also
GatherExpander,
and
BatchNormExpander.
Passes which transform unsupported HLO into a form which the backend can emit or for which the backend produces a more efficient lowering.
Passes specific to the GPU backend are found in
xla/service/gpu.
These passes can be identified as classes defined in namespace gpu.
See also
CudnnFusedConvRewriter
and
CudnnNormRewriter.
Rewrites fused convolution and norm operations into their respective library calls in cuDNN.
Passes specific to the CPU backend are found in
xla/service/cpu.
These passes can be identified as classes defined in namespace cpu.
See also
ConvCanonicalization.
Canonicalizes convolutions so that they can be lowered to a fast implementation in Eigen.
See also
ParallelTaskAssigner.
Partitions HLOs into tasks to run on separate threads.
Analysis passes are not considered "HLO passes" since they do not transform HLO
and may not extend HloModulePass. Shared analyses are found in
xla/hlo/analysis.
See also
HloDataflowAnalysis.
Identifies all HLO values in the graph and their uses.
See also
HloAliasAnalysis.
Identifies must-alias relationships between values in the program.
See also
HloCostAnalysis.
Computes FLOP count and memory usage for all operations in the program.
See also
HloVerifier.
Verifies various invariants of the HLO graph.