HLO Passes

This document outlines the HLO optimizations and transformations passes in the XLA compiler.

Introduction

A single HLO Pass can be comprised of one or many compiler optimizations and transformations, and XLA provides several hundred such passes. HLO focuses only on the shape (e.g. a 3x4 matrix) and the operation semantics of the arrays to make the optimization or transformation easier.

For example:

AlgebraicSimplifier: A pass that performs a number of mostly arithmetic simplifications and optimizations. Including:
- When dividing by a constant, an optimization is performed to transform the operation to multiplication by the inversion of the constant.
HloRematerialization: A pass that recomputes selected expressions in the computation to reduce memory pressure caused by long live ranges of array-shaped values.

Developer details

The base class for HLO passes can be found in xla/hlo/pass/hlo_pass_interface.h. HLO pass should not extend this class directly but instead should extend HloModulePass.

Tooling and Testing

XLA comes with multiple command line tools, including the hlo-opt tool. This tool allows execution of an individual pass independent of the given platform compilation stages. For more information see Tooling.

For information on writing unit tests for HLO Passes see Testing HLO Passes.

Hardware-independent HLO Pass Examples

This section describes a few examples of passes shared across XLA backends. Some passes may be specialized for specific backends, but the high-level functionality is similar.

Shared passes or hardware-independent passes can be found in xla/hlo/transforms.

Rematerialization

Algebraic Simplifier

Constant Folding

Dead Code Elimination

Call Graph Flattening

Reshape Mover

Zero-sized HLO Elimination

TPU-specific HLO Pass Examples

Passes specific to the TPU backend.

Model parallelism

The partitioning of an XLA program across multiple cores is performed at the HLO level and the TPU HLO pipeline includes a number of passes for supporting multi-core execution.

Spatial partitioning

Handling of bfloat16

TPUs support bfloat16 as a lower-precision, more compact floating-point representation than 32-bit floats. Using bfloat16 reduces memory footprint and memory bandwidth. The TPU HLO pipeline includes various passes for replacing floats with bfloat16 into the program and propagating the precision through the graph.

Legalization passes

See also GatherExpander, and BatchNormExpander.

Passes which transform unsupported HLO into a form which the backend can emit or for which the backend produces a more efficient lowering.

GPU-specific HLO Pass Example

Passes specific to the GPU backend are found in xla/service/gpu. These passes can be identified as classes defined in namespace gpu.

cuDNN Rewriter

Rewrites fused convolution and norm operations into their respective library calls in cuDNN.

CPU-specific HLO Pass Examples

Passes specific to the CPU backend are found in xla/service/cpu. These passes can be identified as classes defined in namespace cpu.

Convolution Canonicalization

Operation Parallelization

Analysis passes

Analysis passes are not considered "HLO passes" since they do not transform HLO and may not extend HloModulePass. Shared analyses are found in xla/hlo/analysis.

HLO Passes

HLO Passes

Introduction

Developer details

Tooling and Testing

Hardware-independent HLO Pass Examples

Rematerialization

Algebraic Simplifier

Constant Folding

Dead Code Elimination

Call Graph Flattening

Reshape Mover

Zero-sized HLO Elimination

TPU-specific HLO Pass Examples

Model parallelism

Spatial partitioning

Handling of bfloat16

Legalization passes

GPU-specific HLO Pass Example

cuDNN Rewriter

CPU-specific HLO Pass Examples

Convolution Canonicalization

Operation Parallelization

Analysis passes

Analysis Pass Examples

Dataflow Analysis

Alias Analysis

Computation Cost Analysis

HLO Verification