third_party/xla/docs/errors/error_0101.md
Category: Runtime: Program Allocation Failure
This error indicates that the XLA runtime on a TPU device failed to load a compiled XLA program executable into the TPU's HBM.
Sample error message:
XlaRuntimeError: RESOURCE_EXHAUSTED: Error loading program 'jit_embedding_pipeline_step_fn': Attempting to reserve 29.49G at the bottom of memory. That was not possible. There are 147.64M free, 0B reserved, and 147.64M reservable. Scope: unknown..: while running replica 0 and partition 34 of a replicated computation (other replicas may have failed as well).
XLA backends: TPU
This error is typically caused by one of the following reasons:
It's important to understand how the TPU runtime prioritizes memory. Buffer allocations are privileged over loaded programs. If a buffer allocation fails, the runtime will evict already loaded programs from HBM to free up space. This can lead to a situation where a program that loaded successfully before now fails with an OOM error, because the HBM is now occupied with more data buffers.
jax.jit(..., donate_argnums=...)) to allow XLA to reuse the memory of input buffers
for storing output, reducing peak memory usage.tpu_shared_memory_percent flag. Note that this might negatively affect
performance.jax.Array objects are not being held longer than
intended. Holding on to jax.Array objects might prevent automatic
de-allocation even after program compilation is completed.See also Error code: E1000 for other strategies you can use to reduce the amount of memory each program uses.
tpu_log_allocations_on_oom flag for which the allocator will
dump a detailed report of all current allocations when an OOM occurs, which
can be invaluable for debugging.