third_party/xla/docs/errors/error_0100.md
Category: Runtime: Buffer Allocation Failure
This error indicates that XLA:TPU runtime’s memory allocator failed to find a suitable block of memory on the accelerator’s HBM for the requested allocation.
Sample error message:
ValueError: RESOURCE_EXHAUSTED: Error allocating device buffer: Attempting to allocate 8.00M. That was not possible. There are 6.43M free.; (0x0x1_HBM0)
XLA backends: TPU
This error is thrown on:
jax.device_putor
These failures are typically caused due to a couple of reasons:
The TPU runtime has a number of mechanisms in-place to retry allocation failures including:
So an error encountered after the above mitigations typically require user action.
jax.jit(..., donate_argnums=...)) to signal to XLA that certain input buffers can be
overwritten and reused for outputs. Read
Buffer donation
for more details.jax.Array objects are not being held longer than
intended. Holding on to jax.Array objects might prevent automatic
de-allocation even after program compilation is completed.See also Error code: E1000 for other strategies you can use to reduce the amount of memory each program uses.
Enable the tpu_log_allocations_on_oom flag for which the allocator will dump a
detailed report of all current allocations when an OOM occurs, which can be
invaluable for debugging.