third_party/xla/xla/codegen/intrinsic/accuracy/README.md
This directory contains the framework for testing the accuracy of XLA's floating-point intrinsics against high-precision ground-truth values.
The tests evaluate the implementations across a carefully selected set of input
points (including edge cases like subnormals, powers of 2, and infinities) and
compare the output against reference values generated by mpmath. The error is
measured in Units in the Last Place (ULP), and tested against platform-specific
empirical budgets.
Golden values are computed using mpmath at high precision to serve as ground-truth references for ULP accuracy testing. For each precision (f32, f64), mpmath must be evaluated on the exact input value that the implementation under test will receive. Simply computing a golden value at f64 precision and rounding the input and output to f32 does not produce a valid f32 reference, because rounding the input changes it by up to 0.5 ULP in f32, and the effect of that perturbation on the output is amplified by the condition number of the function. For well-conditioned points, the error is negligible, but accuracy tests specifically target ill-conditioned regions (e.g., log(x) near 1, sin(x) near multiples of π, exp(x) at large arguments), where the condition number can be large enough to introduce tens or even hundreds of ULPs of spurious error in the golden value — far exceeding the 1–2 ULP accuracy we're trying to verify. To avoid this, we generate f32 golden values by first rounding each input to f32, then evaluating mpmath on that exact value, ensuring the reference reflects the true mathematically correct result for the input the f32 implementation actually sees.
Subnormals often suffer from severe precision loss or are flushed to zero by hardware/fast-math implementations, resulting in massive ULP errors compared to typical domain values. By isolating subnormals into their own budget, we can maintain strict (e.g., 1-2 ULP) budgets for regular numbers without the tests failing on edge cases. Special values (like infinity or NaN) are budgeted by an absolute mismatch count rather than ULPs, since they represent discrete states where some hardware (especially GPUs) might intentionally deviate from IEEE-754.
Instead of testing purely random inputs, we test a mix of common floating-point edge cases (0.0, -0.0, Inf, NaN, subnormals, powers of 2 and 10), log-spaced values near zero and near the domain bounds, as well as operation-specific dense ranges (e.g., near multiples of pi for trigonometric functions). This targets the most ill-conditioned and typical boundary regions where approximations are most likely to fail.