XLA Intrinsic Accuracy Testing

This directory contains the framework for testing the accuracy of XLA's floating-point intrinsics against high-precision ground-truth values.

The tests evaluate the implementations across a carefully selected set of input points (including edge cases like subnormals, powers of 2, and infinities) and compare the output against reference values generated by mpmath. The error is measured in Units in the Last Place (ULP), and tested against platform-specific empirical budgets.

FAQ

Why we record separate golden values for f32 and f64

Golden values are computed using mpmath at high precision to serve as ground-truth references for ULP accuracy testing. For each precision (f32, f64), mpmath must be evaluated on the exact input value that the implementation under test will receive. Simply computing a golden value at f64 precision and rounding the input and output to f32 does not produce a valid f32 reference, because rounding the input changes it by up to 0.5 ULP in f32, and the effect of that perturbation on the output is amplified by the condition number of the function. For well-conditioned points, the error is negligible, but accuracy tests specifically target ill-conditioned regions (e.g., log(x) near 1, sin(x) near multiples of π, exp(x) at large arguments), where the condition number can be large enough to introduce tens or even hundreds of ULPs of spurious error in the golden value — far exceeding the 1–2 ULP accuracy we're trying to verify. To avoid this, we generate f32 golden values by first rounding each input to f32, then evaluating mpmath on that exact value, ensuring the reference reflects the true mathematically correct result for the input the f32 implementation actually sees.

Why do we have separate ULP budgets for regular numbers, subnormals, and special values?

Subnormals often suffer from severe precision loss or are flushed to zero by hardware/fast-math implementations, resulting in massive ULP errors compared to typical domain values. By isolating subnormals into their own budget, we can maintain strict (e.g., 1-2 ULP) budgets for regular numbers without the tests failing on edge cases. Special values (like infinity or NaN) are budgeted by an absolute mismatch count rather than ULPs, since they represent discrete states where some hardware (especially GPUs) might intentionally deviate from IEEE-754.

How are the golden reference points chosen?

Instead of testing purely random inputs, we test a mix of common floating-point edge cases (0.0, -0.0, Inf, NaN, subnormals, powers of 2 and 10), log-spaced values near zero and near the domain bounds, as well as operation-specific dense ranges (e.g., near multiples of pi for trigonometric functions). This targets the most ill-conditioned and typical boundary regions where approximations are most likely to fail.