docs/docsgen/source/technical/float4.md
(onnx-detail-float4)=
4 bit floating point formats have emerged as a solution to the rising cost and deployment challenges of large language models. The S1E2M1 format has been part of the Open Compute Project (OCP) standard.
As a result, a new data type was introduced in onnx==1.18.0
to support a limited set of operators to enable computation
with float4.
FLOAT4E2M1: 1 bit for the sign, 2 bits for the exponents, and 1 bit for the mantissa.
No nan or infinities.$S$ stands for the sign. $10_2$ describe a number base 2.
.. list-table:: Float4 type
:widths: 10 10
:header-rows: 1
* -
- E2M1
* - Exponent bias
- 1
* - Infinities
-
* - NaN
-
* - Zeros
- :math:`S.00.0_2`
* - Max
- :math:`S.11.1_2`
* - Min
- :math:`S.00.1_2 = 2^{-1}`
Let's denote the bit representation as $S.b_2 b_1 b_0$. The float value is defined by the following expressions:
.. list-table:: Float4 type values
:widths: 10 10
:header-rows: 1
* -
- E2M1
* - exponent :math:`\neq` 0
- :math:`(-1)^S 2^{\sum_{i=1}^2 b_i 2^{i-1} - 1} \left( 1 + b_0 2^{-1} \right)`
* - exponent :math:`=` 0
- :math:`(-1)^S b_0 2^{-1}`
The following table lists all the representable values by float4 E2M1, ignoring the sign bit:
.. list-table:: Float4 type values
:widths: 10 10
:header-rows: 1
* - bits (ignoring sign bit)
- E2M1
* - 000
- 0
* - 001
- 0.5
* - 010
- 1
* - 011
- 1.5
* - 100
- 2
* - 101
- 3
* - 110
- 4
* - 111
- 6
Upcasting from float4 to float32, float16, bfloat16, and float8 is exact. The behavior for downcasting to float 4 is summarized below
| x | E2M1 |
|---|---|
| -6<=x<=6 | E2M1 converted value of x. Round to nearest even. |
| x=+/-0 | +/-0 |
| x>6 | 6 |
| x<-6 | -6 |
| +Inf | 6 |
| -Inf | -6 |
| NaN | 6 |
Float4 is stored as 2x4bit in a single byte.
The first element is stored in the 4 LSB and the second element is stored in the 4 MSB,
i.e. for elements x and y that are consecutive elements in the array:
pack(x,y): y << 4 | x & 0x0F
unpack(z): x = z & 0x0F, y = z >> 4
In case the total number of elements is odd, padding of 4 bits will be appended.
The storage size of a 4 bit tensor of size N is ceil(N/2).