doc/neps/nep-0051-scalar-representation.rst
.. _NEP51:
:Author: Sebastian Berg :Status: Accepted :Type: Standards Track :Created: 2022-09-13 :Resolution: https://mail.python.org/archives/list/[email protected]/message/U2A4RCJSXMK7GG23MA5QMRG4KQYFMO2S/
NumPy has scalar objects ("NumPy scalar") representing a single value corresponding to a NumPy DType. The representation of these currently matches that of the Python builtins, giving::
>>> np.float32(3.0)
3.0
In this NEP we propose to change the representation to include the NumPy scalar type information. Changing the above example to::
>>> np.float32(3.0)
np.float32(3.0)
We expect that this change will help users distinguish the NumPy scalars from the Python builtin types and clarify their behavior.
The distinction between NumPy scalars and Python builtins will further become
more important for users once :ref:NEP 50 <NEP50> is adopted.
These changes do lead to smaller incompatible and infrastructure changes related to array printing.
This NEP proposes to change the representation of the following NumPy scalars types to distinguish them from the Python scalars:
np.bool_np.uint8, np.int8, and all other integer scalarsnp.float16, np.float32, np.float64, np.longdoublenp.complex64, np.complex128, np.clongdoublenp.str_, np.bytes_np.void (structured dtypes)Additionally, the representation of the remaining NumPy scalars will be
adapted to print as np. rather than numpy.:
np.datetime64 and np.timedelta64np.void (unstructured version)The NEP does not propose to change how these scalars print – only
their representation (__repr__) will be changed.
Further, array representation will not be affected since it already
includes the dtype= when necessary.
The main motivation behind the change is that the Python numerical types
behave differently from the NumPy scalars.
For example numbers with lower precision (e.g. uint8 or float16)
should be used with care and users should be aware when they are working
with them. All NumPy integers can experience overflow which Python integers
do not.
These differences will be exacerbated when adopting :ref:NEP 50 <NEP50>
because the lower precision NumPy scalar will be preserved more often.
Even np.float64, which is very similar to Python's float and inherits
from it, does behave differently for example when dividing by zero.
Another common source of confusion are the NumPy booleans. Python programmers
sometimes write obj is True and will surprised when an object that shows
as True fails to pass the test.
It is much easier to understand this behavior when the value is
shown as np.True_.
Not only do we expect the change to help users better understand and be reminded of the differences between NumPy and Python scalars, but we also believe that the awareness will greatly help debugging.
Most user code should not be impacted by the change, but users will now often see NumPy values shown as::
np.True_
np.float64(3.0)
np.int64(34)
and so on. This will also mean that documentation and output in Jupyter notebook cells will often show the type information intact.
np.longdouble and np.clongdouble will print with single quotes::
np.longdouble('3.0')
to allow round-tripping. Additionally to this change, float128 will
now always be printed as longdouble since the old name gives a false
impression of precision.
We expect that most workflows will not be affected as only printing changes. In general we believe that informing users about the type they are working with outweighs the need for adapting printing in some instances.
The NumPy test suite includes code such as decimal.Decimal(repr(scalar)).
This code needs to be modified to use the str().
An exception to this are downstream libraries with documentation and especially documentation testing. Since the representation of many values will change, in many cases the documentation will have to be updated. This is expected to require larger documentation fixups in the mid-term.
It may be necessary to adopt tools for doctest testing to allow approximate value checking for the new representation.
arr.tofile()arr.tofile() currently stores values as repr(arr.item()) when in text
mode. This is not always ideal since that may include a conversion to
Python.
One issue is that this would start saving longdouble as
np.longdouble('3.1') which is clearly not desired. We expect that this
method is rarely used for object arrays. For string arrays, using the repr
also leads to storing "string" or b"string" which seems rarely desired.
The proposal is to change the default (back) to use str rather than
repr. If repr is desired, users will have to pass fmt=%r.
This NEP proposes to change the representation for NumPy scalars to:
np.True_ and np.False_ for booleans (their singleton instances)np.scalar(<value>), i.e. np.float64(3.0) for all numerical dtypes.np.longdouble and np.clongdouble will be given in quotes:
np.longdouble('3.0'). This ensures that it can always roundtrip correctly
and matches the way that decimal.Decimal behaves.
For these two the size-based name such as float128 will not be used
as the actual size is platform-dependent and therefore misleading.np.str_("string") and np.bytes_(b"byte_string") for string dtypes.np.void((3, 5), dtype=[('a', '<i8'), ('b', 'u1')]) (similar to arrays)
for structured types. This will be valid syntax to recreate the scalar.Unlike arrays, the scalar representation should round-trip correctly, so longdouble values will be quoted and other values never be truncated.
In some places (i.e. masked arrays, void and record scalars) we will want to print the representation without the type. For example::
np.void(('3.0',), dtype=[('a', 'f16')]) # longdouble
should print the 3.0 with quotes (to ensure round-tripping), but not repeat
the full np.longdouble('3.0') as the dtype includes the longdouble
information.
To allow this, a new semi-public np.core.array_print.get_formatter() will
be introduced to expand the current functionality (see Implementation).
Some other parts of NumPy will indirectly be changed. Masked arrays
fill_value will be adapted to only include the full scalar information
such as fill_value=np.float64(1e20) when the dtype of the array
mismatches.
For longdouble (with matching dtype), it will be printed as
fill_value='3.1' including the quotes which (in principle but likely not
in practice) ensure round-tripping.
It should be noted that for strings it is typical for the dtypes to mismatch
in the string length. So that strings will usually be printed as
np.str_("N/A").
The np.record scalar will be aligned with np.void and print identically
to it (except the name itself). For example as:
np.record((3, 5), dtype=[('a', '<i8'), ('b', 'u1')])
longdouble and clongdoubleFor longdouble and clongdouble values such as::
np.sqrt(np.longdouble(2.))
may not roundtrip unless quoted as strings (as the conversion to a Python float
would lose precision). This NEP proposes to use a single quote similar to
Python's decimal which prints as Decimal('3.0')
longdouble can have different precision and storage sizes varying from
8 to 16 bytes. However, even if float128 is correct because the number
is stored as 128 bits, it normally does not have 128 bit precision.
(clongdouble is the same, but with twice the storage size.)
This NEP thus includes the proposal of changing the name of longdouble
to always print as longdouble and never float128 or float96.
It does not include deprecating the np.float128 alias.
However, such a deprecation may occur independently of the NEP.
One detail is that due to NumPy scalar types being based on the C types, NumPy sometimes distinguishes them, for example on most 64 bit systems (not windows)::
>>> np.longlong
numpy.longlong
>>> np.longlong(3)
np.int64(3)
The proposal will lead to the longlong name for the type while
using the int64 form for the scalar.
This choice is made since int64 is generally the more useful
information for users, but the type name itself must be precise.
A PR to only change the representation of booleans was previously
made here <https://github.com/numpy/numpy/pull/17592>_.
The implementation is (at the time of writing) largely finished and can be
found here <https://github.com/numpy/numpy/pull/22449>_
.. note::
This part has not been implemented in the
initial PR <https://github.com/numpy/numpy/pull/22449>_.
A similar change will be required to fix certain cases in printing and
allow fully correct printing e.g. of structured scalars which include
longdoubles.
A similar solution is also expected to be necessary in the future
to allow custom DTypes to correctly print.
The new representations can be mostly implemented on the scalar types with the largest changes needed in the test suite.
The proposed changes for void scalars and masked fill_value makes it
necessary to expose the scalar representation without the type.
We propose introducing the semi-public API::
np.core.arrayprint.get_formatter(*,
data=None, dtype=None, fmt=None, options=None)
to replace the current internal _get_formatting_func. This will allow
two things compared to the old function:
data may be None (if dtype is passed) allowing to not pass
multiple values that will be printed/formatted later.
fmt= will allow passing on format strings to a DType-specific element
formatter in the future. For now, get_formatter() will accept
repr or str (the singletons not strings) to format the elements
without type information ('3.1' rather than np.longdouble('3.1')).
The implementation ensures that formatting matches except for the type
information.
The empty format string will print identically to str() (with possibly
extra padding when data is passed).
get_formatter() is expected to query a user DType's method in the future
allowing customized formatting for all DTypes.
Making get_formatter public allows it to be used for np.record and
masked arrays.
Currently, the formatters themselves seem semi-public; using a single
entry-point will hopefully provide a clear API for formatting NumPy values.
The large part for the scalar representation changes had previously been done by Ganesh Kathiresan in [2]_.
Different representations can be considered: alternatives include spelling
np. as numpy. or dropping the np. part from the numerical scalars.
We believe that using np. is sufficiently clear, concise, and does allow
copy pasting the representation.
Using only float64(3.0) without the np. prefix is more concise but
contexts may exists where the NumPy dependency is not fully clear and the name
could clash with other libraries.
For booleans an alternative would be to use np.bool_(True) or bool_(True).
However, NumPy boolean scalars are singletons and the proposed formatting is more
concise. Alternatives for booleans were also discussed previously in [1]_.
For the string scalars, the confusion is generally less pronounced. It may be reasonable to defer changing these.
The proposal does not allow copy pasting nan and inf values.
They could be represented by np.float64('nan') or np.float64(np.nan)
instead.
This is more concise and Python also uses nan and inf rather than
allowing copy-pasting by showing it as float('nan'). Arguably, it would be
a smaller addition in NumPy, where the will already be always printed.
get_formatter()When fmt= is passed, and specifically for the main use (in this NEP) to
format to a repr or str.
It would also be possible to use a ufunc or a direct formatting function
rather than wrapping it into a ```get_formatter()`` which relies on
instantiating a formatter class for the DType.
This NEP does not preclude creating a ufunc or making a special path.
However, NumPy array formatting commonly looks at all values to be formatted
in order to add padding for alignment or give uniform exponential output.
In this case data= is passed and used in preparation. This form of
formatting (unlike the scalar case where data=None would be desired) is
unfortunately fundamentally incompatible with UFuncs.
The use of the singleton str and repr ensures that future formatting
strings like f"{arr:r}" are not in any way limited by using "r" or
"s" instead.
.. [1] https://github.com/numpy/numpy/issues/12950 .. [2] https://github.com/numpy/numpy/pull/17592
This document has been placed in the public domain.