web/pandas/pdeps/0010-required-pyarrow-dependency.md
[TOC]
!!! note While this PDEP mentions adding pyarrow as a required dependency in pandas 3.0, this aspect has been delayed until after pandas 3.0 (see the abstract of PDEP-14). Therefore, pandas 3.0 will not have a hard requirement on pyarrow but still use pyarrow by default (for the new string dtype) when installed.
This PDEP proposes that:
FutureWarning when PyArrow is not installed in the users
environment when pandas is imported. This will ensure that only one warning is raised and users can
easily silence it if necessary. This warning will point to the feedback issue.ArrowDtype with pyarrow.string
instead of object. Additionally, we will infer all dtypes that are listed below as well instead of storing as object.This will bring immediate benefits to users, as well as opening up the door for significant further benefits in the future.
PyArrow is an optional dependency of pandas that provides a wide range of supplemental features to pandas:
ExtensionArray interface to provide an
optional string data type backed by PyArrowArrowExtensionArray and ArrowDtype to support all PyArrow
data types within the ExtensionArray interfaceAs of pandas 2.0, one can feasibly utilize PyArrow as an alternative data representation to NumPy with advantages such as:
NA support for all data types;decimal, date and nested types;While all the functionality described in the previous paragraph is currently optional, PyArrow has significant integration into many areas of pandas. With our roadmap noting that pandas strives for better Apache Arrow interoperability [^1] and many projects [^2], within or beyond the Python ecosystem, adopting or interacting with the Arrow format, making PyArrow a required dependency provides an additional signal of confidence in the Arrow ecosystem (as well as improving interoperability with it).
Currently, when users pass string data into pandas constructors without specifying a data type, the resulting data type
is object, which has significantly much worse memory usage and performance as compared to pyarrow strings.
With pyarrow string support available since 1.2.0, requiring pyarrow for 3.0 will allow pandas to default
the inferred type to the more efficient pyarrow string type.
In [1]: import pandas as pd
In [2]: pd.Series(["a"]).dtype
# Current behavior
Out[2]: dtype('O')
# Future behavior in 3.0
Out[2]: string[pyarrow]
Dask developers investigated performance and memory of pyarrow strings here,
and found them to be a significant improvement over the current object dtype.
Little demo:
import string
import random
import pandas as pd
def random_string() -> str:
return "".join(random.choices(string.printable, k=random.randint(10, 100)))
ser_object = pd.Series([random_string() for _ in range(1_000_000)])
ser_string = ser_object.astype("string[pyarrow]")\
PyArrow backed strings are significantly faster than NumPy object strings:
str.len
In[1]: %timeit ser_object.str.len()
118 ms ± 260 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In[2]: %timeit ser_string.str.len()
24.2 ms ± 187 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
str.startswith
In[3]: %timeit ser_object.str.startswith("a")
136 ms ± 300 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In[4]: %timeit ser_string.str.startswith("a")
11 ms ± 19.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Currently, if you try storing dicts in a pandas Series, you will again get the horrendous object dtype:
In [6]: pd.Series([{'a': 1, 'b': 2}, {'a': 2, 'b': 99}])
Out[6]:
0 {'a': 1, 'b': 2}
1 {'a': 2, 'b': 99}
dtype: object
If pyarrow were required, this could have been auto-inferred to be pyarrow.struct, which again
would come with memory and performance improvements.
Other Arrow-backed dataframe libraries are growing in popularity. Having the same memory representation would improve interoperability with them, as operations such as:
import pandas as pd
import polars as pl
df = pd.DataFrame(
{
'a': ['one', 'two'],
'b': [{'name': 'Billy', 'age': 3}, {'name': 'Bob', 'age': 4}],
}
)
pl.from_pandas(df)
could be zero-copy. Users making use of multiple dataframe libraries would more easily be able to switch between them.
Requiring PyArrow would simplify the related development within pandas and potentially improve NumPy functionality that would be better suited by PyArrow including:
Avoiding runtime checking if PyArrow is available to perform PyArrow object inference during constructor or indexing operations
NumPy object dtype will be avoided as much as possible. This means that every dtype that has a PyArrow equivalent is inferred automatically as such. This includes:
First, this would simplify development of pyarrow-backed datatypes, as it would avoid optional dependency checks.
Second, it could potentially remove redundant functionality:
read_parquet;read_csv logic (needs more investigation);Including PyArrow would naturally increase the installation size of pandas. For example, installing pandas and PyArrow
using pip from wheels, numpy and pandas requires about 70MB, and including PyArrow requires an additional 120MB.
An increase of installation size would have negative implication using pandas in space-constrained development or deployment environments
such as AWS Lambda.
Additionally, if a user is installing pandas in an environment where wheels are not available through a pip install or conda install,
the user will need to also build Arrow C++ and related dependencies when installing from source. These environments include
Lastly, pandas development and releases will need to be mindful of PyArrow's development and release cadance. For example when supporting a newly released Python version, pandas will also need to be mindful of PyArrow's wheel support for that Python version before releasing a new pandas version.
Q: Why can't pandas just use numpy string and numpy void datatypes instead of pyarrow string and pyarrow struct?
A: NumPy strings aren't yet available, whereas pyarrow strings are. NumPy void datatype would be different to pyarrow struct, not bringing the same interoperabitlity benefit with other arrow-based dataframe libraries.
Q: Are all pyarrow dtypes ready? Isn't it too soon to make them the default?
A: They will likely be ready by 3.0 - however, we're not making them the default (yet).
For example, pd.Series([1, 2, 3]) will continue to be auto-inferred to be
np.int64. We will only change the default for dtypes which currently have no numpy-backed equivalent and which are
stored as object dtype, such as strings and nested datatypes.
[^1] https://pandas.pydata.org/docs/development/roadmap.html#apache-arrow-interoperability [^2] https://arrow.apache.org/powered_by/