doc/source/user_guide/pyarrow.rst
.. _pyarrow:
{{ header }}
PyArrow Functionality
pandas can utilize PyArrow <https://arrow.apache.org/docs/python/index.html>__ to extend functionality and improve the performance
of various APIs. This includes:
data types <https://arrow.apache.org/docs/python/api/datatypes.html>__ compared to NumPyTo use this functionality, please ensure you have :ref:installed the minimum supported PyArrow version. <install.optional_dependencies>
A :class:Series, :class:Index, or the columns of a :class:DataFrame can be directly backed by a :external+pyarrow:py:class:pyarrow.ChunkedArray
which is similar to a NumPy array. To construct these from the main pandas data structures, you can pass in a string of the type followed by
[pyarrow], e.g. "int64[pyarrow]" into the dtype parameter
.. ipython:: python
ser = pd.Series([-1.5, 0.2, None], dtype="float32[pyarrow]") ser
idx = pd.Index([True, None], dtype="bool[pyarrow]") idx
df = pd.DataFrame([[1, 2], [3, 4]], dtype="uint64[pyarrow]") df
.. note::
The string alias "string[pyarrow]" maps to pd.StringDtype("pyarrow") which is not equivalent to
specifying dtype=pd.ArrowDtype(pa.string()). Generally, operations on the data will behave similarly
except pd.StringDtype("pyarrow") can return NumPy-backed nullable types while pd.ArrowDtype(pa.string())
will return :class:ArrowDtype.
.. ipython:: python
import pyarrow as pa
data = list("abc")
ser_sd = pd.Series(data, dtype="string[pyarrow]")
ser_ad = pd.Series(data, dtype=pd.ArrowDtype(pa.string()))
ser_ad.dtype == ser_sd.dtype
ser_sd.str.contains("a")
ser_ad.str.contains("a")
For PyArrow types that accept parameters, you can pass in a PyArrow type with those parameters
into :class:ArrowDtype to use in the dtype parameter.
.. ipython:: python
import pyarrow as pa list_str_type = pa.list_(pa.string()) ser = pd.Series([["hello"], ["there"]], dtype=pd.ArrowDtype(list_str_type)) ser
.. ipython:: python
from datetime import time idx = pd.Index([time(12, 30), None], dtype=pd.ArrowDtype(pa.time64("us"))) idx
.. ipython:: python
from decimal import Decimal decimal_type = pd.ArrowDtype(pa.decimal128(3, scale=2)) data = [[Decimal("3.19"), None], [None, Decimal("-1.23")]] df = pd.DataFrame(data, dtype=decimal_type) df
If you already have an :external+pyarrow:py:class:pyarrow.Array or :external+pyarrow:py:class:pyarrow.ChunkedArray,
you can pass it into :class:.arrays.ArrowExtensionArray to construct the associated :class:Series, :class:Index
or :class:DataFrame object.
.. ipython:: python
pa_array = pa.array( [{"1": "2"}, {"10": "20"}, None], type=pa.map_(pa.string(), pa.string()), ) ser = pd.Series(pd.arrays.ArrowExtensionArray(pa_array)) ser
To retrieve a pyarrow :external+pyarrow:py:class:pyarrow.ChunkedArray from a :class:Series or :class:Index, you can call
the pyarrow array constructor on the :class:Series or :class:Index.
.. ipython:: python
ser = pd.Series([1, 2, None], dtype="uint8[pyarrow]") pa.array(ser)
idx = pd.Index(ser) pa.array(idx)
To convert a :external+pyarrow:py:class:pyarrow.Table to a :class:DataFrame, you can call the
:external+pyarrow:py:meth:pyarrow.Table.to_pandas method with types_mapper=pd.ArrowDtype.
.. ipython:: python
table = pa.table([pa.array([1, 2, 3], type=pa.int64())], names=["a"])
df = table.to_pandas(types_mapper=pd.ArrowDtype) df df.dtypes
PyArrow data structure integration is implemented through pandas' :class:~pandas.api.extensions.ExtensionArray :ref:interface <extending.extension-types>;
therefore, supported functionality exists where this interface is integrated within the pandas API. Additionally, this functionality
is accelerated with PyArrow compute functions <https://arrow.apache.org/docs/python/api/compute.html>__ where available. This includes:
The following are just some examples of operations that are accelerated by native PyArrow compute functions.
.. ipython:: python
import pyarrow as pa ser = pd.Series([-1.545, 0.211, None], dtype="float32[pyarrow]") ser.mean() ser + ser ser > (ser + 1)
ser.dropna() ser.isna() ser.fillna(0)
.. ipython:: python
ser_str = pd.Series(["a", "b", None], dtype=pd.ArrowDtype(pa.string())) ser_str.str.startswith("a")
.. ipython:: python
from datetime import datetime pa_type = pd.ArrowDtype(pa.timestamp("ns")) ser_dt = pd.Series([datetime(2022, 1, 1), None], dtype=pa_type) ser_dt.dt.strftime("%Y-%m")
PyArrow also provides IO reading functionality that has been integrated into several pandas IO readers. The following
functions provide an engine keyword that can dispatch to PyArrow to accelerate reading from an IO source.
read_csvread_featherread_jsonread_orcread_parquetread_table (experimental).. ipython:: python
import io data = io.StringIO("""a,b,c 1,2.5,True 3,4.5,False """) df = pd.read_csv(data, engine="pyarrow") df
By default, these functions and all other IO reader functions return NumPy-backed data. These readers can return
PyArrow-backed data by specifying the parameter dtype_backend="pyarrow". A reader does not need to set
engine="pyarrow" to necessarily return PyArrow-backed data.
.. ipython:: python
import io
data = io.StringIO("""a,b,c,d,e,f,g,h,i
1,2.5,True,a,,,,,
3,4.5,False,b,6,7.5,True,a,
""")
df_pyarrow = pd.read_csv(data, dtype_backend="pyarrow")
df_pyarrow.dtypes
Several non-IO reader functions can also use the dtype_backend argument to return PyArrow-backed data including:
to_numericDataFrame.convert_dtypesSeries.convert_dtypes