docs/source/python/pandas.rst
.. Licensed to the Apache Software Foundation (ASF) under one .. or more contributor license agreements. See the NOTICE file .. distributed with this work for additional information .. regarding copyright ownership. The ASF licenses this file .. to you under the Apache License, Version 2.0 (the .. "License"); you may not use this file except in compliance .. with the License. You may obtain a copy of the License at
.. http://www.apache.org/licenses/LICENSE-2.0
.. Unless required by applicable law or agreed to in writing, .. software distributed under the License is distributed on an .. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY .. KIND, either express or implied. See the License for the .. specific language governing permissions and limitations .. under the License.
.. _pandas_interop:
To interface with pandas <https://pandas.pydata.org/>_, PyArrow provides
various conversion routines to consume pandas structures and convert back
to them.
.. note::
While pandas uses NumPy as a backend, it has enough peculiarities
(such as a different type system, and support for null values) that this
is a separate topic from :ref:numpy_interop.
To follow examples in this document, make sure to run:
.. code-block:: python
import pandas as pd import pyarrow as pa
The equivalent to a pandas DataFrame in Arrow is a :ref:Table <data.table>.
Both consist of a set of named columns of equal length. While pandas only
supports flat columns, the Table also provides nested columns, thus it can
represent more data than a DataFrame, so a full conversion is not always possible.
Conversion from a Table to a DataFrame is done by calling
:meth:pyarrow.Table.to_pandas. The inverse is then achieved by using
:meth:pyarrow.Table.from_pandas.
.. code-block:: python
>>> df = pd.DataFrame({"a": [1, 2, 3]})
>>> # Convert from pandas to Arrow
>>> table = pa.Table.from_pandas(df)
>>> # Convert back to pandas
>>> df_new = table.to_pandas()
>>> # Infer Arrow schema from pandas
>>> schema = pa.Schema.from_pandas(df)
By default pyarrow tries to preserve and restore the .index
data as accurately as possible. See the section below for more about
this, and how to disable this logic.
In Arrow, the most similar structure to a pandas Series is an Array.
It is a vector that contains data of the same type as linear memory. You can
convert a pandas Series to an Arrow Array using :meth:pyarrow.Array.from_pandas.
As Arrow Arrays are always nullable, you can supply an optional mask using
the mask parameter to mark all null-entries.
Methods like :meth:pyarrow.Table.from_pandas have a
preserve_index option which defines how to preserve (store) or not
to preserve (to not store) the data in the index member of the
corresponding pandas object. This data is tracked using schema-level
metadata in the internal arrow::Schema object.
The default of preserve_index is None, which behaves as
follows:
RangeIndex is stored as metadata-only, not requiring any extra
storage.TableTo not store the index at all pass preserve_index=False. Since
storing a RangeIndex can cause issues in some limited scenarios
(such as storing multiple DataFrame objects in a Parquet file), to
force all index data to be serialized in the resulting table, pass
preserve_index=True.
With the current design of pandas and Arrow, it is not possible to convert all
column types unmodified. One of the main issues here is that pandas has no
support for nullable columns of arbitrary type. Also datetime64 is currently
fixed to nanosecond resolution. On the other side, Arrow might be still missing
support for some types.
pandas -> Arrow Conversion
+------------------------+--------------------------+
| Source Type (pandas) | Destination Type (Arrow) |
+========================+==========================+
| ``bool`` | ``BOOL`` |
+------------------------+--------------------------+
| ``(u)int{8,16,32,64}`` | ``(U)INT{8,16,32,64}`` |
+------------------------+--------------------------+
| ``float32`` | ``FLOAT`` |
+------------------------+--------------------------+
| ``float64`` | ``DOUBLE`` |
+------------------------+--------------------------+
| ``str`` / ``unicode`` | ``STRING`` |
+------------------------+--------------------------+
| ``pd.Categorical`` | ``DICTIONARY`` |
+------------------------+--------------------------+
| ``pd.Timestamp`` | ``TIMESTAMP(unit=ns)`` |
+------------------------+--------------------------+
| ``datetime.date`` | ``DATE`` |
+------------------------+--------------------------+
| ``datetime.time`` | ``TIME64`` |
+------------------------+--------------------------+
Arrow -> pandas Conversion
+-------------------------------------+--------------------------------------------------------+
| Source Type (Arrow) | Destination Type (pandas) |
+=====================================+========================================================+
| BOOL | bool |
+-------------------------------------+--------------------------------------------------------+
| BOOL with nulls | object (with values True, False, None) |
+-------------------------------------+--------------------------------------------------------+
| (U)INT{8,16,32,64} | (u)int{8,16,32,64} |
+-------------------------------------+--------------------------------------------------------+
| (U)INT{8,16,32,64} with nulls | float64 |
+-------------------------------------+--------------------------------------------------------+
| FLOAT | float32 |
+-------------------------------------+--------------------------------------------------------+
| DOUBLE | float64 |
+-------------------------------------+--------------------------------------------------------+
| STRING | str |
+-------------------------------------+--------------------------------------------------------+
| DICTIONARY | pd.Categorical |
+-------------------------------------+--------------------------------------------------------+
| TIMESTAMP(unit=*) | pd.Timestamp (np.datetime64[ns]) |
+-------------------------------------+--------------------------------------------------------+
| DATE | object (with datetime.date objects) |
+-------------------------------------+--------------------------------------------------------+
| TIME64 | object (with datetime.time objects) |
+-------------------------------------+--------------------------------------------------------+
Categorical types
`Pandas categorical <https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html>`_
columns are converted to :ref:`Arrow dictionary arrays <data.dictionary>`,
a special array type optimized to handle repeated and limited
number of possible values.
.. code-block:: python
>>> df = pd.DataFrame({"cat": pd.Categorical(["a", "b", "c", "a", "b", "c"])})
>>> df.cat.dtype.categories
Index(['a', 'b', 'c'], dtype='str')
>>> df
cat
0 a
1 b
2 c
3 a
4 b
5 c
>>> table = pa.Table.from_pandas(df)
>>> table
pyarrow.Table
cat: dictionary<values=large_string, indices=int8, ordered=0>
----
cat: [ -- dictionary:
["a","b","c"] -- indices:
[0,1,2,0,1,2]]
We can inspect the :class:`~.ChunkedArray` of the created table and see the
same categories of the Pandas DataFrame.
.. code-block:: python
>>> column = table[0]
>>> chunk = column.chunk(0)
>>> chunk.dictionary
<pyarrow.lib.LargeStringArray object at ...>
[
"a",
"b",
"c"
]
>>> chunk.indices
<pyarrow.lib.Int8Array object at ...>
[
0,
1,
2,
0,
1,
2
]
Datetime (Timestamp) types
Pandas Timestamps <https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html>_
use the datetime64[ns] type in Pandas and are converted to an Arrow
:class:~.TimestampArray.
.. code-block:: python
df = pd.DataFrame({"datetime": pd.date_range("2020-01-01T00:00:00Z", freq="h", periods=3)}) df.dtypes datetime datetime64[us, UTC] dtype: object df datetime 0 2020-01-01 00:00:00+00:00 1 2020-01-01 01:00:00+00:00 2 2020-01-01 02:00:00+00:00 table = pa.Table.from_pandas(df) table pyarrow.Table datetime: timestamp[us, tz=UTC]
datetime: [[2020-01-01 00:00:00.000000Z,2020-01-01 01:00:00.000000Z,2020-01-01 02:00:00.000000Z]]
In this example the Pandas Timestamp is time zone aware
(UTC on this case), and this information is used to create the Arrow
:class:~.TimestampArray.
Date types
While dates can be handled using the ``datetime64[ns]`` type in
pandas, some systems work with object arrays of Python's built-in
``datetime.date`` object:
.. code-block:: python
>>> from datetime import date
>>> s = pd.Series([date(2018, 12, 31), None, date(2000, 1, 1)])
>>> s
0 2018-12-31
1 None
2 2000-01-01
dtype: object
When converting to an Arrow array, the ``date32`` type will be used by
default:
.. code-block:: python
>>> arr = pa.array(s)
>>> arr.type
DataType(date32[day])
>>> arr[0]
<pyarrow.Date32Scalar: datetime.date(2018, 12, 31)>
To use the 64-bit ``date64``, specify this explicitly:
.. code-block:: python
>>> arr = pa.array(s, type='date64')
>>> arr.type
DataType(date64[ms])
When converting back with ``to_pandas``, object arrays of
``datetime.date`` objects are returned:
.. code-block:: python
>>> arr.to_pandas()
0 2018-12-31
1 None
2 2000-01-01
dtype: object
If you want to use NumPy's ``datetime64`` dtype instead, pass
``date_as_object=False``:
.. code-block:: python
>>> s2 = pd.Series(arr.to_pandas(date_as_object=False))
>>> s2.dtype
dtype('<M8[ms]')
.. warning::
As of Arrow ``0.13`` the parameter ``date_as_object`` is ``True``
by default. Older versions must pass ``date_as_object=True`` to
obtain this behavior
Time types
The builtin datetime.time objects inside Pandas data structures will be
converted to an Arrow time64 and :class:~.Time64Array respectively.
.. code-block:: python
from datetime import time s = pd.Series([time(1, 1, 1), time(2, 2, 2)]) s 0 01:01:01 1 02:02:02 dtype: object arr = pa.array(s) arr.type Time64Type(time64[us]) arr <pyarrow.lib.Time64Array object at ...> [ 01:01:01.000000, 02:02:02.000000 ]
When converting to pandas, arrays of datetime.time objects are returned:
.. code-block:: python
arr.to_pandas() 0 01:01:01 1 02:02:02 dtype: object
In Arrow all data types are nullable, meaning they support storing missing values. In pandas, however, not all data types have support for missing data. Most notably, the default integer data types do not, and will get casted to float when missing values are introduced. Therefore, when an Arrow array or table gets converted to pandas, integer columns will become float when missing values are present:
.. code-block:: python
arr = pa.array([1, 2, None]) arr <pyarrow.lib.Int64Array object at ...> [ 1, 2, null ] arr.to_pandas() 0 1.0 1 2.0 2 NaN dtype: float64
Pandas has experimental nullable data types (https://pandas.pydata.org/docs/user_guide/integer_na.html). Arrows supports round trip conversion for those:
.. code-block:: python
df = pd.DataFrame({'a': pd.Series([1, 2, None], dtype="Int64")}) df a 0 1 1 2 2 <NA>
table = pa.table(df) table pyarrow.Table a: int64
a: [[1,2,null]]
table.to_pandas() a 0 1 1 2 2 <NA>
table.to_pandas().dtypes a Int64 dtype: object
This roundtrip conversion works because metadata about the original pandas DataFrame gets stored in the Arrow table. However, if you have Arrow data (or e.g. a Parquet file) not originating from a pandas DataFrame with nullable data types, the default conversion to pandas will not use those nullable dtypes.
The :meth:pyarrow.Table.to_pandas method has a types_mapper keyword
that can be used to override the default data type used for the resulting
pandas DataFrame. This way, you can instruct Arrow to create a pandas
DataFrame using nullable dtypes.
.. code-block:: python
table = pa.table({"a": [1, 2, None]}) table.to_pandas() a 0 1.0 1 2.0 2 NaN table.to_pandas(types_mapper={pa.int64(): pd.Int64Dtype()}.get) a 0 1 1 2 2 <NA>
The types_mapper keyword expects a function that will return the pandas
data type to use given a pyarrow data type. By using the dict.get method,
we can create such a function using a dictionary.
If you want to use all currently supported nullable dtypes by pandas, this dictionary becomes:
.. code-block:: python
dtype_mapping = { ... pa.int8(): pd.Int8Dtype(), ... pa.int16(): pd.Int16Dtype(), ... pa.int32(): pd.Int32Dtype(), ... pa.int64(): pd.Int64Dtype(), ... pa.uint8(): pd.UInt8Dtype(), ... pa.uint16(): pd.UInt16Dtype(), ... pa.uint32(): pd.UInt32Dtype(), ... pa.uint64(): pd.UInt64Dtype(), ... pa.bool_(): pd.BooleanDtype(), ... pa.float32(): pd.Float32Dtype(), ... pa.float64(): pd.Float64Dtype(), ... pa.string(): pd.StringDtype(), ... } df = table.to_pandas(types_mapper=dtype_mapping.get)
When using the pandas API for reading Parquet files (pd.read_parquet(..)),
this can also be achieved by passing use_nullable_dtypes:
.. code-block:: python
df = pd.read_parquet(path, use_nullable_dtypes=True) # doctest: +SKIP
When converting from Arrow data structures to pandas objects using various
to_pandas methods, one must occasionally be mindful of issues related to
performance and memory usage.
Since pandas's internal data representation is generally different from the Arrow columnar format, zero copy conversions (where no memory allocation or computation is required) are only possible in certain limited cases.
In the worst case scenario, calling to_pandas will result in two versions
of the data in memory, one for Arrow and one for pandas, yielding approximately
twice the memory footprint. We have implement some mitigations for this case,
particularly when creating large DataFrame objects, that we describe below.
Zero Copy Series Conversions
Zero copy conversions from ``Array`` or ``ChunkedArray`` to NumPy arrays or
pandas Series are possible in certain narrow cases:
* The Arrow data is stored in an integer (signed or unsigned ``int8`` through
``int64``) or floating point type (``float16`` through ``float64``). This
includes many numeric types as well as timestamps.
* The Arrow data has no null values (since these are represented using bitmaps
which are not supported by pandas).
* For ``ChunkedArray``, the data consists of a single chunk,
i.e. ``arr.num_chunks == 1``. Multiple chunks will always require a copy
because of pandas's contiguousness requirement.
In these scenarios, ``to_pandas`` or ``to_numpy`` will be zero copy. In all
other scenarios, a copy will be required.
Reducing Memory Use in ``Table.to_pandas``
As of this writing, pandas applies a data management strategy called
"consolidation" to collect like-typed DataFrame columns in two-dimensional
NumPy arrays, referred to internally as "blocks". We have gone to great effort
to construct the precise "consolidated" blocks so that pandas will not perform
any further allocation or copies after we hand off the data to
pandas.DataFrame. The obvious downside of this consolidation strategy is
that it forces a "memory doubling".
To try to limit the potential effects of "memory doubling" during
Table.to_pandas, we provide a couple of options:
split_blocks=True, when enabled Table.to_pandas produces one internal
DataFrame "block" for each column, skipping the "consolidation" step. Note
that many pandas operations will trigger consolidation anyway, but the peak
memory use may be less than the worst case scenario of a full memory
doubling. As a result of this option, we are able to do zero copy conversions
of columns in the same cases where we can do zero copy with Array and
ChunkedArray.self_destruct=True, this destroys the internal Arrow memory buffers in
each column Table object as they are converted to the pandas-compatible
representation, potentially releasing memory to the operating system as soon
as a column is converted. Note that this renders the calling Table object
unsafe for further use, and any further methods called will cause your Python
process to crash.Used together, the call
.. code-block:: python
df = table.to_pandas(split_blocks=True, self_destruct=True) del table # not necessary, but a good practice
will yield significantly lower memory usage in some scenarios. Without these
options, to_pandas will always double memory.
Note that self_destruct=True is not guaranteed to save memory. Since the
conversion happens column by column, memory is also freed column by column. But
if multiple columns share an underlying buffer, then no memory will be freed
until all of those columns are converted. In particular, due to implementation
details, data that comes from IPC or Flight is prone to this, as memory will be
laid out as follows::
Record Batch 0: Allocation 0: array 0 chunk 0, array 1 chunk 0, ... Record Batch 1: Allocation 1: array 0 chunk 1, array 1 chunk 1, ... ...
In this case, no memory can be freed until the entire table is converted, even
with self_destruct=True.