doc/source/development/developer.rst
.. _developer:
{{ header }}
.. currentmodule:: pandas
Developer
This section will focus on downstream applications of pandas.
.. _apache.parquet:
The Apache Parquet <https://github.com/apache/parquet-format>__ format
provides key-value metadata at the file and column level, stored in the footer
of the Parquet file:
.. code-block:: shell
5: optional list<KeyValue> key_value_metadata
where KeyValue is
.. code-block:: shell
struct KeyValue { 1: required string key 2: optional string value }
So that a pandas.DataFrame can be faithfully reconstructed, we store a
pandas metadata key in the FileMetaData with the value stored as :
.. code-block:: text
{'index_columns': [<descr0>, <descr1>, ...], 'column_indexes': [<ci0>, <ci1>, ..., <ciN>], 'columns': [<c0>, <c1>, ...], 'pandas_version': $VERSION, 'creator': { 'library': $LIBRARY, 'version': $LIBRARY_VERSION }}
The "descriptor" values <descr0> in the 'index_columns' field are
strings (referring to a column) or dictionaries with values as described below.
The <c0>/<ci0> and so forth are dictionaries containing the metadata
for each column, including the index columns. This has JSON form:
.. code-block:: text
{'name': column_name, 'field_name': parquet_column_name, 'pandas_type': pandas_type, 'numpy_type': numpy_type, 'metadata': metadata}
See below for the detailed specification for these.
Index metadata descriptors
``RangeIndex`` can be stored as metadata only, not requiring serialization. The
descriptor format for these as is follows:
.. code-block:: python
index = pd.RangeIndex(0, 10, 2)
{
"kind": "range",
"name": index.name,
"start": index.start,
"stop": index.stop,
"step": index.step,
}
Other index types must be serialized as data columns along with the other
DataFrame columns. The metadata for these is a string indicating the name of
the field in the data columns, for example ``'__index_level_0__'``.
If an index has a non-None ``name`` attribute, and there is no other column
with a name matching that value, then the ``index.name`` value can be used as
the descriptor. Otherwise (for unnamed indexes and ones with names colliding
with other column names) a disambiguating name with pattern matching
``__index_level_\d+__`` should be used. In cases of named indexes as data
columns, ``name`` attribute is always stored in the column descriptors as
above.
Column metadata
~~~~~~~~~~~~~~~
``pandas_type`` is the logical type of the column, and is one of:
* Boolean: ``'bool'``
* Integers: ``'int8', 'int16', 'int32', 'int64', 'uint8', 'uint16', 'uint32', 'uint64'``
* Floats: ``'float16', 'float32', 'float64'``
* Date and Time Types: ``'datetime', 'datetimetz', 'timedelta'``
* String: ``'unicode', 'bytes'``
* Categorical: ``'categorical'``
* Other Python objects: ``'object'``
The ``numpy_type`` is the physical storage type of the column, which is the
result of ``str(dtype)`` for the underlying NumPy array that holds the data. So
for ``datetimetz`` this is ``datetime64[ns]`` and for categorical, it may be
any of the supported integer categorical types.
The ``metadata`` field is ``None`` except for:
* ``datetimetz``: ``{'timezone': zone, 'unit': 'ns'}``, e.g. ``{'timezone',
'America/New_York', 'unit': 'ns'}``. The ``'unit'`` is optional, and if
omitted it is assumed to be nanoseconds.
* ``categorical``: ``{'num_categories': K, 'ordered': is_ordered, 'type': $TYPE}``
* Here ``'type'`` is optional, and can be a nested pandas type specification
here (but not categorical)
* ``unicode``: ``{'encoding': encoding}``
* The encoding is optional, and if not present is UTF-8
* ``object``: ``{'encoding': encoding}``. Objects can be serialized and stored
in ``BYTE_ARRAY`` Parquet columns. The encoding can be one of:
* ``'pickle'``
* ``'bson'``
* ``'json'``
* ``timedelta``: ``{'unit': 'ns'}``. The ``'unit'`` is optional, and if omitted
it is assumed to be nanoseconds. This metadata is optional altogether
For types other than these, the ``'metadata'`` key can be
omitted. Implementations can assume ``None`` if the key is not present.
As an example of fully-formed metadata:
.. code-block:: text
{'index_columns': ['__index_level_0__'],
'column_indexes': [
{'name': None,
'field_name': 'None',
'pandas_type': 'unicode',
'numpy_type': 'object',
'metadata': {'encoding': 'UTF-8'}}
],
'columns': [
{'name': 'c0',
'field_name': 'c0',
'pandas_type': 'int8',
'numpy_type': 'int8',
'metadata': None},
{'name': 'c1',
'field_name': 'c1',
'pandas_type': 'bytes',
'numpy_type': 'object',
'metadata': None},
{'name': 'c2',
'field_name': 'c2',
'pandas_type': 'categorical',
'numpy_type': 'int16',
'metadata': {'num_categories': 1000, 'ordered': False}},
{'name': 'c3',
'field_name': 'c3',
'pandas_type': 'datetimetz',
'numpy_type': 'datetime64[ns]',
'metadata': {'timezone': 'America/Los_Angeles'}},
{'name': 'c4',
'field_name': 'c4',
'pandas_type': 'object',
'numpy_type': 'object',
'metadata': {'encoding': 'pickle'}},
{'name': None,
'field_name': '__index_level_0__',
'pandas_type': 'int64',
'numpy_type': 'int64',
'metadata': None}
],
'pandas_version': '1.4.0',
'creator': {
'library': 'pyarrow',
'version': '0.13.0'
}}