docs/source/python/json.rst
.. Licensed to the Apache Software Foundation (ASF) under one .. or more contributor license agreements. See the NOTICE file .. distributed with this work for additional information .. regarding copyright ownership. The ASF licenses this file .. to you under the Apache License, Version 2.0 (the .. "License"); you may not use this file except in compliance .. with the License. You may obtain a copy of the License at
.. http://www.apache.org/licenses/LICENSE-2.0
.. Unless required by applicable law or agreed to in writing, .. software distributed under the License is distributed on an .. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY .. KIND, either express or implied. See the License for the .. specific language governing permissions and limitations .. under the License.
.. currentmodule:: pyarrow.json .. _json:
Arrow supports reading columnar data from line-delimited JSON files. In this context, a JSON file consists of multiple JSON objects, one per line, representing individual data rows. For example, this file represents two rows of data with four columns "a", "b", "c", "d":
.. code-block:: json
{"a": 1, "b": 2.0, "c": "foo", "d": false} {"a": 4, "b": -5.5, "c": null, "d": true}
The features currently offered are the following:
my_data.json.gz).. note:: Currently only the line-delimited JSON format is supported.
JSON reading functionality is available through the :mod:pyarrow.json module.
In many cases, you will simply call the :func:read_json function
with the file path you want to read from:
.. code-block:: python
from pyarrow import json fn = 'my_data.json' # doctest: +SKIP table = json.read_json(fn) # doctest: +SKIP table # doctest: +SKIP pyarrow.Table a: int64 b: double c: string d: bool table.to_pandas() # doctest: +SKIP a b c d 0 1 2.0 foo False 1 4 -5.5 None True
Arrow :ref:data types <data.types> are inferred from the JSON types and
values of each column:
null type, but can fall back to any
other type.bool_.int64, falling back to float64 if a
non-integer is encountered.timestamp[s], falling back to utf8 if a conversion error occurs.list type, and inference proceeds recursively
on the JSON arrays' values.struct type, and inference proceeds
recursively on the JSON objects' values.Thus, reading this JSON file:
.. code-block:: json
{"a": [1, 2], "b": {"c": true, "d": "1991-02-03"}} {"a": [3, 4, 5], "b": {"c": false, "d": "2019-04-01"}}
returns the following data:
.. code-block:: python
table = json.read_json("my_data.json") # doctest: +SKIP table # doctest: +SKIP pyarrow.Table a: list<item: int64> child 0, item: int64 b: struct<c: bool, d: timestamp[s]> child 0, c: bool child 1, d: timestamp[s] table.to_pandas() # doctest: +SKIP a b 0 [1, 2] {'c': True, 'd': 1991-02-03 00:00:00} 1 [3, 4, 5] {'c': False, 'd': 2019-04-01 00:00:00}
To alter the default parsing settings in case of reading JSON files with an
unusual structure, you should create a :class:ParseOptions instance
and pass it to :func:read_json. For example, you can pass an explicit
:ref:schema <data.schema> in order to bypass automatic type inference.
Similarly, you can choose performance settings by passing a
:class:ReadOptions instance to :func:read_json.
For memory-constrained environments, it is also possible to read a JSON file
one batch at a time, using :func:open_json.
In this case, type inference is done on the first block and types are frozen afterwards.
To make sure the right data types are inferred, either set
:attr:ReadOptions.block_size to a large enough value, or use
:attr:ParseOptions.explicit_schema to set the desired data types explicitly.