Back to Numpy

Reading and writing files

doc/source/user/how-to-io.rst

2.5.0.dev010.5 KB
Original Source

.. _how-to-io:

.. Setting up files temporarily to be used in the examples below. Clear-up has to be done at the end of the document.

.. testsetup::

from numpy.testing import temppath with open("csv.txt", "wt") as f: ... _ = f.write("1, 2, 3\n4,, 6\n7, 8, 9") with open("fixedwidth.txt", "wt") as f: ... _ = f.write("1 2 3\n44 6\n7 88889") with open("nan.txt", "wt") as f: ... _ = f.write("1 2 3\n44 x 6\n7 8888 9") with open("skip.txt", "wt") as f: ... _ = f.write("1 2 3\n44 6\n7 888 9") with open("tabs.txt", "wt") as f: ... _ = f.write("1\t2\t3\n44\t \t6\n7\t888\t9")

=========================== Reading and writing files

This page tackles common applications; for the full collection of I/O routines, see :ref:routines.io.

Reading text and CSV_ files

.. _CSV: https://en.wikipedia.org/wiki/Comma-separated_values

With no missing values

Use :func:numpy.loadtxt.

With missing values

Use :func:numpy.genfromtxt.

:func:numpy.genfromtxt will either

  • return a :ref:masked array<maskedarray.generic> masking out missing values (if usemask=True), or

  • fill in the missing value with the value specified in filling_values (default is np.nan for float, -1 for int).

With non-whitespace delimiters



    >>> with open("csv.txt", "r") as f:
    ...     print(f.read())
    1, 2, 3
    4,, 6
    7, 8, 9


Masked-array output
+++++++++++++++++++

    >>> np.genfromtxt("csv.txt", delimiter=",", usemask=True)
    masked_array(
      data=[[1.0, 2.0, 3.0],
            [4.0, --, 6.0],
            [7.0, 8.0, 9.0]],
      mask=[[False, False, False],
            [False,  True, False],
            [False, False, False]],
      fill_value=1e+20)

Array output
++++++++++++

    >>> np.genfromtxt("csv.txt", delimiter=",")
    array([[ 1.,  2.,  3.],
           [ 4., nan,  6.],
           [ 7.,  8.,  9.]])

Array output, specified fill-in value
+++++++++++++++++++++++++++++++++++++


    >>> np.genfromtxt("csv.txt", delimiter=",", dtype=np.int8, filling_values=99)
    array([[ 1,  2,  3],
           [ 4, 99,  6],
           [ 7,  8,  9]], dtype=int8)

Whitespace-delimited
~~~~~~~~~~~~~~~~~~~~

:func:`numpy.genfromtxt` can also parse whitespace-delimited data files
that have missing values if

* **Each field has a fixed width**: Use the width as the `delimiter` argument.::

    # File with width=4. The data does not have to be justified (for example,
    # the 2 in row 1), the last column can be less than width (for example, the 6
    # in row 2), and no delimiting character is required (for instance 8888 and 9
    # in row 3)

    >>> with open("fixedwidth.txt", "r") as f:
    ...    data = (f.read())
    >>> print(data)
    1   2      3
    44      6
    7   88889

    # Showing spaces as ^
    >>> print(data.replace(" ","^"))
    1^^^2^^^^^^3
    44^^^^^^6
    7^^^88889

    >>> np.genfromtxt("fixedwidth.txt", delimiter=4)
    array([[1.000e+00, 2.000e+00, 3.000e+00],
           [4.400e+01,       nan, 6.000e+00],
           [7.000e+00, 8.888e+03, 9.000e+00]])

* **A special value (e.g. "x") indicates a missing field**: Use it as the
  `missing_values` argument.

    >>> with open("nan.txt", "r") as f:
    ...     print(f.read())
    1 2 3
    44 x 6
    7  8888 9

    >>> np.genfromtxt("nan.txt", missing_values="x")
    array([[1.000e+00, 2.000e+00, 3.000e+00],
           [4.400e+01,       nan, 6.000e+00],
           [7.000e+00, 8.888e+03, 9.000e+00]])

* **You want to skip the rows with missing values**: Set
  `invalid_raise=False`.

    >>> with open("skip.txt", "r") as f:
    ...     print(f.read())
    1 2   3
    44    6
    7 888 9

    >>> np.genfromtxt("skip.txt", invalid_raise=False)  # doctest: +SKIP
    __main__:1: ConversionWarning: Some errors were detected !
        Line #2 (got 2 columns instead of 3)
    array([[  1.,   2.,   3.],
           [  7., 888.,   9.]])


* **The delimiter whitespace character is different from the whitespace that
  indicates missing data**. For instance, if columns are delimited by ``\t``,
  then missing data will be recognized if it consists of one
  or more spaces.::

    >>> with open("tabs.txt", "r") as f:
    ...    data = (f.read())
    >>> print(data)
    1       2       3
    44              6
    7       888     9

    # Tabs vs. spaces
    >>> print(data.replace("\t","^"))
    1^2^3
    44^ ^6
    7^888^9

    >>> np.genfromtxt("tabs.txt", delimiter="\t", missing_values=" +")
    array([[  1.,   2.,   3.],
           [ 44.,  nan,   6.],
           [  7., 888.,   9.]])

Read a file in .npy or .npz format
==================================

Choices:

- Use :func:`numpy.load`. It can read files generated by any of
  :func:`numpy.save`, :func:`numpy.savez`, or :func:`numpy.savez_compressed`.

- Use memory mapping. See `numpy.lib.format.open_memmap`.

Write to a file to be read back by NumPy
========================================

Binary
------

Use
:func:`numpy.save`, or to store multiple arrays :func:`numpy.savez`
or :func:`numpy.savez_compressed`.

For :ref:`security and portability <how-to-io-pickle-file>`, set
``allow_pickle=False`` unless the dtype contains Python objects, which
requires pickling.

Masked arrays :any:`can't currently be saved <MaskedArray.tofile>`,
nor can other arbitrary array subclasses.

Human-readable
--------------

:func:`numpy.save` and :func:`numpy.savez` create binary files. To **write a
human-readable file**, use :func:`numpy.savetxt`. The array can only be 1- or
2-dimensional, and there's no ``savetxtz`` for multiple files.

Large arrays
------------

See :ref:`how-to-io-large-arrays`.

Read an arbitrarily formatted binary file ("binary blob")
=========================================================

Use a :doc:`structured array <basics.rec>`.

**Example:**

The ``.wav`` file header is a 44-byte block preceding ``data_size`` bytes of the
actual sound data::

    chunk_id         "RIFF"
    chunk_size       4-byte unsigned little-endian integer
    format           "WAVE"
    fmt_id           "fmt "
    fmt_size         4-byte unsigned little-endian integer
    audio_fmt        2-byte unsigned little-endian integer
    num_channels     2-byte unsigned little-endian integer
    sample_rate      4-byte unsigned little-endian integer
    byte_rate        4-byte unsigned little-endian integer
    block_align      2-byte unsigned little-endian integer
    bits_per_sample  2-byte unsigned little-endian integer
    data_id          "data"
    data_size        4-byte unsigned little-endian integer

The ``.wav`` file header as a NumPy structured dtype::

    wav_header_dtype = np.dtype([
        ("chunk_id", (bytes, 4)), # flexible-sized scalar type, item size 4
        ("chunk_size", "<u4"),    # little-endian unsigned 32-bit integer
        ("format", "S4"),         # 4-byte string, alternate spelling of (bytes, 4)
        ("fmt_id", "S4"),
        ("fmt_size", "<u4"),
        ("audio_fmt", "<u2"),     #
        ("num_channels", "<u2"),  # .. more of the same ...
        ("sample_rate", "<u4"),   #
        ("byte_rate", "<u4"),
        ("block_align", "<u2"),
        ("bits_per_sample", "<u2"),
        ("data_id", "S4"),
        ("data_size", "<u4"),
        #
        # the sound data itself cannot be represented here:
        # it does not have a fixed size
    ])

    header = np.fromfile(f, dtype=wave_header_dtype, count=1)[0]

This ``.wav`` example is for illustration; to read a ``.wav`` file in real
life, use Python's built-in module :mod:`wave`.

(Adapted from Pauli Virtanen, :ref:`advanced_numpy`, licensed
under `CC BY 4.0 <https://creativecommons.org/licenses/by/4.0/>`_.)

.. _how-to-io-large-arrays:

Write or read large arrays
==========================

**Arrays too large to fit in memory** can be treated like ordinary in-memory
arrays using memory mapping.

- Raw array data written with :func:`numpy.ndarray.tofile` or
  :func:`numpy.ndarray.tobytes` can be read with :func:`numpy.memmap`::

      array = numpy.memmap("mydata/myarray.arr", mode="r", dtype=np.int16, shape=(1024, 1024))

- Files output by :func:`numpy.save` (that is, using the numpy format) can be read
  using :func:`numpy.load` with the ``mmap_mode`` keyword argument::

      large_array[some_slice] = np.load("path/to/small_array", mmap_mode="r")

Memory mapping lacks features like data chunking and compression; more
full-featured formats and libraries usable with NumPy include:

* **HDF5**: `h5py <https://www.h5py.org/>`_ or `PyTables <https://www.pytables.org/>`_.
* **Zarr**: `here <https://zarr.readthedocs.io/en/stable/tutorial.html#reading-and-writing-data>`_.
* **NetCDF**: :class:`scipy.io.netcdf_file`.

For tradeoffs among memmap, Zarr, and HDF5, see
`pythonspeed.com <https://pythonspeed.com/articles/mmap-vs-zarr-hdf5/>`_.

Write files for reading by other (non-NumPy) tools
==================================================

Formats for **exchanging data** with other tools include HDF5, Zarr, and
NetCDF (see :ref:`how-to-io-large-arrays`).

Write or read a JSON file
=========================

NumPy arrays and most NumPy scalars are **not** directly
`JSON serializable <https://github.com/numpy/numpy/issues/12481>`_.
Instead, use a custom `json.JSONEncoder` for NumPy types, which can
be found using your favorite search engine.


.. _how-to-io-pickle-file:

Save/restore using a pickle file
================================

Avoid when possible; :doc:`pickles <python:library/pickle>` are not secure
against erroneous or maliciously constructed data.

Use :func:`numpy.save` and :func:`numpy.load`.  Set ``allow_pickle=False``,
unless the array dtype includes Python objects, in which case pickling is
required.

:func:`numpy.load` and `pickle` submodule also support unpickling files
created with NumPy 1.26.

Convert from a pandas DataFrame to a NumPy array
================================================

See :meth:`pandas.Series.to_numpy`.

Save/restore using `~numpy.ndarray.tofile` and `~numpy.fromfile`
================================================================

In general, prefer :func:`numpy.save` and :func:`numpy.load`.

:func:`numpy.ndarray.tofile` and :func:`numpy.fromfile` lose information on
endianness and precision and so are unsuitable for anything but scratch
storage.


.. testcleanup::

   >>> import os
   >>> # list all files created in testsetup. If needed there are
   >>> # conveniences in e.g. astroquery to do this more automatically
   >>> for filename in ['csv.txt', 'fixedwidth.txt', 'nan.txt', 'skip.txt', 'tabs.txt']:
   ...     os.remove(filename)