doc/source/user_guide/missing_data.rst
.. _missing_data:
{{ header }}
Working with missing data
Values considered "missing"
pandas uses different sentinel values to represent a missing (also referred to as NA)
depending on the data type.
``numpy.nan`` for NumPy data types. The disadvantage of using NumPy data types
is that the original data type will be coerced to ``np.float64`` or ``object``.
.. ipython:: python
pd.Series([1, 2], dtype=np.int64).reindex([0, 1, 2])
pd.Series([True, False], dtype=np.bool_).reindex([0, 1, 2])
:class:`NaT` for NumPy ``np.datetime64``, ``np.timedelta64``, and :class:`PeriodDtype`. For typing applications,
use :class:`api.typing.NaTType`.
.. ipython:: python
pd.Series([1, 2], dtype=np.dtype("timedelta64[ns]")).reindex([0, 1, 2])
pd.Series([1, 2], dtype=np.dtype("datetime64[ns]")).reindex([0, 1, 2])
pd.Series(["2020", "2020"], dtype=pd.PeriodDtype("D")).reindex([0, 1, 2])
:class:`NA` for :class:`StringDtype`, :class:`Int64Dtype` (and other bit widths),
:class:`Float64Dtype` (and other bit widths), :class:`BooleanDtype` and :class:`ArrowDtype`.
These types will maintain the original data type of the data.
For typing applications, use :class:`api.typing.NAType`.
.. ipython:: python
pd.Series([1, 2], dtype="Int64").reindex([0, 1, 2])
pd.Series([True, False], dtype="boolean[pyarrow]").reindex([0, 1, 2])
To detect these missing value, use the :func:`isna` or :func:`notna` methods.
.. ipython:: python
ser = pd.Series([pd.Timestamp("2020-01-01"), pd.NaT])
ser
pd.isna(ser)
.. note::
:func:`isna` or :func:`notna` will also consider ``None`` a missing value.
.. ipython:: python
ser = pd.Series([1, None], dtype=object)
ser
pd.isna(ser)
.. warning::
Equality comparisons between ``np.nan``, :class:`NaT`, and :class:`NA`
do not act like ``None``
.. ipython:: python
None == None # noqa: E711
np.nan == np.nan
pd.NaT == pd.NaT
pd.NA == pd.NA
Therefore, an equality comparison between a :class:`DataFrame` or :class:`Series`
with one of these missing values does not provide the same information as
:func:`isna` or :func:`notna`.
.. ipython:: python
ser = pd.Series([True, None], dtype="boolean[pyarrow]")
ser == pd.NA
pd.isna(ser)
.. _missing_data.NA:
:class:`NA` semantics
~~~~~~~~~~~~~~~~~~~~~
.. warning::
Experimental: the behaviour of :class:`NA` can still change without warning.
Starting from pandas 1.0, an experimental :class:`NA` value (singleton) is
available to represent scalar missing values. The goal of :class:`NA` is provide a
"missing" indicator that can be used consistently across data types
(instead of ``np.nan``, ``None`` or ``pd.NaT`` depending on the data type).
For example, when having missing values in a :class:`Series` with the nullable integer
dtype, it will use :class:`NA`:
.. ipython:: python
s = pd.Series([1, 2, None], dtype="Int64")
s
s[2]
s[2] is pd.NA
Currently, pandas does not use those data types using :class:`NA` by default in
a :class:`DataFrame` or :class:`Series`, so you need to specify
the dtype explicitly. An easy way to convert to those dtypes is explained in the
:ref:`conversion section <missing_data.NA.conversion>`.
Propagation in arithmetic and comparison operations
---------------------------------------------------
In general, missing values *propagate* in operations involving :class:`NA`. When
one of the operands is unknown, the outcome of the operation is also unknown.
For example, :class:`NA` propagates in arithmetic operations, similarly to
``np.nan``:
.. ipython:: python
pd.NA + 1
"a" * pd.NA
There are a few special cases when the result is known, even when one of the
operands is ``NA``.
.. ipython:: python
pd.NA ** 0
1 ** pd.NA
In equality and comparison operations, :class:`NA` also propagates. This deviates
from the behaviour of ``np.nan``, where comparisons with ``np.nan`` always
return ``False``.
.. ipython:: python
pd.NA == 1
pd.NA == pd.NA
pd.NA < 2.5
To check if a value is equal to :class:`NA`, use :func:`isna`
.. ipython:: python
pd.isna(pd.NA)
.. note::
An exception on this basic propagation rule are *reductions* (such as the
mean or the minimum), where pandas defaults to skipping missing values. See the
:ref:`calculation section <missing_data.calculations>` for more.
Logical operations
------------------
For logical operations, :class:`NA` follows the rules of the
`three-valued logic <https://en.wikipedia.org/wiki/Three-valued_logic>`__ (or
*Kleene logic*, similarly to R, SQL and Julia). This logic means to only
propagate missing values when it is logically required.
For example, for the logical "or" operation (``|``), if one of the operands
is ``True``, we already know the result will be ``True``, regardless of the
other value (so regardless the missing value would be ``True`` or ``False``).
In this case, :class:`NA` does not propagate:
.. ipython:: python
True | False
True | pd.NA
pd.NA | True
On the other hand, if one of the operands is ``False``, the result depends
on the value of the other operand. Therefore, in this case :class:`NA`
propagates:
.. ipython:: python
False | True
False | False
False | pd.NA
The behaviour of the logical "and" operation (``&``) can be derived using
similar logic (where now :class:`NA` will not propagate if one of the operands
is already ``False``):
.. ipython:: python
False & True
False & False
False & pd.NA
.. ipython:: python
True & True
True & False
True & pd.NA
``NA`` in a boolean context
---------------------------
Since the actual value of an NA is unknown, it is ambiguous to convert NA
to a boolean value.
.. ipython:: python
:okexcept:
bool(pd.NA)
This also means that :class:`NA` cannot be used in a context where it is
evaluated to a boolean, such as ``if condition: ...`` where ``condition`` can
potentially be :class:`NA`. In such cases, :func:`isna` can be used to check
for :class:`NA` or ``condition`` being :class:`NA` can be avoided, for example by
filling missing values beforehand.
A similar situation occurs when using :class:`Series` or :class:`DataFrame` objects in ``if``
statements, see :ref:`gotchas.truth`.
NumPy ufuncs
------------
:attr:`pandas.NA` implements NumPy's ``__array_ufunc__`` protocol. Most ufuncs
work with ``NA``, and generally return ``NA``:
.. ipython:: python
np.log(pd.NA)
np.add(pd.NA, 1)
.. warning::
Currently, ufuncs involving an ndarray and ``NA`` will return an
object-dtype filled with NA values.
.. ipython:: python
a = np.array([1, 2, 3])
np.greater(a, pd.NA)
The return type here may change to return a different array type
in the future.
See :ref:`dsintro.numpy_interop` for more on ufuncs.
.. _missing_data.NA.conversion:
Conversion
^^^^^^^^^^
If you have a :class:`DataFrame` or :class:`Series` using ``np.nan``,
:meth:`DataFrame.convert_dtypes` and :meth:`Series.convert_dtypes`, respectively,
will convert your data to use the nullable data types supporting :class:`NA`,
such as :class:`Int64Dtype` or :class:`ArrowDtype`. This is especially helpful after reading
in data sets from IO methods where data types were inferred.
.. ipython:: python
import io
data = io.StringIO("a,b\n,True\n2,")
df = pd.read_csv(data)
df.dtypes
df_conv = df.convert_dtypes()
df_conv
df_conv.dtypes
.. _missing.inserting:
Inserting missing data
~~~~~~~~~~~~~~~~~~~~~~
You can insert missing values by simply assigning to a :class:`Series` or :class:`DataFrame`.
The missing value sentinel used will be chosen based on the dtype.
.. ipython:: python
ser = pd.Series([1., 2., 3.])
ser.loc[0] = None
ser
ser = pd.Series([pd.Timestamp("2021"), pd.Timestamp("2021")])
ser.iloc[0] = np.nan
ser
ser = pd.Series([True, False], dtype="boolean[pyarrow]")
ser.iloc[0] = None
ser
For ``object`` types, pandas will use the value given:
.. ipython:: python
s = pd.Series(["a", "b", "c"], dtype=object)
s.loc[0] = None
s.loc[1] = np.nan
s
.. _missing_data.calculations:
Calculations with missing data
Missing values propagate through arithmetic operations between pandas objects.
.. ipython:: python
ser1 = pd.Series([np.nan, np.nan, 2, 3]) ser2 = pd.Series([np.nan, 1, np.nan, 4]) ser1 ser2 ser1 + ser2
The descriptive statistics and computational methods discussed in the
:ref:data structure overview <basics.stats> (and listed :ref:here <api.series.stats> and :ref:here <api.dataframe.stats>) all
account for missing data.
When summing data, NA values or empty data will be treated as zero.
.. ipython:: python
pd.Series([np.nan]).sum() pd.Series([], dtype="float64").sum()
When taking the product, NA values or empty data will be treated as 1.
.. ipython:: python
pd.Series([np.nan]).prod() pd.Series([], dtype="float64").prod()
Cumulative methods like :meth:~DataFrame.cumsum and :meth:~DataFrame.cumprod
ignore NA values by default, but preserve them in the resulting array. To override
this behaviour and include NA values in the calculation, use skipna=False.
.. ipython:: python
ser = pd.Series([1, np.nan, 3, np.nan]) ser ser.cumsum() ser.cumsum(skipna=False)
.. _missing_data.dropna:
Dropping missing data
:meth:`~DataFrame.dropna` drops rows or columns with missing data.
.. ipython:: python
df = pd.DataFrame([[np.nan, 1, 2], [1, 2, np.nan], [1, 2, 3]])
df
df.dropna()
df.dropna(axis=1)
ser = pd.Series([1, pd.NA], dtype="int64[pyarrow]")
ser.dropna()
Filling missing data
~~~~~~~~~~~~~~~~~~~~
.. _missing_data.fillna:
Filling by value
----------------
:meth:`~DataFrame.fillna` replaces NA values with non-NA data.
Replace NA with a scalar value
.. ipython:: python
data = {"np": [1.0, np.nan, np.nan, 2], "arrow": pd.array([1.0, pd.NA, pd.NA, 2], dtype="float64[pyarrow]")}
df = pd.DataFrame(data)
df
df.fillna(0)
When the data has object dtype, you can control what type of NA values are present.
.. ipython:: python
df = pd.DataFrame({"a": [pd.NA, np.nan, None]}, dtype=object)
df
df.fillna(None)
df.fillna(np.nan)
df.fillna(pd.NA)
However when the dtype is not object, these will all be replaced with the proper NA value for the dtype.
.. ipython:: python
data = {"np": [1.0, np.nan, np.nan, 2], "arrow": pd.array([1.0, pd.NA, pd.NA, 2], dtype="float64[pyarrow]")}
df = pd.DataFrame(data)
df
df.fillna(None)
df.fillna(np.nan)
df.fillna(pd.NA)
Fill gaps forward or backward
.. ipython:: python
df.ffill()
df.bfill()
.. _missing_data.fillna.limit:
Limit the number of NA values filled
.. ipython:: python
df.ffill(limit=1)
NA values can be replaced with corresponding value from a :class:`Series` or :class:`DataFrame`
where the index and column aligns between the original object and the filled object.
.. ipython:: python
dff = pd.DataFrame(np.arange(30, dtype=np.float64).reshape(10, 3), columns=list("ABC"))
dff.iloc[3:5, 0] = np.nan
dff.iloc[4:6, 1] = np.nan
dff.iloc[5:8, 2] = np.nan
dff
dff.fillna(dff.mean())
.. note::
:meth:`DataFrame.where` can also be used to fill NA values. Same result as above.
.. ipython:: python
dff.where(pd.notna(dff), dff.mean(), axis="columns")
.. _missing_data.interpolate:
Interpolation
-------------
:meth:`DataFrame.interpolate` and :meth:`Series.interpolate` fills NA values
using various interpolation methods.
.. ipython:: python
df = pd.DataFrame(
{
"A": [1, 2.1, np.nan, 4.7, 5.6, 6.8],
"B": [0.25, np.nan, np.nan, 4, 12.2, 14.4],
}
)
df
df.interpolate()
idx = pd.date_range("2020-01-01", periods=10, freq="D")
data = np.random.default_rng(2).integers(0, 10, 10).astype(np.float64)
ts = pd.Series(data, index=idx)
ts.iloc[[1, 2, 5, 6, 9]] = np.nan
ts
@savefig series_before_interpolate.png
ts.plot()
.. ipython:: python
ts.interpolate()
@savefig series_interpolate.png
ts.interpolate().plot()
Interpolation relative to a :class:`Timestamp` in the :class:`DatetimeIndex`
is available by setting ``method="time"``
.. ipython:: python
ts2 = ts.iloc[[0, 1, 3, 7, 9]]
ts2
ts2.interpolate()
ts2.interpolate(method="time")
For a floating-point index, use ``method='values'``:
.. ipython:: python
idx = [0.0, 1.0, 10.0]
ser = pd.Series([0.0, np.nan, 10.0], idx)
ser
ser.interpolate()
ser.interpolate(method="values")
If you have scipy_ installed, you can pass the name of a 1-d interpolation routine to ``method``.
as specified in the scipy interpolation documentation_ and reference guide_.
The appropriate interpolation method will depend on the data type.
.. tip::
If you are dealing with a time series that is growing at an increasing rate,
use ``method='barycentric'``.
If you have values approximating a cumulative distribution function,
use ``method='pchip'``.
To fill missing values with goal of smooth plotting use ``method='akima'``.
.. ipython:: python
df = pd.DataFrame(
{
"A": [1, 2.1, np.nan, 4.7, 5.6, 6.8],
"B": [0.25, np.nan, np.nan, 4, 12.2, 14.4],
}
)
df
df.interpolate(method="barycentric")
df.interpolate(method="pchip")
df.interpolate(method="akima")
When interpolating via a polynomial or spline approximation, you must also specify
the degree or order of the approximation:
.. ipython:: python
df.interpolate(method="spline", order=2)
df.interpolate(method="polynomial", order=2)
Comparing several methods.
.. ipython:: python
np.random.seed(2)
ser = pd.Series(np.arange(1, 10.1, 0.25) ** 2 + np.random.randn(37))
missing = np.array([4, 13, 14, 15, 16, 17, 18, 20, 29])
ser.iloc[missing] = np.nan
methods = ["linear", "quadratic", "cubic"]
df = pd.DataFrame({m: ser.interpolate(method=m) for m in methods})
@savefig compare_interpolations.png
df.plot()
Interpolating new observations from expanding data with :meth:`Series.reindex`.
.. ipython:: python
ser = pd.Series(np.sort(np.random.uniform(size=100)))
# interpolate at new_index
new_index = ser.index.union(pd.Index([49.25, 49.5, 49.75, 50.25, 50.5, 50.75]))
interp_s = ser.reindex(new_index).interpolate(method="pchip")
interp_s.loc[49:51]
.. _scipy: https://scipy.org/
.. _documentation: https://docs.scipy.org/doc/scipy/reference/interpolate.html#univariate-interpolation
.. _guide: https://docs.scipy.org/doc/scipy/tutorial/interpolate.html
.. _missing_data.interp_limits:
Interpolation limits
^^^^^^^^^^^^^^^^^^^^
:meth:`~DataFrame.interpolate` accepts a ``limit`` keyword
argument to limit the number of consecutive ``NaN`` values
filled since the last valid observation
.. ipython:: python
ser = pd.Series([np.nan, np.nan, 5, np.nan, np.nan, np.nan, 13, np.nan, np.nan])
ser
ser.interpolate()
ser.interpolate(limit=1)
By default, ``NaN`` values are filled in a ``forward`` direction. Use
``limit_direction`` parameter to fill ``backward`` or from ``both`` directions.
.. ipython:: python
ser.interpolate(limit=1, limit_direction="backward")
ser.interpolate(limit=1, limit_direction="both")
ser.interpolate(limit_direction="both")
By default, ``NaN`` values are filled whether they are surrounded by
existing valid values or outside existing valid values. The ``limit_area``
parameter restricts filling to either inside or outside values.
.. ipython:: python
# fill one consecutive inside value in both directions
ser.interpolate(limit_direction="both", limit_area="inside", limit=1)
# fill all consecutive outside values backward
ser.interpolate(limit_direction="backward", limit_area="outside")
# fill all consecutive outside values in both directions
ser.interpolate(limit_direction="both", limit_area="outside")
.. _missing_data.replace:
Replacing values
----------------
:meth:`Series.replace` and :meth:`DataFrame.replace` can be used similar to
:meth:`Series.fillna` and :meth:`DataFrame.fillna` to replace or insert missing values.
.. ipython:: python
df = pd.DataFrame(np.eye(3))
df
df_missing = df.replace(0, np.nan)
df_missing
df_filled = df_missing.replace(np.nan, 2)
df_filled
Replacing more than one value is possible by passing a list.
.. ipython:: python
df_filled.replace([1, 44], [2, 28])
Replacing using a mapping dict.
.. ipython:: python
df_filled.replace({1: 44, 2: 28})
.. _missing_data.replace_expression:
Regular expression replacement
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. note::
Python strings prefixed with the ``r`` character such as ``r'hello world'``
are `"raw" strings <https://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals>`_.
They have different semantics regarding backslashes than strings without this prefix.
Backslashes in raw strings will be interpreted as an escaped backslash, e.g., ``r'\' == '\\'``.
Replace the '.' with ``NaN``
.. ipython:: python
d = {"a": list(range(4)), "b": list("ab.."), "c": ["a", "b", np.nan, "d"]}
df = pd.DataFrame(d)
df.replace(".", np.nan)
Replace the '.' with ``NaN`` with regular expression that removes surrounding whitespace
.. ipython:: python
df.replace(r"\s*\.\s*", np.nan, regex=True)
Replace with a list of regexes.
.. ipython:: python
df.replace([r"\.", r"(a)"], ["dot", r"\1stuff"], regex=True)
Replace with a regex in a mapping dict.
.. ipython:: python
df.replace({"b": r"\s*\.\s*"}, {"b": np.nan}, regex=True)
Pass nested dictionaries of regular expressions that use the ``regex`` keyword.
.. ipython:: python
df.replace({"b": {"b": r""}}, regex=True)
df.replace(regex={"b": {r"\s*\.\s*": np.nan}})
df.replace({"b": r"\s*(\.)\s*"}, {"b": r"\1ty"}, regex=True)
Pass a list of regular expressions that will replace matches with a scalar.
.. ipython:: python
df.replace([r"\s*\.\s*", r"a|b"], "placeholder", regex=True)
All of the regular expression examples can also be passed with the
``to_replace`` argument as the ``regex`` argument. In this case the ``value``
argument must be passed explicitly by name or ``regex`` must be a nested
dictionary.
.. ipython:: python
df.replace(regex=[r"\s*\.\s*", r"a|b"], value="placeholder")
.. note::
A regular expression object from ``re.compile`` is a valid input as well.