Back to Cudf

Working with missing data

notebooks/missing-data.ipynb

26.06.00a7.5 KB
Original Source

Working with missing data

In this section, we will discuss missing (also referred to as NA) values in cudf. cudf supports having missing values in all dtypes. These missing values are represented by <NA>. These values are also referenced as "null values".

How to Detect missing values

To detect missing values, you can use isna() and notna() functions.

python
import cudf
import numpy as np

rng = np.random.default_rng()
python
df = cudf.DataFrame({"a": [1, 2, None, 4], "b": [0.1, None, 2.3, 17.17]})
python
df
python
df.isna()
python
df["a"].notna()

One has to be mindful that in Python (and NumPy), the nan's don't compare equal, but None's do. Note that cudf/NumPy uses the fact that np.nan != np.nan, and treats None like np.nan.

python
None == None
python
np.nan == np.nan

So as compared to above, a scalar equality comparison versus a None/np.nan doesn't provide useful information.

python
df["b"] == np.nan
python
s = cudf.Series([None, 1, 2])
python
s
python
s == None
python
s = cudf.Series([1, 2, np.nan], nan_as_null=False)
python
s
python
s == np.nan

Float dtypes and missing data

Because NaN is a float, a column of integers with even one missing values is cast to floating-point dtype. However this doesn't happen by default.

By default if a NaN value is passed to Series constructor, it is treated as <NA> value.

python
cudf.Series([1, 2, np.nan])

Hence to consider a NaN as NaN you will have to pass nan_as_null=False parameter into Series constructor.

python
cudf.Series([1, 2, np.nan], nan_as_null=False)

Datetimes

For datetime64 types, cudf doesn't support having NaT values. Instead these values which are specific to numpy and pandas are considered as null values(<NA>) in cudf. The actual underlying value of NaT is min(int64) and cudf retains the underlying value when converting a cudf object to pandas object.

python
import pandas as pd

datetime_series = cudf.Series(
    [pd.Timestamp("20120101"), pd.NaT, pd.Timestamp("20120101")]
)
datetime_series
python
datetime_series.to_pandas()

any operations on rows having <NA> values in datetime column will result in <NA> value at the same location in resulting column:

python
datetime_series - datetime_series

Calculations with missing data

Null values propagate naturally through arithmetic operations between pandas objects.

python
df1 = cudf.DataFrame(
    {
        "a": [1, None, 2, 3, None],
        "b": cudf.Series([np.nan, 2, 3.2, 0.1, 1], nan_as_null=False),
    }
)
python
df2 = cudf.DataFrame(
    {"a": [1, 11, 2, 34, 10], "b": cudf.Series([0.23, 22, 3.2, None, 1])}
)
python
df1
python
df2
python
df1 + df2

While summing the data along a series, NA values will be treated as 0.

python
df1["a"]
python
df1["a"].sum()

Since NA values are treated as 0, the mean would result to 2 in this case (1 + 0 + 2 + 3 + 0)/5 = 2

python
df1["a"].mean()

To preserve NA values in the above calculations, sum & mean support skipna parameter. By default it's value is set to True, we can change it to False to preserve NA values.

python
df1["a"].sum(skipna=False)
python
df1["a"].mean(skipna=False)

Cumulative methods like cumsum and cumprod ignore NA values by default.

python
df1["a"].cumsum()

To preserve NA values in cumulative methods, provide skipna=False.

python
df1["a"].cumsum(skipna=False)

Sum/product of Null/nans

The sum of an empty or all-NA Series of a DataFrame is 0.

python
cudf.Series([np.nan], nan_as_null=False).sum()
python
cudf.Series([np.nan], nan_as_null=False).sum(skipna=False)
python
cudf.Series([], dtype="float64").sum()

The product of an empty or all-NA Series of a DataFrame is 1.

python
cudf.Series([np.nan], nan_as_null=False).prod()
python
cudf.Series([np.nan], nan_as_null=False).prod(skipna=False)
python
cudf.Series([], dtype="float64").prod()

NA values in GroupBy

NA groups in GroupBy are automatically excluded. For example:

python
df1
python
df1.groupby("a").mean()

It is also possible to include NA in groups by passing dropna=False

python
df1.groupby("a", dropna=False).mean()

Inserting missing data

All dtypes support insertion of missing value by assignment. Any specific location in series can made null by assigning it to None.

python
series = cudf.Series([1, 2, 3, 4])
python
series
python
series[2] = None
python
series

Filling missing values: fillna

fillna() can fill in NA & NaN values with non-NA data.

python
df1
python
df1["b"].fillna(10)

Filling with cudf Object

You can also fillna using a dict or Series that is alignable. The labels of the dict or index of the Series must match the columns of the frame you wish to fill. The use case of this is to fill a DataFrame with the mean of that column.

python
import cupy as cp

cp_rng = cp.random.default_rng()

dff = cudf.DataFrame(cp_rng.standard_normal((10, 3)), columns=list("ABC"))
python
dff.iloc[3:5, 0] = np.nan
python
dff.iloc[4:6, 1] = np.nan
python
dff.iloc[5:8, 2] = np.nan
python
dff
python
dff.fillna(dff.mean())
python
dff.fillna(dff.mean()[1:3])

Dropping axis labels with missing data: dropna

Missing data can be excluded using dropna():

python
df1
python
df1.dropna(axis=0)
python
df1.dropna(axis=1)

An equivalent dropna() is available for Series.

python
df1["a"].dropna()

Replacing generic values

Often times we want to replace arbitrary values with other values.

replace() in Series and replace() in DataFrame provides an efficient yet flexible way to perform such replacements.

python
series = cudf.Series([0.0, 1.0, 2.0, 3.0, 4.0])
python
series
python
series.replace(0, 5)

We can also replace any value with a <NA> value.

python
series.replace(0, None)

You can replace a list of values by a list of other values:

python
series.replace([0, 1, 2, 3, 4], [4, 3, 2, 1, 0])

You can also specify a mapping dict:

python
series.replace({0: 10, 1: 100})

For a DataFrame, you can specify individual values by column:

python
df = cudf.DataFrame({"a": [0, 1, 2, 3, 4], "b": [5, 6, 7, 8, 9]})
python
df
python
df.replace({"a": 0, "b": 5}, 100)

String/regular expression replacement

cudf supports replacing string values using replace API:

python
d = {"a": list(range(4)), "b": list("ab.."), "c": ["a", "b", None, "d"]}
python
df = cudf.DataFrame(d)
python
df
python
df.replace(".", "A Dot")
python
df.replace([".", "b"], ["A Dot", None])

Replace a few different values (list -> list):

python
df.replace(["a", "."], ["b", "--"])

Only search in column 'b' (dict -> dict):

python
df.replace({"b": "."}, {"b": "replacement value"})

Numeric replacement

replace() can also be used similar to fillna().

python
df = cudf.DataFrame(cp_rng.standard_normal((10, 2)))
python
df[rng.random(df.shape[0]) > 0.5] = 1.5
python
df.replace(1.5, None)

Replacing more than one value is possible by passing a list.

python
df00 = df.iloc[0, 0]
python
df.replace([1.5, df00], [5, 10])

You can also operate on the DataFrame in place:

python
df.replace(1.5, None, inplace=True)
python
df