notebooks/missing-data.ipynb
In this section, we will discuss missing (also referred to as NA) values in cudf. cudf supports having missing values in all dtypes. These missing values are represented by <NA>. These values are also referenced as "null values".
To detect missing values, you can use isna() and notna() functions.
import cudf
import numpy as np
rng = np.random.default_rng()
df = cudf.DataFrame({"a": [1, 2, None, 4], "b": [0.1, None, 2.3, 17.17]})
df
df.isna()
df["a"].notna()
One has to be mindful that in Python (and NumPy), the nan's don't compare equal, but None's do. Note that cudf/NumPy uses the fact that np.nan != np.nan, and treats None like np.nan.
None == None
np.nan == np.nan
So as compared to above, a scalar equality comparison versus a None/np.nan doesn't provide useful information.
df["b"] == np.nan
s = cudf.Series([None, 1, 2])
s
s == None
s = cudf.Series([1, 2, np.nan], nan_as_null=False)
s
s == np.nan
Because NaN is a float, a column of integers with even one missing values is cast to floating-point dtype. However this doesn't happen by default.
By default if a NaN value is passed to Series constructor, it is treated as <NA> value.
cudf.Series([1, 2, np.nan])
Hence to consider a NaN as NaN you will have to pass nan_as_null=False parameter into Series constructor.
cudf.Series([1, 2, np.nan], nan_as_null=False)
For datetime64 types, cudf doesn't support having NaT values. Instead these values which are specific to numpy and pandas are considered as null values(<NA>) in cudf. The actual underlying value of NaT is min(int64) and cudf retains the underlying value when converting a cudf object to pandas object.
import pandas as pd
datetime_series = cudf.Series(
[pd.Timestamp("20120101"), pd.NaT, pd.Timestamp("20120101")]
)
datetime_series
datetime_series.to_pandas()
any operations on rows having <NA> values in datetime column will result in <NA> value at the same location in resulting column:
datetime_series - datetime_series
Null values propagate naturally through arithmetic operations between pandas objects.
df1 = cudf.DataFrame(
{
"a": [1, None, 2, 3, None],
"b": cudf.Series([np.nan, 2, 3.2, 0.1, 1], nan_as_null=False),
}
)
df2 = cudf.DataFrame(
{"a": [1, 11, 2, 34, 10], "b": cudf.Series([0.23, 22, 3.2, None, 1])}
)
df1
df2
df1 + df2
While summing the data along a series, NA values will be treated as 0.
df1["a"]
df1["a"].sum()
Since NA values are treated as 0, the mean would result to 2 in this case (1 + 0 + 2 + 3 + 0)/5 = 2
df1["a"].mean()
To preserve NA values in the above calculations, sum & mean support skipna parameter.
By default it's value is
set to True, we can change it to False to preserve NA values.
df1["a"].sum(skipna=False)
df1["a"].mean(skipna=False)
Cumulative methods like cumsum and cumprod ignore NA values by default.
df1["a"].cumsum()
To preserve NA values in cumulative methods, provide skipna=False.
df1["a"].cumsum(skipna=False)
The sum of an empty or all-NA Series of a DataFrame is 0.
cudf.Series([np.nan], nan_as_null=False).sum()
cudf.Series([np.nan], nan_as_null=False).sum(skipna=False)
cudf.Series([], dtype="float64").sum()
The product of an empty or all-NA Series of a DataFrame is 1.
cudf.Series([np.nan], nan_as_null=False).prod()
cudf.Series([np.nan], nan_as_null=False).prod(skipna=False)
cudf.Series([], dtype="float64").prod()
NA groups in GroupBy are automatically excluded. For example:
df1
df1.groupby("a").mean()
It is also possible to include NA in groups by passing dropna=False
df1.groupby("a", dropna=False).mean()
All dtypes support insertion of missing value by assignment. Any specific location in series can made null by assigning it to None.
series = cudf.Series([1, 2, 3, 4])
series
series[2] = None
series
fillna() can fill in NA & NaN values with non-NA data.
df1
df1["b"].fillna(10)
You can also fillna using a dict or Series that is alignable. The labels of the dict or index of the Series must match the columns of the frame you wish to fill. The use case of this is to fill a DataFrame with the mean of that column.
import cupy as cp
cp_rng = cp.random.default_rng()
dff = cudf.DataFrame(cp_rng.standard_normal((10, 3)), columns=list("ABC"))
dff.iloc[3:5, 0] = np.nan
dff.iloc[4:6, 1] = np.nan
dff.iloc[5:8, 2] = np.nan
dff
dff.fillna(dff.mean())
dff.fillna(dff.mean()[1:3])
Missing data can be excluded using dropna():
df1
df1.dropna(axis=0)
df1.dropna(axis=1)
An equivalent dropna() is available for Series.
df1["a"].dropna()
Often times we want to replace arbitrary values with other values.
replace() in Series and replace() in DataFrame provides an efficient yet flexible way to perform such replacements.
series = cudf.Series([0.0, 1.0, 2.0, 3.0, 4.0])
series
series.replace(0, 5)
We can also replace any value with a <NA> value.
series.replace(0, None)
You can replace a list of values by a list of other values:
series.replace([0, 1, 2, 3, 4], [4, 3, 2, 1, 0])
You can also specify a mapping dict:
series.replace({0: 10, 1: 100})
For a DataFrame, you can specify individual values by column:
df = cudf.DataFrame({"a": [0, 1, 2, 3, 4], "b": [5, 6, 7, 8, 9]})
df
df.replace({"a": 0, "b": 5}, 100)
cudf supports replacing string values using replace API:
d = {"a": list(range(4)), "b": list("ab.."), "c": ["a", "b", None, "d"]}
df = cudf.DataFrame(d)
df
df.replace(".", "A Dot")
df.replace([".", "b"], ["A Dot", None])
Replace a few different values (list -> list):
df.replace(["a", "."], ["b", "--"])
Only search in column 'b' (dict -> dict):
df.replace({"b": "."}, {"b": "replacement value"})
replace() can also be used similar to fillna().
df = cudf.DataFrame(cp_rng.standard_normal((10, 2)))
df[rng.random(df.shape[0]) > 0.5] = 1.5
df.replace(1.5, None)
Replacing more than one value is possible by passing a list.
df00 = df.iloc[0, 0]
df.replace([1.5, df00], [5, 10])
You can also operate on the DataFrame in place:
df.replace(1.5, None, inplace=True)
df