Back to Cudf

Overview of User Defined Functions with cuDF

docs/cudf/source/user_guide/guide-to-udfs.ipynb

26.06.00a16.9 KB
Original Source

Overview of User Defined Functions with cuDF

python
import cudf
import numpy as np
from cudf.datasets import randomdata
from numba import config

config.CUDA_LOW_OCCUPANCY_WARNINGS = 0

Like many tabular data processing APIs, cuDF provides a range of composable, DataFrame style operators. While out of the box functions are flexible and useful, it is sometimes necessary to write custom code, or user-defined functions (UDFs), that can be applied to rows, columns, and other groupings of the cells making up the DataFrame.

In conjunction with the broader GPU PyData ecosystem, cuDF provides interfaces to run UDFs on a variety of data structures. Currently, we can only execute UDFs on numeric, boolean, datetime, and timedelta typed data with partial support for strings in some APIs. This guide covers writing and executing UDFs on the following data structures:

  • Series
  • DataFrame
  • Rolling Windows Series
  • Groupby DataFrames
  • CuPy NDArrays
  • Numba DeviceNDArrays

It also demonstrates cuDF's default null handling behavior, and how to write UDFs that can interact with null values.

Series UDFs

You can execute UDFs on Series in two ways:

  • Writing a standard python function and using cudf.Series.apply
  • Writing a Numba kernel and using Numba's forall syntax

Using apply or is simpler, but writing a Numba kernel offers the flexibility to build more complex functions (we'll be writing only simple kernels in this guide).

cudf.Series.apply

cuDF provides a similar API to pandas.Series.apply for applying scalar UDFs to series objects. Here is a very basic example.

python
# Create a cuDF series
sr = cudf.Series([1, 2, 3])

UDFs destined for cudf.Series.apply might look something like this:

python
# define a scalar function


def f(x):
    return x + 1

cudf.Series.apply is called like pd.Series.apply and returns a new Series object:

python
sr.apply(f)

Functions with Additional Scalar Arguments

In addition, cudf.Series.apply supports args= just like pandas, allowing you to write UDFs that accept an arbitrary number of scalar arguments. Here is an example of such a function and it's API call in both pandas and cuDF:

python
def g(x, const):
    return x + const
python
# cuDF apply
sr.apply(g, args=(42,))

As a final note, **kwargs is not yet supported.

Nullable Data

The null value NA an propagates through unary and binary operations. Thus, NA + 1, abs(NA), and NA == NA all return NA. To make this concrete, let's look at the same example from above, this time using nullable data:

python
# Create a cuDF series with nulls
sr = cudf.Series([1, cudf.NA, 3])
sr
python
# redefine the same function from above


def f(x):
    return x + 1
python
# cuDF result
sr.apply(f)

Often however you want explicit null handling behavior inside the function. cuDF exposes this capability the same way as pandas, by interacting directly with the NA singleton object. Here's an example of a function with explicit null handling:

python
def f_null_sensitive(x):
    # do something if the input is null
    if x is cudf.NA:
        return 42
    else:
        return x + 1
python
# cuDF result
sr.apply(f_null_sensitive)

In addition, cudf.NA can be returned from a function directly or conditionally. This capability should allow you to implement custom null handling in a wide variety of cases.

String data

Experimental support for a subset of string functionality is available for apply. The following string operations are currently supported:

  • str.count
  • str.startswith
  • str.endswith
  • str.find
  • str.rfind
  • str.isalnum
  • str.isdecimal
  • str.isdigit
  • str.islower
  • str.isupper
  • str.isalpha
  • str.istitle
  • str.isspace
  • ==, !=, >=, <=, >, < (between two strings)
  • len (e.g. len(some_string))
  • in (e.g, 'abc' in some_string)
  • strip
  • lstrip
  • rstrip
  • upper
  • lower
  • + (string concatenation)
  • replace
python
sr = cudf.Series(["", "abc", "some_example"])
python
def f(st):
    if len(st) > 0:
        if st.startswith("a"):
            return 1
        elif "example" in st:
            return 2
        else:
            return -1
    else:
        return 42
python
result = sr.apply(f)
print(result)

String UDF Memory Considerations

UDFs that create intermediate strings as part of the computation may require memory tuning. An API is provided for convenience to accomplish this:

python
from cudf.core.udf.utils import set_malloc_heap_size

set_malloc_heap_size(int(2e9))

Lower level control with custom numba kernels

In addition to the Series.apply() method for performing custom operations, you can also pass Series objects directly into CUDA kernels written with Numba. Note that this section requires basic CUDA knowledge. Refer to numba's CUDA documentation for details.

The easiest way to write a Numba kernel is to use cuda.grid(1) to manage thread indices, and then leverage Numba's forall method to configure the kernel for us. Below, define a basic multiplication kernel as an example and use @cuda.jit to compile it.

python
df = randomdata(nrows=5, dtypes={"a": int, "b": int, "c": int}, seed=12)
python
from numba import cuda


@cuda.jit
def multiply(in_col, out_col, multiplier):
    i = cuda.grid(1)
    if i < in_col.size:  # boundary guard
        out_col[i] = in_col[i] * multiplier

This kernel will take an input array, multiply it by a configurable value (supplied at runtime), and store the result in an output array. Notice that we wrapped our logic in an if statement. Because we can launch more threads than the size of our array, we need to make sure that we don't use threads with an index that would be out of bounds. Leaving this out can result in undefined behavior.

To execute our kernel, must pre-allocate an output array and leverage the forall method mentioned above. First, we create a Series of all 0.0 in our DataFrame, since we want float64 output. Next, we run the kernel with forall. forall requires us to specify our desired number of tasks, so we'll supply in the length of our Series (which we store in size). The cuda_array_interface is what allows us to directly call our Numba kernel on our Series.

python
size = len(df["a"])
df["e"] = 0.0
multiply.forall(size)(df["a"], df["e"], 10.0)

After calling our kernel, our DataFrame is now populated with the result.

python
df.head()

This API allows a you to theoretically write arbitrary kernel logic, potentially accessing and using elements of the series at arbitrary indices and use them on cuDF data structures. Advanced developers with some CUDA experience can often use this capability to implement iterative transformations, or spot treat problem areas of a data pipeline with a custom kernel that does the same job faster.

DataFrame UDFs

Like cudf.Series, cudf.DataFrame.apply is the primary way to use UDFs on a row of a dataframe.

cudf.DataFrame.apply

cudf.DataFrame.apply is the main entrypoint for UDFs that expect multiple columns as input and produce a single output column. Functions intended to be consumed by this API are written in terms of a "row" argument. The "row" is considered to be like a dictionary and contains all of the column values at a certain iloc in a DataFrame. The function can access these values by key within the function, the keys being the column names corresponding to the desired value. Below is an example function that would be used to add column A and column B together inside a UDF.

python
def f(row):
    return row["A"] + row["B"]

Let's create some very basic toy data containing at least one null.

python
df = cudf.DataFrame({"A": [1, 2, 3], "B": [4, cudf.NA, 6]})
df

Finally call the function as you would in pandas - by using a lambda function to map the UDF onto "rows" of the DataFrame:

python
df.apply(f, axis=1)

The same function should produce the same result as pandas:

python
df.to_pandas(nullable=True).apply(f, axis=1)

Notice that Pandas returns object dtype - see notes on this in the caveats section.

Like cudf.Series.apply, these functions support generalized null handling. Here's a function that conditionally returns a different value if a certain input is null:

python
def f(row):
    x = row["a"]
    if x is cudf.NA:
        return 0
    else:
        return x + 1


df = cudf.DataFrame({"a": [1, cudf.NA, 3]})
df
python
df.apply(f, axis=1)

cudf.NA can also be directly returned from a function resulting in data that has the correct nulls in the end, just as if it were run in Pandas. For the following data, the last row fulfills the condition that 1 + 3 > 3 and returns NA for that row:

python
def f(row):
    x = row["a"]
    y = row["b"]
    if x + y > 3:
        return cudf.NA
    else:
        return x + y


df = cudf.DataFrame({"a": [1, 2, 3], "b": [2, 1, 1]})
df
python
df.apply(f, axis=1)

Mixed types are allowed, but will return the common type, rather than object as in Pandas. Here's a null aware op between an int and a float column:

python
def f(row):
    return row["a"] + row["b"]


df = cudf.DataFrame({"a": [1, 2, 3], "b": [0.5, cudf.NA, 3.14]})
df
python
df.apply(f, axis=1)

Functions may also return scalar values, however the result will be promoted to a safe type regardless of the data. This means even if you have a function like:

python
def f(x):
    if x > 1000:
        return 1.5
    else:
        return 2

And your data is:

python
[1,2,3,4,5]

You will get floats in the final data even though a float is never returned. This is because Numba ultimately needs to produce one function that can handle any data, which means if there's any possibility a float could result, you must always assume it will happen. Here's an example of a function that returns a scalar in some cases:

python
def f(row):
    x = row["a"]
    if x > 3:
        return x
    else:
        return 1.5


df = cudf.DataFrame({"a": [1, 3, 5]})
df
python
df.apply(f, axis=1)

Any number of columns and many arithmetic operators are supported, allowing for complex UDFs:

python
def f(row):
    return row["a"] + (row["b"] - (row["c"] / row["d"])) % row["e"]


df = cudf.DataFrame(
    {
        "a": [1, 2, 3],
        "b": [4, 5, 6],
        "c": [cudf.NA, 4, 4],
        "d": [8, 7, 8],
        "e": [7, 1, 6],
    }
)
df
python
df.apply(f, axis=1)

String Data

String data may be used inside DataFrame.apply UDFs, subject to the same constraints as those for Series.apply. See the section on string handling for Series UDFs above for details. Below is a simple example extending the row UDF logic from above in the case of a string column:

python
str_df = cudf.DataFrame(
    {"str_col": ["abc", "ABC", "Example"], "scale": [1, 2, 3]}
)
str_df
python
def f(row):
    st = row["str_col"]
    scale = row["scale"]

    if len(st) > 5:
        return len(st) + scale
    else:
        return len(st)
python
result = str_df.apply(f, axis=1)
print(result)

User Defined Aggregation Functions (UDAFs) and GroupBy.apply

cuDF provides support for accelerating a subset of user defined aggregations through the GroupBy apply jit engine, which is based on numba. Aggregations meeting the criteria necessary for execution through the jit engine shall be run as such automatically. Users wishing to develop aggregation functions for the jit engine may call it explicitly by passing engine='jit' to apply:

python
# Create a dataframe with two groups
df = cudf.DataFrame(
    {
        "a": [1, 1, 1, 2, 2, 2],
        "b": [1, 2, 3, 4, 5, 6],
        "c": [3, 4, 5, 6, 7, 8],
    }
)
df
python
# a user defined aggregation function.


def udaf(df):
    return df["b"].max() - df["b"].min() / 2
python
result = df.groupby("a").apply(udaf, engine="jit")
result

GroupBy JIT Engine Supported Features

For cuDF to execute a UDAF through the JIT engine, several criteria must be met for the input data and UDAF itself. It is expected that many restrictions may be lifted as development proceeds.

Restrictions

  • Data containing nulls is not yet permitted. Attempting to use data containing nulls with engine='jit' will raise.
  • Broadly speaking, only 4 or 8 byte integer and float dtypes are permitted: ["int32, "int64" ,"float32", "float64"].
  • Some functions have additional dtype restrictions, such as corr, which does not yet support floating point dtypes. Calling corr with such a missing overload will raise.
  • If a column of an unsupported dtype is accessed and used inside a UDAF, cuDF will raise.
  • Operations that return new columns are not permitted within the UDAF, such as a binary operation between columns:
    python
    df['a'] + df['b']
    
    Doing so will raise.
  • Functions that return Series or DataFrame objects are not yet available, only functions that return scalars are permitted.
  • The following reductions are supported:
    • max()
    • min()
    • sum()
    • mean()
    • var()
    • std()
    • idxmax()
    • idxmin()
    • corr() (integer data only)

Rolling Window UDFs

For time-series data, we may need to operate on a small "window" of our column at a time, processing each portion independently. We could slide ("roll") this window over the entire column to answer questions like "What is the 3-day moving average of a stock price over the past year?"

We can apply more complex functions to rolling windows to rolling Series and DataFrames using apply. This example is adapted from cuDF's API documentation. First, we'll create an example Series and then create a rolling object from the Series.

python
ser = cudf.Series([16, 25, 36, 49, 64, 81], dtype="float64")
ser
python
rolling = ser.rolling(window=3, min_periods=3, center=False)
rolling

Next, we'll define a function to use on our rolling windows. We created this one to highlight how you can include things like loops, mathematical functions, and conditionals. Rolling window UDFs do not yet support null values.

python
import math


def example_func(window):
    b = 0
    for a in window:
        b = max(b, math.sqrt(a))
    if b == 8:
        return 100
    return b

We can execute the function by passing it to apply. With window=3, min_periods=3, and center=False, our first two values are null.

python
rolling.apply(example_func)

We can apply this function to every column in a DataFrame, too.

python
df2 = cudf.DataFrame()
df2["a"] = np.arange(55, 65, dtype="float64")
df2["b"] = np.arange(55, 65, dtype="float64")
df2.head()
python
rolling = df2.rolling(window=3, min_periods=3, center=False)
rolling.apply(example_func)

Numba Kernels on CuPy Arrays

We can also execute Numba kernels on CuPy NDArrays, again thanks to the __cuda_array_interface__. We can even run the same UDF on the Series and the CuPy array. First, we define a Series and then create a CuPy array from that Series.

python
import cupy as cp

s = cudf.Series([1.0, 2, 3, 4, 10])
arr = cp.asarray(s)
arr

Next, we define a UDF and execute it on our Series. We need to allocate a Series of the same size for our output, which we'll call out.

python
@cuda.jit
def multiply_by_5(x, out):
    i = cuda.grid(1)
    if i < x.size:
        out[i] = x[i] * 5


out = cudf.Series(cp.zeros(len(s), dtype="int32"))
multiply_by_5.forall(s.shape[0])(s, out)
out

Finally, we execute the same function on our array. We allocate an empty array out to store our results.

python
out = cp.empty_like(arr)
multiply_by_5.forall(arr.size)(arr, out)
out

Caveats

  • UDFs are currently only supported for numeric nondecimal scalar types (full support) and strings in Series.apply and DataFrame.apply (partial support, subject to the caveats outlined above). Attempting to use this API with unsupported types will raise a TypeError.
  • We do not yet fully support all arithmetic operators. Certain ops like bitwise operations are not currently implemented, but planned in future releases. If an operator is needed, a github issue should be raised so that it can be properly prioritized and implemented.

Summary

This guide has covered a lot of content. At this point, you should hopefully feel comfortable writing UDFs (with or without null values) that operate on

  • Series
  • DataFrame
  • Rolling Windows
  • GroupBy DataFrames
  • CuPy NDArrays
  • Numba DeviceNDArrays
  • Generalized NA UDFs
  • String UDFs

For more information please see the cuDF, Numba.cuda, and CuPy documentation.