scientific-skills/optimize-for-gpu/references/cudf.md
cuDF is a GPU DataFrame library that provides a pandas-like API for loading, joining, aggregating, filtering, and manipulating tabular data entirely on the GPU. It's part of the NVIDIA RAPIDS ecosystem and is built on the Apache Arrow columnar memory format.
Full documentation: https://docs.rapids.ai/api/cudf/stable/
Always use uv add (never pip install or conda install) in all install instructions, docstrings, comments, and error messages.
uv add --extra-index-url=https://pypi.nvidia.com cudf-cu12 # For CUDA 12.x
Verify:
import cudf
print(cudf.Series([1, 2, 3])) # Should print a GPU series
cuDF offers two ways to accelerate pandas code:
Drop-in replacement that automatically accelerates pandas. Falls back to CPU for unsupported operations. Best for: quick acceleration of existing code, mixed codebases, prototyping.
Replace import pandas with import cudf. Maximum performance, no proxy overhead, but requires adapting code to cuDF's API (which has some behavioral differences from pandas). Best for: production pipelines, maximum performance, new GPU-first code.
The fastest path from pandas to GPU — no code changes required.
# Jupyter/IPython (MUST be before any pandas import)
%load_ext cudf.pandas
import pandas as pd # Now GPU-accelerated
# Command line
# python -m cudf.pandas your_script.py
# python -m cudf.pandas --profile your_script.py # With profiling
# Programmatic
import cudf.pandas
cudf.pandas.install()
import pandas as pd # Now GPU-accelerated
Critical: If pandas was already imported in the session, you must restart the kernel/process.
import pandas returns a proxy module that wraps cuDF and pandas.%%cudf.pandas.profile # Shows GPU vs CPU operation breakdown per cell
%%cudf.pandas.line_profile # Per-line GPU/CPU timing
proxy_df.as_gpu_object() # Get the cuDF DataFrame directly
proxy_df.as_cpu_object() # Get the pandas DataFrame directly
Note: automatic fallback stops working after you extract the underlying object.
cuGraph, cuML, Hvplot, Holoview, Ibis, NumPy, Matplotlib, Plotly, PyTorch, Seaborn, Scikit-Learn, SciPy, TensorFlow, XGBoost.
Not compatible: Joblib. For distributed work, use Dask-cuDF instead.
import cudf alongside cudf.pandas in the same session.numpy.ndarray, which can cause eager device-to-host transfers.CUDF_PANDAS_FALLBACK_MODE=1.import cudf
# From dict
df = cudf.DataFrame({"a": [1, 2, 3], "b": [4.0, 5.0, 6.0], "c": ["x", "y", "z"]})
# From pandas
import pandas as pd
gdf = cudf.DataFrame.from_pandas(pd.DataFrame({"a": [1, 2, 3]}))
# or
gdf = cudf.DataFrame(pandas_df)
# Series
s = cudf.Series([1, 2, 3, None, 5])
# Back to pandas
pdf = gdf.to_pandas()
df.head(10)
df.tail(5)
df.describe()
df.info()
df.dtypes
df.columns
df.shape
# Selection
df["a"] # Column → Series
df[["a", "b"]] # Multiple columns → DataFrame
df.loc[2:5, ["a", "b"]] # Label-based indexing
df.iloc[0:3] # Integer-based indexing
# Filtering
df[df["a"] > 2]
df.query("a > 2 and b < 6") # Supports @var for local variables
# Sorting
df.sort_values("a", ascending=False)
df.sort_index()
# Missing data
df.fillna(0)
df.dropna()
df.isna()
# Aggregations
df["a"].sum()
df["a"].mean()
df["a"].std()
df["a"].value_counts()
# Transforms
df["a"].clip(lower=1, upper=5)
df["a"].apply(lambda x: x * 2) # JIT-compiled
# Combining
cudf.concat([df1, df2])
df1.merge(df2, on="key")
df1.merge(df2, on="key", how="left") # left, right, inner, outer
# Arrow interop (zero-copy)
arrow_table = df.to_arrow()
df = cudf.DataFrame.from_arrow(arrow_table)
GPU-accelerated file reading and writing — often dramatically faster than pandas for large files.
# Read
df = cudf.read_parquet("data.parquet")
df = cudf.read_parquet("data.parquet", columns=["a", "b"]) # Read only specific columns
# Write
df.to_parquet("output.parquet")
# Metadata inspection (without loading data)
cudf.io.parquet.read_parquet_metadata("data.parquet")
# Incremental writing
writer = cudf.io.parquet.ParquetDatasetWriter("output_dir/", partition_cols=["year"])
writer.write_table(df)
writer.close()
df = cudf.read_csv("data.csv")
df = cudf.read_csv("data.csv", usecols=["a", "b"], dtype={"a": "int32"})
df.to_csv("output.csv", index=False)
df = cudf.read_json("data.json")
df = cudf.read_json("data.json", lines=True) # JSON Lines format
df.to_json("output.json")
df = cudf.read_orc("data.orc")
df.to_orc("output.orc")
| Format | Read | Write | GPU-Accelerated |
|---|---|---|---|
| Avro | cudf.read_avro() | N/A | Yes (read only) |
| Text | cudf.read_text() | N/A | Yes (read only) |
| HDF5 | cudf.read_hdf() | df.to_hdf() | No (uses pandas) |
| Feather | cudf.read_feather() | df.to_feather() | No (uses pandas) |
Prefer Parquet over CSV — columnar format reads faster on GPU, supports predicate pushdown, and compresses well.
df.groupby("category").sum()
df.groupby(["category", "subcategory"]).mean()
df.groupby("category").agg({"value": "sum", "count": "max"})
df.groupby("category").agg({"value": ["sum", "min", "max"], "count": "mean"})
Universal: count, size, nunique, nth, collect, unique
Numeric: sum, mean, var, std, median, idxmin, idxmax, min, max, quantile
Specialized: corr, cov
df.groupby("category").transform("max") # Broadcasts result to match group size
df.groupby("category").apply(lambda x: x.max() - x.min())
Warning: Apply runs the function sequentially per group — can be slow with many small groups. Use vectorized aggregations whenever possible.
def custom_agg(df):
return df["value"].max() - df["value"].min() / 2
result = df.groupby("category").apply(custom_agg, engine="jit")
JIT restrictions: no nulls, only int32/64 and float32/64, cannot return new columns.
cuDF uses sort=False by default (unlike pandas which sorts by default). To match pandas:
df.groupby("category", sort=True).sum()
# Or globally:
cudf.set_option("mode.pandas_compatible", True)
cuDF provides GPU-accelerated string operations via the .str accessor — identical API to pandas.
s = cudf.Series(["Hello World", "foo bar", "RAPIDS GPU", None])
# Case
s.str.lower()
s.str.upper()
s.str.title()
s.str.capitalize()
# Pattern matching
s.str.contains("World")
s.str.startswith("Hello")
s.str.endswith("GPU")
s.str.match(r"^[A-Z]")
# Extraction and replacement
s.str.extract(r"(\w+)\s(\w+)")
s.str.replace("World", "GPU")
s.str.slice(0, 5)
# Splitting and joining
s.str.split(" ")
s.str.cat(sep=", ")
# Info
s.str.len()
s.str.isalpha()
s.str.isdigit()
# cuDF-exclusive operations (not in pandas)
s.str.normalize_spaces() # Collapse whitespace
s.str.tokenize() # Tokenize strings
s.str.ngrams(2) # Generate n-grams
s.str.edit_distance(other) # Levenshtein distance
s.str.url_encode()
s.str.url_decode()
s = cudf.Series([1, 2, 3, 4, 5])
def square_plus_one(x):
return x ** 2 + 1
s.apply(square_plus_one) # Compiled to GPU kernel via Numba
With arguments:
def add_constant(x, c):
return x + c
s.apply(add_constant, args=(42,))
def row_func(row):
return row["a"] + row["b"] * 2
df.apply(row_func, axis=1) # Access columns by name via dict-like syntax
Nulls propagate automatically:
s = cudf.Series([1, cudf.NA, 3])
def f(x):
return x + 1
s.apply(f) # Returns [2, <NA>, 4]
Explicit null checks:
def f(x):
if x is cudf.NA:
return 0
return x + 1
String operations inside UDFs support: ==, !=, >=, <=, startswith(), endswith(), find(), rfind(), count(), in, strip/lstrip/rstrip(), upper/lower(), replace(), + (concatenation), len(), boolean checks.
For string UDFs creating intermediate strings, allocate heap:
from cudf.core.udf.utils import set_malloc_heap_size
set_malloc_heap_size(int(2e9)) # 2 GB
import math
s = cudf.Series([16, 25, 36, 49, 64, 81], dtype="float64")
def max_sqrt(window):
result = 0
for val in window:
result = max(result, math.sqrt(val))
return result
s.rolling(window=3, min_periods=3).apply(max_sqrt)
Limitation: Rolling UDFs do NOT support null values.
For maximum control, write CUDA kernels that operate directly on cuDF columns:
from numba import cuda
@cuda.jit
def gpu_multiply(in_col, out_col, multiplier):
i = cuda.grid(1)
if i < in_col.size:
out_col[i] = in_col[i] * multiplier
df["result"] = 0.0
gpu_multiply.forall(len(df))(df["a"], df["result"], 10.0)
**kwargs not supported.<NA> (not NaN) — cuDF uses a separate null mask, not NaN sentinels.np.nan inserted into integer columns becomes <NA> without casting to float.s = cudf.Series([1, None, 3, None, 5])
s.isna() # Boolean mask
s.notna()
s.fillna(0) # Fill with scalar
s.fillna({"a": 0, "b": 1}) # Fill with dict (per-column)
s.dropna()
# Aggregations skip NA by default
s.sum() # skipna=True (default)
s.sum(skipna=False) # Propagates NA
# GroupBy excludes NA groups by default
df.groupby("a", dropna=False).sum() # Include NA groups
| Category | Types |
|---|---|
| Integer | int8, int16, int32, int64, uint32, uint64 |
| Float | float32, float64 |
| Datetime | datetime64[s/ms/us/ns] |
| Timedelta | timedelta[s/ms/us/ns] |
| Categorical | CategoricalDtype |
| String | object / string |
| Decimal | Decimal32Dtype, Decimal64Dtype, Decimal128Dtype |
| List | ListDtype (nested lists) |
| Struct | StructDtype (dict-like) |
All types are nullable. List columns have a .list accessor (get(), len(), contains(), sort_values(), unique(), concat()). Struct columns have a .struct accessor (field(), explode()).
No object dtype for arbitrary Python objects — object dtype only stores strings.
cuDF uses RMM for GPU memory allocation. Configure it for your workload:
import rmm
# Pool allocator (recommended for production — avoids per-allocation cudaMalloc overhead)
pool = rmm.mr.PoolMemoryResource(
rmm.mr.CudaMemoryResource(),
initial_pool_size="1GiB",
maximum_pool_size="4GiB"
)
rmm.mr.set_current_device_resource(pool)
# Managed memory (allows datasets larger than GPU memory)
rmm.mr.set_current_device_resource(rmm.mr.ManagedMemoryResource())
# Managed + pool (best of both)
pool = rmm.mr.PoolMemoryResource(
rmm.mr.ManagedMemoryResource(),
initial_pool_size="1GiB"
)
rmm.mr.set_current_device_resource(pool)
When using cuDF with CuPy or Numba, align all libraries on the same allocator to avoid memory fragmentation:
# CuPy
from rmm.allocators.cupy import rmm_cupy_allocator
import cupy
cupy.cuda.set_allocator(rmm_cupy_allocator)
# Numba
from rmm.allocators.numba import RMMNumbaManager
from numba import cuda
cuda.set_memory_manager(RMMNumbaManager)
cudf.set_option("copy_on_write", True)
# or: export CUDF_COPY_ON_WRITE=1
Slices, .head(), shallow copies, and view-generating methods share memory until one is modified. Reduces memory usage significantly for workflows with many derived DataFrames.
rmm.statistics.enable_statistics()
stats = rmm.statistics.get_statistics()
# Returns: current_bytes, current_count, peak_bytes, peak_count, total_bytes, total_count
import cupy as cp
# cuDF → CuPy
arr = df.to_cupy() # DataFrame → 2D CuPy array
arr = cp.asarray(df["col"]) # Series → 1D CuPy array
arr = df["col"].values # Series → 1D CuPy array
# CuPy → cuDF
df = cudf.DataFrame(cupy_2d_array)
s = cudf.Series(cupy_1d_array)
# Via DLPack
df = cudf.from_dlpack(cupy_array.__dlpack__())
arrow_table = df.to_arrow()
df = cudf.DataFrame.from_arrow(arrow_table)
cuDF Series exposes __cuda_array_interface__ for zero-copy sharing with any compatible library (CuPy, Numba, PyTorch, etc.).
For datasets larger than a single GPU's memory, or for multi-GPU parallelism:
import dask_cudf
from dask.distributed import Client
from dask_cuda import LocalCUDACluster
# One worker per GPU
cluster = LocalCUDACluster()
client = Client(cluster)
# From files
ddf = dask_cudf.read_csv("path/*.csv")
ddf = dask_cudf.read_parquet("path/")
# From cuDF DataFrame
ddf = dask_cudf.from_cudf(df, npartitions=16)
# Operations (lazy — call .compute() to execute)
result = ddf.groupby("a").sum().compute()
# Persist in GPU memory for repeated access
ddf = ddf.persist()
Key differences from cuDF: .iloc not supported, must call .compute() to materialize, transpose not implemented.
Start with cudf.pandas for easiest adoption — zero code changes, automatic GPU/CPU fallback.
Switch to direct cuDF API for max performance — avoids proxy overhead and fallback copying costs.
Prefer Parquet over CSV — columnar format, faster GPU reads, predicate pushdown, better compression.
Use pool allocators via RMM — avoids per-allocation cudaMalloc overhead.
Enable copy-on-write — cudf.set_option("copy_on_write", True) reduces memory from slices and views.
Reshape data to be long (more rows, fewer columns) — GPUs parallelize over rows.
Never iterate — use vectorized operations exclusively. for row in df.iterrows() defeats the purpose of GPU acceleration.
Minimum dataset size: GPUs shine with 10,000-100,000+ rows. Smaller datasets may be faster on CPU.
Use vectorized string ops (.str. accessor) instead of row-wise string UDFs.
Use CuPy for row-wise math that cuDF doesn't support natively.
Use Numba CUDA kernels for complex element-wise operations.
Align all RAPIDS libraries on the same RMM allocator to avoid memory fragmentation.
For distributed workloads, use Dask-cuDF with persist() to keep data on GPU memory.
Result ordering is non-deterministic by default (groupby, joins, etc.). Use sort=True or cudf.set_option("mode.pandas_compatible", True).
All types are nullable. Missing values are <NA>, not NaN. Integer columns with missing values stay integer (no float coercion).
No iteration. for val in series is not supported. Convert to pandas first if you must iterate.
Unique column names required. No duplicate column names.
No arbitrary Python objects. The object dtype only stores strings.
.apply() uses Numba JIT. Only a subset of Python is supported inside UDFs — no arbitrary Python objects, no external library calls.
Floating-point results may differ slightly due to GPU parallel operation ordering. Use tolerance-based comparisons.
GroupBy defaults to sort=False (pandas defaults to sort=True).
No ExtensionDtype support from pandas.
%load_ext cudf.pandas
import pandas as pd
# Everything else stays exactly the same
# Before
import pandas as pd
df = pd.read_csv("data.csv")
result = df.groupby("col").mean()
# After
import cudf
df = cudf.read_csv("data.csv")
result = df.groupby("col").mean()
# Before (pandas — slow even on CPU)
for idx, row in df.iterrows():
df.at[idx, "c"] = row["a"] + row["b"]
# After (cuDF)
df["c"] = df["a"] + df["b"]
# Before
df["result"] = df.apply(lambda row: row["a"] ** 2 + row["b"], axis=1)
# After (vectorized — much faster)
df["result"] = df["a"] ** 2 + df["b"]
# Load and process on GPU
gdf = cudf.read_parquet("data.parquet")
result = gdf.groupby("key").agg({"val": "sum"})
# Convert to pandas only when needed (plotting, export, etc.)
pdf = result.to_pandas()
pdf.plot()
import cupy as cp
# Convert to CuPy for operations cuDF doesn't support
arr = df[["x", "y", "z"]].to_cupy()
norms = cp.linalg.norm(arr, axis=1)
df["norm"] = cudf.Series(norms)
cudf.set_option("copy_on_write", True) # Enable copy-on-write
cudf.set_option("mode.pandas_compatible", True) # Match pandas behavior
cudf.describe_option() # List all options
| Environment Variable | Purpose |
|---|---|
CUDF_COPY_ON_WRITE=1 | Enable copy-on-write |
CUDF_PANDAS_RMM_MODE | Control memory allocator for cudf.pandas |
CUDF_PANDAS_FALLBACK_MODE=1 | Force CPU-only execution in cudf.pandas |