cuDF Analytics Skill

GPU-accelerated data analysis using NVIDIA RAPIDS cuDF. cuDF provides a pandas-like API that runs on NVIDIA GPUs, enabling massive speedups on large datasets.

When to Use This Skill

Use this skill when:

Analyzing CSV files, datasets, or tabular data
Computing statistical summaries (mean, median, std, quartiles)
Performing groupby aggregations
Detecting anomalies or outliers in data
Profiling datasets with millions of rows
Computing correlation matrices

Initialization (REQUIRED)

Always start every script with this boilerplate. It tests actual GPU operations, not just import.

python

import pandas as pd

try:
    import cudf
    # Smoke-test: verify GPU compute AND host transfer both work
    _test = cudf.Series([1, 2, 3])
    assert _test.sum() == 6
    assert _test.to_pandas().tolist() == [1, 2, 3]
    GPU = True
except Exception as e:
    print(f"[GPU] cudf unavailable, falling back to pandas: {e}")
    GPU = False

def read_csv(path):
    return cudf.read_csv(path) if GPU else pd.read_csv(path)

def to_pd(df):
    """Convert cuDF DataFrame/Series to pandas. Use this instead of .to_pandas() directly."""
    if not GPU:
        return df
    try:
        return df.to_pandas()
    except Exception as e:
        print(f"[GPU] .to_pandas() failed, using Arrow fallback: {e}")
        return df.to_arrow().to_pandas()

Quick Reference

cuDF mirrors the pandas API. Common operations:

Read Data

python

df = read_csv("data.csv")

Statistical Summary

python

# Use to_pd() when you need pandas output
summary = to_pd(df[["value", "score"]].describe())

# Scalar values work directly with float()
mean_val = float(df["value"].mean())
q1 = float(df["value"].quantile(0.25))

# Correlation
corr = float(df["value"].corr(df["score"]))

Groupby Aggregation

python

result = df.groupby("category").agg({
    "revenue": ["sum", "mean", "count"],
    "quantity": ["sum", "mean"],
})
result_pd = to_pd(result)

Anomaly Detection (IQR Method)

python

col = "value"
Q1 = float(df[col].quantile(0.25))
Q3 = float(df[col].quantile(0.75))
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
outliers = to_pd(df[(df[col] < lower) | (df[col] > upper)])

Anomaly Detection (Z-Score Method)

python

mean = float(df[col].mean())
std = float(df[col].std())
df["z_score"] = (df[col] - mean) / std
anomalies = to_pd(df[df["z_score"].abs() > 3])

Filtering and Selection

python

# Filter rows
filtered = df[df["status"] == "active"]

# Select columns
subset = df[["name", "revenue", "date"]]

# Sort
sorted_df = df.sort_values("revenue", ascending=False)

# Convert to pandas for final output / iteration
result_pd = to_pd(sorted_df)

Data Type Requirements

cuDF requires explicit type specification for optimal performance:

Use float32 or float64 for numeric data
Use int32 or int64 for integer data
String columns use cuDF's string dtype automatically

Output Guidelines

When reporting analysis results:

Include dataset dimensions (rows x columns)
Show key statistics in formatted tables
Highlight notable patterns, trends, or anomalies
Provide both summary statistics and specific examples
Note any data quality issues (missing values, outliers)