scientific-skills/optimize-for-gpu/references/cuml.md
cuML is NVIDIA's GPU-accelerated machine learning library within the RAPIDS ecosystem. It provides scikit-learn-compatible APIs for 50+ algorithms, delivering 10-50x faster performance on average, with some algorithms (HDBSCAN, t-SNE, UMAP, KNN) reaching 60-600x speedup. It follows the familiar fit/predict/transform pattern from sklearn.
Full documentation: https://docs.rapids.ai/api/cuml/stable/
Always use uv add (never pip install or conda install) in all install instructions, docstrings, comments, and error messages.
uv add --extra-index-url=https://pypi.nvidia.com cuml-cu12 # For CUDA 12.x
Platform: Linux and WSL2 only (no native macOS or Windows). Requires: scikit-learn >= 1.4, NVIDIA GPU with CUDA 12.x support.
Verify:
import cuml
print(cuml.__version__)
from cuml.datasets import make_blobs
X, y = make_blobs(n_samples=1000, n_features=10)
print(f"Generated {X.shape[0]} samples on GPU")
Transparently intercepts sklearn, umap-learn, and hdbscan calls and routes them to GPU. Falls back to CPU for unsupported operations. Best for: quick acceleration of existing sklearn code, mixed codebases, prototyping.
Replace from sklearn with from cuml. Maximum performance, explicit control over GPU execution. Best for: production pipelines, maximum performance, new GPU-first code.
The fastest path from sklearn to GPU — no code changes required. Similar to cudf.pandas for pandas.
# Jupyter/IPython (MUST be the first cell, before any sklearn import)
%load_ext cuml.accel
import sklearn # Now GPU-accelerated
from sklearn.cluster import KMeans # Runs on GPU transparently
# Command line
python -m cuml.accel script.py
python -m cuml.accel -v script.py # With info logging
python -m cuml.accel -vv script.py # With debug logging
# Programmatic (call BEFORE importing sklearn)
import cuml
cuml.accel.install()
from sklearn.cluster import KMeans # Now GPU-accelerated
# Environment variable
CUML_ACCEL_ENABLED=1 python script.py
init for KMeans)n_components="mle" for PCA, positive=True for linear models, warm startsGPU results are numerically equivalent but may differ at floating-point precision level due to parallel reduction order. Compare model quality via scores (accuracy, R2, etc.), not raw coefficient values.
Replace sklearn imports with cuml imports. The API is identical — fit/predict/transform.
from cuml.cluster import DBSCAN
from cuml.datasets import make_blobs
# Create data directly on GPU
X, y = make_blobs(n_samples=100_000, centers=5, n_features=10, random_state=42)
# Fit — runs on GPU
model = DBSCAN(eps=1.0, min_samples=5)
model.fit(X)
print(model.labels_)
from cuml import LinearRegression
from cuml.datasets import make_regression
from cuml.model_selection import train_test_split
X, y = make_regression(n_samples=100_000, n_features=50, noise=0.1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = LinearRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
score = model.score(X_test, y_test)
print(f"R2 score: {score:.4f}")
| cuML | sklearn Equivalent | Multi-GPU |
|---|---|---|
cuml.KMeans | sklearn.cluster.KMeans | Yes |
cuml.DBSCAN | sklearn.cluster.DBSCAN | Yes |
cuml.AgglomerativeClustering | sklearn.cluster.AgglomerativeClustering | No |
cuml.cluster.hdbscan.HDBSCAN | hdbscan.HDBSCAN | No |
cuml.cluster.SpectralClustering | sklearn.cluster.SpectralClustering | No |
| cuML | sklearn Equivalent | Multi-GPU |
|---|---|---|
cuml.LinearRegression | sklearn.linear_model.LinearRegression | Yes |
cuml.Ridge | sklearn.linear_model.Ridge | Yes |
cuml.Lasso | sklearn.linear_model.Lasso | Yes |
cuml.ElasticNet | sklearn.linear_model.ElasticNet | Yes |
cuml.SVR | sklearn.svm.SVR | No |
cuml.KernelRidge | sklearn.kernel_ridge.KernelRidge | No |
cuml.ensemble.RandomForestRegressor | sklearn.ensemble.RandomForestRegressor | Yes |
cuml.MBSGDRegressor | sklearn.linear_model.SGDRegressor | No |
| cuML | sklearn Equivalent | Multi-GPU |
|---|---|---|
cuml.LogisticRegression | sklearn.linear_model.LogisticRegression | No |
cuml.ensemble.RandomForestClassifier | sklearn.ensemble.RandomForestClassifier | Yes |
cuml.svm.SVC | sklearn.svm.SVC | No |
cuml.svm.LinearSVC | sklearn.svm.LinearSVC | No |
cuml.naive_bayes.GaussianNB | sklearn.naive_bayes.GaussianNB | No |
cuml.naive_bayes.MultinomialNB | sklearn.naive_bayes.MultinomialNB | Yes |
cuml.naive_bayes.BernoulliNB | sklearn.naive_bayes.BernoulliNB | No |
cuml.naive_bayes.CategoricalNB | sklearn.naive_bayes.CategoricalNB | No |
cuml.naive_bayes.ComplementNB | sklearn.naive_bayes.ComplementNB | No |
cuml.neighbors.KNeighborsClassifier | sklearn.neighbors.KNeighborsClassifier | Yes |
cuml.neighbors.KNeighborsRegressor | sklearn.neighbors.KNeighborsRegressor | Yes |
cuml.MBSGDClassifier | sklearn.linear_model.SGDClassifier | No |
cuml.multiclass.OneVsOneClassifier | sklearn.multiclass.OneVsOneClassifier | No |
cuml.multiclass.OneVsRestClassifier | sklearn.multiclass.OneVsRestClassifier | No |
| cuML | sklearn/Library Equivalent | Multi-GPU |
|---|---|---|
cuml.PCA | sklearn.decomposition.PCA | Yes |
cuml.IncrementalPCA | sklearn.decomposition.IncrementalPCA | No |
cuml.TruncatedSVD | sklearn.decomposition.TruncatedSVD | Yes |
cuml.UMAP | umap.UMAP | Yes (inference) |
cuml.TSNE | sklearn.manifold.TSNE | No |
cuml.random_projection.GaussianRandomProjection | sklearn.random_projection.GaussianRandomProjection | No |
cuml.random_projection.SparseRandomProjection | sklearn.random_projection.SparseRandomProjection | No |
| cuML | sklearn Equivalent | Multi-GPU |
|---|---|---|
cuml.neighbors.NearestNeighbors | sklearn.neighbors.NearestNeighbors | Yes |
cuml.neighbors.KNeighborsClassifier | sklearn.neighbors.KNeighborsClassifier | Yes |
cuml.neighbors.KNeighborsRegressor | sklearn.neighbors.KNeighborsRegressor | Yes |
cuml.neighbors.KernelDensity | sklearn.neighbors.KernelDensity | No |
| cuML | Description |
|---|---|
cuml.ExponentialSmoothing | Holt-Winters exponential smoothing |
cuml.tsa.ARIMA | ARIMA/SARIMA models (batched — fits multiple series simultaneously) |
cuml.tsa.auto_arima.AutoARIMA | Automatic ARIMA order selection |
Regression: r2_score, mean_squared_error, mean_absolute_error, mean_squared_log_error, median_absolute_error
Classification: accuracy_score, log_loss, roc_auc_score, precision_recall_curve, confusion_matrix
Clustering: adjusted_rand_score, silhouette_score, silhouette_samples, homogeneity_score, completeness_score, v_measure_score, mutual_info_score
Other: trustworthiness, pairwise_distances, pairwise_kernels
| cuML | Description |
|---|---|
cuml.explainer.KernelExplainer | SHAP Kernel Explainer |
cuml.explainer.PermutationExplainer | SHAP Permutation Explainer |
cuml.explainer.TreeExplainer | SHAP Tree Explainer |
cuML accepts: NumPy arrays, CuPy arrays, cuDF DataFrames/Series, pandas DataFrames/Series, Numba device arrays, PyTorch tensors (via __cuda_array_interface__).
NumPy and pandas inputs are automatically transferred to GPU. For best performance, pass CuPy arrays or cuDF DataFrames to avoid transfers.
import cuml
# Global setting
cuml.set_global_output_type('cupy') # Options: 'input', 'cupy', 'numpy', 'cudf', 'pandas'
# Context manager
with cuml.using_output_type('cudf'):
result = model.predict(X) # Returns cudf Series
# Per-estimator
model = cuml.KMeans(output_type='cupy')
Performance ranking (fastest to slowest output type):
cupy — no host transfers, most efficientcudf — slight overhead for some shapesnumpy / pandas — device-to-host transfer costBest practice: Use cupy or cudf for intermediate results. Only convert to numpy/pandas at the end for visualization or export.
cuML provides GPU-accelerated versions of all common sklearn preprocessors.
from cuml.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
from cuml.preprocessing import Normalizer, PowerTransformer, QuantileTransformer
from cuml.preprocessing import Binarizer, PolynomialFeatures, KBinsDiscretizer
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
from cuml.preprocessing import LabelEncoder, OneHotEncoder, LabelBinarizer, TargetEncoder
le = LabelEncoder()
y_encoded = le.fit_transform(y)
ohe = OneHotEncoder(sparse_output=False)
X_encoded = ohe.fit_transform(X_categorical)
from cuml.preprocessing import SimpleImputer, MissingIndicator
imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X)
from cuml.compose import ColumnTransformer, make_column_transformer
from cuml.preprocessing import StandardScaler, OneHotEncoder
preprocessor = make_column_transformer(
(StandardScaler(), ['age', 'income']),
(OneHotEncoder(), ['category', 'region']),
)
X_processed = preprocessor.fit_transform(df)
scale(), minmax_scale(), maxabs_scale(), robust_scale(), normalize(), binarize(), add_dummy_feature(), label_binarize()
from cuml.feature_extraction.text import TfidfVectorizer, CountVectorizer, HashingVectorizer
tfidf = TfidfVectorizer(max_features=10000)
X_tfidf = tfidf.fit_transform(corpus)
from cuml.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
from cuml.model_selection import KFold
kf = KFold(n_splits=5, shuffle=True, random_state=42)
for train_idx, test_idx in kf.split(X):
X_train, X_test = X[train_idx], X[test_idx]
# ...
For GPU-efficient hyperparameter search, use dask-ml's GridSearchCV/RandomizedSearchCV rather than sklearn's — sklearn's version causes excessive CPU-GPU data transfers per fold.
from dask_ml.model_selection import RandomizedSearchCV
from cuml.ensemble import RandomForestClassifier
param_distributions = {
'max_depth': [8, 12, 16, 20],
'n_estimators': [100, 200, 500],
'max_features': [0.5, 0.75, 1.0],
}
search = RandomizedSearchCV(
RandomForestClassifier(),
param_distributions,
n_iter=25,
cv=5,
random_state=42,
)
search.fit(X_train, y_train)
print(f"Best score: {search.best_score_:.4f}")
print(f"Best params: {search.best_params_}")
from cuml.datasets import make_blobs, make_classification, make_regression
X, y = make_blobs(n_samples=100_000, centers=5, n_features=20, random_state=42)
X, y = make_classification(n_samples=100_000, n_features=50, n_informative=25)
X, y = make_regression(n_samples=100_000, n_features=50, noise=0.1)
FIL provides high-performance GPU inference for tree-based models trained in any framework — 80x+ faster than sklearn inference.
from cuml.fil import ForestInference
# Load from XGBoost, LightGBM, or sklearn saved models
fil_model = ForestInference.load("xgboost_model.ubj", is_classifier=True)
# Optional: optimize for specific batch size
fil_model.optimize()
# Predict (80x+ faster than sklearn)
predictions = fil_model.predict(X_test)
probas = fil_model.predict_proba(X_test)
Supports: XGBoost, LightGBM, sklearn Random Forests, any Treelite-compatible model.
This is especially valuable when you have a model already trained on CPU and want to speed up inference without retraining.
For datasets too large for a single GPU or when you want to use multiple GPUs.
from dask.distributed import Client
from dask_cuda import LocalCUDACluster
# One Dask worker per GPU
cluster = LocalCUDACluster(
rmm_pool_size="12GB",
enable_cudf_spill=True,
)
client = Client(cluster)
# Create distributed data
from cuml.dask.datasets import make_blobs
X, y = make_blobs(
n_samples=1_000_000,
n_features=20,
centers=5,
n_parts=len(client.scheduler_info()['workers']) * 2, # 2 partitions per worker
)
# Use Dask estimator
from cuml.dask.cluster import KMeans
kmeans = KMeans(n_clusters=5)
kmeans.fit(X)
labels = kmeans.predict(X)
# Convert to single-GPU model for serialization
single_model = kmeans.get_combined_model()
client.close()
cluster.close()
cuml.dask)import pickle
# Save cuML model
with open("model.pkl", "wb") as f:
pickle.dump(model, f, protocol=5)
# Load cuML model
with open("model.pkl", "rb") as f:
model = pickle.load(f)
single_model = dask_model.get_combined_model().import rmm
# Pre-allocate a memory pool for faster allocation
rmm.reinitialize(pool_allocator=True, initial_pool_size=2**32) # 4 GB pool
When using cuML alongside cuDF and CuPy, align all libraries on the same RMM allocator:
import rmm
from rmm.allocators.cupy import rmm_cupy_allocator
import cupy
cupy.cuda.set_allocator(rmm_cupy_allocator)
cuml.accel uses managed memory by default (host RAM augments GPU VRAM). Disable with --disable-uvm flag if experiencing slowdowns. Managed memory does NOT work on WSL2 or when RMM is externally configured.
| Category | Typical Speedup | Notes |
|---|---|---|
| HDBSCAN, t-SNE, UMAP | 60-300x | Complex algorithms benefit most |
| KNN | Up to 600x | Scales dramatically with data size |
| KMeans, Random Forest | 15-80x | RF: 20-45x single GPU |
| FIL inference | 80x+ | Tree model inference from any framework |
| Linear models, PCA, Ridge | 2-10x | Simpler algorithms, lower but consistent gains |
Use float32. GPU float32 throughput is 2x-32x higher than float64. Most ML algorithms don't need double precision.
Keep data on GPU. Pass CuPy arrays or cuDF DataFrames. Every NumPy/pandas conversion triggers a device-host transfer.
Larger datasets = larger speedup. GPU parallelism advantage grows with data size. Minimum ~10K rows to see benefit.
Wide data benefits more. 128-512 features see higher speedups than 8-16 features.
First call has JIT overhead. Benchmark on subsequent calls, not the first.
Use RMM pools. Pre-allocated memory pools are 1000x faster than raw cudaMalloc.
Use dask-ml for hyperparameter tuning, not sklearn's GridSearchCV — it avoids excessive CPU-GPU transfers.
Use FIL for tree model inference. Even if the model was trained on CPU (XGBoost, LightGBM, sklearn RF), FIL gives 80x+ inference speedup.
__cuda_array_interface__. Most efficient intermediate format.cuml.dask module.import cudf
import cuml
from cuml.preprocessing import StandardScaler
from cuml.ensemble import RandomForestClassifier
from cuml.model_selection import train_test_split
# Load data on GPU
df = cudf.read_parquet("data.parquet")
X = df.drop("target", axis=1)
y = df["target"]
# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Preprocess
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Train
model = RandomForestClassifier(n_estimators=100, max_depth=16)
model.fit(X_train, y_train)
# Evaluate
score = model.score(X_test, y_test)
print(f"Accuracy: {score:.4f}")
All of this runs entirely on GPU — from Parquet read to model evaluation — with zero CPU-GPU transfers.
Platform: Linux and WSL2 only. No native macOS or Windows.
Sparse data: Most cuML algorithms do not support sparse matrices. Under cuml.accel, sparse inputs fall back to CPU.
String data: Must be pre-encoded to numeric. No native string column support in estimators.
Multi-output: Not supported for Random Forest.
Warm starts: Not supported for most algorithms.
Some sklearn parameters ignored: n_jobs (GPU handles parallelism), positive=True, specific solver choices.
Numerical precision: Results equivalent in quality but may differ at floating-point level. Compare scores, not raw coefficients.
Memory: Limited by GPU VRAM (typically 8-80 GB). Use managed memory or Dask for larger datasets.
Missing fitted attributes: Some sklearn attributes not computed under cuml.accel (e.g., HDBSCAN exemplars_, LinearRegression rank_).
# Add one line at top of notebook:
%load_ext cuml.accel
from sklearn.cluster import KMeans # Now GPU-accelerated
from sklearn.decomposition import PCA # Now GPU-accelerated
# Everything else stays exactly the same
# Before
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
# After
from cuml.ensemble import RandomForestClassifier
from cuml.preprocessing import StandardScaler
from cuml.model_selection import train_test_split
import cudf
from cuml.preprocessing import StandardScaler, LabelEncoder
from cuml.ensemble import RandomForestClassifier
from cuml.model_selection import train_test_split
# Load and preprocess entirely on GPU
df = cudf.read_parquet("data.parquet")
le = LabelEncoder()
df["category_encoded"] = le.fit_transform(df["category"])
X = df[["feature1", "feature2", "category_encoded"]].to_cupy()
y = df["target"].to_cupy()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
model = RandomForestClassifier(n_estimators=200, max_depth=16)
model.fit(X_train, y_train)
print(f"Accuracy: {model.score(X_test, y_test):.4f}")
from cuml.fil import ForestInference
# Load XGBoost/LightGBM/sklearn model for 80x+ faster inference
fil_model = ForestInference.load("my_xgboost_model.ubj", is_classifier=True)
predictions = fil_model.predict(X_test)