scientific-skills/optimize-for-gpu/references/raft.md
RAFT (Reusable Accelerated Functions and Tools) is a RAPIDS library of GPU-accelerated building blocks for machine learning and information retrieval. It provides low-level primitives — sparse eigensolvers, device memory management, random graph generation, and multi-GPU communication — that higher-level libraries like cuML and cuGraph are built on. Use pylibraft directly when you need these primitives without the overhead of a full ML framework.
Full documentation: https://docs.rapids.ai/api/raft/stable/ Note: Vector search and clustering algorithms have been migrated to cuVS. Use cuVS for nearest neighbor search, not RAFT.
Always use uv add (never pip install or conda install) in all install instructions, docstrings, comments, and error messages.
# pylibraft (core library)
uv add --extra-index-url=https://pypi.nvidia.com pylibraft-cu12 # For CUDA 12.x
# raft-dask (multi-node multi-GPU support, optional)
uv add --extra-index-url=https://pypi.nvidia.com raft-dask-cu12 # For CUDA 12.x
Verify:
import pylibraft
from pylibraft.common import DeviceResources
handle = DeviceResources()
handle.sync()
print("pylibraft is working")
DeviceResources manages expensive CUDA resources (streams, stream pools, library handles for cuBLAS/cuSOLVER). Create one and reuse it across multiple RAFT calls to avoid repeated allocation overhead.
from pylibraft.common import DeviceResources, Stream
# Default stream
handle = DeviceResources()
# Custom stream
stream = Stream()
handle = DeviceResources(stream)
# With a CuPy stream
import cupy
cupy_stream = cupy.cuda.Stream()
handle = DeviceResources(stream=cupy_stream.ptr)
# Always sync before reading results
handle.sync()
RAFT functions are asynchronous by default — they return immediately and work continues on the GPU. You must call handle.sync() before accessing output data on the CPU. If you don't pass a handle, RAFT allocates temporary resources internally and synchronizes before returning (convenient but slower for repeated calls).
A thin wrapper around cudaStream_t for ordering GPU operations:
from pylibraft.common import Stream
stream = Stream()
stream.sync() # Synchronize all work on this stream
ptr = stream.get_ptr() # Get the raw cudaStream_t pointer (uintptr_t)
device_ndarray is RAFT's lightweight GPU array type. It implements __cuda_array_interface__, making it interoperable with CuPy, Numba, PyTorch, and other GPU libraries.
from pylibraft.common import device_ndarray
import numpy as np
# Allocate empty GPU array
gpu_arr = device_ndarray.empty((1000, 50), dtype=np.float32)
# From a NumPy array (copies data to GPU)
cpu_data = np.random.rand(1000, 50).astype(np.float32)
gpu_arr = device_ndarray(cpu_data)
# Back to NumPy (copies data to CPU)
result = gpu_arr.copy_to_host()
# Properties
print(gpu_arr.shape) # (1000, 50)
print(gpu_arr.dtype) # float32
print(gpu_arr.c_contiguous) # True (row-major)
print(gpu_arr.f_contiguous) # False
You can configure all RAFT compute APIs to return CuPy arrays or PyTorch tensors instead of device_ndarray:
import pylibraft.config
pylibraft.config.set_output_as("cupy") # All APIs return cupy arrays
pylibraft.config.set_output_as("torch") # All APIs return torch tensors
# Custom conversion
pylibraft.config.set_output_as(lambda arr: arr.copy_to_host()) # Return numpy
GPU-accelerated Lanczos method for finding eigenvalues/eigenvectors of large sparse symmetric matrices. Drop-in replacement for scipy.sparse.linalg.eigsh.
import cupy as cp
import cupyx.scipy.sparse as sp
from pylibraft.sparse.linalg import eigsh
from pylibraft.common import DeviceResources
# Create a sparse symmetric matrix (CSR format)
n = 10000
density = 0.01
A = sp.random(n, n, density=density, dtype=cp.float32, format='csr')
A = A + A.T # Make symmetric
# Find 6 largest eigenvalues
handle = DeviceResources()
eigenvalues, eigenvectors = eigsh(A, k=6, which='LM', handle=handle)
handle.sync()
print(f"Eigenvalues shape: {eigenvalues.shape}") # (6,)
print(f"Eigenvectors shape: {eigenvectors.shape}") # (10000, 6)
Parameters:
A — Sparse symmetric CSR matrix (cupyx.scipy.sparse.csr_matrix)k — Number of eigenvalues to compute (default: 6). Must be 1 <= k < nwhich — Which eigenvalues:
'LM': largest in magnitude (default)'LA': largest algebraic'SA': smallest algebraic'SM': smallest in magnitudev0 — Starting vector (optional, random if None)ncv — Number of Lanczos vectors. Must be k + 1 < ncv < nmaxiter — Maximum iterationstol — Convergence tolerance (0 = machine precision)seed — Random seed for reproducibilityhandle — Optional DeviceResources handleWhen to use: Spectral methods (spectral clustering, graph partitioning, PageRank-like computations), dimensionality reduction on sparse data, physics simulations with large sparse Hamiltonians, structural analysis (vibration modes).
Generates random graphs using the Recursive Matrix (R-MAT) model, commonly used for benchmarking graph algorithms with realistic structure (power-law degree distribution, community structure).
import cupy as cp
from pylibraft.random import rmat
from pylibraft.common import DeviceResources
n_edges = 100000
r_scale = 16 # log2 of source node count (2^16 = 65536 nodes)
c_scale = 16 # log2 of destination node count
theta_len = max(r_scale, c_scale) * 4
# Output: edge list as (src, dst) pairs
out = cp.empty((n_edges, 2), dtype=cp.int32)
# Probability distribution at each R-MAT level
theta = cp.random.random_sample(theta_len, dtype=cp.float32)
handle = DeviceResources()
rmat(out, theta, r_scale, c_scale, seed=42, handle=handle)
handle.sync()
print(f"Generated {n_edges} edges")
print(f"Edge list shape: {out.shape}") # (100000, 2)
print(f"Sample edges:\n{out[:5].get()}") # First 5 edges on CPU
When to use: Benchmarking graph algorithms, generating synthetic social/web graphs, testing graph processing pipelines at scale.
raft-dask provides a Comms class for managing NCCL and UCX communication across workers in a Dask cluster. This is the foundation for distributed GPU computing in RAPIDS.
from dask_cuda import LocalCUDACluster
from dask.distributed import Client
from raft_dask.common import Comms, local_handle
# Set up a local multi-GPU Dask cluster
cluster = LocalCUDACluster()
client = Client(cluster)
def run_on_gpu(sessionId):
handle = local_handle(sessionId)
# Use handle with RAFT or cuML algorithms
return "done"
# Initialize multi-GPU communication
comms = Comms(client=client)
comms.init()
# Submit work to each GPU worker
futures = [
client.submit(run_on_gpu, comms.sessionId, workers=[w], pure=False)
for w in comms.worker_addresses
]
# Wait for results
from dask.distributed import wait
wait(futures, timeout=60)
# Clean up
comms.destroy()
client.close()
cluster.close()
Comms parameters:
comms_p2p (bool) — Enable UCX peer-to-peer communication (default: False). Enable for algorithms that need direct GPU-to-GPU transfers.client — Dask distributed clientverbose (bool) — Enable verbose loggingstreams_per_handle (int) — Number of CUDA streams per handleRAFT's device_ndarray implements __cuda_array_interface__, enabling zero-copy sharing with other GPU libraries:
import cupy as cp
import torch
from pylibraft.common import device_ndarray
# pylibraft -> CuPy (zero-copy)
raft_arr = device_ndarray(np.random.rand(100).astype(np.float32))
cupy_arr = cp.asarray(raft_arr)
# pylibraft -> PyTorch (zero-copy)
torch_tensor = torch.as_tensor(raft_arr, device='cuda')
# CuPy -> pylibraft (pass directly — RAFT APIs accept __cuda_array_interface__)
cupy_data = cp.random.rand(100, 50, dtype=cp.float32)
# Can pass cupy_data directly to pylibraft functions like eigsh()
# pylibraft -> NumPy (copy)
numpy_arr = raft_arr.copy_to_host()
RAFT functions accept any object implementing __cuda_array_interface__ as input — you don't need to convert to device_ndarray first. This means CuPy arrays, Numba device arrays, PyTorch CUDA tensors, and cuDF columns all work directly.
Reuse DeviceResources. Creating a DeviceResources allocates CUDA library handles (cuBLAS, cuSOLVER). Create once, pass to all calls.
Batch your syncs. RAFT calls are asynchronous. Queue multiple operations before calling handle.sync() rather than syncing after each one.
Use float32. GPU throughput for float32 is 2x-32x higher than float64. Only use float64 when precision demands it.
Pre-allocate outputs. Many RAFT functions accept an out parameter. Pre-allocating avoids repeated GPU memory allocation.
Keep data on GPU. RAFT interoperates with CuPy, cuDF, and cuML via __cuda_array_interface__. Pass GPU arrays directly between libraries instead of round-tripping through CPU.
Forgetting to sync. RAFT operations are asynchronous. Reading results without calling handle.sync() gives undefined/stale data. If you omit the handle parameter, RAFT syncs internally (safe but slower).
Using RAFT for vector search. Vector search (k-NN, IVFPQ, CAGRA, etc.) has been migrated to cuVS. RAFT no longer maintains these algorithms.
Wrong sparse format. eigsh() requires cupyx.scipy.sparse.csr_matrix. Other sparse formats (COO, CSC) must be converted first.
Non-symmetric matrix with eigsh. eigsh is for real symmetric / Hermitian matrices only. For general eigenvalue problems, you'll need a different solver.
dtype mismatch. RAFT functions are picky about dtypes. Use float32 or float64 explicitly — don't rely on implicit conversion.