scientific-skills/optimize-for-gpu/references/cupy.md
CuPy is a NumPy/SciPy-compatible array library for GPU-accelerated computing. It wraps NVIDIA's optimized libraries (cuBLAS, cuFFT, cuSOLVER, cuSPARSE, cuRAND) so standard array operations are already highly tuned. Most NumPy code works by simply changing the import.
Full documentation: https://docs.cupy.dev/en/stable/
Always use uv add (never pip install or conda install) in all install instructions, docstrings, comments, and error messages.
uv add cupy-cuda12x # For CUDA 12.x (most common)
Verify:
import cupy as cp
print(cp.cuda.runtime.getDeviceCount()) # >= 1 means GPU is available
print(cp.show_config()) # Full environment info
The fastest way to GPU-accelerate NumPy code: change the import.
# Before (CPU)
import numpy as np
a = np.random.rand(10_000_000)
b = np.fft.fft(a)
c = np.sort(b.real)
# After (GPU)
import cupy as cp
a = cp.random.rand(10_000_000)
b = cp.fft.fft(a)
c = cp.sort(b.real)
# NumPy → CuPy (CPU → GPU)
gpu_array = cp.asarray(numpy_array) # Zero-copy if already on current device
gpu_array = cp.array(numpy_array) # Always copies
# CuPy → NumPy (GPU → CPU)
cpu_array = cp.asnumpy(gpu_array) # Copy to CPU
cpu_array = gpu_array.get() # Same thing
def normalize(x):
xp = cp.get_array_module(x) # Returns cupy or numpy depending on input
return x / xp.linalg.norm(x)
# Works with both NumPy and CuPy arrays
normalize(numpy_array) # Runs on CPU
normalize(cupy_array) # Runs on GPU
CuPy arrays implement __array_ufunc__ and __array_function__, so NumPy functions can dispatch to CuPy automatically when given CuPy arrays (NumPy >= 1.17).
cupy.ndarray mirrors numpy.ndarray — same attributes (shape, dtype, ndim, size, strides, T), plus device (which GPU the array lives on).
Important: cupy.ndarray and numpy.ndarray are NOT implicitly convertible. Every conversion incurs a host-device data transfer.
cp.empty((1000, 1000), dtype=cp.float32)
cp.zeros((1000,), dtype=cp.float64)
cp.ones((512, 512), dtype=cp.float32)
cp.full((100,), fill_value=3.14, dtype=cp.float32)
cp.arange(0, 100, 0.1)
cp.linspace(0, 1, 1000)
cp.eye(100)
cp.random.rand(1000, 1000) # Uniform [0, 1)
cp.random.randn(1000, 1000) # Standard normal
cp.random.default_rng(42).normal(0, 1, 1000) # Generator API
CuPy's random supports a dtype argument (float32/float64) — unlike NumPy which always returns float64. Use dtype=cp.float32 when you don't need double precision.
CuPy implements most of NumPy and large parts of SciPy. All are GPU-accelerated.
sin, cos, tan, exp, log, log2, log10, sqrt, square, abs, power, add, subtract, multiply, divide, mod, clip, sign, ceil, floor, round, maximum, minimum
sum, prod, mean, std, var, min, max, argmin, argmax, cumsum, cumprod, any, all, nansum, nanmean, nanstd, nanvar
cupy.linalg — powered by cuBLAS/cuSOLVER)dot, matmul, @ operator, tensordot, einsum, inner, outer, cholesky, qr, svd, eig, eigh, eigvalsh, norm, solve, inv, pinv, lstsq, det, slogdet, matrix_rank, matrix_power
cupy.fft — powered by cuFFT)fft, ifft, fft2, ifft2, fftn, ifftn, rfft, irfft, rfft2, irfft2, rfftn, irfftn, fftfreq, rfftfreq, fftshift, ifftshift
sort, argsort, partition, argpartition, argmin, argmax, where, nonzero, unique, searchsorted
reshape, ravel, flatten, transpose, swapaxes, concatenate, stack, vstack, hstack, dstack, split, hsplit, vsplit, tile, repeat, pad, flip, fliplr, flipud, roll, rot90, broadcast_to, expand_dims, squeeze
cupyx.scipy.sparse)CSR, CSC, COO formats. Matrix-vector multiply, matrix-matrix multiply, conversions between formats. Powered by cuSPARSE.
cupyx.scipy.signal)Convolution, correlation, filtering, window functions.
cupyx.scipy.special)Bessel functions, error functions, gamma functions, and more.
mean, median, std, var, percentile, quantile, corrcoef, cov, histogram, bincount, digitize
When built-in operations aren't enough, CuPy offers several ways to write custom GPU code, ordered from simplest to most powerful.
CuPy handles indexing and broadcasting automatically. You just write the per-element logic in C++.
squared_diff = cp.ElementwiseKernel(
'float32 x, float32 y', # Input params
'float32 z', # Output params
'z = (x - y) * (x - y)', # Per-element operation (C++ code)
'squared_diff' # Kernel name
)
result = squared_diff(a, b) # Broadcasting works automatically
Type-generic kernels: Use single-letter type placeholders. Same letter = same type, resolved from arguments at call time.
generic_squared_diff = cp.ElementwiseKernel(
'T x, T y', 'T z',
'z = (x - y) * (x - y)',
'generic_squared_diff'
)
# Works with float32, float64, etc. — type inferred from inputs
Raw indexing: Prefix with raw to disable automatic indexing. Use i for loop index.
# Access neighbors — raw disables auto-indexing so you can index manually
stencil = cp.ElementwiseKernel(
'raw T x', 'T y',
'y = (x[i > 0 ? i-1 : 0] + x[i] + x[i < _ind.size()-1 ? i+1 : _ind.size()-1]) / 3',
'stencil_1d'
)
Four-part reduction: map each element, reduce pairs, post-process the result.
l2norm = cp.ReductionKernel(
'T x', # Input
'T y', # Output
'x * x', # Map: square each element
'a + b', # Reduce: sum pairs (a, b are the binary operands)
'y = sqrt(a)', # Post-map: sqrt of final sum
'0', # Identity element
'l2norm' # Kernel name
)
norm = l2norm(array) # Full reduction → scalar
norms = l2norm(matrix, axis=1) # Reduce along axis → vector
For complete control over grid, blocks, shared memory — write raw CUDA.
kernel_code = r'''
extern "C" __global__
void vector_add(const float* a, const float* b, float* c, int n) {
int tid = blockDim.x * blockIdx.x + threadIdx.x;
if (tid < n) {
c[tid] = a[tid] + b[tid];
}
}
'''
vector_add = cp.RawKernel(kernel_code, 'vector_add')
n = 1_000_000
a = cp.random.rand(n, dtype=cp.float32)
b = cp.random.rand(n, dtype=cp.float32)
c = cp.zeros(n, dtype=cp.float32)
threads = 256
blocks = (n + threads - 1) // threads
vector_add((blocks,), (threads,), (a, b, c, n)) # (grid, block, args)
Important RawKernel caveats:
matrix.T is treated as matrix. Handle strides yourself.extern "C" to prevent C++ name mangling.<cupy/complex.cuh>.~/.cupy/kernel_cache.CuPy dtype to CUDA type mapping:
| CuPy dtype | CUDA type |
|---|---|
float16 | half |
float32 | float |
float64 | double |
int32 | int |
int64 | long long |
complex64 | complex<float> |
complex128 | complex<double> |
For multi-kernel CUDA files or precompiled binaries:
module = cp.RawModule(code=cuda_source) # From source string
module = cp.RawModule(path='kernels.cu') # From file
module = cp.RawModule(path='kernels.cubin') # From precompiled
kernel = module.get_function('my_kernel')
kernel((blocks,), (threads,), (args...))
Write CUDA-style kernels using Python syntax instead of C++.
@cupyx.jit.rawkernel()
def my_kernel(x, y, size):
tid = cupyx.jit.grid(1)
if tid < size:
y[tid] = x[tid] * 2.0
my_kernel[blocks, threads](x, y, n)
Available JIT primitives:
cupyx.jit.threadIdx, blockIdx, blockDim, gridDimcupyx.jit.grid(ndim), gridsize(ndim)cupyx.jit.syncthreads(), syncwarp()cupyx.jit.shared_memory(dtype, size)cupyx.jit.atomic_add/min/max/and/or/xor(array, index, value)shfl_sync, shfl_up_sync, shfl_down_sync, shfl_xor_syncLimitation: Does not work in Python REPL (needs source code access). Use from .py files.
Combine multiple element-wise operations into a single kernel launch — eliminates intermediate arrays and reduces kernel launch overhead.
@cp.fuse()
def fused_op(x, y):
return cp.sqrt((x - y) ** 2 + 1.0)
# This compiles into ONE kernel instead of multiple
result = fused_op(a, b)
Limitation: Only fuses elementwise and simple reduction operations. Does not support matmul, reshape, indexing, etc.
CuPy uses memory pools by default — this is critical for performance. The pool caches freed GPU memory for reuse, avoiding expensive cudaMalloc/cudaFree calls and implicit synchronization.
Key insight: Memory is NOT freed to the OS when arrays go out of scope — it's returned to the pool. This is expected behavior (shows up in nvidia-smi as still-allocated).
mempool = cp.get_default_memory_pool()
mempool.used_bytes() # Currently allocated by CuPy arrays
mempool.total_bytes() # Total held by pool (including free blocks)
mempool.free_all_blocks() # Release all unused memory back to OS
pinned_mempool = cp.get_default_pinned_memory_pool()
pinned_mempool.free_all_blocks()
mempool = cp.get_default_memory_pool()
with cp.cuda.Device(0):
mempool.set_limit(size=4 * 1024**3) # 4 GiB limit for GPU 0
Or via environment variable (set before import cupy):
export CUPY_GPU_MEMORY_LIMIT="50%" # Percentage of total GPU memory
export CUPY_GPU_MEMORY_LIMIT="4294967296" # Bytes
Data auto-migrates between CPU and GPU. Useful when data doesn't fit in GPU memory.
cp.cuda.set_allocator(cp.cuda.MemoryPool(cp.cuda.malloc_managed).malloc)
# High-level API
pinned_array = cupyx.empty_pinned((1000,), dtype=np.float32)
pinned_array = cupyx.zeros_pinned((1000,), dtype=np.float32)
# These are NumPy arrays backed by page-locked memory — transfers to GPU are faster
cp.cuda.set_allocator(None) # Disable device pool
cp.cuda.set_pinned_memory_allocator(None) # Disable pinned pool
Must be done before any CuPy operations.
When using CuPy alongside cuDF/RAPIDS, align on a single allocator:
import rmm
rmm.reinitialize(pool_allocator=True)
cp.cuda.set_allocator(rmm.rmm_cupy_allocator)
Streams enable overlapping computation with data transfer and running multiple operations concurrently.
stream = cp.cuda.Stream()
# Context manager style
with stream:
d_data = cp.asarray(host_data) # H→D transfer on this stream
result = cp.sum(d_data) # Kernel on this stream
# Operations enqueued but may not be complete here
stream.synchronize() # Wait for all operations on this stream
s1 = cp.cuda.Stream()
s2 = cp.cuda.Stream()
with s1:
d_a = cp.asarray(data_a)
result_a = cp.fft.fft(d_a)
with s2:
d_b = cp.asarray(data_b) # Overlaps with s1's FFT
result_b = cp.fft.fft(d_b)
cp.cuda.Device().synchronize() # Wait for all streams
start = cp.cuda.Event()
end = cp.cuda.Event()
start.record()
# ... GPU operations ...
end.record()
end.synchronize()
elapsed_ms = cp.cuda.get_elapsed_time(start, end)
export CUPY_CUDA_PER_THREAD_DEFAULT_STREAM=1
Enables per-thread default streams for better concurrency in multi-threaded applications.
# Set current device
cp.cuda.Device(0).use()
# Context manager
with cp.cuda.Device(1):
x = cp.array([1, 2, 3]) # Allocated on GPU 1
# Check which device an array is on
print(x.device) # Device 1
Cross-device operations may work via P2P (peer-to-peer) memory access if the GPU topology supports it. Use cp.asarray() to explicitly transfer arrays between devices.
mempool = cp.get_default_memory_pool()
with cp.cuda.Device(0):
mempool.set_limit(size=4 * 1024**3)
with cp.cuda.Device(1):
mempool.set_limit(size=4 * 1024**3)
Never use time.perf_counter() or %timeit for GPU code — they measure only CPU time, not GPU execution time. CuPy operations are asynchronous.
from cupyx.profiler import benchmark
result = benchmark(my_function, (arg1, arg2), n_repeat=100, n_warmup=10)
print(result) # Shows CPU and GPU elapsed times with statistics
In IPython/Jupyter:
%load_ext cupy
%gpu_timeit my_function(args)
~/.cupy/kernel_cache. Persist this directory across CI/CD runs.# CuPy v11+ uses CUB by default
export CUPY_ACCELERATORS=cub # CUB only (default)
export CUPY_ACCELERATORS=cub,cutensor # Both (requires cuTENSOR installed)
CUB accelerates: reductions (sum, prod, amin, amax, argmin, argmax), inclusive scans (cumsum), histograms, sparse matrix-vector multiply, and ReductionKernel. Can provide ~100x speedup for reductions.
cuTENSOR accelerates: binary elementwise ufuncs, reduction, tensor contraction.
Prefer float32 over float64. Consumer GPUs have 2x-32x higher float32 throughput. Use dtype=cp.float32 when precision allows.
Minimize CPU-GPU transfers. Every cp.asnumpy() / .get() triggers synchronization and PCI-e transfer. Keep data on GPU as long as possible.
Use kernel fusion. @cp.fuse() combines multiple elementwise operations into one kernel, eliminating intermediate arrays.
Batch operations. Fewer large operations beat many small ones (kernel launch overhead ~5-20us each).
Pre-allocate output arrays. Use out= parameter in ufuncs to avoid repeated allocation:
cp.add(a, b, out=result) # Writes into existing array
Use in-place operations. a += b avoids allocating a new array.
Use streams to overlap computation and data transfer.
Profile with NVTX markers for Nsight Systems analysis:
with cupyx.profiler.time_range('my_operation', color_id=0):
result = heavy_computation()
@cp.fuse()ElementwiseKernelReductionKernelRawKernel or cupyx.jit.rawkernelRawModuleCuPy interoperates with other GPU libraries via the CUDA Array Interface and DLPack protocol — both enable zero-copy data sharing.
# NumPy functions auto-dispatch to CuPy (NumPy >= 1.17)
import numpy as np
result = np.sum(cupy_array) # Dispatches to CuPy, returns CuPy array
from numba import cuda
@cuda.jit
def numba_kernel(x, y):
i = cuda.grid(1)
if i < x.shape[0]:
y[i] = x[i] * 2
# CuPy arrays pass directly to Numba kernels — zero copy
a = cp.arange(1000, dtype=cp.float32)
b = cp.zeros_like(a)
numba_kernel[4, 256](a, b)
import torch
# CuPy → PyTorch (zero copy via CUDA Array Interface)
cupy_array = cp.array([1.0, 2.0, 3.0], dtype=cp.float32)
torch_tensor = torch.as_tensor(cupy_array, device='cuda')
# PyTorch → CuPy (zero copy)
cupy_array = cp.asarray(torch_tensor)
# Via DLPack (also zero copy)
cupy_array = cp.from_dlpack(torch_tensor)
torch_tensor = torch.from_dlpack(cupy_array)
import cudf
# cuDF → CuPy
arr = df.to_cupy()
arr = cp.asarray(df['column'])
# CuPy → cuDF
df = cudf.DataFrame(cupy_array)
s = cudf.Series(cupy_array)
# Export pointer
ptr = cupy_array.data.ptr # Raw device pointer as int
# Import foreign pointer
mem = cp.cuda.UnownedMemory(ptr, size_bytes, owner=owner_obj)
memptr = cp.cuda.MemoryPointer(mem, offset=0)
arr = cp.ndarray(shape, dtype, memptr=memptr)
These are the behavioral differences that can cause bugs if you're not aware of them.
Reductions return 0-d arrays, not scalars. cp.sum(a) returns a 0-d cupy.ndarray, not a Python float. This avoids implicit GPU-CPU synchronization. Use .item() if you need a scalar.
Out-of-bounds indexing wraps silently. NumPy raises IndexError; CuPy wraps around without error.
Duplicate indices in assignment are undefined. a[[0, 0]] = [1, 2] — NumPy stores the last value; CuPy stores an undefined value (GPU race condition).
Float-to-integer casts differ at edges. Casting negative float to unsigned int or infinity to int gives different results than NumPy.
No string/object dtypes. CuPy only supports numeric types. No structured arrays with string fields.
CuPy ufuncs require CuPy arrays. Unlike NumPy ufuncs, CuPy ufuncs don't accept lists or NumPy arrays — convert first.
Random seed arrays are hashed. Array seeds produce less entropy than NumPy's approach.
Measuring with CPU timers. GPU operations are async. time.perf_counter() measures only the time to enqueue operations, not execute them. Always use cupyx.profiler.benchmark().
Unnecessary round-trips. Every cp.asnumpy() / .get() syncs the GPU and copies data across PCI-e. Restructure code to keep data on GPU.
"Memory leak" from pools. The memory pool caches freed blocks. nvidia-smi shows them as allocated. Use mempool.free_all_blocks() to release.
First-call latency. CUDA context init + kernel JIT compilation. Warm up before benchmarking.
Mixing devices. Using an array from GPU 0 on GPU 1 without explicit transfer can fail or be slow.
RawKernel ignoring views. Transposed or sliced arrays passed to RawKernel are treated as the original contiguous layout. You must handle strides manually.
Forgetting synchronize() before reading results. If you pass data back to CPU or use it in non-CuPy code, ensure the GPU is done first.
| Variable | Purpose |
|---|---|
CUPY_ACCELERATORS | Backend list: cub, cutensor (default: cub for v11+) |
CUPY_CACHE_DIR | Kernel cache directory (default: ~/.cupy/kernel_cache) |
CUPY_GPU_MEMORY_LIMIT | GPU memory limit (bytes or "50%") |
CUPY_CACHE_SAVE_CUDA_SOURCE | Set 1 to dump kernel source for profiling |
CUPY_CUDA_PER_THREAD_DEFAULT_STREAM | Set 1 for per-thread default streams |