python/2_CoreConcepts/simpleZeroCopy/README.md
This sample demonstrates zero-copy access using cuda.core to compile and launch a kernel, and cuda.bindings.runtime for mapped pinned host memory (cudaHostAlloc with cudaHostAllocMapped, cudaHostGetDevicePointer, and cudaFreeHost). The GPU loads and stores through device addresses that refer to that host memory—no cudaMemcpy in or out. The example is vector add with inputs and output as NumPy views of the host side of those buffers.
cudaHostAlloc (via cuda.bindings.runtime) so the GPU can use cudaHostGetDevicePointer addresses in a kernelcuda.core.PinnedMemoryResource differs (staging/copies; not guaranteed to be cudaHostAllocMapped for direct kernel access)ctypes and numpy.frombuffercuda.core’s Program and launch, passing device pointers for mapped buffersnumpy – CPU arrays and reference computationcuda-core – Device, stream, Program, LaunchConfig, launchcuda-python (cuda.bindings.runtime) – cudaHostAlloc / cudaHostGetDevicePointer / cudaFreeHost for mapped host memoryFrom cuda.core: Device, device.create_stream(), Program, ProgramOptions, LaunchConfig, launch
From cuda.bindings.runtime: cudaHostAlloc (with cudaHostAllocMapped | cudaHostAllocPortable), cudaHostGetDevicePointer, cudaFreeHost
From the standard library: ctypes – wrap host pointers for numpy.frombuffer float32 views
Memory management: Free host memory with cudaFreeHost in a finally block; call stream.close() when done.
cudaMemcpy callscuda-python / cuda-core wheels.pip install -r requirements.txt (NumPy, cuda-python, cuda-core). A system CUDA Toolkit is not strictly required if the process can load the driver/runtime; use LD_LIBRARY_PATH in How to run if you hit missing-library errors.Install packages:
pip install -r requirements.txt
Or manually:
pip install numpy>=2.3.2 cuda-core>=1.0.0 cuda-python>=13.0.0
Basic usage:
# Pre-steps: Set library path
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
# Run with default parameters (1M elements)
python simpleZeroCopy.py
With custom parameters:
# Use 2M elements
python simpleZeroCopy.py --num_elements 2097152
# Show help
python simpleZeroCopy.py --help
--num_elements: Number of elements in vectors (default: 1048576)
num_elements * 4 bytes (float32)Device name and compute capability depend on your system; the rest of the log should match this shape when validation passes.
======================================================================
simpleZeroCopy - CUDA Python Sample
======================================================================
Device Information:
Name: <your GPU>
Compute Capability: <major>.<minor>
> Memory: mapped pinned host (cudaHostAlloc + cudaHostGetDevicePointer)
Compiling CUDA kernel...
Kernel compiled successfully
Allocating memory:
Vector size: 1,048,576 elements
Memory per vector: 4.00 MB
Total memory: 12.00 MB
> Allocating mapped pinned host memory...
Mapped host memory allocated successfully
> Initializing vectors on host...
> Computing reference result on CPU...
> Launching vectorAddGPU kernel...
Note: GPU accesses host memory directly (zero-copy)
Kernel execution complete
> Checking results from vectorAddGPU()...
Comparing 1,048,576 elements...
Relative error: 0.000000e+00
Validation PASSED
======================================================================
simpleZeroCopy completed successfully!
======================================================================
simpleZeroCopy.py – Main Python implementationREADME.md – This filerequirements.txt – Python package dependencies