python/1_GettingStarted/blurImageUnifiedMemory/README.md
Blur images on GPU using modern cuda.core APIs for kernel compilation, execution, and memory management. This sample demonstrates zero-copy data sharing between CPU and GPU using unified (managed) memory.
cuda.core.Programcuda.core.launch and LaunchConfigcuda.core.ManagedMemoryResourcenp.from_dlpack()# Compile CUDA C++ kernel at runtime
program = Program(KERNEL_CODE, code_type="c++", options=options)
compiled = program.compile(target_type="cubin")
kernel = compiled.get_kernel("box_blur_3x3")
# Configure and launch kernel
config = LaunchConfig(grid=grid_size, block=block_size)
# Buffers can be passed directly as kernel arguments
launch(stream, config, kernel, src_buf, dst_buf, H, W)
This sample uses ManagedMemoryResource for simplicity: a single allocation is accessible from both CPU and GPU without explicit transfers. For performance-critical workloads, consider LegacyPinnedMemoryResource + DeviceMemoryResource instead, which gives explicit control over host/device placement and transfer costs.
Unified memory is accessible from both CPU and GPU without explicit data transfers:
# Allocate unified memory
options = ManagedMemoryResourceOptions(preferred_location=device.device_id)
mr = ManagedMemoryResource(options)
src_buf = mr.allocate(n_bytes, stream)
dst_buf = mr.allocate(n_bytes, stream)
try:
# Synchronize to ensure allocations are complete before CPU access
stream.sync()
# Create numpy views of unified memory using DLPack protocol (zero-copy)
src_np = np.from_dlpack(src_buf).view(np.float32).reshape(H, W)
dst_np = np.from_dlpack(dst_buf).view(np.float32).reshape(H, W)
# CPU writes directly to unified memory
src_np[:] = input_data
# Launch kernel - buffers can be passed directly as arguments
launch(stream, config, kernel, src_buf, dst_buf, H, W)
stream.sync()
# Return zero-copy view; caller must close buffers when done
return dst_np, src_buf, dst_buf
except Exception:
src_buf.close()
dst_buf.close()
raise
When returning a zero-copy view, the caller must close the buffers after use (e.g., in a try/finally block) to avoid leaking managed memory.
cuda.core:Device - CUDA device managementProgram - Runtime kernel compilation (NVRTC)ProgramOptions - Compilation options (architecture target)LaunchConfig - Kernel launch configuration (grid/block dimensions)launch - Execute compiled kernelManagedMemoryResource - Unified memory allocationnp.from_dlpack(buffer) - Create numpy view of unified memory using DLPack protocolbuffer directly to launch() as kernel arguments(view, src_buf, dst_buf) and have the caller close buffers in try/finally after usecuda-python package (13.0.0+)cuda-core package (>=1.0.0)numpy package (>=2.3.2)pillow package (10.0.0+)This sample relies on ManagedMemoryResource with concurrent host access
to managed allocations while GPU kernels are in flight. That behavior
requires the device property concurrent_managed_access=True, which is only
supported on Linux with HMM (Pascal and newer). On Windows (WDDM/MCDM/TCC)
the property is False, so the sample exits early with a waive message and
exit code 2 instead of attempting a run that would crash the process.
cd /path/to/cuda-samples/python/1_GettingStarted/blurImageUnifiedMemory
pip install -r requirements.txt
python blurImageUnifiedMemory.py
============================================================
Image Blur with Unified Memory (cuda.core)
============================================================
Device: <Your GPU Name>
Compute Capability: sm_<XY>
Compiling CUDA kernel with cuda.core.Program...
Compiled for architecture: sm_<XY>
Image size: 256x256 grayscale
Creating sample image...
Blurring image on GPU...
Saving results...
Saved: original_image.png
Saved: blurred_image.png
Verifying result...
Test PASSED
Max difference from original: <value>
original_image.png - Test pattern image before blurblurred_image.png - Image after 3x3 box blurblurImageUnifiedMemory.py - Python implementation using cuda.coreREADME.md - This filerequirements.txt - Sample dependencies