python/4_DistributedComputing/simpleP2P/README.md
This sample demonstrates peer-to-peer (P2P) memory access between multiple GPUs in CUDA using the cuda.core Python library. P2P allows GPUs to directly access each other's memory without routing data through the host (CPU), enabling efficient multi-GPU applications. This sample detects P2P-capable GPUs, enables peer access, measures bandwidth using CUDA events for accurate GPU-side timing, and launches kernels (using grid-stride loops) that read from one GPU's memory and write to another GPU's memory.
system.get_num_devices() and Device(id)device.can_access_peer()DeviceMemoryResource.peer_accessible_byDeviceMemoryResourceProgram and launch APIs with grid-stride loopsnumpy - CPU array operations and data initializationcuda-core - Modern Python interface to CUDA runtime with full P2P supportFrom cuda.core:
system – Pre-instantiated singleton for system-level CUDA informationsystem.get_num_devices() – Get number of CUDA-capable devicesDevice(id) – Get specific CUDA device handledevice.can_access_peer(peer) – Check if this device can access peer device memorydevice.set_current() – Set active device for subsequent operationsdevice.create_stream() – Create CUDA stream for kernel executionDeviceMemoryResource(device) – Create memory resource for specific GPUmemory_resource.peer_accessible_by – Get/set which devices can access this memory pool's allocations
mr.peer_accessible_by = [1] grants device 1 accessmr.peer_accessible_by = [] revokes all accessPinnedMemoryResource() – Allocate pinned (page-locked) host memoryEventOptions(enable_timing=True) – Create options for CUDA events with timing enabledstream.record(options=event_options) – Record a CUDA event on a streamevent.elapsed_time(start_event) – Get elapsed time in milliseconds between two eventsstream.wait_event(event) – Make a stream wait for an event to completestream.close() – Clean up stream resourcesProgram() – Compile CUDA C++ kernel codeLaunchConfig() – Configure kernel launch parameters (grid, block)launch() – Launch compiled kernel with argumentsbuffer.copy_from(src, stream=stream) – Copy data from source buffer asynchronouslybuffer.copy_to(dst, stream=stream) – Copy data to destination buffer asynchronouslyFrom DLPack:
numpy.from_dlpack() – Create NumPy array view of memory bufferMemory Management:
stream.close() in finally blockscan_access_peer())Note: This sample will gracefully exit if fewer than 2 GPUs are detected or if P2P is not supported between any GPU pair.
Install packages:
pip install -r requirements.txt
Or manually:
pip install numpy>=2.3.2 cuda-core>=1.0.0 cuda-python>=13.0.0
Basic usage:
# Run with default parameters (16M elements = 64MB)
python simpleP2P.py
With custom parameters:
# Use 32M elements (128MB)
python simpleP2P.py --num_elements 33554432
# Show help
python simpleP2P.py --help
--num_elements: Number of elements in arrays (default: 16777216)
num_elements * 4 bytes (float32)======================================================================
simpleP2P - CUDA Python Sample
======================================================================
Starting...
Checking for multiple GPUs...
CUDA-capable device count: 2
Checking GPU(s) for support of peer to peer memory access...
> Peer access from Tesla T10 (GPU0) -> Tesla T10 (GPU1): Yes
> Peer access from Tesla T10 (GPU1) -> Tesla T10 (GPU0): Yes
Using GPU0 (Tesla T10) and GPU1 (Tesla T10)
Allocating buffers (64MB on GPU0, GPU1 and CPU Host)...
Peer access enabled: GPU0 <-> GPU1
Peer access status: MR0 accessible by (1,), MR1 accessible by (0,)
Memory allocated successfully
Measuring P2P bandwidth...
Performing 100 ping-pong copies between GPUs...
P2P bandwidth: 12.37 GB/s
Preparing host buffer and memcpy to GPU0...
Data initialized and copied to GPU
Compiling CUDA kernel...
Kernels compiled successfully
Run kernel on GPU1, taking source data from GPU0 and writing to GPU1...
Kernel execution complete
Run kernel on GPU0, taking source data from GPU1 and writing to GPU0...
Kernel execution complete
Copy data back to host from GPU0 and verify results...
Checking results...
Comparing 16,777,216 elements...
Test PASSED
[PASS] Validation PASSED
Disabling peer access...
Peer access revoked: MR0 accessible by (), MR1 accessible by ()
======================================================================
simpleP2P completed successfully!
======================================================================
Shutting down...
Note: P2P bandwidth varies based on:
simpleP2P.py – Main Python implementationREADME.md – This filerequirements.txt – Python package dependencies