python/4_DistributedComputing/multiGPUGradientAverage/README.md
This sample demonstrates gradient averaging across multiple GPUs using MPI and cuda.core. Each GPU computes local gradients, which are synchronized (averaged) across all GPUs using MPI Allreduce with host-staging (GPU → CPU → MPI → CPU → GPU) for maximum compatibility.
Stream.from_externalpip install -r requirements.txt
IMPORTANT: This sample MUST be launched by an MPI runtime with at
least 2 processes. On Linux/macOS this is typically mpirun; on Windows with
Microsoft MPI the launcher is mpiexec (and the flag for process count is
-n). Either form is accepted by most MPI stacks.
Linux / macOS (OpenMPI, MPICH, Intel MPI):
# Single node (2 GPUs)
mpirun -np 2 python multiGPUGradientAverage.py --size 10000
# Single node (4 GPUs)
mpirun -np 4 python multiGPUGradientAverage.py --size 10000
# With specific GPUs
CUDA_VISIBLE_DEVICES=0,2 mpirun -np 2 python multiGPUGradientAverage.py
Windows (Microsoft MPI — mpiexec is installed under
C:\Program Files\Microsoft MPI\Bin\ and is not on PATH by default):
& "C:\Program Files\Microsoft MPI\Bin\mpiexec.exe" -n 2 `
python multiGPUGradientAverage.py --size 10000
[Rank 0] World size = 4
======================================================================
Multi-GPU Gradient Average Demo
======================================================================
Number of MPI ranks (GPUs): 4
Gradient vector length per GPU: 10000
Device: NVIDIA GeForce RTX 4090
Computation: gradients computed on GPU via cuda.core.
Communication: gradients averaged via MPI_Allreduce on host (CPU) buffers.
======================================================================
Sample averaged gradient values (rank 0):
avg_grad[0] = 1.500000
avg_grad[5000] = 6.500000
avg_grad[9999] = 11.499000
Expected values:
expected[0] = 1.500000
expected[5000] = 6.500000
expected[9999] = 11.499000
Verifying gradient averaging correctness...
[PASS] Gradient averaging is correct.
[PASS] Gradient averaging is correct on all ranks.
Performance:
Kernel time (GPU only): 0.123 ms
MPI communication time (host-staging, end-to-end): 0.456 ms
Total time: 0.579 ms
======================================================================
Demo complete.
======================================================================
The sample uses cuda.core streams and makes CuPy use them via Stream.from_external:
stream = device.create_stream()
cp.cuda.Stream.from_external(stream).use()
GPU timing is measured using cuda.core Event:
from cuda.core import EventOptions
timing_options = EventOptions(enable_timing=True)
start_event = stream.record(options=timing_options)
# ... GPU work ...
end_event = stream.record(options=timing_options)
end_event.sync()
kernel_time = end_event - start_event # Returns milliseconds
The host-staging pattern transfers data GPU → CPU → MPI → CPU → GPU for universal MPI compatibility without requiring CUDA-aware MPI.
Error: "This sample requires at least 2 MPI processes!"
Solution:
mpirun -np 2 python multiGPUGradientAverage.py& "C:\Program Files\Microsoft MPI\Bin\mpiexec.exe" -n 2 python multiGPUGradientAverage.py
(or mpiexec -n 2 ... after adding C:\Program Files\Microsoft MPI\Bin\ to PATH).See the Running section above for fully-formed examples.