Sample: multiGPUGradientAverage (Python)

Description

This sample demonstrates gradient averaging across multiple GPUs using MPI and cuda.core. Each GPU computes local gradients, which are synchronized (averaged) across all GPUs using MPI Allreduce with host-staging (GPU → CPU → MPI → CPU → GPU) for maximum compatibility.

What you will learn

How to initialize MPI for multi-process GPU communication
How to map MPI ranks to CUDA devices consistently
How to integrate cuda.core streams with CuPy using Stream.from_external
How to compile and launch custom CUDA kernels using cuda.core
How to use cuda.core Event for GPU timing measurements
How to use MPI Allreduce with host-staging for universal compatibility

Prerequisites

Python 3.10+
CUDA Toolkit 13.0+
Standard MPI implementation (OpenMPI, MPICH, or Intel MPI)
Multiple NVIDIA GPUs (tested with 2+ GPUs)

Installation

bash

pip install -r requirements.txt

Running

IMPORTANT: This sample MUST be launched by an MPI runtime with at least 2 processes. On Linux/macOS this is typically mpirun; on Windows with Microsoft MPI the launcher is mpiexec (and the flag for process count is -n). Either form is accepted by most MPI stacks.

Linux / macOS (OpenMPI, MPICH, Intel MPI):

bash

# Single node (2 GPUs)
mpirun -np 2 python multiGPUGradientAverage.py --size 10000

# Single node (4 GPUs)
mpirun -np 4 python multiGPUGradientAverage.py --size 10000

# With specific GPUs
CUDA_VISIBLE_DEVICES=0,2 mpirun -np 2 python multiGPUGradientAverage.py

Windows (Microsoft MPI — mpiexec is installed under C:\Program Files\Microsoft MPI\Bin\ and is not on PATH by default):

powershell

& "C:\Program Files\Microsoft MPI\Bin\mpiexec.exe" -n 2 `
    python multiGPUGradientAverage.py --size 10000

Sample Output

[Rank 0] World size = 4

======================================================================
Multi-GPU Gradient Average Demo
======================================================================
Number of MPI ranks (GPUs): 4
Gradient vector length per GPU: 10000
Device: NVIDIA GeForce RTX 4090
Computation: gradients computed on GPU via cuda.core.
Communication: gradients averaged via MPI_Allreduce on host (CPU) buffers.
======================================================================

Sample averaged gradient values (rank 0):
  avg_grad[0] = 1.500000
  avg_grad[5000] = 6.500000
  avg_grad[9999] = 11.499000

Expected values:
  expected[0] = 1.500000
  expected[5000] = 6.500000
  expected[9999] = 11.499000

Verifying gradient averaging correctness...
[PASS] Gradient averaging is correct.
[PASS] Gradient averaging is correct on all ranks.

Performance:
  Kernel time (GPU only): 0.123 ms
  MPI communication time (host-staging, end-to-end): 0.456 ms
  Total time: 0.579 ms

======================================================================
Demo complete.
======================================================================

Key Technical Details

The sample uses cuda.core streams and makes CuPy use them via Stream.from_external:

python

stream = device.create_stream()
cp.cuda.Stream.from_external(stream).use()

GPU timing is measured using cuda.core Event:

python

from cuda.core import EventOptions
timing_options = EventOptions(enable_timing=True)
start_event = stream.record(options=timing_options)
# ... GPU work ...
end_event = stream.record(options=timing_options)
end_event.sync()
kernel_time = end_event - start_event  # Returns milliseconds

The host-staging pattern transfers data GPU → CPU → MPI → CPU → GPU for universal MPI compatibility without requiring CUDA-aware MPI.

Troubleshooting

Error: "This sample requires at least 2 MPI processes!"

Solution:

Linux / macOS: mpirun -np 2 python multiGPUGradientAverage.py
Windows (Microsoft MPI): & "C:\Program Files\Microsoft MPI\Bin\mpiexec.exe" -n 2 python multiGPUGradientAverage.py (or mpiexec -n 2 ... after adding C:\Program Files\Microsoft MPI\Bin\ to PATH).

See the Running section above for fully-formed examples.