python/2_CoreConcepts/reduction/README.md
Two-stage parallel reduction: each GPU block sums its chunk in shared memory (tree reduction, two elements per thread), writes one partial sum per block; the host combines partial sums for the final result.
Stack: cuda-core for Device, stream, events, Program / launch(). CuPy allocates device memory and copies; launch() takes device pointers as ndarray.data.ptr (Python int). Copies run on the same CUDA stream as the kernel via cp.cuda.Stream.from_external(stream) (cuda.core Stream implements the CUDA stream protocol) and with cp_stream:.
LaunchConfig with dynamic shared memory and launch() with pointer argumentscuda.core stream (Stream.from_external)EventOptions / device.create_event()| Library | Role |
|---|---|
cuda-core | Device, stream, events, compile, launch |
cupy | cp.empty, cp.asarray, cp.asnumpy, Stream.from_external |
numpy | Host data and CPU reference sum |
Device, create_stream, Program / ProgramOptions, LaunchConfig, launch, EventOptions, create_eventcp.empty, cp.asarray, cp.cuda.Stream.from_external(stream), with cp_stream:, cp.asnumpycuda-core)pip install -r requirements.txt
python reduction.py
Defaults: 2²⁴ elements, 256 threads/block, float, 100 benchmark iterations.
Change data type (selects blockReduceKernel_int / _float / _double):
python reduction.py --type float # default; 32-bit float
python reduction.py --type double # 64-bit float
python reduction.py --type int # 32-bit integer (exact equality check)
Combine with other flags as needed, e.g. python reduction.py --type int --n 1048576.
Other main flags: --n, --threads, --iterations. Full list: python reduction.py --help.
Example run (python reduction.py, defaults) on Tesla T10, compute capability 7.5:
======================================================================
Fast Array Sum using Shared Memory - Two-Stage Reduction
======================================================================
Demonstrates: Efficient parallel reduction using shared memory
Device Information:
Name: Tesla T10
Compute Capability: sm_7.5
Configuration:
Array size: 16,777,216 elements
Data type: float
Memory: 64.00 MB
Threads per block: 256
Two-Stage Reduction Strategy:
Stage 1: GPU block reduction
- Number of blocks: 32768
- Elements per block: 512
- Output: 32768 partial sums
Stage 2: CPU final reduction
- Combine 32768 partial sums -> 1 final result
Compiling CUDA kernel...
Kernel 'blockReduceKernel_float' compiled successfully
> Generating random input data...
> Computing reference result on CPU...
CPU time: 2.428208 seconds
> Warming up GPU...
Warm-up completed
> Benchmarking Stage 1 (GPU block reduction)...
Running 100 iterations...
> Running Stage 2 (CPU final reduction)...
======================================================================
Performance Results
======================================================================
Stage 1 (GPU block reduction):
Average time: 0.338404 ms
Throughput: 198.31 GB/s
Stage 2 (CPU final reduction):
Time: 0.078073 ms
(32768 partial sums)
Total time: 0.416477 ms
Speedup vs CPU: 5830.35x
> Validating results...
GPU result: 2147639808.00000000
CPU result: 2147639929.62027407
Test PASSED
======================================================================
Summary
======================================================================
Key optimizations:
- Load 2 elements per thread: 8,388,608 global reads (50% savings)
- Shared memory for reduction: ~10-20x faster than global memory
- Parallel block outputs: 32768 independent writes
Result: 198.31 GB/s throughput
======================================================================
Two-Stage Reduction completed successfully!
======================================================================
reduction.py · requirements.txt · README.md