python/2_CoreConcepts/parallelHistogram/README.md
Compute histograms on the GPU using atomic operations to handle concurrent updates from multiple threads. This sample demonstrates the modern cuda.core API for kernel compilation and launch, comparing two approaches:
cuda.core.Programcuda.core.LaunchConfigcuda.core.launch()atomicAdd) for thread-safe updatescuda.core EventsWhen multiple threads update the same histogram bin, a race condition occurs. Atomic operations ensure thread-safe updates:
atomicAdd(&histogram[data[i]], 1); // Thread-safe increment
| Approach | Pros | Cons |
|---|---|---|
| Global | Simple | High contention on popular bins |
| Privatized | Significantly faster | Extra shared memory, synchronization |
cuda.core:Device - Device management and contextProgram - Compile CUDA C source codeProgramOptions - Set architecture, optimization flagsLaunchConfig - Configure grid and block dimensionslaunch() - Launch compiled kernelStream - Async stream managementEventOptions - Configure events for GPU timingstream.record() - Record events for timingcupy:cp.random.randint() - Generate random data directly on GPUcp.zeros() - Allocate zeroed GPU arraysatomicAdd() - Thread-safe additionrequirements.txt for Python packagespip install -r requirements.txt
python parallelHistogram.py
============================================================
Parallel Histogram with Atomics (cuda.core)
============================================================
Device: <Your GPU>
Compute Capability: ComputeCapability(major=X, minor=Y)
Compiling CUDA kernels with cuda.core.Program...
Compiled for architecture: sm_XY
Generating 10,000,000 random values on GPU...
Verifying correctness...
Global atomics: PASSED
Privatized atomics: PASSED
Benchmarking (100 iterations)...
Global atomics: X.XXX ms
Privatized atomics: X.XXX ms
Speedup: XXx
Test PASSED
parallelHistogram.py - Main sample using cuda.coreREADME.md - This filerequirements.txt - Dependencies