python/2_CoreConcepts/blockwiseSum/README.md
Demonstrates fundamental CUDA thread cooperation: thread/block indexing, strided loops, and block-wise reduction using shared memory. This sample shows three progressively complex kernel patterns using the cuda.core API:
__syncthreads()Global Thread ID = blockIdx.x * blockDim.x + threadIdx.x
Stride = blockDim.x * gridDim.x
Each thread processes multiple elements, enabling fixed grid size for arbitrary array lengths:
for (size_t i = tid; i < N; i += stride) {
output[i] = input[i] * 2.0f;
}
cuda.core:Device - Device management and contextProgram - Compile CUDA C++ kernelsProgramOptions - Kernel compilation options (architecture target)LaunchConfig - Configure grid/block dimensions and shared memorylaunch() - Execute kernelEventOptions - GPU timing configurationcp.asarray() - Transfer data to GPUcp.zeros_like() - Allocate GPU arraysrequirements.txt for Python packagespip install -r requirements.txt
python blockwiseSum.py
Device: <Your GPU>
Compute Capability: sm_XX
Array size: 1,048,576 elements
Simple indexing: Test PASSED
Strided loop: Test PASSED
Block-wise sum: Test PASSED
Kernel time: X.XXX ms, Bandwidth: XXX.X GB/s
Done
blockwiseSum.py - Python implementation with CUDA kernelsREADME.md - This filerequirements.txt - Sample dependencies