python/1_GettingStarted/simplePrint/README.md
This sample demonstrates how to use printf() inside CUDA kernels using two different approaches:
cuda.core.Program - Full C++ features and controlnumba.cuda.grid() for modern indexingThe sample shows basic device management, kernel compilation with inline CUDA C++ code, and multi-dimensional kernel launches (2D grid × 3D blocks) using modern CUDA Python. The Numba example demonstrates the recommended numba.cuda.grid() indexing style while also showing how it relates to classic CUDA C++ block/thread IDs. Both approaches use cuda.core APIs for stream management and synchronization, demonstrating interoperability.
This is the Python equivalent of the C++ simplePrintf sample, enhanced with Numba CUDA examples.
CUDA Python (cuda.core), Numba CUDA, Kernel Compilation, Printf in Kernels, Multi-dimensional Launch, Pythonic GPU Programming, Modern Thread Indexing (grid()), Stream-based Execution, cuda.core/Numba Interoperability
Device() - Device managementDevice.create_stream() - Create CUDA streamsStream.sync() - Synchronize stream executionProgram() - Compile CUDA C++ kernelsLaunchConfig() - Configure kernel launchlaunch() - Execute kernels on streams@cuda.jit - JIT compile Python functions to CUDA kernelscuda.grid() - Get global thread position (recommended modern approach)cuda.blockIdx, cuda.threadIdx - Thread/block indices (classic style)cuda.gridDim, cuda.blockDim - Grid/block dimensionscuda.core APIs for stream management (interoperability)printf() - Print from device code (C++)print() - Print from device code (Numba, limited formatting)blockIdx, threadIdx - Thread/block indicesgridDim, blockDim - Grid/block dimensionscuda.core.DeviceProgram and ProgramOptions@cuda.jit decoratornumba.cuda.grid() for modern thread indexing (recommended approach)cuda.core streams with Numba CUDA kernelsprintf() and print() in GPU kernels for debuggingcuda-python package (13.0+)cuda-core package (>=1.0.0)numba-cuda package (0.24.0+, for Pythonic kernel authoring)Download and install:
pip install cuda-pythonpip install numba-cuda# Install dependencies
pip install -r requirements.txt
# Run the sample
python simplePrint.py
Simple Print - Printing from CUDA Kernels
Demonstrating both CUDA C++ and Numba CUDA approaches
Device: <Your GPU Name>
Compute Capability: sm_<XX>
======================================================================
METHOD 1: CUDA C++ Kernel (via cuda.core.Program)
======================================================================
Advantage: Full C++ features, better for complex kernels
Compiling CUDA C++ kernel...
Kernel compiled successfully.
Kernel configuration:
Grid: (2, 2)
Block: (2, 2, 2)
Total threads: 32
Launching kernel with value=10. Output:
[0, 0]: Value is: 10
[0, 1]: Value is: 10
[0, 2]: Value is: 10
[0, 3]: Value is: 10
[0, 4]: Value is: 10
[0, 5]: Value is: 10
[0, 6]: Value is: 10
[0, 7]: Value is: 10
[1, 0]: Value is: 10
...
[3, 7]: Value is: 10
CUDA C++ kernel execution complete.
======================================================================
METHOD 2: Numba CUDA Kernel (Pythonic / modern indexing)
======================================================================
Advantage: Uses numba.cuda.grid(3) for global indexing,
while still showing classic CUDA C++ IDs for reference.
Uses cuda.core for stream management (interoperability).
Kernel configuration:
Grid: (2, 2)
Block: (2, 2, 2)
Total threads: 32
Launching Numba CUDA kernel (grid(3) + classic IDs) with value=10:
Uses numba.cuda.grid(3) to get global (x, y, z),
and prints the corresponding blockId/threadId like the C++ sample.
Stream managed by cuda.core for consistency with C++ example.
global[ 0 , 0 , 0 ] -> [ 0 , 0 ]: Value is: 10
global[ 1 , 0 , 0 ] -> [ 0 , 1 ]: Value is: 10
global[ 0 , 1 , 0 ] -> [ 0 , 2 ]: Value is: 10
...
global[ 3 , 3 , 1 ] -> [ 3 , 7 ]: Value is: 10
Numba CUDA kernel execution complete.
======================================================================
Done! Both kernel approaches demonstrated successfully.
======================================================================
Each thread calculates:
blockIdx.y * gridDim.x + blockIdx.xthreadIdx.z * blockDim.x * blockDim.y + threadIdx.y * blockDim.x + threadIdx.xEach thread shows:
numba.cuda.grid(3) → (x, y, z) coordinates across entire gridCUDA C++ Kernel (Method 1):
printf() with full formatting controlcuda.core.Program[0, 0]: Value is: 10 (clean formatting)Numba CUDA Kernel (Method 2):
@cuda.jit decoratornumba.cuda.grid(3) to get global thread coordinates (recommended)cuda.core streams via stream for consistencyprint() only; adds spaces between arguments)global[ 0 , 0 , 0 ] -> [ 0 , 0 ]: Value is: 10 (shows both indexing styles; note extra spaces due to print() behavior)Try modifying:
grid=(4, 4) for 16 blocksblock=(4, 4, 4) for 64 threads per blockif threadId == 0:)printf() formats<cuda/std/cmath>, <cub/cub.cuh>)numba.cuda.grid(1) or numba.cuda.grid(2) for different dimensionsx == 0 or y == znumba.cuda.shared.array() for fast inter-thread communicationnumba.cuda.atomic.add() for thread-safe updatesprint() can and cannot handlecuda.core streams and launch numba-cuda kernels on them concurrentlystream.sync() after kernel launch to flush printf outputprintf() format string support (%, flags, width, precision)numba.cuda.grid(ndim) for thread indexing (modern, Pythonic)
grid(1) for 1D indexing, grid(2) for 2D, grid(3) for 3Dcuda.core streams with Numba kernels via stream
stream = device.create_stream()kernel[grid, block, stream](args)stream.sync()print() has limited capabilities compared to Python's print()print("Value:", x) instead of f-stringsprint() automatically adds spaces between comma-separated arguments (e.g., print("[", x, "]") outputs [ 0 ] not [0])simplePrint.py - Python implementation using cuda.core APIREADME.md - This filerequirements.txt - Sample dependencies