simplePrint - Printing from CUDA Kernels

Description

This sample demonstrates how to use printf() inside CUDA kernels using two different approaches:

CUDA C++ kernels compiled with cuda.core.Program - Full C++ features and control
Numba CUDA kernels - Pythonic kernel authoring using numba.cuda.grid() for modern indexing

The sample shows basic device management, kernel compilation with inline CUDA C++ code, and multi-dimensional kernel launches (2D grid × 3D blocks) using modern CUDA Python. The Numba example demonstrates the recommended numba.cuda.grid() indexing style while also showing how it relates to classic CUDA C++ block/thread IDs. Both approaches use cuda.core APIs for stream management and synchronization, demonstrating interoperability.

This is the Python equivalent of the C++ simplePrintf sample, enhanced with Numba CUDA examples.

Key Concepts

CUDA Python (cuda.core), Numba CUDA, Kernel Compilation, Printf in Kernels, Multi-dimensional Launch, Pythonic GPU Programming, Modern Thread Indexing (grid()), Stream-based Execution, cuda.core/Numba Interoperability

CUDA APIs involved

cuda.core (cuda-python)

Device() - Device management
Device.create_stream() - Create CUDA streams
Stream.sync() - Synchronize stream execution
Program() - Compile CUDA C++ kernels
LaunchConfig() - Configure kernel launch
launch() - Execute kernels on streams

Numba CUDA

@cuda.jit - JIT compile Python functions to CUDA kernels
cuda.grid() - Get global thread position (recommended modern approach)
cuda.blockIdx, cuda.threadIdx - Thread/block indices (classic style)
cuda.gridDim, cuda.blockDim - Grid/block dimensions
Note: Uses cuda.core APIs for stream management (interoperability)

CUDA Kernel Functions

printf() - Print from device code (C++)
print() - Print from device code (Numba, limited formatting)
blockIdx, threadIdx - Thread/block indices
gridDim, blockDim - Grid/block dimensions

What You Learn

Device initialization with cuda.core.Device
Compiling CUDA C++ kernels with Program and ProgramOptions
Writing Pythonic CUDA kernels with Numba's @cuda.jit decorator
Using numba.cuda.grid() for modern thread indexing (recommended approach)
Understanding the relationship between global coordinates and classic block/thread IDs
Interoperability: Using cuda.core streams with Numba CUDA kernels
Comparing CUDA C++ vs Pythonic kernel authoring approaches
Multi-dimensional kernel launches (2D grid, 3D blocks)
Using streams for kernel execution and synchronization
Using printf() and print() in GPU kernels for debugging
Understanding print limitations in Numba CUDA (no f-strings)
Proper error handling and resource management

Requirements

Hardware:

NVIDIA GPU with Compute Capability 7.0 or higher
Minimum GPU memory: 512 MB

Software:

CUDA Toolkit 13.0 or newer
Python 3.10 or newer
cuda-python package (13.0+)
cuda-core package (>=1.0.0)
numba-cuda package (0.24.0+, for Pythonic kernel authoring)

Download and install:

CUDA Toolkit
cuda-python package: pip install cuda-python
numba-cuda: pip install numba-cuda

Build and Run

bash

# Install dependencies
pip install -r requirements.txt

# Run the sample
python simplePrint.py

Expected Output

Simple Print - Printing from CUDA Kernels
Demonstrating both CUDA C++ and Numba CUDA approaches

Device: <Your GPU Name>
Compute Capability: sm_<XX>

======================================================================
METHOD 1: CUDA C++ Kernel (via cuda.core.Program)
======================================================================
Advantage: Full C++ features, better for complex kernels

Compiling CUDA C++ kernel...
Kernel compiled successfully.

Kernel configuration:
  Grid:  (2, 2)
  Block: (2, 2, 2)
  Total threads: 32

Launching kernel with value=10. Output:

[0, 0]:		Value is: 10
[0, 1]:		Value is: 10
[0, 2]:		Value is: 10
[0, 3]:		Value is: 10
[0, 4]:		Value is: 10
[0, 5]:		Value is: 10
[0, 6]:		Value is: 10
[0, 7]:		Value is: 10
[1, 0]:		Value is: 10
...
[3, 7]:		Value is: 10

CUDA C++ kernel execution complete.


======================================================================
METHOD 2: Numba CUDA Kernel (Pythonic / modern indexing)
======================================================================
Advantage: Uses numba.cuda.grid(3) for global indexing,
           while still showing classic CUDA C++ IDs for reference.
           Uses cuda.core for stream management (interoperability).

Kernel configuration:
  Grid:  (2, 2)
  Block: (2, 2, 2)
  Total threads: 32

Launching Numba CUDA kernel (grid(3) + classic IDs) with value=10:
Uses numba.cuda.grid(3) to get global (x, y, z),
and prints the corresponding blockId/threadId like the C++ sample.
Stream managed by cuda.core for consistency with C++ example.

global[ 0 , 0 , 0 ]  -> [ 0 ,  0 ]: Value is: 10
global[ 1 , 0 , 0 ]  -> [ 0 ,  1 ]: Value is: 10
global[ 0 , 1 , 0 ]  -> [ 0 ,  2 ]: Value is: 10
...
global[ 3 , 3 , 1 ]  -> [ 3 ,  7 ]: Value is: 10

Numba CUDA kernel execution complete.

======================================================================
Done! Both kernel approaches demonstrated successfully.
======================================================================

Understanding the Output

Grid: 2×2 = 4 blocks (labeled 0-3)
Block: 2×2×2 = 8 threads per block (labeled 0-7)
Total: 32 threads, each printing its position and value

CUDA C++ Kernel:

Each thread calculates:

Block ID (linear): blockIdx.y * gridDim.x + blockIdx.x
Thread ID (linear): threadIdx.z * blockDim.x * blockDim.y + threadIdx.y * blockDim.x + threadIdx.x

Numba CUDA Kernel:

Each thread shows:

Global position using numba.cuda.grid(3) → (x, y, z) coordinates across entire grid
Classic IDs (block ID, thread ID) calculated the same way as C++ for comparison
This demonstrates how modern indexing relates to traditional CUDA C++ style

Comparing the Two Approaches

CUDA C++ Kernel (Method 1):

Uses C++ syntax and printf() with full formatting control
Requires compilation via cuda.core.Program
Best for complex kernels needing C++ features (templates, libraries, etc.)
Uses classic block/thread ID indexing
Output: [0, 0]: Value is: 10 (clean formatting)

Numba CUDA Kernel (Method 2):

Uses Python syntax with @cuda.jit decorator
JIT compiled automatically when called
Best for prototyping and simpler kernels
Modern indexing: Uses numba.cuda.grid(3) to get global thread coordinates (recommended)
Also shows classic block/thread IDs to help relate the two indexing models
Interoperability: Uses cuda.core streams via stream for consistency
Demonstrates that numba-cuda kernels can work seamlessly with cuda.core infrastructure
Limited print formatting (no f-strings, basic print() only; adds spaces between arguments)
Output: global[ 0 , 0 , 0 ] -> [ 0 , 0 ]: Value is: 10 (shows both indexing styles; note extra spaces due to print() behavior)

Experiments

Try modifying:

For Both Approaches:

Grid size: Change grid=(4, 4) for 16 blocks
Block size: Change block=(4, 4, 4) for 64 threads per block
Conditional printing: Print only from specific threads (e.g., if threadId == 0:)

CUDA C++ Specific:

Format strings: Experiment with different printf() formats
Kernel code: Add complex C++ computations before printing
External libraries: Include CUDA math libraries or device functions (e.g., <cuda/std/cmath>, <cub/cub.cuh>)

Numba CUDA Specific:

Grid indexing: Try numba.cuda.grid(1) or numba.cuda.grid(2) for different dimensions
Conditional printing: Print only from threads where x == 0 or y == z
Python operations: Use NumPy-like operations in the kernel
Device math libraries: Use nvmath-python device APIs for optimized math operations (similar to CUDA math libraries in C++)
Shared memory: Add numba.cuda.shared.array() for fast inter-thread communication
Atomic operations: Try numba.cuda.atomic.add() for thread-safe updates
Print variations: Experiment with what numba-cuda's print() can and cannot handle
Streams: Create multiple cuda.core streams and launch numba-cuda kernels on them concurrently
Interoperability: Mix numba-cuda kernels and CUDA C++ kernels on the same stream

Notes

General:

Printing from GPU is relatively slow - use sparingly in production code
Printf output is buffered and limited (~1MB buffer on most GPUs)

CUDA C++ Kernels:

Always call stream.sync() after kernel launch to flush printf output
Full printf() format string support (%, flags, width, precision)

Numba CUDA Kernels:

Recommended: Use numba.cuda.grid(ndim) for thread indexing (modern, Pythonic)
- grid(1) for 1D indexing, grid(2) for 2D, grid(3) for 3D
- Returns global thread position across the entire grid
Interoperability: Use cuda.core streams with Numba kernels via stream
- Create streams: stream = device.create_stream()
- Launch kernels: kernel[grid, block, stream](args)
- Synchronize: stream.sync()
Numba's print() has limited capabilities compared to Python's print()
F-strings are NOT supported in Numba CUDA kernels
Use comma-separated arguments: print("Value:", x) instead of f-strings
Note: print() automatically adds spaces between comma-separated arguments (e.g., print("[", x, "]") outputs [ 0 ] not [0])
Always synchronize the stream to flush output

Files

simplePrint.py - Python implementation using cuda.core API
README.md - This file
requirements.txt - Sample dependencies

simplePrint - Printing from CUDA Kernels

simplePrint - Printing from CUDA Kernels

Description

Key Concepts

CUDA APIs involved

cuda.core (cuda-python)

Numba CUDA

CUDA Kernel Functions

What You Learn

Requirements

Hardware:

Software:

Build and Run

Expected Output

Understanding the Output

CUDA C++ Kernel:

Numba CUDA Kernel:

Comparing the Two Approaches

Experiments

For Both Approaches:

CUDA C++ Specific:

Numba CUDA Specific:

Notes

General:

CUDA C++ Kernels:

Numba CUDA Kernels:

Files

See Also

CUDA Python (cuda.core):

Numba CUDA:

CUDA References: