python/2_CoreConcepts/launchConfigTuning/README.md
Benchmark different CUDA kernel launch configurations to find the optimal block-size setting using cuda.core APIs. This sample demonstrates performance tuning by measuring execution time across various thread block sizes.
cuda.core.ProgramLaunchConfig settings# Configure kernel launch with specific thread block size
config = LaunchConfig(
grid=(grid_size,),
block=(block_size,),
shmem_size=shared_memory_bytes
)
# Launch kernel
launch(stream, config, kernel, *args)
stream.sync()
Thread block size significantly impacts performance due to:
| Factor | Impact |
|---|---|
| Occupancy | More active warps can hide memory latency |
| Registers | More threads/block = fewer registers/thread |
| Shared Memory | Divided among blocks on each SM |
| Warp Efficiency | Block size should be multiple of 32 |
# Use CUDA events for accurate GPU timing (not CPU wall-clock)
start_event = device.create_event(options=EventOptions(enable_timing=True))
end_event = device.create_event(options=EventOptions(enable_timing=True))
stream.record(start_event)
for _ in range(n_iterations):
launch(stream, config, kernel, *args)
stream.record(end_event)
end_event.sync()
elapsed_ms = (end_event - start_event) / n_iterations
cuda.core:Device - CUDA device managementProgram - Runtime kernel compilation (NVRTC)ProgramOptions - Compilation options (architecture target)LaunchConfig - Kernel launch configuration (grid/block dimensions)launch - Execute compiled kernel (accepts Buffer objects directly)EventOptions - GPU timing with CUDA eventsManagedMemoryResource - Device-preferred unified memoryManagedMemoryResourceOptions - Set preferred_location for representative benchmarksnumpy:np.from_dlpack() - Zero-copy view of GPU buffers via DLPackrequirements.txt for Python packagesThe benchmark loops in this sample read kernel results back from
ManagedMemoryResource allocations between launches, which requires the
device property concurrent_managed_access=True. This is only supported on
Linux with HMM (Pascal and newer). On Windows (WDDM/MCDM/TCC) the property
is False, so the sample exits early with a waive message and exit code
2.
pip install -r requirements.txt
python launchConfigTuning.py
============================================================
Launch Configuration Tuning (cuda.core)
Finding the Best Block Size for Your Kernel
============================================================
Device: <Your GPU Name>
Compute Capability: X.X
Compiling CUDA kernels with cuda.core.Program...
Target architecture: sm_XX
[OK] vector_add kernel compiled
[OK] reduce_sum kernel compiled
============================================================
VECTOR ADDITION - Launch Configuration Tuning
============================================================
Problem size: 10,000,000 elements
Kernel: vector_add (C = A + B)
Testing thread configurations: [32, 64, 128, 256, 512, 1024]
------------------------------------------------------------
Block Size: 32 | Blocks: 312500 | Time: X.XXXX ± X.XXXX ms
Block Size: 64 | Blocks: 156250 | Time: X.XXXX ± X.XXXX ms
...
------------------------------------------------------------
[OK] OPTIMAL: block_size=XXX (X.XXXX ms)
[FAIL] WORST: block_size=XXX (X.XXXX ms)
Speedup: X.XXx
[OK] Results verified correct!
...
============================================================
SAMPLE COMPLETE
============================================================
Key Takeaway: The optimal thread configuration depends on your
specific kernel characteristics. Always benchmark to find the best!
launchConfigTuning.py - Python implementation using cuda.coreREADME.md - This filerequirements.txt - Sample dependencies