python/1_GettingStarted/kernelNsysProfile/README.md
This sample demonstrates how to profile custom CUDA C++ kernels compiled and launched with cuda.core using NVIDIA Nsight Systems. It implements three GPU operations (vector addition, SAXPY, vector transform) as custom kernels and shows how to instrument code with NVTX markers for profiling analysis.
cuda.core.ProgramLaunchConfig and manage CUDA streamsnvtx.annotate()) to annotate code sectionscuda.core.Device and proper resource cleanupnumpy, cuda-python, cuda-core, cupy-cuda13x, nvtx (see requirements.txt; NumPy >=2.3.2)Install:
pip install -r requirements.txt
python kernelNsysProfile.py
python kernelNsysProfile.py --array-size 10000000 # Custom size
Basic profile:
nsys profile -o gpu_profile python kernelNsysProfile.py
nsys-ui gpu_profile.nsys-rep # View results
The program uses color-coded NVTX markers:
Focus on Phase 2 to analyze kernel execution times, launch overhead, and GPU utilization.
For detailed Nsys usage and analysis techniques, see the NVIDIA Nsight Systems documentation.
Missing packages:
pip install -r requirements.txt
Out of memory:
python kernelNsysProfile.py -n 10000000 # Reduce array size
Nsys not found:
export PATH=/usr/local/cuda/bin:$PATH