python/2_CoreConcepts/cudaGraphs/README.md
This sample demonstrates how to capture a multi-stage kernel pipeline as a
CUDA graph with cuda.core and replay it with a single driver call.
The sample runs a three-stage elementwise pipeline
r3 = (a + b) * c - a in two modes:
launch(stream, ...) per stage, repeated
for every iteration of the pipeline.Graph once and replayed with graph.launch(stream) on each
iteration.Both paths are timed over N iterations and their results are verified against a reference computation. The sample also re-launches the graph after mutating the input buffers to show that the graph captures pointers (not data), so the same graph can process new inputs without rebuilding.
GraphBuilder from a stream with stream.create_graph_builder()begin_building() and end_building()builder.complete() and uploading it to a streamgraph.launch(stream)cuda.core - Pythonic access to CUDA runtime, programs, and graphscupy - input buffers and result verificationnumpy - scalar kernel argumentscuda.coreStream.create_graph_builder() - obtain a GraphBuilderGraphBuilder.begin_building() / end_building() - begin and finish recording launches issued against the builderGraphBuilder.complete() - produce an executable GraphGraph.upload(stream) - upload the graph structure to the deviceGraph.launch(stream) - replay the entire graphlaunch(graph_builder, config, kernel, ...) - record a kernel launch into the graph being builtcuda_samples_utilsprint_gpu_info() - print device name and compute capabilitycuda-python 13.x)cuda-python (>=13.0.0)cuda-core (>=1.0.0)cupy-cuda13x (>=14.0.0)Install the required packages from requirements.txt:
cd /path/to/cuda-samples/python/2_CoreConcepts/cudaGraphs
pip install -r requirements.txt
The requirements.txt installs:
cuda-python (>=13.0.0)cuda-core (>=1.0.0)cupy-cuda13x (>=14.0.0)cd cuda-samples/python/2_CoreConcepts/cudaGraphs
python cudaGraphs.py
# Larger vectors and more iterations
python cudaGraphs.py --elements 4096 --iters 2000
# Use a specific GPU
python cudaGraphs.py --device 1
Short vectors exaggerate the launch-overhead savings; larger vectors will show the two approaches converging because per-launch overhead becomes negligible next to kernel runtime.
Speedup numbers vary with GPU and host CPU.
Device: <Your GPU Name>
Compute Capability: <X.Y>
Individual launches: 1000 iters in 0.0085s (8.49 us/iter)
Building CUDA graph...
Graph replay: 1000 iters in 0.0034s (3.41 us/iter)
Graph speedup: 2.49x
Graph replay on updated data verified (same graph, new buffer contents)
Done
Note: Device name, compute capability, and speedup will vary based on your GPU and host CPU.
cudaGraphs.py - Python implementation using cuda.core CUDA graphsREADME.md - This filerequirements.txt - Sample dependencies../../Utilities/cuda_samples_utils.py - Common utilities (imported by this sample)cuda.core graphs APIcuda.core example: cuda_graphs.py