python/2_CoreConcepts/jitLtoLinking/README.md
This sample demonstrates how to build a kernel out of two independently
compiled translation units and link them at runtime with
cuda.core.Linker. This is the pattern a library would use to accept
user-supplied device code as a plug-in without recompiling its own
kernels from scratch.
The sample runs the same program in two linking modes:
ProgramOptions(relocatable_device_code=True) down to PTX, and the
Linker emits a final cubin. The two modules stay independently
compiled (no cross-module inlining).ProgramOptions(link_time_optimization=True) down to LTO IR, and the
Linker is configured with LinkerOptions(link_time_optimization=True)
so the optimizer runs again across both modules, typically matching
the code generation of a single-source build.The "main" kernel apply_transform calls a user_transform device
function that lives in a separate source string, and the results of both
linking modes are verified against a NumPy reference.
Program objects into PTX or LTO IRLinkerrelocatable_device_code and link_time_optimizationcuda.core - Pythonic access to CUDA runtime, programs, and the JIT linkercupy - input and output buffers on the GPUnumpy - reference computation on the hostcuda.coreProgramOptions(relocatable_device_code=True) + Program.compile("ptx") - produce relocatable PTXProgramOptions(link_time_optimization=True) + Program.compile("ltoir") - produce LTO IRLinker(*object_codes, options=LinkerOptions(...)) - create a JIT linker over multiple object codesLinkerOptions(link_time_optimization=True) - opt into LTO during linkingLinker.link("cubin") - produce a loadable moduleObjectCode.get_kernel(name) - fetch a kernel from the linked modulecuda_samples_utilsprint_gpu_info() - print device name and compute capabilitycuda-python 13.x)cuda-python (>=13.0.0)cuda-core (>=1.0.0)cupy-cuda13x (>=14.0.0)Install the required packages from requirements.txt:
cd /path/to/cuda-samples/python/2_CoreConcepts/jitLtoLinking
pip install -r requirements.txt
The requirements.txt installs:
cuda-python (>=13.0.0)cuda-core (>=1.0.0)cupy-cuda13x (>=14.0.0)cd cuda-samples/python/2_CoreConcepts/jitLtoLinking
python jitLtoLinking.py
# Larger element count
python jitLtoLinking.py --elements 1048576
# Use a specific GPU
python jitLtoLinking.py --device 1
Device: <Your GPU Name>
Compute Capability: <X.Y>
[1] PTX linking (no LTO)
[ptx] result verified against NumPy reference
[2] LTO linking (link-time optimization)
[lto] result verified against NumPy reference
Both PTX and LTO linked kernels produced matching results. Done
Note: Device name and compute capability will vary based on your GPU.
jitLtoLinking.py - Python implementation using cuda.core.LinkerREADME.md - This filerequirements.txt - Sample dependencies../../Utilities/cuda_samples_utils.py - Common utilities (imported by this sample)cuda.core compilation APIcuda.core example: jit_lto_fractal.py