python/2_CoreConcepts/tmaTensorMap/README.md
This sample demonstrates how to use Tensor Memory Accelerator (TMA)
descriptors with cuda.core on Hopper and later GPUs (compute
capability >= 9.0). TMA enables efficient bulk data movement between
global and shared memory using hardware-managed tensor map
descriptors, which are a key building block for modern GEMM kernels
and large shared-memory tile loads.
The sample:
StridedMemoryView.from_any_interface(...).as_tensor_map(...).__grid_constant__) to a
kernel that uses libcudacxx TMA/barrier wrappers to bulk-load a
tile into shared memory, then copies it out to verify correctness.replace_address() to avoid rebuilding it.StridedMemoryView.as_tensor_map(box_dim=...)__grid_constant__cuda/barrier) to coordinate TMA loads with a
block-scoped barriertensor_map.replace_address(new_tensor)cuda.pathfinder to locate the CUDA toolkit include directory
CCCL headers and libcudacxxcuda.core - compilation, launching, and tensor-map helperscuda.pathfinder - locate the CUDA toolkit include directorycupy - allocate and fill device tensorsnumpy - scalar kernel argumentscuda.coreStridedMemoryView.from_any_interface(tensor, stream_ptr=-1) - build a typed view from any DLPack/CUDA-array-interface tensorStridedMemoryView.as_tensor_map(box_dim=(...)) - produce a TMA descriptor for the given tile shapetensor_map.replace_address(new_tensor) - retarget an existing descriptor at a new tensorProgram(code, code_type="c++", options=ProgramOptions(std="c++17", arch="sm_90", include_path=[...])) - compile a C++ kernel against libcudacxxprogram.compile("cubin") - produce a CUBIN so __grid_constant__ and TMA intrinsics are fully supportedlaunch(stream, config, kernel, tensor_map, ...) - pass the TMA descriptor as a kernel argumentcuda.pathfinderget_cuda_path_or_home() - return the detected CUDA toolkit root for locating include/ccclcuda_samples_utilsprint_gpu_info() - print device name and compute capabilitycuda-python (>=13.0.0)cuda-core (>=1.0.0)cupy-cuda13x (>=14.0.0)Install the required packages from requirements.txt:
cd /path/to/cuda-samples/python/2_CoreConcepts/tmaTensorMap
pip install -r requirements.txt
The requirements.txt installs:
cuda-python (>=13.0.0)cuda-core (>=1.0.0)cupy-cuda13x (>=14.0.0)cd cuda-samples/python/2_CoreConcepts/tmaTensorMap
python tmaTensorMap.py
# Larger tensor (must be a multiple of the 128-element tile)
python tmaTensorMap.py --elements 8192
# Use a specific GPU
python tmaTensorMap.py --device 1
On a Hopper (sm_90) GPU:
Device: NVIDIA H100 PCIe
Compute Capability: 9.0
TMA copy verified: 1024 elements across 8 tiles
replace_address verified: descriptor reused with new source tensor
Note: Device name and compute capability will vary based on your GPU.
tmaTensorMap.py - Python implementation using cuda.core TMA APIsREADME.md - This filerequirements.txt - Sample dependencies../../Utilities/cuda_samples_utils.py - Common utilities (imported by this sample)