torch/_inductor/codegen/cutedsl/README.md
Writing a CuteDSL template:
from torch._inductor.codegen.cutedsl import CuteDSLTemplate
template_source = """
@cute.kernel
def {{kernel_name}}_kernel(A, B, C):
# Your CUTLASS kernel logic here
pass
{{def_kernel("A", "B", "C")}}
# Call the kernel
{{kernel_name}}_kernel(A, B, C)
return C
"""
my_template = CuteDSLTemplate(
name="my_gemm",
source=template_source,
)
def_kernel), manages args.async_compile.cutedsl().CuteDSL requires source files for compilation (cannot compile from strings directly). The process:
async_compile.cutedsl()PyCodeCache.write() to write source to a temporary .py file.run() interfaceDebug tip: Use TORCH_LOGS="kernel_code" to see the generated kernel source and file path during compilation.
Templates use Jinja2 syntax with these available hooks:
{{kernel_name}} - Unique kernel identifier{{def_kernel(args...)}} - Generates kernel function signature and argument handling{{input_nodes}} - List of input buffers{{output_node}} - Output buffer{{gen_defines()}} - Generates autotunable parameter definitions with proper CuteDSL typingCuteDSL templates support autotunable parameters similar to Triton's tl.constexpr system:
template_source = r"""
{{gen_defines()}}
@cute.kernel
def {{kernel_name}}_kernel(gA: cute.Tensor, gB: cute.Tensor, gC: cute.Tensor):
threads_per_block = THREADS_PER_BLOCK # Uses autotuned value
block_size = BLOCK_SIZE
# ... kernel implementation
"""
# Pass parameters when generating template choices
template.maybe_append_choice(
choices,
input_nodes=[a, b],
layout=layout,
THREADS_PER_BLOCK=256, # cutlass.Constexpr = 256
BLOCK_SIZE=128, # cutlass.Constexpr = 128
SCALE_FACTOR=1.5, # cutlass.Constexpr = 1.5
)
Templates must:
@cute.kernel decorated function{{def_kernel()}} to create the entry point{{gen_defines()}} for autotunable parametersSee test_cutedsl_template.py for complete examples.
can_fuse_vertical and can_fuse_horizontal return FalseNote: Requires CUTLASS Python package (pip install nvidia-cutlass)