media/docs/pythonDSL/cute_dsl_general/dsl_introduction.rst
.. _dsl_introduction: .. |DC| replace:: dynamic compilation .. |IR| replace:: IR .. |DSL| replace:: CuTe DSL
|DSL| is a Python-based domain-specific language (DSL) designed for |DC| of high-performance GPU kernels. It evolved from the C++ CUTLASS library and is now available as a decorator-based DSL.
Its primary goals are:
DLPack <https://github.com/dmlc/dlpack>_ integration, enabling seamless
interop with frameworks (e.g., PyTorch, JAX).|DSL| provides two main Python decorators for generating optimized code via |DC|:
@jit — Host-side JIT-compiled functions@kernel — GPU kernel functionsBoth decorators can optionally use a preprocessor that automatically expands Python control flow (loops, conditionals) into operations consumable by the underlying |IR|.
@jit
Declares JIT-compiled functions that can be invoked from Python or from other |DSL| functions.
**Decorator Parameters**:
* ``preprocessor``:
* ``True`` (default) — Automatically translate Python flow control (e.g., loops, if-statements) into |IR| operations.
* ``False`` — No automatic expansion; Python flow control must be handled manually or avoided.
**Call-site Parameters**:
- ``no_cache``:
- ``True`` — Disables JIT caching, forcing a fresh compilation each call.
- ``False`` (default) — Enables caching for faster subsequent calls.
``@kernel``
Defines GPU kernel functions, compiled as specialized GPU symbols through |DC|.
Decorator Parameters:
preprocessor:
True (default) — Automatically expands Python loops/ifs into GPU-compatible |IR| operations.False — Expects manual or simplified kernel implementations.Kernel Launch Parameters:
grid
Specifies the grid size as a list of integers.
block
Specifies the block size as a list of integers.
cluster
Specifies the cluster size as a list of integers.
smem
Specifies the size of shared memory in bytes (integer).
None (default) — Automatically calculates the kernel's shared memory usage via utils.SmemAllocator. Recommended unless manual control is required.int — Manually specifies the size of shared memory in bytes.Additional Kernel Launch Parameters:
fallback_cluster
Specifies the minimum-guaranteed cluster size. When set, cluster becomes the preferred size, enabling graceful degradation when hardware cannot satisfy the preferred dimensions.
None (default) — No fallback; cluster is used directly.list[int] — Three-element list [x, y, z].max_number_threads
Specifies the maximum thread count per block (maxntid).
[0, 0, 0] (default) — Auto-generate reqntid from block.list[int] — Three-element list [x, y, z].min_blocks_per_mp
Specifies the minimum blocks per multiprocessor (minctasm).
0 (default) — No minimum occupancy hint.int — Minimum number of blocks per multiprocessor.use_pdl
Enables Programmatic Dependent Launch (PDL) to overlap dependent kernel launches in the same stream.
False (default) — PDL disabled.True — PDL enabled.cooperative
Enables cooperative kernel launch; all thread blocks launch cooperatively with grid-wide synchronization support.
False (default) — Standard kernel launch.True — Cooperative kernel launch.smem_merge_branch_allocs
Enables mutually exclusive control flow branches (sequentially executed if-else) to reuse the same shared memory.
False (default) — Shared memory is allocated additively across all branches (default CUDA C++ behavior).True — Merge shared-memory allocations across branches (experimental feature, recommended for mega-kernels)... list-table:: :header-rows: 1 :widths: 20 20 15 25
@jit@kernel@jit@jit@jit@jit@kernel@kernel@jit@kernel@kernel@kernel