media/docs/pythonDSL/cute_dsl_api/changelog.rst
4.3.0 <https://github.com/NVIDIA/cutlass/releases/tree/main>_ (2025-10-20)Debuggability improvements:
Removed deprecated cutlass.<arch>_utils.SMEM_CAPACITY["<arch_str>"] and cutlass.utils.ampere_helpers
Supported calling nested functions without capturing variables inside dynamic control flow
Replaced usage of cute.arch.barrier in examples with corresponding APIs in pipeline
pipeline.sync for simple cases like synchronizing the whole CTApipeline.NamedBarrier to customize barriers with different participating threads and barrier idAdded new APIs repeat and repeat_as_tuple
Added new APIs make_rmem_tensor to create tensor in register memory (replace make_fragment with better naming)
Added new APIs make_rmem_tensor_like which create rmem tensor from a tensor using the same shape with compact col-major strides
Added TmemAllocator for allocating tensor memory
Updated SmemAllocator.allocate to support allocation of a single scalar value
Fixed TensorSSA.reduce to support static value as initial value
Updated docstring for following APIs to be more concise and easier to understand:
make_layout_tvis_staticPipelineAsyncSmemAllocatorFixed documentation for pipeline, utils and cute.math (cute.math is part of top level documentation)
4.2.0 <https://github.com/NVIDIA/cutlass/releases/tag/v4.2.0>_ (2025-09-10)Added back cute.make_tiled_copy per the request from community
Added support for explicit and implicit broadcast in TensorSSA
cutlass.cute.TensorSSA: support broadcast_to and implicit broadcasting for binary operations.Supported printing TensorSSA value in cutlass.cute.print_tensor
Updated cute.gemm to support all dispatch patterns and improved checks for illegal inputs
Introduced automatic kernel smem usage calculation for launch config.
Introduced per op fast-math control for math ops(e.g. exp, exp2, log2, log)
Introduced CopyReduceBulkTensorTileS2GOp in tcgen05/copy.py <https://github.com/NVIDIA/cutlass/blob/main/python/CuTeDSL/cutlass/cute/nvgpu/tcgen05/copy.py>_ to support TMA Reduce.
4.1.0 <https://github.com/NVIDIA/cutlass/releases/tag/v4.1.0>_ (2025-07-16)for loop
range now always generates codes and executes at runtimecutlass.range is advanced range with kernel code level unrolling and pipelining controlcutlass.range_dynamic, please replace with range or cutlass.rangepipelining control for compiler generated software pipeline codewhile/if
while/if now by default generates codes and executes at runtime unless cutlass.const_expr is specified for the predicatecutlass.dynamic_expr, please remove itRename mbarrier functions to reduce ambiguity
Modify SyncObject API (MbarrierArray, NamedBarrier, TmaStoreFence) to match std::barrier
Change pipeline create function to take only keyword arguments, and make barrier_storage optional.
Introduce cutlass.cute.arch.get_dyn_smem_size api to get runtime dynamic shared memory size.
Various API Support for SM100 BlockScaled Gemm
tcgen05/mma.py <https://github.com/NVIDIA/cutlass/blob/main/python/CuTeDSL/cutlass/cute/nvgpu/tcgen05/mma.py>, and provide a make_blockscaled_trivial_tiled_mma function in blackwell_helpers.py <https://github.com/NVIDIA/cutlass/blob/main/python/CuTeDSL/cutlass/utils/blackwell_helpers.py> to help construct a BlockScaled TiledMma.tcgen05/copy.py <https://github.com/NVIDIA/cutlass/blob/main/python/CuTeDSL/cutlass/cute/nvgpu/tcgen05/copy.py>_.blockscaled_layout.py <https://github.com/NVIDIA/cutlass/blob/main/python/CuTeDSL/cutlass/utils/blockscaled_layout.py>_ for creating the required scale factor layouts in global memory, shared memory and tensor memory.cutlass.cute.compile now supports compilation options. Refer to JIT compilation options <https://docs.nvidia.com/cutlass/latest/media/docs/pythonDSL/cute_dsl_general/dsl_jit_compilation_options.html>_ for more details.
cutlass.cute.testing.assert_ now works for device JIT function. Specify --enable-assertions as compilation option to enable.
cutlass.cute.make_tiled_copy is now deprecated. Please use cutlass.cute.make_tiled_copy_tv instead.
Shared memory capacity query
cutlass.utils.get_smem_capacity_in_bytes for querying the shared memory capacity.<arch>_utils.SMEM_CAPACITY["<arch_str>"] is now deprecated.4.0.0 <https://github.com/NVIDIA/cutlass/releases/tag/v4.0.0>_ (2025-06-03)cute.runtime.Pointer: change element_type to dtype to match typing.Pointer