Back to Cutlass

Changelog for CuTe DSL API changes

media/docs/pythonDSL/cute_dsl_api/changelog.rst

4.4.25.4 KB
Original Source

====================================== Changelog for CuTe DSL API changes

4.3.0 <https://github.com/NVIDIA/cutlass/releases/tree/main>_ (2025-10-20)

  • Debuggability improvements:

    • Supported source location tracking for DSL APIs
    • Supported dumping PTX and CUBIN
  • Removed deprecated cutlass.<arch>_utils.SMEM_CAPACITY["<arch_str>"] and cutlass.utils.ampere_helpers

  • Supported calling nested functions without capturing variables inside dynamic control flow

  • Replaced usage of cute.arch.barrier in examples with corresponding APIs in pipeline

    • Use pipeline.sync for simple cases like synchronizing the whole CTA
    • Use pipeline.NamedBarrier to customize barriers with different participating threads and barrier id
  • Added new APIs repeat and repeat_as_tuple

  • Added new APIs make_rmem_tensor to create tensor in register memory (replace make_fragment with better naming)

  • Added new APIs make_rmem_tensor_like which create rmem tensor from a tensor using the same shape with compact col-major strides

  • Added TmemAllocator for allocating tensor memory

  • Updated SmemAllocator.allocate to support allocation of a single scalar value

  • Fixed TensorSSA.reduce to support static value as initial value

  • Updated docstring for following APIs to be more concise and easier to understand:

    • make_layout_tv
    • is_static
    • PipelineAsync
    • SmemAllocator
  • Fixed documentation for pipeline, utils and cute.math (cute.math is part of top level documentation)

4.2.0 <https://github.com/NVIDIA/cutlass/releases/tag/v4.2.0>_ (2025-09-10)

  • Added back cute.make_tiled_copy per the request from community

  • Added support for explicit and implicit broadcast in TensorSSA

    • cutlass.cute.TensorSSA: support broadcast_to and implicit broadcasting for binary operations.
  • Supported printing TensorSSA value in cutlass.cute.print_tensor

  • Updated cute.gemm to support all dispatch patterns and improved checks for illegal inputs

  • Introduced automatic kernel smem usage calculation for launch config.

  • Introduced per op fast-math control for math ops(e.g. exp, exp2, log2, log)

  • Introduced CopyReduceBulkTensorTileS2GOp in tcgen05/copy.py <https://github.com/NVIDIA/cutlass/blob/main/python/CuTeDSL/cutlass/cute/nvgpu/tcgen05/copy.py>_ to support TMA Reduce.

4.1.0 <https://github.com/NVIDIA/cutlass/releases/tag/v4.1.0>_ (2025-07-16)

  • for loop

    • Python built-in range now always generates codes and executes at runtime
    • cutlass.range is advanced range with kernel code level unrolling and pipelining control
    • Deprecated cutlass.range_dynamic, please replace with range or cutlass.range
    • Experimental Added pipelining control for compiler generated software pipeline code
  • while/if

    • while/if now by default generates codes and executes at runtime unless cutlass.const_expr is specified for the predicate
    • Deprecated cutlass.dynamic_expr, please remove it
  • Rename mbarrier functions to reduce ambiguity

  • Modify SyncObject API (MbarrierArray, NamedBarrier, TmaStoreFence) to match std::barrier

  • Change pipeline create function to take only keyword arguments, and make barrier_storage optional.

  • Introduce cutlass.cute.arch.get_dyn_smem_size api to get runtime dynamic shared memory size.

  • Various API Support for SM100 BlockScaled Gemm

    • Introduce BlockScaled MmaOps in tcgen05/mma.py <https://github.com/NVIDIA/cutlass/blob/main/python/CuTeDSL/cutlass/cute/nvgpu/tcgen05/mma.py>, and provide a make_blockscaled_trivial_tiled_mma function in blackwell_helpers.py <https://github.com/NVIDIA/cutlass/blob/main/python/CuTeDSL/cutlass/utils/blackwell_helpers.py> to help construct a BlockScaled TiledMma.
    • Introduce S2T CopyOps in tcgen05/copy.py <https://github.com/NVIDIA/cutlass/blob/main/python/CuTeDSL/cutlass/cute/nvgpu/tcgen05/copy.py>_.
    • Introduce BlockScaled layout utilities in blockscaled_layout.py <https://github.com/NVIDIA/cutlass/blob/main/python/CuTeDSL/cutlass/utils/blockscaled_layout.py>_ for creating the required scale factor layouts in global memory, shared memory and tensor memory.
  • cutlass.cute.compile now supports compilation options. Refer to JIT compilation options <https://docs.nvidia.com/cutlass/latest/media/docs/pythonDSL/cute_dsl_general/dsl_jit_compilation_options.html>_ for more details.

  • cutlass.cute.testing.assert_ now works for device JIT function. Specify --enable-assertions as compilation option to enable.

  • cutlass.cute.make_tiled_copy is now deprecated. Please use cutlass.cute.make_tiled_copy_tv instead.

  • Shared memory capacity query

    • Introduce cutlass.utils.get_smem_capacity_in_bytes for querying the shared memory capacity.
    • <arch>_utils.SMEM_CAPACITY["<arch_str>"] is now deprecated.

4.0.0 <https://github.com/NVIDIA/cutlass/releases/tag/v4.0.0>_ (2025-06-03)

  • Fixed API mismatch in class cute.runtime.Pointer: change element_type to dtype to match typing.Pointer