media/docs/pythonDSL/overview.rst
.. _overview:
CUTLASS 4.x bridges the gap between productivity and performance for CUDA kernel development. By providing Python-based DSLs to the powerful CUTLASS C++ template library, it enables faster iteration, easier prototyping, and a gentler learning curve for high-performance linear algebra on NVIDIA GPUs.
Overall we envision CUTLASS DSLs as a family of domain-specific languages (DSLs). With the release of 4.0, we are releasing the first of these in CuTe DSL. This is a low level programming model that is fully consistent with CuTe C++ abstractions — exposing core concepts such as layouts, tensors, hardware atoms, and full control over the hardware thread and data hierarchy.
While CUTLASS offers exceptional performance through its C++ template abstractions, the complexity can present challenges for many developers. CUTLASS 4.x addresses this by:
Students can learn GPU programming concepts without the complexity of C++ templates. Researchers and performance engineers can rapidly explore algorithms, prototype, and tune kernels before moving to production implementations.
CUTLASS DSLs translate Python code into a custom intermediate representation (IR),
which is then Just-In-Time (JIT) compiled into optimized CUDA kernels using MLIR and ptxas.
TiledMma, TiledCopy).For more on CuTe abstractions, refer to the CuTe C++ library documentation <https://github.com/NVIDIA/cutlass/blob/main/media/docs/cpp/cute/00_quickstart.md>__.
Pythonic Kernel Expression
Developers express kernel logic, data movement, and computation using familiar Python syntax and control flow.
The DSLs simplify expressing loop tiling, threading strategies, and data transformations using concise Python code.
JIT Compilation
Python kernels are compiled at runtime into CUDA device code using MLIR infrastructure and NVIDIA’s ptxas toolchain,
enabling rapid iteration and interactive debugging.
CUTLASS DSLs are not a replacement for the CUTLASS C++ library or its 2.x and 3.x APIs. Instead, it aims to be a high-productivity kernel authoring framework that shares all concepts with CUTLASS 3.x C++ API such as CuTe, pipelines, schedulers etc.
quick_start – Initial setup and installation.cute_dsl – Overview of the typical development and workflow using CuTe DSL.cute_dsl_api – Refer to the full API documentation.limitations – Understand current CuTe DSL constraints and differences from C++.faqs – Common questions and known issues.CuTe DSL is in public beta and actively evolving. Interfaces and features are subject to change as we improve the system.
For known issues and workarounds, please consult the :doc:limitations and :doc:faqs.
We welcome contributions and feedback from the developer community!
You can:
GitHub Issues page <https://github.com/NVIDIA/cutlass/issues>__Discord <https://discord.com/channels/1019361803752456192/1150868614921064590>__ to ask questions and share ideasThank you for helping shape the future of CUTLASS DSLs!