Your First Program with CuTe DSL

Introduction

Welcome! In this tutorial, we'll write a simple "Hello World" program that runs on your GPU using CuTe DSL. This will help you understand the basics of GPU programming with our framework.

What You'll Learn

How to write code that runs on both CPU (host) and GPU (device),
How to launch a GPU kernel (a function that runs on the GPU),
Basic CUDA concepts like threads and thread blocks,

Step 1: Import Required Libraries

First, let's import the libraries we need:

python

import cutlass
import cutlass.cute as cute

Step 2: Write Our GPU Kernel

A GPU kernel is a function that runs on the GPU. Here's a simple kernel that prints "Hello World". Key concepts:

@cute.kernel: This decorator tells CUTLASS that this function should run on the GPU
cute.arch.thread_idx(): Gets the ID of the current GPU thread (like a worker's ID number)
We only want one thread to print the message (thread 0) to avoid multiple prints

python

@cute.kernel
def kernel():
    # Get the x component of the thread index (y and z components are unused)
    tidx, _, _ = cute.arch.thread_idx()
    # Only the first thread (thread 0) prints the message
    if cutlass.dynamic_expr(tidx == 0):
        cute.printf("Hello world")

Step 3: Write Our Host Function

Now we need a function that sets up the GPU and launches our kernel. Key concepts:

@cute.jit: This decorator is for functions that run on the CPU but can launch GPU code
We need to initialize CUDA before using the GPU
.launch() tells CUDA how many blocks, threads, shared memory, etc. to use

python

@cute.jit
def hello_world():
    # Print hello world from host code
    cute.printf("hello world")

    # Launch kernel
    kernel().launch(
        grid=(1, 1, 1),  # Single thread block
        block=(32, 1, 1),  # One warp (32 threads) per thread block
    )

Step 4: Run Our Program

There are 2 ways we can run our program:

compile and run immediately
separate compilation which allows us to compile the code once and run multiple times

Please note the Compiling... for Method 2 prints before the "Hello world" of the first kernel. This shows the asynchronous behavior between CPU and GPU prints.

python

# Initialize CUDA context for launching a kernel with error checking
# We make context initialization explicit to allow users to control the context creation
# and avoid potential issues with multiple contexts
cutlass.cuda.initialize_cuda_context()

# Method 1: Just-In-Time (JIT) compilation - compiles and runs the code immediately
print("Running hello_world()...")
hello_world()

# Method 2: Compile first (useful if you want to run the same code multiple times)
print("Compiling...")
hello_world_compiled = cute.compile(hello_world)

# Dump PTX/CUBIN files while compiling
from cutlass.cute import KeepPTX, KeepCUBIN

print("Compiling with PTX/CUBIN dumped...")
hello_world_compiled_ptx_on = cute.compile[KeepPTX, KeepCUBIN](hello_world)

# Run the pre-compiled version
print("Running compiled version...")
hello_world_compiled()