examples/python/CuTeDSL/notebooks/hello_world.ipynb
Welcome! In this tutorial, we'll write a simple "Hello World" program that runs on your GPU using CuTe DSL. This will help you understand the basics of GPU programming with our framework.
First, let's import the libraries we need:
import cutlass
import cutlass.cute as cute
A GPU kernel is a function that runs on the GPU. Here's a simple kernel that prints "Hello World". Key concepts:
@cute.kernel: This decorator tells CUTLASS that this function should run on the GPUcute.arch.thread_idx(): Gets the ID of the current GPU thread (like a worker's ID number)@cute.kernel
def kernel():
# Get the x component of the thread index (y and z components are unused)
tidx, _, _ = cute.arch.thread_idx()
# Only the first thread (thread 0) prints the message
if cutlass.dynamic_expr(tidx == 0):
cute.printf("Hello world")
Now we need a function that sets up the GPU and launches our kernel. Key concepts:
@cute.jit: This decorator is for functions that run on the CPU but can launch GPU code.launch() tells CUDA how many blocks, threads, shared memory, etc. to use@cute.jit
def hello_world():
# Print hello world from host code
cute.printf("hello world")
# Launch kernel
kernel().launch(
grid=(1, 1, 1), # Single thread block
block=(32, 1, 1), # One warp (32 threads) per thread block
)
There are 2 ways we can run our program:
Please note the Compiling... for Method 2 prints before the "Hello world" of the first kernel. This shows the asynchronous behavior between CPU and GPU prints.
# Initialize CUDA context for launching a kernel with error checking
# We make context initialization explicit to allow users to control the context creation
# and avoid potential issues with multiple contexts
cutlass.cuda.initialize_cuda_context()
# Method 1: Just-In-Time (JIT) compilation - compiles and runs the code immediately
print("Running hello_world()...")
hello_world()
# Method 2: Compile first (useful if you want to run the same code multiple times)
print("Compiling...")
hello_world_compiled = cute.compile(hello_world)
# Dump PTX/CUBIN files while compiling
from cutlass.cute import KeepPTX, KeepCUBIN
print("Compiling with PTX/CUBIN dumped...")
hello_world_compiled_ptx_on = cute.compile[KeepPTX, KeepCUBIN](hello_world)
# Run the pre-compiled version
print("Running compiled version...")
hello_world_compiled()