examples/python/CuTeDSL/notebooks/tensor.ipynb
import cutlass
import cutlass.cute as cute
A tensor in CuTe is created through the composition of two key components:
An Engine (E) - A random-access, pointer-like object that supports:
e + d → e (offset engine by elements of a layout's codomain)*e → v (dereference engine to produce value)A Layout (L) - Defines the mapping from coordinates to offsets
A tensor is formally defined as the composition of an engine E with a layout L, expressed as T = E ∘ L. When evaluating a tensor at coordinate c, it:
This can be expressed mathematically as:
T(c) = (E ∘ L)(c) = *(E + L(c))
Here's a simple example of creating a tensor using pointer and layout (8,5):(5,1) and fill with ones:
@cute.jit
def create_tensor_from_ptr(ptr: cute.Pointer):
layout = cute.make_layout((8, 5), stride=(5, 1))
tensor = cute.make_tensor(ptr, layout)
tensor.fill(1)
cute.print_tensor(tensor)
This creates a tensor where:
(8, 5) and stride (5, 1)We can test this by allocating buffer with torch and run test with pointer to torch tensor
import torch
from cutlass.torch import dtype as torch_dtype
import cutlass.cute.runtime as cute_rt
a = torch.randn(8, 5, dtype=torch_dtype(cutlass.Float32))
ptr_a = cute_rt.make_ptr(cutlass.Float32, a.data_ptr())
create_tensor_from_ptr(ptr_a)
CuTe DSL is designed to support dlpack protocol natively. This offers easy integration with frameworks supporting DLPack, e.g. torch, numpy, jax, tensorflow, etc.
For more information, please refer to DLPACK project: https://github.com/dmlc/dlpack
Calling from_dlpack can convert any tensor or ndarray object supporting __dlpack__ and __dlpack_device__.
from cutlass.cute.runtime import from_dlpack
@cute.jit
def print_tensor_dlpack(src: cute.Tensor):
print(src)
cute.print_tensor(src)
a = torch.randn(8, 5, dtype=torch_dtype(cutlass.Float32))
print_tensor_dlpack(from_dlpack(a))
import numpy as np
a = np.random.randn(8, 8).astype(np.float32)
print_tensor_dlpack(from_dlpack(a))
Tensors support two primary methods of evaluation:
When applying the tensor evaluation with a complete coordinate c, it computes the offset, applies it to the engine, and dereferences it to return the stored value. This is the straightforward case where you want to access a specific element of the tensor.
When evaluating with an incomplete coordinate c = c' ⊕ c* (where c* represents the unspecified portion), the result is a new tensor which is a slice of the original tensor with its engine offset to account for the coordinates that were provided. This operation can be expressed as:
T(c) = (E ∘ L)(c) = (E + L(c')) ∘ L(c*) = T'(c*)
Slicing effectively reduces the dimensionality of the tensor, creating a sub-tensor that can be further evaluated or manipulated.
@cute.jit
def tensor_access_item(a: cute.Tensor):
# access data using linear index
cute.printf(
"a[2] = {} (equivalent to a[{}])",
a[2],
cute.make_identity_tensor(a.layout.shape)[2],
)
cute.printf(
"a[9] = {} (equivalent to a[{}])",
a[9],
cute.make_identity_tensor(a.layout.shape)[9],
)
# access data using n-d coordinates, following two are equivalent
cute.printf("a[2,0] = {}", a[2, 0])
cute.printf("a[2,4] = {}", a[2, 4])
cute.printf("a[(2,4)] = {}", a[2, 4])
# assign value to tensor@(2,4)
a[2, 3] = 100.0
a[2, 4] = 101.0
cute.printf("a[2,3] = {}", a[2, 3])
cute.printf("a[(2,4)] = {}", a[(2, 4)])
# Create a tensor with sequential data using torch
data = torch.arange(0, 8 * 5, dtype=torch.float32).reshape(8, 5)
tensor_access_item(from_dlpack(data))
print(data)
In CUDA programming, different memory spaces have different characteristics in terms of access speed, scope, and lifetime:
When creating tensors in CuTe, you can specify the memory space to optimize performance based on your access patterns.
For more information on CUDA memory spaces, see the CUDA Programming Guide.
A coordinate tensor $T: Z^n → Z^m$ is a mathematical structure that establishes a mapping between coordinate spaces. Unlike standard tensors that map coordinates to scalar values, coordinate tensors map coordinates to other coordinates, forming a fundamental building block for tensor operations and transformations.
Consider a (4,4) coordinate tensor:
Row-Major Layout (C-style): \begin{bmatrix} (0,0) & (0,1) & (0,2) & (0,3) \ (1,0) & (1,1) & (1,2) & (1,3) \ (2,0) & (2,1) & (2,2) & (2,3) \ (3,0) & (3,1) & (3,2) & (3,3) \end{bmatrix}
Column-Major Layout (Fortran-style): \begin{bmatrix} (0,0) & (1,0) & (2,0) & (3,0) \ (0,1) & (1,1) & (2,1) & (3,1) \ (0,2) & (1,2) & (2,2) & (3,2) \ (0,3) & (1,3) & (2,3) & (3,3) \end{bmatrix}
An identity tensor $I$ is a special case of a coordinate tensor that implements the identity mapping function:
Definition: For a given shape $S = (s_1, s_2, ..., s_n)$, the identity tensor $I$ satisfies: $I(c) = c, \forall c \in \prod_{i=1}^n [0, s_i)$
Properties:
CuTe establishes an isomorphism between 1-D indices and N-D coordinates through lexicographical ordering. For a coordinate c = (c₁, c₂, ..., cₙ) in an identity tensor with shape S = (s₁, s₂, ..., sₙ):
Linear Index Formula: $\text{idx} = c_1 + \sum_{i=2}^{n} \left(c_i \prod_{j=1}^{i-1} s_j\right)$
Example:
# Create an identity tensor from a given shape
coord_tensor = make_identity_tensor(layout.shape())
# Access coordinate using linear index
coord = coord_tensor[linear_idx] # Returns the N-D coordinate
This bidirectional mapping enables efficient conversion from linear indices to N-dimensional coordinates, facilitating tensor operations and memory access patterns.
@cute.jit
def print_tensor_coord(a: cute.Tensor):
coord_tensor = cute.make_identity_tensor(a.layout.shape)
print(coord_tensor)
cute.print_tensor(coord_tensor)
a = torch.randn(8, 4, dtype=torch_dtype(cutlass.Float32))
print_tensor_coord(from_dlpack(a))