Back to Tinygrad

Runtimes

docs/runtime.md

0.13.06.0 KB
Original Source

Runtimes

tinygrad supports various runtimes, enabling your code to scale across a wide range of devices. The default runtime can be automatically selected based on the available hardware, or you can force a specific runtime to be default using environment variables (e.g., DEV=CPU).

RuntimeDescriptionCompiler OptionsRequirements
NVProvides acceleration for NVIDIA GPUsnvrtc (default)
PTX (DEV=NV:PTX)Ampere/Ada/Blackwell series GPUs.
You can select an interface via the DEV variable. See NV interfaces for details.
AMDProvides acceleration for AMD GPUsLLVM (DEV=AMD:LLVM)
HIP/COMGR (DEV=AMD:HIP)CDNA3, CDNA4, RDNA3 or RDNA4 GPUs.
You can select an interface via the DEV variable. See AMD interfaces for details.
QCOMProvides acceleration for QCOM GPUs-6xx series GPUs
METALUtilizes Metal for acceleration on Apple devices-M1+ Macs; Metal 3.0+ for bfloat support
CUDAUtilizes CUDA for acceleration on NVIDIA GPUsnvrtc (default)
PTX (DEV=CUDA:PTX)NVIDIA GPU with CUDA support
CLAccelerates computations using OpenCL on GPUs-OpenCL 2.0 compatible device
CPURuns on CPU using the clang or llvm compilerClang JIT (default)
LLVM IR (DEV=CPU:LLVM)clang compiler in system PATH
You can specify additional arch parameters via the DEV variable. See CPU arch for details.
WEBGPURuns on GPU using the Dawn WebGPU engine (used in Google Chrome)-Dawn library installed and discoverable. Binaries: pydawn v0.3.0

Interoperability

tinygrad provides interoperability with OpenCL and PyTorch, allowing efficient tensor data sharing between frameworks through the Tensor.from_blob API. This enables zero-copy operations by working directly with external memory pointers.

Important: When using external memory pointers with tinygrad tensors, you must ensure these pointers remain valid throughout the entire lifetime of the tinygrad tensor to prevent memory corruption.

CUDA/METAL PyTorch Interoperability

You can seamlessly work with CUDA/MPS tensors between PyTorch and tinygrad without data copying:

python
from tinygrad.dtype import _from_torch_dtype
tensor1 = torch.tensor([1.0, 2.0, 3.0], device=torch.device("cuda"))
tiny_tensor1 = Tensor.from_blob(tensor1.data_ptr(), tensor1.shape, dtype=_from_torch_dtype(tensor1.dtype), device='CUDA')

# Before tinygrad calculations, mps needs to be synchronized to make sure data is valid.
if data.device.type == "mps": torch.mps.synchronize()
else: torch.cuda.synchronize()

x = (tiny_tensor1 + 1).realize()

QCOM OpenCL Interoperability

tinygrad supports OpenCL interoperability on QCOM backend.

Buffer interop allows direct access to OpenCL memory buffers:

python
# create raw opencl buffer.
cl_buf = cl.clCreateBuffer(cl_context, cl.CL_MEM_READ_WRITE, 0x100, None, status := ctypes.c_int32())

# extract pointers
cl_buf_desc_ptr = to_mv(ctypes.addressof(cl_buf), 8).cast('Q')[0]
rawbuf_ptr = to_mv(cl_buf_desc_ptr, 0x100).cast('Q')[20] # offset 0xA0 is a raw gpu pointer.

# create tiny tensor
tiny = Tensor.from_blob(rawbuf_ptr, (8, 8), dtype=dtypes.int, device='QCOM')

And the same for the images:

python
# create cl image.
cl_img = cl.clCreateImage2D(cl_context, cl.CL_MEM_READ_WRITE, cl.cl_image_format(cl.CL_RGBA, cl.CL_FLOAT), w, h, 0, None, status := ctypes.c_int32())

# extract pointers
cl_buf_desc_ptr = to_mv(ctypes.addressof(cl_img), 8).cast('Q')[0]
rawbuf_ptr = to_mv(cl_buf_desc_ptr, 0x100).cast('Q')[20] # offset 0xA0 is a raw gpu pointer.

# create tiny tensor
tiny = Tensor.from_blob(rawbuf_ptr, (h*w*4,), dtype=dtypes.imagef((h,w)), device='QCOM')

AMD Interfaces

AMD backend supports several interfaces for communicating with devices:

  • KFD: uses the amdgpu driver
  • PCI: uses the AM driver
  • USB: USB3 interface for asm24xx chips.

You can force an interface by setting the interface component of the DEV environment variable to one of these values. When set to PCI, this may unbind your GPU from the amdgpu driver.

NV Interfaces

NV backend supports several interfaces for communicating with devices:

  • NVK: uses the nvidia driver
  • PCI: uses the NV driver

CPU Arch

The CPU renderers may be additionally configured using the arch component of the DEV environment variable. CPU arch should be specified as a comma-separated list of parameters, and must contain at least two values: the architecture family (ie. x86_64, arm64, or riscv64) and the cpu type (as accepted by clang's -march). If native is specified as the cpu type, tinygrad (or delegate compiler) will query the host cpu type. Additional comma-separated values may be specified as follows:

  • AMX: emit Apple silicon AMX instructions

All other additional values are interpreted as cpu feature flags. When a value is preceded by a - character, the corresponding feature flag will be disabled, otherwise the flag will be enabled. Note that enabled feature flags should not be preceded by a +.