Utilities

The stable API provides various utility functions and types for working with tensors and CUDA operations.

DeviceGuard Class

{doxygenclass}

:members:
:undoc-members:

{doxygenfunction}

Example:

cpp

{
    torch::stable::accelerator::DeviceGuard guard(1);
    // Operations here run on device 1
}
// Previous device is restored

Stream

{doxygenclass}

:members:
:undoc-members:

Stream Utilities

For CUDA stream access, we currently recommend the ABI stable C shim API. This will be improved in a future release with a more ergonomic wrapper.

Getting the Current CUDA Stream

To obtain the current cudaStream_t for use in CUDA kernels:

cpp

#include <torch/csrc/inductor/aoti_torch/c/shim.h>
#include <torch/headeronly/util/shim_utils.h>

// For now, we rely on the ABI stable C shim API to get the current CUDA stream.
void* stream_ptr = nullptr;
TORCH_ERROR_CODE_CHECK(
    aoti_torch_get_current_cuda_stream(tensor.get_device_index(), &stream_ptr));
cudaStream_t stream = static_cast<cudaStream_t>(stream_ptr);

// Now you can use 'stream' in your CUDA kernel launches
my_kernel<<<blocks, threads, 0, stream>>>(args...);

{note}


The `TORCH_ERROR_CODE_CHECK` macro is required when using C shim APIs
to properly check error codes and throw appropriate exceptions.

CUDA Error Checking Macros

These macros provide stable ABI equivalents for CUDA error checking. They wrap CUDA API calls and kernel launches, providing detailed error messages using PyTorch's error formatting.

STD_CUDA_CHECK

{c:macro}


Checks the result of a CUDA API call and throws an exception on error.
Users of this macro are expected to include `cuda_runtime.h`.

**Example:**

```cpp
STD_CUDA_CHECK(cudaMalloc(&ptr, size));
STD_CUDA_CHECK(cudaMemcpy(dst, src, size, cudaMemcpyDeviceToHost));

Minimum compatible version: PyTorch 2.10.

### STD_CUDA_KERNEL_LAUNCH_CHECK

```{c:macro} STD_CUDA_KERNEL_LAUNCH_CHECK()

Checks for errors from the most recent CUDA kernel launch. Equivalent to
`STD_CUDA_CHECK(cudaGetLastError())`.

**Example:**

```cpp
my_kernel<<<blocks, threads, 0, stream>>>(args...);
STD_CUDA_KERNEL_LAUNCH_CHECK();

Minimum compatible version: PyTorch 2.10.

## Header-Only Utilities

The `torch::headeronly` namespace provides header-only versions of common
PyTorch types and utilities. These can be used without linking against libtorch
at all! This portability makes them ideal for maintaining binary compatibility
across PyTorch versions.

### Error Checking

`STD_TORCH_CHECK` is a header-only macro for runtime assertions:

```cpp
#include <torch/headeronly/util/Exception.h>

STD_TORCH_CHECK(condition, "Error message with ", variable, " interpolation");

Wherever you used TORCH_CHECK before, you can replace usage with STD_TORCH_CHECK to remove the need to link against libtorch. The only difference is that when the condition check fails, TORCH_CHECK throws a fancier c10::Error while STD_TORCH_CHECK throws a std::runtime_error.

Core Types

The following c10:: types are available as header-only versions under torch::headeronly:::

torch::headeronly::ScalarType - Tensor data types (Float, Double, Int, etc.)
torch::headeronly::DeviceType - Device types (CPU, CUDA, etc.)
torch::headeronly::MemoryFormat - Memory layout formats (Contiguous, ChannelsLast, etc.)
torch::headeronly::Layout - Tensor layouts (Strided, Sparse, etc.)

cpp

#include <torch/headeronly/core/ScalarType.h>
#include <torch/headeronly/core/DeviceType.h>
#include <torch/headeronly/core/MemoryFormat.h>
#include <torch/headeronly/core/Layout.h>

auto dtype = torch::headeronly::ScalarType::Float;
auto device_type = torch::headeronly::DeviceType::CUDA;
auto memory_format = torch::headeronly::MemoryFormat::Contiguous;
auto layout = torch::headeronly::Layout::Strided;

TensorAccessor

TensorAccessor provides efficient, bounds-checked access to tensor data. You can construct one from a stable tensor's data pointer, sizes, and strides:

cpp

#include <torch/headeronly/core/TensorAccessor.h>

// Create a TensorAccessor for a 2D float tensor
auto sizes = tensor.sizes();
auto strides = tensor.strides();
torch::headeronly::TensorAccessor<float, 2> accessor(
    static_cast<float*>(tensor.mutable_data_ptr()),
    sizes.data(),
    strides.data());

// Access elements
float value = accessor[i][j];

Dispatch Macros

Header-only dispatch macros (THO = Torch Header Only) are available for dtype dispatching:

cpp

#include <torch/headeronly/core/Dispatch_v2.h>

THO_DISPATCH_V2(
   tensor.scalar_type(),  // will be resolved as scalar_t
   "my_kernel",
   AT_WRAP(([&]() {
   // code to specialize with scalar_t
   // scalar_t is the resolved C++ type (e.g. float, double)
   auto* data = static_cast<scalar_t*>(tensor.mutable_data_ptr());
   Scalar s(*data);
   })),
   AT_EXPAND(AT_ALL_TYPES),
   AT_EXPAND(AT_COMPLEX_TYPES),
   torch::headeronly::ScalarType::Half,
   // as many type arguments as needed
);

THO_DISPATCH_V2 works the same way as AT_DISPATCH_V2 (see ATen/Dispatch_v2.h) but does not require linking against libtorch. As a result, whereas AT_DISPATCH_V2 would have thrown c10::NotImplementedError for unimplemented paths, THO_DISPATCH_V2 will throw std::runtime_error.

For ease of use, we've also migrated the below AT_* macros representing collections of types to be header-only and thus have no dependency on libtorch:

AT_FLOATING_TYPES
AT_INTEGRAL_TYPES
AT_INTEGRAL_TYPES_V2
AT_ALL_TYPES
AT_COMPLEX_TYPES
AT_ALL_TYPES_AND_COMPLEX
AT_FLOAT8_TYPES
AT_BAREBONES_UNSIGNED_TYPES
AT_QINT_TYPES

If your extension uses our older AT_DISPATCH version 1 infrastructure, you can also migrate to a header-only libtorch-free world without upgrading everything to version 2.

THO_DISPATCH_SWITCH and THO_DISPATCH_CASE are the header-only equivalents of AT_DISPATCH_SWITCH and AT_DISPATCH_CASE. Similarly, the only user-visible difference is the exception type on an unhandled dtype, where the AT_ version throws a c10::NotImplementedError and the THO_ version throws a std::runtime_error.

The migration is pretty mechanical:

AT_DISPATCH_SWITCH → THO_DISPATCH_SWITCH
AT_DISPATCH_CASE → THO_DISPATCH_CASE
AT_PRIVATE_CASE_TYPE_USING_HINT → THO_PRIVATE_CASE_TYPE_USING_HINT
at::ScalarType::X → torch::headeronly::ScalarType::X

cpp

// ---- Before (requires linking against libtorch) ----
#include <torch/all.h>

#define MY_DISPATCH_CASE_FLOATING_TYPES(...)            \
  AT_DISPATCH_CASE(at::ScalarType::Float, __VA_ARGS__) \
  AT_DISPATCH_CASE(at::ScalarType::Half, __VA_ARGS__)  \
  AT_DISPATCH_CASE(at::ScalarType::BFloat16, __VA_ARGS__)

#define MY_DISPATCH_FLOATING_TYPES(TYPE, NAME, ...) \
  AT_DISPATCH_SWITCH(TYPE, NAME,                    \
                     MY_DISPATCH_CASE_FLOATING_TYPES(__VA_ARGS__))

cpp

// ---- After (header-only, no libtorch dependency) ----
#include <torch/headeronly/core/Dispatch.h>

#define MY_DISPATCH_CASE_FLOATING_TYPES(...)                          \
  THO_DISPATCH_CASE(torch::headeronly::ScalarType::Float, __VA_ARGS__) \
  THO_DISPATCH_CASE(torch::headeronly::ScalarType::Half, __VA_ARGS__)  \
  THO_DISPATCH_CASE(torch::headeronly::ScalarType::BFloat16, __VA_ARGS__)

#define MY_DISPATCH_FLOATING_TYPES(TYPE, NAME, ...) \
  THO_DISPATCH_SWITCH(TYPE, NAME,                   \
                      MY_DISPATCH_CASE_FLOATING_TYPES(__VA_ARGS__))

For the complete list of header-only APIs, see torch/header_only_apis.txt in the PyTorch source tree.

Parallelization Utilities

{doxygenfunction}