CUTLASS: cutlass::gemm::device::GemmBatched< ElementA_, LayoutA_, ElementB_, LayoutB_, ElementC_, LayoutC_, ElementAccumulator_, OperatorClass_, ArchTag_, ThreadblockShape_, WarpShape_, InstructionShape_, EpilogueOutputOp_, ThreadblockSwizzle_, Stages, AlignmentA, AlignmentB, Operator_ > Class Template Reference - Cutlass

| | CUTLASS

CUDA Templates for Linear Algebra Subroutines and Solvers |

cutlass::gemm::device::GemmBatched< ElementA_, LayoutA_, ElementB_, LayoutB_, ElementC_, LayoutC_, ElementAccumulator_, OperatorClass_, ArchTag_, ThreadblockShape_, WarpShape_, InstructionShape_, EpilogueOutputOp_, ThreadblockSwizzle_, Stages, AlignmentA, AlignmentB, Operator_ > Class Template Reference

#include <gemm_batched.h>

Classes

Public Types

Public Member Functions

Static Public Member Functions

Static Public Attributes

Detailed Description

template<typename ElementA_, typename LayoutA_, typename ElementB_, typename LayoutB_, typename ElementC_, typename LayoutC_, typename ElementAccumulator_ = ElementC_, typename OperatorClass_ = arch::OpClassSimt, typename ArchTag_ = arch::Sm70, typename ThreadblockShape_ = typename DefaultGemmConfiguration< OperatorClass_, ArchTag_, ElementA_, ElementB_, ElementC_, ElementAccumulator_>::ThreadblockShape, typename WarpShape_ = typename DefaultGemmConfiguration< OperatorClass_, ArchTag_, ElementA_, ElementB_, ElementC_, ElementAccumulator_>::WarpShape, typename InstructionShape_ = typename DefaultGemmConfiguration< OperatorClass_, ArchTag_, ElementA_, ElementB_, ElementC_, ElementAccumulator_>::InstructionShape, typename EpilogueOutputOp_ = typename DefaultGemmConfiguration< OperatorClass_, ArchTag_, ElementA_, ElementB_, ElementC_, ElementAccumulator_>::EpilogueOutputOp, typename ThreadblockSwizzle_ = threadblock::GemmBatchedIdentityThreadblockSwizzle, int Stages = DefaultGemmConfiguration<OperatorClass_, ArchTag_, ElementA_, ElementB_, ElementC_, ElementAccumulator_>::kStages, int AlignmentA = DefaultGemmConfiguration<OperatorClass_, ArchTag_, ElementA_, ElementB_, ElementC_, ElementAccumulator_>::kAlignmentA, int AlignmentB = DefaultGemmConfiguration<OperatorClass_, ArchTag_, ElementA_, ElementB_, ElementC_, ElementAccumulator_>::kAlignmentB, typename Operator_ = typename DefaultGemmConfiguration< OperatorClass_, ArchTag_, ElementA_, ElementB_, ElementC_, ElementAccumulator_>::Operator> class cutlass::gemm::device::GemmBatched< ElementA_, LayoutA_, ElementB_, LayoutB_, ElementC_, LayoutC_, ElementAccumulator_, OperatorClass_, ArchTag_, ThreadblockShape_, WarpShape_, InstructionShape_, EpilogueOutputOp_, ThreadblockSwizzle_, Stages, AlignmentA, AlignmentB, Operator_ >

Gemm device-level operator. This is an interface to efficient CUTLASS GEMM kernels that may be invoked from host code.

The contributions of this class are:

At compile time, it maps data types and high-level structural parameters onto specific CUTLASS components.
At runtime, it maps logical arguments to GEMM problems to kernel parameters.
At runtime, it launches kernels on the device.

The intent is to provide a convenient mechanism for interacting with most plausible GEMM configurations for each supported architecture. Consequently, not all parameters are exposed to the top-level interface. Rather, sensible defaults at each level of the CUTLASS hierarchy are selected to tradeoff simplicity of the interface with flexibility. We expect most configurations to be specified at this level. Applications with more exotic requirements may construct their kernels of interest using CUTLASS components at the threadblock, warp, and thread levels of abstraction.

CUTLASS exposes computations using the functor design pattern in which objects compose some internal state with an overloaded function call operator. This enables decoupling of initialization from execution, possibly reducing overhead during steady state phases of application execution.

CUTLASS device-level operators expose an Arguments structure encompassing each logical input to the computation. This is distinct from the kernel-level Params structure pattern which contains application-specific precomputed state needed by the device code.

Example of a CUTLASS GEMM operator implementing the functionality of cuBLAS's SGEMM NN is as follows:

Instantiate the CUTLASS GEMM operator.

cutlass::gemm::device::Gemm<
  float,
  cutlass::layout::ColumnMajor,
  float,
  cutlass::layout::ColumnMajor,
  float,
  cutlass::layout::ColumnMajor
> gemm_op;

Launch the GEMM operation on the device

cutlass::Status status = gemm_op({
  {m, n, k}, // GemmCoord problem_size,
  {A, lda}, // TensorRef<float, layout::ColumnMajor> ref_A,
  {B, ldb}, // TensorRef<float, layout::ColumnMajor> ref_B,
  {C, ldc}, // TensorRef<float, layout::ColumnMajor> ref_C,
  {D, ldd}, // TensorRef<float, layout::ColumnMajor> ref_D,
  {alpha, beta} // EpilogueOutputOp::Params epilogue_op_params
});

A simplified view of the template is listed below.

template < / Element type for A matrix operand typename ElementA,

/ Layout type for A matrix operand typename LayoutA,

/ Element type for B matrix operand typename ElementB,

/ Layout type for B matrix operand typename LayoutB,

/ Element type for C and D matrix operands typename ElementC,

/ Layout type for C and D matrix operands typename LayoutC,

/ Element type for internal accumulation typename ElementAccumulator,

/ Operator class tag typename OperatorClass,

/ Tag indicating architecture to tune for typename ArchTag,

/ Threadblock-level tile size (concept: GemmShape) typename ThreadblockShape,

/ Warp-level tile size (concept: GemmShape) typename WarpShape,

/ Warp-level tile size (concept: GemmShape) typename InstructionShape,

/ Epilogue output operator typename EpilogueOutputOp,

/ Threadblock-level swizzling operator typename ThreadblockSwizzle,

/ Number of stages used in the pipelined mainloop int Stages > class Gemm;

Member Typedef Documentation

template<typename ElementA_, typename LayoutA_, typename ElementB_, typename LayoutB_, typename ElementC_, typename LayoutC_, typename ElementAccumulator_ = ElementC_, typename OperatorClass_ = arch::OpClassSimt, typename ArchTag_ = arch::Sm70, typename ThreadblockShape_ = typename DefaultGemmConfiguration< OperatorClass_, ArchTag_, ElementA_, ElementB_, ElementC_, ElementAccumulator_>::ThreadblockShape, typename WarpShape_ = typename DefaultGemmConfiguration< OperatorClass_, ArchTag_, ElementA_, ElementB_, ElementC_, ElementAccumulator_>::WarpShape, typename InstructionShape_ = typename DefaultGemmConfiguration< OperatorClass_, ArchTag_, ElementA_, ElementB_, ElementC_, ElementAccumulator_>::InstructionShape, typename EpilogueOutputOp_ = typename DefaultGemmConfiguration< OperatorClass_, ArchTag_, ElementA_, ElementB_, ElementC_, ElementAccumulator_>::EpilogueOutputOp, typename ThreadblockSwizzle_ = threadblock::GemmBatchedIdentityThreadblockSwizzle, int Stages = DefaultGemmConfiguration<OperatorClass_, ArchTag_, ElementA_, ElementB_, ElementC_, ElementAccumulator_>::kStages, int AlignmentA = DefaultGemmConfiguration<OperatorClass_, ArchTag_, ElementA_, ElementB_, ElementC_, ElementAccumulator_>::kAlignmentA, int AlignmentB = DefaultGemmConfiguration<OperatorClass_, ArchTag_, ElementA_, ElementB_, ElementC_, ElementAccumulator_>::kAlignmentB, typename Operator_ = typename DefaultGemmConfiguration< OperatorClass_, ArchTag_, ElementA_, ElementB_, ElementC_, ElementAccumulator_>::Operator>

| using cutlass::gemm::device::GemmBatched< ElementA_, LayoutA_, ElementB_, LayoutB_, ElementC_, LayoutC_, ElementAccumulator_, OperatorClass_, ArchTag_, ThreadblockShape_, WarpShape_, InstructionShape_, EpilogueOutputOp_, ThreadblockSwizzle_, Stages, AlignmentA, AlignmentB, Operator_ >::DefaultGemmKernel = typename kernel::DefaultGemm< ElementA, LayoutA, kAlignmentA, ElementB, LayoutB, kAlignmentB, ElementC, LayoutC, ElementAccumulator, OperatorClass, ArchTag, ThreadblockShape, WarpShape, InstructionShape, EpilogueOutputOp, ThreadblockSwizzle, kStages, false, Operator, false >::GemmKernel |

Constructor & Destructor Documentation

| cutlass::gemm::device::GemmBatched< ElementA_, LayoutA_, ElementB_, LayoutB_, ElementC_, LayoutC_, ElementAccumulator_, OperatorClass_, ArchTag_, ThreadblockShape_, WarpShape_, InstructionShape_, EpilogueOutputOp_, ThreadblockSwizzle_, Stages, AlignmentA, AlignmentB, Operator_ >::GemmBatched | ( | | ) | |

| inline |

Member Function Documentation

| static Status cutlass::gemm::device::GemmBatched< ElementA_, LayoutA_, ElementB_, LayoutB_, ElementC_, LayoutC_, ElementAccumulator_, OperatorClass_, ArchTag_, ThreadblockShape_, WarpShape_, InstructionShape_, EpilogueOutputOp_, ThreadblockSwizzle_, Stages, AlignmentA, AlignmentB, Operator_ >::can_implement | ( | Arguments const & | args | ) | |

| inlinestatic |

| static size_t cutlass::gemm::device::GemmBatched< ElementA_, LayoutA_, ElementB_, LayoutB_, ElementC_, LayoutC_, ElementAccumulator_, OperatorClass_, ArchTag_, ThreadblockShape_, WarpShape_, InstructionShape_, EpilogueOutputOp_, ThreadblockSwizzle_, Stages, AlignmentA, AlignmentB, Operator_ >::get_workspace_size | ( | Arguments const & | args | ) | |

| inlinestatic |

| Status cutlass::gemm::device::GemmBatched< ElementA_, LayoutA_, ElementB_, LayoutB_, ElementC_, LayoutC_, ElementAccumulator_, OperatorClass_, ArchTag_, ThreadblockShape_, WarpShape_, InstructionShape_, EpilogueOutputOp_, ThreadblockSwizzle_, Stages, AlignmentA, AlignmentB, Operator_ >::initialize | ( | Arguments const & | args, | | | | void * | workspace = nullptr, | | | | cudaStream_t | stream = nullptr | | | ) | | |

| inline |

| Status cutlass::gemm::device::GemmBatched< ElementA_, LayoutA_, ElementB_, LayoutB_, ElementC_, LayoutC_, ElementAccumulator_, OperatorClass_, ArchTag_, ThreadblockShape_, WarpShape_, InstructionShape_, EpilogueOutputOp_, ThreadblockSwizzle_, Stages, AlignmentA, AlignmentB, Operator_ >::operator() | ( | cudaStream_t | stream = nullptr | ) | |

| inline |

| Status cutlass::gemm::device::GemmBatched< ElementA_, LayoutA_, ElementB_, LayoutB_, ElementC_, LayoutC_, ElementAccumulator_, OperatorClass_, ArchTag_, ThreadblockShape_, WarpShape_, InstructionShape_, EpilogueOutputOp_, ThreadblockSwizzle_, Stages, AlignmentA, AlignmentB, Operator_ >::operator() | ( | Arguments const & | args, | | | | void * | workspace = nullptr, | | | | cudaStream_t | stream = nullptr | | | ) | | |

| inline |

| Status cutlass::gemm::device::GemmBatched< ElementA_, LayoutA_, ElementB_, LayoutB_, ElementC_, LayoutC_, ElementAccumulator_, OperatorClass_, ArchTag_, ThreadblockShape_, WarpShape_, InstructionShape_, EpilogueOutputOp_, ThreadblockSwizzle_, Stages, AlignmentA, AlignmentB, Operator_ >::run | ( | cudaStream_t | stream = nullptr | ) | |

| inline |

| Status cutlass::gemm::device::GemmBatched< ElementA_, LayoutA_, ElementB_, LayoutB_, ElementC_, LayoutC_, ElementAccumulator_, OperatorClass_, ArchTag_, ThreadblockShape_, WarpShape_, InstructionShape_, EpilogueOutputOp_, ThreadblockSwizzle_, Stages, AlignmentA, AlignmentB, Operator_ >::update | ( | Arguments const & | args, | | | | void * | workspace = nullptr | | | ) | | |

| inline |

Member Data Documentation

| int const cutlass::gemm::device::GemmBatched< ElementA_, LayoutA_, ElementB_, LayoutB_, ElementC_, LayoutC_, ElementAccumulator_, OperatorClass_, ArchTag_, ThreadblockShape_, WarpShape_, InstructionShape_, EpilogueOutputOp_, ThreadblockSwizzle_, Stages, AlignmentA, AlignmentB, Operator_ >::kAlignmentA = AlignmentA |

| static |

The documentation for this class was generated from the following file:

device/gemm_batched.h

Generated by 1.8.11