CUTLASS: File List - Cutlass

| | CUTLASS

CUDA Templates for Linear Algebra Subroutines and Solvers |

File List

Here is a list of all files with brief descriptions:

| aligned_buffer.h | AlignedBuffer is a container for trivially copyable elements suitable for use in unions and shared memory | | arch.h | Defines tags for architecture-specific configurations | | array.h | Statically sized array of elements that accommodates all CUTLASS-supported numeric types and is safe to use in a union | | array_subbyte.h | Statically sized array of elements that accommodates all CUTLASS-supported numeric types and is safe to use in a union | | batched_reduction.h | Implements a software-pipelined efficient batched reduction. D = alpha * Reduction(A) + beta * C | | [batched_reduction_traits.h](batched reduction traits_8h.html) | Defines structural properties of complete batched reduction. D = alpha * Reduction(A) + beta * C | | command_line.h | | | complex.h | | | conversion_op.h | Functor performing conversion operations used by epilogues | | coord.h | A Coord is a coordinate of arbitrary rank into a tensor or matrix | | core_io.h | Helpers for printing cutlass/core objects | | cutlass.h | Basic include for CUTLASS | | include/cutlass/util/debug.h | Debugging and logging functionality | | tools/util/include/cutlass/util/debug.h | Contains code for debugging cutlass code | | [default_epilogue_complex_tensor_op.h](default epilogue complex tensor op_8h.html) | Epilogue for threadblock scoped complex GEMMs using Tensor Ops | | [default_epilogue_simt.h](default epilogue simt_8h.html) | Epilogue for threadblock scoped GEMMs using SIMT | | [default_epilogue_tensor_op.h](default epilogue tensor__op_8h.html) | Epilogue for threadblock scoped GEMMs using Tensor Ops | | [default_epilogue_volta_tensor_op.h](default epilogue volta tensor op_8h.html) | Epilogue for threadblock scoped GEMMs using Tensor Ops on Volta | | [default_epilogue_wmma_tensor_op.h](default epilogue wmma tensor op_8h.html) | Epilogue for threadblock scoped GEMMs using Tensor Ops | | default_gemm.h | Default kernel-level GEMM definitions combine threadblock-scoped matrix multiply-add with the appropriate threadblock-scoped epilogue | | [default_gemm_configuration.h](default gemm configuration_8h.html) | Definitions for GEMM structures | | [default_gemm_splitk_parallel.h](default gemm splitk__parallel_8h.html) | Default kernel-level GEMM definitions combine threadblock-scoped matrix multiply-add with the appropriate threadblock-scoped epilogue | | default_gemv.h | | | [default_gemv_core.h](default gemv core_8h.html) | Defines basic properties needed by CTA-level batched GEMV assuming expectations about data layout of the global memory fragments, data types, and internal tile sizes | | default_mma.h | Template for a pipelined GEMM kernel. Does not compute batching or support split-K | | [default_mma_core.h](default mma core_8h.html) | Defines basic properties needed by CTA-level GEMMs assuming expectations about data layout of the global memory fragments, data types, and internal tile sizes | | [default_mma_core_simt.h](default mma core__simt_8h.html) | Defines basic properties needed by CTA-level GEMMs assuming expectations about data layout of the global memory fragments, data types, and internal tile sizes | | [default_mma_core_sm50.h](default mma core__sm50_8h.html) | Defines basic properties needed by CTA-level GEMMs assuming expectations about data layout of the global memory fragments, data types, and internal tile sizes | | [default_mma_core_sm70.h](default mma core__sm70_8h.html) | Defines basic properties needed by CTA-level GEMMs assuming expectations about data layout of the global memory fragments, data types, and internal tile sizes | | [default_mma_core_sm75.h](default mma core__sm75_8h.html) | Defines basic properties needed by CTA-level GEMMs assuming expectations about data layout of the global memory fragments, data types, and internal tile sizes | | [default_mma_core_wmma.h](default mma core__wmma_8h.html) | Defines basic properties needed by CTA-level GEMMs assuming expectations about data layout of the global memory fragments, data types, and internal tile sizes | | [default_mma_tensor_op.h](default mma tensor__op_8h.html) | Default warp-level GEMM operators selected by data type, size, and layouts of operands | | [default_mma_wmma_tensor_op.h](default mma wmma tensor op_8h.html) | Default warp-level GEMM operators selected by data type, size, and layouts of operands | | [default_thread_map_simt.h](default thread map__simt_8h.html) | | | [default_thread_map_tensor_op.h](default thread map tensor op_8h.html) | | | [default_thread_map_volta_tensor_op.h](default thread map volta tensor__op_8h.html) | | | [default_thread_map_wmma_tensor_op.h](default thread map wmma tensor__op_8h.html) | | | device_dump.h | C++ interface to dump fragments and shared memory contents for debugging | | device_kernel.h | Template for generic CUTLASS kernel | | device_memory.h | C++ interface to CUDA device memory management functions | | [direct_epilogue_tensor_op.h](direct epilogue tensor__op_8h.html) | Epilogue for tensor operations | | distribution.h | This header contains a class to parametrize a statistical distribution function | | epilogue.h | Epilogue for threadblock scoped GEMMs using Tensor Ops | | epilogue_base.h | Epilogue for threadblock scoped GEMMs using Tensor Ops | | epilogue_workspace.h | Epilogue for threadblock scoped GEMMs | | exceptions.h | C++ exception semantics for CUDA error codes | | fast_math.h | Math utilities | | [fragment_iterator_complex_tensor_op.h](fragment iterator complex tensor op_8h.html) | This defines a "fragment" iterator for visiting the fragments of an accumulator tile that participate in one warp-level store operation | | [fragment_iterator_simt.h](fragment iterator simt_8h.html) | This defines a "fragment" iterator for visiting the fragments of an accumulator tile that participate in one warp-level store operation | | [fragment_iterator_tensor_op.h](fragment iterator tensor__op_8h.html) | This defines a "fragment" iterator for visiting the fragments of an accumulator tile that participate in one warp-level store operation | | [fragment_iterator_volta_tensor_op.h](fragment iterator volta tensor op_8h.html) | This defines a "fragment" iterator for visiting the fragments of an accumulator tile that participate in one warp-level store operation | | [fragment_iterator_wmma_tensor_op.h](fragment iterator wmma tensor op_8h.html) | This defines a "fragment" iterator for visiting the fragments of an accumulator tile that participate in one warp-level store operation | | functional.h | Define basic numeric operators with specializations for Array<T, N>. SIMD-ize where possible | | include/cutlass/gemm/device/gemm.h | Template for a pipelined GEMM kernel. Does not compute batching or support split-K | | include/cutlass/gemm/gemm.h | Defines common types used for all GEMM-like operators | | include/cutlass/gemm/kernel/gemm.h | Template for a pipelined GEMM kernel. Does not compute batching or support split-K | | tools/util/include/cutlass/util/reference/device/gemm.h | Reference implementation for GEMM in device-side code | | tools/util/include/cutlass/util/reference/device/kernel/gemm.h | Reference implementation for GEMM in host-side code | | tools/util/include/cutlass/util/reference/device/thread/gemm.h | Reference implementation for GEMM in host-side code | | tools/util/include/cutlass/util/reference/host/gemm.h | Reference implementation for GEMM in host-side code | | device/gemm_batched.h | Template for a pipelined GEMM kernel. Does not compute batching or support split-K | | kernel/gemm_batched.h | Template for a pipelined GEMM kernel. Does not compute batching or support split-K | | include/cutlass/gemm/device/gemm_complex.h | Template for a pipelined GEMM kernel. Does not compute batching or support split-K | | tools/util/include/cutlass/util/reference/host/gemm_complex.h | Reference implementation for complex-valued GEMM in host-side code | | gemm_pipelined.h | Template for a pipelined GEMM kernel. Does not compute batching or support split-K | | [device/gemm_splitk_parallel.h](device_2gemm splitk parallel_8h.html) | Template for GEMM performing a reduction over K partitions in parallel | | [kernel/gemm_splitk_parallel.h](kernel_2gemm splitk parallel_8h.html) | Template for GEMM performing a reduction over K partitions in parallel | | gemv.h | Template for a threadblock-scoped GEMV kernel | | [gemv_batched_strided.h](gemv batched strided_8h.html) | | | half.h | Defines a class for using IEEE half-precision floating-point types in host or device code | | host_reorder.h | Reorder data from the host side | | host_tensor.h | HostTensor contributes management for both host and device memory | | inner_product.h | Reference implementation for GEMM in host-side code | | integer_subbyte.h | Defines a class for using integer types smaller than one byte in host or device code | | interleaved_epilogue.h | Epilogue for threadblock scoped GEMMs using Tensor Ops | | kernel_launch.h | Defines structures and helpers to launch CUDA kernels within CUTLASS | | layout.h | Defines layout functions used by TensorRef and derived classes | | library.h | CUTLASS Library is an object-oriented approach to managing operations implemented by CUTLASS | | linear_combination.h | Functor performing linear combination operations used by epilogues | | [linear_combination_clamp.h](linear combination clamp_8h.html) | Functor performing linear scaling operations used by epilogues. Values are clamped before converting to the output element type | | [linear_combination_relu.h](linear combination relu_8h.html) | Functor performing linear combination operations used by epilogues. Values are clamped before converting to the output element type | | manifest.h | Manifest of CUTLASS Library | | layout/matrix.h | Defines layout functions used by TensorRef and derived classes | | thread/matrix.h | Defines a matrix object intended for storing data in registers and operations within a CUDA thread | | matrix_coord.h | Defines a canonical coordinate for rank=2 matrices offering named indices | | matrix_shape.h | Defines a Shape template for matrix tiles | | matrix_traits.h | Defines properties of matrices used to denote layout and operands to GEMM kernels | | memory.h | Architecture-specific operators on memory | | memory_sm75.h | Architecture-specific operators on memory added for SM75 | | arch/mma.h | Templates exposing architecture support for multiply-add operations | | gemm/thread/mma.h | Templates exposing architecture support for warp-level multiply-add operations | | gemm/warp/mma.h | Templates exposing architecture support for warp-level multiply-add operations | | mma_base.h | Template for a double-buffered threadblock-scoped GEMM kernel | | [mma_complex_tensor_op.h](mma complex tensor__op_8h.html) | Templates implementing warp-level matrix multiply-accumulate operations targeting Tensor Cores | | mma_pipelined.h | Template for a double-buffered threadblock-scoped GEMM kernel | | mma_simt.h | Templates implementing warp-level matrix multiply-accumulate operations | | [mma_simt_policy.h](mma simt policy_8h.html) | Describes the lane policy used by warp-level matrix multiply operators targeting SIMT instructions | | [mma_simt_tile_iterator.h](mma simt tile__iterator_8h.html) | Describes the lane policy used by warp-level matrix multiply operators targeting SIMT instructions | | mma_singlestage.h | Template for a double-buffered threadblock-scoped GEMM kernel | | arch/mma_sm50.h | Matrix multiply | | gemm/thread/mma_sm50.h | Templates exposing architecture support for multiply-add operations | | arch/mma_sm60.h | Matrix multiply | | gemm/thread/mma_sm60.h | Templates exposing architecture support for multiply-add operations | | arch/mma_sm61.h | Matrix multiply | | gemm/thread/mma_sm61.h | Templates exposing architecture support for multiply-add operations | | mma_sm70.h | Matrix multiply | | mma_sm75.h | Matrix multiply for SM75 | | [mma_tensor_op.h](mma tensor op_8h.html) | Templates implementing warp-level matrix multiply-accumulate operations targeting Tensor Cores | | [mma_tensor_op_policy.h](mma tensor op__policy_8h.html) | Policy describing implementation details of warp-level GEMM targeting Tensor Cores | | [mma_tensor_op_sm70.h](mma tensor op__sm70_8h.html) | Templates implementing warp-level matrix multiply-accumulate operations targeting Tensor Cores | | [mma_tensor_op_tile_iterator.h](mma tensor op tile iterator_8h.html) | Defines iterators used by warp-level matrix multiply operations targeting Tensor Cores | | [mma_tensor_op_tile_iterator_sm70.h](mma tensor op tile iterator__sm70_8h.html) | Defines iterators used by warp-level matrix multiply operations targeting Tensor Cores | | [mma_tensor_op_tile_iterator_wmma.h](mma tensor op tile iterator__wmma_8h.html) | Defines iterators used by warp-level matrix multiply operations targeting Tensor Cores | | [mma_tensor_op_wmma.h](mma tensor op__wmma_8h.html) | Templates implementing warp-level matrix multiply-accumulate operations targeting Tensor Cores | | numeric_conversion.h | Boost-like numeric conversion operator for CUTLASS numeric types | | numeric_types.h | Top-level include for all CUTLASS numeric types | | [output_tile_thread_map.h](output tile thread__map_8h.html) | Metaprogram for determining the mapping of output elements to threads for epilogue tiles | | pitch_linear.h | Defines layout functions used by TensorRef and derived classes for pitch-linear memory | | [pitch_linear_thread_map.h](pitch linear thread__map_8h.html) | Templates implementing how threads are mapped to a given tile | | platform.h | C++ features that may be otherwise unimplemented for CUDA device functions | | predicate_vector.h | Defines container classes and iterators for managing a statically sized vector of boolean predicates | | [predicated_tile_access_iterator.h](predicated tile access__iterator_8h.html) | Templates calculating the address and predicates to the load of tiles from pitch-linear rank=2 tensors | | [predicated_tile_access_iterator_2dthreadtile.h](predicated tile access iterator 2dthreadtile_8h.html) | Templates calculating the address and predicates to the load of tiles from pitch-linear rank=2 tensors | | [epilogue/threadblock/predicated_tile_iterator.h](epilogue_2threadblock_2predicated tile iterator_8h.html) | Epilogue for threadblock scoped GEMMs using Tensor Ops | | [transform/threadblock/predicated_tile_iterator.h](transform_2threadblock_2predicated tile iterator_8h.html) | Templates implementing loading of tiles from pitch-linear rank=2 tensors | | [predicated_tile_iterator_2dthreadtile.h](predicated tile iterator__2dthreadtile_8h.html) | Templates implementing loading of tiles from pitch-linear rank=2 tensors | | real.h | | | reduce.h | Defines basic thread level reduction with specializations for Array<T, N> | | [reduce_split_k.h](reduce split k_8h.html) | Kernel performing a reduction over densely packed tensors in global memory | | reduction_op.h | Functor performing reduction operations used by epilogues | | reduction_operators.h | Kernel performing a reduction over densely packed tensors in global memory | | [regular_tile_access_iterator.h](regular tile access__iterator_8h.html) | Templates implementing the address computation of storing of tiles from pitch-linear rank=2 tensors | | [regular_tile_access_iterator_pitch_linear.h](regular tile access iterator pitch__linear_8h.html) | Templates implementing computing the addresses of storing of tiles from pitch-linear rank=2 tensors | | [regular_tile_access_iterator_tensor_op.h](regular tile access iterator tensor__op_8h.html) | Templates implementing computing the addresses of storing of tiles from pitch-linear rank=2 tensors | | [regular_tile_iterator.h](regular tile iterator_8h.html) | Templates implementing storing of tiles from pitch-linear rank=2 tensors | | [regular_tile_iterator_pitch_linear.h](regular tile iterator pitch linear_8h.html) | Templates implementing loading of tiles from pitch-linear rank=2 tensors | | [regular_tile_iterator_pitch_linear_2dthreadtile.h](regular tile iterator pitch linear__2dthreadtile_8h.html) | Templates implementing loading of tiles from pitch-linear rank=2 tensors | | [regular_tile_iterator_tensor_op.h](regular tile iterator tensor op_8h.html) | Templates implementing storing of tiles from pitch-linear rank=2 tensors | | [regular_tile_iterator_tensor_op_sm70.h](regular tile iterator tensor op__sm70_8h.html) | Templates implementing loading of tiles from pitch-linear rank=2 tensors | | relatively_equal.h | | | semaphore.h | Implementation of a CTA-wide semaphore for inter-CTA synchronization | | [shared_load_iterator.h](shared load iterator_8h.html) | Epilogue for threadblock scoped GEMMs using Tensor Ops | | simd.h | Templates exposing SIMD operators | | simd_sm60.h | Templates exposing SIMD operators for SM60 | | simd_sm61.h | Templates exposing SIMD operators for SM60 | | simt_policy.h | Defines basic structures needed for implementing the warp-scoped phase of the epilogue. These quantities assume a 'column-major' arrangement of SimtOp instructions, of which a row-oriented slice is visible per iteration | | subbyte_reference.h | Provides a mechanism for packing and unpacking elements smaller than one byte | | tensor.h | Defines layout functions used by TensorRef and derived classes for common 4-D and 5-D tensor formats | | device/tensor_compare.h | | | host/tensor_compare.h | | | tensor_coord.h | Defines a canonical coordinate for rank=4 tensors offering named indices | | tensor_copy.h | | | device/kernel/tensor_elementwise.h | | | host/tensor_elementwise.h | | | device/tensor_fill.h | | | host/tensor_fill.h | | | device/kernel/tensor_foreach.h | | | device/tensor_foreach.h | | | host/tensor_foreach.h | | | tensor_norm.h | | | [tensor_op_multiplicand_sm70.h](tensor op multiplicand__sm70_8h.html) | | | [tensor_op_multiplicand_sm75.h](tensor op multiplicand__sm75_8h.html) | | | [tensor_op_policy.h](tensor op policy_8h.html) | Defines basic structures needed for implementing the warp-scoped phase of the epilogue. These quantities assume a 'column-major' arrangement of TensorOp instructions, of which a row-oriented slice is visible per iteration | | tensor_ref.h | Defines a structure containing strides, bounds, and a pointer to tensor data | | tensor_view.h | Defines a structure containing strides and a pointer to tensor data | | [tensor_view_io.h](tensor view io_8h.html) | | | gemm/threadblock/threadblock_swizzle.h | Implements several possible threadblock-swizzling functions mapping blockIdx to GEMM problems | | reduction/threadblock_swizzle.h | Defies functors for mapping blockIdx to partitions of the batched reduction computation | | [tile_iterator_simt.h](tile iterator simt_8h.html) | | | [tile_iterator_tensor_op.h](tile iterator tensor__op_8h.html) | | | [tile_iterator_volta_tensor_op.h](tile iterator volta tensor op_8h.html) | | | [tile_iterator_wmma_tensor_op.h](tile iterator wmma tensor op_8h.html) | | | transpose.h | Basic copy routines for tensor views | | type_traits.h | Type traits for common CUDA types | | vector.h | Defines layout functions used for rank=1 vectors | | [volta_tensor_op_policy.h](volta tensor op__policy_8h.html) | Defines basic structures needed for implementing the warp-scoped phase of the epilogue. These quantities assume a 'column-major' arrangement of TensorOp instructions, of which a row-oriented slice is visible per iteration | | wmma.h | Templates exposing architecture support for warp matrix multiply-add (WMMA) operations | | wmma_array.h | Statically sized array of elements that accommodates all CUTLASS-supported numeric types and is safe to use in a union | | wmma_ptx.h | Templates exposing warp matrix multiply-add (WMMA) operations | | wmma_sm70.h | Matrix multiply | | wmma_sm72.h | Matrix multiply | | wmma_sm75.h | Matrix multiply | | [wmma_tensor_op_policy.h](wmma tensor op__policy_8h.html) | Defines basic structures needed for implementing the warp-scoped phase of the epilogue. These quantities assume a 'column-major' arrangement of TensorOp instructions, of which a row-oriented slice is visible per iteration |

Generated by 1.8.11