examples/81_blackwell_gemm_blockwise/README.md
Blockwise and Groupwise GEMM and Grouped GEMM implement software scaling by the accumulator type. The examples in this directory aim to demonstrate how we can instantiate this kernel and run it. The profiler enables instantiating and profiling different kernel configurations for Blockwise and Groupwise GEMM to determine the best performing kernel for your workload.
Blockwise and Groupwise GEMM operations enable fine-grained numerical precision control by applying scale factors at configurable granularities. This is particularly useful for quantized neural networks where different regions of tensors may have different scaling requirements.
For a GEMM $D = \alpha A B + \beta C$, we introduce two scale factor tensors, SFA and SFB. This leads to a GEMM $D = \alpha \text{SFA} * A \text{ SFB} * B + \beta C$.
These can be represented in CuTe as:
The 0 element stride ensures the same group of coordinates to map to the same element in the scale factors.
For convenience the Blockwise and Groupwise implementation provide
cutlass::detail::Sm100BlockwiseScaleConfig<ScaleGranularityM, ScaleGranularityN, ScaleGranularityK>
to deduce layouts and manage compact tensors.
cutlass::detail::Sm100BlockwiseScaleConfig<ScaleGranularityM, ScaleGranularityN, ScaleGranularityK> by default makes
every tensor major the M/N mode, but can be configured. For example:
cutlass::detail::Sm100BlockwiseScaleConfig<ScaleGranularityM, ScaleGranularityN, ScaleGranularityK, UMMA::Major::K, UMMA::Major::MN>
denotes SFA will be major in the K dimension but SFB will be major in the N dimension.
If translating from frameworks like Torch where SFA has shape (M / ScaleGranularityM, K / ScaleGranularityK) and SFB has a shape (K / ScaleGranularityK, N / ScaleGranularityN), ensure to transpose SFB and B to fit into the canonical CuTe layout form. This ensures K is always the second mode. Use strides can be used to determine if each tensor is MN or K major to correctly form the layouts either directly or with the convenience wrappers.
To determine the most performance Blockwise/Groupwise GEMM or Grouped GEMM kernel for your use case, you can utilize the CUTLASS profiler.
All Blockwise/Groupwise GEMMs and Group GEMMs with f32 scaling of e4m3 or runtime f8 types can be selected by
selecting a subset of kernels when configuring with CMake by passing:
-DCUTLASS_LIBRARY_KERNELS="cutlass3x*f32xe4m3_*f32xe4m3*,cutlass3x*f32xf8_*f32xf8*" you can further reduce the amount of kernels generated by specifying the SFA and SFB scale granularities e.g., cutlass3x*1x128f32xe4m3_*128x128f32xe4m3*).
The simplest way to use the profiler is to pass m, n, and k as well as your scale_vec_size_m,
scale_vec_size_n, and scale_vec_size_k. Passing enable-best-kernel-for-fixed-shape will do some autotuning
per kernel to determine best rasterization orders, swizzles, and cluster sizes. Passing blockwiseGemm
or GroupedGemm through the operation flag will determine which set of operations will be profiled.
For example, this command using the cutlass profiler will dump the performance of all compiled kernels which support scale granularity m = 1, scale granularity n = 128, and scale granularity k = 128 for the problem size 8192x8192x8192:
cutlass_profiler --operation=blockwiseGemm \
--enable-best-kernel-for-fixed-shape \
--m=8192 --n=8192 --k=8192 \
--scale_vec_size_m=1 --scale_vec_size_n=128 --scale_vec_size_k=128 \
--verification-enabled=false
The naming of the blockwise and groupwise kernels includes the following new pattern: for each tensor scalar pair we have
<scale_granularity_m or scale_granularity_n>x<scale_granularity_k><accumulator type>x<scaled tensor type>. For example
cutlass3x_sm100_tensorop_gemm_64x128f32xe4m3_1x128f32xe4m3_f32_f16_f16_64x128x128_1x1x1_0_nnn_align16_1sm would denote:
It is also worthwhile to note that C can be void if scaling by beta is not needed.
MMA_M dimension is 64, but MMA_N
dimension can be as small as 8 for some instructions. For problem sizes where M is small consider computing $D^T = \alpha B^T A^T + \beta C^T$ instead.
MMA_N we can more effectively tile without performing unecessary computation.m and n inputs and adjust layouts to reflect this swapping and transposing.
When using blockwise and groupwise GEMM we must swap the scale vector sizes when doing this optimization. If we have a 1 element scale granularity M and a 128 element scale granularity N, we must run a kernel with a 128 element scale granularity M and a 1 element scale granularity N.