media/docs/pythonDSL/cute_dsl_general/naming_conventions.rst
.. _cute_dsl_naming_conventions:
This page summarizes the Hungarian-style naming conventions used for identifiers across the DSL examples and epilogue helpers: tensor partitions, per-thread copy-partitioners, copy atoms, and the axis-order suffixes that encode tensor layouts. It is meant as a lookup reference while reading example code — not as a style rule enforced on new code.
g: Global memory view (GMEM), e.g., gB_nkl, tTR_gCs: Shared memory view (SMEM), e.g., sA, tRS_sC, bSG_sCr: Register view (RMEM), e.g., tTR_rAcc, tRS_rCt: Tensor-memory view (TMEM), used for any TMEM-resident fragment or layout regardless of role. The classical case is the accumulator (tCtAcc, tTR_tAcc). The same scope letter also appears for non-accumulator TMEM tensors such as tCtE, tCtState, tCtQState, tCtShared. Read the operand suffix to distinguish the role from the memory scope.tA… / tB…: TMA load path for A/B
tAgA / tAsA: per-thread partitioned global/shared A for TMA loadtBgB / tBsB: per-thread partitioned global/shared B for TMA loadtAgSFA / tAsSFA, tBgSFB / tBsSFBtC…: Compute/epilogue path for C/Acc
tCgA / tCgB / tCgC: per-thread partitions used by MMA/epilogue (derived from global tensors)tCrA / tCrB: per-thread fragments used by MMA (derived from SMEM A/B)tCtAcc: per-thread accumulator fragment/layout in TMEMtC* tensors follow the same schema for kernels that carry more than the classical A/B/C/Acc operands (see Operands and roles below): e.g. tCtState / tCtQState / tCtShared (gated-delta-net recurrent state in TMEM), tCrValpha / tCrVbeta / tCrVbias (EVT/EFC broadcast vectors in registers), tCtAccInter / tCtAccIntra (hierarchical accumulators).. {$nv-internal-release begin}
tCtE for the sparsity metadata tensor in TMEM (sm_140 / Feynman sparse GEMM, not yet released).. {$nv-internal-release end}
tTM…: Per-thread TMEM tiled-copy partitions used by FMHA/attention kernels (e.g. tTMrO as the register-side view of a TMEM load partitioned through thr_tmem_load)
Attention/MLA path families (tQ…, tK…, tV…, tP…, tO…): same schema as tA… / tB… / tC… but specialised to the Q/K/V/P/O operands of attention kernels, e.g.:
tQsQ / tQgQ_qdl: per-thread SMEM / GMEM partitions of Q for TMA loadtKrK / tVrV: per-thread register fragments for K / VtOtO / tOrO: per-thread TMEM / register views of the attention output accumulator OtPrP: per-thread register fragment for the softmax probability matrix PtTR_*: TMEM → Register (T2R)
tTR_tAcc: TMEM accumulator source for T2RtTR_rAcc: Register destination for T2RtTR_gC: When not using TMA store, Register → Global C destination partitiontRS_*: Register → Shared (R2S)
tRS_rC: Register source (C dtype)tRS_sC: Shared destinationbSG_*: Thread(b)lock partition for Shared → Global via TMA store
bSG_sC: Shared source for TMA storebSG_gC: Global destination for TMA storebSG_sAcc, bSG_gAccbSG_sD / bSG_gD, bSG_sP / bSG_gP, bSG_sY / bSG_gYbGS_*: Thread(b)lock partition for Global → Shared via TMA load (the load-path mirror of bSG_*)
bGS_gC / bGS_sC: Global source / Shared destination for TMA load of C-like operands (seen in EFC row/column broadcast prologues)simt_atom: SIMT copy path used when TMA store is disabled (Register → Global)
Generic SIMT / tiled copy atoms <src>2<dst>_atom[_suffix] name the copy direction between two memory scopes:
s2r_atom_*: Shared → Register atom used in specialised epilogues and attention loads (e.g. s2r_atom_delta, s2r_atom_cumsum, s2r_atom_d in Mamba2 SSD)r2s_atom: Register → Shared atomt2r_atom / r2t_atom: Tensor memory ↔ Register atoms (paired with thr_tmem_load / thr_tmem_store)s2s_atom: Shared → Shared atom (reshape/remap without register spill)s2t: Shared → Tensor memory atom.. {$nv-internal-release begin}
sp2t_copy_op_*: Sparse source → Tensor memory copy op (sm_140 / Feynman sparse GEMM, not yet released: e.g. Sp2TAsACopyOp, Sp2TAsECopyOp).. {$nv-internal-release end}
autovec_copy paths appear where the DSL auto-vectorises a bespoke layoutA, B, C: GEMM operands
Acc: Accumulator (TMEM/Register paths). Hierarchical MMA kernels split this into AccInter / AccIntra for the inter-/intra-CTA accumulator halves
Classical extra outputs / intermediates: D (additional output), Y (fused output), SFA / SFB (per-operand scale-factor arrays for NVFP4/FP8), SF (generic scale factor)
Attention / MLA operand letters (Q/K/V/P/O schema):
Q (query), K (key), V (value), P (softmax probability / score matrix), O (attention output)Kt / Vt for the transposed view of K/V, Qi / Ki / Vi for per-iteration slices, QK / PV / QKV where a single fragment spans multiple operands of the two back-to-back matmulsMamba / recurrent-state letters: Delta / DeltaA (time-step and A-decay), State / QState / Shared (gated-delta-net recurrent state tensors), Cumsumlog / Cumprod (running reductions), Gate, DecayV
.. {$nv-internal-release begin}
E (sparsity metadata tensor in TMEM; paired with sp2t_* copy ops).. {$nv-internal-release end}
Valpha / Vbeta (alpha/beta scalars broadcast as vectors), Vbias (bias vector), Ainv (inverse of A for fused solvers).. {$nv-internal-release begin}
LutB (look-up-table operand).. {$nv-internal-release end}
CommInMC / CommOutMC (multicast in/out), CommOutUC (unicast out)Dv (value head dimension when distinct from Q/K dim), Nv (number of value heads)Suffix encodes axis order of the view (lowercase letters each stand for one tensor mode):
GEMM layouts use m/n/k/l:
_mnl, _nkl, _mkl, … map to (M, N, K, L) orderinggB_nkl is B with axes (N, K, L); gC_mnl is C with (M, N, L)Attention / FMHA layouts use q/k/d/l (sequence-Q, sequence-K, head-dim, batch):
mQ_qdl: Q tensor with axes (SeqQ, HeadDim, Batch)mK_kdl: K tensor with axes (SeqK, HeadDim, Batch)mV_dkl: V tensor with axes (HeadDim, SeqK, Batch) — the d-first order reflects the V-transpose that makes the second matmul (P·V) a standard row-major MxK·KxNLower-rank 2D slices drop the batch letter: _mn, _mk, _nk
Internally, CuTe layouts also expose grouped modes like MMA_M/N/K, EPI_M/N, RestM/N/K/L, STAGE, etc. (these are typically implementation details not directly used in example code).
From left to right: [t|b][A|B|C|Q|K|V|P|O|TR|RS|SG|GS|TM]_[g|s|r|t][Operand/Role][AxisSuffix?]
t = per-thread/partitioned view; b = block/threadblock partition context
family/path letters:
A / B / C (GEMM), Q / K / V / P / O (attention)TR (TMEM → Register), RS (Register → Shared), SG (Shared → Global, store), GS (Global → Shared, load), TM (TMEM tiled-copy partition), R2G / S2R / T2R / R2T convenience aliasesmemory = g/s/r/t
operand/role = A/B/C/Acc/SFA/SFB/Q/K/V/P/O/E/State/…
axis suffix = _mnl, _nkl, _qdl, _kdl, _dkl, _mn, … when applicable
Per-thread-partitioner objects follow a parallel thr_* vocabulary, grouped by role:
thr_mmathr_copy_<src>2<dst>: thr_copy_g2s, thr_copy_s2r, thr_copy_t2r, thr_copy_r2s, thr_copy_r2t, thr_copy_s2tthr_copy_sfa, thr_copy_sfb, thr_copy_load, thr_copy_beta_g2sthr_mma_qk, thr_mma_pv, thr_mma_kv, thr_mma_qkv, thr_mma_intra1 / thr_mma_intra2, thr_mma_leader_cta, thr_mma_sfbthr_tmem_load, thr_tmem_store (with _stats / _vec suffix variants)The tensor produced by thr_foo.partition_S(X) or .partition_D(X) is then named by the [t|b]FamilyPrefix_* convention above.
Open these files in the repository to see each pattern in context:
TMA load partitions for A/B:
tAgA, tAsA, tBgB, tBsBCuTeDSL/cute/blackwell/kernel/dense_gemm/dense_gemm.py (around TMA partition of A/B)Accumulator fragment in TMEM:
tCtAccCuTeDSL/cute/blackwell/kernel/dense_gemm/dense_gemm.py (accumulator creation and use)TMEM → Register (T2R):
tTR_tAcc, tTR_rAcc, tTR_gCCuTeDSL/cute/blackwell/kernel/dense_gemm/dense_gemm.py (epilog_tmem_copy_and_partition)Register → Shared (R2S):
tRS_rC, tRS_sCCuTeDSL/cute/blackwell/kernel/mixed_input_gemm/mixed_input_gemm.py (epilog_smem_copy_and_partition)Shared → Global via TMA store:
bSG_sC, bSG_gCCuTeDSL/cute/blackwell/kernel/blockscaled_gemm/dense_blockscaled_gemm_persistent.py (epilog_gmem_copy_and_partition)NVFP4/FP8 scale factors:
tAgSFA/tAsSFA, tBgSFB/tBsSFBCuTeDSL/cute/blackwell/tutorial/tutorial_gemm/nvfp4_gemm_0.py (scale factor partition and usage)Additional examples across examples/:
Register → Global helper naming in MLA: tR2G_rO_src, tR2G_rO_dst
CuTeDSL/cute/blackwell/kernel/attention/mla/mla_decode_fp16.py (output store section)
Shared → Register SIMT atoms in Mamba2 SSD: s2r_atom_delta, s2r_atom_cumsum, s2r_atom_d
CuTeDSL/cute/blackwell/kernel/attention/mamba2_ssd/mamba2_ssd.py (SMEM load paths for delta and D)
thr_* slices for partitioning per-thread work: thr_mma, thr_copy_t2r, thr_copy_r2s, etc.
CuTeDSL/cute/blackwell/kernel/dense_gemm/dense_gemm.py (thr_mma, thr_copy_t2r, thr_copy_r2s)
Axis-order suffix examples:
gB_nkl, gC_mnlCuTeDSL/cute/blackwell/kernel/dense_gemm/dense_gemm.py (global tensor tiling and partitioning)Global → Shared (TMA load) block partition bGS_*:
bGS_gC, bGS_sCCuTeDSL/cute/blackwell/efc/common_efc.py (row/column broadcast prologue building the C-like input for EVT)Attention Q/K/V/P/O families and _qdl / _kdl / _dkl axis suffixes:
tQsQ, tQgQ_qdl, mK_kdl, mV_dklCuTeDSL/cute/hopper/kernel/attention/fmha.py (Q/K/V TMA partitions)tOtO, tOrO, tPrPCuTeDSL/cute/blackwell/tutorial/tutorial_fmha/fmha_0.py (output and softmax fragments)tKrK, tVrVCuTeDSL/cute/blackwell/kernel/attention/mixed_input_fmha/mixed_input_fmha_decode.py (mixed-input K/V register fragments)TMEM tiled-copy tTM* family and the generalised <src>2<dst>_atom naming:
tTMrO driven by thr_tmem_loadCuTeDSL/cute/blackwell/tutorial/tutorial_fmha/fmha_0.pyRecurrent-state operands (State / QState / Shared) in TMEM:
tCtState, tCtQState, tCtSharedCuTeDSL/cute/blackwell/kernel/attention/gated_delta_net/gated_delta_net_chunked.py.. {$nv-internal-release begin}
Sparse-metadata operand E and sp2t_* copy ops (sm_140 / Feynman, not yet released):
tCtE, sp2t_copy_op_A, sp2t_copy_op_ECuTeDSL/internal/feynman/sm140_sparse_gemm.py and sm140_sparse_gemm_temporal_split_k.pyLUT-based block-scaled GEMM operand LutB (Rubin, not yet released):
CuTeDSL/cute/rubin/kernel/blockscaled_gemm/dense_blockscaled_gemm_lut.pyCuTeDSL/cute_ext/rubin/dense_gemm_lutb.py.. {$nv-internal-release end}
Richer thr_* and thr_copy_* / thr_mma_* / thr_tmem_* partitioner taxonomy:
thr_copy_g2s, thr_copy_s2r, thr_copy_s2t, thr_copy_r2t, thr_mma_qk, thr_mma_pv, thr_tmem_load, thr_tmem_storefmha_*.py or mamba2_ssd.py file will show the full vocabulary in use