src/plugins/intel_cpu/src/nodes/kernels/simd/README.md
Namespace: ov::Extensions::Cpu::XARCH::simd
(XARCH is a macro set by OpenVINO's cross-compilation framework — it
resolves to AVX2, AVX512F, or ANY depending on the target ISA of each
translation unit.)
Compile-time abstraction over AVX-512 / AVX2 / scalar.
vec<T, i> wraps the native register. Operators are members, everything
else is a free function found via ADL.
| File | Contents |
|---|---|
simd_common.hpp | isa enum, active_isa, primary templates for vec/mask |
simd_scalar.hpp | Scalar specializations (always available, no #ifdef) |
simd_avx2.hpp | AVX2 specializations (no #ifdef inside) |
simd_avx512.hpp | AVX-512 specializations (no #ifdef inside) |
simd.hpp | Aggregator: includes above + aliases (f32, i32), load API, table |
simd_loop.hpp | Unified loop/reduction frontend: simd_loop, simd_loop_reduce, active-aware wrappers |
Each per-ISA header includes simd_common.hpp and is self-contained. The
only #ifdef is in simd.hpp — one conditional include per ISA.
simd_avx2.hpp and simd_avx512.hpp depend on scaled_attn/common.hpp
for mm256_loadu_u4_to_f32 / mm512_loadu_u4_to_f32 (used in
load_u4_pair). Moving this logic into simd would make the directory
fully self-contained.
#include "nodes/kernels/simd/simd.hpp"
using namespace ov::Extensions::Cpu::XARCH;
simd::f32 a(1.0f); // broadcast
auto b = simd::load<simd::f32>(ptr); // load (same-type or converting)
auto c = fmadd(a, b, simd::f32(0.0f)); // FMA (ADL from vec args)
store(c, out_ptr); // store (ADL)
float sum = reduce(c); // horizontal sum (ADL)
simd::table<16> cb(codebook_data); // 16-entry LUT in registers
auto decoded = cb.lookup(indices); // parallel LUT lookup
simd_loop.hpp provides a higher-level frontend on top of vec<T, i> for
elementwise traversal and reduction kernels.
Elementwise traversal:
simd::simd_loop(n, [&](int j, auto a) {
auto v = simd::load<simd::f32>(ptr + j, a);
...
});
Reduction kernels:
float sum = simd::simd_loop_reduce<4>(
n,
[&](int j, simd::f32& acc) { ... },
[&](int j, float& tail) { ... });
The intent is:
The current backend lowering is still simple:
scalar
AVX2
AVX-512
So the frontend abstraction is in place, but per-ISA tail strategies are not fully implemented yet.
The intended evolution is:
AVX2
AVX-512
SVE
RVV
That requires evolution in four places:
for_each_chunk<I>()
active_lanes<I>
load/store/reduce(..., active_lanes<I>)
simd_loop_reduce
In simd_common.hpp, add to the enum:
enum class isa { scalar, avx2, avx512, neon /* new */ };
Create simd_<name>.hpp (e.g. simd_neon.hpp). Use any existing file as
template. The file must:
simd_common.hppnamespace ov::Extensions::Cpu::XARCH::simdvec<float, isa::neon> and vec<int32_t, isa::neon> with:
using element_type = ...;static constexpr int width = ...;static constexpr isa isa_value = ...;operator+, -, * (float) and operator& (int32)No #ifdef guards inside the file.
Add one conditional include in simd.hpp:
#if defined(HAVE_NEON)
#include "simd_neon.hpp"
#endif
In simd_common.hpp, add the preprocessor branch:
#if defined(HAVE_AVX512F)
inline constexpr isa active_isa = isa::avx512;
#elif defined(HAVE_AVX2)
inline constexpr isa active_isa = isa::avx2;
#elif defined(HAVE_NEON) // new
inline constexpr isa active_isa = isa::neon;
#else
inline constexpr isa active_isa = isa::scalar;
#endif
Every per-ISA header must provide these free functions for vec<float, isa::X>:
Required (used by codec infrastructure):
| Function | Signature |
|---|---|
store | void store(vec<float, I> v, float* p) |
store | void store(vec<int32_t, I> v, int32_t* p) |
reduce | float reduce(vec<float, I> v) |
fmadd | vec<float, I> fmadd(vec<float, I> a, b, c) |
load (float) | vec<float, I> load(const float* p, vec<float, I>*) |
load (f16) | vec<float, I> load(const ov::float16* p, vec<float, I>*) |
load (bf16) | vec<float, I> load(const ov::bfloat16* p, vec<float, I>*) |
load (u8→f32) | vec<float, I> load(const uint8_t* p, vec<float, I>*) |
load (i32) | vec<int32_t, I> load(const int32_t* p, vec<int32_t, I>*) |
load (u8→i32) | vec<int32_t, I> load(const uint8_t* p, vec<int32_t, I>*) |
partial_load | vec<float, I> partial_load(uint32_t k, const float* p, vec<float, I>*) |
load_u4 | vec<float, I> load_u4(const uint8_t* p, int bit_offset, vec<float, I>*) |
load_u4_pair | void load_u4_pair(const uint8_t* p, vec<float, I>& lo, vec<float, I>& hi) |
load_u8_pair | void load_u8_pair(const uint8_t* p, vec<float, I>& lo, vec<float, I>& hi) |
permute | vec<float, I> permute(vec<float, I> table, vec<int32_t, I> idx) |
srlv | vec<int32_t, I> srlv(vec<int32_t, I> val, vec<int32_t, I> shift) |
select | vec<float, I> select(mask<I> m, vec<float, I> if_false, vec<float, I> if_true) |
| comparisons | mask<I> operator>(vec<int32_t, I>, vec<int32_t, I>) (all 6) |
Optional (only needed for specific code paths):
| Function | Used by |
|---|---|
permute2 | table::lookup (N > W, AVX-512 path) |
unpack_lo/hi | Polar interleave/deinterleave |
unpack_lo/hi_64 | AVX2 deinterleave |
shuffle<imm> | AVX2 deinterleave |
permute_lanes<ctrl> | AVX2 interleave |
permute_64<ctrl> | AVX2 deinterleave |
broadcast_halves | 3-bit unpack (AVX-512 only) |
select (4-arg) | AVX2 compare+blend convenience |
Scalar stubs (assert-false or static_assert) are acceptable for optional
functions that are guarded by if constexpr(i != isa::scalar) at call sites.
The codec infrastructure (codecs.hpp, turboq_codecs.hpp, polar_codecs.hpp,
mha_kv_cache_codec.cpp) is generic over vec<T, i>.
If a new ISA only needs vec<T, i> support, that often requires no
consumer changes outside this directory.
If the ISA is also meant to participate in simd_loop, then adding a new
per-ISA header is not enough. You should also consider:
active_lanes<I>for_each_chunk<I>()load/store/reduce