Back to Eliza

polarquant-cpu

packages/native/plugins/polarquant-cpu/README.md

2.0.18.6 KB
Original Source

polarquant-cpu

C reference + AVX2 + NEON kernels and a GGUF converter for the on-device PolarQuant Q4 weight format (block_q4_polar, GGML type tag Q4_POLAR=45).

The standalone static library here is the behavioural source of truth for the kernels. Drop-in patches for the Apothic-AI/llama.cpp-1bit-turboquant fork live under fork-integration/ (separate quants-polar.{h,c} + .patch deltas for ggml-common.h, ggml.h, ggml-cpu.c, ggml-quants.c, and ggml/src/ggml-cpu/CMakeLists.txt).

What is in here

FilePurpose
include/polarquant/polar_centroids.h16 Lloyd-Max centroids for N(0,1), generated.
include/polarquant/polar_block.hblock_q4_polar layout (locked) + fp16<->fp32 helpers.
include/polarquant/polarquant.hPublic API: encoder, decoder, dot product, QJL signs, SIMD dispatcher.
src/polar_hadamard.cIn-place size-128 Walsh-Hadamard butterfly (scalar).
src/polar_qjl.cDeterministic per-block +/-1 sign vector (xorshift32).
src/polar_quantize_ref.cquantize_row_q4_polar_ref (norm -> WHT -> bucketize -> pack + 1-bit residual).
src/polar_dequantize_ref.cdequantize_row_q4_polar_ref (unpack -> centroid LUT -> inverse WHT -> rescale).
src/polar_dot_ref.cggml_vec_dot_q4_polar_q8_0_ref (matmul kernel; mirrors ggml_vec_dot_q4_K_q8_K).
src/polar_dequantize_avx2.cAVX2 dequantizer (FMA-vectorised Hadamard butterfly).
src/polar_dot_avx2.cAVX2 dot product against Q8_0 activations.
src/polar_dequantize_neon.cARM NEON dequantizer (FMA-vectorised Hadamard butterfly).
src/polar_dot_neon.cARM NEON dot product against Q8_0 activations.
src/polar_dispatch.cCompile-time dequantize_row_q4_polar / ggml_vec_dot_q4_polar_q8_0 dispatcher.
test/polar_roundtrip_test.cRound-trip a float[128] and check rel-L2 against the Python reference's measured rate.
test/polar_dot_test.cDot product against an unquantized fp32 reference, same tolerance.
test/polar_simd_parity_test.cSIMD-vs-scalar parity over 100 random blocks (dequant max-abs <= 5e-5, dot rel-err <= 1e-5).
scripts/gen_centroids.pyRegenerates polar_centroids.h bit-for-bit from the Lloyd-Max solver in polar_quant.py.
scripts/polarquant_to_gguf.pyPack a PolarQuant safetensors sidecar into a Q4_POLAR=45 GGUF.
scripts/test_converter.pySynthesize a 128x128 linear, encode + convert + read back.
fork-integration/In-fork drop-in: quants-polar.{h,c} + *.patch for the apothic llama.cpp fork.

Block format (locked)

c
#define QK_POLAR 128
#define QJL_RESIDUAL_BYTES (QK_POLAR / 8)   // 16 bytes

typedef struct __attribute__((packed)) {
    polar_fp16_t d;                          // 2  bytes (per-block L2 norm)
    uint8_t      qs[QK_POLAR / 2];           // 64 bytes (4-bit codes, 2 per byte)
    uint8_t      qjl[QJL_RESIDUAL_BYTES];    // 16 bytes (1-bit residual per block)
} block_q4_polar;

// 82 bytes/block.  5.125 bpw with QJL, 4.125 bpw without.

qs: low nibble = even-index code, high nibble = odd-index code (matches the layout llama.cpp's existing 4-bit kernels assume so SIMD unpacking ports cleanly).

qjl[0] bit 0 holds the per-block residual sign; bytes 1..15 are reserved for a future per-coordinate residual without breaking the on-disk size.

Build + test

bash
cmake -B build -S .
cmake --build build -j
ctest --test-dir build --output-on-failure

Centroid regeneration

The committed centroid header is the bit-for-bit output of:

bash
python scripts/gen_centroids.py > include/polarquant/polar_centroids.h

The Lloyd-Max iteration is deterministic (16 levels, 100 iterations, fixed initial boundaries on [-4, 4]). gen_centroids.py mirrors packages/training/scripts/quantization/polarquant/polar_quant.py::_compute_lloyd_max_centroids exactly.

GGUF converter

bash
python scripts/polarquant_to_gguf.py \
  --sidecar  /path/to/polarquant_artifacts.safetensors \
  --base-model /path/to/base/hf/model_dir \
  --output   /path/to/out.gguf

Reads the sidecar's <layer>.codes (int8), <layer>.norms (fp16), optional <layer>.qjl (uint8) tensors; packs each layer into block_q4_polar records; and writes a GGUF where every quantized tensor is typed Q4_POLAR=45. Header metadata:

KeyValue
polarquant.block_size128
polarquant.bits4
polarquant.use_qjl0 / 1
polarquant.qjl_seed42
polarquant.qjl_correction0.5
polarquant.rotation"wht-128"
polarquant.upstream_commitPolarQuant commit pin

The decoder is expected to verify these against its compile-time constants and refuse to load on any mismatch.

Test

bash
python scripts/test_converter.py

Synthesizes a 128x128 fp32 weight, runs the vendored PolarQuant encoder over it, drives the converter, and reads the GGUF back via gguf.GGUFReader (with Q4_POLAR=45 patched into the enum to mirror what the upstream registration step will do).

Validation results

TestStatusNotes
polar_roundtripPASSrel-L2 ~ 0.091 (no QJL) / 0.099 (with QJL); matches Python reference's measured per-block error.
polar_dotPASSrel-error ~ 0.066 vs fp32 ref; same Python ref bound.
polar_simd_parityPASSAVX2 vs scalar reference: dequant max_abs <= 5e-7, mean_abs <= 3e-8; dot rel-err <= 2e-7 across 100 random blocks (use_qjl on/off). NEON path cross-compiles cleanly under aarch64-linux-gnu; runtime gate runs on aarch64 CI.
test_converter.pyPASS1 layer, 128 blocks, 82-byte records bit-identical to direct pack_layer().

The per-block reconstruction error (~9-10%) is not a quality knob. PolarQuant Q4's downstream perplexity claim (PPL Δ ≤ +0.05 vs FP16) is end-to-end and runs once a real Q4_POLAR model GGUF is built through the integration flow described in fork-integration/README.md.

Architecture-specific kernels

KernelScalarAVX2NEON
quantize_row_q4_polar (encoder)yes-- (convert-time only)--
dequantize_row_q4_polar (decoder)yesyesyes
ggml_vec_dot_q4_polar_q8_0 (dot)yesyesyes

The encoder stays scalar because it runs once per weight tensor at GGUF convert time, not in the inference hot path. The decoder + dot are SIMD-dispatched. The dispatcher itself (src/polar_dispatch.c) is #if-guarded by POLARQUANT_HAVE_AVX2 / POLARQUANT_HAVE_NEON (set by CMakeLists.txt) and falls back to the scalar reference when neither is available — useful for non-x86 / non-aarch64 dev hosts.

In-fork integration

The Apothic llama.cpp fork integration (registers Q4_POLAR=45, wires the type-traits dispatch) is staged in fork-integration/:

  • quants-polar.{h,c} — drop-in for ggml/src/ggml-cpu/, scalar + AVX2 + NEON.
  • *.patch — the deltas for ggml-common.h, ggml.h, ggml-cpu.c, ggml-quants.c, and ggml/src/ggml-cpu/CMakeLists.txt.
  • fork-integration/README.md — the order of operations + the test-quantize-fns gate the vendor must run before we bump the pin in compile-libllama.mjs.

This standalone library remains the behavioural source of truth (it has the unit tests + parity gates). The in-fork file is a transcription with llama.cpp's own typedefs (ggml_fp16_t, block_q8_0). Math is identical; only the type names differ.

QJL residual sign vector parity

The Python reference uses torch.randint(seed=42), which is not portable across torch versions. Both the standalone library and the in-fork TU (fork-integration/quants-polar.c) use the deterministic C xorshift32 stream defined in src/polar_qjl.c. The GGUF converter at scripts/polarquant_to_gguf.py is responsible for recomputing the QJL bits against the same xorshift32 stream when packing the sidecar, so encoder + decoder + converter all agree on the same 128-bit sign vector.

  • docs/porting/on-device-quantization-porting-plan.md -- the design spec this implementation follows ("PolarQuant block_q4_polar GGML quant type").
  • packages/training/scripts/quantization/polarquant/polar_quant.py -- the bit-exact Python reference for the Lloyd-Max centroid solver, the Hadamard rotation, and the QJL residual.
  • packages/training/scripts/quantization/polarquant_apply.py -- the orchestrator that produces the safetensors sidecar this converter consumes.
  • packages/app-core/scripts/aosp/compile-libllama.mjs -- the toolchain that will build the libllama.so carrying the eventual Q4_POLAR kernel registration.