Back to Eliza

TurboQuant / QJL / PolarQuant KV cache kernels (Vulkan + Metal)

plugins/plugin-local-inference/native/README.md

2.0.152.0 KB
Original Source

TurboQuant / QJL / PolarQuant KV cache kernels (Vulkan + Metal)

STATUS — All five required Metal shader files are hardware-verified 8/8 PASS on Apple M4 Max, with Polar also covering the pre-Hadamard-query hot-path entrypoint (kernel_mul_mv_q4_polar_preht_f32) for both residual modes. Vulkan turbo is hardware-verified on Intel ARL + lavapipe (8/8 PASS each) and rerun on Apple M4 Max via MoltenVK after the current TBQ4 four-record layout update. Vulkan QJL score (qjl.comp) + Polar matvec (polar.comp) + Polar pre-Hadamard matvec (polar_preht.comp) are hardware-verified 8/8 PASS on Apple M4 Max via MoltenVK 1.4.1 and on Intel Arrow Lake Mesa ANV 25.2.8 (make vulkan-verify). The fused-attention compute shaders (fused_attn_qjl_tbq.comp, fused_attn_qjl_polar.comp) are hardware-verified 1920/1920 outputs PASS on Intel ARL Mesa ANV across all four GQA/n_kv cases (make vulkan-verify-fused), and the built-fork GGML_OP_FUSED_ATTN_QJL_TBQ graph route is verified (vulkan-dispatch-smoke 7/7, max diff 4.5e-8) — so kernel-contract.json's fusedAttn.runtimeStatus.vulkan is runtime-ready. The Metal fused-attention kernels (metal/fused_attn_qjl_tbq.metal, metal/fused_attn_qjl_polar.metal) + the Polar pre-Hadamard-query score kernel (metal/polar_preht.metal, kernel_attn_score_q4_polar_preht_f32 + _multi) are also standalone hardware-verified on Apple M4 Max via make metal-verify-fused; Metal fused graph dispatch still needs built-fork smoke before fusedAttn.runtimeStatus.metal can flip runtime-ready. The production CUDA fused kernel (cuda/fused-attn-qjl-tbq.cu, mirroring verify/cuda_verify.cu's fused_attn_qjl_tbq3_kernel + DP4A QJL path) is AUTHORED — hardware-verify pending (needs an NVIDIA host for verify/cuda_runner.sh). fused_attn is an optimization on top of the five required kernels (AGENTS.md §3), not a required kernel — it stays out of requiredRuntimeCapabilityKeys / manifestKernelNames. The Vulkan fallback entrypoints now have fixture-harness branches in vulkan_verify.cpp (make vulkan-verify-fallbacks): qjl_mul_mv.comp is a hard gate (8/8 PASS on Intel ARL Mesa ANV — the reduction-based QJL-score parity check, deterministic) and qjl_get_rows.comp / polar_get_rows.comp are informational decode-block checks. On Mesa ANV (Arrow Lake) the get-rows shaders write one fp32 output per element from all 32 invocations (no reduction) and a small, varying handful of elements intermittently read back 0.0 — an ANV codegen quirk in the all-threads strided SSBO store, not a logic bug (the reduction-based polar.comp / polar_preht.comp share the exact same decode and are clean). The - prefix on those Makefile lines keeps a flaky ANV get-rows result from failing the target; on lavapipe the same shaders are bit-exact and deterministic. All Vulkan SPIR-V is spirv-val clean.*

FamilyFilesSource-level checked against fork?Compiles with target SDK?Validated on hardware?
TurboQuant — Vulkanvulkan/turbo3.comp, vulkan/turbo4.comp, vulkan/turbo3_tcq.compYES (byte layout + decode math match current fork block_tbq* layouts; TBQ4 is four 18-byte records per 128-row)YES (Mesa NDK glslc, SPIR-V 1.3 / Vulkan 1.1)YES — 8/8 PASS on Intel ARL Mesa 25.2.8 + lavapipe Mesa 25.2.8 LLVMpipe; rerun via MoltenVK on Apple M4 Max after TBQ4 update
TurboQuant — Metalmetal/turbo3.metal, metal/turbo4.metal, metal/turbo3_tcq.metalYES (matches current fork block_tbq* layouts; TBQ4 is four 18-byte records per 128-row)YES (clang++ -framework Metal runtime JIT path; Metal Toolchain not required for metal_verify)YES — 8/8 PASS on Apple M4 Max (Darwin 25.2.0, runtime MTLDevice.newLibraryWithSource); max diff 6.7e-06
QJLmetal/qjl.metal, vulkan/qjl.comp plus fallback vulkan/qjl{_get_rows,_mul_mv}.compYES (against qjl_score_qk_ref in packages/native/plugins/qjl-cpu)YES (Vulkan); YES (Metal — runtime JIT)Metal: YES — 8/8 PASS on Apple M4 Max after Wave-3 fix to kernel_attn_score_qjl1_256 (uniform uint3 attribute params; original mixed uint+uint2 failed Metal compile). Vulkan score: YES — 8/8 PASS on Apple M4 Max via MoltenVK 1.4.1 (Wave-4-C) and on Intel ARL Mesa ANV 25.2.8. Fallback qjl_mul_mv: YES — 8/8 PASS on Intel ARL Mesa ANV (hard gate, reduction-based QJL-score parity); fallback qjl_get_rows: PASS on lavapipe, intermittent few-element 0.0 read-back on Mesa ANV (informational, --guarded; ANV strided-store codegen quirk, not a logic bug)
PolarQuantmetal/polar.metal, vulkan/polar.comp, vulkan/polar_preht.comp plus fallback vulkan/polar_get_rows.compYES (against dequantize_row_q4_polar_ref + polar_dot_ref in packages/native/plugins/polarquant-cpu)YES (Vulkan); YES (Metal — runtime JIT)Metal: YES — 8/8 PASS on Apple M4 Max, including pre-Hadamard query entrypoint. Vulkan matvec: YES — 8/8 PASS on Apple M4 Max via MoltenVK 1.4.1 and on Intel ARL Mesa ANV 25.2.8, including polar_preht.spv. Fallback polar_get_rows (informational): PASS on lavapipe; on Mesa ANV a small varying handful of the 128 elements intermittently read back 0.0 (ANV strided-store codegen quirk, not a logic bug; --guarded in the Makefile). Both use_qjl modes affected; the reduction-based polar.comp/polar_preht.comp share the same decode and are clean
Fused attention (QJL-K + TBQ-V / Polar-V)vulkan/fused_attn_qjl_tbq.comp, vulkan/fused_attn_qjl_polar.comp; metal/fused_attn_qjl_tbq.metal, metal/fused_attn_qjl_polar.metal; cuda/fused-attn-qjl-tbq.cu + verify/cuda_verify.cu harnessYES (against eliza_fused_attn_qjl_tbq3 / eliza_fused_attn_qjl_polar in reference/; Metal/CUDA are byte-faithful mirrors of the hardware-verified Vulkan ports)YES (Vulkan glslc SPIR-V 1.3 / Vulkan 1.1; Metal runtime JIT; CUDA nvcc gated)Vulkan: YES — 1920/1920 outputs PASS on Intel ARL Mesa ANV 25.2.8 across all four cases (n_kv 64/512/256/128, GQA 1/2/4), max diff 6.3e-7 (make vulkan-verify-fused); built-fork GGML_OP_FUSED_ATTN_QJL_TBQ graph dispatch verified (vulkan-dispatch-smoke 7/7, max diff 4.5e-8). Metal standalone: YES — 1920/1920 outputs PASS on Apple M4 Max via make metal-verify-fused, max diff 7.2e-7; Metal fused graph dispatch still needs a built-fork smoke before fusedAttn.runtimeStatus.metal can flip runtime-ready. CUDA: AUTHORED — hardware-verify pending (no NVIDIA HW; verify/cuda_runner.sh on a CUDA host; build the fork with -DGGML_CUDA_FUSED_ATTN_QJL=ON)
PolarQuant pre-Hadamard-query scoremetal/polar_preht.metal (kernel_attn_score_q4_polar_preht_f32 + _multi), vulkan/polar_preht.compYES (dot(H·x, q) == dot(x, H·q); same reference as PolarQuant)YES (Metal runtime JIT; Vulkan glslc)Vulkan: YES — 8/8 PASS on Apple M4 Max via MoltenVK 1.4.1 + Intel ARL Mesa ANV 25.2.8 (the mat-vec ABI). Metal attention-score ABI: YES — 8/8 PASS for kernel_attn_score_q4_polar_preht_f32 and _multi on Apple M4 Max via make metal-verify-fused, max diff 5.8e-6. Metal mat-vec ABI (kernel_mul_mv_q4_polar_preht_f32 in polar.metal): YES — 8/8 PASS on Apple M4 Max
CUDA (all 5)verify/cuda_verify.cu linking ~/.cache/eliza-dflash/eliza-llama-cpp/build-cuda/.../libggml-cuda.so (qjl, polar, turbo3_tcq exported symbols; turbo3/turbo4 via thin __global__ wrapper around the shipped device-side tbq_decode_block_cuda) plus verify/runtime_graph_smoke.sh for llama-cli --cache-type-k graph dispatchYES (against ggml-cuda/{turboquant,turbo-tcq,qjl,polarquant}.cu(h) in the elizaOS/llama.cpp fork; make cuda-preprocess-check asserts every API symbol + every block_* layout is present in the in-fork headers)NEEDS-HARDWARE — make cuda requires nvcc (gated on Linux + CUDA Toolkit; macOS not supported); preprocessor-only API surface check passes on M4 MaxNEEDS-HARDWARE — see verify/HARDWARE_VERIFICATION.md and verify/CUDA_VERIFICATION.md; cuda_runner.sh now requires NVIDIA hardware, fixture parity, and a real GGUF graph-smoke model before a pass can be recorded

Earlier history: the original turbo*.comp Vulkan port reported 0/8 PASS against Mesa llvmpipe AND Intel ARL — different wrong values per ICD, which fingerprinted a source-level subgroup-size assumption (subgroupAdd over a 32-thread workgroup with no requiredSubgroupSize). Wave-4 W4-A replaced that with the same driver-portable shared-memory tree reduction the new W3-E QJL/Polar shaders use; the result is 8/8 PASS on both ICDs. See reports/porting/2026-05-09-w4/vulkan-turbo-fix.md.

The Metal ports here mirror the same decode math as the fork's current shipped TurboQuant layouts. block_tbq4_0 is half norm; uint8_t qs[16], and four records form one 128-wide attention row, so the standalone, Vulkan, and built-fork graph-dispatch paths now use the same 72-byte TBQ4 row size.

Patch-hook status (post 2026-05-10 audit):

  • The five Metal patch hooks have been collapsed into one patchMetalKernels implementation in packages/app-core/scripts/kernel-patches/metal-kernels.mjs. It copies the verified standalones from packages/inference/metal/ into the fork at ggml/src/ggml-metal/eliza-shipped/<name>.metal, then patches ggml/src/ggml-metal/CMakeLists.txt so each standalone is compiled into its own .air and merged into default.metallib alongside ggml-metal.air. The patch fires unconditionally on every Metal target — no env-var opt-in. The previous opt-in environment variables (ELIZA_DFLASH_PATCH_METAL_*=1) were decorative log toggles and are removed. Idempotent via # ELIZA-KERNEL-PATCH-V1 sentinel.

  • For Apple desktop targets the script now sets -DGGML_METAL_EMBED_LIBRARY=OFF, so the patched add_custom_command (which lives in the non-EMBED branch of the fork's CMakeLists.txt) actually runs. iOS targets keep EMBED_LIBRARY=ON because the static-archive build needs the metallib data baked in via .incbin. The patcher now rewrites that EMBED path to embed compiled default.metallib bytes, so desktop and iOS package the same shipped standalone symbols. The build gate still refuses an iOS artifact until those symbols are runtime-dispatch-ready, not merely present.

  • The Vulkan patchVulkanKernels hook (Wave-6, 2026-05-10) now stages the nine standalone .comp files from packages/inference/vulkan/ directly into ggml/src/ggml-vulkan/vulkan-shaders/ (where the fork's file(GLOB CONFIGURE_DEPENDS *.comp) picks them up), plus applies two anchor-driven idempotent patches under kernel-patches/vulkan-dispatch-patches/:

    1. 01-vulkan-shaders-gen.patch — adds 9 string_to_spv() calls at the bottom of process_shaders() so the gen tool emits eliza_<name>_data[] + eliza_<name>_len symbols into ggml-vulkan-shaders.hpp.
    2. 02-ggml-vulkan-pipelines.patch — extends vk_device_struct with 9 vk_pipeline pipeline_eliza_* slots and adds 9 ggml_vk_create_pipeline() calls at the bottom of ggml_vk_load_shaders() so each SPV blob is referenced at link time and surfaces in nm libggml-vulkan.so.

    Symbol audit (nm libggml-vulkan.so | grep eliza_) is now expected to pass on a successful linux-x64-vulkan build. The source patch also adds conservative op-level dispatch for GGML_OP_ATTN_SCORE_QJL, GGML_OP_ATTN_SCORE_TBQ, and GGML_OP_ATTN_SCORE_POLAR. The remaining Vulkan gap is native graph-dispatch evidence on physical Linux/Android Vulkan hardware; standalone SPIR-V fixture success and MoltenVK fixture success do not flip runtime-ready capability bits by themselves.

  • Dispatch wiring status (Wave-7 audit, 2026-05-10): the build now distinguishes shipped symbols from runtime-ready graph capabilities. patchMetalKernels() ships all five standalones in desktop default.metallib and iOS embedded metallib. Metal now has a dedicated, numerically smoke-tested GGML_OP_ATTN_SCORE_QJL bridge to kernel_attn_score_qjl1_256_multi, so CAPABILITIES.json may report kernels.qjl_full=true when that symbol is present. The helper still deliberately refuses the old generic MUL_MAT/GET_ROWS patch route: routing the standalones through generic ggml ops is either semantically wrong (QJL1_256 expects a pre-projected sketch) or incomplete (the tbq* kernels are attention-score only). shippedKernels.symbols.* records metallib symbol presence; the remaining Turbo/Polar kernels.* bits stay false until dedicated graph ops land and pass numeric dispatch tests. CUDA is still the only backend whose elizaOS/llama.cpp fork binary covers the full runtime path for all five types.

  • Wave-6 darwin shared-lib link fix: ggml-base on darwin defaults to BUILD_SHARED_LIBS=ON and links with -undefined error, so unresolved quantize_qjl1_256 / dequantize_row_qjl1_256 / quantize_row_qjl1_256_ref symbols (called from ggml.c but defined in ggml-cpu/qjl/) made libggml-base.dylib fail to link at all. The Wave-5 BLOCKER agent's "verified via strings" claim was against a stale metallib produced by an earlier shape; the end-to-end build never actually completed. Wave-6 extends patchGgmlBaseForWindowsQjl to fire on darwin-* and ios-* targets (same fix as Windows: compile QJL TUs into ggml-base too).

The most likely on-hardware failure modes (carry-overs from the Vulkan investigation, applicable here too):

  1. FWHT seed/sign-vector application. All five shaders assume Q is pre-rotated host-side (matches the fork's CUDA + Metal paths). If you wire them into a pipeline that does NOT pre-rotate Q, you must add an inverse FWHT to the dequantized output before the dot product.
  2. SIMD-group reduction. The Metal shaders all dispatch with threadgroup_size = 32 = one Apple SIMD-group, so simd_sum covers the full reduction. If anyone bumps the threadgroup size, switch to threadgroup-shared scratch + barrier.
  3. Block byte layout. block_turbo*, block_qjl1_256, and block_q4_polar are all __attribute__((packed)) in the C reference and have natural alignment 2 (fp16/bf16 leading field). Metal device const T* arithmetic must respect this — if you change the header layout, the shader's head_offset_bytes arg goes stale.

The harness IS reusable for fixing the above — metal_verify loads a .metal source, JIT-compiles via MTLDevice.newLibraryWithSource, dispatches the named kernel, and diffs against a JSON fixture.

Source of truth

TurboQuant

CUDA originals: https://github.com/spiritbuun/buun-llama-cpp.git at commit 6575873e9c4872709d374d854b583cfaa270caff, paths:

  • ggml/src/ggml-cuda/turbo-quant-cuda.cuh — quantize/dequantize, codebooks, FWHT, Viterbi
  • ggml/src/ggml-cuda/fattn-vec.cuh — flash-attention vec path that consumes them
  • ggml/src/ggml-common.hblock_turbo3_0, block_turbo4_0, block_turbo3_tcq layouts
  • ggml/src/ggml-cpu/ops.cpp and ggml/src/ggml-turbo-quant.c — CPU reference helpers for lossy quantization
  • ggml/src/ggml-metal/ggml-metal.metal and turbo-wht.h — existing Metal shader ground truth (the dequantize_turbo3_0_t4 helper there is the bit-for-bit reference our standalone turbo3.metal mirrors)

QJL (1-bit JL transform K-cache compression)

Reference impl in this repo:

  • packages/native/plugins/qjl-cpu/include/qjl/qjl.h — public API + block_qjl1_256 layout
  • packages/native/plugins/qjl-cpu/src/qjl_score_ref.c — scalar GQA score (the kernel_attn_score_qjl1_256 mirror target)
  • packages/native/plugins/qjl-cpu/src/qjl_quantize_ref.c — quantize one row
  • packages/native/plugins/qjl-cpu/src/qjl_dispatch.c — runtime dispatch (NEON / AVX2 / scalar)

Original CUDA reference (training-side, not the on-device target): packages/training/scripts/quantization/qjl/csrc/{qjl_quant_kernel.cu, qjl_gqa_score_kernel.cu}. The on-fork CPU side that lands block_qjl1_256 in ggml-common.h is W1-A's responsibility; this directory only ports the Metal half.

PolarQuant (block_q4_polar)

Reference impl in this repo:

  • packages/native/plugins/polarquant-cpu/include/polarquant/polarquant.h — public API + 5-step decode contract
  • packages/native/plugins/polarquant-cpu/include/polarquant/polar_block.hblock_q4_polar packed layout
  • packages/native/plugins/polarquant-cpu/include/polarquant/polar_centroids.hPOLAR_Q4_CENTROIDS[16] Lloyd-Max LUT (the Metal shader inlines the same constants)
  • packages/native/plugins/polarquant-cpu/src/polar_dequantize_ref.c — scalar decoder (the kernel_get_rows_q4_polar mirror target)
  • packages/native/plugins/polarquant-cpu/src/polar_dot_ref.c — scalar q4_polar · q8_0 dot product (template for kernel_mul_mv_q4_polar_f32, except our verification path uses fp32 activations not q8_0)
  • packages/native/plugins/polarquant-cpu/src/polar_qjl.c — xorshift32 sign vector for the optional QJL residual
  • packages/native/plugins/polarquant-cpu/src/polar_hadamard.c — in-place 128-element Walsh-Hadamard butterfly

The on-fork CPU side that adds block_q4_polar to ggml-common.h is W1-B's responsibility; this directory only ports the Metal half.

What each kernel does

turbo3 — 3-bit PolarQuant + per-block norm

The graph normalizes a 128-element rotation group, applies a forward Fast Walsh–Hadamard transform with seed=42 sign vectors, then quantizes each of the four 32-element sub-blocks to 3-bit Lloyd–Max centroids {-0.190685, -0.117832, ..., +0.190685}. The 3-bit index is split: low 2 bits in qs[], high bit in signs[]. Per-group norm is corrected by grp_norm / recon_norm so dequantized output preserves the original L2 norm. At dot-product time, the host has already pre-rotated Q with the same FWHT, so the shader only needs Q · centroids[idx] * norm.

turbo4 — 4-bit TurboQuant (16 centroids)

Current TBQ4 uses four independent 32-element preconditioned records per 128-wide attention row. Each record is half norm; uint8_t qs[16] (18 bytes); the full GGML row size for head_dim=128 is 72 bytes. A fixed sign vector plus 32-point Hadamard preconditions each record before quantization to centroids {-2.7321365, -2.0685055, ..., +2.7321365}. The packing stores elements 0..15 in low nibbles and 16..31 in high nibbles. The standalone Metal, Vulkan, reference fixture generator, and built-fork Metal graph-dispatch smoke all use this layout now. The kernel body has diverged into perf-only differences (the standalone hoists per-block byte loads and uses fma); see PATCH_AUDIT_2026-05-10.md for the diff.

turbo3_tcq — 3-bit Trellis-Coded Quantization (k=3, L=9, 512 states)

Each 128-element rotation group is encoded by a 512-state right-shift bitshift trellis with a hand-trained 512-entry codebook. Encode = 128-step Viterbi over 512 states (CUDA: k_set_rows_turbo3_tcq); decode = read a sliding 9-bit window at bit t*3 of the bitstream and look up the centroid (CUDA: dequantize_turbo3_tcq). Decode-only is what the dot-product path needs; the Vulkan and Metal shaders here implement decode + Q·K. The reference C implementation in reference/turbo_kernels.c includes a slow Viterbi encoder for fixture generation.

qjl — 1-bit JL transform K-cache compression

Each 128-element key vector is projected through a fixed 128×256 JL matrix Π, sign-extracted (256 bits), and stored alongside a bf16 norm (34 bytes/key total — 7.53× compression vs bf16 K-cache). The attention-score path consumes a pre-projected query sketch (n_heads × proj_dim, computed once per Q) and emits

score[h_q, t] = ||k_t|| * sqrt(pi/2)/proj_dim *
                sum_j sign_packed[t, j] * q_sketch[h_q, j]

GQA fanout: h_kv = h_q / (n_heads/n_kv_heads). Three Metal kernels: kernel_attn_score_qjl1_256 (the hot path), kernel_get_rows_qjl1_256 (decode-to-fp32 fallback), kernel_mul_mv_qjl1_256_f32 (mat-vec for non-attention call sites).

polar — 4-bit PolarQuant block (block_q4_polar, 82 bytes)

Each 128-element block stores: fp16 per-block L2 norm, 64 bytes of 4-bit Lloyd-Max-optimal centroid indices, plus 16 bytes of optional 1-bit QJL residual. Decode steps (mirrors dequantize_row_q4_polar_ref):

  1. Unpack 4-bit codes → centroid LUT lookup (16 entries, N(0,1) Lloyd-Max).
  2. Optional QJL residual: 1 sign-bit applied to a deterministic ±1 sign vector (xorshift32 seeded with POLAR_QJL_SEED=42), magnitude 0.5 / sqrt(QK_POLAR).
  3. In-place 128-element Walsh-Hadamard butterfly (7 stages).
  4. Compensate by 1/QK_POLAR to convert the in-place butterfly into the orthonormal inverse the Python decoder uses.
  5. Per-block L2 rescale by the stored fp16 norm.

Two Metal kernels: kernel_get_rows_q4_polar (full block decode to fp32) and kernel_mul_mv_q4_polar_f32 (block-wise dot product against an fp32 activation chunk; one threadgroup per block, threadgroup-shared scratch holds the dequantized block).

CUDA-to-shader port mapping

Each shader file annotates the CUDA function it ports. Key correspondences:

ConceptCUDAVulkanMetal
3-bit centroid LUTd_turbo_centroids_3bit (turbo-quant-cuda.cuh:296)TURBO_CENTROIDS_3BIT constantTURBO_CENTROIDS_3BIT constant
4-bit centroid LUTd_turbo_centroids_4bit (turbo-quant-cuda.cuh:313)TURBO_CENTROIDS_4BIT constantTURBO_CENTROIDS_4BIT constant
FWHT signs (seed=42)d_turbo_wht_signs1/signs2 (turbo-quant-cuda.cuh:326)not applied — Q is pre-rotatednot applied — Q is pre-rotated
3-bit dequantdequantize_turbo3_0 (turbo-quant-cuda.cuh:481)vulkan/turbo3.comp main loopmetal/turbo3.metal kernel_turbo3_dot
4-bit dequantdequantize_turbo4_0 (turbo-quant-cuda.cuh:532)vulkan/turbo4.comp main loopmetal/turbo4.metal kernel_turbo4_dot
TCQ codebookd_turbo3_tcq_codebook (turbo-quant-cuda.cuh:619)binding=3 storage bufferbuffer(3) constant
TCQ dequant (sliding 9-bit)dequantize_turbo3_tcq (turbo-quant-cuda.cuh:982)vulkan/turbo3_tcq.comp main loopmetal/turbo3_tcq.metal kernel_turbo3_tcq_dot

Algorithmic deltas vs CUDA

  1. No on-device encode in the shaders. The CUDA k_set_rows_turbo3_tcq does a 512-state Viterbi inside one threadgroup with shared-memory double-buffered cost arrays. That requires 512 threads, ~5 KB of shared/threadgroup memory, and warp-shuffle min-reduction primitives that are tricky to port portably. Encode happens host-side or via the C reference impl in reference/turbo_kernels.c.

  2. No InnerQ calibration paths. The CUDA kernels include a d_innerq_* per-channel scaling pass that is part of the inner-quant calibration routine. The decode path does not need it (calibration is a separate step that mutates d_innerq_channel_scale ahead of quantize). The shaders here decode against the post-corrected norm field directly.

  3. No d_tcq_dump_* debug paths. The CUDA k_set_rows_turbo3_tcq has optional global dump buffers for autocorrelation analysis. Removed.

  4. Subgroup/SIMD reduction. Vulkan uses a driver-portable 32-thread shared-memory tree reduction (5 barriers, shared float partials[32]) so it works regardless of subgroup size. The original subgroupAdd path silently under-reduced on Intel ARL (minSubgroupSize=8) and on lavapipe; W4-A replaced it after on-hardware verification. Metal still uses simd_sum because the dispatch is one threadgroup = one Apple SIMD-group of 32 lanes (the simd_sum assumption is a per-vendor guarantee on Apple Silicon, unlike Vulkan's driver-chosen subgroup size). On AMD GCN/RDNA the Vulkan tree reduction sidesteps the wave32/wave64 question entirely.

  5. No FWHT inside the shader. All shaders skip both forward and inverse rotation because the surrounding graph pre-rotates Q. This matches what the existing fork's Metal dequantize_turbo3_0_t4 and CUDA dequantize_turbo3_0 actually do — the rotation is a graph-level concern, not a per-block concern. If you wire these shaders into a pipeline that does NOT pre-rotate Q, you must add an inverse FWHT to the dequantized output before the dot product.

Hardware verification protocol

Before running backend-specific hardware checks, run the executable contract gate. It verifies that the manifest kernel names (turboquant_q3/turboquant_q4/qjl/polarquant/dflash/turbo3_tcq), the build-script capability keys (turbo3/turbo4/turbo3_tcq/qjl_full), the fixture set, and every supported build target's platform gate are still in sync:

bash
make -C packages/inference/verify kernel-contract

Mac (Apple Silicon)

bash
# In this repo's worktree:
cd packages/inference/verify

# 1) Build the reference + fixture generator (no GPU needed):
make reference-test

# 2) Generate fixtures from the reference impl (already shipped under
#    fixtures/, but regenerate if you change the reference):
./gen_fixture fixtures

# 3) Build the Metal harness (requires Xcode command-line tools):
make metal

# 4) Run each shader against the fixtures:
./metal_verify ../metal/turbo3.metal     kernel_turbo3_dot              fixtures/turbo3.json
./metal_verify ../metal/turbo4.metal     kernel_turbo4_dot              fixtures/turbo4.json
./metal_verify ../metal/turbo3_tcq.metal kernel_turbo3_tcq_dot          fixtures/turbo3_tcq.json
./metal_verify ../metal/qjl.metal        kernel_attn_score_qjl1_256     fixtures/qjl.json
./metal_verify ../metal/polar.metal      kernel_mul_mv_q4_polar_f32     fixtures/polar.json

# 5) Optional: build the patched llama-server. Metal kernel patching is
#    unconditional for Metal targets; there are no opt-in env vars.
bun run packages/app-core/scripts/build-llama-cpp-dflash.mjs --backend metal

# 6) Optional: build the iOS Capacitor static archive that the
#    LlamaCpp.xcframework patch in
#    packages/app-core/patches/[email protected] consumes.
#    Requires macOS host with Xcode installed.
bun run packages/app-core/scripts/build-llama-cpp-dflash.mjs \
  --target ios-arm64-metal
bun run packages/app-core/scripts/build-llama-cpp-dflash.mjs \
  --target ios-arm64-simulator-metal

Android (Vulkan)

bash
# Host-side: build SPIR-V from the .comp files using glslc from the Vulkan
# SDK (https://vulkan.lunarg.com/sdk/home) or the Android NDK shader tools.
cd packages/inference/verify
make reference-test
make vulkan-spirv

# 1) On a workstation with a Vulkan-capable GPU (NVIDIA / AMD / Intel),
#    run the host harness:
VULKAN_SDK=/opt/vulkan-sdk make vulkan
make vulkan-verify

# 2) On native Linux hardware, run the stricter smoke gate. This writes
#    hardware-results/linux-vulkan-smoke-*.log, rejects software ICDs unless
#    ELIZA_ALLOW_SOFTWARE_VULKAN=1, runs standalone fixtures, builds the
#    patched fork, dumps CAPABILITIES.json, then runs the built-fork graph
#    gate against the managed install output under
#    $ELIZA_STATE_DIR/local-inference/bin/dflash/linux-x64-vulkan by default.
#    Build failure stops the runner; stale/symbol-only artifacts are not reused.
make vulkan-native-smoke

#    Direct graph dispatch smoke is native-Linux-only by default and requires a
#    bin dir containing libggml-vulkan.so. Override only for explicit prebuilts:
#    ELIZA_DFLASH_VULKAN_BIN_DIR=/path/to/bin make vulkan-dispatch-smoke

# 3) On-device (Android) verification: cross-compile the harness against
#    the Android NDK Vulkan headers and push to a Vulkan-capable handset
#    (Adreno 6xx+, Mali-G7x+). Same SPIR-V, same fixtures. The runner records
#    hardware-results/android-vulkan-smoke-*.log and fails closed after
#    standalone fixtures unless ELIZA_ANDROID_VULKAN_GRAPH_EVIDENCE points at
#    a built-fork/app graph-dispatch report covering all six Vulkan graph
#    routes or all five runtime capability keys with finite maxDiff. The
#    standalone runner also checks `polar_preht.spv` with both Polar fixtures.
make android-vulkan-smoke

# 4) End-to-end via llama-server: the patch hook `patchVulkanKernels` is
#    default-on. The build still refuses publishable artifacts until graph
#    dispatch capabilities are runtime-ready, not merely symbol-shipped.
bun run packages/app-core/scripts/build-llama-cpp-dflash.mjs --backend vulkan

Verification matrix (verified locally vs needs hardware)

ShaderC reference compilesSelf-test (CPU)Fixture generatedStatic visual review vs CUDA / forkCompiles to SPIR-V/AIRRuns on real GPUNumerically matches CUDA
turbo3.compn/an/ayesyesyes (Mesa NDK glslc, SPIR-V 1.3 / Vulkan 1.1)YES — Intel ARL Mesa 25.2.8 + lavapipe Mesa 25.2.8 LLVMpipeYES — 8/8 PASS, max diff 4.8e-6
turbo4.compn/an/ayesyesyes (Mesa NDK glslc, SPIR-V 1.3 / Vulkan 1.1)YES — Intel ARL Mesa 25.2.8 + lavapipe Mesa 25.2.8 LLVMpipeYES — 8/8 PASS, max diff 5.7e-6
turbo3_tcq.compn/an/ayesyesyes (Mesa NDK glslc, SPIR-V 1.3 / Vulkan 1.1)YES — Intel ARL Mesa 25.2.8 + lavapipe Mesa 25.2.8 LLVMpipeYES — 8/8 PASS, max diff 6.7e-6
qjl.compn/an/ayesYES (against qjl_score_qk_ref)yes (glslc 2026.2, SPIR-V 1.3 / Vulkan 1.1, spirv-val clean)YES — Apple M4 Max via MoltenVK 1.4.1 (Wave-4-C); Intel ARL Mesa ANV 25.2.8YES — 8/8 PASS, max diff 7.6e-6
qjl_mul_mv.compn/an/ayesYES (against eliza_qjl_mul_mv, the reduction-based QJL-score parity check)yes (glslc 2026.2, SPIR-V 1.3 / Vulkan 1.1, spirv-val clean)YES — Intel ARL Mesa ANV 25.2.8 (make vulkan-verify-fallbacks, hard gate, deterministic)YES — 8/8 PASS, max diff 7.6e-6
qjl_get_rows.compn/an/ayesYES (against eliza_qjl_dequantize_row, decode block 0; all-threads strided store, no reduction)yes (glslc 2026.2, SPIR-V 1.3 / Vulkan 1.1, spirv-val clean)YES — Intel ARL Mesa ANV 25.2.8 + lavapipe (make vulkan-verify-fallbacks, informational, --guarded)lavapipe: 128/128 PASS deterministic. Mesa ANV: a small varying subset of the 128 elements intermittently reads back 0.0 (ANV strided-store codegen quirk; the reduction-based qjl.comp/qjl_mul_mv.comp share the decode and are clean)
polar.compn/an/ayesYES (against dequantize_row_q4_polar_ref + polar_dot_ref)yes (glslc 2026.2, SPIR-V 1.3 / Vulkan 1.1, spirv-val clean)YES — Apple M4 Max via MoltenVK 1.4.1 (Wave-4-C); Intel ARL Mesa ANV 25.2.8YES — 8/8 PASS, max diff 5.7e-6
polar_preht.compn/an/ayesYES (same reference, using dot(H*x, q) == dot(x, H*q))yes (Android NDK glslc, SPIR-V 1.3 / Vulkan 1.1)YES — Apple M4 Max via MoltenVK 1.4.1; Intel ARL Mesa ANV 25.2.8YES — 8/8 PASS for polar.json and polar_qjl.json, max diff 7.6e-6
polar_get_rows.compn/an/ayesYES (against eliza_polar_dequantize_row, decode block 0; all-threads strided store, no reduction)yes (glslc 2026.2, SPIR-V 1.3 / Vulkan 1.1, spirv-val clean)YES — Intel ARL Mesa ANV 25.2.8 + lavapipe (make vulkan-verify-fallbacks, informational, --guarded)lavapipe: 128/128 PASS deterministic, both use_qjl modes. Mesa ANV: a small varying subset of the 128 elements intermittently reads back 0.0 (ANV strided-store codegen quirk; reduction-based polar.comp/polar_preht.comp share the decode and are clean)
fused_attn_qjl_tbq.compn/an/ayes (cases array)YES (against eliza_fused_attn_qjl_tbq3)yes (glslc 2026.2, SPIR-V 1.3 / Vulkan 1.1, spirv-val clean)YES — Intel ARL Mesa ANV 25.2.8 (make vulkan-verify-fused); built-fork GGML_OP_FUSED_ATTN_QJL_TBQ graph dispatch (vulkan-dispatch-smoke 7/7)YES — 1920/1920 outputs PASS across 4 cases (n_kv 64/512/256/128, GQA 1/2/4), max diff 5.7e-7; graph route max diff 4.5e-8
fused_attn_qjl_polar.compn/an/ayes (cases array)YES (against eliza_fused_attn_qjl_polar)yes (glslc 2026.2, SPIR-V 1.3 / Vulkan 1.1, spirv-val clean)YES — Intel ARL Mesa ANV 25.2.8 (make vulkan-verify-fused)YES — 1920/1920 outputs PASS across 4 cases (n_kv 64/512/256/128, GQA 1/2/4), max diff 6.3e-7
fused_attn_qjl_tbq.metaln/an/ayes (cases array)YES (byte-faithful mirror of fused_attn_qjl_tbq.comp / eliza_fused_attn_qjl_tbq3)yes (Apple Metal runtime JIT; ships in default.metallib via metal-kernels.mjs)YES — Apple M4 Max (make metal-verify-fused)YES — 1920/1920 outputs PASS, max diff 5.7e-7
fused_attn_qjl_polar.metaln/an/ayes (cases array)YES (byte-faithful mirror of fused_attn_qjl_polar.comp / eliza_fused_attn_qjl_polar)yes (Apple Metal runtime JIT; ships in default.metallib)YES — Apple M4 Max (make metal-verify-fused)YES — 1920/1920 outputs PASS, max diff 7.2e-7
polar_preht.metal (kernel_attn_score_q4_polar_preht_f32 + _multi)n/an/ayes (polar_preht.json)YES (dot(H·x, q) == dot(x, H·q); mirrors vulkan/polar_preht.comp + eliza_polar_mul_mv)yes (Apple Metal runtime JIT; ships in default.metallib)YES — Apple M4 Max (make metal-verify-fused)YES — 8/8 PASS for scalar + multi 2/3/4/8, both residual modes, max diff 5.8e-6
cuda/fused-attn-qjl-tbq.cu (fused_attn_qjl_tbq3_kernel + DP4A QJL)n/an/ayes (fused_attn_qjl_tbq.json)YES (mirrors verify/cuda_verify.cu's fused kernel + the fork CPU op)yes (nvcc, gated on -DGGML_CUDA_FUSED_ATTN_QJL=ON)AUTHORED — hardware-verify pending (no NVIDIA HW; verify/cuda_runner.sh on a CUDA host)AUTHORED — hardware-verify pending
turbo3.metaln/an/ayesYES (matches fork's dequantize_turbo3_0_t4 byte-for-byte)YES (Apple M4 Max, runtime JIT)YES — Apple M4 Max, Darwin 25.2.0YES — 8/8 PASS, max diff 3.3e-6
turbo4.metaln/an/ayesThe fork's in-tree eliza-kernels/tbq4_0.metal is an EARLIER draft (29-line diff, materially different inner loop). The standalone is the canonical FMA-tuned variant; the build script copies the standalone into eliza-shipped/ so the metallib uses it.YES (Apple M4 Max, runtime JIT in verify harness)YES — Apple M4 Max, Darwin 25.2.0YES — 8/8 PASS, max diff 5.7e-6
turbo3_tcq.metaln/an/ayesYES (matches CUDA dequantize_turbo3_tcq 9-bit window decode)YES (Apple M4 Max, runtime JIT)YES — Apple M4 Max, Darwin 25.2.0YES — 8/8 PASS, max diff 6.7e-6
qjl.metaln/an/ayesYES (matches qjl_score_qk_ref in qjl-cpu)YES (Apple M4 Max, runtime JIT) after Wave-3 attribute-shape fixYES — Apple M4 Max, Darwin 25.2.0YES — 8/8 PASS, max diff 1.1e-5
polar.metaln/an/ayesYES (matches dequantize_row_q4_polar_ref + polar_dot_ref)YES (Apple M4 Max, runtime JIT)YES — Apple M4 Max, Darwin 25.2.0YES — 8/8 PASS, max diff 7.6e-6
turbo_kernels.cyes (gcc/clang)yesyesn/an/an/an/a
qjl_polar_ref.cyes (gcc/clang)yesyesn/an/an/an/a

What "yes" means for the C references: make reference-test builds without warnings (-O2 -Wall -Wextra -std=c11), and ./gen_fixture --self-test emits finite, plausible-magnitude scores for every kernel:

turbo3=-2.501480 turbo4=-23.721790 turbo3_tcq=-4.822659 qjl=3.696591 polar=-1.994053 polar_qjl=-1.438744

(deterministic with the seeded PRNG in this repo). The turbo4 magnitude changed when TBQ4 moved to the current four-18-byte-record-per-128-row layout — each record is its own preconditioned 32-element block with an independent half norm, so the dequantized magnitudes are no longer scaled by a single shared per-128-group norm. polar_qjl is the Polar V-cache score with the optional 1-bit QJL residual enabled. The same --self-test also runs internal-consistency parity checks before printing those scores, so the references can't silently disagree with each other:

  • QJL: qjl_score_qk and qjl_mul_mv must return the same scalar when n_heads = n_kv_heads = n_tokens = 1 (no GQA fanout, just a single projected dot product). Tolerance 1e-5.
  • Polar: polar_mul_mv must equal polar_dequantize_row reconstructed and then manually dotted against q[]. Tolerance 1e-3.

These are reference-vs-reference checks; they verify that the two C references the Metal shaders mirror agree with each other, not that the shaders agree with hardware. Metal hardware verification was completed in Wave-3 (2026-05-10) on Apple M4 Max (Darwin 25.2.0) and rerun on 2026-05-11 after the Polar pre-Hadamard hot path landed: metal_verify reports 8/8 PASS for each checked entrypoint, including kernel_mul_mv_q4_polar_preht_f32 with and without the QJL residual. The harness uses MTLDevice.newLibraryWithSource (runtime JIT) against Metal.framework — it does NOT require the offline xcrun metal toolchain, so any Mac with Xcode command-line tools can run the verification. Vulkan QJL/Polar verification (Wave-4-C, 2026-05-10): verify/vulkan_verify.cpp now branches on the fixture's kernel field and resolves a per-kernel KernelBindings (input buffer set, output buffer size, push-constant struct, dispatch shape). QJL gets the 3-buffer (q_sketch, packed_k, scores) + 4-uint push (n_heads, n_kv_heads, n_tokens, proj_dim) bind-set with (n_heads, n_tokens, 1) dispatch. Polar gets the 3-buffer (k_blocks, q, y) + 3-uint push (n_rows, head_dim, use_qjl) bind-set with (n_rows, 1, 1) dispatch. The existing turbo3/4/tcq path is unchanged; turbo3_tcq still attaches the 4th codebook buffer. Same JSON fixture from gen_fixture feeds the Metal and Vulkan harnesses; the QJL/Polar fixture shape fields (q_sketch, n_heads, n_tokens, n_rows, use_qjl) were already present.

Hardware run: Apple M4 Max (Darwin 25.2.0) via MoltenVK 1.4.1 + Vulkan-Loader 1.4.341 (brew install molten-vk vulkan-loader vulkan-headers vulkan-tools shaderc; ICD at /opt/homebrew/etc/vulkan/icd.d/MoltenVK_icd.json). The five primary harness entrypoints (turbo3, turbo4, turbo3_tcq, qjl score, polar matvec) report 8/8 PASS at the standard 1e-3 tolerance with max diffs in the 4.8e-7 to 7.6e-6 range — within 1 ULP of the direct Metal harness numbers, which confirms MoltenVK's SPIR-V→MSL translation is bit-equivalent on Apple Silicon for these kernels. The Vulkan fallback entrypoints now have fixture branches in vulkan_verify.cpp and were exercised on Intel Arrow Lake Mesa ANV 25.2.8 (make vulkan-verify-fallbacks / make vulkan-verify-fused): qjl_mul_mv 8/8 PASS (hard gate, reduction-based, deterministic), the *_get_rows decode-block shaders PASS on lavapipe but show the intermittent few-element 0.0 read-back on Mesa ANV's all-threads strided SSBO store (informational, --guarded; not a logic bug — the reduction-based polar.comp/polar_preht.comp share the decode and are clean), and the fused-attention shaders are 1920/1920 outputs PASS across all four GQA/n_kv cases. MoltenVK re-run with:

brew install molten-vk vulkan-loader vulkan-headers vulkan-tools shaderc
cd packages/inference/verify
make reference-test
make -C ../verify vulkan-spirv      # glslc --target-env=vulkan1.1 --target-spv=spv1.3
c++ -O2 -std=c++17 -I/opt/homebrew/include -I../reference \
    vulkan_verify.cpp turbo_kernels.o qjl_polar_ref.o \
    -L/opt/homebrew/opt/vulkan-loader/lib -lvulkan -lm -o vulkan_verify
export VK_ICD_FILENAMES=/opt/homebrew/etc/vulkan/icd.d/MoltenVK_icd.json
export DYLD_LIBRARY_PATH=/opt/homebrew/opt/vulkan-loader/lib:/opt/homebrew/opt/molten-vk/lib
for k in turbo3 turbo4 turbo3_tcq qjl polar; do
  ./vulkan_verify ../vulkan/$k.spv fixtures/$k.json
done

Substitution note for fixtures

fixtures/*.json were generated from the reference C implementations in this directory (reference/turbo_kernels.c for the turbo family, plus verify/qjl_polar_ref.c for QJL and Polar), NOT from a real CUDA build of buun-llama-cpp. They are sufficient to verify that a Vulkan/Metal shader and the reference produce the same scalar output, but they do NOT verify CUDA-vs-{Vulkan,Metal} parity.

On hardware-validation day:

  • For TurboQuant: regenerate fixtures from a real CUDA build of buun-llama-cpp and replace turbo3.json, turbo4.json, turbo3_tcq.json. W1-E owns the CUDA productionization side and can supply these.
  • For QJL: rerun the reference quantize/score path (or the upstream qjl_pure_pytorch_quantize reference at packages/training/scripts/quantization/qjl/test_qjl.py) on a CUDA host, capture the projected sketch + packed signs + scores, replace qjl.json.
  • For PolarQuant: regenerate from polarquant.polar_quant.py's reference on a CUDA host (the _compute_lloyd_max_centroids and polar_hadamard_inplace paths are bit-exact targets), replace polar.json.

The block_qjl1_256 and block_q4_polar byte layouts are owned by W1-A and W1-B respectively; the verify/qjl_polar_ref.{h,c} files in this directory are stand-ins that mirror those layouts so the verify harness has zero deps on the @elizaos/native-plugins packages.

The patched llama.cpp fork (submodule)

The DFlash/TBQ/QJL/Polar/Metal fork ships in-tree as a git submodule at plugins/plugin-local-inference/native/llama.cppelizaOS/llama.cpp at the repository gitlink commit (currently 33c888a7b, with ce85787c8 as the validated DFlash/SWA build base). bun install runs git submodule update --init --recursive, so a fresh checkout has it. Both build paths default to this checkout:

  • packages/app-core/scripts/build-llama-cpp-dflash.mjs — desktop / server / Windows / iOS.
  • packages/app-core/scripts/aosp/compile-libllama.mjs — Android cross-compile (same pinned commit, so both paths land on identical kernels).

The build re-applies the kernel patches (kernel-patches/*) on top of the pristine submodule tree each run, then git checkout -- . && git clean -fdx discards those edits at the start of the next build — the submodule stays pinned to its gitlink commit. ELIZA_DFLASH_LLAMA_CPP_REMOTE / ELIZA_DFLASH_LLAMA_CPP_REF (or --cache-dir / --src-dir) still force a standalone clone for fork bisects. The DFlash build scripts now reject source checkouts that lack the SWA-aware seq_rm probe fallback from elizaOS/eliza#7635; otherwise SWA-based bodies can silently run target-only after --spec-type dflash.

How standalone shaders flow into the shipped binary

Source-of-truth: the verified .metal and .comp files in this directory (plugins/plugin-local-inference/native/{metal,vulkan}/). The build script packages/app-core/scripts/build-llama-cpp-dflash.mjs calls into packages/app-core/scripts/kernel-patches/{metal,vulkan}-kernels.mjs during applyForkPatches() and the helpers do the actual work:

Metal (darwin desktop)

  1. The build script forces -DGGML_METAL_EMBED_LIBRARY=OFF on every darwin-{arm64,x64}-metal target. This selects the non-EMBED branch of the fork's ggml/src/ggml-metal/CMakeLists.txt, which builds a sidecar default.metallib next to llama-server.
  2. patchMetalKernels() copies the five standalones from packages/inference/metal/{turbo3,turbo4,turbo3_tcq,qjl,polar}.metal into the fork at ggml/src/ggml-metal/eliza-shipped/<name>.metal. Files are copied verbatim; a // # ELIZA-KERNEL-PATCH-V1 comment is prepended so an audit can tell they came from the standalone.
  3. patchMetalKernels() patches ggml/src/ggml-metal/CMakeLists.txt's non-EMBED add_custom_command so the metallib build runs xcrun metal -c once per source (ggml-metal.metal plus each of the five standalones), producing one .air file each, then merges all six .air files into default.metallib via a single xcrun metallib invocation. Sentinel # ELIZA-KERNEL-PATCH-V1 makes the patch idempotent.
  4. The build install loop copies default.metallib from the build's bin/ directory into the install outDir next to llama-server. The Metal runtime locates it via dlopen-style loader_path resolution.

After this, strings default.metallib | grep kernel_turbo3_dot (and similarly kernel_turbo4_dot, kernel_turbo3_tcq_dot, kernel_attn_score_qjl1_256, kernel_get_rows_qjl1_256, kernel_mul_mv_qjl1_256_f32, kernel_get_rows_q4_polar, kernel_mul_mv_q4_polar_f32, kernel_mul_mv_q4_polar_preht_f32) returns matches. The kernels are present as live symbols inside the metallib.

Metal (iOS) — shipped symbols, dispatch-gated

iOS keeps EMBED_LIBRARY=ON because the static-archive build needs the metallib data baked in via .incbin. patchMetalKernels() rewrites that branch to compile ggml-metal.metal plus the five standalones as separate .air files, merge them into default.metallib, and embed the compiled bytes. This avoids the old duplicate-declaration failure from concatenating standalone Metal source with ggml-common.h.

The build still fails the publish gate until op-level dispatch is numerically verified. Shipped symbols are not enough to satisfy the runtime-ready kernel contract.

Vulkan — source-patched, graph-smoke gated

The nine standalone .comp files are staged into the fork and compiled cleanly. patchVulkanKernels() adds conservative op-level dispatch for GGML_OP_ATTN_SCORE_QJL, GGML_OP_ATTN_SCORE_TBQ, and GGML_OP_ATTN_SCORE_POLAR, but the runtime capability bits remain gated on native built-fork graph-dispatch evidence. Standalone SPIR-V fixture success, MoltenVK success, and symbol presence are not enough to mark Vulkan runtime-ready. There is no incomplete-artifact bypass.

Dispatch wiring (partially complete)

For the current blocker ledger and platform/device gap list, see reports/porting/2026-05-11/remaining-work-ledger.md.

Metal graph dispatch is runtime-ready for all five kernel families in the current contract evidence. make -C packages/inference/verify dispatch-smoke links against the built fork libraries, executes the actual Metal backend routes, and compares numeric scores against local references. The QJL route uses the launch-tax-amortized kernel_attn_score_qjl1_256_multi path; the TurboQuant and Polar routes are likewise covered by the built-fork smoke gate.

Vulkan graph execution is source-patched but still needs native Linux/Android graph-smoke evidence before its capability bits can flip runtime-ready. CUDA still requires NVIDIA hardware fixture parity plus GGUF graph-smoke evidence before it can satisfy AGENTS.md §3 in this workspace.

Build-time environment overrides

Env varWhat it doesDefault
ELIZA_DFLASH_LLAMA_CPP_REMOTEBuild from a standalone clone of this fork remote instead of the in-repo plugins/plugin-local-inference/native/llama.cpp submodule (default https://github.com/elizaOS/llama.cpp.git).unset
ELIZA_DFLASH_LLAMA_CPP_REFStandalone-clone fork ref (default 33c888a7be0b0b8ffb54cd3f0e05b4bed20cc52e). Setting either this or _REMOTE switches off the submodule build path.unset
ELIZA_DFLASH_VULKAN_HEADERS_DIR / ELIZA_DFLASH_SPIRV_HEADERS_DIRPre-staged Khronos header paths for cross-builds.unset
ELIZA_DFLASH_CMAKE_FLAGSExtra cmake flags appended to the per-target list. Wins on conflict (e.g. override -DCMAKE_CUDA_ARCHITECTURES).unset
MINGW_TOOLCHAIN_FILEOperator-supplied cmake toolchain file for windows-* targets. Required for windows-arm64-* cross builds; optional override for windows-x64-* (auto-detected mingw is used otherwise).unset

The previous ELIZA_DFLASH_PATCH_METAL_* / ELIZA_DFLASH_PATCH_VULKAN_KERNELS environment knobs were decorative log toggles for the v0.4.0-eliza-era no-op patch hooks. They have been removed; both the Metal and Vulkan patch helpers now run unconditionally on every matching target — Metal copies the standalone shaders + patches the metallib add_custom_command, Vulkan copies the standalones into vulkan-shaders/ + applies the available staging patches. There is no opt-out, per AGENTS.md §3 ("Required for ALL tiers"). Symbol-only kernels do not satisfy the post-build audit gate.

Wiring these into dflash-server.ts (so --cache-type-k turbo3_tcq actually runs through the new shader, and so QJL / Polar are reachable from the CLI) is owned by another agent and depends on the ggml-metal-ops dispatch work flagged above.

iOS Capacitor build

The on-device path consumes @elizaos/llama-cpp-capacitor (currently v0.1.5 from npm), an opaque prebuilt framework. The patch at packages/app-core/patches/[email protected] switches the plugin to consume a vendored LlamaCpp.xcframework so we can ship a custom-built static archive against the patched fork.

To produce that static archive, the build script now exposes two iOS targets (compile-only on this machine — they require macOS host with Xcode):

bash
bun run packages/app-core/scripts/build-llama-cpp-dflash.mjs --target ios-arm64-metal
bun run packages/app-core/scripts/build-llama-cpp-dflash.mjs --target ios-arm64-simulator-metal

Both pass -DGGML_METAL=ON -DCMAKE_SYSTEM_NAME=iOS -DCMAKE_OSX_ARCHITECTURES=arm64 plus -DGGML_METAL_EMBED_LIBRARY=ON so the .metallib ships inside the static archive. Output lands under $ELIZA_STATE_DIR/local-inference/bin/dflash/<target>/ as lib*.a files plus a include/ headers staging directory. A follow-up packaging step (not in this directory) glues these into the Capacitor xcframework layout the patch expects.

Fused mobile status is intentionally closed. The dflash build script does not advertise or build android-*-*-fused, ios-arm64-metal-fused, or ios-arm64-simulator-metal-fused because the libelizainference mobile FFI packaging and verifier path are not wired here yet. Those spellings fail with explicit diagnostics instead of falling through to a generic unsupported target list. Android system-agent fused artifacts, including emulator android-x86_64-*-fused, are owned by packages/app-core/scripts/aosp/compile-libllama.mjs.

Server fused CUDA on arm64 Linux is supported as linux-aarch64-cuda-fused. It uses the same CUDA fused-attention CMake flags as linux-x64-cuda-fused and keeps the existing arm64 Linux host gate; build it on a real aarch64 CUDA runner such as GH200.