plugins/plugin-local-inference/native/README.md
STATUS — All five required Metal shader files are hardware-verified 8/8 PASS on Apple M4 Max, with Polar also covering the pre-Hadamard-query hot-path entrypoint (
kernel_mul_mv_q4_polar_preht_f32) for both residual modes. Vulkan turbo is hardware-verified on Intel ARL + lavapipe (8/8 PASS each) and rerun on Apple M4 Max via MoltenVK after the current TBQ4 four-record layout update. Vulkan QJL score (qjl.comp) + Polar matvec (polar.comp) + Polar pre-Hadamard matvec (polar_preht.comp) are hardware-verified 8/8 PASS on Apple M4 Max via MoltenVK 1.4.1 and on Intel Arrow Lake Mesa ANV 25.2.8 (make vulkan-verify). The fused-attention compute shaders (fused_attn_qjl_tbq.comp,fused_attn_qjl_polar.comp) are hardware-verified 1920/1920 outputs PASS on Intel ARL Mesa ANV across all four GQA/n_kv cases (make vulkan-verify-fused), and the built-forkGGML_OP_FUSED_ATTN_QJL_TBQgraph route is verified (vulkan-dispatch-smoke7/7, max diff 4.5e-8) — sokernel-contract.json'sfusedAttn.runtimeStatus.vulkanisruntime-ready. The Metal fused-attention kernels (metal/fused_attn_qjl_tbq.metal,metal/fused_attn_qjl_polar.metal) + the Polar pre-Hadamard-query score kernel (metal/polar_preht.metal,kernel_attn_score_q4_polar_preht_f32+_multi) are also standalone hardware-verified on Apple M4 Max viamake metal-verify-fused; Metal fused graph dispatch still needs built-fork smoke beforefusedAttn.runtimeStatus.metalcan flip runtime-ready. The production CUDA fused kernel (cuda/fused-attn-qjl-tbq.cu, mirroringverify/cuda_verify.cu'sfused_attn_qjl_tbq3_kernel+ DP4A QJL path) is AUTHORED — hardware-verify pending (needs an NVIDIA host forverify/cuda_runner.sh).fused_attnis an optimization on top of the five required kernels (AGENTS.md §3), not a required kernel — it stays out ofrequiredRuntimeCapabilityKeys/manifestKernelNames. The Vulkan fallback entrypoints now have fixture-harness branches invulkan_verify.cpp(make vulkan-verify-fallbacks):qjl_mul_mv.compis a hard gate (8/8 PASS on Intel ARL Mesa ANV — the reduction-based QJL-score parity check, deterministic) andqjl_get_rows.comp/polar_get_rows.compare informational decode-block checks. On Mesa ANV (Arrow Lake) the get-rows shaders write one fp32 output per element from all 32 invocations (no reduction) and a small, varying handful of elements intermittently read back0.0— an ANV codegen quirk in the all-threads strided SSBO store, not a logic bug (the reduction-basedpolar.comp/polar_preht.compshare the exact same decode and are clean). The-prefix on those Makefile lines keeps a flaky ANV get-rows result from failing the target; on lavapipe the same shaders are bit-exact and deterministic. All Vulkan SPIR-V isspirv-valclean.*
Family Files Source-level checked against fork? Compiles with target SDK? Validated on hardware? TurboQuant — Vulkan vulkan/turbo3.comp,vulkan/turbo4.comp,vulkan/turbo3_tcq.compYES (byte layout + decode math match current fork block_tbq*layouts; TBQ4 is four 18-byte records per 128-row)YES (Mesa NDK glslc, SPIR-V 1.3 / Vulkan 1.1) YES — 8/8 PASS on Intel ARL Mesa 25.2.8 + lavapipe Mesa 25.2.8 LLVMpipe; rerun via MoltenVK on Apple M4 Max after TBQ4 update TurboQuant — Metal metal/turbo3.metal,metal/turbo4.metal,metal/turbo3_tcq.metalYES (matches current fork block_tbq*layouts; TBQ4 is four 18-byte records per 128-row)YES ( clang++ -framework Metalruntime JIT path; Metal Toolchain not required formetal_verify)YES — 8/8 PASS on Apple M4 Max (Darwin 25.2.0, runtime MTLDevice.newLibraryWithSource); max diff 6.7e-06QJL metal/qjl.metal,vulkan/qjl.compplus fallbackvulkan/qjl{_get_rows,_mul_mv}.compYES (against qjl_score_qk_refinpackages/native/plugins/qjl-cpu)YES (Vulkan); YES (Metal — runtime JIT) Metal: YES — 8/8 PASS on Apple M4 Max after Wave-3 fix to kernel_attn_score_qjl1_256(uniformuint3attribute params; original mixeduint+uint2failed Metal compile). Vulkan score: YES — 8/8 PASS on Apple M4 Max via MoltenVK 1.4.1 (Wave-4-C) and on Intel ARL Mesa ANV 25.2.8. Fallbackqjl_mul_mv: YES — 8/8 PASS on Intel ARL Mesa ANV (hard gate, reduction-based QJL-score parity); fallbackqjl_get_rows: PASS on lavapipe, intermittent few-element0.0read-back on Mesa ANV (informational,--guarded; ANV strided-store codegen quirk, not a logic bug)PolarQuant metal/polar.metal,vulkan/polar.comp,vulkan/polar_preht.compplus fallbackvulkan/polar_get_rows.compYES (against dequantize_row_q4_polar_ref+polar_dot_refinpackages/native/plugins/polarquant-cpu)YES (Vulkan); YES (Metal — runtime JIT) Metal: YES — 8/8 PASS on Apple M4 Max, including pre-Hadamard query entrypoint. Vulkan matvec: YES — 8/8 PASS on Apple M4 Max via MoltenVK 1.4.1 and on Intel ARL Mesa ANV 25.2.8, including polar_preht.spv. Fallbackpolar_get_rows(informational): PASS on lavapipe; on Mesa ANV a small varying handful of the 128 elements intermittently read back0.0(ANV strided-store codegen quirk, not a logic bug;--guarded in the Makefile). Bothuse_qjlmodes affected; the reduction-basedpolar.comp/polar_preht.compshare the same decode and are cleanFused attention (QJL-K + TBQ-V / Polar-V) vulkan/fused_attn_qjl_tbq.comp,vulkan/fused_attn_qjl_polar.comp;metal/fused_attn_qjl_tbq.metal,metal/fused_attn_qjl_polar.metal;cuda/fused-attn-qjl-tbq.cu+verify/cuda_verify.cuharnessYES (against eliza_fused_attn_qjl_tbq3/eliza_fused_attn_qjl_polarinreference/; Metal/CUDA are byte-faithful mirrors of the hardware-verified Vulkan ports)YES (Vulkan glslc SPIR-V 1.3 / Vulkan 1.1; Metal runtime JIT; CUDA nvcc gated) Vulkan: YES — 1920/1920 outputs PASS on Intel ARL Mesa ANV 25.2.8 across all four cases (n_kv 64/512/256/128, GQA 1/2/4), max diff 6.3e-7 ( make vulkan-verify-fused); built-forkGGML_OP_FUSED_ATTN_QJL_TBQgraph dispatch verified (vulkan-dispatch-smoke7/7, max diff 4.5e-8). Metal standalone: YES — 1920/1920 outputs PASS on Apple M4 Max viamake metal-verify-fused, max diff 7.2e-7; Metal fused graph dispatch still needs a built-fork smoke beforefusedAttn.runtimeStatus.metalcan flip runtime-ready. CUDA: AUTHORED — hardware-verify pending (no NVIDIA HW;verify/cuda_runner.shon a CUDA host; build the fork with-DGGML_CUDA_FUSED_ATTN_QJL=ON)PolarQuant pre-Hadamard-query score metal/polar_preht.metal(kernel_attn_score_q4_polar_preht_f32+_multi),vulkan/polar_preht.compYES ( dot(H·x, q) == dot(x, H·q); same reference as PolarQuant)YES (Metal runtime JIT; Vulkan glslc) Vulkan: YES — 8/8 PASS on Apple M4 Max via MoltenVK 1.4.1 + Intel ARL Mesa ANV 25.2.8 (the mat-vec ABI). Metal attention-score ABI: YES — 8/8 PASS for kernel_attn_score_q4_polar_preht_f32and_multion Apple M4 Max viamake metal-verify-fused, max diff 5.8e-6. Metal mat-vec ABI (kernel_mul_mv_q4_polar_preht_f32inpolar.metal): YES — 8/8 PASS on Apple M4 MaxCUDA (all 5) verify/cuda_verify.culinking~/.cache/eliza-dflash/eliza-llama-cpp/build-cuda/.../libggml-cuda.so(qjl, polar, turbo3_tcq exported symbols; turbo3/turbo4 via thin__global__wrapper around the shipped device-sidetbq_decode_block_cuda) plusverify/runtime_graph_smoke.shforllama-cli --cache-type-kgraph dispatchYES (against ggml-cuda/{turboquant,turbo-tcq,qjl,polarquant}.cu(h)in theelizaOS/llama.cppfork;make cuda-preprocess-checkasserts every API symbol + everyblock_*layout is present in the in-fork headers)NEEDS-HARDWARE — make cudarequiresnvcc(gated on Linux + CUDA Toolkit; macOS not supported); preprocessor-only API surface check passes on M4 MaxNEEDS-HARDWARE — see verify/HARDWARE_VERIFICATION.mdandverify/CUDA_VERIFICATION.md;cuda_runner.shnow requires NVIDIA hardware, fixture parity, and a real GGUF graph-smoke model before a pass can be recordedEarlier history: the original
turbo*.compVulkan port reported 0/8 PASS against Mesa llvmpipe AND Intel ARL — different wrong values per ICD, which fingerprinted a source-level subgroup-size assumption (subgroupAddover a 32-thread workgroup with norequiredSubgroupSize). Wave-4 W4-A replaced that with the same driver-portable shared-memory tree reduction the new W3-E QJL/Polar shaders use; the result is 8/8 PASS on both ICDs. Seereports/porting/2026-05-09-w4/vulkan-turbo-fix.md.The Metal ports here mirror the same decode math as the fork's current shipped TurboQuant layouts.
block_tbq4_0ishalf norm; uint8_t qs[16], and four records form one 128-wide attention row, so the standalone, Vulkan, and built-fork graph-dispatch paths now use the same 72-byte TBQ4 row size.Patch-hook status (post 2026-05-10 audit):
The five Metal patch hooks have been collapsed into one
patchMetalKernelsimplementation inpackages/app-core/scripts/kernel-patches/metal-kernels.mjs. It copies the verified standalones frompackages/inference/metal/into the fork atggml/src/ggml-metal/eliza-shipped/<name>.metal, then patchesggml/src/ggml-metal/CMakeLists.txtso each standalone is compiled into its own.airand merged intodefault.metallibalongsideggml-metal.air. The patch fires unconditionally on every Metal target — no env-var opt-in. The previous opt-in environment variables (ELIZA_DFLASH_PATCH_METAL_*=1) were decorative log toggles and are removed. Idempotent via# ELIZA-KERNEL-PATCH-V1sentinel.For Apple desktop targets the script now sets
-DGGML_METAL_EMBED_LIBRARY=OFF, so the patchedadd_custom_command(which lives in the non-EMBED branch of the fork's CMakeLists.txt) actually runs. iOS targets keepEMBED_LIBRARY=ONbecause the static-archive build needs the metallib data baked in via.incbin. The patcher now rewrites that EMBED path to embed compileddefault.metallibbytes, so desktop and iOS package the same shipped standalone symbols. The build gate still refuses an iOS artifact until those symbols are runtime-dispatch-ready, not merely present.The Vulkan
patchVulkanKernelshook (Wave-6, 2026-05-10) now stages the nine standalone.compfiles frompackages/inference/vulkan/directly intoggml/src/ggml-vulkan/vulkan-shaders/(where the fork'sfile(GLOB CONFIGURE_DEPENDS *.comp)picks them up), plus applies two anchor-driven idempotent patches underkernel-patches/vulkan-dispatch-patches/:
01-vulkan-shaders-gen.patch— adds 9string_to_spv()calls at the bottom ofprocess_shaders()so the gen tool emitseliza_<name>_data[]+eliza_<name>_lensymbols intoggml-vulkan-shaders.hpp.02-ggml-vulkan-pipelines.patch— extendsvk_device_structwith 9vk_pipeline pipeline_eliza_*slots and adds 9ggml_vk_create_pipeline()calls at the bottom ofggml_vk_load_shaders()so each SPV blob is referenced at link time and surfaces innm libggml-vulkan.so.Symbol audit (
nm libggml-vulkan.so | grep eliza_) is now expected to pass on a successful linux-x64-vulkan build. The source patch also adds conservative op-level dispatch forGGML_OP_ATTN_SCORE_QJL,GGML_OP_ATTN_SCORE_TBQ, andGGML_OP_ATTN_SCORE_POLAR. The remaining Vulkan gap is native graph-dispatch evidence on physical Linux/Android Vulkan hardware; standalone SPIR-V fixture success and MoltenVK fixture success do not flip runtime-ready capability bits by themselves.Dispatch wiring status (Wave-7 audit, 2026-05-10): the build now distinguishes shipped symbols from runtime-ready graph capabilities.
patchMetalKernels()ships all five standalones in desktopdefault.metalliband iOS embedded metallib. Metal now has a dedicated, numerically smoke-testedGGML_OP_ATTN_SCORE_QJLbridge tokernel_attn_score_qjl1_256_multi, soCAPABILITIES.jsonmay reportkernels.qjl_full=truewhen that symbol is present. The helper still deliberately refuses the old genericMUL_MAT/GET_ROWSpatch route: routing the standalones through generic ggml ops is either semantically wrong (QJL1_256expects a pre-projected sketch) or incomplete (thetbq*kernels are attention-score only).shippedKernels.symbols.*records metallib symbol presence; the remaining Turbo/Polarkernels.*bits stay false until dedicated graph ops land and pass numeric dispatch tests. CUDA is still the only backend whoseelizaOS/llama.cppfork binary covers the full runtime path for all five types.Wave-6 darwin shared-lib link fix: ggml-base on darwin defaults to
BUILD_SHARED_LIBS=ONand links with-undefined error, so unresolvedquantize_qjl1_256/dequantize_row_qjl1_256/quantize_row_qjl1_256_refsymbols (called fromggml.cbut defined inggml-cpu/qjl/) madelibggml-base.dylibfail to link at all. The Wave-5 BLOCKER agent's "verified via strings" claim was against a stale metallib produced by an earlier shape; the end-to-end build never actually completed. Wave-6 extendspatchGgmlBaseForWindowsQjlto fire ondarwin-*andios-*targets (same fix as Windows: compile QJL TUs into ggml-base too).The most likely on-hardware failure modes (carry-overs from the Vulkan investigation, applicable here too):
- FWHT seed/sign-vector application. All five shaders assume Q is pre-rotated host-side (matches the fork's CUDA + Metal paths). If you wire them into a pipeline that does NOT pre-rotate Q, you must add an inverse FWHT to the dequantized output before the dot product.
- SIMD-group reduction. The Metal shaders all dispatch with threadgroup_size = 32 = one Apple SIMD-group, so
simd_sumcovers the full reduction. If anyone bumps the threadgroup size, switch to threadgroup-shared scratch + barrier.- Block byte layout.
block_turbo*,block_qjl1_256, andblock_q4_polarare all__attribute__((packed))in the C reference and have natural alignment 2 (fp16/bf16 leading field). Metaldevice const T*arithmetic must respect this — if you change the header layout, the shader'shead_offset_bytesarg goes stale.The harness IS reusable for fixing the above —
metal_verifyloads a.metalsource, JIT-compiles viaMTLDevice.newLibraryWithSource, dispatches the named kernel, and diffs against a JSON fixture.
CUDA originals: https://github.com/spiritbuun/buun-llama-cpp.git at commit
6575873e9c4872709d374d854b583cfaa270caff, paths:
ggml/src/ggml-cuda/turbo-quant-cuda.cuh — quantize/dequantize, codebooks, FWHT, Viterbiggml/src/ggml-cuda/fattn-vec.cuh — flash-attention vec path that consumes themggml/src/ggml-common.h — block_turbo3_0, block_turbo4_0, block_turbo3_tcq layoutsggml/src/ggml-cpu/ops.cpp and ggml/src/ggml-turbo-quant.c — CPU reference helpers for lossy quantizationggml/src/ggml-metal/ggml-metal.metal and turbo-wht.h — existing Metal shader ground truth (the dequantize_turbo3_0_t4 helper there is the bit-for-bit reference our standalone turbo3.metal mirrors)Reference impl in this repo:
packages/native/plugins/qjl-cpu/include/qjl/qjl.h — public API + block_qjl1_256 layoutpackages/native/plugins/qjl-cpu/src/qjl_score_ref.c — scalar GQA score (the kernel_attn_score_qjl1_256 mirror target)packages/native/plugins/qjl-cpu/src/qjl_quantize_ref.c — quantize one rowpackages/native/plugins/qjl-cpu/src/qjl_dispatch.c — runtime dispatch (NEON / AVX2 / scalar)Original CUDA reference (training-side, not the on-device target):
packages/training/scripts/quantization/qjl/csrc/{qjl_quant_kernel.cu, qjl_gqa_score_kernel.cu}. The on-fork CPU side that lands block_qjl1_256 in ggml-common.h is W1-A's responsibility; this directory only ports the Metal half.
block_q4_polar)Reference impl in this repo:
packages/native/plugins/polarquant-cpu/include/polarquant/polarquant.h — public API + 5-step decode contractpackages/native/plugins/polarquant-cpu/include/polarquant/polar_block.h — block_q4_polar packed layoutpackages/native/plugins/polarquant-cpu/include/polarquant/polar_centroids.h — POLAR_Q4_CENTROIDS[16] Lloyd-Max LUT (the Metal shader inlines the same constants)packages/native/plugins/polarquant-cpu/src/polar_dequantize_ref.c — scalar decoder (the kernel_get_rows_q4_polar mirror target)packages/native/plugins/polarquant-cpu/src/polar_dot_ref.c — scalar q4_polar · q8_0 dot product (template for kernel_mul_mv_q4_polar_f32, except our verification path uses fp32 activations not q8_0)packages/native/plugins/polarquant-cpu/src/polar_qjl.c — xorshift32 sign vector for the optional QJL residualpackages/native/plugins/polarquant-cpu/src/polar_hadamard.c — in-place 128-element Walsh-Hadamard butterflyThe on-fork CPU side that adds block_q4_polar to ggml-common.h is W1-B's
responsibility; this directory only ports the Metal half.
turbo3 — 3-bit PolarQuant + per-block normThe graph normalizes a 128-element rotation group, applies a forward Fast
Walsh–Hadamard transform with seed=42 sign vectors, then quantizes each of
the four 32-element sub-blocks to 3-bit Lloyd–Max centroids
{-0.190685, -0.117832, ..., +0.190685}. The 3-bit index is split: low 2
bits in qs[], high bit in signs[]. Per-group norm is corrected by
grp_norm / recon_norm so dequantized output preserves the original L2
norm. At dot-product time, the host has already pre-rotated Q with the
same FWHT, so the shader only needs Q · centroids[idx] * norm.
turbo4 — 4-bit TurboQuant (16 centroids)Current TBQ4 uses four independent 32-element preconditioned records per
128-wide attention row. Each record is half norm; uint8_t qs[16] (18 bytes);
the full GGML row size for head_dim=128 is 72 bytes. A fixed sign vector plus
32-point Hadamard preconditions each record before quantization to centroids
{-2.7321365, -2.0685055, ..., +2.7321365}. The packing stores elements
0..15 in low nibbles and 16..31 in high nibbles. The standalone Metal,
Vulkan, reference fixture generator, and built-fork Metal graph-dispatch
smoke all use this layout now. The kernel body has diverged into perf-only
differences (the standalone hoists per-block byte
loads and uses fma); see PATCH_AUDIT_2026-05-10.md for the diff.
turbo3_tcq — 3-bit Trellis-Coded Quantization (k=3, L=9, 512 states)Each 128-element rotation group is encoded by a 512-state right-shift
bitshift trellis with a hand-trained 512-entry codebook. Encode = 128-step
Viterbi over 512 states (CUDA: k_set_rows_turbo3_tcq); decode = read a
sliding 9-bit window at bit t*3 of the bitstream and look up the centroid
(CUDA: dequantize_turbo3_tcq). Decode-only is what the dot-product path
needs; the Vulkan and Metal shaders here implement decode + Q·K. The
reference C implementation in reference/turbo_kernels.c includes a slow
Viterbi encoder for fixture generation.
qjl — 1-bit JL transform K-cache compressionEach 128-element key vector is projected through a fixed 128×256 JL matrix Π, sign-extracted (256 bits), and stored alongside a bf16 norm (34 bytes/key total — 7.53× compression vs bf16 K-cache). The attention-score path consumes a pre-projected query sketch (n_heads × proj_dim, computed once per Q) and emits
score[h_q, t] = ||k_t|| * sqrt(pi/2)/proj_dim *
sum_j sign_packed[t, j] * q_sketch[h_q, j]
GQA fanout: h_kv = h_q / (n_heads/n_kv_heads). Three Metal kernels:
kernel_attn_score_qjl1_256 (the hot path), kernel_get_rows_qjl1_256
(decode-to-fp32 fallback), kernel_mul_mv_qjl1_256_f32 (mat-vec for
non-attention call sites).
polar — 4-bit PolarQuant block (block_q4_polar, 82 bytes)Each 128-element block stores: fp16 per-block L2 norm, 64 bytes of 4-bit
Lloyd-Max-optimal centroid indices, plus 16 bytes of optional 1-bit
QJL residual. Decode steps (mirrors dequantize_row_q4_polar_ref):
POLAR_QJL_SEED=42), magnitude
0.5 / sqrt(QK_POLAR).1/QK_POLAR to convert the in-place butterfly into the
orthonormal inverse the Python decoder uses.Two Metal kernels: kernel_get_rows_q4_polar (full block decode to fp32)
and kernel_mul_mv_q4_polar_f32 (block-wise dot product against an fp32
activation chunk; one threadgroup per block, threadgroup-shared scratch
holds the dequantized block).
Each shader file annotates the CUDA function it ports. Key correspondences:
| Concept | CUDA | Vulkan | Metal |
|---|---|---|---|
| 3-bit centroid LUT | d_turbo_centroids_3bit (turbo-quant-cuda.cuh:296) | TURBO_CENTROIDS_3BIT constant | TURBO_CENTROIDS_3BIT constant |
| 4-bit centroid LUT | d_turbo_centroids_4bit (turbo-quant-cuda.cuh:313) | TURBO_CENTROIDS_4BIT constant | TURBO_CENTROIDS_4BIT constant |
| FWHT signs (seed=42) | d_turbo_wht_signs1/signs2 (turbo-quant-cuda.cuh:326) | not applied — Q is pre-rotated | not applied — Q is pre-rotated |
| 3-bit dequant | dequantize_turbo3_0 (turbo-quant-cuda.cuh:481) | vulkan/turbo3.comp main loop | metal/turbo3.metal kernel_turbo3_dot |
| 4-bit dequant | dequantize_turbo4_0 (turbo-quant-cuda.cuh:532) | vulkan/turbo4.comp main loop | metal/turbo4.metal kernel_turbo4_dot |
| TCQ codebook | d_turbo3_tcq_codebook (turbo-quant-cuda.cuh:619) | binding=3 storage buffer | buffer(3) constant |
| TCQ dequant (sliding 9-bit) | dequantize_turbo3_tcq (turbo-quant-cuda.cuh:982) | vulkan/turbo3_tcq.comp main loop | metal/turbo3_tcq.metal kernel_turbo3_tcq_dot |
No on-device encode in the shaders. The CUDA k_set_rows_turbo3_tcq
does a 512-state Viterbi inside one threadgroup with shared-memory
double-buffered cost arrays. That requires 512 threads, ~5 KB of
shared/threadgroup memory, and warp-shuffle min-reduction primitives
that are tricky to port portably. Encode happens host-side or via the
C reference impl in reference/turbo_kernels.c.
No InnerQ calibration paths. The CUDA kernels include a d_innerq_*
per-channel scaling pass that is part of the inner-quant calibration
routine. The decode path does not need it (calibration is a separate
step that mutates d_innerq_channel_scale ahead of quantize). The
shaders here decode against the post-corrected norm field directly.
No d_tcq_dump_* debug paths. The CUDA k_set_rows_turbo3_tcq has
optional global dump buffers for autocorrelation analysis. Removed.
Subgroup/SIMD reduction. Vulkan uses a driver-portable 32-thread
shared-memory tree reduction (5 barriers, shared float partials[32])
so it works regardless of subgroup size. The original subgroupAdd
path silently under-reduced on Intel ARL (minSubgroupSize=8) and on
lavapipe; W4-A replaced it after on-hardware verification. Metal still
uses simd_sum because the dispatch is one threadgroup = one Apple
SIMD-group of 32 lanes (the simd_sum assumption is a per-vendor
guarantee on Apple Silicon, unlike Vulkan's driver-chosen subgroup
size). On AMD GCN/RDNA the Vulkan tree reduction sidesteps the
wave32/wave64 question entirely.
No FWHT inside the shader. All shaders skip both forward and inverse
rotation because the surrounding graph pre-rotates Q. This matches
what the existing fork's Metal dequantize_turbo3_0_t4 and CUDA
dequantize_turbo3_0 actually do — the rotation is a graph-level
concern, not a per-block concern. If you wire these shaders into a
pipeline that does NOT pre-rotate Q, you must add an inverse FWHT to
the dequantized output before the dot product.
Before running backend-specific hardware checks, run the executable contract
gate. It verifies that the manifest kernel names
(turboquant_q3/turboquant_q4/qjl/polarquant/dflash/turbo3_tcq),
the build-script capability keys (turbo3/turbo4/turbo3_tcq/qjl_full),
the fixture set, and every supported build target's platform gate are still
in sync:
make -C packages/inference/verify kernel-contract
# In this repo's worktree:
cd packages/inference/verify
# 1) Build the reference + fixture generator (no GPU needed):
make reference-test
# 2) Generate fixtures from the reference impl (already shipped under
# fixtures/, but regenerate if you change the reference):
./gen_fixture fixtures
# 3) Build the Metal harness (requires Xcode command-line tools):
make metal
# 4) Run each shader against the fixtures:
./metal_verify ../metal/turbo3.metal kernel_turbo3_dot fixtures/turbo3.json
./metal_verify ../metal/turbo4.metal kernel_turbo4_dot fixtures/turbo4.json
./metal_verify ../metal/turbo3_tcq.metal kernel_turbo3_tcq_dot fixtures/turbo3_tcq.json
./metal_verify ../metal/qjl.metal kernel_attn_score_qjl1_256 fixtures/qjl.json
./metal_verify ../metal/polar.metal kernel_mul_mv_q4_polar_f32 fixtures/polar.json
# 5) Optional: build the patched llama-server. Metal kernel patching is
# unconditional for Metal targets; there are no opt-in env vars.
bun run packages/app-core/scripts/build-llama-cpp-dflash.mjs --backend metal
# 6) Optional: build the iOS Capacitor static archive that the
# LlamaCpp.xcframework patch in
# packages/app-core/patches/[email protected] consumes.
# Requires macOS host with Xcode installed.
bun run packages/app-core/scripts/build-llama-cpp-dflash.mjs \
--target ios-arm64-metal
bun run packages/app-core/scripts/build-llama-cpp-dflash.mjs \
--target ios-arm64-simulator-metal
# Host-side: build SPIR-V from the .comp files using glslc from the Vulkan
# SDK (https://vulkan.lunarg.com/sdk/home) or the Android NDK shader tools.
cd packages/inference/verify
make reference-test
make vulkan-spirv
# 1) On a workstation with a Vulkan-capable GPU (NVIDIA / AMD / Intel),
# run the host harness:
VULKAN_SDK=/opt/vulkan-sdk make vulkan
make vulkan-verify
# 2) On native Linux hardware, run the stricter smoke gate. This writes
# hardware-results/linux-vulkan-smoke-*.log, rejects software ICDs unless
# ELIZA_ALLOW_SOFTWARE_VULKAN=1, runs standalone fixtures, builds the
# patched fork, dumps CAPABILITIES.json, then runs the built-fork graph
# gate against the managed install output under
# $ELIZA_STATE_DIR/local-inference/bin/dflash/linux-x64-vulkan by default.
# Build failure stops the runner; stale/symbol-only artifacts are not reused.
make vulkan-native-smoke
# Direct graph dispatch smoke is native-Linux-only by default and requires a
# bin dir containing libggml-vulkan.so. Override only for explicit prebuilts:
# ELIZA_DFLASH_VULKAN_BIN_DIR=/path/to/bin make vulkan-dispatch-smoke
# 3) On-device (Android) verification: cross-compile the harness against
# the Android NDK Vulkan headers and push to a Vulkan-capable handset
# (Adreno 6xx+, Mali-G7x+). Same SPIR-V, same fixtures. The runner records
# hardware-results/android-vulkan-smoke-*.log and fails closed after
# standalone fixtures unless ELIZA_ANDROID_VULKAN_GRAPH_EVIDENCE points at
# a built-fork/app graph-dispatch report covering all six Vulkan graph
# routes or all five runtime capability keys with finite maxDiff. The
# standalone runner also checks `polar_preht.spv` with both Polar fixtures.
make android-vulkan-smoke
# 4) End-to-end via llama-server: the patch hook `patchVulkanKernels` is
# default-on. The build still refuses publishable artifacts until graph
# dispatch capabilities are runtime-ready, not merely symbol-shipped.
bun run packages/app-core/scripts/build-llama-cpp-dflash.mjs --backend vulkan
| Shader | C reference compiles | Self-test (CPU) | Fixture generated | Static visual review vs CUDA / fork | Compiles to SPIR-V/AIR | Runs on real GPU | Numerically matches CUDA |
|---|---|---|---|---|---|---|---|
turbo3.comp | n/a | n/a | yes | yes | yes (Mesa NDK glslc, SPIR-V 1.3 / Vulkan 1.1) | YES — Intel ARL Mesa 25.2.8 + lavapipe Mesa 25.2.8 LLVMpipe | YES — 8/8 PASS, max diff 4.8e-6 |
turbo4.comp | n/a | n/a | yes | yes | yes (Mesa NDK glslc, SPIR-V 1.3 / Vulkan 1.1) | YES — Intel ARL Mesa 25.2.8 + lavapipe Mesa 25.2.8 LLVMpipe | YES — 8/8 PASS, max diff 5.7e-6 |
turbo3_tcq.comp | n/a | n/a | yes | yes | yes (Mesa NDK glslc, SPIR-V 1.3 / Vulkan 1.1) | YES — Intel ARL Mesa 25.2.8 + lavapipe Mesa 25.2.8 LLVMpipe | YES — 8/8 PASS, max diff 6.7e-6 |
qjl.comp | n/a | n/a | yes | YES (against qjl_score_qk_ref) | yes (glslc 2026.2, SPIR-V 1.3 / Vulkan 1.1, spirv-val clean) | YES — Apple M4 Max via MoltenVK 1.4.1 (Wave-4-C); Intel ARL Mesa ANV 25.2.8 | YES — 8/8 PASS, max diff 7.6e-6 |
qjl_mul_mv.comp | n/a | n/a | yes | YES (against eliza_qjl_mul_mv, the reduction-based QJL-score parity check) | yes (glslc 2026.2, SPIR-V 1.3 / Vulkan 1.1, spirv-val clean) | YES — Intel ARL Mesa ANV 25.2.8 (make vulkan-verify-fallbacks, hard gate, deterministic) | YES — 8/8 PASS, max diff 7.6e-6 |
qjl_get_rows.comp | n/a | n/a | yes | YES (against eliza_qjl_dequantize_row, decode block 0; all-threads strided store, no reduction) | yes (glslc 2026.2, SPIR-V 1.3 / Vulkan 1.1, spirv-val clean) | YES — Intel ARL Mesa ANV 25.2.8 + lavapipe (make vulkan-verify-fallbacks, informational, --guarded) | lavapipe: 128/128 PASS deterministic. Mesa ANV: a small varying subset of the 128 elements intermittently reads back 0.0 (ANV strided-store codegen quirk; the reduction-based qjl.comp/qjl_mul_mv.comp share the decode and are clean) |
polar.comp | n/a | n/a | yes | YES (against dequantize_row_q4_polar_ref + polar_dot_ref) | yes (glslc 2026.2, SPIR-V 1.3 / Vulkan 1.1, spirv-val clean) | YES — Apple M4 Max via MoltenVK 1.4.1 (Wave-4-C); Intel ARL Mesa ANV 25.2.8 | YES — 8/8 PASS, max diff 5.7e-6 |
polar_preht.comp | n/a | n/a | yes | YES (same reference, using dot(H*x, q) == dot(x, H*q)) | yes (Android NDK glslc, SPIR-V 1.3 / Vulkan 1.1) | YES — Apple M4 Max via MoltenVK 1.4.1; Intel ARL Mesa ANV 25.2.8 | YES — 8/8 PASS for polar.json and polar_qjl.json, max diff 7.6e-6 |
polar_get_rows.comp | n/a | n/a | yes | YES (against eliza_polar_dequantize_row, decode block 0; all-threads strided store, no reduction) | yes (glslc 2026.2, SPIR-V 1.3 / Vulkan 1.1, spirv-val clean) | YES — Intel ARL Mesa ANV 25.2.8 + lavapipe (make vulkan-verify-fallbacks, informational, --guarded) | lavapipe: 128/128 PASS deterministic, both use_qjl modes. Mesa ANV: a small varying subset of the 128 elements intermittently reads back 0.0 (ANV strided-store codegen quirk; reduction-based polar.comp/polar_preht.comp share the decode and are clean) |
fused_attn_qjl_tbq.comp | n/a | n/a | yes (cases array) | YES (against eliza_fused_attn_qjl_tbq3) | yes (glslc 2026.2, SPIR-V 1.3 / Vulkan 1.1, spirv-val clean) | YES — Intel ARL Mesa ANV 25.2.8 (make vulkan-verify-fused); built-fork GGML_OP_FUSED_ATTN_QJL_TBQ graph dispatch (vulkan-dispatch-smoke 7/7) | YES — 1920/1920 outputs PASS across 4 cases (n_kv 64/512/256/128, GQA 1/2/4), max diff 5.7e-7; graph route max diff 4.5e-8 |
fused_attn_qjl_polar.comp | n/a | n/a | yes (cases array) | YES (against eliza_fused_attn_qjl_polar) | yes (glslc 2026.2, SPIR-V 1.3 / Vulkan 1.1, spirv-val clean) | YES — Intel ARL Mesa ANV 25.2.8 (make vulkan-verify-fused) | YES — 1920/1920 outputs PASS across 4 cases (n_kv 64/512/256/128, GQA 1/2/4), max diff 6.3e-7 |
fused_attn_qjl_tbq.metal | n/a | n/a | yes (cases array) | YES (byte-faithful mirror of fused_attn_qjl_tbq.comp / eliza_fused_attn_qjl_tbq3) | yes (Apple Metal runtime JIT; ships in default.metallib via metal-kernels.mjs) | YES — Apple M4 Max (make metal-verify-fused) | YES — 1920/1920 outputs PASS, max diff 5.7e-7 |
fused_attn_qjl_polar.metal | n/a | n/a | yes (cases array) | YES (byte-faithful mirror of fused_attn_qjl_polar.comp / eliza_fused_attn_qjl_polar) | yes (Apple Metal runtime JIT; ships in default.metallib) | YES — Apple M4 Max (make metal-verify-fused) | YES — 1920/1920 outputs PASS, max diff 7.2e-7 |
polar_preht.metal (kernel_attn_score_q4_polar_preht_f32 + _multi) | n/a | n/a | yes (polar_preht.json) | YES (dot(H·x, q) == dot(x, H·q); mirrors vulkan/polar_preht.comp + eliza_polar_mul_mv) | yes (Apple Metal runtime JIT; ships in default.metallib) | YES — Apple M4 Max (make metal-verify-fused) | YES — 8/8 PASS for scalar + multi 2/3/4/8, both residual modes, max diff 5.8e-6 |
cuda/fused-attn-qjl-tbq.cu (fused_attn_qjl_tbq3_kernel + DP4A QJL) | n/a | n/a | yes (fused_attn_qjl_tbq.json) | YES (mirrors verify/cuda_verify.cu's fused kernel + the fork CPU op) | yes (nvcc, gated on -DGGML_CUDA_FUSED_ATTN_QJL=ON) | AUTHORED — hardware-verify pending (no NVIDIA HW; verify/cuda_runner.sh on a CUDA host) | AUTHORED — hardware-verify pending |
turbo3.metal | n/a | n/a | yes | YES (matches fork's dequantize_turbo3_0_t4 byte-for-byte) | YES (Apple M4 Max, runtime JIT) | YES — Apple M4 Max, Darwin 25.2.0 | YES — 8/8 PASS, max diff 3.3e-6 |
turbo4.metal | n/a | n/a | yes | The fork's in-tree eliza-kernels/tbq4_0.metal is an EARLIER draft (29-line diff, materially different inner loop). The standalone is the canonical FMA-tuned variant; the build script copies the standalone into eliza-shipped/ so the metallib uses it. | YES (Apple M4 Max, runtime JIT in verify harness) | YES — Apple M4 Max, Darwin 25.2.0 | YES — 8/8 PASS, max diff 5.7e-6 |
turbo3_tcq.metal | n/a | n/a | yes | YES (matches CUDA dequantize_turbo3_tcq 9-bit window decode) | YES (Apple M4 Max, runtime JIT) | YES — Apple M4 Max, Darwin 25.2.0 | YES — 8/8 PASS, max diff 6.7e-6 |
qjl.metal | n/a | n/a | yes | YES (matches qjl_score_qk_ref in qjl-cpu) | YES (Apple M4 Max, runtime JIT) after Wave-3 attribute-shape fix | YES — Apple M4 Max, Darwin 25.2.0 | YES — 8/8 PASS, max diff 1.1e-5 |
polar.metal | n/a | n/a | yes | YES (matches dequantize_row_q4_polar_ref + polar_dot_ref) | YES (Apple M4 Max, runtime JIT) | YES — Apple M4 Max, Darwin 25.2.0 | YES — 8/8 PASS, max diff 7.6e-6 |
turbo_kernels.c | yes (gcc/clang) | yes | yes | n/a | n/a | n/a | n/a |
qjl_polar_ref.c | yes (gcc/clang) | yes | yes | n/a | n/a | n/a | n/a |
What "yes" means for the C references: make reference-test builds without
warnings (-O2 -Wall -Wextra -std=c11), and ./gen_fixture --self-test
emits finite, plausible-magnitude scores for every kernel:
turbo3=-2.501480 turbo4=-23.721790 turbo3_tcq=-4.822659 qjl=3.696591 polar=-1.994053 polar_qjl=-1.438744
(deterministic with the seeded PRNG in this repo). The turbo4 magnitude
changed when TBQ4 moved to the current four-18-byte-record-per-128-row
layout — each record is its own preconditioned 32-element block with an
independent half norm, so the dequantized magnitudes are no longer
scaled by a single shared per-128-group norm. polar_qjl is the Polar
V-cache score with the optional 1-bit QJL residual enabled. The same
--self-test also runs internal-consistency parity checks before printing
those scores, so the references can't silently disagree with each other:
qjl_score_qk and qjl_mul_mv must return the same scalar when
n_heads = n_kv_heads = n_tokens = 1 (no GQA fanout, just a single
projected dot product). Tolerance 1e-5.polar_mul_mv must equal polar_dequantize_row reconstructed
and then manually dotted against q[]. Tolerance 1e-3.These are reference-vs-reference checks; they verify that the two C
references the Metal shaders mirror agree with each other, not that the
shaders agree with hardware. Metal hardware verification was completed
in Wave-3 (2026-05-10) on Apple M4 Max (Darwin 25.2.0) and rerun on
2026-05-11 after the Polar pre-Hadamard hot path landed: metal_verify
reports 8/8 PASS for each checked entrypoint, including
kernel_mul_mv_q4_polar_preht_f32 with and without the QJL residual. The harness uses
MTLDevice.newLibraryWithSource (runtime JIT) against Metal.framework
— it does NOT require the offline xcrun metal toolchain, so any Mac
with Xcode command-line tools can run the verification.
Vulkan QJL/Polar verification (Wave-4-C, 2026-05-10): verify/vulkan_verify.cpp
now branches on the fixture's kernel field and resolves a per-kernel
KernelBindings (input buffer set, output buffer size, push-constant struct,
dispatch shape). QJL gets the 3-buffer (q_sketch, packed_k, scores) +
4-uint push (n_heads, n_kv_heads, n_tokens, proj_dim) bind-set with
(n_heads, n_tokens, 1) dispatch. Polar gets the 3-buffer (k_blocks, q,
y) + 3-uint push (n_rows, head_dim, use_qjl) bind-set with (n_rows, 1, 1)
dispatch. The existing turbo3/4/tcq path is unchanged; turbo3_tcq still
attaches the 4th codebook buffer. Same JSON fixture from gen_fixture feeds
the Metal and Vulkan harnesses; the QJL/Polar fixture shape fields
(q_sketch, n_heads, n_tokens, n_rows, use_qjl) were already present.
Hardware run: Apple M4 Max (Darwin 25.2.0) via MoltenVK 1.4.1 + Vulkan-Loader
1.4.341 (brew install molten-vk vulkan-loader vulkan-headers vulkan-tools shaderc; ICD at /opt/homebrew/etc/vulkan/icd.d/MoltenVK_icd.json). The
five primary harness entrypoints (turbo3, turbo4, turbo3_tcq, qjl score,
polar matvec) report 8/8 PASS at the standard 1e-3 tolerance with max diffs
in the 4.8e-7 to 7.6e-6 range — within 1 ULP of the direct Metal harness
numbers, which confirms MoltenVK's SPIR-V→MSL translation is bit-equivalent
on Apple Silicon for these kernels. The Vulkan fallback entrypoints now have fixture branches in
vulkan_verify.cpp and were exercised on Intel Arrow Lake Mesa ANV 25.2.8
(make vulkan-verify-fallbacks / make vulkan-verify-fused): qjl_mul_mv
8/8 PASS (hard gate, reduction-based, deterministic), the *_get_rows
decode-block shaders PASS on lavapipe but show the intermittent few-element
0.0 read-back on Mesa ANV's all-threads strided SSBO store (informational,
--guarded; not a logic bug — the reduction-based polar.comp/polar_preht.comp
share the decode and are clean), and the fused-attention shaders are 1920/1920
outputs PASS across all four GQA/n_kv cases. MoltenVK re-run with:
brew install molten-vk vulkan-loader vulkan-headers vulkan-tools shaderc
cd packages/inference/verify
make reference-test
make -C ../verify vulkan-spirv # glslc --target-env=vulkan1.1 --target-spv=spv1.3
c++ -O2 -std=c++17 -I/opt/homebrew/include -I../reference \
vulkan_verify.cpp turbo_kernels.o qjl_polar_ref.o \
-L/opt/homebrew/opt/vulkan-loader/lib -lvulkan -lm -o vulkan_verify
export VK_ICD_FILENAMES=/opt/homebrew/etc/vulkan/icd.d/MoltenVK_icd.json
export DYLD_LIBRARY_PATH=/opt/homebrew/opt/vulkan-loader/lib:/opt/homebrew/opt/molten-vk/lib
for k in turbo3 turbo4 turbo3_tcq qjl polar; do
./vulkan_verify ../vulkan/$k.spv fixtures/$k.json
done
fixtures/*.json were generated from the reference C implementations in
this directory (reference/turbo_kernels.c for the turbo family, plus
verify/qjl_polar_ref.c for QJL and Polar), NOT from a real CUDA build of
buun-llama-cpp. They are sufficient to verify that a Vulkan/Metal shader and
the reference produce the same scalar output, but they do NOT verify
CUDA-vs-{Vulkan,Metal} parity.
On hardware-validation day:
turbo3.json, turbo4.json, turbo3_tcq.json.
W1-E owns the CUDA productionization side and can supply these.qjl_pure_pytorch_quantize reference at
packages/training/scripts/quantization/qjl/test_qjl.py) on a CUDA host,
capture the projected sketch + packed signs + scores, replace qjl.json.polarquant.polar_quant.py's reference
on a CUDA host (the _compute_lloyd_max_centroids and
polar_hadamard_inplace paths are bit-exact targets), replace polar.json.The block_qjl1_256 and block_q4_polar byte layouts are owned by W1-A and
W1-B respectively; the verify/qjl_polar_ref.{h,c} files in this directory
are stand-ins that mirror those layouts so the verify harness has zero deps
on the @elizaos/native-plugins packages.
The DFlash/TBQ/QJL/Polar/Metal fork ships in-tree as a git submodule at
plugins/plugin-local-inference/native/llama.cpp —
elizaOS/llama.cpp at the repository gitlink commit (currently
33c888a7b, with ce85787c8 as the validated DFlash/SWA build base).
bun install runs git submodule update --init --recursive, so a fresh
checkout has it. Both build paths default to this checkout:
packages/app-core/scripts/build-llama-cpp-dflash.mjs — desktop / server /
Windows / iOS.packages/app-core/scripts/aosp/compile-libllama.mjs — Android cross-compile
(same pinned commit, so both paths land on identical kernels).The build re-applies the kernel patches (kernel-patches/*) on top of the
pristine submodule tree each run, then git checkout -- . && git clean -fdx
discards those edits at the start of the next build — the submodule stays pinned
to its gitlink commit. ELIZA_DFLASH_LLAMA_CPP_REMOTE / ELIZA_DFLASH_LLAMA_CPP_REF
(or --cache-dir / --src-dir) still force a standalone clone for fork bisects.
The DFlash build scripts now reject source checkouts that lack the
SWA-aware seq_rm probe fallback from elizaOS/eliza#7635; otherwise
SWA-based bodies can silently run target-only after --spec-type dflash.
Source-of-truth: the verified .metal and .comp files in this
directory (plugins/plugin-local-inference/native/{metal,vulkan}/). The build script
packages/app-core/scripts/build-llama-cpp-dflash.mjs calls into
packages/app-core/scripts/kernel-patches/{metal,vulkan}-kernels.mjs
during applyForkPatches() and the helpers do the actual work:
-DGGML_METAL_EMBED_LIBRARY=OFF on every
darwin-{arm64,x64}-metal target. This selects the non-EMBED branch
of the fork's ggml/src/ggml-metal/CMakeLists.txt, which builds a
sidecar default.metallib next to llama-server.patchMetalKernels() copies the five standalones from
packages/inference/metal/{turbo3,turbo4,turbo3_tcq,qjl,polar}.metal
into the fork at ggml/src/ggml-metal/eliza-shipped/<name>.metal.
Files are copied verbatim; a // # ELIZA-KERNEL-PATCH-V1 comment is
prepended so an audit can tell they came from the standalone.patchMetalKernels() patches
ggml/src/ggml-metal/CMakeLists.txt's non-EMBED add_custom_command
so the metallib build runs xcrun metal -c once per source
(ggml-metal.metal plus each of the five standalones), producing
one .air file each, then merges all six .air files into
default.metallib via a single xcrun metallib invocation. Sentinel
# ELIZA-KERNEL-PATCH-V1 makes the patch idempotent.default.metallib from the build's
bin/ directory into the install outDir next to llama-server.
The Metal runtime locates it via dlopen-style loader_path resolution.After this, strings default.metallib | grep kernel_turbo3_dot (and
similarly kernel_turbo4_dot, kernel_turbo3_tcq_dot,
kernel_attn_score_qjl1_256, kernel_get_rows_qjl1_256,
kernel_mul_mv_qjl1_256_f32, kernel_get_rows_q4_polar,
kernel_mul_mv_q4_polar_f32, kernel_mul_mv_q4_polar_preht_f32) returns matches. The kernels are
present as live symbols inside the metallib.
iOS keeps EMBED_LIBRARY=ON because the static-archive build needs the
metallib data baked in via .incbin. patchMetalKernels() rewrites that
branch to compile ggml-metal.metal plus the five standalones as separate
.air files, merge them into default.metallib, and embed the compiled
bytes. This avoids the old duplicate-declaration failure from concatenating
standalone Metal source with ggml-common.h.
The build still fails the publish gate until op-level dispatch is numerically verified. Shipped symbols are not enough to satisfy the runtime-ready kernel contract.
The nine standalone .comp files are staged into the fork and compiled
cleanly. patchVulkanKernels() adds conservative op-level dispatch for
GGML_OP_ATTN_SCORE_QJL, GGML_OP_ATTN_SCORE_TBQ, and
GGML_OP_ATTN_SCORE_POLAR, but the runtime capability bits remain gated
on native built-fork graph-dispatch evidence. Standalone SPIR-V fixture
success, MoltenVK success, and symbol presence are not enough to mark
Vulkan runtime-ready. There is no incomplete-artifact bypass.
For the current blocker ledger and platform/device gap list, see
reports/porting/2026-05-11/remaining-work-ledger.md.
Metal graph dispatch is runtime-ready for all five kernel families in the
current contract evidence. make -C packages/inference/verify dispatch-smoke
links against the built fork libraries, executes the actual Metal backend
routes, and compares numeric scores against local references. The QJL route
uses the launch-tax-amortized kernel_attn_score_qjl1_256_multi path; the
TurboQuant and Polar routes are likewise covered by the built-fork smoke gate.
Vulkan graph execution is source-patched but still needs native Linux/Android graph-smoke evidence before its capability bits can flip runtime-ready. CUDA still requires NVIDIA hardware fixture parity plus GGUF graph-smoke evidence before it can satisfy AGENTS.md §3 in this workspace.
| Env var | What it does | Default |
|---|---|---|
ELIZA_DFLASH_LLAMA_CPP_REMOTE | Build from a standalone clone of this fork remote instead of the in-repo plugins/plugin-local-inference/native/llama.cpp submodule (default https://github.com/elizaOS/llama.cpp.git). | unset |
ELIZA_DFLASH_LLAMA_CPP_REF | Standalone-clone fork ref (default 33c888a7be0b0b8ffb54cd3f0e05b4bed20cc52e). Setting either this or _REMOTE switches off the submodule build path. | unset |
ELIZA_DFLASH_VULKAN_HEADERS_DIR / ELIZA_DFLASH_SPIRV_HEADERS_DIR | Pre-staged Khronos header paths for cross-builds. | unset |
ELIZA_DFLASH_CMAKE_FLAGS | Extra cmake flags appended to the per-target list. Wins on conflict (e.g. override -DCMAKE_CUDA_ARCHITECTURES). | unset |
MINGW_TOOLCHAIN_FILE | Operator-supplied cmake toolchain file for windows-* targets. Required for windows-arm64-* cross builds; optional override for windows-x64-* (auto-detected mingw is used otherwise). | unset |
The previous ELIZA_DFLASH_PATCH_METAL_* / ELIZA_DFLASH_PATCH_VULKAN_KERNELS
environment knobs were decorative log toggles for the v0.4.0-eliza-era
no-op patch hooks. They have been removed; both the Metal and Vulkan
patch helpers now run unconditionally on every matching target — Metal
copies the standalone shaders + patches the metallib add_custom_command,
Vulkan copies the standalones into vulkan-shaders/ + applies the
available staging patches.
There is no opt-out, per AGENTS.md §3 ("Required for ALL tiers").
Symbol-only kernels do not satisfy the post-build audit gate.
Wiring these into dflash-server.ts (so --cache-type-k turbo3_tcq
actually runs through the new shader, and so QJL / Polar are reachable
from the CLI) is owned by another agent and depends on the
ggml-metal-ops dispatch work flagged above.
The on-device path consumes
@elizaos/llama-cpp-capacitor
(currently v0.1.5 from npm), an opaque prebuilt framework. The patch at
packages/app-core/patches/[email protected] switches the
plugin to consume a vendored LlamaCpp.xcframework so we can ship a
custom-built static archive against the patched fork.
To produce that static archive, the build script now exposes two iOS targets (compile-only on this machine — they require macOS host with Xcode):
bun run packages/app-core/scripts/build-llama-cpp-dflash.mjs --target ios-arm64-metal
bun run packages/app-core/scripts/build-llama-cpp-dflash.mjs --target ios-arm64-simulator-metal
Both pass -DGGML_METAL=ON -DCMAKE_SYSTEM_NAME=iOS -DCMAKE_OSX_ARCHITECTURES=arm64
plus -DGGML_METAL_EMBED_LIBRARY=ON so the .metallib ships inside the
static archive. Output lands under
$ELIZA_STATE_DIR/local-inference/bin/dflash/<target>/ as lib*.a files
plus a include/ headers staging directory. A follow-up packaging step
(not in this directory) glues these into the Capacitor xcframework
layout the patch expects.
Fused mobile status is intentionally closed. The dflash build script does
not advertise or build android-*-*-fused, ios-arm64-metal-fused, or
ios-arm64-simulator-metal-fused because the libelizainference mobile FFI
packaging and verifier path are not wired here yet. Those spellings fail
with explicit diagnostics instead of falling through to a generic unsupported
target list. Android system-agent fused artifacts, including emulator
android-x86_64-*-fused, are owned by
packages/app-core/scripts/aosp/compile-libllama.mjs.
Server fused CUDA on arm64 Linux is supported as
linux-aarch64-cuda-fused. It uses the same CUDA fused-attention CMake flags
as linux-x64-cuda-fused and keeps the existing arm64 Linux host gate; build
it on a real aarch64 CUDA runner such as GH200.