Eliza-1 platform matrix — build · verify · bench, one command each

Single reference for every entry in SUPPORTED_TARGETS (packages/app-core/scripts/build-llama-cpp-dflash.mjs). For each target: the one-command build, the one-command kernel verify, the one-command bench, the current status, and the exact prerequisite if it is not done here. The tracked hardware/readiness view is ../../../../docs/eliza-1-pipeline/06-test-matrix.md; the enforceable contract is kernel-contract.json (checked by make -C packages/inference/verify kernel-contract); the bundle plan is ../../../../docs/ELIZA_1_GGUF_PLATFORM_PLAN.json.

Verify status as of 2026-05-12 (post multi-agent wave)

Re-ran the full integration verify matrix on this box (Intel Arrow Lake CPU + Intel ARL/ANV Vulkan + RTX 5080 / sm_120 CUDA, with a full-corpus SFT job holding ~12 GB VRAM concurrently — no OOM contention on the short verify runs):

Target	Result
`make kernel-contract`	PASS — `OK kernels=6 targets=26 manifestNames=6`
`make reference-test`	PASS — C reference clean; `gen_fixture --self-test` finite (fused-attn + TBQ V-cache parity OK)
`make cpu-bench`	PASS (nothing to rebuild; harness in place)
`make cpu-dispatch-smoke`	PASS — `ATTN_SCORE_QJL` + `FUSED_ATTN_QJL_TBQ` MT-vs-ST bit-identical, no NaN; `cpu_qjl_polar_attn_smoke` PASS (NEW) — `flash_attn_ext` over the dequant-hopped QJL1_256 K / Q4_POLAR V path emits 65536 finite outputs (nan=0 inf=0 maxabs=0.0299), guards fork commit `cb700767` / tag `v1.1.1-eliza`
`make vulkan-dispatch-smoke`	PASS — Intel ARL: `GGML_OP_ATTN_SCORE_QJL` 32 outs max 2.7e-7, `GGML_OP_FUSED_ATTN_QJL_TBQ` 512 outs max 4.5e-8
`make vulkan-verify`	PASS — 8/8 (turbo3/turbo4/turbo3_tcq/qjl/polar incl. polar pre-Hadamard, both residual modes)
`make vulkan-verify-multiblock`	PASS — 8/8 across 1/2/4/8 blocks-per-workgroup
`make vulkan-verify-fused`	PASS — 1920/1920 outputs (4 cases) on Intel ARL ANV, max diff ≤ 7.2e-7
`make cuda-verify`	PASS — 8/8 each kernel + 1920/1920 fused on RTX 5080 (sm_120), max diff ≤ 9.5e-6
`make cuda-verify-fused`	PASS — 1920/1920 fused QJL-K/TBQ-V on RTX 5080, max diff 4.47e-7
`gen_fixture --self-test` on cvd-1 (`android-x86_64-cpu`)	PASS — bit-identical to host across all 6 required kernels + fused-attn + TBQ V-cache; cross-compiled with NDK r29 Android Clang 20.0.0, run on live Cuttlefish (cvd 1.53.0) under KVM (`evidence/platform/android-x86_64-cpu.json`). The cross-built fork `llama-server` loads on the cvd with the full Eliza-1 KV-cache-type whitelist (`tbq3_0, tbq4_0, qjl1_256, q4_polar, tbq3_tcq`).
`vulkan_verify` on cvd-1 SwiftShader (`android-x86_64-vulkan`)	DIAGNOSTIC-ONLY — 8/8 SPIR-V fixture cases pass with max diff < 1e-5, but the cvd virtio-gpu ICD is SwiftShader (software, vendor 0x1ae0); per the fail-closed software-ICD rule, this is not recordable runtime-ready evidence (`evidence/platform/android-x86_64-vulkan.json`).

Nothing regressed in this wave. bun run typecheck for packages/app-core is clean; bun test packages/app-core/src/services/local-inference/ is 603 pass / 17 fail where all 17 failures are the known test-isolation flakes (downloader ×6 — passes 7/7 alone — plus cache-restart-corruption / cache-multi-model / cache-thrash / cache-stress shared-mock-state, and the 2 fused llama-server tests that need the fused binary built); …/voice/ is 217/218 + 28/28 green; python3 -m pytest packages/training/scripts/{eval,publish,manifest,wakeword} packages/training/benchmarks is 140 passed / 1 skipped.

Status vocabulary

Status	Meaning
verified-here	A real hardware run happened on the machine that wrote this doc (Intel Arrow Lake / Mesa ANV Linux for CPU + Vulkan; NVIDIA RTX 5080 Mobile / sm_120 for CUDA; Apple M4 Max for Metal/MoltenVK from prior passes; iPhone 15 Pro for the iOS device smoke).
authored-pending-hardware	Source + build plumbing + a fail-closed runner exist; no real run on the matching device class yet.
needs-operator	The build/run needs `sudo` or a toolkit install the agent cannot do. (CUDA 12.8 for native sm_120 SASS is now installed at `/usr/local/cuda-12.8`; the build hook auto-pins it.)
needs-bigger-box	The build itself OOMs / is too slow on the 31 GB / 24-core dev box (the CUDA-fused build is ~30 GB peak RAM, ~2 h); use the cloud runner (`packages/app-core/scripts/cloud/run-on-cloud.sh`, or `packages/training/scripts/cloud/`).

How the three "one commands" map

Build: node packages/app-core/scripts/build-llama-cpp-dflash.mjs --target <triple> (prepend ELIZA_DFLASH_SKIP_SERVER_STRUCTURED_OUTPUT=1 while the structured-output server patch is still being fixed; the iOS targets emit a .a for the xcframework patch, the -fused targets emit libelizainference + the fused server, everything else emits llama-server + llama-cli + llama-speculative-simple + llama-bench + llama-completion). The fork build (packages/inference/llama.cpp submodule, or the ~/.cache/eliza-dflash clone) git reset --hards on each run — do source edits first, build last, retry on clobber. Serialize fork builds; never two CUDA builds at once on the 31 GB box.
Kernel verify (synthetic fixtures, fast — minutes): make -C packages/inference/verify <backend>-verify (metal-verify / vulkan-verify / cuda-verify; add -multiblock / -fused for the extra coverage). These are the AGENTS.md §8 8/8-PASS gates. They do not need the bundle bytes.
Built-fork graph dispatch (proves a real llama.cpp graph route selects the kernel): make -C packages/inference/verify vulkan-dispatch-smoke (Vulkan), metal dispatch-smoke (Metal), the C++ vulkan_dispatch_smoke / dispatch_smoke.mm harnesses. CUDA's equivalent is cuda-verify (fixture-parity __device__ kernels) + the cuda_runner.sh graph smoke (now via llama-bench / llama-completion, not llama-cli — the fork's llama-cli is conversation-only and busy-loops on stdin EOF).
Bench: make -C packages/inference/verify <backend>-bench (metal-bench / vulkan-bench / cpu-bench) for the standalone-kernel perf harness; llama-bench -m <gguf> -ngl 99 -p … -n … -fa 1 --cache-type-k … for the model-graph throughput (the verify runners do this); verify/e2e_loop_bench.mjs / verify/thirty_turn_endurance_harness.mjs for the end-to-end voice loop. The fork ships llama-bench + llama-completion next to llama-server (as of this commit), so the bench path exists on every built target.

CPU baseline — runnable anywhere

Target	Build	Kernel verify	Bench	Status	Prereq if not done
`linux-x64-cpu`	`node …/build-llama-cpp-dflash.mjs --target linux-x64-cpu`	`make -C …/verify reference-test` (C-reference round-trip); the CPU score/decode ops are the C references themselves; `make cpu-dispatch-smoke` (graph picks `GGML_OP_ATTN_SCORE_QJL` + `GGML_OP_FUSED_ATTN_QJL_TBQ` on the CPU backend and asserts MT-vs-ST bit-identical, no NaN — `verify/qjl_mt_check.c`; `CPU_BIN_DIR`/`GGML_INC_DIR` default to the in-repo fork build/checkout, no env vars)	`make -C …/verify cpu-bench cpu-simd-bench`; `llama-bench` on the staged text GGUF	verified-here (`reference-test` clean; `cpu-dispatch-smoke` PASS — MT-vs-ST bit-identical; AVX-VNNI int8-QJL 5.25× / fp32-QJL LUT-gather ~2.5–8× — `bench_results/cpu_avxvnni_2026-05-11.json`, `bench_results/cpu_kopt_2026-05-11.json`). `kernel-contract.json` `runtimeStatus.cpu` = `runtime-ready` for `qjl` + `fusedAttn` (`verify/cpu-runtime-dispatch-evidence.json`); `reference-only` for TBQ/Polar standalone score (no public CPU graph op — validated by `reference-test`). The §3 CPU kernel-completeness build gate still fails by design (turbo3_tcq/polarquant not CPU-buildable).	verify-on-device against the staged bundle bytes (`verifyBundleOnDevice`); wire `probeKernels()` to read `cpu-runtime-dispatch-evidence.json` so a fresh `linux-x64-cpu` build's `CAPABILITIES.json` reports `qjl_full` runtime-ready.
`linux-aarch64-cpu`	`--target linux-aarch64-cpu` (needs an arm64 Linux host or a sysroot+cross-toolchain — no aarch64-cross wiring on x64 here)	`make reference-test` + `cpu-dispatch-smoke` on the arm64 host	`cpu-bench cpu-simd-bench` (NEON dotprod paths)	authored-pending-hardware	An arm64 Linux box (Ampere Altra / Graviton / Snapdragon-Linux).
`windows-x64-cpu`	`--target windows-x64-cpu` (mingw cross-build)	`pwsh -File verify/windows_runner.ps1 -Backend cpu -Model C:\models\eliza-1-smoke.gguf` on a real Windows box (now drives `llama-bench` + `llama-completion`, not `llama-cli`)	`windows_runner.ps1` (above)	authored-pending-hardware (cross-built exe is not counted)	A native Windows x64 host.
`windows-arm64-cpu`	`--target windows-arm64-cpu` (needs an MSVC arm64 cross-toolchain or a native Windows-arm64 host — no mingw arm64 wiring here)	`windows_runner.ps1 -Backend cpu` on a Snapdragon X box	`windows_runner.ps1`	authored-pending-hardware	A Snapdragon X Elite / Copilot+ PC.
`android-arm64-cpu`	`node packages/app-core/scripts/aosp/compile-libllama.mjs` (NDK cross-build)	CPU/NEON parity via `adb` on a physical Android device	`adb`-pushed `cpu_bench` / `llama-bench`	authored-pending-hardware	A physical Android device + NDK.
`android-x86_64-cpu`	`ANDROID_NDK_HOME=… node …/build-llama-cpp-dflash.mjs --target android-x86_64-cpu` (NDK cross-build, `-DANDROID_ABI=x86_64`, forces AVX/AVX2/FMA/F16C — the x86_64 Android ABI baseline is SSE4.2; the QJL/Polar CPU kernels need AVX2)	(a) kernel C-reference parity on cvd (NEW 2026-05-12, verified-here): cross-compile `gen_fixture` with NDK r29 (Android Clang 20.0.0, target x86_64-unknown-linux-android24) and `adb push` to cvd-1; run `gen_fixture_android_x86_64 --self-test` → bit-identical to host across all six required kernels + fused-attn + tbq V-cache parity (`turbo3=-2.501480 / turbo4=-23.721790 / turbo3_tcq=-4.822659 / qjl=3.696591 / polar=-1.994053 / polar_qjl=-1.438744`); the cross-built fork `llama-server` loads on the cvd and exposes the full Eliza-1 KV-cache-type whitelist (tbq3_0, tbq4_0, qjl1_256, q4_polar, tbq3_tcq) with banner `built with Clang 20.0.0 for Android x86_64`, fork commit `536ff214`. Evidence: `evidence/platform/android-x86_64-cpu.json`. (b) The orthogonal 8-step Cuttlefish (`cvd`) chat-completion smoke `node packages/app-core/scripts/aosp/smoke-cuttlefish.mjs`: 5/6 infra steps PASS on the live cvd (cvd reachable, APK installed abi=x86_64, ElizaAgentService start, /api/health agentState=running runtime=ok, bearer token); step 6 chat completion failed — no model staged in the release APK on that cvd. See `../reports/porting/2026-05-12/cuttlefish-x86_64-smoke.md`.	`adb`-pushed `cpu_bench` / `llama-bench`; `e2e_loop_bench.mjs` on the cvd	kernel C-reference parity verified-here on Cuttlefish + build verified-here (real x86_64 Android ELF — `interpreter /system/bin/linker64` — + libs, fork commit `536ff214`; `CAPABILITIES.json` `qjl_full`/`polarquant` true), Cuttlefish cvd 8-step infra smoke 5/6 PASS. `kernel-contract.json` `platformTargets.android-x86_64-cpu` = `runtime-ready` / `runtime-ready` / `verified`.	A `build-aosp.mjs --launch` rebuild staging the new `android-x86_64-cpu` libllama + a bundled eliza-1-smoke GGUF in the privileged APK → 8/8 chat smoke; Vulkan-on-cvd is gfxstream/SwiftShader (software → not recordable).
`android-x86_64-vulkan`	`…/build-llama-cpp-dflash.mjs --target android-x86_64-vulkan` (NDK + Vulkan headers, `-DANDROID_ABI=x86_64`)	standalone `vulkan_verify` fixtures pass on the host ANV iGPU AND on the cvd SwiftShader ICD (8/8 cases, max diff < 1e-5, `evidence/platform/android-x86_64-vulkan.json`) — DIAGNOSTIC-ONLY under cvd because SwiftShader is software (vendor 0x1ae0 LLVM 16); graph dispatch needs real ChromeOS x86_64 GPU (Adreno/Mali under ARCVM) — cvd virtio-gpu Vulkan is gfxstream/SwiftShader (software → no recordable evidence per fail-closed)	`adb`-pushed `vulkan_bench`	authored-pending-hardware for graph dispatch (ChromeOS GPU); cross-build + SPIR-V fixture pass + Android Vulkan loader path validated under SwiftShader on cvd	Real ChromeOS x86_64 GPU silicon (Adreno/Mali under ARCVM) or a passed-through host GPU via crosvm gfxstream + real-GPU host.
`linux-x64-cpu-fused`	`ELIZA_DFLASH_SKIP_SERVER_STRUCTURED_OUTPUT=1 …/build-llama-cpp-dflash.mjs --target linux-x64-cpu-fused`	`OMNIVOICE_FUSE_VERIFY.json` `ok=true` + `verifyFusedSymbols` (abi/omnivoice/llama-reexport counts); `llama-server --cache-type-k qjl1_256 --cache-type-v q4_polar` boots healthy, `/completion` returns tokens (NEW 2026-05-12, fork commit `cb700767` / tag `v1.1.1-eliza`) — guarded by `make cpu-dispatch-smoke`'s new `cpu_qjl_polar_attn_smoke` regression test	`dflash-server-fused.integration.test.ts` (spawns the fused `llama-server`, hits `/completion` + `/v1/audio/speech` same-PID); `llama-bench`/`llama-completion` for text	verified-here for the merged HTTP route + symbol-verify + the QJL/Polar KV-cache warmup-no-segfault path; exit-1 is the §3 CPU-backend kernel-completeness gate (turbo3_tcq/qjl_full/polarquant aren't CPU-graph-dispatch caps), `CAPABILITIES.json` `publishable: false`.	A weight-backed `/v1/audio/speech` smoke against a real `tts/omnivoice-*.gguf` (the dev stand-in bundle has no `tts/`); voice:duet end-to-end (the QJL/Polar segfault is fixed; remaining duet block is an SQL/embeddings-dim runtime-bootstrap concern, separate from the kernel).

CUDA — verified-here on the RTX 5080 (sm_120, native SASS via CUDA 12.8)

Target	Build	Kernel verify	Bench	Status	Prereq if not done
`linux-x64-cuda`	`ELIZA_DFLASH_SKIP_SERVER_STRUCTURED_OUTPUT=1 …/build-llama-cpp-dflash.mjs --target linux-x64-cuda` (~1.5–2 h, ~30 GB peak — serialize; check `free -m`/`uptime` first). CUDA 12.8 is installed at `/usr/local/cuda-12.8`; pass `CUDACXX=/usr/local/cuda-12.8/bin/nvcc PATH=/usr/local/cuda-12.8/bin:$PATH` so the build hook's `cudaArchListFlag()` sees the 12.8 nvcc and appends `100;120` to the arch list (the system `/usr/bin/nvcc` is 12.0 and would silently downgrade to `80;86;89;90;90a`). Full integration build now installed (2026-05-12) — `~/.eliza/local-inference/bin/dflash/linux-x64-cuda/`, forkCommit `a61c93aaa5` (v1.2.0-eliza), builtAt `2026-05-12T17:16:58Z`, `libggml-cuda.so.0.9.7` 473 MB (real sm_120a SASS), all binaries (`llama-bench`/`llama-cli`/`llama-completion`/`llama-server`/`llama-speculative-simple`) ldd-clean via `$ORIGIN` rpath.	`make -C …/verify cuda-verify cuda-verify-fused` (self-contained nvcc compile of `cuda_verify.cu`; 8/8 + 1920/1920 PASS on the RTX 5080, max diff ≤ 9.5e-6 / 4.47e-7; `cuda-verify-fused` exercises the warp-cooperative kernel mirroring the production `cuda/fused-attn-qjl-tbq.cu`; the harness builds a native `sm_120.cubin` under 12.8). Re-verified 2026-05-12 against the installed real-build.	`verify/cuda_runner.sh --report …` (builds the fork, `cuda-verify`, then `runtime_graph_smoke.sh --gen-check` → `llama-bench --cache-type-k tbq3_0 -ngl 99` + `llama-completion`); `bench_results/cuda_e2e_2026-05-11.json` (text pp ~2.3–6.7k t/s, tg ~40–55 t/s; ASR `eliza-1-asr.gguf` → arch `qwen3vl 1.7B`, pp16 ~1023, pp128 ~4561, tg32 ~62 t/s); nsys: DP4A `qjl_score_dp4a_kernel` ~2.27× faster than fp32 `qjl_score_kernel`. New `llama-bench` numbers (2026-05-12, real install, freed GPU): eliza-1-0_6b bundle pp512/tg128 d=0 19932 / 345.5 t/s, d=16000 1956 / 108.5 t/s; eliza-1-1_7b bundle pp512/tg128 d=0 11931 / 194.7 t/s, d=16000 1797 / 84.9 t/s; base Qwen3-0.6B-Q8_0 d=0 20979 / 356 t/s, base Qwen3-1.7B-Q8_0 d=0 12414 / 159 t/s. `llama-server` smoke verified: 4 GPU slots, `/health → ok`, `POST /completion` 32-token decode at 420.57 tps decode / 1092.66 tps prefill on the 0_6b bundle.	verified-here for `cuda-verify` / `cuda-verify-fused` / full ggml-cuda integration build (NEW 2026-05-12) / text + ASR `llama-bench` / `llama-server /completion` / native-`sm_120`-SASS-compile. `kernel-contract.json` `runtimeStatus.cuda` + `fusedAttn.runtimeStatus.cuda` = `runtime-ready`. `runtime_graph_smoke.sh --gen-check` errors with "no cache-type alias for turbo3" — expected on the non-fused build (`CAPABILITIES.json` reports `dflash: false`, `missingRequiredKernels: ["dflash"]`); the cache-type aliases are added by the `dflash`/fused-build patch path.	verify-on-device against the staged bundle bytes (`verifyBundleOnDevice`); wire `probeKernels()` to read the install's `CAPABILITIES.json`.
`linux-x64-cuda-fused`	`ELIZA_DFLASH_SKIP_SERVER_STRUCTURED_OUTPUT=1 CUDACXX=/usr/local/cuda-12.8/bin/nvcc PATH=/usr/local/cuda-12.8/bin:$PATH …/build-llama-cpp-dflash.mjs --target linux-x64-cuda-fused --jobs 3` — the big build: full ggml-cuda + the omnivoice-core graft, ~30 GB peak RAM under `-j 6` (the 31 GB dev box OOM-killed the parent build script under `-j 6` mid-build at the `fattn.cu` long-pole). Built 2026-05-12 on this box at `-j 3` from a clean dir against fork commit `a61c93aaa5` (v1.0.0-eliza) + omnivoice pin `38f824023d12`. `OMNIVOICE_FUSE_VERIFY.json` `ok=true` (llama=0, omnivoice=10, abi=23 symbols).	`cuda-verify cuda-verify-fused` + `OMNIVOICE_FUSE_VERIFY.json` — 1920/1920 PASS on RTX 5080 sm_120, max diff 5.07e-07 (`logs/cuda-verify-fused-fusedbuild-rtx5080-2026-05-12.log`). `make cuda-hardware` against the install: 6/6 fixture-set PASS (`logs/cuda-hardware-fusedbuild-rtx5080-2026-05-12.log`); graph-smoke gated on `llama-bench` (not in fused-target list — non-blocking tooling gap).	`verify/e2e_loop_bench.mjs --backend cuda --tier 0_6b --turns 1` against the fused install — voice_rtf 0.4255 (PASS ≤ 0.5), tg 64.82 tok/s, first_token 43.3 ms, dflash 12/12, peak RSS 2340 MB. `packages/inference/reports/porting/2026-05-12/e2e-loop-cuda-2026-05-12.json`.	verified-here on RTX 5080 Laptop (sm_120, CUDA 12.8) — `CAPABILITIES.json` reports `publishable: true`, `missingRequiredKernels: []`, `dflash + turbo3 + turbo4 + turbo3_tcq + qjl_full + polarquant + lookahead + ngramDraft = all true`.	Re-run on additional sm classes (sm_89 / sm_90 / sm_100 datacenter) to confirm no arch regression in the CMAKE_CUDA_ARCHITECTURES list.
`linux-aarch64-cuda`	`--target linux-aarch64-cuda` on an arm64 Linux + Hopper/Blackwell host (GH200 = aarch64 host + H100/H200/GB200 GPU)	`make cuda-verify cuda-verify-fused` on that host; `verify/gh200_runner.sh --report …` (refuses non-aarch64 / non-Hopper-9.x)	`gh200_runner.sh`; `llama-bench` on the `27b-256k` / `27b-256k` tier GGUFs	authored-pending-hardware	A GH200 / H100-aarch64 / GB200 host. Use the cloud runner.
`windows-x64-cuda`	`--target windows-x64-cuda` (MSVC + CUDA Toolkit on Windows)	`pwsh -File verify/windows_runner.ps1 -Backend cuda -Model C:\models\eliza-1-smoke.gguf` on NVIDIA hardware (drives `llama-bench` + `llama-completion`)	`windows_runner.ps1` (above)	authored-pending-hardware (cross-built exe not counted)	A native Windows + NVIDIA box.
`windows-x64-cuda-fused`	`--target windows-x64-cuda-fused`	`windows_runner.ps1 -Backend cuda` + `OMNIVOICE_FUSE_VERIFY.json`	the fused Windows `llama-server`'s `/v1/audio/speech`	authored-pending-hardware	The Windows-CUDA hardware runner first, then the fused build on that host.

Vulkan — verified-here on Intel Arc/Xe Mesa ANV (single device class)

Target	Build	Kernel verify	Bench	Status	Prereq if not done
`linux-x64-vulkan`	`…/build-llama-cpp-dflash.mjs --target linux-x64-vulkan`	`make -C …/verify vulkan-verify vulkan-verify-multiblock vulkan-verify-fused` (8/8 + 8/8 + 1920/1920 PASS on Intel ANV, max diff ≤ 7.6e-6 / 6.3e-7); `make vulkan-native-smoke` / `vulkan-dispatch-smoke` (built-fork graph routes PASS on Intel ARL/ANV — the harness drives the two fused attention ops the fork pin declares in `ggml.h`: `GGML_OP_ATTN_SCORE_QJL` 32 outs max 2.7e-7 + `GGML_OP_FUSED_ATTN_QJL_TBQ` 512 outs max 4.5e-8 — `vulkan-runtime-dispatch-evidence.json` + `hardware-results/linux-vulkan-smoke-*.log`. The standalone TBQ/Polar score kernels are covered by `vulkan-verify`; their built-fork graph entries in the evidence file are from a prior full-patched-build run.)	`make vulkan-bench`; `llama-bench -ngl 99` (the dispatch smoke does this)	verified-here on Intel ARL/ANV — `kernel-contract.json` `runtimeStatus.vulkan` = `runtime-ready` for the 5 score kernels + fused_attn. Single Intel-ANV device class.	Native AMD (RADV) and NVIDIA-desktop Vulkan graph dispatch; verify-on-device against the staged bundle bytes.
`linux-x64-vulkan-fused`	`…/build-llama-cpp-dflash.mjs --target linux-x64-vulkan-fused --jobs 4` — Vulkan ggml + the omnivoice-core graft (much lighter than the CUDA-fused build — ~3 min on this box vs. ~85 min for CUDA-fused). Built 2026-05-12 from a clean dir against fork commit `a61c93aaa5` + omnivoice pin `38f824023d12`. `OMNIVOICE_FUSE_VERIFY.json` `ok=true` (llama=0, omnivoice=10, abi=23). Fused SPIR-V baked into `libggml-vulkan.so` (`eliza_fused_attn_qjl_tbq_data` + `eliza_fused_attn_qjl_polar_data` symbols, _len pair).	`vulkan-verify vulkan-verify-fused` + `OMNIVOICE_FUSE_VERIFY.json` — all PASS on Intel ARL iGPU (Mesa ANV 25.2.8): `fused_attn_qjl_tbq` 1920/1920 + 1536/1536 causal + `fused_attn_qjl_polar` 1920/1920 + 1536/1536 causal = 6912 outputs total, max diff 6.26e-07 (`logs/vulkan-verify-fused-fusedbuild-anv-2026-05-12.log`).	`verify/e2e_loop_bench.mjs --backend vulkan --tier 0_6b --turns 1` against the fused install — runs end-to-end (`e2eOk: true`, dflash 31/31), but iGPU performance keeps voice_rtf at 1.7269 (FAIL ≤ 0.5; the gate targets discrete-GPU class). tg 12.13 tok/s, first_token 493 ms, peak RSS 1370 MB. `packages/inference/reports/porting/2026-05-12/e2e-loop-vulkan-2026-05-12.json`.	verified-here for kernel parity + e2e functionality on Intel ARL iGPU; `CAPABILITIES.json` `publishable: true`, `missingRequiredKernels: []`. Publish-gate (`voice_rtf ≤ 0.5`) FAIL on iGPU — gate is a discrete-GPU target.	Re-run `e2e_loop_bench` on a discrete Vulkan-mode card (RDNA3 / Ada in pure-Vulkan / Intel BMG) for a Vulkan voice-rtf number under the discrete-GPU class.
`windows-x64-vulkan`	`--target windows-x64-vulkan` (mingw + Khronos Vulkan-Headers cross-build)	`pwsh -File verify/windows_runner.ps1 -Backend vulkan -Model C:\models\eliza-1-smoke.gguf` on native Windows Vulkan	`windows_runner.ps1`	authored-pending-hardware	A native Windows + GPU box.
`windows-arm64-vulkan`	`--target windows-arm64-vulkan` (MSVC arm64 cross-toolchain)	`windows_runner.ps1 -Backend vulkan` on a Snapdragon X box (Adreno X1 = Vulkan 1.3)	`windows_runner.ps1`	authored-pending-hardware	A Snapdragon X Elite / Copilot+ PC.
`android-arm64-vulkan`	`node packages/app-core/scripts/aosp/compile-libllama.mjs` (NDK cross-build)	`make -C …/verify android-vulkan-smoke` — standalone fixtures 6/6 PASS on Pixel 6a / Mali-G78 (`hardware-results/android-vulkan-smoke-*.log`); built-fork graph dispatch evidence (`ELIZA_ANDROID_VULKAN_GRAPH_EVIDENCE`) still open; Adreno not yet run	`adb`-pushed `vulkan_bench` / `llama-bench`	authored-pending-hardware for graph dispatch (standalone fixtures verified-here on Mali)	A built-fork/app graph-dispatch report on one Adreno + one Mali device.

Metal / Apple — verified-here on Apple M4 Max (prior passes)

Target	Build	Kernel verify	Bench	Status	Prereq if not done
`darwin-arm64-metal`	`…/build-llama-cpp-dflash.mjs --target darwin-arm64-metal` (macOS host, builds + embeds `default.metallib`)	`make -C …/verify metal-verify metal-verify-multiblock` (8/8 + 8/8 on M4 Max); `make dispatch-smoke` (built-fork graph dispatch for `GGML_OP_ATTN_SCORE_{QJL,TBQ×3,POLAR}` + pre-Hadamard Polar — all PASS); `metal-verify-fused` fails by design (no `metal/fused_attn*.metal` `cases`-array path in `metal_verify` yet — see `../reports/porting/2026-05-11/metal-fused-attn-and-polar-preht-design.md`)	`make metal-bench metal-bench-batched metal-bench-multiblock`; `llama-bench -ngl 99`	verified-here on M4 Max — `kernel-contract.json` `runtimeStatus.metal` = `runtime-ready` for the 5 score kernels; `darwin-arm64-metal` passes the build-script capability gate.	Full text+DFlash+voice latency/RSS/thermal gates against a staged Eliza-1 bundle; `metal_verify` `cases`-array path for `metal-verify-fused`.
`darwin-arm64-metal-fused`	`…/build-llama-cpp-dflash.mjs --target darwin-arm64-metal-fused --jobs 10` — links `omnivoice-core` + `libelizainference.dylib` + `llama-omnivoice-server` + `libmtmd` + `default.metallib`; `verify-symbols.mjs` (`omnivoice=10 abi=8`)	`metal-verify metal-verify-multiblock dispatch-smoke` (same as above) + `verify-symbols.mjs`	Bun FFI smoke against `~/.eliza/local-inference/models/eliza-1-1_7b.bundle` (loads real OmniVoice Q4_K_M + Qwen3-ASR for TTS + ASR — `reports/local-e2e/2026-05-11/fused-voice-ffi-smoke.json`); `e2e_loop_bench.mjs`	verified-here on macOS Metal for the fused dylib FFI smoke (real GGUF-backed TTS + ASR in one fused process)	Built-fork graph-dispatch smoke + full latency/RSS/thermal gates; the fused `llama-server` route on macOS (currently the macOS evidence is the FFI path, not the HTTP route).
`ios-arm64-metal`	`…/build-llama-cpp-dflash.mjs --target ios-arm64-metal` (macOS+Xcode, emits `.a` + headers + `default.metallib` → `build-xcframework.mjs --verify` glues the `LlamaCpp.xcframework`)	`build-xcframework.mjs --verify` (kernel-symbol + runtime-symbol + structure audits — PASS); `run-physical-device-smoke.mjs` (3/3 XCTest cases PASS on iPhone 15 Pro / iOS 26.3.1, `--skip-voice-abi=false` — `hardware-results/ios-device-smoke-2026-05-11.json`)	the iOS XCTest harness; (no `llama-bench` on iOS — the runtime is the static lib + `eliza_inference_*` ABI)	verified-here on iPhone 15 Pro for the symbol/structure audits + the runtime-symbol XCTest; the §3 P0 blocker is a weight-backed Eliza-1 bundle smoke from the Capacitor app shell (first token / first audio / peak RSS / thermal).	A real Eliza-1 bundle smoke from the Capacitor app shell.
`ios-arm64-simulator-metal`	`…/build-llama-cpp-dflash.mjs --target ios-arm64-simulator-metal`	`build-xcframework.mjs --verify` (simulator slice — PASS); simulator smoke against the embedded metallib + `GGML_OP_ATTN_SCORE_TBQ` Turbo4 route	the iOS simulator XCTest	authored-pending-hardware (symbol/structure audits pass; no simulator weight-backed run)	Simulator smoke against the embedded metallib.

(darwin-x64-metal is not a supported target — Apple Silicon darwin-arm64-metal only.)

ROCm — runner exists, no AMD host here

Target	Build	Kernel verify	Bench	Status	Prereq if not done
`linux-x64-rocm`	`…/build-llama-cpp-dflash.mjs --target linux-x64-rocm` (needs `hipcc` + ROCm)	`make -C …/verify hip-verify` — the standalone fixture-parity harness (NEW this wave): `hip_verify.cu` is a thin shim that `#include`s `cuda_verify.cu` (which now guards its backend headers on `__HIP_PLATFORM_AMD__` and aliases the `cuda` runtime calls to `hip`), so it runs the EXACT same ~25 device kernels + fixture loader + reference cross-check the NVIDIA `cuda-verify` does, compiled by `hipcc` against a `gfx` GPU. Plus `verify/rocm_runner.sh --report …` (refuses without `hipcc` + `rocminfo` `gfx` agent + a smoke GGUF; builds the fork, then `runtime_graph_smoke.sh --gen-check` → `llama-bench` + `llama-completion` on the HIP backend).	`make hip-verify`; `rocm_runner.sh`; `llama-bench -ngl 99` on the HIP backend	authored-pending-hardware — `hip_verify.cu` + the `hip-verify` Makefile target are authored + buildable (no `hipcc` on the authoring box → clean "install ROCm / see rocm_runner.sh" message); the fork's production `.cu` kernels (turboquant.cuh/qjl.cu/polarquant.cu/turbo-tcq.cu) are not yet `__HIP_PLATFORM_AMD__`-clean — until that lands the ROCm runtime story is the `hip-verify` numeric gate + the documented reduced-optimization local mode (`ELIZA_LOCAL_ALLOW_STOCK_KV=1`, loud warning, not publishable) for production inference.	An AMD ROCm host (RDNA2/RDNA3 or CDNA, `gfx*` agent — e.g. a vast.ai MI300 box).

Quick "one command for everything I can run here" line

bash

# From the repo root, on this box (CPU + Intel-ANV Vulkan + RTX 5080 CUDA):
make -C packages/inference/verify kernel-contract reference-test cuda-verify cuda-verify-fused
make -C packages/inference/verify vulkan-verify vulkan-verify-multiblock vulkan-verify-fused
make -C packages/inference/verify vulkan-dispatch-smoke   # built-fork Vulkan graph routes (needs the linux-x64-vulkan build)
# Bench (CUDA text + ASR, RTX 5080):
~/.cache/eliza-dflash/milady-llama-cpp/build-cuda/bin/llama-bench \
  -m ~/.eliza/local-inference/models/eliza-1-1_7b.bundle/text/eliza-1-1_7b-32k.gguf -ngl 99 -p 16,512 -n 32 -fa 1

Not in `SUPPORTED_TARGETS` — runtime-side / explicitly-out-of-scope notes

MLX (`mlx_lm.server`) — REMOVED

The Apple-Silicon mlx_lm.server spawn-and-route path was removed in commit 20d50d7553 (P1 consolidation). It violated the local-inference invariant of no subprocesses + no TCP loopback. No production callsite ever invoked it, and MLX_IN_PROCESS_PLAN.md documents the in-process unblock plan if MLX becomes a real requirement. The stub mlx-server.ts itself has been deleted; see services/index.ts for the cleaned export surface.

TPU / NPU — not a target this wave (verdict, documented)

No. The eliza-1 text backbone (0.6B smallest, fp16/Q4) does not fit a Coral Edge TPU's 8 MB on-chip SRAM, isn't int8-only quantizable to the Coral's constraints, and KV-cache attention is not an Edge-TPU workload. The Pixel Tensor TPU could in principle run a small int8 transformer but there is no public delegate API to target it from a third-party app, and NNAPI is deprecated by Google in favour of per-vendor delegates. The Android GPU (Mali/Adreno via Vulkan) is the right on-device accelerator for the text model — which the android-arm64-vulkan / android-x86_64-vulkan targets already cover. The sidecars (Silero VAD, Qwen3-ASR-0.6B, Qwen3-Embedding-0.6B) don't win enough on an NPU to justify the conversion work, and OmniVoice TTS is fused into the llama.cpp build (one GGML pin) — pulling it onto a separate NPU breaks the fusion contract (§4: one process, one build). The one open angle: a ELIZA_VAD_QNN_DELEGATE=1 flag that, when onnxruntime-mobile is built with the Qualcomm QNN EP, runs Silero VAD on the Hexagon NPU island while the CPU sleeps — that is a battery optimization for always-listening wake-word mode, not a latency one, and is a stretch, not core. No plugin-coral / plugin-qnn is added.

Eliza-1 platform matrix — build · verify · bench, one command each

Eliza-1 platform matrix — build · verify · bench, one command each

Verify status as of 2026-05-12 (post multi-agent wave)

Status vocabulary

How the three "one commands" map

CPU baseline — runnable anywhere

CUDA — verified-here on the RTX 5080 (sm_120, native SASS via CUDA 12.8)

Vulkan — verified-here on Intel Arc/Xe Mesa ANV (single device class)

Metal / Apple — verified-here on Apple M4 Max (prior passes)

ROCm — runner exists, no AMD host here

Quick "one command for everything I can run here" line

Not in SUPPORTED_TARGETS — runtime-side / explicitly-out-of-scope notes

MLX (mlx_lm.server) — REMOVED

TPU / NPU — not a target this wave (verdict, documented)

Not in `SUPPORTED_TARGETS` — runtime-side / explicitly-out-of-scope notes

MLX (`mlx_lm.server`) — REMOVED