Back to Eliza

Eliza-1 platform matrix — build · verify · bench, one command each

plugins/plugin-local-inference/native/verify/PLATFORM_MATRIX.md

2.0.131.1 KB
Original Source

Eliza-1 platform matrix — build · verify · bench, one command each

Single reference for every entry in SUPPORTED_TARGETS (packages/app-core/scripts/build-llama-cpp-dflash.mjs). For each target: the one-command build, the one-command kernel verify, the one-command bench, the current status, and the exact prerequisite if it is not done here. The tracked hardware/readiness view is ../../../../docs/eliza-1-pipeline/06-test-matrix.md; the enforceable contract is kernel-contract.json (checked by make -C packages/inference/verify kernel-contract); the bundle plan is ../../../../docs/ELIZA_1_GGUF_PLATFORM_PLAN.json.

Verify status as of 2026-05-12 (post multi-agent wave)

Re-ran the full integration verify matrix on this box (Intel Arrow Lake CPU + Intel ARL/ANV Vulkan + RTX 5080 / sm_120 CUDA, with a full-corpus SFT job holding ~12 GB VRAM concurrently — no OOM contention on the short verify runs):

TargetResult
make kernel-contractPASS — OK kernels=6 targets=26 manifestNames=6
make reference-testPASS — C reference clean; gen_fixture --self-test finite (fused-attn + TBQ V-cache parity OK)
make cpu-benchPASS (nothing to rebuild; harness in place)
make cpu-dispatch-smokePASS — ATTN_SCORE_QJL + FUSED_ATTN_QJL_TBQ MT-vs-ST bit-identical, no NaN; cpu_qjl_polar_attn_smoke PASS (NEW)flash_attn_ext over the dequant-hopped QJL1_256 K / Q4_POLAR V path emits 65536 finite outputs (nan=0 inf=0 maxabs=0.0299), guards fork commit cb700767 / tag v1.1.1-eliza
make vulkan-dispatch-smokePASS — Intel ARL: GGML_OP_ATTN_SCORE_QJL 32 outs max 2.7e-7, GGML_OP_FUSED_ATTN_QJL_TBQ 512 outs max 4.5e-8
make vulkan-verifyPASS — 8/8 (turbo3/turbo4/turbo3_tcq/qjl/polar incl. polar pre-Hadamard, both residual modes)
make vulkan-verify-multiblockPASS — 8/8 across 1/2/4/8 blocks-per-workgroup
make vulkan-verify-fusedPASS — 1920/1920 outputs (4 cases) on Intel ARL ANV, max diff ≤ 7.2e-7
make cuda-verifyPASS — 8/8 each kernel + 1920/1920 fused on RTX 5080 (sm_120), max diff ≤ 9.5e-6
make cuda-verify-fusedPASS — 1920/1920 fused QJL-K/TBQ-V on RTX 5080, max diff 4.47e-7
gen_fixture --self-test on cvd-1 (android-x86_64-cpu)PASS — bit-identical to host across all 6 required kernels + fused-attn + TBQ V-cache; cross-compiled with NDK r29 Android Clang 20.0.0, run on live Cuttlefish (cvd 1.53.0) under KVM (evidence/platform/android-x86_64-cpu.json). The cross-built fork llama-server loads on the cvd with the full Eliza-1 KV-cache-type whitelist (tbq3_0, tbq4_0, qjl1_256, q4_polar, tbq3_tcq).
vulkan_verify on cvd-1 SwiftShader (android-x86_64-vulkan)DIAGNOSTIC-ONLY — 8/8 SPIR-V fixture cases pass with max diff < 1e-5, but the cvd virtio-gpu ICD is SwiftShader (software, vendor 0x1ae0); per the fail-closed software-ICD rule, this is not recordable runtime-ready evidence (evidence/platform/android-x86_64-vulkan.json).

Nothing regressed in this wave. bun run typecheck for packages/app-core is clean; bun test packages/app-core/src/services/local-inference/ is 603 pass / 17 fail where all 17 failures are the known test-isolation flakes (downloader ×6 — passes 7/7 alone — plus cache-restart-corruption / cache-multi-model / cache-thrash / cache-stress shared-mock-state, and the 2 fused llama-server tests that need the fused binary built); …/voice/ is 217/218 + 28/28 green; python3 -m pytest packages/training/scripts/{eval,publish,manifest,wakeword} packages/training/benchmarks is 140 passed / 1 skipped.

Status vocabulary

StatusMeaning
verified-hereA real hardware run happened on the machine that wrote this doc (Intel Arrow Lake / Mesa ANV Linux for CPU + Vulkan; NVIDIA RTX 5080 Mobile / sm_120 for CUDA; Apple M4 Max for Metal/MoltenVK from prior passes; iPhone 15 Pro for the iOS device smoke).
authored-pending-hardwareSource + build plumbing + a fail-closed runner exist; no real run on the matching device class yet.
needs-operatorThe build/run needs sudo or a toolkit install the agent cannot do. (CUDA 12.8 for native sm_120 SASS is now installed at /usr/local/cuda-12.8; the build hook auto-pins it.)
needs-bigger-boxThe build itself OOMs / is too slow on the 31 GB / 24-core dev box (the CUDA-fused build is ~30 GB peak RAM, ~2 h); use the cloud runner (packages/app-core/scripts/cloud/run-on-cloud.sh, or packages/training/scripts/cloud/).

How the three "one commands" map

  • Build: node packages/app-core/scripts/build-llama-cpp-dflash.mjs --target <triple> (prepend ELIZA_DFLASH_SKIP_SERVER_STRUCTURED_OUTPUT=1 while the structured-output server patch is still being fixed; the iOS targets emit a .a for the xcframework patch, the -fused targets emit libelizainference + the fused server, everything else emits llama-server + llama-cli + llama-speculative-simple + llama-bench + llama-completion). The fork build (packages/inference/llama.cpp submodule, or the ~/.cache/eliza-dflash clone) git reset --hards on each run — do source edits first, build last, retry on clobber. Serialize fork builds; never two CUDA builds at once on the 31 GB box.
  • Kernel verify (synthetic fixtures, fast — minutes): make -C packages/inference/verify <backend>-verify (metal-verify / vulkan-verify / cuda-verify; add -multiblock / -fused for the extra coverage). These are the AGENTS.md §8 8/8-PASS gates. They do not need the bundle bytes.
  • Built-fork graph dispatch (proves a real llama.cpp graph route selects the kernel): make -C packages/inference/verify vulkan-dispatch-smoke (Vulkan), metal dispatch-smoke (Metal), the C++ vulkan_dispatch_smoke / dispatch_smoke.mm harnesses. CUDA's equivalent is cuda-verify (fixture-parity __device__ kernels) + the cuda_runner.sh graph smoke (now via llama-bench / llama-completion, not llama-cli — the fork's llama-cli is conversation-only and busy-loops on stdin EOF).
  • Bench: make -C packages/inference/verify <backend>-bench (metal-bench / vulkan-bench / cpu-bench) for the standalone-kernel perf harness; llama-bench -m <gguf> -ngl 99 -p … -n … -fa 1 --cache-type-k … for the model-graph throughput (the verify runners do this); verify/e2e_loop_bench.mjs / verify/thirty_turn_endurance_harness.mjs for the end-to-end voice loop. The fork ships llama-bench + llama-completion next to llama-server (as of this commit), so the bench path exists on every built target.

CPU baseline — runnable anywhere

TargetBuildKernel verifyBenchStatusPrereq if not done
linux-x64-cpunode …/build-llama-cpp-dflash.mjs --target linux-x64-cpumake -C …/verify reference-test (C-reference round-trip); the CPU score/decode ops are the C references themselves; make cpu-dispatch-smoke (graph picks GGML_OP_ATTN_SCORE_QJL + GGML_OP_FUSED_ATTN_QJL_TBQ on the CPU backend and asserts MT-vs-ST bit-identical, no NaN — verify/qjl_mt_check.c; CPU_BIN_DIR/GGML_INC_DIR default to the in-repo fork build/checkout, no env vars)make -C …/verify cpu-bench cpu-simd-bench; llama-bench on the staged text GGUFverified-here (reference-test clean; cpu-dispatch-smoke PASS — MT-vs-ST bit-identical; AVX-VNNI int8-QJL 5.25× / fp32-QJL LUT-gather ~2.5–8× — bench_results/cpu_avxvnni_2026-05-11.json, bench_results/cpu_kopt_2026-05-11.json). kernel-contract.json runtimeStatus.cpu = runtime-ready for qjl + fusedAttn (verify/cpu-runtime-dispatch-evidence.json); reference-only for TBQ/Polar standalone score (no public CPU graph op — validated by reference-test). The §3 CPU kernel-completeness build gate still fails by design (turbo3_tcq/polarquant not CPU-buildable).verify-on-device against the staged bundle bytes (verifyBundleOnDevice); wire probeKernels() to read cpu-runtime-dispatch-evidence.json so a fresh linux-x64-cpu build's CAPABILITIES.json reports qjl_full runtime-ready.
linux-aarch64-cpu--target linux-aarch64-cpu (needs an arm64 Linux host or a sysroot+cross-toolchain — no aarch64-cross wiring on x64 here)make reference-test + cpu-dispatch-smoke on the arm64 hostcpu-bench cpu-simd-bench (NEON dotprod paths)authored-pending-hardwareAn arm64 Linux box (Ampere Altra / Graviton / Snapdragon-Linux).
windows-x64-cpu--target windows-x64-cpu (mingw cross-build)pwsh -File verify/windows_runner.ps1 -Backend cpu -Model C:\models\eliza-1-smoke.gguf on a real Windows box (now drives llama-bench + llama-completion, not llama-cli)windows_runner.ps1 (above)authored-pending-hardware (cross-built exe is not counted)A native Windows x64 host.
windows-arm64-cpu--target windows-arm64-cpu (needs an MSVC arm64 cross-toolchain or a native Windows-arm64 host — no mingw arm64 wiring here)windows_runner.ps1 -Backend cpu on a Snapdragon X boxwindows_runner.ps1authored-pending-hardwareA Snapdragon X Elite / Copilot+ PC.
android-arm64-cpunode packages/app-core/scripts/aosp/compile-libllama.mjs (NDK cross-build)CPU/NEON parity via adb on a physical Android deviceadb-pushed cpu_bench / llama-benchauthored-pending-hardwareA physical Android device + NDK.
android-x86_64-cpuANDROID_NDK_HOME=… node …/build-llama-cpp-dflash.mjs --target android-x86_64-cpu (NDK cross-build, -DANDROID_ABI=x86_64, forces AVX/AVX2/FMA/F16C — the x86_64 Android ABI baseline is SSE4.2; the QJL/Polar CPU kernels need AVX2)(a) kernel C-reference parity on cvd (NEW 2026-05-12, verified-here): cross-compile gen_fixture with NDK r29 (Android Clang 20.0.0, target x86_64-unknown-linux-android24) and adb push to cvd-1; run gen_fixture_android_x86_64 --self-test → bit-identical to host across all six required kernels + fused-attn + tbq V-cache parity (turbo3=-2.501480 / turbo4=-23.721790 / turbo3_tcq=-4.822659 / qjl=3.696591 / polar=-1.994053 / polar_qjl=-1.438744); the cross-built fork llama-server loads on the cvd and exposes the full Eliza-1 KV-cache-type whitelist (tbq3_0, tbq4_0, qjl1_256, q4_polar, tbq3_tcq) with banner built with Clang 20.0.0 for Android x86_64, fork commit 536ff214. Evidence: evidence/platform/android-x86_64-cpu.json. (b) The orthogonal 8-step Cuttlefish (cvd) chat-completion smoke node packages/app-core/scripts/aosp/smoke-cuttlefish.mjs: 5/6 infra steps PASS on the live cvd (cvd reachable, APK installed abi=x86_64, ElizaAgentService start, /api/health agentState=running runtime=ok, bearer token); step 6 chat completion failed — no model staged in the release APK on that cvd. See ../reports/porting/2026-05-12/cuttlefish-x86_64-smoke.md.adb-pushed cpu_bench / llama-bench; e2e_loop_bench.mjs on the cvdkernel C-reference parity verified-here on Cuttlefish + build verified-here (real x86_64 Android ELF — interpreter /system/bin/linker64 — + libs, fork commit 536ff214; CAPABILITIES.json qjl_full/polarquant true), Cuttlefish cvd 8-step infra smoke 5/6 PASS. kernel-contract.json platformTargets.android-x86_64-cpu = runtime-ready / runtime-ready / verified.A build-aosp.mjs --launch rebuild staging the new android-x86_64-cpu libllama + a bundled eliza-1-smoke GGUF in the privileged APK → 8/8 chat smoke; Vulkan-on-cvd is gfxstream/SwiftShader (software → not recordable).
android-x86_64-vulkan…/build-llama-cpp-dflash.mjs --target android-x86_64-vulkan (NDK + Vulkan headers, -DANDROID_ABI=x86_64)standalone vulkan_verify fixtures pass on the host ANV iGPU AND on the cvd SwiftShader ICD (8/8 cases, max diff < 1e-5, evidence/platform/android-x86_64-vulkan.json) — DIAGNOSTIC-ONLY under cvd because SwiftShader is software (vendor 0x1ae0 LLVM 16); graph dispatch needs real ChromeOS x86_64 GPU (Adreno/Mali under ARCVM) — cvd virtio-gpu Vulkan is gfxstream/SwiftShader (software → no recordable evidence per fail-closed)adb-pushed vulkan_benchauthored-pending-hardware for graph dispatch (ChromeOS GPU); cross-build + SPIR-V fixture pass + Android Vulkan loader path validated under SwiftShader on cvdReal ChromeOS x86_64 GPU silicon (Adreno/Mali under ARCVM) or a passed-through host GPU via crosvm gfxstream + real-GPU host.
linux-x64-cpu-fusedELIZA_DFLASH_SKIP_SERVER_STRUCTURED_OUTPUT=1 …/build-llama-cpp-dflash.mjs --target linux-x64-cpu-fusedOMNIVOICE_FUSE_VERIFY.json ok=true + verifyFusedSymbols (abi/omnivoice/llama-reexport counts); llama-server --cache-type-k qjl1_256 --cache-type-v q4_polar boots healthy, /completion returns tokens (NEW 2026-05-12, fork commit cb700767 / tag v1.1.1-eliza) — guarded by make cpu-dispatch-smoke's new cpu_qjl_polar_attn_smoke regression testdflash-server-fused.integration.test.ts (spawns the fused llama-server, hits /completion + /v1/audio/speech same-PID); llama-bench/llama-completion for textverified-here for the merged HTTP route + symbol-verify + the QJL/Polar KV-cache warmup-no-segfault path; exit-1 is the §3 CPU-backend kernel-completeness gate (turbo3_tcq/qjl_full/polarquant aren't CPU-graph-dispatch caps), CAPABILITIES.json publishable: false.A weight-backed /v1/audio/speech smoke against a real tts/omnivoice-*.gguf (the dev stand-in bundle has no tts/); voice:duet end-to-end (the QJL/Polar segfault is fixed; remaining duet block is an SQL/embeddings-dim runtime-bootstrap concern, separate from the kernel).

CUDA — verified-here on the RTX 5080 (sm_120, native SASS via CUDA 12.8)

TargetBuildKernel verifyBenchStatusPrereq if not done
linux-x64-cudaELIZA_DFLASH_SKIP_SERVER_STRUCTURED_OUTPUT=1 …/build-llama-cpp-dflash.mjs --target linux-x64-cuda (~1.5–2 h, ~30 GB peak — serialize; check free -m/uptime first). CUDA 12.8 is installed at /usr/local/cuda-12.8; pass CUDACXX=/usr/local/cuda-12.8/bin/nvcc PATH=/usr/local/cuda-12.8/bin:$PATH so the build hook's cudaArchListFlag() sees the 12.8 nvcc and appends 100;120 to the arch list (the system /usr/bin/nvcc is 12.0 and would silently downgrade to 80;86;89;90;90a). Full integration build now installed (2026-05-12)~/.eliza/local-inference/bin/dflash/linux-x64-cuda/, forkCommit a61c93aaa5 (v1.2.0-eliza), builtAt 2026-05-12T17:16:58Z, libggml-cuda.so.0.9.7 473 MB (real sm_120a SASS), all binaries (llama-bench/llama-cli/llama-completion/llama-server/llama-speculative-simple) ldd-clean via $ORIGIN rpath.make -C …/verify cuda-verify cuda-verify-fused (self-contained nvcc compile of cuda_verify.cu; 8/8 + 1920/1920 PASS on the RTX 5080, max diff ≤ 9.5e-6 / 4.47e-7; cuda-verify-fused exercises the warp-cooperative kernel mirroring the production cuda/fused-attn-qjl-tbq.cu; the harness builds a native sm_120.cubin under 12.8). Re-verified 2026-05-12 against the installed real-build.verify/cuda_runner.sh --report … (builds the fork, cuda-verify, then runtime_graph_smoke.sh --gen-checkllama-bench --cache-type-k tbq3_0 -ngl 99 + llama-completion); bench_results/cuda_e2e_2026-05-11.json (text pp ~2.3–6.7k t/s, tg ~40–55 t/s; ASR eliza-1-asr.gguf → arch qwen3vl 1.7B, pp16 ~1023, pp128 ~4561, tg32 ~62 t/s); nsys: DP4A qjl_score_dp4a_kernel ~2.27× faster than fp32 qjl_score_kernel. New llama-bench numbers (2026-05-12, real install, freed GPU): eliza-1-0_6b bundle pp512/tg128 d=0 19932 / 345.5 t/s, d=16000 1956 / 108.5 t/s; eliza-1-1_7b bundle pp512/tg128 d=0 11931 / 194.7 t/s, d=16000 1797 / 84.9 t/s; base Qwen3-0.6B-Q8_0 d=0 20979 / 356 t/s, base Qwen3-1.7B-Q8_0 d=0 12414 / 159 t/s. llama-server smoke verified: 4 GPU slots, /health → ok, POST /completion 32-token decode at 420.57 tps decode / 1092.66 tps prefill on the 0_6b bundle.verified-here for cuda-verify / cuda-verify-fused / full ggml-cuda integration build (NEW 2026-05-12) / text + ASR llama-bench / llama-server /completion / native-sm_120-SASS-compile. kernel-contract.json runtimeStatus.cuda + fusedAttn.runtimeStatus.cuda = runtime-ready. runtime_graph_smoke.sh --gen-check errors with "no cache-type alias for turbo3" — expected on the non-fused build (CAPABILITIES.json reports dflash: false, missingRequiredKernels: ["dflash"]); the cache-type aliases are added by the dflash/fused-build patch path.verify-on-device against the staged bundle bytes (verifyBundleOnDevice); wire probeKernels() to read the install's CAPABILITIES.json.
linux-x64-cuda-fusedELIZA_DFLASH_SKIP_SERVER_STRUCTURED_OUTPUT=1 CUDACXX=/usr/local/cuda-12.8/bin/nvcc PATH=/usr/local/cuda-12.8/bin:$PATH …/build-llama-cpp-dflash.mjs --target linux-x64-cuda-fused --jobs 3 — the big build: full ggml-cuda + the omnivoice-core graft, ~30 GB peak RAM under -j 6 (the 31 GB dev box OOM-killed the parent build script under -j 6 mid-build at the fattn.cu long-pole). Built 2026-05-12 on this box at -j 3 from a clean dir against fork commit a61c93aaa5 (v1.0.0-eliza) + omnivoice pin 38f824023d12. OMNIVOICE_FUSE_VERIFY.json ok=true (llama=0, omnivoice=10, abi=23 symbols).cuda-verify cuda-verify-fused + OMNIVOICE_FUSE_VERIFY.json1920/1920 PASS on RTX 5080 sm_120, max diff 5.07e-07 (logs/cuda-verify-fused-fusedbuild-rtx5080-2026-05-12.log). make cuda-hardware against the install: 6/6 fixture-set PASS (logs/cuda-hardware-fusedbuild-rtx5080-2026-05-12.log); graph-smoke gated on llama-bench (not in fused-target list — non-blocking tooling gap).verify/e2e_loop_bench.mjs --backend cuda --tier 0_6b --turns 1 against the fused install — voice_rtf 0.4255 (PASS ≤ 0.5), tg 64.82 tok/s, first_token 43.3 ms, dflash 12/12, peak RSS 2340 MB. packages/inference/reports/porting/2026-05-12/e2e-loop-cuda-2026-05-12.json.verified-here on RTX 5080 Laptop (sm_120, CUDA 12.8)CAPABILITIES.json reports publishable: true, missingRequiredKernels: [], dflash + turbo3 + turbo4 + turbo3_tcq + qjl_full + polarquant + lookahead + ngramDraft = all true.Re-run on additional sm classes (sm_89 / sm_90 / sm_100 datacenter) to confirm no arch regression in the CMAKE_CUDA_ARCHITECTURES list.
linux-aarch64-cuda--target linux-aarch64-cuda on an arm64 Linux + Hopper/Blackwell host (GH200 = aarch64 host + H100/H200/GB200 GPU)make cuda-verify cuda-verify-fused on that host; verify/gh200_runner.sh --report … (refuses non-aarch64 / non-Hopper-9.x)gh200_runner.sh; llama-bench on the 27b-256k / 27b-256k tier GGUFsauthored-pending-hardwareA GH200 / H100-aarch64 / GB200 host. Use the cloud runner.
windows-x64-cuda--target windows-x64-cuda (MSVC + CUDA Toolkit on Windows)pwsh -File verify/windows_runner.ps1 -Backend cuda -Model C:\models\eliza-1-smoke.gguf on NVIDIA hardware (drives llama-bench + llama-completion)windows_runner.ps1 (above)authored-pending-hardware (cross-built exe not counted)A native Windows + NVIDIA box.
windows-x64-cuda-fused--target windows-x64-cuda-fusedwindows_runner.ps1 -Backend cuda + OMNIVOICE_FUSE_VERIFY.jsonthe fused Windows llama-server's /v1/audio/speechauthored-pending-hardwareThe Windows-CUDA hardware runner first, then the fused build on that host.

Vulkan — verified-here on Intel Arc/Xe Mesa ANV (single device class)

TargetBuildKernel verifyBenchStatusPrereq if not done
linux-x64-vulkan…/build-llama-cpp-dflash.mjs --target linux-x64-vulkanmake -C …/verify vulkan-verify vulkan-verify-multiblock vulkan-verify-fused (8/8 + 8/8 + 1920/1920 PASS on Intel ANV, max diff ≤ 7.6e-6 / 6.3e-7); make vulkan-native-smoke / vulkan-dispatch-smoke (built-fork graph routes PASS on Intel ARL/ANV — the harness drives the two fused attention ops the fork pin declares in ggml.h: GGML_OP_ATTN_SCORE_QJL 32 outs max 2.7e-7 + GGML_OP_FUSED_ATTN_QJL_TBQ 512 outs max 4.5e-8 — vulkan-runtime-dispatch-evidence.json + hardware-results/linux-vulkan-smoke-*.log. The standalone TBQ/Polar score kernels are covered by vulkan-verify; their built-fork graph entries in the evidence file are from a prior full-patched-build run.)make vulkan-bench; llama-bench -ngl 99 (the dispatch smoke does this)verified-here on Intel ARL/ANVkernel-contract.json runtimeStatus.vulkan = runtime-ready for the 5 score kernels + fused_attn. Single Intel-ANV device class.Native AMD (RADV) and NVIDIA-desktop Vulkan graph dispatch; verify-on-device against the staged bundle bytes.
linux-x64-vulkan-fused…/build-llama-cpp-dflash.mjs --target linux-x64-vulkan-fused --jobs 4 — Vulkan ggml + the omnivoice-core graft (much lighter than the CUDA-fused build — ~3 min on this box vs. ~85 min for CUDA-fused). Built 2026-05-12 from a clean dir against fork commit a61c93aaa5 + omnivoice pin 38f824023d12. OMNIVOICE_FUSE_VERIFY.json ok=true (llama=0, omnivoice=10, abi=23). Fused SPIR-V baked into libggml-vulkan.so (eliza_fused_attn_qjl_tbq_data + eliza_fused_attn_qjl_polar_data symbols, _len pair).vulkan-verify vulkan-verify-fused + OMNIVOICE_FUSE_VERIFY.jsonall PASS on Intel ARL iGPU (Mesa ANV 25.2.8): fused_attn_qjl_tbq 1920/1920 + 1536/1536 causal + fused_attn_qjl_polar 1920/1920 + 1536/1536 causal = 6912 outputs total, max diff 6.26e-07 (logs/vulkan-verify-fused-fusedbuild-anv-2026-05-12.log).verify/e2e_loop_bench.mjs --backend vulkan --tier 0_6b --turns 1 against the fused install — runs end-to-end (e2eOk: true, dflash 31/31), but iGPU performance keeps voice_rtf at 1.7269 (FAIL ≤ 0.5; the gate targets discrete-GPU class). tg 12.13 tok/s, first_token 493 ms, peak RSS 1370 MB. packages/inference/reports/porting/2026-05-12/e2e-loop-vulkan-2026-05-12.json.verified-here for kernel parity + e2e functionality on Intel ARL iGPU; CAPABILITIES.json publishable: true, missingRequiredKernels: []. Publish-gate (voice_rtf ≤ 0.5) FAIL on iGPU — gate is a discrete-GPU target.Re-run e2e_loop_bench on a discrete Vulkan-mode card (RDNA3 / Ada in pure-Vulkan / Intel BMG) for a Vulkan voice-rtf number under the discrete-GPU class.
windows-x64-vulkan--target windows-x64-vulkan (mingw + Khronos Vulkan-Headers cross-build)pwsh -File verify/windows_runner.ps1 -Backend vulkan -Model C:\models\eliza-1-smoke.gguf on native Windows Vulkanwindows_runner.ps1authored-pending-hardwareA native Windows + GPU box.
windows-arm64-vulkan--target windows-arm64-vulkan (MSVC arm64 cross-toolchain)windows_runner.ps1 -Backend vulkan on a Snapdragon X box (Adreno X1 = Vulkan 1.3)windows_runner.ps1authored-pending-hardwareA Snapdragon X Elite / Copilot+ PC.
android-arm64-vulkannode packages/app-core/scripts/aosp/compile-libllama.mjs (NDK cross-build)make -C …/verify android-vulkan-smoke — standalone fixtures 6/6 PASS on Pixel 6a / Mali-G78 (hardware-results/android-vulkan-smoke-*.log); built-fork graph dispatch evidence (ELIZA_ANDROID_VULKAN_GRAPH_EVIDENCE) still open; Adreno not yet runadb-pushed vulkan_bench / llama-benchauthored-pending-hardware for graph dispatch (standalone fixtures verified-here on Mali)A built-fork/app graph-dispatch report on one Adreno + one Mali device.

Metal / Apple — verified-here on Apple M4 Max (prior passes)

TargetBuildKernel verifyBenchStatusPrereq if not done
darwin-arm64-metal…/build-llama-cpp-dflash.mjs --target darwin-arm64-metal (macOS host, builds + embeds default.metallib)make -C …/verify metal-verify metal-verify-multiblock (8/8 + 8/8 on M4 Max); make dispatch-smoke (built-fork graph dispatch for GGML_OP_ATTN_SCORE_{QJL,TBQ×3,POLAR} + pre-Hadamard Polar — all PASS); metal-verify-fused fails by design (no metal/fused_attn*.metal cases-array path in metal_verify yet — see ../reports/porting/2026-05-11/metal-fused-attn-and-polar-preht-design.md)make metal-bench metal-bench-batched metal-bench-multiblock; llama-bench -ngl 99verified-here on M4 Maxkernel-contract.json runtimeStatus.metal = runtime-ready for the 5 score kernels; darwin-arm64-metal passes the build-script capability gate.Full text+DFlash+voice latency/RSS/thermal gates against a staged Eliza-1 bundle; metal_verify cases-array path for metal-verify-fused.
darwin-arm64-metal-fused…/build-llama-cpp-dflash.mjs --target darwin-arm64-metal-fused --jobs 10 — links omnivoice-core + libelizainference.dylib + llama-omnivoice-server + libmtmd + default.metallib; verify-symbols.mjs (omnivoice=10 abi=8)metal-verify metal-verify-multiblock dispatch-smoke (same as above) + verify-symbols.mjsBun FFI smoke against ~/.eliza/local-inference/models/eliza-1-1_7b.bundle (loads real OmniVoice Q4_K_M + Qwen3-ASR for TTS + ASR — reports/local-e2e/2026-05-11/fused-voice-ffi-smoke.json); e2e_loop_bench.mjsverified-here on macOS Metal for the fused dylib FFI smoke (real GGUF-backed TTS + ASR in one fused process)Built-fork graph-dispatch smoke + full latency/RSS/thermal gates; the fused llama-server route on macOS (currently the macOS evidence is the FFI path, not the HTTP route).
ios-arm64-metal…/build-llama-cpp-dflash.mjs --target ios-arm64-metal (macOS+Xcode, emits .a + headers + default.metallibbuild-xcframework.mjs --verify glues the LlamaCpp.xcframework)build-xcframework.mjs --verify (kernel-symbol + runtime-symbol + structure audits — PASS); run-physical-device-smoke.mjs (3/3 XCTest cases PASS on iPhone 15 Pro / iOS 26.3.1, --skip-voice-abi=falsehardware-results/ios-device-smoke-2026-05-11.json)the iOS XCTest harness; (no llama-bench on iOS — the runtime is the static lib + eliza_inference_* ABI)verified-here on iPhone 15 Pro for the symbol/structure audits + the runtime-symbol XCTest; the §3 P0 blocker is a weight-backed Eliza-1 bundle smoke from the Capacitor app shell (first token / first audio / peak RSS / thermal).A real Eliza-1 bundle smoke from the Capacitor app shell.
ios-arm64-simulator-metal…/build-llama-cpp-dflash.mjs --target ios-arm64-simulator-metalbuild-xcframework.mjs --verify (simulator slice — PASS); simulator smoke against the embedded metallib + GGML_OP_ATTN_SCORE_TBQ Turbo4 routethe iOS simulator XCTestauthored-pending-hardware (symbol/structure audits pass; no simulator weight-backed run)Simulator smoke against the embedded metallib.

(darwin-x64-metal is not a supported target — Apple Silicon darwin-arm64-metal only.)

ROCm — runner exists, no AMD host here

TargetBuildKernel verifyBenchStatusPrereq if not done
linux-x64-rocm…/build-llama-cpp-dflash.mjs --target linux-x64-rocm (needs hipcc + ROCm)make -C …/verify hip-verify — the standalone fixture-parity harness (NEW this wave): hip_verify.cu is a thin shim that #includes cuda_verify.cu (which now guards its backend headers on __HIP_PLATFORM_AMD__ and aliases the cuda* runtime calls to hip*), so it runs the EXACT same ~25 device kernels + fixture loader + reference cross-check the NVIDIA cuda-verify does, compiled by hipcc against a gfx* GPU. Plus verify/rocm_runner.sh --report … (refuses without hipcc + rocminfo gfx* agent + a smoke GGUF; builds the fork, then runtime_graph_smoke.sh --gen-checkllama-bench + llama-completion on the HIP backend).make hip-verify; rocm_runner.sh; llama-bench -ngl 99 on the HIP backendauthored-pending-hardwarehip_verify.cu + the hip-verify Makefile target are authored + buildable (no hipcc on the authoring box → clean "install ROCm / see rocm_runner.sh" message); the fork's production .cu kernels (turboquant.cuh/qjl.cu/polarquant.cu/turbo-tcq.cu) are not yet __HIP_PLATFORM_AMD__-clean — until that lands the ROCm runtime story is the hip-verify numeric gate + the documented reduced-optimization local mode (ELIZA_LOCAL_ALLOW_STOCK_KV=1, loud warning, not publishable) for production inference.An AMD ROCm host (RDNA2/RDNA3 or CDNA, gfx* agent — e.g. a vast.ai MI300 box).

Quick "one command for everything I can run here" line

bash
# From the repo root, on this box (CPU + Intel-ANV Vulkan + RTX 5080 CUDA):
make -C packages/inference/verify kernel-contract reference-test cuda-verify cuda-verify-fused
make -C packages/inference/verify vulkan-verify vulkan-verify-multiblock vulkan-verify-fused
make -C packages/inference/verify vulkan-dispatch-smoke   # built-fork Vulkan graph routes (needs the linux-x64-vulkan build)
# Bench (CUDA text + ASR, RTX 5080):
~/.cache/eliza-dflash/milady-llama-cpp/build-cuda/bin/llama-bench \
  -m ~/.eliza/local-inference/models/eliza-1-1_7b.bundle/text/eliza-1-1_7b-32k.gguf -ngl 99 -p 16,512 -n 32 -fa 1

Not in SUPPORTED_TARGETS — runtime-side / explicitly-out-of-scope notes

MLX (mlx_lm.server) — REMOVED

The Apple-Silicon mlx_lm.server spawn-and-route path was removed in commit 20d50d7553 (P1 consolidation). It violated the local-inference invariant of no subprocesses + no TCP loopback. No production callsite ever invoked it, and MLX_IN_PROCESS_PLAN.md documents the in-process unblock plan if MLX becomes a real requirement. The stub mlx-server.ts itself has been deleted; see services/index.ts for the cleaned export surface.

TPU / NPU — not a target this wave (verdict, documented)

No. The eliza-1 text backbone (0.6B smallest, fp16/Q4) does not fit a Coral Edge TPU's 8 MB on-chip SRAM, isn't int8-only quantizable to the Coral's constraints, and KV-cache attention is not an Edge-TPU workload. The Pixel Tensor TPU could in principle run a small int8 transformer but there is no public delegate API to target it from a third-party app, and NNAPI is deprecated by Google in favour of per-vendor delegates. The Android GPU (Mali/Adreno via Vulkan) is the right on-device accelerator for the text model — which the android-arm64-vulkan / android-x86_64-vulkan targets already cover. The sidecars (Silero VAD, Qwen3-ASR-0.6B, Qwen3-Embedding-0.6B) don't win enough on an NPU to justify the conversion work, and OmniVoice TTS is fused into the llama.cpp build (one GGML pin) — pulling it onto a separate NPU breaks the fusion contract (§4: one process, one build). The one open angle: a ELIZA_VAD_QNN_DELEGATE=1 flag that, when onnxruntime-mobile is built with the Qualcomm QNN EP, runs Silero VAD on the Hexagon NPU island while the CPU sleeps — that is a battery optimization for always-listening wake-word mode, not a latency one, and is a stretch, not core. No plugin-coral / plugin-qnn is added.