packages/app-core/scripts/ffi-stub/README.md
This directory contains the helpers + patch material that the build script
packages/app-core/scripts/build-llama-cpp-dflash.mjs invokes when one of
the fused targets (e.g. darwin-arm64-metal-fused) is requested.
The fused build produces ONE shared library and ONE server binary that
expose both llama_* (text + vision) and omnivoice_* (TTS + ASR)
symbols. This is the "one process, one llama.cpp build, one GGML pin"
contract from packages/inference/AGENTS.md §4.
| Component | Repo | Pin |
|---|---|---|
| omnivoice.cpp | https://github.com/elizaOS/omnivoice.cpp (fork of https://github.com/ServeurpersoCom/omnivoice.cpp) | 38f824023d12b21a7c324651b18bd90f16d8bb86 (upstream master HEAD 2026-05-10) |
| omnivoice ggml | https://github.com/ServeurpersoCom/ggml.git | 0e3980ef205ea3639650f59e54cfeecd7d947700 (its ggml submodule) |
| eliza llama.cpp | https://github.com/elizaOS/llama.cpp.git | v0.4.0-eliza (08032d57) — see build-llama-cpp-dflash.mjs |
omnivoice.cpp ships its own ggml fork as a git submodule (commit
0e3980ef…), and pulls it into the build with add_subdirectory(ggml).
The elizaOS/llama.cpp fork ships its own ggml in-tree (at
ggml/, NOT a submodule) with the TurboQuant + QJL + PolarQuant + DFlash
patches that the kernels in packages/inference/{metal,vulkan} are
verified against.
Two ggml trees in one build tree is illegal. The kernels in this
repo are checked against the eliza ggml only — the ServeurpersoCom ggml
does not have TurboQuant centroids, QJL projections, PolarQuant
centroids, or DFlash flash-attn entry points. Targeting both would
either (a) link two different ggml libraries into the same process
(undefined-behavior territory: duplicate ggml_* exports, divergent
struct layouts) or (b) silently use whichever comes first in link order
and lose half the contract.
Strategy chosen: graft, not submodule swap. When we prepare the fused checkout we:
ggml/ subdirectory entirely. No
git submodule update --init for ggml, no add_subdirectory(ggml)
from omnivoice's CMakeLists.txt. The only ggml in the merged tree is
llama.cpp's.src/, tools/, and examples/ into the llama.cpp
tree under stable paths:
omnivoice/src/ ← omnivoice src/omnivoice/tools/ ← omnivoice tools/omnivoice/examples/ ← omnivoice examples/ (data files only)CMakeLists.txt that:
omnivoice-core (static archive) over the copied sources,ggml / ggml-base /
ggml-cpu / per-backend targets (so it shares one ggml ABI),llama-omnivoice-server smoke target and links
the product llama-server speech route against omnivoice-core,libelizainference) so
mobile/desktop bridges can dlopen one .so/.dylib and resolve both
symbol families.omnivoice-merged/patches/ to
reconcile any compile-time API drift between omnivoice's expected
ggml surface (the ServeurpersoCom fork) and the eliza ggml. Each
patch is documented at the top with the symbol/struct it touches and
the upstream commit that introduced the drift.This is the lowest-blast-radius approach. We do NOT rebase omnivoice
onto a clean ggml tip and we do NOT carry the ServeurpersoCom ggml
submodule alongside ours. If the omnivoice authors upstream changes to
their ggml fork that we want, we cherry-pick those into the in-tree
ggml that ships inside elizaOS/llama.cpp (the
packages/inference/llama.cpp submodule), then bump the omnivoice pin
in this README.
That sounds equivalent but creates a sharper failure mode: omnivoice's
CMakeLists.txt expects to be the parent of ggml/ and configures it
with its own option set (OMNIVOICE_*, GGML_MAX_NAME=128, etc.). If we
let omnivoice's CMake reconfigure eliza's ggml we lose the kernel-set
and patch hooks that build-llama-cpp-dflash.mjs already wires. The
graft approach keeps llama.cpp's CMake as the single point of ggml
configuration.
Forbidden by §4 of packages/inference/AGENTS.md: "We do not run text
and voice in two processes communicating over IPC. That regresses
memory and adds a 1-10ms scheduling tax per turn." Even an in-process
dlopen would still mean two distinct ggml ABIs sharing the same
address space — same problem, masked.
git clone https://github.com/elizaOS/omnivoice.cpp /tmp/omnivoice-pinbump
cd /tmp/omnivoice-pinbump
git log --oneline -20
git diff 38f824023d12b21a7c324651b18bd90f16d8bb86..master \
-- src/ tools/ CMakeLists.txt
src/omnivoice.h (used by
cap-bridge.cpp and the runtime), in particular any rename of
omnivoice_context_*, omnivoice_generate_*, omnivoice_load_*.src/maskgit-tts.h / dac-decoder.h / hubert-enc.h
that touch the ggml graph builders — those must stay compatible
with the ggml exposed by elizaOS/llama.cpp at the build pin.src/ or tools/ — extend
CMAKE_GRAFT_SOURCES in omnivoice-merged/cmake-graft.mjs.git diff against the
current ServeurpersoCom ggml pin:
cd /tmp/omnivoice-pinbump
git submodule update --init ggml
cd ggml
git log 0e3980ef205ea3639650f59e54cfeecd7d947700..HEAD --oneline
src/, decide:
ggml/, then bump that fork's tag, then
bump omnivoice here)OMNIVOICE_REF / OMNIVOICE_GGML_REF in
packages/app-core/scripts/build-llama-cpp-dflash.mjs.node build-llama-cpp-dflash.mjs --target darwin-arm64-metal-fused (or vulkan/cpu equivalent). The build
MUST exit non-zero if symbol verification fails — do NOT add
compatibility shims to make the new pin compile.verify/metal_verify and verify/vulkan_verify (per
packages/inference/README.md) to confirm the kernel matrix still
reports 8/8 PASS on the previously-verified hardware. A bumped
omnivoice pin is NOT shippable until those are green.Per packages/inference/AGENTS.md §3 ("Mandatory optimizations") and
§9 ("No defensive code"), any of the following cause the build script
to exit non-zero. There is no "build the non-fused binary as a
fallback" path.
src/, tools/, or required headers are missing.ggml/ submodule
fails (e.g. it's a real directory we couldn't strip).omnivoice-merged/patches/ fail to apply.llama_* and omnivoice_* symbols (verified post-link with
nm, or objdump -T on Linux/MinGW, or nm -gU on Darwin).README.md — this file.cmake-graft.mjs — reads the omnivoice source list and emits a
CMake snippet appended to llama.cpp's root
CMakeLists.txt to declare omnivoice-core,
llama-omnivoice-server, libelizainference.prepare.mjs — clones omnivoice at the pin, strips its ggml/
submodule, copies src/ + tools/ into the
llama.cpp tree, applies any patches/*.patch,
and returns the omnivoice commit so the caller
can record it in CAPABILITIES.json.verify-symbols.mjs — post-build symbol probe. Runs nm (or
objdump -T on PE) against the produced
binary/library and asserts both llama_* and
concrete OmniVoice ov_* exports are present.
Writes OMNIVOICE_FUSE_VERIFY.json beside the
artifact on both pass and fail.patches/ — directory for .patch files keyed to specific
omnivoice or ggml commit drifts. Each patch is
applied with git apply --check first; a failed
apply is a hard error.ffi.h — C ABI v3 for libelizainference. Single source
of truth for the symbol set the fused build
exposes. Consumed by the Bun FFI loader at
src/services/local-inference/voice/ffi-bindings.ts
and by future Rust / Swift / Python bridges.ffi-stub.c — Reference C implementation that builds into
libelizainference_stub.{dylib,so}. Lifecycle
(create/destroy) works; every entry that
requires the real fused build returns
ELIZA_ERR_NOT_IMPLEMENTED (*_supported → 0,
cancel_tts → OK, set_verifier_callback → OK
no-op). Used by ffi-bindings.test.ts for
end-to-end loader validation without the fused
dylib. The stub library itself is a build
artifact (make-produced, .gitignored) — not
checked in; CI/tests that need it run make
first or skip.Makefile — Builds the stub. make produces the
platform-default artifact; make verify lists
the exported eliza_inference_* symbols;
make verify-stub-rejected confirms the real
symbol verifier rejects the stub.ffi.h)The fused build (and the stub) export exactly these symbols. Bump
ELIZA_INFERENCE_ABI_VERSION in ffi.h AND
ELIZA_INFERENCE_ABI_VERSION in
packages/app-core/src/services/local-inference/voice/ffi-bindings.ts
in lockstep on any breaking shape change — the loader checks the
version at dlopen time and refuses to bind a mismatched library.
| Symbol | Purpose |
|---|---|
eliza_inference_abi_version | Returns the static ABI version string ("3"). |
eliza_inference_create / _destroy | Allocate / free a per-engine EliInferenceContext. |
eliza_inference_mmap_acquire / _evict | Lazy-page / release weights for a region (tts/asr/text/dflash). |
eliza_inference_tts_synthesize | Synchronous OmniVoice forward → fp32 PCM @ 24 kHz (batch). |
eliza_inference_tts_stream_supported | 1 when this build implements streaming TTS, else 0. |
eliza_inference_tts_synthesize_stream | Chunked OmniVoice forward → PCM segments via eliza_tts_chunk_cb + a final is_final tail; chunk cb returns non-zero to cancel. |
eliza_inference_cancel_tts | Hard-cancel any in-flight TTS forward pass on the context. |
eliza_inference_set_verifier_callback | Register the native DFlash speculative-step callback (EliVerifierEvent accepted / rejected-range / corrected token ids); cb == NULL clears it. |
eliza_inference_asr_transcribe | Synchronous ASR forward → UTF-8 transcript (batch). |
eliza_inference_asr_stream_supported | 1 when this build implements streaming ASR, else 0. |
eliza_inference_asr_stream_open / _feed / _partial / _finish / _close | Streaming ASR session: feed PCM frames, read a running partial transcript (+ optional text-model token ids), force-finalize, close. |
eliza_inference_vad_supported | 1 when this build implements native Silero VAD, else 0. |
eliza_inference_vad_open / _process / _reset / _close | Native VAD session: 16 kHz fp32 mono, 512-sample windows, one speech probability per call. |
eliza_inference_free_string | Free heap strings the library handed back (errors). |
ABI v2 status codes added ELIZA_ERR_CANCELLED (-7), returned by the
streaming TTS entry when the chunk callback (or cancel_tts) requested
a stop. The JS binding surfaces it as { cancelled: true }, not a throw.
ABI v3 adds the vad mmap region and the native VAD entry points above.
§4 of packages/inference/AGENTS.md calls for ONE process serving
/v1/chat/completions (+ DFlash), /v1/audio/speech, and an ASR route.
The fused llama-server mounts /v1/audio/speech directly through
committed fork source (tools/server/server.cpp namespace eliza_omnivoice,
guarded by #ifdef ELIZA_FUSE_OMNIVOICE), using the same in-process
OmniVoice runtime (ov_init / ov_synthesize) as libelizainference.
(Before W3-3 H2.c, the route was injected via
kernel-patches/server-omnivoice-route.mjs; that patcher was deleted
once the source landed in the fork.)
It returns PCM or WAV from the same process that hosts
/v1/chat/completions, so text and speech share the llama.cpp build,
GGML pin, Metal/Vulkan/CPU backend selection, and memory lifetime.
llama-omnivoice-server still builds as a small executable smoke target,
but it is no longer the product server. The production HTTP path is the
patched fused llama-server; mobile and desktop bridges can also load
libelizainference directly.
Remaining HTTP follow-up:
eliza_inference_asr_stream_supported()
advertises a true low-latency streaming decoder. Until then the JS bridge
uses fused batch ASR, not whisper, when an Eliza-1 ASR bundle is present.Do NOT mark eliza_inference_tts_synthesize /
eliza_inference_asr_transcribe as the streaming story by themselves:
they are the batch one-shot fallbacks. The within-a-tick handoff
AGENTS.md §4 needs is the *_stream / verifier-callback surface above.
Implementation note: ABI v2 added the streaming TTS, streaming ASR, and
verifier-callback symbols. ABI v3 adds native Silero VAD. Streaming TTS and
batch ASR are implemented on macOS Metal; current smoke runs report
tts_stream_supported()==1 and asr_stream_supported()==0. Callers use the
native streaming TTS path, the fused batch ASR path, and the JS/ONNX VAD
fallback until native streaming ASR and native VAD advertise support.
Implementation note (v1, still in force): TTS and ASR on macOS Metal.
TTS keeps the OmniVoice LM / MaskGIT path on the selected accelerator. On
Apple Metal, the audio tokenizer / DAC codec region is pinned to a CPU-only
scheduler inside the same process; this avoids the previously observed
merged-ggml Metal DAC decode stall after [TTS] Decode without launching a
second model runtime or duplicating model lifecycle state. ASR uses llama.cpp
mtmd with a qwen3a backport for Qwen3-ASR GGUF bundles and requires the
canonical bundle files asr/eliza-1-asr.gguf and
asr/eliza-1-asr-mmproj.gguf; missing or ambiguous ASR assets remain a hard
ELIZA_ERR_BUNDLE_INVALID failure.
Streaming-cancel note: the v3 ABI cancellation path is correct at the FFI
boundary, but short utterances still run as a single OmniVoice chunk by default
(chunk_threshold_sec=30). The first PCM callback can therefore arrive only
after MaskGIT and DAC decode complete. For low-latency barge-in, lower the
native streaming chunk threshold through ELIZA_TTS_CHUNK_THRESHOLD_SEC and
ELIZA_TTS_CHUNK_DURATION_SEC, then measure audio quality before changing
release defaults. The smoke harness exposes the same knobs as
--chunk-threshold-sec and --chunk-duration-sec, plus
--warmup-runs for measuring a warmed TTS context before the reported
run. Its JSON report includes firstAudioMs, first/largest chunk
durations, RTF, and codecBackendPolicy. On darwin-arm64-metal-fused,
codecBackendPolicy.status === "intentional-cpu-fallback" means the
OmniVoice LM / MaskGIT path stayed on Metal while the codec scheduler was
intentionally pinned to CPU to bypass the known merged-ggml DAC decode
stall; gates should classify that as a pass-with-fallback, not as a
silent downgrade or hang.
Example 9B latency probe:
bun packages/app-core/scripts/omnivoice-merged/tts-stream-ffi-smoke.ts \
--bundle ~/.eliza/local-inference/models/eliza-1-9b.bundle \
--cancel-mode none \
--maskgit-steps 8 \
--chunk-threshold-sec 0.25 \
--chunk-duration-sec 0.25 \
--warmup-runs 1
All errors flow through a char ** out_error parameter that the
library populates with a heap-allocated NUL-terminated message.
Callers MUST free those messages via eliza_inference_free_string.
Negative return values map to the ELIZA_ERR_* codes declared in
ffi.h — the JS binding re-projects them onto
VoiceLifecycleError.code (ram-pressure, mmap-fail,
kernel-missing, disarm-failed).
Production loader (Bun runtime via Electrobun + Capacitor):
import { loadElizaInferenceFfi } from
"@elizaos/app-core/services/local-inference/voice/ffi-bindings";
const ffi = loadElizaInferenceFfi("/path/to/libelizainference.dylib");
const ctx = ffi.create(bundleRoot);
ffi.mmapAcquire(ctx, "tts");
const out = new Float32Array(24_000 * 4);
const samples = ffi.ttsSynthesize({
ctx, text: "hello world", speakerPresetId: null, out,
});
ffi.mmapEvict(ctx, "tts");
ffi.destroy(ctx);
ffi.close();
The loader throws VoiceLifecycleError({code:"kernel-missing"}) when
the runtime is not Bun, when dlopen fails, or when the library's
ABI version disagrees with the binding. It does NOT fall back to a
stub on failure — per packages/inference/AGENTS.md §3 + §9, every
startup precondition is a structured throw.
make -C packages/app-core/scripts/omnivoice-merged
# → libelizainference_stub.dylib (macOS) or .so (linux)
# Symbol verification:
nm -gU libelizainference_stub.dylib | grep eliza_inference_
# Fail-closed real-library smoke. This intentionally renames the stub
# as libelizainference and verifies the real fused-symbol checker
# rejects it with an OMNIVOICE_FUSE_VERIFY.json failure report:
make -C packages/app-core/scripts/omnivoice-merged verify-stub-rejected
# JS-side coverage (requires Bun on PATH for the integration scenarios):
cd packages/app-core
bunx vitest run src/services/local-inference/voice/ffi-bindings.test.ts
The test harness spawns a bun -e subprocess that loads the stub
dylib via bun:ffi and exercises create/destroy/mmapEvict/
ttsSynthesize/ABI-mismatch scenarios. The vitest worker itself runs
on Node 22 (no bun:ffi), so the pure-unit cases assert that the
loader throws structurally on the no-Bun path.
After build-llama-cpp-dflash.mjs --target <fused-target> installs
libelizainference, the build runs the same verifier as this CLI:
node packages/app-core/scripts/omnivoice-merged/verify-symbols.mjs \
--out-dir <installed-bin-dir> \
--target darwin-arm64-metal-fused
The verifier rejects stub-only artifacts, missing llama_* exports
unless Darwin re-exports libllama.dylib, any missing ABI v3
eliza_inference_* entry (the full streaming-voice + verifier-callback
surface in the table above), and missing concrete OmniVoice entries
such as ov_init, ov_synthesize, and ov_audio_free. A failed probe
exits non-zero and leaves OMNIVOICE_FUSE_VERIFY.json in the output
directory for build reports.