plugins/plugin-local-inference/native/configs/gpu/SPECS.md
Source of truth for the per-GPU JSON configs in this directory
(3090.json, 4090.json, 5090.json, h200.json) and for the
gpu-autotune.ts helper in packages/app-core/src/services/local-inference/.
Scope: one GPU per host. No tensor parallelism, no NVLink splits, no multi-tenant scheduling. The product target is "one conversation at a time on a single GPU box."
Inference engine: llama.cpp / llama-server (the buun-llama-cpp fork that ships the QJL + Polar KV quant kernels). This file does not cover vLLM / SGLang — those have different memory and parallelism models.
All
expected_metricsin the JSON configs are extrapolated, not measured. The_provenance: "extrapolated"field marks that explicitly. A real benchmark on each card replaces these once a runner is wired.
| Card | Arch (CC) | VRAM | Mem-BW | FP16 TFLOPs | FP8 TFLOPs | FP4 TFLOPs | INT4 TFLOPs | Max ctx (rec.) | Max parallel | Target RTF (voice) |
|---|---|---|---|---|---|---|---|---|---|---|
| RTX 3090 | Ampere sm_86 | 24 GiB GDDR6X | 936 GB/s | 71 | — | — | 284 | 65 536 | 4 | 0.55 |
| RTX 4090 | Ada Lovelace sm_89 | 24 GiB GDDR6X | 1 008 GB/s | 165 | 660 (E4M3/E5M2) | — | 660 | 131 072 | 8 | 0.40 |
| RTX 5090 | Blackwell sm_120 | 32 GiB GDDR7 | 1 792 GB/s | 209 | 838 | 1 676 | 838 | 262 144 | 12 | 0.30 |
| H200 SXM | Hopper sm_90 | 141 GiB HBM3e | 4 800 GB/s | 989 | 1 979 | — | 1 979 | 1 048 576 | 16 | 0.20 |
RTF = real-time factor; lower is better. For voice streaming we need RTF < 1 for steady-state and < 0.5 to leave headroom for TTS + ASR.
CAPABILITIES.json; the runtime probes it before
promising QJL/Polar.llama.cpp issues:
Two budgets dominate every choice:
Transformer KV cost: bytes/token = 2 × n_layers × n_kv_heads × head_dim × bytes_per_element
(factor of 2 = K and V).
Eliza-1 bundles (Qwen3.5 / 3.6 base):
| Bundle | n_layers | n_kv_heads | head_dim | FP16 KiB/tok | Q8K/Q4V KiB/tok | QJL+Polar KiB/tok |
|---|---|---|---|---|---|---|
| 0.8B / 2B class | 28 | 8 | 128 | 112 | 88 | 28 |
| 4B | 36 | 8 | 128 | 144 | 113 | 36 |
| 9B | 48 | 8 | 128 | 192 | 150 | 48 |
| 27B | 62 | 8 | 128 | 248 | 194 | 62 |
| Bundle | Ctx | KV quant | KV per slot |
|---|---|---|---|
| 2B | 32k | Q8K/Q4V | 32 768 × 88 KiB = 2.75 GiB |
| 2B | 32k | QJL+Polar | 32 768 × 28 KiB = 0.88 GiB |
| 9B | 65k | QJL+Polar | 65 536 × 48 KiB = 3.0 GiB |
| 27B | 32k | QJL+Polar | 32 768 × 62 KiB = 2.0 GiB |
| 27B | 128k | QJL+Polar | 131 072 × 62 KiB = 8.0 GiB |
| 27B | 256k | QJL+Polar | 262 144 × 62 KiB = 16.0 GiB |
VRAM available for KV ≈ vram - model_weights - reserved_headroom.
Reserved headroom (driver + activations + drafter): 3 GiB on 24 GB
cards, 4 GiB on 5090, 6 GiB on H200. See reservedHeadroomGb() in
packages/shared/src/local-inference/gpu-profiles.ts.
RTX 3090 (24 GiB, no FP8) — uses Q8K / Q4V KV (Ampere has no q4_polar kernel on the Polar fork).
RTX 4090 (24 GiB, FP8) — QJL + Polar KV available.
RTX 5090 (32 GiB, FP8/FP4) — same KV math, 8 GiB more headroom.
H200 (141 GiB) — the marquee box.
batch_size = logical batch fed to the prefill kernel per server tick.
Doubles with VRAM (more headroom for activations) but caps at 4096 —
beyond that, llama.cpp scheduler overhead eats the win.ubatch_size = physical micro-batch the GPU launches. Ada / Blackwell
/ Hopper want ≥ 512 to keep tensor cores saturated; Ampere is
happiest at 256-512.| Card | batch | ubatch | Why |
|---|---|---|---|
| 3090 | 2048 | 512 | Ampere; mem-bw-bound past 512 ubatch |
| 4090 | 2048 | 512 | Same dies as 3090 family; FP8 helps prompt eval not decode |
| 5090 | 4096 | 1024 | More SMs + GDDR7 bw lets the bigger ubatch land |
| H200 | 4096 | 2048 | HBM3e + sm_90 tensor cores; bigger ubatch wins |
n_gpu_layersAlways 999 (all layers on GPU). Single-GPU only — we never split
across cards in this tier. The literal -1 works equally well in
llama.cpp but 999 is unambiguous and survives clamping in older builds.
split_mode / main_gpuAlways "none" / 0. We never multi-GPU.
cache_type_k / cache_type_vq8_0 / q4_polar. The q4_polar Polar-quant V
kernel exists for sm_86 but the qjl1_256 K kernel does not — fall back
to Q8 K.qjl1_256 / q4_polar.
Both kernels are pre-built and exposed in CAPABILITIES.json.ctx_checkpoints / ctx_checkpoint_intervalUsed by the voice optimistic-rollback path. Mid-prefill snapshots cost
~per-checkpoint = slot_kv_at_checkpoint. Defaults per
ctxCheckpointsForTier() in packages/shared/src/local-inference/catalog.ts:
| Bundle | ctx_checkpoints | interval |
|---|---|---|
| 0.8B / 2B | 4 | 4 096 |
| 4B / 9B | 8 | 8 192 |
| 27B (incl. 256k) | 16 | 8 192 |
Per-card, picked from dflashDraftMin / dflashDraftMax in gpu-profiles.ts:
| Card | min | max |
|---|---|---|
| 3090 | 4 | 16 |
| 4090 | 4 | 24 |
| 5090 | 4 | 24 |
| H200 | 8 | 32 |
Draft window scales with compute throughput, not memory. Bigger cards can verify a longer drafter run per round without latency hit.
p_min / draft_p_min0.5 everywhere — drafter token accepted only if p ≥ 0.5. This is a
conservative default for voice latency. Higher values mean fewer
accepted drafts; lower values raise rollback waste.
sm_86.qjl1_256 for sm_120. The runtime probes
CAPABILITIES.json; missing → fall back to q8_0/q4_0 and surface
a structured warning rather than silently. Don't fix in the autotune;
fix in the kernel build.--mlock. Beyond 64k, opt-in kvSpillToCpu=true.The autotune helper merges in this order (later wins):
gpu-profiles.ts static profile defaultspackages/inference/configs/gpu/<id>.json (this directory)bundle_recommendations.<bundle>)overrides arg to selectGpuConfig() (used by the CLI)ELIZA_LOCAL_* (see dflash-server.ts for the full list,
e.g. ELIZA_LOCAL_UBATCH_SIZE, ELIZA_LOCAL_N_PARALLEL).When selectGpuConfig() gets a GPU it doesn't recognize, it falls back
on a VRAM bucket:
| VRAM (GiB) | Bucket | Falls back to |
|---|---|---|
| < 12 | tiny | Returns null — use catalog defaults |
| 12 – 18 | small | RTX 3090 profile, parallel halved |
| 18 – 28 | mid | RTX 3090 |
| 28 – 40 | mid-plus | RTX 5090 (capped) |
| 40 – 80 | large | RTX 5090 |
| ≥ 80 | huge | H200 |
Bucket fallback is "best effort" — if the user has an unsupported card, log the fallback choice loudly so they know they're not on a tuned profile.