packages/shared/src/local-inference/CONTEXT_SCALING.md
2026-05-15 audit note: the small-tier 32k recommendation in this memo is superseded by the current staging contract. The audited 0.8B GGUF declares
qwen35.context_length = 262144; release validation still blocks artifacts published under misleading32k/64ktext paths, while the catalog/platform plan use a 128k release floor for the active 0_8B/2B/4B/9B/27B tiers unless a distinct verified long-context catalog tier is introduced.
This is the analysis behind the per-tier contextLength in catalog.ts, the
dflash.{contextSize,draftContextSize} knobs in runtimeFor(), and the
RAM-budget / spill plumbing in ram-budget.ts + kv-spill.ts. It answers one
question: given a phone / 16 GB GPU / 24 GB GPU and an Eliza-1 tier, what is
the largest context window that actually fits, and which optimization should run
when there is spare memory vs when there isn't?
The short version: on the small tiers (0.8B / 2B) the binding limit is the
base model's positional range (Qwen3.5-0.8B / 2B ship max_position_embeddings = 40960), not the KV-cache memory — the compressed cache fits 32k in well
under 1 GiB on any device. On the big tiers (9B / 27B) the limit flips:
positional range is huge, KV-cache memory is what
gates you, and that is where QJL+PolarQuant compression and (past 64k) KV spill
do the heavy lifting. The catalog's job is to ship the largest native window
per tier and then have the runtime pick a context within it that fits the
device. That second half — a memory-aware context selector — is the gap; see
"Wins" below.
KV-cache footprint per generated token, summed over all full-attention layers:
bytes/token = n_layers * n_kv_heads * head_dim * (k_bits + v_bits) / 8
Qwen3 dense base shapes (from each base model's config.json; the registry's
model_registry.py header table is the source of truth):
| Eliza-1 tier | base model | layers | kv heads | head_dim | native ctx (max_position_embeddings) | GGUF (quant) |
|---|---|---|---|---|---|---|
eliza-1-0_8b | Qwen3.5-0.8B-Base | 28 | 8 | 128 | 262144 | ~0.7 GB (Q4-class) |
eliza-1-2b | Qwen3.5-2B-Base | 28 | 8 | 128 | 262144 | ~1.5 GB (Q4-class) |
eliza-1-4b | Qwen3.5-4B-Base | 36 | 8 | 128 | 262144 | ~2.6 GB (Q4_K_M) |
eliza-1-9b† | Qwen3.5-9B | 8‡ | 4 | 256 | 128k target | ~5.4 GB (Q4_K_M) |
eliza-1-27b† | Qwen3.6-27B | 16‡ | 4 | 256 | 128k target | ~16.8 GB (Q4_K_M) |
† 9B/27B use the registry's hybrid-attention geometry: only 8 / 16 of the layers carry a KV cache; the rest are linear-attention with constant SSM state. That is why their per-token KV is much smaller than a dense model of the same parameter count would be. ‡ KV-bearing (full-attention) layer count, per the published 3:1 ratio.
Bits-per-coordinate per cache layout (what the dflash-server actually loads —
see kvCacheForContext() in catalog.ts):
| layout | K bits/coord | V bits/coord | combined B/coord | notes |
|---|---|---|---|---|
f16 | 16.0 | 16.0 | 4.00 | stock llama.cpp default |
q8_0 | 8.5 | 8.5 | 2.12 | stock; lossless-ish |
qjl1_256 K + turbo3_0 V | ~2.13 | ~3.5 | ~0.70 | the ≤8k-context fallback layout (TBQ3 V) |
qjl1_256 K + q4_polar V | ~2.13 | ~4.5 | ~0.83 | the default for every shipping tier (all are >8k context) |
QJL on the K-cache realizes ~7.53× vs f16 — not the marketing 16× — because
each 256-coord sketch block carries a per-token bf16 norm (≈34 bytes of
overhead per block of 256 sketch dims). PolarQuant on the V-cache is ~4-bit +
per-block scales. These are the figures model_registry._compute_inference_mem
and kv-spill.estimateQuantizedKvBytesPerToken are both sized against
(QUANTIZED_KV_BYTES_PER_TOKEN_BY_PARAMS = 1200 / 2400 / 9000 / 22000 B/tok for
0.8B / 2B/9B/27B — consistent with the per-layer math here once you account for
the realized ratios).
Resulting bytes/token across the whole KV cache:
| tier | f16 | q8_0 | qjl+turbo3 (≤8k fallback) | qjl+polarq4 (shipping default) |
|---|---|---|---|---|
| 0.8B | 114.7 kB | 60.9 kB | ~20.2 kB | ~23.7 kB |
| 2B | 114.7 kB | 60.9 kB | ~20.2 kB | ~23.7 kB |
| 4B | 147.5 kB | 78.3 kB | ~25.9 kB | ~30.5 kB |
| 9B | 32.8 kB | 17.4 kB | ~5.8 kB | ~6.8 kB |
| 27B | 65.5 kB | 34.8 kB | ~11.5 kB | ~13.6 kB |
(9B/27B are smaller per-token than 0.8B because of the 3:1 linear-attention ratio — only 8/16 layers cache, and the per-coord cost is identical.)
Budget model: available_for_KV = device_mem − GGUF_weights − working_set,
where working_set is the compute/activation/scratch buffers (≈0.6 GB for 0.8B,
0.8 GB for 2B, 1.0 GB for 4B, 1.5 GB for 9B, 2.5 GB for 27B). For RAM-budget
hosts (phones, Apple Silicon, CPU) subtract another ~1.5 GB OS reserve
(DEFAULT_RAM_HEADROOM_RESERVE_MB); the GPU columns assume dedicated VRAM.
Numbers are "context tokens the cache can hold", rounded.
| tier | layout | 4 GB | 8 GB |
|---|---|---|---|
| 0.8B | f16 | ~27k | ~65k |
| 0.8B | qjl+polarq4 | ~131k | ~312k |
| 2B | f16 | ~19k | ~56k |
| 2B | qjl+polarq4 | ~90k | ~271k |
| 4B | f16 | — | ~32k |
| 4B | qjl+polarq4 | ~14k | ~155k |
| 9B | qjl+polarq4 | — | ~174k |
Takeaway: on a 4 GB phone the 0.8B at the shipping compressed cache fits
~130k tokens — far past the model's 40960-token positional ceiling. The KV
cache is not the limiter at the small end; the base model's RoPE range is.
The 2B is comfortable on any modern (≥6 GB) phone. The 4B only fits an 8 GB
phone, and only with the compressed cache. The 9B needs ~8 GB and its small
hybrid KV — borderline; reserve it for 12 GB+ phones / 24 GB-RAM tablets, which
is exactly what minRamGb: 12 does today.
| tier | layout | 16 GB | 24 GB |
|---|---|---|---|
| 0.8B | f16 | ~139k | ~214k |
| 0.8B | q8_0 | ~263k | ~404k |
| 0.8B | qjl+polarq4 | ~674k | ~1.0M |
| 2B | f16 | ~131k | ~206k |
| 2B | qjl+polarq4 | ~633k | ~995k |
| 4B | f16 | ~90k | ~149k |
| 4B | qjl+polarq4 | ~436k | ~718k |
| 9B | f16 | ~298k | ~560k |
| 9B | qjl+polarq4 | ~1.4M | ~2.7M |
| 27B | f16 | — | ~77k |
| 27B | q8_0 | — | ~145k |
| 27B | qjl+polarq4 | — | ~372k |
Takeaways:
128k catalog
default is the right production call until a separately verified
long-context 27B artifact exists.65536 catalog window with ~8.6 GB of headroom even at
f16 KV. The compressed cache would let the 9B run much longer windows if
the weights supported them (they don't natively yet — see "RoPE extension").Ranked by confidence × impact.
Today recommendation.ts picks a model per device, and pickFittingContextVariant
picks among pre-baked context variants of the same model line when such
variants exist.
There is no path that says "the device has 14 GB of unused VRAM
after loading the 0.8B at its default 32k — bump contextSize toward the
model's native ceiling". The numbers above show every device has slack:
max_position_embeddings
(262144). The catalog's 131072 is a deliberate release floor below that. A
memory-aware selector could safely raise the runtime contextSize to
262144 on roomy devices when the GGUF metadata and verification evidence
agree. Do not bump contextLength in the catalog past the base model's
positional range without a YaRN/RoPE-scaled GGUF.131072. Native range extends to 128k+ for the verified
variant. On 16 GB the compressed KV at 128k is small; the deciding factor is
weight residency and backend support.131072 on 24 GB uses ~1.7 GiB compressed KV. There's room for
longer contexts, but those must remain runtime stretch experiments until a
separately verified long-context 27B artifact passes latency and spill tests.Implementation sketch: after the model is chosen, compute
maxFittingContext = floor((memBytes − weightBytes − workingSet) / kvBytesPerToken)
using the same estimateQuantizedKvBytesPerToken figure kv-spill.ts uses,
clamp to min(model.contextLength, baseModelNativeContext), and pass that as
dflash.contextSize instead of the static contextLength. This is the
"largest model+context combo that fits the available RAM/VRAM with a safety
margin, preferring qjl+polarq4" the task asks for.
kvCacheForContext() already returns { typeK: "qjl1_256", typeV: "q4_polar" }
for any context > 8192, and every shipping tier is > 8k. So the 0.8B / 2B at 32k
already cost ~0.6-0.7 GiB of KV, not 3.5 GiB. Nothing to change — but the doc
above is the justification, and the memory-aware selector (above) must keep
using qjl+polarq4 as the assumed layout, not f16.
eliza-1-4b is in catalog.tsmodel_registry.py has a real, buildable qwen3.5-4b → eliza-1-4b entry; the
catalog now includes the 4B tier at elizaos/eliza-1/bundles/4b. The 4B fits
an 8 GB phone when the compressed cache path is available (compressed cache,
~155k ceiling, though the base model caps at 40960) and is a strong 16 GB-GPU
default (~436k KV ceiling, ~90k even at f16).
contextLength bump for the small tiersIt is tempting to read "0.8B fits 800k tokens of KV on 16 GB" and bump
contextLength beyond the release floor. Don't. The current catalog
standardizes the active tiers at 128k; the runtime selector should still lower
contextSize on constrained devices rather than letting a roomy desktop change
the release contract by accident.
How the catalog/harness handles low-RAM hosts today:
FIRST_RUN_DEFAULT_MODEL_ID = "eliza-1-2b" — the smallest tier that fits
"the broadest range of hardware (modern phone or laptop)". On first run with
no preference, that's what loads.eliza-1-2b fall back to eliza-1-0_8b via the
mobile ladder in recommendation.ts: TEXT_SMALL: [0_8b, 2b],
TEXT_LARGE: [2b, 0_8b]. canFit runs assessRamFit against the device's
total RAM (mobile shares GPU memory with system RAM), with the OS headroom
reserve and a download-size guardrail (wontFitRatio 0.8, tightRatio 0.65).assessRamFit → resolveRamBudget: for an installed Eliza-1 bundle it reads
ramBudgetMb.{min,recommended} from eliza-1.manifest.json; otherwise it
synthesizes recommendedMb = minRamGb + (compressed KV bytes/token × contextLength). So "boots" and "runs a long session comfortably" are
genuinely different lines, sized off the compressed KV footprint.kv-spill.ts
pages cold KV out to host RAM (cpu) or NVMe (disk), keeping the hot tail
resident — but it hard-fails with KvSpillUnsupportedError if the
worst-case cold-page restore (one 256-token page over PCIe ≈12 GB/s, or NVMe
≈1.5 GB/s) can't meet the latency budget (200 ms voice / 1500 ms text). No
silent-slow fallback. On phones there is no discrete GPU, so the restore
class is disk-nvme: a 0.8B page is ~5 MB ÷ 1.5 GB/s ≈ 3.4 ms — fine for
text, fine for voice. Spill is viable on phones; it just rarely triggers
because (a) the small tiers cap at 40960 tokens anyway and (b) the compressed
cache at 40k is under 1 GiB.What's missing on mobile: the same memory-aware context selector. A 4 GB
phone running the 0.8B at the catalog's static 32k is fine, but a 4 GB phone
running the 0.8B could run 40960 (the model's native max) for the same KV
cost it already budgets — there's no reason to leave the last ~8k of the
model's range on the table. And an 8 GB phone could run the 4B (if catalogued)
instead of the 2B. The selector should: enumerate tiers fitting the device →
pick the largest → pick contextSize = min(tier.contextLength, baseNativeContext, maxFittingContextAtCompressedCache − safetyMargin).
The user's framing: "optimizations that run when we have extra overhead vs when we can't." A 24 GB GPU running a 0.8B has ~22 GB of spare VRAM — that slack can buy a bigger model, a bigger context, a more accurate KV cache (f16), a bigger speculative drafter, or more parallel contexts (continuous batching). A 4 GB phone has none of that — smallest model, most aggressive KV quant, KV spill armed. Decision table (after a tier is chosen for the device, working out what to do with the remaining headroom):
| device / situation | model | KV layout | context (runtime contextSize) | drafter / spec-decode | parallelism (optimizations.parallel) | spill |
|---|---|---|---|---|---|---|
| Phone, ~4 GB usable | 0.8B | qjl+polarq4 (forced) | min(40960, fit) — currently 32k | drafter on, draftMax 6, ngl=auto | 4 (or 2 if KV is the squeeze) | armed (disk-nvme); hard-fail if >1.5 s |
| Phone, ~6-8 GB usable | 2B (or 4B*) | qjl+polarq4 | min(40960, fit) — 32k native cap | drafter on | 4 | armed |
| Phone/tablet, 12 GB+ RAM | 9B | qjl+polarq4 | up to 65536 (catalog) — fits easily | 9B-distilled drafter | 4 | rarely needed |
| 16 GB GPU | 9B | qjl+polarq4 (or f16 — 16 GB fits f16 KV at 64k with 8.6 GB to spare → use f16 for accuracy) | 65536; could stretch to 128k on a RoPE-ext variant | drafter on, ngl=max, draftGpuLayers=max | 4-8 | not needed |
| 16 GB GPU, want long context on small model | 2B | qjl+polarq4 | up to 40960 (native cap) | drafter on | 8 (lots of room → continuous batching) | n/a |
| 24 GB GPU | 27B | qjl+polarq4 (f16 only fits ~77k — keep compressed) | 131072 (catalog); stretch toward ~200k if verify ok | 9B-distilled drafter, ngl=max | 8 | armed past ~200k |
| 24 GB GPU, small model on purpose (latency) | 0.8B / 2B | f16 (accuracy — 3.5 GiB at 32k, trivial) | 40960 native cap; or run multiple parallel sessions (8× contexts × ~0.9 GiB compressed each) | bigger drafter (use the 2B as drafter for the 0.8B if vocab matches) | 8 | n/a |
| 48 GB+ Apple Silicon (unified) | 27B | qjl+polarq4 | 131072; stretch only behind verification | 9B drafter | 8 | unified-mem spill cheap |
| 96 GB+ / GH200 | 27B | qjl+polarq4 / TCQ trellis for verified stretch windows | 131072+ experimental | 9B drafter, mlock on | 8 | host RAM headroom |
The levers, in priority order, when there's spare memory:
contextSize toward
min(tier.contextLength, baseNativeContext, maxFittingContext). Cheap with
the compressed cache.kvCacheForContext would need a
preferAccurateKvWhenHeadroom branch keyed off probed VRAM.runtimeFor sets
draftGpuLayers: "auto"); only the choice of drafter is fixed today.optimizations.parallel is already
8 for ≥131072-context tiers and 4 otherwise; on a roomy device serving a
small model it could be 8 regardless, enabling continuous batching for
multiple concurrent sessions.When there's no slack (phones): smallest model, qjl+polarq4 forced (already
the default >8k), contextSize = whatever fits, KV spill armed but expected to
rarely fire, parallel dropped to 2 if the KV cache is the squeeze. The
hard-fail-not-degrade rule (KvSpillUnsupportedError, missing-kernel = startup
error) stays — a slow voice session is worse than a clear "this device can't do
256k, use 32k".
contextLength values are correct:
active text tiers use the 128k release floor; device-specific reduction belongs
in the runtime selector, not in per-tier manifest drift.
The compressed-cache default (kvCacheForContext → qjl1_256 + q4_polar
for context > 8k) already applies to every tier. There is no obviously-too-
conservative number to bump without verification evidence behind it.recommendation.ts that raises runtime dflash.contextSize
toward min(tier.contextLength, baseNativeContext, maxFittingContext) using
the estimateQuantizedKvBytesPerToken figures; (2) an opt-in
preferAccurateKvWhenHeadroom branch that picks f16/q8_0 KV on hosts with
abundant VRAM relative to the chosen model+context; (3) only add a
named long-context 27B tier after the artifact and spill-latency gate pass.