Back to Eliza

Eliza-1 context-length scaling: how far can we push the window on real machines?

packages/shared/src/local-inference/CONTEXT_SCALING.md

2.0.120.0 KB
Original Source

Eliza-1 context-length scaling: how far can we push the window on real machines?

2026-05-15 audit note: the small-tier 32k recommendation in this memo is superseded by the current staging contract. The audited 0.8B GGUF declares qwen35.context_length = 262144; release validation still blocks artifacts published under misleading 32k/64k text paths, while the catalog/platform plan use a 128k release floor for the active 0_8B/2B/4B/9B/27B tiers unless a distinct verified long-context catalog tier is introduced.

This is the analysis behind the per-tier contextLength in catalog.ts, the dflash.{contextSize,draftContextSize} knobs in runtimeFor(), and the RAM-budget / spill plumbing in ram-budget.ts + kv-spill.ts. It answers one question: given a phone / 16 GB GPU / 24 GB GPU and an Eliza-1 tier, what is the largest context window that actually fits, and which optimization should run when there is spare memory vs when there isn't?

The short version: on the small tiers (0.8B / 2B) the binding limit is the base model's positional range (Qwen3.5-0.8B / 2B ship max_position_embeddings = 40960), not the KV-cache memory — the compressed cache fits 32k in well under 1 GiB on any device. On the big tiers (9B / 27B) the limit flips: positional range is huge, KV-cache memory is what gates you, and that is where QJL+PolarQuant compression and (past 64k) KV spill do the heavy lifting. The catalog's job is to ship the largest native window per tier and then have the runtime pick a context within it that fits the device. That second half — a memory-aware context selector — is the gap; see "Wins" below.


1. The geometry

KV-cache footprint per generated token, summed over all full-attention layers:

bytes/token  =  n_layers * n_kv_heads * head_dim * (k_bits + v_bits) / 8

Qwen3 dense base shapes (from each base model's config.json; the registry's model_registry.py header table is the source of truth):

Eliza-1 tierbase modellayerskv headshead_dimnative ctx (max_position_embeddings)GGUF (quant)
eliza-1-0_8bQwen3.5-0.8B-Base288128262144~0.7 GB (Q4-class)
eliza-1-2bQwen3.5-2B-Base288128262144~1.5 GB (Q4-class)
eliza-1-4bQwen3.5-4B-Base368128262144~2.6 GB (Q4_K_M)
eliza-1-9bQwen3.5-9B8‡4256128k target~5.4 GB (Q4_K_M)
eliza-1-27bQwen3.6-27B16‡4256128k target~16.8 GB (Q4_K_M)

† 9B/27B use the registry's hybrid-attention geometry: only 8 / 16 of the layers carry a KV cache; the rest are linear-attention with constant SSM state. That is why their per-token KV is much smaller than a dense model of the same parameter count would be. ‡ KV-bearing (full-attention) layer count, per the published 3:1 ratio.

Bits-per-coordinate per cache layout (what the dflash-server actually loads — see kvCacheForContext() in catalog.ts):

layoutK bits/coordV bits/coordcombined B/coordnotes
f1616.016.04.00stock llama.cpp default
q8_08.58.52.12stock; lossless-ish
qjl1_256 K + turbo3_0 V~2.13~3.5~0.70the ≤8k-context fallback layout (TBQ3 V)
qjl1_256 K + q4_polar V~2.13~4.5~0.83the default for every shipping tier (all are >8k context)

QJL on the K-cache realizes ~7.53× vs f16 — not the marketing 16× — because each 256-coord sketch block carries a per-token bf16 norm (≈34 bytes of overhead per block of 256 sketch dims). PolarQuant on the V-cache is ~4-bit + per-block scales. These are the figures model_registry._compute_inference_mem and kv-spill.estimateQuantizedKvBytesPerToken are both sized against (QUANTIZED_KV_BYTES_PER_TOKEN_BY_PARAMS = 1200 / 2400 / 9000 / 22000 B/tok for 0.8B / 2B/9B/27B — consistent with the per-layer math here once you account for the realized ratios).

Resulting bytes/token across the whole KV cache:

tierf16q8_0qjl+turbo3 (≤8k fallback)qjl+polarq4 (shipping default)
0.8B114.7 kB60.9 kB~20.2 kB~23.7 kB
2B114.7 kB60.9 kB~20.2 kB~23.7 kB
4B147.5 kB78.3 kB~25.9 kB~30.5 kB
9B32.8 kB17.4 kB~5.8 kB~6.8 kB
27B65.5 kB34.8 kB~11.5 kB~13.6 kB

(9B/27B are smaller per-token than 0.8B because of the 3:1 linear-attention ratio — only 8/16 layers cache, and the per-coord cost is identical.)


2. Max context that fits the KV cache, by device

Budget model: available_for_KV = device_mem − GGUF_weights − working_set, where working_set is the compute/activation/scratch buffers (≈0.6 GB for 0.8B, 0.8 GB for 2B, 1.0 GB for 4B, 1.5 GB for 9B, 2.5 GB for 27B). For RAM-budget hosts (phones, Apple Silicon, CPU) subtract another ~1.5 GB OS reserve (DEFAULT_RAM_HEADROOM_RESERVE_MB); the GPU columns assume dedicated VRAM. Numbers are "context tokens the cache can hold", rounded.

Phones — ~4 GB usable / ~8 GB usable

tierlayout4 GB8 GB
0.8Bf16~27k~65k
0.8Bqjl+polarq4~131k~312k
2Bf16~19k~56k
2Bqjl+polarq4~90k~271k
4Bf16~32k
4Bqjl+polarq4~14k~155k
9Bqjl+polarq4~174k

Takeaway: on a 4 GB phone the 0.8B at the shipping compressed cache fits ~130k tokens — far past the model's 40960-token positional ceiling. The KV cache is not the limiter at the small end; the base model's RoPE range is. The 2B is comfortable on any modern (≥6 GB) phone. The 4B only fits an 8 GB phone, and only with the compressed cache. The 9B needs ~8 GB and its small hybrid KV — borderline; reserve it for 12 GB+ phones / 24 GB-RAM tablets, which is exactly what minRamGb: 12 does today.

Gaming / workstation GPUs — 16 GB / 24 GB VRAM

tierlayout16 GB24 GB
0.8Bf16~139k~214k
0.8Bq8_0~263k~404k
0.8Bqjl+polarq4~674k~1.0M
2Bf16~131k~206k
2Bqjl+polarq4~633k~995k
4Bf16~90k~149k
4Bqjl+polarq4~436k~718k
9Bf16~298k~560k
9Bqjl+polarq4~1.4M~2.7M
27Bf16~77k
27Bq8_0~145k
27Bqjl+polarq4~372k

Takeaways:

  • The 27B does not fit on 16 GB at all — weights alone are ~16.8 GB. Its 16 GB story is "use the 9B instead", which the recommender ladder already encodes. On 24 GB the 27B fits with ~5 GB to spare → ~370k context at the compressed cache, ~145k even at q8_0, ~77k at f16. The 27B's 128k catalog default is the right production call until a separately verified long-context 27B artifact exists.
  • The 9B is the sweet spot for 16 GB cards. Weights ~5.4 GB leaves ~9 GB for KV → it fits its 65536 catalog window with ~8.6 GB of headroom even at f16 KV. The compressed cache would let the 9B run much longer windows if the weights supported them (they don't natively yet — see "RoPE extension").
  • 0.8B / 2B on a 16-24 GB GPU are wildly under-utilizing the device. Weights are 0.5-1.4 GB. After that there are 14-23 GB of unused VRAM. The catalog ships them at the 128k release floor, and the compressed KV path is still small enough for modern local hardware. This is the headroom the overhead-aware policy (§5) should spend.

3. Wins

Ranked by confidence × impact.

HIGH — ship a memory-aware context selector (the real gap)

Today recommendation.ts picks a model per device, and pickFittingContextVariant picks among pre-baked context variants of the same model line when such variants exist. There is no path that says "the device has 14 GB of unused VRAM after loading the 0.8B at its default 32k — bump contextSize toward the model's native ceiling". The numbers above show every device has slack:

  • 0.8B / 2B: the binding limit is the base model's max_position_embeddings (262144). The catalog's 131072 is a deliberate release floor below that. A memory-aware selector could safely raise the runtime contextSize to 262144 on roomy devices when the GGUF metadata and verification evidence agree. Do not bump contextLength in the catalog past the base model's positional range without a YaRN/RoPE-scaled GGUF.
  • 9B: catalog 131072. Native range extends to 128k+ for the verified variant. On 16 GB the compressed KV at 128k is small; the deciding factor is weight residency and backend support.
  • 27B: catalog 131072 on 24 GB uses ~1.7 GiB compressed KV. There's room for longer contexts, but those must remain runtime stretch experiments until a separately verified long-context 27B artifact passes latency and spill tests.

Implementation sketch: after the model is chosen, compute maxFittingContext = floor((memBytes − weightBytes − workingSet) / kvBytesPerToken) using the same estimateQuantizedKvBytesPerToken figure kv-spill.ts uses, clamp to min(model.contextLength, baseModelNativeContext), and pass that as dflash.contextSize instead of the static contextLength. This is the "largest model+context combo that fits the available RAM/VRAM with a safety margin, preferring qjl+polarq4" the task asks for.

HIGH — default to the compressed cache whenever context > 8k (already done, keep it)

kvCacheForContext() already returns { typeK: "qjl1_256", typeV: "q4_polar" } for any context > 8192, and every shipping tier is > 8k. So the 0.8B / 2B at 32k already cost ~0.6-0.7 GiB of KV, not 3.5 GiB. Nothing to change — but the doc above is the justification, and the memory-aware selector (above) must keep using qjl+polarq4 as the assumed layout, not f16.

RESOLVED — eliza-1-4b is in catalog.ts

model_registry.py has a real, buildable qwen3.5-4beliza-1-4b entry; the catalog now includes the 4B tier at elizaos/eliza-1/bundles/4b. The 4B fits an 8 GB phone when the compressed cache path is available (compressed cache, ~155k ceiling, though the base model caps at 40960) and is a strong 16 GB-GPU default (~436k KV ceiling, ~90k even at f16).

LOW / informational — no safe catalog contextLength bump for the small tiers

It is tempting to read "0.8B fits 800k tokens of KV on 16 GB" and bump contextLength beyond the release floor. Don't. The current catalog standardizes the active tiers at 128k; the runtime selector should still lower contextSize on constrained devices rather than letting a roomy desktop change the release contract by accident.


4. The mobile story

How the catalog/harness handles low-RAM hosts today:

  • FIRST_RUN_DEFAULT_MODEL_ID = "eliza-1-2b" — the smallest tier that fits "the broadest range of hardware (modern phone or laptop)". On first run with no preference, that's what loads.
  • Hosts that can't fit eliza-1-2b fall back to eliza-1-0_8b via the mobile ladder in recommendation.ts: TEXT_SMALL: [0_8b, 2b], TEXT_LARGE: [2b, 0_8b]. canFit runs assessRamFit against the device's total RAM (mobile shares GPU memory with system RAM), with the OS headroom reserve and a download-size guardrail (wontFitRatio 0.8, tightRatio 0.65).
  • assessRamFitresolveRamBudget: for an installed Eliza-1 bundle it reads ramBudgetMb.{min,recommended} from eliza-1.manifest.json; otherwise it synthesizes recommendedMb = minRamGb + (compressed KV bytes/token × contextLength). So "boots" and "runs a long session comfortably" are genuinely different lines, sized off the compressed KV footprint.
  • Past 64k context on a device whose RAM can't hold the cache, kv-spill.ts pages cold KV out to host RAM (cpu) or NVMe (disk), keeping the hot tail resident — but it hard-fails with KvSpillUnsupportedError if the worst-case cold-page restore (one 256-token page over PCIe ≈12 GB/s, or NVMe ≈1.5 GB/s) can't meet the latency budget (200 ms voice / 1500 ms text). No silent-slow fallback. On phones there is no discrete GPU, so the restore class is disk-nvme: a 0.8B page is ~5 MB ÷ 1.5 GB/s ≈ 3.4 ms — fine for text, fine for voice. Spill is viable on phones; it just rarely triggers because (a) the small tiers cap at 40960 tokens anyway and (b) the compressed cache at 40k is under 1 GiB.

What's missing on mobile: the same memory-aware context selector. A 4 GB phone running the 0.8B at the catalog's static 32k is fine, but a 4 GB phone running the 0.8B could run 40960 (the model's native max) for the same KV cost it already budgets — there's no reason to leave the last ~8k of the model's range on the table. And an 8 GB phone could run the 4B (if catalogued) instead of the 2B. The selector should: enumerate tiers fitting the device → pick the largest → pick contextSize = min(tier.contextLength, baseNativeContext, maxFittingContextAtCompressedCache − safetyMargin).


5. The overhead-aware policy: VRAM/RAM budget → config

The user's framing: "optimizations that run when we have extra overhead vs when we can't." A 24 GB GPU running a 0.8B has ~22 GB of spare VRAM — that slack can buy a bigger model, a bigger context, a more accurate KV cache (f16), a bigger speculative drafter, or more parallel contexts (continuous batching). A 4 GB phone has none of that — smallest model, most aggressive KV quant, KV spill armed. Decision table (after a tier is chosen for the device, working out what to do with the remaining headroom):

device / situationmodelKV layoutcontext (runtime contextSize)drafter / spec-decodeparallelism (optimizations.parallel)spill
Phone, ~4 GB usable0.8Bqjl+polarq4 (forced)min(40960, fit) — currently 32kdrafter on, draftMax 6, ngl=auto4 (or 2 if KV is the squeeze)armed (disk-nvme); hard-fail if >1.5 s
Phone, ~6-8 GB usable2B (or 4B*)qjl+polarq4min(40960, fit) — 32k native capdrafter on4armed
Phone/tablet, 12 GB+ RAM9Bqjl+polarq4up to 65536 (catalog) — fits easily9B-distilled drafter4rarely needed
16 GB GPU9Bqjl+polarq4 (or f16 — 16 GB fits f16 KV at 64k with 8.6 GB to spare → use f16 for accuracy)65536; could stretch to 128k on a RoPE-ext variantdrafter on, ngl=max, draftGpuLayers=max4-8not needed
16 GB GPU, want long context on small model2Bqjl+polarq4up to 40960 (native cap)drafter on8 (lots of room → continuous batching)n/a
24 GB GPU27Bqjl+polarq4 (f16 only fits ~77k — keep compressed)131072 (catalog); stretch toward ~200k if verify ok9B-distilled drafter, ngl=max8armed past ~200k
24 GB GPU, small model on purpose (latency)0.8B / 2Bf16 (accuracy — 3.5 GiB at 32k, trivial)40960 native cap; or run multiple parallel sessions (8× contexts × ~0.9 GiB compressed each)bigger drafter (use the 2B as drafter for the 0.8B if vocab matches)8n/a
48 GB+ Apple Silicon (unified)27Bqjl+polarq4131072; stretch only behind verification9B drafter8unified-mem spill cheap
96 GB+ / GH20027Bqjl+polarq4 / TCQ trellis for verified stretch windows131072+ experimental9B drafter, mlock on8host RAM headroom

The levers, in priority order, when there's spare memory:

  1. Bigger model first. A 9B at 16 GB beats a 2B at 16 GB for quality; pick the largest tier that fits with margin. (This is what the ladders do.)
  2. Then bigger context. Raise runtime contextSize toward min(tier.contextLength, baseNativeContext, maxFittingContext). Cheap with the compressed cache.
  3. Then more accurate KV (f16 / q8_0) — only worth it when there's enough slack to hold f16 and the chosen context, which on consumer cards means the 0.8B / 2B/9B tiers, not the 27B. kvCacheForContext would need a preferAccurateKvWhenHeadroom branch keyed off probed VRAM.
  4. Then a bigger drafter (better acceptance rate → faster decode) — e.g. the 2B distilled drafter in front of the 0.8B target, or the existing 9B drafter in front of the 27B. Already wired (runtimeFor sets draftGpuLayers: "auto"); only the choice of drafter is fixed today.
  5. Then more parallel contextsoptimizations.parallel is already 8 for ≥131072-context tiers and 4 otherwise; on a roomy device serving a small model it could be 8 regardless, enabling continuous batching for multiple concurrent sessions.

When there's no slack (phones): smallest model, qjl+polarq4 forced (already the default >8k), contextSize = whatever fits, KV spill armed but expected to rarely fire, parallel dropped to 2 if the KV cache is the squeeze. The hard-fail-not-degrade rule (KvSpillUnsupportedError, missing-kernel = startup error) stays — a slow voice session is worse than a clear "this device can't do 256k, use 32k".


  • No ad hoc catalog edit. The per-tier contextLength values are correct: active text tiers use the 128k release floor; device-specific reduction belongs in the runtime selector, not in per-tier manifest drift. The compressed-cache default (kvCacheForContextqjl1_256 + q4_polar for context > 8k) already applies to every tier. There is no obviously-too- conservative number to bump without verification evidence behind it.
  • Recommended: (1) a memory-aware context selector in recommendation.ts that raises runtime dflash.contextSize toward min(tier.contextLength, baseNativeContext, maxFittingContext) using the estimateQuantizedKvBytesPerToken figures; (2) an opt-in preferAccurateKvWhenHeadroom branch that picks f16/q8_0 KV on hosts with abundant VRAM relative to the chosen model+context; (3) only add a named long-context 27B tier after the artifact and spill-latency gate pass.