Back to Eliza

Eliza-1 deployment matrix

packages/training/cloud/README.md

2.0.17.4 KB
Original Source

Eliza-1 deployment matrix

Three published Eliza-1 fine-tunes × five hosting paths. Each row picks a quant flavor that's appropriate for the target hardware; the cell columns name the canonical config file and the runtime that consumes it.

SizeLocal (Eliza)Ollama (local)vLLM (self-host)Vast pyworker (autoscale)Eliza Cloud
eliza-1-2bELIZA_MODEL=eliza-1-2bollama/Modelfile.eliza-1-2b-q4_k_mserve_vllm.py --registry-key eliza-1-2b --gpu-target singlevast-pyworker/eliza-1-2b.jsonvast/eliza-1-2b (catalog entry pending)
eliza-1-9bELIZA_MODEL=eliza-1-9bollama/Modelfile.eliza-1-9b-q4_k_mserve_vllm.py --registry-key eliza-1-9b --gpu-target h100-2xvast-pyworker/eliza-1-9b.jsonvast/eliza-1-9b (catalog entry pending)
eliza-1-27bELIZA_MODEL=eliza-1-27bollama/Modelfile.eliza-1-27b-q4_k_mserve_vllm.py --registry-key eliza-1-27b --gpu-target h200-2xvast-pyworker/eliza-1-27b.jsonvast/eliza-1-27b (catalog entry pending)

The "catalog entry pending" annotation reflects that eliza/cloud/packages/lib/models/catalog.ts only has vast/eliza-1-27b today. Adding the vast/eliza-1-* entries is a one-line PR per id (mirror the existing row), but is owned by the cloud monorepo, not this directory.

The ELIZA_MODEL env var is read by the runtime via eliza/packages/app-core/src/runtime/local-model-resolver.ts, which auto-picks the right quant flavor (gguf / polarquant / fp8 / bf16) for the detected GPU and pulls the matching HF sibling repo on first run.

Local dev — single laptop / desktop

  • Default: eliza-1-2b via Ollama. Build the Modelfile under ollama/, then export OLLAMA_LARGE_MODEL=eliza-1-2b. Runs on any 16 GB consumer GPU and on Apple Silicon (Metal).
  • Workstation upgrade: eliza-1-9b if you have a 24+ GB card.

Single-user prod — one user, persistent personal assistant

  • 24 GB workstation: eliza-1-9b Q4_K_M via Ollama. The PolarQuant sibling repo exists for the local-runtime path (scripts/quantization/polarquant_apply.py) but mainline vLLM has no PolarQuant kernel today, so the vLLM serving recipe runs on the bf16 base repo with FP8 W8A8 + FP8 KV on Hopper.
  • 48 GB+ workstation: eliza-1-27b Q4_K_M via Ollama, or eliza-1-27b bf16 via vLLM if FP8 is unavailable on the card.

Multi-tenant prod — many concurrent users

  • Cloud GPU: eliza-1-27b via Eliza Cloud (vast/eliza-1-27b, catalog entry pending). The vast-pyworker/eliza-1-27b.json manifest targets 2x H200 SXM at FP8 weights + FP8 KV. Vast Serverless autoscales workers based on queue depth. (TurboQuant KV per vLLM PR #38479 will halve the KV footprint once that lands in mainline; today's manifest uses the safe fp8_e4m3 fallback.)
  • Burst pattern with predictable baseline: keep one always-warm worker (min_workers=1, inactivity_timeout=-1) and let Vast scale out to max_workers=8 on demand. Costs ~$2/hr baseline (one H200 worker) + per-request capacity.
  • Edge / low-latency: deploy eliza-1-9b on Vast against H100 PCIe / H200 cards (h100-2x target). Cheaper than 27B and geographically distributable.

Cost estimate per million tokens served

Rough order-of-magnitude numbers, dominated by GPU rental cost on Vast spot at typical 2026 rates. Decode throughput estimates assume the full vLLM stack from serve_vllm.py (CUDA graphs + chunked prefill + prefix caching) at decode batch size 32.

RowHardwareHourly $Decode tok/s$ / 1M output tokens
2B / vLLM bf16 / single L40S1x L40S 48 GB$0.40~3500$0.03
2B / GGUF Q4_K_M / RTX 40901x RTX 4090 24 GB$0.30~2200$0.04
9B / vLLM FP8 / h100-2x2x H100 PCIe 80 GB$1.80~3200$0.16
9B / GGUF Q4_K_M / RTX 50901x RTX 5090 32 GB$0.50~1100$0.13
27B / vLLM FP8 / h200-2x2x H200 SXM 141 GB$4.00~1800$0.62
27B / GGUF Q6_K / RTX 50901x RTX 5090 32 GB$0.50~700$0.20

These rows are projections at the typical cloud-spot rates we observe on Vast for the matching hardware classes; numbers will need revisiting once we have measured throughput on a trained eliza-1 checkpoint. Vast spot prices and decode tok/s vary with offer availability, batch size, and prompt mix. These are rentals only; the Eliza Cloud customer-facing markup goes on top per the standard Cloud pricing model.

HuggingFace model layout

The canonical destination is a single consolidated bundle repo elizaos/eliza-1 with per-tier subdirectories. scripts/publish/orchestrator.py uploads each tier under bundles/<tier>/ (text + tts + asr + vad + dflash + cache + evals + licenses + quantization + checksums + manifest + lineage). See packages/inference/AGENTS.md §2 for the full bundle contract.

TierPath inside elizaos/eliza-1
0_8bbundles/0_8b/
2bbundles/2b/
4bbundles/4b/
9bbundles/9b/
27bbundles/27b/
27b-256kbundles/27b-256k/

Per-quant variants (Q4_K_M / Q6_K / Q8_0) live inside the same tier as sibling files under bundles/<tier>/text/. vLLM-specific quants (FP8, AWQ-Marlin, PolarQuant) require separate repos because vLLM cannot load from a subpath — those per-quant repos do not yet exist on HF and are tracked as Wave 3+ publish work.

Subdirectory map

cloud/
├── README.md             ← this file
├── ollama/               ← Modelfiles for ollama (per size, Q4_K_M)
│   ├── README.md
│   ├── Modelfile.eliza-1-2b-q4_k_m
│   ├── Modelfile.eliza-1-9b-q4_k_m
│   └── Modelfile.eliza-1-27b-q4_k_m
├── vast-pyworker/        ← Vast.ai serverless manifests
│   ├── README.md
│   ├── eliza-1-2b.json
│   ├── eliza-1-9b.json
│   └── eliza-1-27b.json
└── scripts/
    └── eliza-cloud-register.sh   ← upsert templates + endpoints in one shot

Reference docs in the wider repo:

  • training/scripts/training/model_registry.py — Python source of truth for sizes, quant siblings, and KV budgets.
  • training/scripts/inference/serve_vllm.py — canonical vLLM CLI; all per-target args here mirror its GPU_TARGETS table.
  • eliza/cloud/services/vast-pyworker/ — the existing pyworker that fronts the GGUF / llama-server path on Vast (Q6_K 27B today).
  • eliza/packages/app-core/src/runtime/local-model-resolver.ts — the Eliza-side resolver that maps ELIZA_MODEL=eliza-1-<size>(repo, quant, backend) per detected GPU.