packages/training/cloud/README.md
Three published Eliza-1 fine-tunes × five hosting paths. Each row picks a
quant flavor that's appropriate for the target hardware; the cell
columns name the canonical config file and the runtime that consumes it.
| Size | Local (Eliza) | Ollama (local) | vLLM (self-host) | Vast pyworker (autoscale) | Eliza Cloud |
|---|---|---|---|---|---|
| eliza-1-2b | ELIZA_MODEL=eliza-1-2b | ollama/Modelfile.eliza-1-2b-q4_k_m | serve_vllm.py --registry-key eliza-1-2b --gpu-target single | vast-pyworker/eliza-1-2b.json | vast/eliza-1-2b (catalog entry pending) |
| eliza-1-9b | ELIZA_MODEL=eliza-1-9b | ollama/Modelfile.eliza-1-9b-q4_k_m | serve_vllm.py --registry-key eliza-1-9b --gpu-target h100-2x | vast-pyworker/eliza-1-9b.json | vast/eliza-1-9b (catalog entry pending) |
| eliza-1-27b | ELIZA_MODEL=eliza-1-27b | ollama/Modelfile.eliza-1-27b-q4_k_m | serve_vllm.py --registry-key eliza-1-27b --gpu-target h200-2x | vast-pyworker/eliza-1-27b.json | vast/eliza-1-27b (catalog entry pending) |
The "catalog entry pending" annotation reflects that
eliza/cloud/packages/lib/models/catalog.ts only has
vast/eliza-1-27b today. Adding the vast/eliza-1-* entries
is a one-line PR per id (mirror the existing row), but is owned by the
cloud monorepo, not this directory.
The ELIZA_MODEL env var is read by the runtime via
eliza/packages/app-core/src/runtime/local-model-resolver.ts, which
auto-picks the right quant flavor (gguf / polarquant / fp8 / bf16) for
the detected GPU and pulls the matching HF sibling repo on first run.
eliza-1-2b via Ollama. Build the Modelfile under
ollama/, then export OLLAMA_LARGE_MODEL=eliza-1-2b. Runs on any
16 GB consumer GPU and on Apple Silicon (Metal).eliza-1-9b if you have a 24+ GB card.eliza-1-9b Q4_K_M via Ollama. The
PolarQuant sibling repo exists for the local-runtime path
(scripts/quantization/polarquant_apply.py) but mainline vLLM has
no PolarQuant kernel today, so the vLLM serving recipe runs on the
bf16 base repo with FP8 W8A8 + FP8 KV on Hopper.eliza-1-27b Q4_K_M via Ollama, or
eliza-1-27b bf16 via vLLM if FP8 is unavailable on the card.eliza-1-27b via Eliza Cloud (vast/eliza-1-27b,
catalog entry pending). The vast-pyworker/eliza-1-27b.json
manifest targets 2x H200 SXM at FP8 weights + FP8 KV. Vast
Serverless autoscales workers based on queue depth. (TurboQuant KV
per vLLM PR #38479 will halve the KV footprint once that lands in
mainline; today's manifest uses the safe fp8_e4m3 fallback.)min_workers=1, inactivity_timeout=-1) and let Vast scale
out to max_workers=8 on demand. Costs ~$2/hr baseline (one H200
worker) + per-request capacity.eliza-1-9b on Vast against H100 PCIe
/ H200 cards (h100-2x target). Cheaper than 27B and geographically
distributable.Rough order-of-magnitude numbers, dominated by GPU rental cost on Vast
spot at typical 2026 rates. Decode throughput estimates assume the
full vLLM stack from serve_vllm.py (CUDA graphs + chunked prefill +
prefix caching) at decode batch size 32.
| Row | Hardware | Hourly $ | Decode tok/s | $ / 1M output tokens |
|---|---|---|---|---|
| 2B / vLLM bf16 / single L40S | 1x L40S 48 GB | $0.40 | ~3500 | $0.03 |
| 2B / GGUF Q4_K_M / RTX 4090 | 1x RTX 4090 24 GB | $0.30 | ~2200 | $0.04 |
| 9B / vLLM FP8 / h100-2x | 2x H100 PCIe 80 GB | $1.80 | ~3200 | $0.16 |
| 9B / GGUF Q4_K_M / RTX 5090 | 1x RTX 5090 32 GB | $0.50 | ~1100 | $0.13 |
| 27B / vLLM FP8 / h200-2x | 2x H200 SXM 141 GB | $4.00 | ~1800 | $0.62 |
| 27B / GGUF Q6_K / RTX 5090 | 1x RTX 5090 32 GB | $0.50 | ~700 | $0.20 |
These rows are projections at the typical cloud-spot rates we observe on Vast for the matching hardware classes; numbers will need revisiting once we have measured throughput on a trained eliza-1 checkpoint. Vast spot prices and decode tok/s vary with offer availability, batch size, and prompt mix. These are rentals only; the Eliza Cloud customer-facing markup goes on top per the standard Cloud pricing model.
The canonical destination is a single consolidated bundle repo
elizaos/eliza-1 with per-tier subdirectories. scripts/publish/orchestrator.py
uploads each tier under bundles/<tier>/ (text + tts + asr + vad + dflash +
cache + evals + licenses + quantization + checksums + manifest + lineage).
See packages/inference/AGENTS.md §2 for the full bundle contract.
| Tier | Path inside elizaos/eliza-1 |
|---|---|
| 0_8b | bundles/0_8b/ |
| 2b | bundles/2b/ |
| 4b | bundles/4b/ |
| 9b | bundles/9b/ |
| 27b | bundles/27b/ |
| 27b-256k | bundles/27b-256k/ |
Per-quant variants (Q4_K_M / Q6_K / Q8_0) live inside the same tier as
sibling files under bundles/<tier>/text/. vLLM-specific quants (FP8,
AWQ-Marlin, PolarQuant) require separate repos because vLLM cannot load
from a subpath — those per-quant repos do not yet exist on HF and are
tracked as Wave 3+ publish work.
cloud/
├── README.md ← this file
├── ollama/ ← Modelfiles for ollama (per size, Q4_K_M)
│ ├── README.md
│ ├── Modelfile.eliza-1-2b-q4_k_m
│ ├── Modelfile.eliza-1-9b-q4_k_m
│ └── Modelfile.eliza-1-27b-q4_k_m
├── vast-pyworker/ ← Vast.ai serverless manifests
│ ├── README.md
│ ├── eliza-1-2b.json
│ ├── eliza-1-9b.json
│ └── eliza-1-27b.json
└── scripts/
└── eliza-cloud-register.sh ← upsert templates + endpoints in one shot
Reference docs in the wider repo:
training/scripts/training/model_registry.py — Python source of truth
for sizes, quant siblings, and KV budgets.training/scripts/inference/serve_vllm.py — canonical vLLM CLI; all
per-target args here mirror its GPU_TARGETS table.eliza/cloud/services/vast-pyworker/ — the existing pyworker that
fronts the GGUF / llama-server path on Vast (Q6_K 27B today).eliza/packages/app-core/src/runtime/local-model-resolver.ts —
the Eliza-side resolver that maps ELIZA_MODEL=eliza-1-<size> →
(repo, quant, backend) per detected GPU.