packages/training/cloud/ollama/README.md
Three Modelfiles, one per published size, all pulling GGUF artifacts from the
public consolidated elizaos/eliza-1 bundle repo on HuggingFace.
| File | Size | Target GPU | Resident VRAM |
|---|---|---|---|
Modelfile.eliza-1-2b-q4_k_m | 2B | 16 GB consumer (RTX 5080 / 4080) | ~3 GB |
Modelfile.eliza-1-9b-q4_k_m | 9B | 24-48 GB workstation (RTX 4090 / 5090) | ~8-10 GB |
Modelfile.eliza-1-27b-q4_k_m | 27B | 32+ GB (RTX 5090 / RTX Pro 5000) | ~22 GB |
Each Modelfile ships with the canonical Eliza system prompt, the ChatML stop tokens used by Eliza-1, and per-size context / sampling defaults.
# 2B — local consumer GPU
ollama create eliza-1-2b -f Modelfile.eliza-1-2b-q4_k_m
# 9B — workstation
ollama create eliza-1-9b -f Modelfile.eliza-1-9b-q4_k_m
# 27B — high-VRAM card or datacenter
ollama create eliza-1-27b -f Modelfile.eliza-1-27b-q4_k_m
Ollama pulls the GGUF directly from HuggingFace on first build — no
intermediate ollama pull needed. Subsequent builds reuse the cached
blob.
ollama run eliza-1-9b
Or expose to Eliza:
# .env
OLLAMA_API_ENDPOINT=http://localhost:11434/api
OLLAMA_LARGE_MODEL=eliza-1-9b
OLLAMA_SMALL_MODEL=eliza-1-2b
When the @elizaos/plugin-ollama plugin is enabled, Eliza sends
TEXT_LARGE requests to the model named in OLLAMA_LARGE_MODEL and
TEXT_SMALL requests to the model named in OLLAMA_SMALL_MODEL.
When a new fine-tune ships (e.g. eliza-1.1-9b), update the FROM
line to the new HF repo and rebuild:
ollama create eliza-1-9b -f Modelfile.eliza-1-9b-q4_k_m # picks up new FROM
Ollama replaces the local model in place; agents already pointing at
eliza-1-9b will use the new weights on the next request without any
config change.
GGUF + llama.cpp is the canonical local-inference path:
:11434 that the
@elizaos/plugin-ollama plugin already consumes.For datacenter / multi-GPU serving, see the sibling vast-pyworker
manifests at ../vast-pyworker/ — those use the vLLM + PolarQuant /
fp8 path defined in training/scripts/inference/serve_vllm.py.