packages/training/README.md
Fine-tunes the eliza-1 model series (eliza-1-0_8b, eliza-1-2b,
eliza-1-4b, eliza-1-9b, eliza-1-27b) on Eliza-native Vercel AI SDK
trajectory rows: the exact
request sent to the model plus the exact normalized response returned by the
model, including native tool calls.
This directory is gitignored. The canonical artifact stores live on HuggingFace, not in git history:
what repo script Dataset (native trajectory SFT) runtime eliza_native_v1exportsscripts/trajectories_to_sft.pyTrained models elizaos/eliza-1(model;bundles/<tier>/)scripts/publish/publish_model.pyTraining dataset (SFT splits) elizaos/eliza-1-training(dataset)scripts/publish/publish_dataset.pyPipeline source (this directory) elizaos/eliza-1-training(dataset;pipeline/)scripts/publish/publish_pipeline.pyQuants land under the same
elizaos/eliza-1model repo alongside each bundle manifest; do not create per-quant public defaults.
The base models are catalogued in scripts/training/model_registry.py;
each entry is tagged local | workstation | cloud. Default optimizer is
APOLLO (full-parameter SFT, low-memory projected optimizer state,
arXiv:2412.05270), not LoRA.
| registry key | eliza release | base | tier | default training target | optimizer |
|---|---|---|---|---|---|
| qwen3.5-0.8b | eliza-1-0_8b | Qwen/Qwen3.5-0.8B-Base | local | 16 GB consumer GPU | apollo_mini |
| qwen3.5-2b | eliza-1-2b | Qwen/Qwen3.5-2B-Base | local | 16 GB consumer GPU | apollo_mini |
| qwen3.5-4b | eliza-1-4b | Qwen/Qwen3.5-4B-Base | local | 24 GB consumer/workstation GPU | apollo_mini |
| qwen3.5-9b | eliza-1-9b | Qwen/Qwen3.5-9B | workstation | 80 GB-class GPU | apollo |
| qwen3.6-27b | eliza-1-27b | Qwen/Qwen3.6-27B | cloud | 2x H200 / B200 | apollo |
After training, two post-training quantization passes run: PolarQuant (4-bit weights via Hadamard rotation, arXiv:2603.29078) and TurboQuant (online KV-cache quantization, arXiv:2504.19874). KV compression is rounded out by QJL (1-bit JL-projected K cache, arXiv:2406.03482).
A unified pipeline runner (scripts/run_pipeline.py) chains:
base bench → APOLLO SFT → fine-tuned bench → PolarQuant + TurboQuant + QJL → quantized bench
Per-task benchmarks live in scripts/benchmark/native_tool_call_bench.py and
score native tool-call structure, tool names, argument keys, and JSON routing
shape on the held-out trajectory split.
hf download elizaos/eliza-1-training pipeline --repo-type dataset --local-dir ./training
cd training
uv sync --extra train
datasets.yaml ──▶ download_datasets.py ──▶ data/raw/<slug>/
prompts/ ──▶ extract_eliza_prompts.py ──▶ data/prompts/registry.json
│
▼
synthesize_targets.py (teacher: Anthropic API)
│
▼
data/raw/* ──▶ normalize.py ──▶ data/normalized/<slug>.jsonl
│
▼
pack_dataset.py
│
▼
data/final/{train,val,test}.jsonl
│
┌──────────┴──────────┐
▼ ▼
train_local.py train_vast.sh
(APOLLO, 0.8B/2B) (APOLLO, 4B remote GPU)
│
▼
┌─────────────┬─────┴─────┬─────────────┐
▼ ▼ ▼ ▼
polarquant_apply turboquant_apply qjl_apply (vendored)
│
▼
native_tool_call_bench.py
(native tool-call + JSON structure correctness)
The runtime training path uses native JSON records, not alternate harness rows.
The contract is documented in
docs/dataset/NATIVE_TOOL_CALLING_SPEC.md.
Source transform families are summarized in
docs/dataset/NATIVE_SOURCE_TRANSFORMS.md.
Bootstrap flow:
uv run python scripts/download_datasets.py --priority all \
--skip nubilio-trajectories,light-multilight \
--max-workers 2 --min-free-gb 40
uv run python scripts/normalize.py
uv run python scripts/prepare_native_tool_calling_data.py --write-matrix
uv run python scripts/prepare_native_tool_calling_data.py \
--transform-normalized --validate-native
uv run python scripts/bootstrap_native_to_eliza_native.py \
--input data/native/records \
--output data/native/eliza_native_bootstrap.jsonl
The source matrix is written to data/native/source_matrix.json and
data/native/SOURCE_MATRIX.md. It records every datasource's transform family,
strengths, weaknesses, raw-data status, and recommended native-training weight.
bootstrap_native_to_eliza_native.py converts bootstrap rows into the same
eliza_native_v1 request/response boundary format used by real runtime
trajectory exports.
Trajectory alignment audit:
uv run --with pyyaml --with pyarrow \
python scripts/sample_native_trajectory_alignment.py \
--samples-per-source 10 --run-cerebras
This writes ignored review artifacts under data/native/audit/: randomized
raw samples per downloaded dataset, reference simple/wallet/email/calendar
trajectories, real Eliza recorder-stage comparisons, an eliza_native_v1
export of real local trajectories for smoke training, per-dataset synthesis
templates for missing components, model-call envelopes for Cerebras and the
Vercel AI Gateway bridge, and a composition audit. See
docs/dataset/TRAJECTORY_ALIGNMENT_AUDIT.md.
# Build train/val/test directly from runtime trajectory exports
uv run --extra train python scripts/trajectories_to_sft.py \
--input ../trajectory-export.jsonl \
--output-dir data/trajectory-runs/local-review
# One-command local APOLLO fine-tune from trajectory export(s)
uv run --extra train python scripts/run_pipeline.py \
--registry-key qwen3.5-2b \
--trajectory-export ../trajectory-export.jsonl \
--epochs 1 --skip-base-bench
# Smoke test on 0.8B (smallest active eliza-1 size, trains on 16 GB)
uv run --extra train python scripts/run_pipeline.py \
--registry-key qwen3.5-0.8b --max-samples 1000 --epochs 1
# Full pipeline on 2B (eliza-1-2b, real local run)
uv run --extra train python scripts/run_pipeline.py \
--registry-key qwen3.5-2b --epochs 3
# Remote GPU pipeline for the active 4B APOLLO tier
VAST_API_KEY=... HUGGING_FACE_HUB_TOKEN=... \
bash scripts/train_vast.sh provision-and-train \
--registry-key qwen3.5-4b --epochs 1 --bootstrap hf
# Push the trained checkpoint to elizaos/eliza-1 (gated bundle path).
# publish_model.py --mode bundle runs the full publish orchestrator:
# layout → kernel verify → eval gates (incl. prior-bundle regression check) →
# manifest → README → HF push.
HF_TOKEN=hf_xxx uv run python -m scripts.publish.publish_model \
--mode bundle \
--tier 4b \
--bundle-dir checkpoints/qwen3-5-4b-apollo/final
See RL_STRATEGY.md for the post-SFT plan (DPO + GRPO via verl).
The active 0.8B, 2B, 4B, 9B, and 27B APOLLO tiers can train on Vast.ai via scripts/train_vast.sh
(subcommands: search, provision, sync, run,
quantize, bench, fetch, status, pull-checkpoints,
kill-and-teardown, teardown, provision-and-train). The script
auto-picks the GPU target from REGISTRY_KEY:
qwen3.5-0.8b / eliza-1-0_8b → smallest active smoke/default tier.qwen3.5-2b / eliza-1-2b → local 16 GB training tier.qwen3.5-4b / eliza-1-4b → 48 GB-class training tier, optimized for 24 GB inference after quantization.qwen3.5-9b / eliza-1-9b → workstation tier.qwen3.6-27b / eliza-1-27b → cloud tier.Lower-level helpers live in scripts/lib/vast.py (searchable via
python -m scripts.lib.vast pick blackwell6000-2x). scripts/day0_smoke.sh
uses the same helpers for its day-0 verification run. scripts/train_nebius.sh
is kept only as an emergency fallback if Vast capacity is unavailable; do not
extend the Nebius path.
scripts/training/optimizer.py (apollo-torch package).
Validation: scripts/training/test_apollo.py.scripts/quantization/polarquant_apply.py plus the
vendored scripts/quantization/polarquant/ library. 4-bit weight codes
scripts/quantization/turboquant_apply.py (turbokv
package, pure-PyTorch reference) and
scripts/quantization/fused_turboquant_apply.py (Triton-fused, vendored
in scripts/quantization/fused_turboquant_vendored/). Inference-time KV
cache quantizer.scripts/quantization/qjl_apply.py plus the vendored
scripts/quantization/qjl/ CUDA extension. 1-bit JL-projected K cache.scripts/training/instrumentation.py. JSONL trace
with peak memory + tokens/sec per logging window; hard-fails the run
when torch.cuda.max_memory_reserved() exceeds the registry budget by
more than 10 %.scripts/benchmark/native_tool_call_bench.py. It scores
expected native tool names, argument keys, and JSON routing/planner shape.
Run on base + fine-tuned + each quantized variant for direct A/B numbers.The primary trajectory-training record is an eliza_native_v1 boundary row.
The renderer reads request.messages or request.prompt, appends the
supervised assistant turn from response.text and/or response.toolCalls, and
passes request.tools into tokenizer.apply_chat_template(..., tools=...)
when the tokenizer supports native tool rendering.
{
"format": "eliza_native_v1",
"request": {"messages": [...], "tools": {...}, "toolChoice": "..."},
"response": {"text": "...", "toolCalls": [...]}
}
The same chat template is applied at benchmark time with
add_generation_prompt=True, so the model sees the same request structure at
training and generation time.
For handoff compatibility, scripts/format_for_training.py also accepts
trainable eliza.eliza1_trajectory_record.v1 message rows, already-rendered
chat-message rows with a final assistant turn, and legacy flat ElizaRecord
rows from pack_dataset.py. It rejects repair_eval / failed-quality rows.
Remote Vast bootstrap expects root split names
data/final/{train,val,test}.jsonl; candidate repos use
data/validation.jsonl, so stage or rename that split to val.jsonl before
using it as the remote root dataset.
Three of the four memory optimizations rely on JIT-compiled or hand-written CUDA kernels:
gcc + Python dev
headers.scripts/quantization/qjl/csrc/ — needs nvcc from the CUDA toolkit
in addition to Python dev headers.One-shot install on Debian/Ubuntu:
sudo apt install build-essential python3.12-dev nvidia-cuda-toolkit
# Then build QJL:
cd scripts/quantization/qjl && python setup.py build_ext --inplace
# For Blackwell (sm_120, RTX 50-series + RTX Pro Blackwell):
TORCH_CUDA_ARCH_LIST="12.0+PTX" python setup.py build_ext --inplace
Without these, every kernel path has a documented fallback (Liger → HF defaults; fused-turboquant → pure-PyTorch turbokv; QJL → bf16 K cache — you lose the K-side compression but retain V-side TurboQuant). The training/inference script logs a warning at startup so you know which path is actually running.
cd training
uv sync --extra train
uv run python scripts/download_datasets.py
uv run python scripts/extract_eliza_prompts.py
uv run python scripts/normalize.py
uv run python scripts/synthesize_targets.py --task should_respond # optional
uv run python scripts/pack_dataset.py
# Smoke test (small subset, no Liger — proves the path end to end on 2B)
uv run --extra train python scripts/train_local.py \
--registry-key qwen3.5-2b --max-samples 256 --epochs 1 \
--use-liger off
# Real local 2B run with Liger (8k seq_len, APOLLO, instrumentation)
uv run --extra train python scripts/train_local.py \
--registry-key qwen3.5-2b --epochs 3 --full-finetune \
--max-chars 24000
# Full pipeline: base bench → APOLLO SFT → fine-tuned bench → quant → quant bench
uv run --extra train python scripts/run_pipeline.py \
--registry-key qwen3.5-2b --epochs 3
For cloud-tier runs see scripts/train_vast.sh and scripts/CLOUD_VAST.md.
scripts/train_nebius.sh is emergency fallback only.
For inference see scripts/inference/serve_vllm.py (vLLM serve launcher) and
scripts/inference/serve_local.py.
The registry's qwen3.5-2b default targets a 16 GB local GPU (seq_len=8192,
budget=15.5 GB). On a 12 GB card that OOMs at the fp32 logits transient
(B·S·V·4 bytes; Qwen vocab=248k makes this dominant). The --low-vram-smoke
flag is a preset bundle that brings the SFT path inside a 12 GB envelope
so the train→quant→bench plumbing can be validated end-to-end on commodity
hardware before reaching for a rented H100/H200.
# 0.8B — the realistic tier for a 12 GB card. Smallest published eliza-1 base.
uv run --extra train python scripts/train_local.py \
--registry-key qwen3.5-0.8b --low-vram-smoke
# 2B — feasible inside 12 GB only with Liger + flash-linear-attention
# kernels installed (see prerequisites below). Without those kernels the
# 248k-vocab fp32 logits transient saturates the card and the run will
# either OOM at the first backward pass or run unusably slowly.
uv run --extra train python scripts/train_local.py \
--registry-key qwen3.5-2b --low-vram-smoke
What the preset overrides (CLI flags the caller passes still win):
seq_len = 2048 (registry default 4096 for 0.8B, 8192 for 2B)per_device_batch_size = 1gradient_accumulation_steps = 16 (effective batch stays at 16)max_samples = 1000epochs = 1memory_budget_gb = 11.5 (1.5 GB headroom under 12 GB)train_local.py)Kernel prerequisites for the 2B path. The preset's budget at 2B depends on two things being live:
gcc, python3.x-dev, and a CUDA toolkit Triton
can use must all be installed. Without them, train_local.py logs a
warning at startup and falls back to HF defaults.When either kernel is missing, the 2B-on-12 GB path is effectively gated to preflight + small-step verification — full SFT will run but extremely slowly. The 0.8B preset is the realistic full-run option on stock WSL2 / no-toolchain hosts.
Verify the wiring without running SFT. --preflight-only validates the
preset's flag bundle, the dataset format, and the APOLLO classification
without loading model weights or touching CUDA:
uv run --extra train python scripts/train_local.py \
--registry-key qwen3.5-2b --low-vram-smoke \
--train-file data/final-eliza1-smoke/train.jsonl \
--val-file data/final-eliza1-smoke/val.jsonl \
--preflight-only
# → "low-vram-smoke preset → seq=2048 batch=1 accum=16 ... budget=11.5GB"
# → "preflight ok: train=314/314 validation=39/39 optimizer=apollo_mini rank=1"
Measured on RTX 3060 (12 GB, WSL2 Ubuntu, torch 2.12.0+cu130, no Liger
JIT toolchain available, 314 smoke records in data/final-eliza1-smoke/):
| tier | preflight | peak VRAM (load + step 0) | notes |
|---|---|---|---|
| 0.8B | ok | ~12.0 GB | runs to completion, but each step is slow without Liger + flash-linear-attn |
| 2B | ok | ~12.0 GB | hits the 12 GB ceiling at the first backward without Liger; first step does not complete in a useful time window |
The instrumentation callback (enabled because --memory-budget-gb is set)
fails the run loud the moment torch.cuda.max_memory_reserved() exceeds
the budget by more than 10 %.
Trade-offs:
--max-chars 6000 (≈3× seq_len) so the
char-filter rejects oversized rows up front rather than wasting them.If it still OOMs: the preset's headroom is conservative but real-world
allocator fragmentation can still tip a 12 GB card over. Drop seq_len with
--low-vram-smoke --max-seq-len 1024 (or 768) and retry. If you cannot
install the Liger / flash-linear-attention kernels (Windows / WSL2 without
a CUDA toolkit and Python dev headers), prefer the 0.8B tier — it is the
smallest published eliza-1 base and validates the same end-to-end pipeline.
The full quantization stack at inference is:
Run scripts/training/memory_calc.py for the actual numbers — every
table below comes from there. Do not transcribe these tables into other
docs; the calculator is the source of truth.
uv run --extra train python scripts/training/memory_calc.py --shape qwen3.5-2b
uv run --extra train python scripts/training/memory_calc.py --shape qwen3.5-4b
The memory_calc output covers APOLLO training memory across seq_len ∈ {4k…147k}, inference memory at the same context lengths for every
(weight-quant, K-quant, V-quant)
combination, an inference fit table across modern local GPUs, and the maximum
context per card with the full quant stack.
Upstream fused_turboquant.hf 0.1.0's make_fused_attention_forward
does not handle the gated q_proj layout (q_proj.out_features = 2 * num_heads * head_dim) used by the Qwen3.5 attention block.
The vendored quantization.fused_turboquant_vendored package adds the
gated branch; until that lands and is smoke-tested end-to-end the V-side
path stays on bf16. K-side QJL is independent of the gated patch and is
on today against the head_dim=256 CUDA kernel.