Back to Eliza

eliza-1 — Training Pipeline

packages/training/README.md

2.0.120.6 KB
Original Source

eliza-1 — Training Pipeline

Fine-tunes the eliza-1 model series (eliza-1-0_8b, eliza-1-2b, eliza-1-4b, eliza-1-9b, eliza-1-27b) on Eliza-native Vercel AI SDK trajectory rows: the exact request sent to the model plus the exact normalized response returned by the model, including native tool calls.

This directory is gitignored. The canonical artifact stores live on HuggingFace, not in git history:

whatreposcript
Dataset (native trajectory SFT)runtime eliza_native_v1 exportsscripts/trajectories_to_sft.py
Trained modelselizaos/eliza-1 (model; bundles/<tier>/)scripts/publish/publish_model.py
Training dataset (SFT splits)elizaos/eliza-1-training (dataset)scripts/publish/publish_dataset.py
Pipeline source (this directory)elizaos/eliza-1-training (dataset; pipeline/)scripts/publish/publish_pipeline.py

Quants land under the same elizaos/eliza-1 model repo alongside each bundle manifest; do not create per-quant public defaults.

The base models are catalogued in scripts/training/model_registry.py; each entry is tagged local | workstation | cloud. Default optimizer is APOLLO (full-parameter SFT, low-memory projected optimizer state, arXiv:2412.05270), not LoRA.

registry keyeliza releasebasetierdefault training targetoptimizer
qwen3.5-0.8beliza-1-0_8bQwen/Qwen3.5-0.8B-Baselocal16 GB consumer GPUapollo_mini
qwen3.5-2beliza-1-2bQwen/Qwen3.5-2B-Baselocal16 GB consumer GPUapollo_mini
qwen3.5-4beliza-1-4bQwen/Qwen3.5-4B-Baselocal24 GB consumer/workstation GPUapollo_mini
qwen3.5-9beliza-1-9bQwen/Qwen3.5-9Bworkstation80 GB-class GPUapollo
qwen3.6-27beliza-1-27bQwen/Qwen3.6-27Bcloud2x H200 / B200apollo

After training, two post-training quantization passes run: PolarQuant (4-bit weights via Hadamard rotation, arXiv:2603.29078) and TurboQuant (online KV-cache quantization, arXiv:2504.19874). KV compression is rounded out by QJL (1-bit JL-projected K cache, arXiv:2406.03482).

A unified pipeline runner (scripts/run_pipeline.py) chains:

base bench → APOLLO SFT → fine-tuned bench → PolarQuant + TurboQuant + QJL → quantized bench

Per-task benchmarks live in scripts/benchmark/native_tool_call_bench.py and score native tool-call structure, tool names, argument keys, and JSON routing shape on the held-out trajectory split.

Cloning the pipeline on a fresh machine

bash
hf download elizaos/eliza-1-training pipeline --repo-type dataset --local-dir ./training
cd training
uv sync --extra train

Pipeline

datasets.yaml ──▶ download_datasets.py ──▶ data/raw/<slug>/
prompts/    ──▶ extract_eliza_prompts.py ──▶ data/prompts/registry.json
                                                  │
                                                  ▼
                          synthesize_targets.py (teacher: Anthropic API)
                                                  │
                                                  ▼
data/raw/* ──▶ normalize.py ──▶ data/normalized/<slug>.jsonl
                                  │
                                  ▼
                            pack_dataset.py
                                  │
                                  ▼
                  data/final/{train,val,test}.jsonl
                                  │
                       ┌──────────┴──────────┐
                       ▼                     ▼
              train_local.py        train_vast.sh
              (APOLLO, 0.8B/2B)     (APOLLO, 4B remote GPU)
                                  │
                                  ▼
              ┌─────────────┬─────┴─────┬─────────────┐
              ▼             ▼           ▼             ▼
       polarquant_apply  turboquant_apply  qjl_apply (vendored)
                                  │
                                  ▼
                          native_tool_call_bench.py
                  (native tool-call + JSON structure correctness)

Native Tool-Calling Data

The runtime training path uses native JSON records, not alternate harness rows. The contract is documented in docs/dataset/NATIVE_TOOL_CALLING_SPEC.md. Source transform families are summarized in docs/dataset/NATIVE_SOURCE_TRANSFORMS.md.

Bootstrap flow:

bash
uv run python scripts/download_datasets.py --priority all \
    --skip nubilio-trajectories,light-multilight \
    --max-workers 2 --min-free-gb 40
uv run python scripts/normalize.py
uv run python scripts/prepare_native_tool_calling_data.py --write-matrix
uv run python scripts/prepare_native_tool_calling_data.py \
    --transform-normalized --validate-native
uv run python scripts/bootstrap_native_to_eliza_native.py \
    --input data/native/records \
    --output data/native/eliza_native_bootstrap.jsonl

The source matrix is written to data/native/source_matrix.json and data/native/SOURCE_MATRIX.md. It records every datasource's transform family, strengths, weaknesses, raw-data status, and recommended native-training weight. bootstrap_native_to_eliza_native.py converts bootstrap rows into the same eliza_native_v1 request/response boundary format used by real runtime trajectory exports.

Trajectory alignment audit:

bash
uv run --with pyyaml --with pyarrow \
    python scripts/sample_native_trajectory_alignment.py \
    --samples-per-source 10 --run-cerebras

This writes ignored review artifacts under data/native/audit/: randomized raw samples per downloaded dataset, reference simple/wallet/email/calendar trajectories, real Eliza recorder-stage comparisons, an eliza_native_v1 export of real local trajectories for smoke training, per-dataset synthesis templates for missing components, model-call envelopes for Cerebras and the Vercel AI Gateway bridge, and a composition audit. See docs/dataset/TRAJECTORY_ALIGNMENT_AUDIT.md.

Quick reference

bash
# Build train/val/test directly from runtime trajectory exports
uv run --extra train python scripts/trajectories_to_sft.py \
    --input ../trajectory-export.jsonl \
    --output-dir data/trajectory-runs/local-review

# One-command local APOLLO fine-tune from trajectory export(s)
uv run --extra train python scripts/run_pipeline.py \
    --registry-key qwen3.5-2b \
    --trajectory-export ../trajectory-export.jsonl \
    --epochs 1 --skip-base-bench

# Smoke test on 0.8B (smallest active eliza-1 size, trains on 16 GB)
uv run --extra train python scripts/run_pipeline.py \
    --registry-key qwen3.5-0.8b --max-samples 1000 --epochs 1

# Full pipeline on 2B (eliza-1-2b, real local run)
uv run --extra train python scripts/run_pipeline.py \
    --registry-key qwen3.5-2b --epochs 3

# Remote GPU pipeline for the active 4B APOLLO tier
VAST_API_KEY=... HUGGING_FACE_HUB_TOKEN=... \
    bash scripts/train_vast.sh provision-and-train \
    --registry-key qwen3.5-4b --epochs 1 --bootstrap hf

# Push the trained checkpoint to elizaos/eliza-1 (gated bundle path).
# publish_model.py --mode bundle runs the full publish orchestrator:
# layout → kernel verify → eval gates (incl. prior-bundle regression check) →
# manifest → README → HF push.
HF_TOKEN=hf_xxx uv run python -m scripts.publish.publish_model \
    --mode bundle \
    --tier 4b \
    --bundle-dir checkpoints/qwen3-5-4b-apollo/final

See RL_STRATEGY.md for the post-SFT plan (DPO + GRPO via verl).

Renting GPUs

The active 0.8B, 2B, 4B, 9B, and 27B APOLLO tiers can train on Vast.ai via scripts/train_vast.sh (subcommands: search, provision, sync, run, quantize, bench, fetch, status, pull-checkpoints, kill-and-teardown, teardown, provision-and-train). The script auto-picks the GPU target from REGISTRY_KEY:

  • qwen3.5-0.8b / eliza-1-0_8b → smallest active smoke/default tier.
  • qwen3.5-2b / eliza-1-2b → local 16 GB training tier.
  • qwen3.5-4b / eliza-1-4b → 48 GB-class training tier, optimized for 24 GB inference after quantization.
  • qwen3.5-9b / eliza-1-9b → workstation tier.
  • qwen3.6-27b / eliza-1-27b → cloud tier.

Lower-level helpers live in scripts/lib/vast.py (searchable via python -m scripts.lib.vast pick blackwell6000-2x). scripts/day0_smoke.sh uses the same helpers for its day-0 verification run. scripts/train_nebius.sh is kept only as an emergency fallback if Vast capacity is unavailable; do not extend the Nebius path.

Implementation details

  • APOLLOscripts/training/optimizer.py (apollo-torch package). Validation: scripts/training/test_apollo.py.
  • PolarQuantscripts/quantization/polarquant_apply.py plus the vendored scripts/quantization/polarquant/ library. 4-bit weight codes
    • fp16 norms in a sidecar safetensors.
  • TurboQuantscripts/quantization/turboquant_apply.py (turbokv package, pure-PyTorch reference) and scripts/quantization/fused_turboquant_apply.py (Triton-fused, vendored in scripts/quantization/fused_turboquant_vendored/). Inference-time KV cache quantizer.
  • QJLscripts/quantization/qjl_apply.py plus the vendored scripts/quantization/qjl/ CUDA extension. 1-bit JL-projected K cache.
  • Instrumentationscripts/training/instrumentation.py. JSONL trace with peak memory + tokens/sec per logging window; hard-fails the run when torch.cuda.max_memory_reserved() exceeds the registry budget by more than 10 %.
  • Benchmarkscripts/benchmark/native_tool_call_bench.py. It scores expected native tool names, argument keys, and JSON routing/planner shape. Run on base + fine-tuned + each quantized variant for direct A/B numbers.

Uniform chat format

The primary trajectory-training record is an eliza_native_v1 boundary row. The renderer reads request.messages or request.prompt, appends the supervised assistant turn from response.text and/or response.toolCalls, and passes request.tools into tokenizer.apply_chat_template(..., tools=...) when the tokenizer supports native tool rendering.

{
  "format": "eliza_native_v1",
  "request": {"messages": [...], "tools": {...}, "toolChoice": "..."},
  "response": {"text": "...", "toolCalls": [...]}
}

The same chat template is applied at benchmark time with add_generation_prompt=True, so the model sees the same request structure at training and generation time.

For handoff compatibility, scripts/format_for_training.py also accepts trainable eliza.eliza1_trajectory_record.v1 message rows, already-rendered chat-message rows with a final assistant turn, and legacy flat ElizaRecord rows from pack_dataset.py. It rejects repair_eval / failed-quality rows. Remote Vast bootstrap expects root split names data/final/{train,val,test}.jsonl; candidate repos use data/validation.jsonl, so stage or rename that split to val.jsonl before using it as the remote root dataset.

System prerequisites

Three of the four memory optimizations rely on JIT-compiled or hand-written CUDA kernels:

  • Liger kernel (training): Triton JIT — needs gcc + Python dev headers.
  • Fused TurboQuant (inference V cache): same Triton JIT requirements.
  • QJL (inference K cache): hand-written CUDA C++ extensions in scripts/quantization/qjl/csrc/ — needs nvcc from the CUDA toolkit in addition to Python dev headers.
  • PolarQuant and APOLLO: pure-PyTorch / pip — no system deps.

One-shot install on Debian/Ubuntu:

bash
sudo apt install build-essential python3.12-dev nvidia-cuda-toolkit
# Then build QJL:
cd scripts/quantization/qjl && python setup.py build_ext --inplace
# For Blackwell (sm_120, RTX 50-series + RTX Pro Blackwell):
TORCH_CUDA_ARCH_LIST="12.0+PTX" python setup.py build_ext --inplace

Without these, every kernel path has a documented fallback (Liger → HF defaults; fused-turboquant → pure-PyTorch turbokv; QJL → bf16 K cache — you lose the K-side compression but retain V-side TurboQuant). The training/inference script logs a warning at startup so you know which path is actually running.

Quickstart

bash
cd training
uv sync --extra train
uv run python scripts/download_datasets.py
uv run python scripts/extract_eliza_prompts.py
uv run python scripts/normalize.py
uv run python scripts/synthesize_targets.py --task should_respond  # optional
uv run python scripts/pack_dataset.py

# Smoke test (small subset, no Liger — proves the path end to end on 2B)
uv run --extra train python scripts/train_local.py \
    --registry-key qwen3.5-2b --max-samples 256 --epochs 1 \
    --use-liger off

# Real local 2B run with Liger (8k seq_len, APOLLO, instrumentation)
uv run --extra train python scripts/train_local.py \
    --registry-key qwen3.5-2b --epochs 3 --full-finetune \
    --max-chars 24000

# Full pipeline: base bench → APOLLO SFT → fine-tuned bench → quant → quant bench
uv run --extra train python scripts/run_pipeline.py \
    --registry-key qwen3.5-2b --epochs 3

For cloud-tier runs see scripts/train_vast.sh and scripts/CLOUD_VAST.md. scripts/train_nebius.sh is emergency fallback only. For inference see scripts/inference/serve_vllm.py (vLLM serve launcher) and scripts/inference/serve_local.py.

Running on a 12 GB consumer GPU (RTX 3060 / 4070 class)

The registry's qwen3.5-2b default targets a 16 GB local GPU (seq_len=8192, budget=15.5 GB). On a 12 GB card that OOMs at the fp32 logits transient (B·S·V·4 bytes; Qwen vocab=248k makes this dominant). The --low-vram-smoke flag is a preset bundle that brings the SFT path inside a 12 GB envelope so the train→quant→bench plumbing can be validated end-to-end on commodity hardware before reaching for a rented H100/H200.

bash
# 0.8B — the realistic tier for a 12 GB card. Smallest published eliza-1 base.
uv run --extra train python scripts/train_local.py \
    --registry-key qwen3.5-0.8b --low-vram-smoke

# 2B — feasible inside 12 GB only with Liger + flash-linear-attention
# kernels installed (see prerequisites below). Without those kernels the
# 248k-vocab fp32 logits transient saturates the card and the run will
# either OOM at the first backward pass or run unusably slowly.
uv run --extra train python scripts/train_local.py \
    --registry-key qwen3.5-2b --low-vram-smoke

What the preset overrides (CLI flags the caller passes still win):

  • seq_len = 2048 (registry default 4096 for 0.8B, 8192 for 2B)
  • per_device_batch_size = 1
  • gradient_accumulation_steps = 16 (effective batch stays at 16)
  • max_samples = 1000
  • epochs = 1
  • memory_budget_gb = 11.5 (1.5 GB headroom under 12 GB)
  • Liger fused chunked-CE stays on (registry default, when installed and Triton can JIT-compile — see the System Prerequisites section)
  • Activation checkpointing stays on (default in train_local.py)
  • APOLLO-Mini (registry default for 0.8B/2B/4B) remains the optimizer; rank=1 keeps optimizer state effectively free.

Kernel prerequisites for the 2B path. The preset's budget at 2B depends on two things being live:

  • Liger fused chunked-CE — cuts the fp32 logits transient ~4–8× on Qwen's 248k vocab. Requires Triton, which JIT-compiles a small CUDA helper at the first kernel launch — gcc, python3.x-dev, and a CUDA toolkit Triton can use must all be installed. Without them, train_local.py logs a warning at startup and falls back to HF defaults.
  • Flash-linear-attention — the Qwen3.5 hybrid linear-attn layers (6 of 24 in the 0.8B/2B configs) have a fast Triton path through https://github.com/fla-org/flash-linear-attention. Without it, transformers falls back to a pure-PyTorch linear-attn implementation that is roughly 10× slower per step.

When either kernel is missing, the 2B-on-12 GB path is effectively gated to preflight + small-step verification — full SFT will run but extremely slowly. The 0.8B preset is the realistic full-run option on stock WSL2 / no-toolchain hosts.

Verify the wiring without running SFT. --preflight-only validates the preset's flag bundle, the dataset format, and the APOLLO classification without loading model weights or touching CUDA:

bash
uv run --extra train python scripts/train_local.py \
    --registry-key qwen3.5-2b --low-vram-smoke \
    --train-file data/final-eliza1-smoke/train.jsonl \
    --val-file data/final-eliza1-smoke/val.jsonl \
    --preflight-only
# → "low-vram-smoke preset → seq=2048 batch=1 accum=16 ... budget=11.5GB"
# → "preflight ok: train=314/314 validation=39/39 optimizer=apollo_mini rank=1"

Measured on RTX 3060 (12 GB, WSL2 Ubuntu, torch 2.12.0+cu130, no Liger JIT toolchain available, 314 smoke records in data/final-eliza1-smoke/):

tierpreflightpeak VRAM (load + step 0)notes
0.8Bok~12.0 GBruns to completion, but each step is slow without Liger + flash-linear-attn
2Bok~12.0 GBhits the 12 GB ceiling at the first backward without Liger; first step does not complete in a useful time window

The instrumentation callback (enabled because --memory-budget-gb is set) fails the run loud the moment torch.cuda.max_memory_reserved() exceeds the budget by more than 10 %.

Trade-offs:

  • Training context window drops from 4–8k to 2k. Records longer than ~2k rendered chars are right-truncated by the tokenizer. The smoke trajectory dataset already fits inside 2k; for the real native trajectory corpus pass --max-chars 6000 (≈3× seq_len) so the char-filter rejects oversized rows up front rather than wasting them.
  • Long-context behaviors (multi-turn agent traces, long tool outputs) are NOT exercised at seq_len=2048. The resulting checkpoint is for smoke / path-validation only, NOT for publishing.
  • Loss numbers from the smoke are not comparable to the registry-default run — the effective sequence diet is different.

If it still OOMs: the preset's headroom is conservative but real-world allocator fragmentation can still tip a 12 GB card over. Drop seq_len with --low-vram-smoke --max-seq-len 1024 (or 768) and retry. If you cannot install the Liger / flash-linear-attention kernels (Windows / WSL2 without a CUDA toolkit and Python dev headers), prefer the 0.8B tier — it is the smallest published eliza-1 base and validates the same end-to-end pipeline.

Memory budgets

The full quantization stack at inference is:

  • APOLLO / APOLLO-Mini at training time — projected optimizer state keeps the 2B local run inside the consumer-GPU budget.
  • PolarQuant — 4-bit weight quantization (arXiv:2603.29078).
  • QJL — 1-bit JL-projected K cache (arXiv:2406.03482).
  • Fused TurboQuant — Triton-fused 4-bit V cache (arXiv:2504.19874).

Run scripts/training/memory_calc.py for the actual numbers — every table below comes from there. Do not transcribe these tables into other docs; the calculator is the source of truth.

bash
uv run --extra train python scripts/training/memory_calc.py --shape qwen3.5-2b
uv run --extra train python scripts/training/memory_calc.py --shape qwen3.5-4b

The memory_calc output covers APOLLO training memory across seq_len ∈ {4k…147k}, inference memory at the same context lengths for every (weight-quant, K-quant, V-quant) combination, an inference fit table across modern local GPUs, and the maximum context per card with the full quant stack.

Reality on Qwen3.5 V-side

Upstream fused_turboquant.hf 0.1.0's make_fused_attention_forward does not handle the gated q_proj layout (q_proj.out_features = 2 * num_heads * head_dim) used by the Qwen3.5 attention block. The vendored quantization.fused_turboquant_vendored package adds the gated branch; until that lands and is smoke-tested end-to-end the V-side path stays on bf16. K-side QJL is independent of the gated patch and is on today against the head_dim=256 CUDA kernel.