Back to Eliza

Eliza-1 0.6B — Supervised Fine-Tuning Dataset

packages/training/datasets/eliza1-sft-0_6b/README.md

2.0.15.8 KB
Original Source

Eliza-1 0.6B — Supervised Fine-Tuning Dataset

Benchmark-aligned SFT data for the eliza-1-0_6b base (upstream Qwen/Qwen3-0.6B; Qwen2/Qwen3 ChatML template; vocab 151,936; 4096-token training window). Built by packages/training/scripts/build_eliza1_sft_0_6b.py from in-repo benchmark sources, augmented and repaired with Cerebras gpt-oss-120b (OpenAI-compatible API).

Format

ChatML JSONL — one row per line:

json
{"messages": [{"role": "system|user|assistant", "content": "..."}, ...],
 "task": "action_selection|tool_use|personality|assistant|structured_decode|voice_emotion",
 "provenance": "benchmark:<file>#<id> | synthetic:<task> | cerebras:<task>",
 "tags": ["..."]}

The messages array matches the 0.6b's chat template; the splitter and packages/training/scripts/train_local.py ingest it directly via --train-file / --val-file (the chat_messages shape understood by scripts/format_for_training.py). task / provenance / tags are dropped at format time — they are metadata for stratified sampling and audits.

Task mix

taskrowssource
action_selection~68packages/app-core/test/benchmarks/action-selection-cases.ts — user turn → the action the agent should pick (or a plain reply for expectedAction: null), rendered as ACTION: NAME {params} + a short confirmation. 1:1 with the action-selection benchmark case ids.
tool_use~730Cerebras-generated agent-loop turns over the canonical action catalog (OWNER_TODOS, CALENDAR, MESSAGE, BLOCK, …, REPLY): more domain/phrasing variety, ambiguous cases, negative (no-action) cases.
personality~37packages/benchmarks/personality-bench/tests/calibration/{hand-graded,adversarial}.jsonl — PASS-graded trajectories for the five rubrics (shut_up, hold_style, note_trait_unrelated, escalation, scope_global_vs_user). Silence-on-demand rows are truncated to the last trainable assistant turn.
assistant~370Cerebras-generated general assistant turns (concise factual Q&A, explanations of speculative decoding / quantization / VAD / on-device inference — the topics the eliza1_eval_suite held-out text-eval corpus probes), plus polite refusals (cerebras:refusal) and short multi-turn exchanges (cerebras:multiturn).
structured_decode~250Stage-1 response-envelope turns: the W3 flat JSON envelope @elizaos/core buildResponseGrammar constrains — {"shouldRespond":"RESPOND|IGNORE|STOP","thought":...,"replyText":...,"contexts":[...],"contextSlices":[...],"candidateActions":[...],"parentActionHints":[...],"requiresTool":<bool>,"extract":{...}} (shouldRespond dropped on direct DM/voice/API channels). Key order matches packages/core/src/runtime/response-grammar.ts::STAGE1_ENVELOPE_KEYS. Deterministic seed rows (synthetic:stage1-envelope#{direct,full}) + Cerebras augmentation (cerebras:stage1-envelope). This is what makes format_ok measure a real target instead of 0%. (On-wire form is JSON, not "TOON" — it matches the runtime model call.)
voice_emotion~245Spoken replies carrying omnivoice-singing inline expressive tags in replyText[happy] [sad] [angry] [nervous] [calm] [excited] [whisper] [singing] plus the preserved non-verbals [laughter] [sigh], scoped until the next tag or end of phrase. Deterministic seed rows (synthetic:voice-emotion-tags) + Cerebras augmentation (cerebras:voice-emotion-tags). The parse/generate/interpret schema the TTS emotion controls consume.

Eval alignment

This dataset is shaped to move the text metrics of packages/training/scripts/eval/eliza1_eval_suite.py and the structural format_ok gate in packages/training/benchmarks/eliza1_gates.yaml:

  • text_eval (held-out perplexity → 0..1; 0_6b threshold 0.55): the assistant rows mirror the topic distribution of the suite's DEFAULT_TEXT_EVAL_CORPUS (capital cities, speculative decoding, on-device assistants, quantization, voice-activity detection).
  • format_ok (parsable-output rate; floor 0.70): the action_selection and tool_use rows teach the ACTION: NAME {json-params} + short-reply structured surface; the structured_decode rows teach the W3 flat JSON response envelope buildResponseGrammar constrains (the Stage-1 message-handler document) — without those rows the smoke task mix never emitted the envelope and format_pct measured 0%.
  • personality-bench: the personality rows are PASS-graded exemplars of silence on demand, style stickiness, trait respect, escalation, and per-user vs global scope.

This is a focused, high-signal mix-in — it is not the full 67k-row data/final corpus the larger eliza-1 tiers train on. For the 0.6b it can be used standalone (whole train→quant→bench stack runs < 1 h on a 16 GB GPU) or concatenated ahead of the broader corpus.

Provenance & privacy

  • Every row carries a provenance field. Benchmark-derived rows are benchmark:<file>#<id>; Cerebras-generated rows are cerebras:<task>.
  • No real user trajectory data is consumed by the builder — the in-repo benchmark sources are synthetic test fixtures, and the build hosts carry no populated trajectories export. The final splits are nonetheless run through the canonical inline privacy filter (packages/training/scripts/privacy_filter_trajectories.py::redact_value — the same filter format_for_training.format_record applies) as defense-in-depth (API keys / bearer tokens / emails / phones / geo).

Reproduce

bash
cd packages/training
CEREBRAS_API_KEY=<key> uv run python scripts/build_eliza1_sft_0_6b.py
# converted-only (no API key):
uv run python scripts/build_eliza1_sft_0_6b.py --no-augment

Counts, per-task breakdown, token histogram, and the privacy-filter pass are recorded in manifest.json alongside the splits.