packages/training/datasets/eliza1-sft-0_6b/README.md
Benchmark-aligned SFT data for the eliza-1-0_6b base (upstream
Qwen/Qwen3-0.6B; Qwen2/Qwen3 ChatML template; vocab 151,936; 4096-token
training window). Built by
packages/training/scripts/build_eliza1_sft_0_6b.py from in-repo benchmark
sources, augmented and repaired with Cerebras gpt-oss-120b
(OpenAI-compatible API).
ChatML JSONL — one row per line:
{"messages": [{"role": "system|user|assistant", "content": "..."}, ...],
"task": "action_selection|tool_use|personality|assistant|structured_decode|voice_emotion",
"provenance": "benchmark:<file>#<id> | synthetic:<task> | cerebras:<task>",
"tags": ["..."]}
The messages array matches the 0.6b's chat template; the splitter and
packages/training/scripts/train_local.py ingest it directly via
--train-file / --val-file (the chat_messages shape understood by
scripts/format_for_training.py). task / provenance / tags are dropped
at format time — they are metadata for stratified sampling and audits.
| task | rows | source |
|---|---|---|
action_selection | ~68 | packages/app-core/test/benchmarks/action-selection-cases.ts — user turn → the action the agent should pick (or a plain reply for expectedAction: null), rendered as ACTION: NAME {params} + a short confirmation. 1:1 with the action-selection benchmark case ids. |
tool_use | ~730 | Cerebras-generated agent-loop turns over the canonical action catalog (OWNER_TODOS, CALENDAR, MESSAGE, BLOCK, …, REPLY): more domain/phrasing variety, ambiguous cases, negative (no-action) cases. |
personality | ~37 | packages/benchmarks/personality-bench/tests/calibration/{hand-graded,adversarial}.jsonl — PASS-graded trajectories for the five rubrics (shut_up, hold_style, note_trait_unrelated, escalation, scope_global_vs_user). Silence-on-demand rows are truncated to the last trainable assistant turn. |
assistant | ~370 | Cerebras-generated general assistant turns (concise factual Q&A, explanations of speculative decoding / quantization / VAD / on-device inference — the topics the eliza1_eval_suite held-out text-eval corpus probes), plus polite refusals (cerebras:refusal) and short multi-turn exchanges (cerebras:multiturn). |
structured_decode | ~250 | Stage-1 response-envelope turns: the W3 flat JSON envelope @elizaos/core buildResponseGrammar constrains — {"shouldRespond":"RESPOND|IGNORE|STOP","thought":...,"replyText":...,"contexts":[...],"contextSlices":[...],"candidateActions":[...],"parentActionHints":[...],"requiresTool":<bool>,"extract":{...}} (shouldRespond dropped on direct DM/voice/API channels). Key order matches packages/core/src/runtime/response-grammar.ts::STAGE1_ENVELOPE_KEYS. Deterministic seed rows (synthetic:stage1-envelope#{direct,full}) + Cerebras augmentation (cerebras:stage1-envelope). This is what makes format_ok measure a real target instead of 0%. (On-wire form is JSON, not "TOON" — it matches the runtime model call.) |
voice_emotion | ~245 | Spoken replies carrying omnivoice-singing inline expressive tags in replyText — [happy] [sad] [angry] [nervous] [calm] [excited] [whisper] [singing] plus the preserved non-verbals [laughter] [sigh], scoped until the next tag or end of phrase. Deterministic seed rows (synthetic:voice-emotion-tags) + Cerebras augmentation (cerebras:voice-emotion-tags). The parse/generate/interpret schema the TTS emotion controls consume. |
This dataset is shaped to move the text metrics of
packages/training/scripts/eval/eliza1_eval_suite.py and the structural
format_ok gate in packages/training/benchmarks/eliza1_gates.yaml:
text_eval (held-out perplexity → 0..1; 0_6b threshold 0.55): the
assistant rows mirror the topic distribution of the suite's
DEFAULT_TEXT_EVAL_CORPUS (capital cities, speculative decoding, on-device
assistants, quantization, voice-activity detection).format_ok (parsable-output rate; floor 0.70): the action_selection
and tool_use rows teach the ACTION: NAME {json-params} + short-reply
structured surface; the structured_decode rows teach the W3 flat JSON
response envelope buildResponseGrammar constrains (the Stage-1
message-handler document) — without those rows the smoke task mix never
emitted the envelope and format_pct measured 0%.personality rows are PASS-graded exemplars of
silence on demand, style stickiness, trait respect, escalation, and per-user
vs global scope.This is a focused, high-signal mix-in — it is not the full 67k-row
data/final corpus the larger eliza-1 tiers train on. For the 0.6b it can be
used standalone (whole train→quant→bench stack runs < 1 h on a 16 GB GPU) or
concatenated ahead of the broader corpus.
provenance field. Benchmark-derived rows are
benchmark:<file>#<id>; Cerebras-generated rows are cerebras:<task>.trajectories export. The final splits are nonetheless run
through the canonical inline privacy filter
(packages/training/scripts/privacy_filter_trajectories.py::redact_value —
the same filter format_for_training.format_record applies) as
defense-in-depth (API keys / bearer tokens / emails / phones / geo).cd packages/training
CEREBRAS_API_KEY=<key> uv run python scripts/build_eliza1_sft_0_6b.py
# converted-only (no API key):
uv run python scripts/build_eliza1_sft_0_6b.py --no-augment
Counts, per-task breakdown, token histogram, and the privacy-filter pass are
recorded in manifest.json alongside the splits.