Back to Moonshine

C++ runtime data (per-language bundles)

core/moonshine-tts/data/README.md

0.0.594.3 KB
Original Source

C++ runtime data (per-language bundles)

This tree mirrors assets the C++ moonshine-tts (speak) and moonshine-tts-g2p (G2P-only) CLIs expect under a single --model-root (see each README.md for paths). When this project is embedded as the moonshine-tts submodule, canonical Python assets usually live under the parent repo’s data/ and models/; this directory is the curated copy shipped for self-contained C++ builds.

FolderRole
ar_msaArabic MSA: tashkīl ONNX + optional lexicon
deGerman IPA lexicon
en_usEnglish CMU-style lexicon + OOV ONNX (no heteronym ONNX in-tree)
frFrench lexicon + liaison POS CSVs
hiHindi Devanagari lexicon (Wiktionary + frequency merge)
itItalian IPA lexicon
jaJapanese lexicon + char-LUW UPOS ONNX
koKorean IPA lexicon
kokoroKokoro-82M ONNX TTS + .kokorovoice bundles (moonshine-tts speak CLI)
nlDutch IPA lexicon
pt_brBrazilian Portuguese IPA lexicon
pt_ptEuropean Portuguese IPA lexicon
ruRussian IPA lexicon
viVietnamese IPA lexicon
zh_hansSimplified Chinese lexicon + RoBERTa UPOS ONNX

All commands below assume the repository root as the current working directory unless noted.

Regeneration verification (2026-03-30)

Commands below were run from a clean temp output directory and compared to the parent monorepo’s data/ / models/ and this moonshine-tts/data/ tree unless noted.

RecipeByte-identical to tree?Notes
download_multilingual_ipa_lexicons.py for de, fr, it, ja, ko, nl, pt_br, pt_pt, ru, vi, zh_hansYesAll eleven dict.tsv files matched the parent repo’s data/<lang>/dict.tsv and this tree’s data/<lang>/dict.tsv.
download_cmudict_to_tsv.pydata/en_us/dict.tsvYesRestored after run; output matched prior file.
export_models_to_onnx.py (English OOV)Yes (after bugfix)The shipped tree under data/en_us/oov/ matches models/en_us/oov/ when re-exported from the same OOV checkpoint. (Heteronym ONNX was removed from the runtime and is no longer present under data/en_us/.) A script bug that wrote onnx_export.onnx_path as onnx-config.json was fixed in scripts/export_models_to_onnx.py (must pass the model.onnx path into _build_config_onnx, not the JSON path). Copy g2p-config.json into the temp model_root if you rely on --only config defaults.
export_arabic_msa_diacritizer_onnx.pyNo (failed)With torch 2.10 + current transformers, export raises ValueError on attention_mask shape inside BERT. The checked-in Arabic ONNX was produced with an older stack; see ar_msa/README.md.
export_chinese_roberta_upos_onnx.pyPartialmeta.json and vocab.txt match; tokenizer_config.json differs by an empty extra_special_tokens key (tokenizer version). model.onnx differs (same HF weights, different ONNX graph / int8 shrink / opset path under torch 2.10).
export_japanese_ud_onnx.pyPartialSame pattern as Chinese: meta.json / vocab.txt match; tokenizer_config.json minor JSON diff. Checked-in bundle uses external weights (model.onnx + model.onnx.data); a fresh export produced a single larger model.onnx (no .data split).
export_korean_ud_onnx.pyPartialmeta.json matches; model.onnx differs in size/hash (int8 shrink / exporter).
filter_dict_by_espeak_coverage.pydict_filtered_heteronyms.tsvNot rerunNeeds eSpeak NG + corpus; recipe is environment-specific.
build_ar_msa_lexicon_from_camel_tools.pyNot rerunRequires Camel Tools + MLE DB install.
French *.csv POS listsN/ANo automated download script in-repo.

Takeaway: Lexicon recipes are deterministic against current upstream URLs. Transformer ONNX exports are not guaranteed to be byte-stable across PyTorch / transformers / export-backend versions; treat meta.json + tokenizer assets + parity tests as the contract when bytes drift.