core/moonshine-tts/data/README.md
This tree mirrors assets the C++ moonshine-tts (speak) and moonshine-tts-g2p (G2P-only) CLIs expect under a single --model-root (see each README.md for paths). When this project is embedded as the moonshine-tts submodule, canonical Python assets usually live under the parent repo’s data/ and models/; this directory is the curated copy shipped for self-contained C++ builds.
| Folder | Role |
|---|---|
| ar_msa | Arabic MSA: tashkīl ONNX + optional lexicon |
| de | German IPA lexicon |
| en_us | English CMU-style lexicon + OOV ONNX (no heteronym ONNX in-tree) |
| fr | French lexicon + liaison POS CSVs |
| hi | Hindi Devanagari lexicon (Wiktionary + frequency merge) |
| it | Italian IPA lexicon |
| ja | Japanese lexicon + char-LUW UPOS ONNX |
| ko | Korean IPA lexicon |
| kokoro | Kokoro-82M ONNX TTS + .kokorovoice bundles (moonshine-tts speak CLI) |
| nl | Dutch IPA lexicon |
| pt_br | Brazilian Portuguese IPA lexicon |
| pt_pt | European Portuguese IPA lexicon |
| ru | Russian IPA lexicon |
| vi | Vietnamese IPA lexicon |
| zh_hans | Simplified Chinese lexicon + RoBERTa UPOS ONNX |
All commands below assume the repository root as the current working directory unless noted.
Commands below were run from a clean temp output directory and compared to the parent monorepo’s data/ / models/ and this moonshine-tts/data/ tree unless noted.
| Recipe | Byte-identical to tree? | Notes |
|---|---|---|
download_multilingual_ipa_lexicons.py for de, fr, it, ja, ko, nl, pt_br, pt_pt, ru, vi, zh_hans | Yes | All eleven dict.tsv files matched the parent repo’s data/<lang>/dict.tsv and this tree’s data/<lang>/dict.tsv. |
download_cmudict_to_tsv.py → data/en_us/dict.tsv | Yes | Restored after run; output matched prior file. |
export_models_to_onnx.py (English OOV) | Yes (after bugfix) | The shipped tree under data/en_us/oov/ matches models/en_us/oov/ when re-exported from the same OOV checkpoint. (Heteronym ONNX was removed from the runtime and is no longer present under data/en_us/.) A script bug that wrote onnx_export.onnx_path as onnx-config.json was fixed in scripts/export_models_to_onnx.py (must pass the model.onnx path into _build_config_onnx, not the JSON path). Copy g2p-config.json into the temp model_root if you rely on --only config defaults. |
export_arabic_msa_diacritizer_onnx.py | No (failed) | With torch 2.10 + current transformers, export raises ValueError on attention_mask shape inside BERT. The checked-in Arabic ONNX was produced with an older stack; see ar_msa/README.md. |
export_chinese_roberta_upos_onnx.py | Partial | meta.json and vocab.txt match; tokenizer_config.json differs by an empty extra_special_tokens key (tokenizer version). model.onnx differs (same HF weights, different ONNX graph / int8 shrink / opset path under torch 2.10). |
export_japanese_ud_onnx.py | Partial | Same pattern as Chinese: meta.json / vocab.txt match; tokenizer_config.json minor JSON diff. Checked-in bundle uses external weights (model.onnx + model.onnx.data); a fresh export produced a single larger model.onnx (no .data split). |
export_korean_ud_onnx.py | Partial | meta.json matches; model.onnx differs in size/hash (int8 shrink / exporter). |
filter_dict_by_espeak_coverage.py → dict_filtered_heteronyms.tsv | Not rerun | Needs eSpeak NG + corpus; recipe is environment-specific. |
build_ar_msa_lexicon_from_camel_tools.py | Not rerun | Requires Camel Tools + MLE DB install. |
French *.csv POS lists | N/A | No automated download script in-repo. |
Takeaway: Lexicon recipes are deterministic against current upstream URLs. Transformer ONNX exports are not guaranteed to be byte-stable across PyTorch / transformers / export-backend versions; treat meta.json + tokenizer assets + parity tests as the contract when bytes drift.