core/moonshine-tts/data/en_us/README.md
en_usdict_filtered_heteronyms.tsv — CMUdict-derived lexicon with extra pronunciations pruned using corpus + eSpeak alignment (see script below). Words that still have multiple CMU readings are resolved at runtime by sorting alternatives and taking the first (no heteronym ONNX).g2p-config.json — uses_dictionary and uses_oov_model (must be true for the bundled layout; the C++ factory requires OOV ONNX when this flag is on).oov/ — model.onnx + onnx-config.json for greedy character→phoneme decoding for out-of-vocabulary words.| Asset | Source |
|---|---|
| Base pronunciations | CMU Pronouncing Dictionary (cmudict.dict), converted to IPA via repo cmudict_ipa — see scripts/download_cmudict_to_tsv.py. |
| Filtered TSV | Repo pipeline: scripts/filter_dict_by_espeak_coverage.py prunes rare readings using eSpeak NG + corpus text. |
| OOV ONNX | Small transformer decoder trained in this repository (train_oov.py, etc.); checkpoints live under models/en_us/oov/ (e.g. checkpoint.pt + JSON vocabs). |
| ONNX export | scripts/export_models_to_onnx.py (or the OOV-only path your pipeline uses) merges vocabs and indices into onnx-config.json next to model.onnx. |
Raw CMU → TSV
python scripts/download_cmudict_to_tsv.py
Writes data/en_us/dict.tsv by default.
Prune rare readings (requires espeak-phonemizer / system eSpeak NG and a sentence corpus such as data/en_us/input_text.txt or your own --input-text):
python scripts/filter_dict_by_espeak_coverage.py \
--dict-path data/en_us/dict.tsv \
--input-text data/en_us/input_text.txt \
--out data/en_us/dict_filtered_heteronyms.tsv
Install the result as models/en_us/dict_filtered_heteronyms.tsv (or the path your model_root uses).
Train the OOV model (see project docs / train_oov.py, dataset scripts under scripts/).
Export ONNX (PyTorch + onnx installed). Use the repo’s English OOV export path (e.g. scripts/export_models_to_onnx.py when present in your tree) so oov/model.onnx and oov/onnx-config.json are produced, then copy them into data/en_us/oov/.
Keep models/en_us/g2p-config.json aligned with what you ship (uses_oov_model: true, no heteronym bundle). Re-exporting from the same checkpoint reproduces model.onnx and onnx-config.json with a fixed exporter (see data/README.md → Regeneration verification).
Copy dict_filtered_heteronyms.tsv, g2p-config.json, and the oov/ directory into data/en_us/ if you maintain this layout.