recipes/local_prior_match/librispeech/README.md
Run data and auxiliary files (like lexicon, tokens set, etc.) preparation (set necessary paths instead of [...]: data_dst path to data to store, model_dst path to auxiliary path to store).
pip install sentencepiece==0.1.82
python3 prepare.py --data_dst [...] --model_dst [...]
Besides data the auxiliary files for acoustic and language models training/evaluation will be generated:
cd $MODEL_DST
tree -L 2
.
├── am
│ ├── librispeech-paired-train+dev-unigram-5000-nbest10.lexicon
│ └── librispeech-paired-train-unigram-5000.tokens
└── lm
└── lpm_data
├── train-clean-360-dummy.lst
└── train-other-500-dummy.lst
Build wav2letter++ with -DW2L_BUILD_RECIPES=ON. In addition to the top-level binaries such as Train, binaries specific to local prior matching, decode_len_lpm and Train_lpm_oss, will be built under [...]/recipes/models/local_prior_match.
Training consists of the following steps:
train_init.cfg and train_proposal.cfg[...]/Train train --flagsfile=train_init.cfg[...]/Train train --flagsfile=trian_proposal.cfgtrain_init.cfg and train_proposal.cfg are for running experiments on a single GPU.# use the best model from the last run
[...]/decode_len_lpm [rundir]/lpm_proposal/[xxx]_model_dev-clean.bin \
[model_dst]/lpm_data/train-clean-360-dummy.lst \
[model_dst]/lpm_data/train-clean-360-viterbi.out
[...]/decode_len_lpm [rundir]/lpm_proposal/[xxx]_model_dev-other.bin \
[model_dst]/lpm_data/train-other-500-dummy.lst \
[model_dst]/lpm_data/train-other-500-viterbi.out
python3 prepare_unpaired.py --data_dst [...] --model_dst [...]
In addition to the files generated from prepare.py, the following files will be generated under $MODEL_DST for LPM training:
cd $MODEL_DST
.
├── am
│ └── librispeech-paired-train-unpaired-viterbi+dev-unigram-5000-nbest10.lexicon
└── lm
└── lpm_data
├── train-clean-360-lpm.lst
└── train-other-500-lpm.lst
[model_dst]/lmtrain_lpm.cfg[...]/Train_lpm_oss fork [rundir]/lpm_init/[xxx]_model_last.bin --flagsfile=train_lpm.cfgtrain_lpm.cfg are for running experiments on a single node with 8 GPUs (--enable_distributed=true). Distributed jobs can be launched using Open MPI.