recipes/lexicon_free/librispeech/README.md
[...]: data_dst path to data to store, model_dst path to auxiliary path to store, kenlm path)python3 prepare.py --data_dst [...] --model_dst [...] --kenlm [...]
The following files will be generated:
cd $MODEL_DST
tree -L 2
.
├── am
│ ├── lexicon_train+dev.lst
│ └── tokens.lst
└── decoder
├── 4-gram.arpa
├── 4-gram.bin
├── char_lm_data.dev-clean
├── char_lm_data.dev-other
├── char_lm_data.train
├── lexicon.lst
├── test-clean.lst.inv
├── test-clean.lst.oov
├── test-other.lst.inv
└── test-other.lst.oov
train.cfgtrain.cfg on a single node with 8 GPUs (distributed jobs can be launched using Open MPI. During training we are decreasing the learning rate:[...]/wav2letter/build/Train train --flagsfile train.cfg --minloglevel=0 --logtostderr=1
[...]/wav2letter/build/Train continue [PATH/TO/MODEL/DIR] --linseg=0 --enable_distributed --lr=0.1 --lrcrit=0.001 --maxgradnorm=0.25 --iter=7 --minloglevel=0 --logtostderr=1
[...]/wav2letter/build/Train continue [PATH/TO/MODEL/DIR] --linseg=0 --enable_distributed --lr=0.1 --lrcrit=0.001 --maxgradnorm=0.25 --stepsize 4 --gamma 0.9 --iter=70 --minloglevel=0 --logtostderr=1
Take either 003_model_librispeech_dev-clean.bin or 003_model_librispeech_dev-other.bin. We are using 003_model_librispeech_dev-clean.bin snapshot for further decoder experiments.
source prepare_fairseq_data.sh [DATA_DST] [MODEL_DST] [FAIRSEQ PATH]
./train_ngram_lms.sh [DATA_DST] [MODEL_DST] [KENLM PATH]/build/bin
lr=0.5, then till 30th epoch with lr=0.05, and then till 48th epoch with lr=0.005.mkdir -p [MODEL_DST]/decoder/convlm_models/word_14B
python3 [FAIRSEQ]/train.py [MODEL_DST]/decoder/fairseq_word_data \
--save-dir [MODEL_DST]/decoder/convlm_models/word_14B \
--task=language_modeling \
--arch=fconv_lm --fp16 --max-epoch=48 --optimizer=nag \
--lr=0.5 --lr-scheduler=fixed --decoder-embed-dim=128 --clip-norm=0.1 \
--decoder-layers='[(512, 5)] + [(128, 1, 0), (128, 5, 0), (512, 1, 3)] * 3 + [(512, 1, 0), (512, 5, 0), (1024, 1, 3)] * 3 + [(1024, 1, 0), (1024, 5, 0), (2048, 1, 3)] * 6 + [(1024, 1, 0), (1024, 5, 0), (4096, 1, 3)]' \
--dropout=0.1 --weight-decay=1e-07 \
--max-tokens=1024 --tokens-per-sample=1024 --sample-break-mode=none \
--criterion=adaptive_loss --adaptive-softmax-cutoff='10000,50000,200000' --seed=42 \
--log-format=json --log-interval=100 \
--save-interval-updates=10000 --keep-interval-updates=10 \
--ddp-backend="no_c10d" --distributed-world-size=8 > [MODEL_DST]/decoder/convlm_models/word_14B/train.log
mkdir -p [MODEL_DST]/decoder/convlm_models/char_14B
python3 [FAIRSEQ]/train.py [MODEL_DST]/decoder/fairseq_char_data \
--save-dir [MODEL_DST]/decoder/convlm_models/char_14B --task=language_modeling \
--arch=fconv_lm --fp16 --max-epoch=48 --optimizer=nag \
--lr=0.5 --lr-scheduler="reduce_lr_on_plateau" --lr-shrink=0.7 \
--decoder-embed-dim=128 --clip-norm=0.1 \
--decoder-layers='[(512, 5)] + [(128, 1, 0), (128, 5, 0), (512, 1, 3)] * 3 + [(512, 1, 0), (512, 5, 0), (1024, 1, 3)] * 3 + [(1024, 1, 0), (1024, 5, 0), (2048, 1, 3)] * 6 + [(1024, 1, 0), (1024, 5, 0), (4096, 1, 3)]' \
--dropout=0.1 --weight-decay=1e-07 \
--max-tokens=512 --tokens-per-sample=512 --sample-break-mode=complete \
--criterion=cross_entropy --seed=42 \
--log-format=json --log-interval=100 \
--save-interval-updates=10000 --keep-interval-updates=10 \
--ddp-backend="no_c10d" --distributed-world-size=8 > [MODEL_DST]/decoder/convlm_models/char_14B/train.log
lr=0.5 is usedmkdir -p [MODEL_DST]/decoder/convlm_models/char_20B
python3 [FAIRSEQ]/train.py [MODEL_DST]/decoder/fairseq_char_data \
--save-dir [MODEL_DST]/decoder/convlm_models/char_20B --task=language_modeling \
--arch=fconv_lm --fp16 --max-epoch=16 --optimizer=nag \
--lr=0.5 --lr-scheduler=fixed --decoder-embed-dim=256 --clip-norm=0.1 \
--decoder-layers='[(512, 5)] + [(128, 1, 0), (128, 5, 0), (256, 1, 3)] * 3 + [(256, 1, 0), (256, 5, 0), (512, 1, 3)] * 3 + [(512, 1, 0), (512, 5, 0), (1024, 1, 3)] * 3 + [(1024, 1, 0), (1024, 5, 0), (2048, 1, 3)] * 9 + [(1024, 1, 0), (1024, 5, 0), (4096, 1, 3)]' \
--dropout=0.1 --weight-decay=1e-07 \
--max-tokens=512 --tokens-per-sample=512 --sample-break-mode=complete \
--criterion=cross_entropy --seed=42 \
--log-format=json --log-interval=100 \
--save-interval-updates=10000 --keep-interval-updates=10 \
--ddp-backend="no_c10d" --distributed-world-size=8 > [MODEL_DST]/decoder/convlm_models/char_20B/train.log
# compute for ngram models
source eval_ngram_lms.sh [MODEL_DST]
source eval_convlm_lms.sh [DATA_DST] [MODEL_DST]
source convert_convlm.sh [MODEL_DST] [WAV2LETTER]/wav2letter
decoder*.cfgdecoder*.cfg[...]/wav2letter/build/Decoder --flagsfile [...] --minloglevel=0 --logtostderr=1
$MODEL_DST/decoder/test-*.lst.oov and $MODEL_DST/decoder/test-*.lst.inv correspondently in the decoder*.cfgBelow there is info about pre-trained acoustic and language models, which one can use, for example, to reproduce a decoding step:
mkdir [MODEL_DST]/decoder/ngram_models [MODEL_DST]/decoder/convlm_models
$MODEL_DST/am$MODEL_DST/decoder/ngram_models for ngram models and into $MODEL_DST/decoder/convlm_models for ConvLM models$MODEL_DST/decoder$.Here am.arch, generated $MODEL_DST/am/tokens.lst and $MODEL_DST/decoder/lexicon.lst files are the same as in the table.
| File | Dataset | Dev Set | Architecture | Decoder Lexicon | Tokens |
|---|---|---|---|---|---|
| baseline_dev-clean+other | LibriSpeech | dev-clean+dev-other | Archfile | Decoder lexicon | Tokens |
Convolutional language models (ConvLM) are trained with the fairseq toolkit. n-gram language models are trained with the KenLM (for ngram language models training) toolkit. The below language models are converted into a binary format compatible with the wav2letter++ decoder.
| Name | Dataset | Type | Vocab | Fairseq model |
|---|---|---|---|---|
| lm_librispeech_convlm_char_20B | LibriSpeech | ConvLM 20B | LM Vocab | Fairseq |
| lm_librispeech_convlm_word_14B | LibriSpeech | ConvLM 14B | LM Vocab | Fairseq |
| lm_librispeech_kenlm_char_15g_pruned | LibriSpeech | 15-gram | - | - |
| lm_librispeech_kenlm_char_20g_pruned | LibriSpeech | 20-gram | - | - |
| lm_librispeech_kenlm_word_4g_200kvocab | LibriSpeech | 4-gram | - | - |