recipes/lexicon_free/wsj/README.md
[...]: data_dst path to data to store, model_dst path to auxiliary path to store; if you are using B-version of WSJ then call with --wsj1_type LDC94S13B)python3 prepare.py --wsj0 [...]/csr_1 --wsj1 [...]/csr_2_comp --wsj1_type LDC94S13A \
--data_dst [...] --model_dst [...] --sph2pipe [...]/sph2pipe_v2.5/sph2pipe
Besides data the auxiliary files for acoustic and language models training/evaluation will be generated:
cd $MODEL_DST
tree -L 2
.
├── am
│ ├── lexicon_si284+nov93dev.txt
│ ├── si284.lst.remap
│ └── tokens.lst
└── decoder
├── char_lm_data.nov93dev
├── char_lm_data.train
├── dict-remap.txt
├── lexicon.lst
└── word_lm_data.nov93dev
*.cfgtrain.cfg on 1 GPU. Take 001_model_nov93dev.bin snapshot for further decoder experiments.[...]/wav2letter/build/Train train --flagsfile train.cfg --minloglevel=0 --logtostderr=1
source prepare_fairseq_data.sh [DATA_DST] [MODEL_DST] [FAIRSEQ PATH]
./train_ngram_lms.sh [DATA_DST] [MODEL_DST] [KENLM PATH]/build/bin
inverse_sqrt, model overfits after 75k updates, we are taking this snapshots before perplexity on validation starts to grow.mkdir -p [MODEL_DST]/decoder/convlm_models/word_14B
python3 [FAIRSEQ]/train.py [MODEL_DST]/decoder/fairseq_word_data \
--save-dir [MODEL_DST]/decoder/convlm_models/word_14B \
--task=language_modeling \
--arch=fconv_lm --fp16 --max-epoch=60 --optimizer=nag \
--lr=1 --lr-scheduler=inverse_sqrt --decoder-embed-dim=128 --clip-norm=0.1 \
--decoder-layers='[(512, 5)] + [(128, 1, 0), (128, 5, 0), (512, 1, 3)] * 3 + [(512, 1, 0), (512, 5, 0), (1024, 1, 3)] * 3 + [(1024, 1, 0), (1024, 5, 0), (2048, 1, 3)] * 6 + [(1024, 1, 0), (1024, 5, 0), (4096, 1, 3)]' \
--dropout=0.3 --weight-decay=1e-07 \
--max-tokens=1024 --tokens-per-sample=1024 --sample-break-mode=none \
--criterion=adaptive_loss --adaptive-softmax-cutoff='10000,50000,100000' --seed=42 \
--log-format=json --log-interval=100 \
--save-interval-updates=10000 --keep-interval-updates=10 \
--ddp-backend="no_c10d" --distributed-world-size=8 > [MODEL_DST]/decoder/convlm_models/word_14B/train.log
lr=0.5, then take best snapshot and run continue training with lr=0.05 during another 30 epochs.mkdir -p [MODEL_DST]/decoder/convlm_models/char_14B
python3 [FAIRSEQ]/train.py [MODEL_DST]/decoder/fairseq_char_data \
--save-dir [MODEL_DST]/decoder/convlm_models/char_14B --task=language_modeling \
--arch=fconv_lm --fp16 --max-epoch=60 --optimizer=nag \
--lr=0.5 --lr-scheduler="fixed" \
--decoder-embed-dim=128 --clip-norm=0.1 \
--decoder-layers='[(512, 5)] + [(128, 1, 0), (128, 5, 0), (512, 1, 3)] * 3 + [(512, 1, 0), (512, 5, 0), (1024, 1, 3)] * 3 + [(1024, 1, 0), (1024, 5, 0), (2048, 1, 3)] * 6 + [(1024, 1, 0), (1024, 5, 0), (4096, 1, 3)]' \
--dropout=0.15 --weight-decay=1e-07 \
--max-tokens=512 --tokens-per-sample=512 --sample-break-mode=complete \
--criterion=cross_entropy --seed=42 \
--log-format=json --log-interval=100 \
--save-interval-updates=10000 --keep-interval-updates=10 \
--ddp-backend="no_c10d" --distributed-world-size=8 > [MODEL_DST]/decoder/convlm_models/char_14B/train.log
lr=0.1 is usedmkdir -p [MODEL_DST]/decoder/convlm_models/char_20B
python3 [FAIRSEQ]/train.py [MODEL_DST]/decoder/fairseq_char_data \
--save-dir [MODEL_DST]/decoder/convlm_models/char_20B --task=language_modeling \
--arch=fconv_lm --fp16 --max-epoch=60 --optimizer=nag \
--lr=0.1 --lr-scheduler=fixed --decoder-embed-dim=256 --clip-norm=0.1 \
--decoder-layers='[(512, 5)] + [(128, 1, 0), (128, 5, 0), (256, 1, 3)] * 3 + [(256, 1, 0), (256, 5, 0), (512, 1, 3)] * 3 + [(512, 1, 0), (512, 5, 0), (1024, 1, 3)] * 3 + [(1024, 1, 0), (1024, 5, 0), (2048, 1, 3)] * 9 + [(1024, 1, 0), (1024, 5, 0), (4096, 1, 3)]' \
--dropout=0.2 --weight-decay=1e-07 \
--max-tokens=512 --tokens-per-sample=512 --sample-break-mode=complete \
--criterion=cross_entropy --seed=42 \
--log-format=json --log-interval=100 \
--save-interval-updates=10000 --keep-interval-updates=10 \
--ddp-backend="no_c10d" --distributed-world-size=8 > [MODEL_DST]/decoder/convlm_models/char_20B/train.log
# compute for ngram models
source eval_ngram_lms.sh [MODEL_DST]
source eval_convlm_lms.sh [DATA_DST] [MODEL_DST]
source convert_convlm.sh [MODEL_DST] [WAV2LETTER]/wav2letter
decoder*.cfgdecoder*.cfg[...]/wav2letter/build/Decoder --flagsfile [...] --minloglevel=0 --logtostderr=1
Below there is info about pre-trained acoustic and language models, which one can use, for example, to reproduce a decoding step:
mkdir [MODEL_DST]/decoder/ngram_models [MODEL_DST]/decoder/convlm_models
$MODEL_DST/am$MODEL_DST/decoder/ngram_models for ngram models and into $MODEL_DST/decoder/convlm_models for ConvLM models$MODEL_DST/decoder$.Here am.arch, generated $MODEL_DST/am/tokens.lst and $MODEL_DST/decoder/lexicon.lst files are the same as in the table.
| File | Dataset | Dev Set | Architecture | Decoder lexicon | Tokens |
|---|---|---|---|---|---|
| baseline_nov93dev | WSJ | nov93dev | Archfile | Decoder lexicon | Tokens |
Convolutional language models (ConvLM) are trained with the fairseq toolkit. n-gram language models are trained with the KenLM toolkit. The below language models are converted into a binary format compatible with the wav2letter++ decoder.
| Name | Dataset | Type | Vocab | Fairseq model |
|---|---|---|---|---|
| lm_wsj_convlm_char_20B | WSJ | ConvLM 20B | LM Vocab | Fairseq |
| lm_wsj_convlm_word_14B | WSJ | ConvLM 14B | LM Vocab | Fairseq |
| lm_wsj_kenlm_char_15g_pruned | WSJ | 15-gram | - | - |
| lm_wsj_kenlm_char_20g_pruned | WSJ | 20-gram | - | - |
| lm_wsj_kenlm_word_4g | WSJ | 4-gram | - | - |