recipes/self_training/pseudo_labeling/README.md
Tools used to analyize and generate pseudo-labels used in experiments can be found here. Refer to the paper for optimal parameters used to generate results from the paper.
Using raw sclite logs from decoding, pseudo-labels can be generated and filtering applied by running:
python3 generate_synthetic_data.py \
--input [path to decoder stdout] \
--listpath [path of listfile containing ground truth] \
--output [path to new output list file]
# Whether or not to perform any sort of filtering
--filter \
# Whether or not to filter EOS warnings
--warnings \
# Whether or not to do ngram filtering
--ngram \
# If ngram filtering is enabled, the number of times an n-gram must appear to be filtered
--ngram_appearance_threshold \
# The size of the n-gram to appear
--ngram_size \
# If doing filtering show the results
--print_filtered_results
To run basic analysis, including the Oracle word error rate, the number of samples, and the duration of a dataset, use the tools provided in AnalyzeDataset. Make sure W2L_BUILD_RECIPES is ON in CMake when building wav2letter.
Run:
./analyze_pseudo_label_dataset \
--infile [path to generated pseudo-label lst file] \
--groundtruthfile [path to ground truth lst file]
To generate a synthetic lexicon from decoder output, run:
python3 generate_synthetic_lexicon.py \
--inputhyp=[path to decoder sclite log output] \
--inputlexicon=[path to an existing lexicon, e.g. train-clean-100] \
--output=[output file]
Lexicons can also be combined with other lexicons by running the following:
python3 combine_synthetic_lexicon.py \
--lexicon1=[train-clean-100 lexicon] \
--lexicon2=[synthetic lexicon] \
--output=[output file]
Note that the resulting lexicon will have words ordered