kosmos-2/fairseq/examples/textless_nlp/gslm/speech2unit/README.md
For quantizing speech we learn a K-means clustering over acoustic representations for which we either use Log-Mel Filterbank or pretrained acoustic representation models. For using pretrained models, please download from their respective locations linked below.
You can download pretrained quantized model from the list below.
| K-Means Model | Download Link |
|---|---|
| Log Mel Filterbank + KM50 | download |
| Log Mel Filterbank + KM100 | download |
| Log Mel Filterbank + KM200 | download |
| Modified CPC + KM50 | download |
| Modified CPC + KM100 | download |
| Modified CPC + KM200 | download |
| HuBERT Base + KM50 | download |
| HuBERT Base + KM100 | download |
| HuBERT Base + KM200 | download |
| wav2vec 2.0 Large + KM50 | download |
| wav2vec 2.0 Large + KM100 | download |
| wav2vec 2.0 Large + KM200 | download |
For quantizing speech with a given acoustic representation, please follow the steps below.
N_CLUSTERS=<number_of_clusters_used_for_kmeans>
TYPE=<one_of_logmel/cpc/hubert/w2v2>
CKPT_PATH=<path_of_pretrained_acoustic_model>
LAYER=<layer_of_acoustic_model_to_extract_features_from>
MANIFEST=<tab_separated_manifest_of_audio_files_for_training_kmeans>
KM_MODEL_PATH=<output_path_of_the_kmeans_model>
PYTHONPATH=. python examples/textless_nlp/gslm/speech2unit/clustering/cluster_kmeans.py \
--num_clusters $N_CLUSTERS \
--feature_type $TYPE \
--checkpoint_path $CKPT_PATH \
--layer $LAYER \
--manifest_path $MANIFEST \
--out_kmeans_model_path $KM_MODEL_PATH
MANIFEST=<tab_separated_manifest_of_audio_files_to_quantize>
OUT_QUANTIZED_FILE=<output_quantized_audio_file_path>
python examples/textless_nlp/gslm/speech2unit/clustering/quantize_with_kmeans.py \
--feature_type $TYPE \
--kmeans_model_path $KM_MODEL_PATH \
--acoustic_model_path $CKPT_PATH \
--layer $LAYER \
--manifest_path $MANIFEST \
--out_quantized_file_path $OUT_QUANTIZED_FILE \
--extension ".flac"
Note about the manifest file is a file with paths and length of input audio files. The format of the file is as follows:
<path_of_root_directory_containing_audio_files>
<relative_path_of_audio_file_1>\t<number_of_frames_1>
<relative_path_of_audio_file_2>\t<number_of_frames_1>
...