Speech to Unit Model (speech2unit)

Acoustic Model

For quantizing speech we learn a K-means clustering over acoustic representations for which we either use Log-Mel Filterbank or pretrained acoustic representation models. For using pretrained models, please download from their respective locations linked below.

Quantization Model

You can download pretrained quantized model from the list below.

K-Means Model	Download Link
Log Mel Filterbank + KM50	download
Log Mel Filterbank + KM100	download
Log Mel Filterbank + KM200	download
Modified CPC + KM50	download
Modified CPC + KM100	download
Modified CPC + KM200	download
HuBERT Base + KM50	download
HuBERT Base + KM100	download
HuBERT Base + KM200	download
wav2vec 2.0 Large + KM50	download
wav2vec 2.0 Large + KM100	download
wav2vec 2.0 Large + KM200	download

Quantization

For quantizing speech with a given acoustic representation, please follow the steps below.

Learn K-means clustering model

N_CLUSTERS=<number_of_clusters_used_for_kmeans>
TYPE=<one_of_logmel/cpc/hubert/w2v2>
CKPT_PATH=<path_of_pretrained_acoustic_model>
LAYER=<layer_of_acoustic_model_to_extract_features_from>
MANIFEST=<tab_separated_manifest_of_audio_files_for_training_kmeans>
KM_MODEL_PATH=<output_path_of_the_kmeans_model>

PYTHONPATH=. python examples/textless_nlp/gslm/speech2unit/clustering/cluster_kmeans.py \
    --num_clusters $N_CLUSTERS \
    --feature_type $TYPE \
    --checkpoint_path $CKPT_PATH \
    --layer $LAYER \
    --manifest_path $MANIFEST \
    --out_kmeans_model_path $KM_MODEL_PATH

Quantize using the learned clusters

MANIFEST=<tab_separated_manifest_of_audio_files_to_quantize>
OUT_QUANTIZED_FILE=<output_quantized_audio_file_path>

python examples/textless_nlp/gslm/speech2unit/clustering/quantize_with_kmeans.py \
    --feature_type $TYPE \
    --kmeans_model_path $KM_MODEL_PATH \
    --acoustic_model_path $CKPT_PATH \
    --layer $LAYER \
    --manifest_path $MANIFEST \
    --out_quantized_file_path $OUT_QUANTIZED_FILE \
    --extension ".flac"

Note about the manifest file is a file with paths and length of input audio files. The format of the file is as follows:

<path_of_root_directory_containing_audio_files>
<relative_path_of_audio_file_1>\t<number_of_frames_1>
<relative_path_of_audio_file_2>\t<number_of_frames_1>
...