Back to Unilm

Speech to Unit Model (speech2unit)

kosmos-2/fairseq/examples/textless_nlp/gslm/speech2unit/README.md

latest3.4 KB
Original Source

Speech to Unit Model (speech2unit)

Acoustic Model

For quantizing speech we learn a K-means clustering over acoustic representations for which we either use Log-Mel Filterbank or pretrained acoustic representation models. For using pretrained models, please download from their respective locations linked below.

Quantization Model

You can download pretrained quantized model from the list below.

K-Means ModelDownload Link
Log Mel Filterbank + KM50download
Log Mel Filterbank + KM100download
Log Mel Filterbank + KM200download
Modified CPC + KM50download
Modified CPC + KM100download
Modified CPC + KM200download
HuBERT Base + KM50download
HuBERT Base + KM100download
HuBERT Base + KM200download
wav2vec 2.0 Large + KM50download
wav2vec 2.0 Large + KM100download
wav2vec 2.0 Large + KM200download

Quantization

For quantizing speech with a given acoustic representation, please follow the steps below.

  1. Learn K-means clustering model
N_CLUSTERS=<number_of_clusters_used_for_kmeans>
TYPE=<one_of_logmel/cpc/hubert/w2v2>
CKPT_PATH=<path_of_pretrained_acoustic_model>
LAYER=<layer_of_acoustic_model_to_extract_features_from>
MANIFEST=<tab_separated_manifest_of_audio_files_for_training_kmeans>
KM_MODEL_PATH=<output_path_of_the_kmeans_model>

PYTHONPATH=. python examples/textless_nlp/gslm/speech2unit/clustering/cluster_kmeans.py \
    --num_clusters $N_CLUSTERS \
    --feature_type $TYPE \
    --checkpoint_path $CKPT_PATH \
    --layer $LAYER \
    --manifest_path $MANIFEST \
    --out_kmeans_model_path $KM_MODEL_PATH
  1. Quantize using the learned clusters
MANIFEST=<tab_separated_manifest_of_audio_files_to_quantize>
OUT_QUANTIZED_FILE=<output_quantized_audio_file_path>

python examples/textless_nlp/gslm/speech2unit/clustering/quantize_with_kmeans.py \
    --feature_type $TYPE \
    --kmeans_model_path $KM_MODEL_PATH \
    --acoustic_model_path $CKPT_PATH \
    --layer $LAYER \
    --manifest_path $MANIFEST \
    --out_quantized_file_path $OUT_QUANTIZED_FILE \
    --extension ".flac"

Note about the manifest file is a file with paths and length of input audio files. The format of the file is as follows:

<path_of_root_directory_containing_audio_files>
<relative_path_of_audio_file_1>\t<number_of_frames_1>
<relative_path_of_audio_file_2>\t<number_of_frames_1>
...