Back to Unilm

WavLM

wavlm/README.md

latest7.1 KB
Original Source

WavLM

<!--**Pre-trained models for speech related tasks**-->

WavLM: Large-Scale Self-Supervised Pre-training for Full Stack Speech Processing

Official PyTorch implementation and pretrained models of WavLM

  • Dec 2021: An interesting speaker verification demo on HuggingFace. You can have a try!
  • Dec 2021: WavLM Large Release and HuggingFace Support
  • Nov 2021: release code and pretrained models (WavLM Base and WavLM Base+)
  • Oct 2021: release preprint in arXiv

Pre-Trained Models

ModelPre-training DatasetFine-tuning DatasetModel
WavLM Base960 hrs LibriSpeech-Azure Storage
Google Drive
WavLM Base+60k hrs Libri-Light + 10k hrs GigaSpeech + 24k hrs VoxPopuli-Azure Storage
Google Drive
WavLM Large60k hrs Libri-Light + 10k hrs GigaSpeech + 24k hrs VoxPopuli-Azure Storage
Google Drive

Load Pre-Trained Models

python
import torch
from WavLM import WavLM, WavLMConfig

# load the pre-trained checkpoints
checkpoint = torch.load('/path/to/wavlm.pt')
cfg = WavLMConfig(checkpoint['cfg'])
model = WavLM(cfg)
model.load_state_dict(checkpoint['model'])
model.eval()

# extract the representation of last layer
wav_input_16khz = torch.randn(1,10000)
if cfg.normalize:
    wav_input_16khz = torch.nn.functional.layer_norm(wav_input_16khz , wav_input_16khz.shape)
rep = model.extract_features(wav_input_16khz)[0]

# extract the representation of each layer
wav_input_16khz = torch.randn(1,10000)
if cfg.normalize:
    wav_input_16khz = torch.nn.functional.layer_norm(wav_input_16khz , wav_input_16khz.shape)
rep, layer_results = model.extract_features(wav_input_16khz, output_layer=model.cfg.encoder_layers, ret_layer_results=True)[0]
layer_reps = [x.transpose(0, 1) for x, _ in layer_results]

HuggingFace and s3prl both support our models. It is very easy to fine-tune our models on different downstream tasks. We suggest you to extract representation of each layer and weighted sum the representations.

Universal Representation Evaluation on SUPERB

Downstream Task Performance

We also evaluate our models on typical speech processing benchmarks.

Speaker Verification

Finetune the model with VoxCeleb2 dev data, and evaluate it on the VoxCeleb1

ModelFix pre-trainVox1-OVox1-EVox1-H
ECAPA-TDNN-0.871.122.12
HuBERT largeYes0.8880.9121.853
Wav2Vec2.0 (XLSR)Yes0.9150.9451.895
UniSpeech-SAT largeYes0.7710.7811.669
WavLM largeYes0.590.651.328
WavLM largeNo0.5050.5791.176
+Large Margin Finetune and Score Calibration
HuBERT largeNo0.5850.6541.342
Wav2Vec2.0 (XLSR)No0.5640.6051.23
UniSpeech-SAT largeNo0.5640.5611.23
WavLM large (New)No0.330.4770.984

Speech Separation

Evaluation on the LibriCSS

Model0S0LOV10OV20OV30OV40
Conformer (SOTA)4.54.46.28.51112.6
HuBERT base4.74.66.17.910.612.3
UniSpeech-SAT base4.44.45.47.29.210.5
UniSpeech-SAT large4.34.25.06.38.28.8
WavLM base+4.54.45.67.59.410.9
WavLM large4.24.14.85.87.48.5

Speaker Diarization

Evaluation on the CALLHOME

Modelspk_2spk_3spk_4spk_5spk_6spk_all
EEND-vector clustering7.9611.9316.3821.2123.112.49
EEND-EDA clustering (SOTA)7.1111.8814.3725.9521.9511.84
HuBERT base7.9312.0715.2119.5923.3212.63
HuBERT large7.3911.9715.7619.8222.1012.40
UniSpeech-SAT large5.9310.6612.916.4823.2510.92
WavLM Base6.9911.1215.2016.4821.6111.75
WavLm large6.4610.6911.8412.8920.7010.35

Speech Recogntion

Evaluate on the LibriSpeech

More Speech Pre-Trained Models

Please visit here for more interesting and effective pre-trained models

License

This project is licensed under the license found in the LICENSE file in the root directory of this source tree. Portions of the source code are based on the FAIRSEQ project.

Microsoft Open Source Code of Conduct

Reference

If you find our work is useful in your research, please cite the following paper:

latex
@article{Chen2021WavLM,
  title   = {WavLM: Large-Scale Self-Supervised  Pre-training   for Full Stack Speech Processing},
  author  = {Sanyuan Chen and Chengyi Wang and Zhengyang Chen and Yu Wu and Shujie Liu and Zhuo Chen and Jinyu Li and Naoyuki Kanda and Takuya Yoshioka and Xiong Xiao and Jian Wu and Long Zhou and Shuo Ren and Yanmin Qian and Yao Qian and Jian Wu and Michael Zeng and Furu Wei},
  eprint={2110.13900},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  year={2021}
}

Contact Information

For help or issues using WavLM models, please submit a GitHub issue.

For other communications related to WavLM, please contact Yu Wu ([email protected]).