Back to Fairseq

Unsupervised Cross-lingual Representation Learning at Scale (XLM-RoBERTa)

examples/xlmr/README.md

0.12.36.4 KB
Original Source

Unsupervised Cross-lingual Representation Learning at Scale (XLM-RoBERTa)

https://arxiv.org/pdf/1911.02116.pdf

Larger-Scale Transformers for Multilingual Masked Language Modeling

https://arxiv.org/pdf/2105.00572.pdf

What's New:

  • June 2021: XLMR-XL AND XLMR-XXL models released.

Introduction

XLM-R (XLM-RoBERTa) is a generic cross lingual sentence encoder that obtains state-of-the-art results on many cross-lingual understanding (XLU) benchmarks. It is trained on 2.5T of filtered CommonCrawl data in 100 languages (list below).

LanguageLanguageLanguageLanguageLanguage
AfrikaansAlbanianAmharicArabicArmenian
AssameseAzerbaijaniBasqueBelarusianBengali
Bengali RomanizeBosnianBretonBulgarianBurmese
Burmese zawgyi fontCatalanChinese (Simplified)Chinese (Traditional)Croatian
CzechDanishDutchEnglishEsperanto
EstonianFilipinoFinnishFrenchGalician
GeorgianGermanGreekGujaratiHausa
HebrewHindiHindi RomanizeHungarianIcelandic
IndonesianIrishItalianJapaneseJavanese
KannadaKazakhKhmerKoreanKurdish (Kurmanji)
KyrgyzLaoLatinLatvianLithuanian
MacedonianMalagasyMalayMalayalamMarathi
MongolianNepaliNorwegianOriyaOromo
PashtoPersianPolishPortuguesePunjabi
RomanianRussianSanskritScottish GaelicSerbian
SindhiSinhalaSlovakSlovenianSomali
SpanishSundaneseSwahiliSwedishTamil
Tamil RomanizeTeluguTelugu RomanizeThaiTurkish
UkrainianUrduUrdu RomanizeUyghurUzbek
VietnameseWelshWestern FrisianXhosaYiddish

Pre-trained models

ModelDescription#paramsvocab sizeDownload
xlmr.baseXLM-R using the BERT-base architecture250M250kxlm.base.tar.gz
xlmr.largeXLM-R using the BERT-large architecture560M250kxlm.large.tar.gz
xlmr.xlXLM-R (layers=36, model_dim=2560)3.5B250kxlm.xl.tar.gz
xlmr.xxlXLM-R (layers=48, model_dim=4096)10.7B250kxlm.xxl.tar.gz

Results

XNLI (Conneau et al., 2018)

Modelaverageenfresdeelbgrutrarvithzhhiswur
roberta.large.mnli (TRANSLATE-TEST)77.891.382.984.381.281.783.178.376.876.674.274.177.570.966.766.8
xlmr.large (TRANSLATE-TRAIN-ALL)83.689.185.186.685.785.385.983.583.283.183.781.583.781.678.078.1
xlmr.xl (TRANSLATE-TRAIN-ALL)85.491.187.288.187.087.487.885.385.285.386.283.885.383.179.878.2
xlmr.xxl (TRANSLATE-TRAIN-ALL)86.091.587.688.787.887.488.285.685.185.886.383.985.684.681.780.6

MLQA (Lewis et al., 2018)

Modelaverageenesdearhivizh
BERT-large-80.2/67.4------
mBERT57.7 / 41.677.7 / 65.264.3 / 46.657.9 / 44.345.7 / 29.843.8 / 29.757.1 / 38.657.5 / 37.3
xlmr.large70.7 / 52.780.6 / 67.874.1 / 56.068.5 / 53.663.1 / 43.569.2 / 51.671.3 / 50.968.0 / 45.4
xlmr.xl73.4 / 55.385.1 / 72.666.7 / 46.270.5 / 55.574.3 / 56.972.2 / 54.774.4 / 52.970.9 / 48.5
xlmr.xxl74.8 / 56.685.5 / 72.468.6 / 48.472.7 / 57.875.4 / 57.673.7 / 55.876.0 / 55.071.7 / 48.9

Example usage

Load XLM-R from torch.hub (PyTorch >= 1.1):
python
import torch
xlmr = torch.hub.load('pytorch/fairseq:main', 'xlmr.large')
xlmr.eval()  # disable dropout (or leave in train mode to finetune)
Load XLM-R (for PyTorch 1.0 or custom models):
python
# Download xlmr.large model
wget https://dl.fbaipublicfiles.com/fairseq/models/xlmr.large.tar.gz
tar -xzvf xlmr.large.tar.gz

# Load the model in fairseq
from fairseq.models.roberta import XLMRModel
xlmr = XLMRModel.from_pretrained('/path/to/xlmr.large', checkpoint_file='model.pt')
xlmr.eval()  # disable dropout (or leave in train mode to finetune)
Apply sentence-piece-model (SPM) encoding to input text:
python
en_tokens = xlmr.encode('Hello world!')
assert en_tokens.tolist() == [0, 35378,  8999, 38, 2]
xlmr.decode(en_tokens)  # 'Hello world!'

zh_tokens = xlmr.encode('你好,世界')
assert zh_tokens.tolist() == [0, 6, 124084, 4, 3221, 2]
xlmr.decode(zh_tokens)  # '你好,世界'

hi_tokens = xlmr.encode('नमस्ते दुनिया')
assert hi_tokens.tolist() == [0, 68700, 97883, 29405, 2]
xlmr.decode(hi_tokens)  # 'नमस्ते दुनिया'

ar_tokens = xlmr.encode('مرحبا بالعالم')
assert ar_tokens.tolist() == [0, 665, 193478, 258, 1705, 77796, 2]
xlmr.decode(ar_tokens) # 'مرحبا بالعالم'

fr_tokens = xlmr.encode('Bonjour le monde')
assert fr_tokens.tolist() == [0, 84602, 95, 11146, 2]
xlmr.decode(fr_tokens) # 'Bonjour le monde'
Extract features from XLM-R:
python
# Extract the last layer's features
last_layer_features = xlmr.extract_features(zh_tokens)
assert last_layer_features.size() == torch.Size([1, 6, 1024])

# Extract all layer's features (layer 0 is the embedding layer)
all_layers = xlmr.extract_features(zh_tokens, return_all_hiddens=True)
assert len(all_layers) == 25
assert torch.all(all_layers[-1] == last_layer_features)

Citation

bibtex
@article{conneau2019unsupervised,
  title={Unsupervised Cross-lingual Representation Learning at Scale},
  author={Conneau, Alexis and Khandelwal, Kartikay and Goyal, Naman and Chaudhary, Vishrav and Wenzek, Guillaume and Guzm{\'a}n, Francisco and Grave, Edouard and Ott, Myle and Zettlemoyer, Luke and Stoyanov, Veselin},
  journal={arXiv preprint arXiv:1911.02116},
  year={2019}
}
bibtex
@article{goyal2021larger,
  title={Larger-Scale Transformers for Multilingual Masked Language Modeling},
  author={Goyal, Naman and Du, Jingfei and Ott, Myle and Anantharaman, Giri and Conneau, Alexis},
  journal={arXiv preprint arXiv:2105.00572},
  year={2021}
}