Back to Unilm

Unsupervised Cross-lingual Representation Learning at Scale (XLM-RoBERTa)

infoxlm/fairseq/examples/xlmr/README.md

latest4.1 KB
Original Source

Unsupervised Cross-lingual Representation Learning at Scale (XLM-RoBERTa)

Introduction

XLM-R (XLM-RoBERTa) is scaled cross lingual sentence encoder. It is trained on 2.5T of data across 100 languages data filtered from Common Crawl. XLM-R achieves state-of-the-arts results on multiple cross lingual benchmarks.

Pre-trained models

ModelDescription#paramsvocab sizeDownload
xlmr.base.v0XLM-R using the BERT-base architecture250M250kxlm.base.v0.tar.gz
xlmr.large.v0XLM-R using the BERT-large architecture560M250kxlm.large.v0.tar.gz

(Note: The above models are still under training, we will update the weights, once fully trained, the results are based on the above checkpoints.)

Results

XNLI (Conneau et al., 2018)

Modelaverageenfresdeelbgrutrarvithzhhiswur
roberta.large.mnli (TRANSLATE-TEST)77.891.382.984.381.281.783.178.376.876.674.274.177.570.966.766.8
xlmr.large.v0 (TRANSLATE-TRAIN-ALL)82.488.785.285.684.683.685.582.481.680.983.480.983.379.875.974.3

MLQA (Lewis et al., 2018)

Modelaverageenesdearhivizh
BERT-large-80.2/67.4------
mBERT57.7 / 41.677.7 / 65.264.3 / 46.657.9 / 44.345.7 / 29.843.8 / 29.757.1 / 38.657.5 / 37.3
xlmr.large.v070.0 / 52.280.1 / 67.773.2 / 55.168.3 / 53.762.8 / 43.768.3 / 51.070.5 / 50.167.1 / 44.4

Example usage

Load XLM-R from torch.hub (PyTorch >= 1.1):
python
import torch
xlmr = torch.hub.load('pytorch/fairseq', 'xlmr.large.v0')
xlmr.eval()  # disable dropout (or leave in train mode to finetune)
Load XLM-R (for PyTorch 1.0 or custom models):
python
# Download xlmr.large model
wget https://dl.fbaipublicfiles.com/fairseq/models/xlmr.large.v0.tar.gz
tar -xzvf xlmr.large.v0.tar.gz

# Load the model in fairseq
from fairseq.models.roberta import XLMRModel
xlmr = XLMRModel.from_pretrained('/path/to/xlmr.large.v0', checkpoint_file='model.pt')
xlmr.eval()  # disable dropout (or leave in train mode to finetune)
Apply sentence-piece-model (SPM) encoding to input text:
python
en_tokens = xlmr.encode('Hello world!')
assert en_tokens.tolist() == [0, 35378,  8999, 38, 2]
xlmr.decode(en_tokens)  # 'Hello world!'

zh_tokens = xlmr.encode('你好,世界')
assert zh_tokens.tolist() == [0, 6, 124084, 4, 3221, 2]
xlmr.decode(zh_tokens)  # '你好,世界'

hi_tokens = xlmr.encode('नमस्ते दुनिया')
assert hi_tokens.tolist() == [0, 68700, 97883, 29405, 2]
xlmr.decode(hi_tokens)  # 'नमस्ते दुनिया'

ar_tokens = xlmr.encode('مرحبا بالعالم')
assert ar_tokens.tolist() == [0, 665, 193478, 258, 1705, 77796, 2]
xlmr.decode(ar_tokens) # 'مرحبا بالعالم'

fr_tokens = xlmr.encode('Bonjour le monde')
assert fr_tokens.tolist() == [0, 84602, 95, 11146, 2]
xlmr.decode(fr_tokens) # 'Bonjour le monde'
Extract features from XLM-R:
python
# Extract the last layer's features
last_layer_features = xlmr.extract_features(zh_tokens)
assert last_layer_features.size() == torch.Size([1, 6, 1024])

# Extract all layer's features (layer 0 is the embedding layer)
all_layers = xlmr.extract_features(zh_tokens, return_all_hiddens=True)
assert len(all_layers) == 25
assert torch.all(all_layers[-1] == last_layer_features)

Citation

bibtex
@article{,
    title = {Unsupervised Cross-lingual Representation Learning at Scale},
    author = {Alexis Conneau and Kartikay Khandelwal
        and Naman Goyal and Vishrav Chaudhary and Guillaume Wenzek
        and Francisco Guzm\'an and Edouard Grave and Myle Ott
        and Luke Zettlemoyer and Veselin Stoyanov
    },
    journal={},
    year = {2019},
}