Back to Unilm

Flores101: Large-Scale Multilingual Machine Translation

kosmos-2/fairseq/examples/flores101/README.md

latest4.5 KB
Original Source
<p align="center"> </p>

Flores101: Large-Scale Multilingual Machine Translation

Introduction

Baseline pretrained models for small and large tracks of WMT 21 Large-Scale Multilingual Machine Translation competition.

Flores Task at WMT 21: http://www.statmt.org/wmt21/large-scale-multilingual-translation-task.html

Flores announement blog post: https://ai.facebook.com/blog/flores-researchers-kick-off-multilingual-translation-challenge-at-wmt-and-call-for-compute-grants/

Pretrained models

ModelNum layersEmbed dimensionFFN dimensionVocab Size#paramsDownload
flores101_mm100_615M1210244096256,000615Mhttps://dl.fbaipublicfiles.com/flores101/pretrained_models/flores101_mm100_615M.tar.gz
flores101_mm100_175M65122048256,000175Mhttps://dl.fbaipublicfiles.com/flores101/pretrained_models/flores101_mm100_175M.tar.gz

These models are trained similar to M2M-100 with additional support for the languages that are part of the WMT Large-Scale Multilingual Machine Translation track. Full list of languages can be found at the bottom.

Example Generation code

Download model, sentencepiece vocab

bash
fairseq=/path/to/fairseq
cd $fairseq

# Download 615M param model.
wget https://dl.fbaipublicfiles.com/flores101/pretrained_models/flores101_mm100_615M.tar.gz

# Extract 
tar -xvzf flores101_mm100_615M.tar.gz

Encode using our SentencePiece Model

Note: Install SentencePiece from here

bash
fairseq=/path/to/fairseq
cd $fairseq

# Download example dataset From German to French
sacrebleu --echo src -l de-fr -t wmt19 | head -n 20 > raw_input.de-fr.de
sacrebleu --echo ref -l de-fr -t wmt19 | head -n 20 > raw_input.de-fr.fr

for lang in de fr ; do
    python scripts/spm_encode.py \
        --model flores101_mm100_615M/sentencepiece.bpe.model \
        --output_format=piece \
        --inputs=raw_input.de-fr.${lang} \
        --outputs=spm.de-fr.${lang}
done

Binarization

bash
fairseq-preprocess \
    --source-lang de --target-lang fr \
    --testpref spm.de-fr \
    --thresholdsrc 0 --thresholdtgt 0 \
    --destdir data_bin \
    --srcdict flores101_mm100_615M/dict.txt --tgtdict flores101_mm100_615M/dict.txt

Generation

bash
fairseq-generate \
    data_bin \
    --batch-size 1 \
    --path flores101_mm100_615M/model.pt \
    --fixed-dictionary flores101_mm100_615M/dict.txt \
    -s de -t fr \
    --remove-bpe 'sentencepiece' \
    --beam 5 \
    --task translation_multi_simple_epoch \
    --lang-pairs flores101_mm100_615M/language_pairs.txt \
    --decoder-langtok --encoder-langtok src \
    --gen-subset test \
    --fp16 \
    --dataset-impl mmap \
    --distributed-world-size 1 --distributed-no-spawn

Supported Languages and lang code

Languagelang code
Akrikaansaf
Amharicam
Arabicar
Assameseas
Asturianast
Aymaraay
Azerbaijaniaz
Bashkirba
Belarusianbe
Bulgarianbg
Bengalibn
Bretonbr
Bosnianbs
Catalanca
Cebuanoceb
Chokwecjk
Czechcs
Welshcy
Danishda
Germande
Dyuladyu
Greekel
Englishen
Spanishes
Estonianet
Persianfa
Fulahff
Finnishfi
Frenchfr
Western Frisianfy
Irishga
Scottish Gaelicgd
Galiciangl
Gujaratigu
Hausaha
Hebrewhe
Hindihi
Croatianhr
Haitian Creoleht
Hungarianhu
Armenianhy
Indonesianid
Igboig
Ilokoilo
Icelandicis
Italianit
Japaneseja
Javanesejv
Georgianka
Kachinkac
Kambakam
Kabuverdianukea
Kongokg
Kazakhkk
Central Khmerkm
Kimbundukmb
Northern Kurdishkmr
Kannadakn
Koreanko
Kurdishku
Kyrgyzky
Luxembourgishlb
Gandalg
Lingalaln
Laolo
Lithuanianlt
Luoluo
Latvianlv
Malagasymg
Maorimi
Macedonianmk
Malayalamml
Mongolianmn
Marathimr
Malayms
Maltesemt
Burmesemy
Nepaline
Dutchnl
Norwegianno
Northern Sothons
Nyanjany
Occitanoc
Oromoom
Oriyaor
Punjabipa
Polishpl
Pashtops
Portuguesept
Quechuaqu
Romanianro
Russianru
Sindhisd
Shanshn
Sinhalasi
Slovaksk
Sloveniansl
Shonasn
Somaliso
Albaniansq
Serbiansr
Swatiss
Sundanesesu
Swedishsv
Swahilisw
Tamilta
Telugute
Tajiktg
Thaith
Tigrinyati
Tagalogtl
Tswanatn
Turkishtr
Ukrainianuk
Umbunduumb
Urduur
Uzbekuz
Vietnamesevi
Wolofwo
Xhosaxh
Yiddishyi
Yorubayo
Chinesezh
Zuluzu