Back to Chromium

SentencePiece Experiments

third_party/sentencepiece/src/doc/experiments.md

149.0.7827.28.9 KB
Original Source

SentencePiece Experiments

Experiments 1 (subword vs word-based model)

Experimental settings

  • Segmentation algorithms:

    • SentencePiece: SentencePiece with a language-model based segmentation. (--model_type=unigram)
    • SentencePeice(BPE): SentencePiece with Byte Pair Encoding. [Sennrich et al.]] (--model_type=bpe)
    • Moses: Moses tokenizer for English.
    • KyTea: KyTea for Japanese.
    • MeCab: MeCab for Japanese.
    • neologd: MeCab with neologd for Japanese.
    • (Moses/KyTea)+SentencePiece: Apply SentencePiece (Unigram) to pre-tokenized sentences. We have several variants with different tokenizers., e.g., (Moses/MeCab)+SentencePiece, (MeCab/Moses)+SentencePiece.
    • char*: Segments sentence by characters.
  • Data sets:

  • NMT parameters: (Google’s Neural Machine Translation System is applied for all experiments.)

    • Dropout prob: 0.2
    • num nodes: 512
    • num lstms: 6
    • Decoder parameters (α and β) are optimized with development data.
  • Evaluation metrics:

    • Case-sensitive BLEU on detokenized text with NIST scorer and KyTea segmenter. Used in-house rule-based detokenizer for Moses/KyTea/MeCab/neologd.

Results (BLEU scores)

English to Japanese

Settingvocab sizeBLEU(dev)BLEU(test)src #tokens/sent.trg #tokens/sent.
SentencePiece4k (shared)0.28570.294043.747829.6998
SentencePiece8k (shared)0.27850.295530.973425.0540
SentencePiece16k (shared)0.26640.286227.182721.5326
SentencePiece32k (shared)0.26410.284925.059219.0840
SentencePiece(BPE)8k (shared)0.27670.294731.769325.4331
(Moses/KyTea)+SentencePiece8k (shared)0.29000.298531.271929.9854
(Moses/MeCab)+SentencePiece8k (shared)0.28170.295031.474328.9537
(Moses/neologd)+SentencePiece8k (shared)0.28240.306231.298528.8645
Moses/Kytea80k/80k0.25760.282421.251323.2161
Moses/MeCab80k/80k0.24550.278021.251321.2033
Moses/neologd80k/80k0.21570.237821.251318.4768
Moses/SentencePiece80k/8k0.24750.274221.251322.9383
SentencePiece/KyTea8k/80k0.27780.291827.042923.2161
SentencePiece/MeCab8k/80k0.26730.291927.042921.2033
SentencePiece/neolgod8k80k0.22800.249427.042918.4768
Char3k (shared)0.25090.2679109.866233.6963

Japanese to English

Settingvocab sizeBLEU(dev)BLEU(test)src #tokens/sent.trg #tokens/sent.
SentencePiece4k (shared)0.19700.217929.699843.7478
SentencePiece8k (shared)0.19660.216225.054030.9734
SentencePiece16k (shared)0.19960.216021.532627.1827
SentencePiece32k (shared)0.19490.215919.084025.0592
SentencePiece(BPE)8k (shared)0.19770.217325.433131.7693
(KyTea/Moses)+SentencePiece8k (shared)0.19210.208629.985431.2719
(MeCab/Moses)+SentencePiece8k (shared)0.19090.204928.953731.4743
(neologd/Moses)+SentencePiece8k (shared)0.19380.213728.864531.2985
KyTea/Moses80k/80k0.17070.200623.216121.2513
MeCab/Moses80k/80k0.16680.189221.203321.2513
neologd/Moses80k/80k0.15890.183618.476821.2513
SentencePiece/Moses8k/80k0.17270.199422.938321.2513
KyTea/SentencePiece80k/8k0.19390.214123.216127.0429
MeCab/SentencePiece80k/8k0.18920.207721.203327.0429
neologd/SentencePiece80k/8k0.16410.180418.476827.0429
Char3k (shared)0.08240.091833.6963109.8662

Discussion

  • SentencePiece (Unigram/BPE) outperforms word-based methods (Moses/KyTea/MeCab/neologd) even with a smaller vocabulary (10% of word-based methods).
  • The number of tokens to represent Japanese sentences is almost comparable between SentencePiece (unigram) and KyTea, though the vocabulary of SentencePiece is much smaller. It implies that Sentencepiece can effectively compress the sentences with a smaller vocabulary set.
  • Pretokenization can slightly improve the BLEU scores in English to Japanese. In Japanese to English translation, pretokenization doesn't help to improve BLEU.
  • Neologd shows poor BLEU score. Tokenizing sentences with a large named entity dictionary might not be effective in neural-based text processing.
  • SentencePiece(Unigram) shows slightly better text compression ratio than BPE, but no significant differences in BLEU score.
  • The selection of vocabulary size for SentencePiece is sensitive in English to Japanese. This is probably because the vocabulary size will drastically affect the tokenization results in Japanese which has no explicit spaces between words.

Experiments 2 (subwording with various pre-tokenizations)

Experimental settings

We have evaluated SentencePiece segmentation with the following configurations.

  • Segmentation algorithms:

    • BPE (Byte Pair Encoding) [Sennrich et al.]] (--model_type=bpe)
    • Unigram. Language-model based segmentation. (--model_type=unigram)
  • pretokenization methods:

    • NoPretok: No pretokenization. We train SentencePiece directly from raw sentences (--split_by_whitespace=false).
    • WsPretok: Trains SentencePiece model from the sentences tokenized by whitespaces (--split_by_whitespace=true). When handling CJK, this setting is almost equivalent to NoPretok.
    • MosesPretok: Trains SentencePiece model from sentences tokenized by Moses tokenizer. We used KyTea for Japanese and in-house segmenters for Korean and Chinese respectively.
  • NMT parameters: (Google’s Neural Machine Translation System is applied for all experiments.)

    • 16k shared vocabulary (Shares the same vocabulary for source and target. We train single SentencePiece model by concatenating raw source and target sentences.)
    • Dropout prob: 0.2
    • num nodes: 512
    • num lstms: 8
  • Evaluation metrics:

    • Case-sensitive BLEU on detokenized text with NIST scorer.
    • For CJK, the same word segmenters are applied prior to NIST scorer.
    • No detokenizer is applied for NoPretok and WsPretok, which can directly emit detokenized sentences.
    • Applied Moses detokenizer and in-house rule-based detokenizer (CJK) for MosesPretok.
  • Data sets:

    • KFTT
    • MultiUN (First 5M and next 5k/5k sentences are used for training and development/testing respectively.)
    • WMT16
    • In-house: (Used 5M parallel sentences for training)

NoPretok and WsPretok do not use any language-dependent resources. BPE+MosePretok is almost the same configuration used in [Sennrich et al.] and [Wu et al.].

Results (BLEU scores)

Language PairBPE(NoPretok)BPE(WsPretok)BPE(MosesPretok)Unigram(NoPretok)Unigram(WsPretok)Unigram(MosesPretok)
KFTT en-ja0.27960.2810.2860.28060.2800.2871
KFTT ja-en0.19430.2080.19670.19850.21480.198
MultiUN ar-en0.52680.54140.53810.53170.54490.5401
MultiUN en-ar0.40390.41470.40120.40840.41720.3991
MultiUN en-zh0.41550.41860.3950.42140.41650.399
MultiUN zh-en0.460.47160.48060.46440.47110.4759
In house en-ko0.1780.18510.18930.18460.18720.1890
In house ko-en0.17860.19540.19940.18450.19560.2015
WMT16 cs-en0.19870.22520.22310.21640.22280.2238
WMT16 de-en0.31940.33480.33740.32610.33750.3398
WMT16 en-cs0.16070.18270.18120.17220.17780.179
WMT16 en-de0.28470.30290.30130.29460.30000.3053
WMT16 en-fi0.14340.15280.14990.14720.15680.1517
WMT16 en-ru0.18840.19730.19890.190.19820.1903
WMT16 fi-en0.17750.18670.18770.1820.18820.1865
WMT16 ru-en0.20420.22290.21940.20870.22010.2155
  • MosesPretok does not always improve BLEU scores. Comparable accuracy can be obtained without using language-dependent resources in many language pairs.
  • Whitespace pretokenization is a reasonable choice. It does not use language-specific resources.
  • NoPretok shows poor BLEU scores. Unigrams are more robust than BPE when no pretokenizer is applied.