Back to Nlp Progress

Language modeling

english/language_modeling.md

0.327.3 KB
Original Source

Language modeling

Language modeling is the task of predicting the next word or character in a document.

* indicates models using dynamic evaluation; where, at test time, models may adapt to seen tokens in order to improve performance on following tokens. (Mikolov et al., (2010), Krause et al., (2017))

Word Level Models

Penn Treebank

A common evaluation dataset for language modeling is the Penn Treebank, as pre-processed by Mikolov et al., (2011). The dataset consists of 929k training words, 73k validation words, and 82k test words. As part of the pre-processing, words were lower-cased, numbers were replaced with N, newlines were replaced with <eos>, and all other punctuation was removed. The vocabulary is the most frequent 10k words with the rest of the tokens replaced by an <unk> token. Models are evaluated based on perplexity, which is the average per-word log-probability (lower is better).

ModelValidation perplexityTest perplexityNumber of paramsPaper / SourceCode
Mogrifier RLSTM + dynamic eval (Melis, 2022)42.942.924MCircling Back to Recurrent Models of LanguageOfficial
Mogrifier LSTM + dynamic eval (Melis et al., 2019)44.944.824MMogrifier LSTMOfficial
AdvSoft + AWD-LSTM-MoS + dynamic eval (Wang et al., 2019)46.6346.0122MImproving Neural Language Modeling via Adversarial TrainingOfficial
FRAGE + AWD-LSTM-MoS + dynamic eval (Gong et al., 2018)47.3846.5422MFRAGE: Frequency-Agnostic Word RepresentationOfficial
AWD-LSTM-DOC x5 (Takase et al., 2018)48.6347.17185MDirect Output Connection for a High-Rank Language ModelOfficial
AWD-LSTM-MoS + dynamic eval (Yang et al., 2018)*48.3347.6922MBreaking the Softmax Bottleneck: A High-Rank RNN Language ModelOfficial
Mogrifier RLSTM (Melis, 2022)48.947.924MCircling Back to Recurrent Models of LanguageOfficial
Mogrifier LSTM (Melis et al., 2019)51.450.124MMogrifier LSTMOfficial
AWD-LSTM + dynamic eval (Krause et al., 2017)*51.651.124MDynamic Evaluation of Neural Sequence ModelsOfficial
AWD-LSTM-DOC + Partial Shuffle (Press, 2019) preprint53.7952.0023MPartially Shuffling the Training Data to Improve Language ModelsOfficial
AWD-LSTM-DOC (Takase et al., 2018)54.1252.3823MDirect Output Connection for a High-Rank Language ModelOfficial
AWD-LSTM + continuous cache pointer (Merity et al., 2017)*53.952.824MRegularizing and Optimizing LSTM Language ModelsOfficial
Trellis Network (Bai et al., 2019)-54.1934MTrellis Networks for Sequence ModelingOfficial
AWD-LSTM-MoS + ATOI (Kocher et al., 2019)56.4454.3322MAlleviating Sequence Information Loss with Data Overlapping and Prime Batch SizesOfficial
AWD-LSTM-MoS + finetune (Yang et al., 2018)56.5454.4422MBreaking the Softmax Bottleneck: A High-Rank RNN Language ModelOfficial
Transformer-XL (Dai et al., 2018) under review56.7254.5224MTransformer-XL: Attentive Language Models Beyond a Fixed-Length ContextOfficial
AWD-LSTM-MoS (Yang et al., 2018)58.0855.9722MBreaking the Softmax Bottleneck: A High-Rank RNN Language ModelOfficial
AWD-LSTM 3-layer with Fraternal dropout (Zołna et al., 2018)58.956.824MFraternal dropoutOfficial
AWD-LSTM (Merity et al., 2017)60.057.324MRegularizing and Optimizing LSTM Language ModelsOfficial

WikiText-2

WikiText-2 has been proposed as a more realistic benchmark for language modeling than the pre-processed Penn Treebank. WikiText-2 consists of around 2 million words extracted from Wikipedia articles.

ModelValidation perplexityTest perplexityNumber of paramsPaper / SourceCode
Mogrifier RLSTM + dynamic eval (Melis, 2022)39.338.024MCircling Back to Recurrent Models of LanguageOfficial
Mogrifier LSTM + dynamic eval (Melis et al., 2019)40.238.635MMogrifier LSTMOfficial
AdvSoft + AWD-LSTM-MoS + dynamic eval (Wang et al., 2019)40.2738.6535MImproving Neural Language Modeling via Adversarial TrainingOfficial
FRAGE + AWD-LSTM-MoS + dynamic eval (Gong et al., 2018)40.8539.1435MFRAGE: Frequency-Agnostic Word RepresentationOfficial
AWD-LSTM-MoS + dynamic eval (Yang et al., 2018)*42.4140.6835MBreaking the Softmax Bottleneck: A High-Rank RNN Language ModelOfficial
AWD-LSTM + dynamic eval (Krause et al., 2017)*46.444.333MDynamic Evaluation of Neural Sequence ModelsOfficial
AWD-LSTM + continuous cache pointer (Merity et al., 2017)*53.852.033MRegularizing and Optimizing LSTM Language ModelsOfficial
AWD-LSTM-DOC x5 (Takase et al., 2018)54.1953.09185MDirect Output Connection for a High-Rank Language ModelOfficial
Mogrifier RLSTM (Melis, 2022)56.755.024MCircling Back to Recurrent Models of LanguageOfficial
Mogrifier LSTM (Melis et al., 2019)57.355.135MMogrifier LSTMOfficial
AWD-LSTM-DOC + Partial Shuffle (Press, 2019) preprint60.1657.8537MPartially Shuffling the Training Data to Improve Language ModelsOfficial
AWD-LSTM-DOC (Takase et al., 2018)60.2958.0337MDirect Output Connection for a High-Rank Language ModelOfficial
AWD-LSTM-MoS (Yang et al., 2018)63.8861.4535MBreaking the Softmax Bottleneck: A High-Rank RNN Language ModelOfficial
AWD-LSTM 3-layer with Fraternal dropout (Zołna et al., 2018)66.864.134MFraternal dropoutOfficial
AWD-LSTM + ATOI (Kocher et al., 2019)67.4764.7333MAlleviating Sequence Information Loss with Data Overlapping and Prime Batch SizesOfficial
AWD-LSTM (Merity et al., 2017)68.665.833MRegularizing and Optimizing LSTM Language ModelsOfficial

WikiText-103

WikiText-103 The WikiText-103 corpus contains 267,735 unique words and each word occurs at least three times in the training set.

ModelValidation perplexityTest perplexityNumber of paramsPaper / SourceCode
Routing Transformer (Roy et al., 2020)* arxiv preprint-15.8-Efficient Content-Based Sparse Attention with Routing Transformers-
Transformer-XL + RMS dynamic eval (Krause et al., 2019)* arxiv preprint15.816.4257MDynamic Evaluation of Transformer Language ModelsOfficial
Compressive Transformer (Rae et al., 2019)* arxiv preprint16.017.1(16.1 with basic dynamic evaluation)~257MCompressive Transformers for Long-Range Sequence Modelling-
SegaTransformer-XL (Bai et al., 2020)-17.1257MSegatron: Segment-Aware Transformer for Language Modeling and UnderstandingOfficial
Transformer-XL Large (Dai et al., 2018) under review17.718.3257MTransformer-XL: Attentive Language Models Beyond a Fixed-Length ContextOfficial
Transformer with tied adaptive embeddings (Baevski and Auli, 2018)19.820.5247MAdaptive Input Representations for Neural Language ModelingLink
TaLK Convolutions (Lioutas et al., 2020)-23.3240MTime-aware Large Kernel ConvolutionsOfficial
Transformer-XL Standard (Dai et al., 2018) under review23.124.0151MTransformer-XL: Attentive Language Models Beyond a Fixed-Length ContextOfficial
AdvSoft + 4 layer QRNN + dynamic eval (Wang et al., 2019)27.228.0Improving Neural Language Modeling via Adversarial TrainingOfficial
LSTM + Hebbian + Cache + MbPA (Rae et al., 2018)29.029.2Fast Parametric Learning with Activation Memorization
Trellis Network (Bai et al., 2019)-30.35180MTrellis Networks for Sequence ModelingOfficial
AWD-LSTM-MoS + ATOI (Kocher et al., 2019)31.9232.85Alleviating Sequence Information Loss with Data Overlapping and Prime Batch SizesOfficial
LSTM + Hebbian (Rae et al., 2018)34.134.3Fast Parametric Learning with Activation Memorization
LSTM (Rae et al., 2018)36.036.4Fast Parametric Learning with Activation Memorization
Gated CNN (Dauphin et al., 2016)-37.2Language modeling with gated convolutional networks
Neural cache model (size = 2,000) (Grave et al., 2017)-40.8Improving Neural Language Models with a Continuous CacheLink
Temporal CNN (Bai et al., 2018)-45.2Convolutional sequence modeling revisited
LSTM (Grave et al., 2017)-48.7Improving Neural Language Models with a Continuous CacheLink

1B Words / Google Billion Word benchmark

The One-Billion Word benchmark is a large dataset derived from a news-commentary site. The dataset consists of 829,250,940 tokens over a vocabulary of 793,471 words. Importantly, sentences in this model are shuffled and hence context is limited.

ModelTest perplexityNumber of paramsPaper / SourceCode
Transformer-XL Large (Dai et al., 2018) under review21.80.8BTransformer-XL: Attentive Language Models Beyond a Fixed-Length ContextOfficial
Transformer-XL Base (Dai et al., 2018) under review23.50.46BTransformer-XL: Attentive Language Models Beyond a Fixed-Length ContextOfficial
Transformer with shared adaptive embeddings - Very large (Baevski and Auli, 2018)23.70.8BAdaptive Input Representations for Neural Language ModelingLink
10 LSTM+CNN inputs + SNM10-SKIP (Jozefowicz et al., 2016) ensemble23.743B?Exploring the Limits of Language ModelingOfficial
Transformer with shared adaptive embeddings (Baevski and Auli, 2018)24.10.46BAdaptive Input Representations for Neural Language ModelingLink
Big LSTM+CNN inputs (Jozefowicz et al., 2016)30.01.04BExploring the Limits of Language Modeling
Gated CNN-14Bottleneck (Dauphin et al., 2017)31.9?Language Modeling with Gated Convolutional Networks
BIGLSTM baseline (Kuchaiev and Ginsburg, 2018)35.10.151BFactorization tricks for LSTM networksOfficial
BIG F-LSTM F512 (Kuchaiev and Ginsburg, 2018)36.30.052BFactorization tricks for LSTM networksOfficial
BIG G-LSTM G-8 (Kuchaiev and Ginsburg, 2018)39.40.035BFactorization tricks for LSTM networksOfficial

Character Level Models

Hutter Prize

The Hutter Prize Wikipedia dataset, also known as enwiki8, is a byte-level dataset consisting of the first 100 million bytes of a Wikipedia XML dump. For simplicity we shall refer to it as a character-level dataset. Within these 100 million bytes are 205 unique tokens.

ModelBit per Character (BPC)Number of paramsPaper / SourceCode
Mogrifier RLSTM + dynamic eval (Melis, 2022)0.93596MCircling Back to Recurrent Models of LanguageOfficial
Transformer-XL + RMS dynamic eval (Krause et al., 2019)* arxiv preprint0.94277MDynamic Evaluation of Transformer Language ModelsOfficial
Compressive Transformer (Rae et al., 2019) arxiv preprint0.97-Compressive Transformers for Long-Range Sequence Modelling-
Mogrifier LSTM + dynamic eval (Melis et al., 2019)0.98896MMogrifier LSTMOfficial
24-layer Transformer-XL (Dai et al., 2018) under review0.99277MTransformer-XL: Attentive Language Models Beyond a Fixed-Length ContextOfficial
Longformer Large (Beltagy, Peters, and Cohan; 2020)0.99102MLongformer: The Long-Document TransformerOfficial
Longformer Small (Beltagy, Peters, and Cohan; 2020)1.0041MLongformer: The Long-Document TransformerOfficial
18-layer Transformer-XL (Dai et al., 2018) under review1.0388MTransformer-XL: Attentive Language Models Beyond a Fixed-Length ContextOfficial
Mogrifier RLSTM (Melis, 2022)1.04296MCircling Back to Recurrent Models of LanguageOfficial
12-layer Transformer-XL (Dai et al., 2018) under review1.0641MTransformer-XL: Attentive Language Models Beyond a Fixed-Length ContextOfficial
64-layer Character Transformer Model (Al-Rfou et al., 2018)1.06235MCharacter-Level Language Modeling with Deeper Self-Attention
mLSTM + dynamic eval (Krause et al., 2017)*1.0846MDynamic Evaluation of Neural Sequence ModelsOfficial
12-layer Character Transformer Model (Al-Rfou et al., 2018)1.1144MCharacter-Level Language Modeling with Deeper Self-Attention
Mogrifier LSTM (Melis et al., 2019)1.12296MMogrifier LSTMOfficial
3-layer AWD-LSTM (Merity et al., 2018)1.23247MAn Analysis of Neural Language Modeling at Multiple ScalesOfficial
Large mLSTM +emb +WN +VD (Krause et al., 2017)1.2446MMultiplicative LSTM for sequence modellingOfficial
Large FS-LSTM-4 (Mujika et al., 2017)1.24547MFast-Slow Recurrent Neural NetworksOfficial
Large RHN (Zilly et al., 2016)1.2746MRecurrent Highway NetworksOfficial
FS-LSTM-4 (Mujika et al., 2017)1.27727MFast-Slow Recurrent Neural NetworksOfficial

Text8

The text8 dataset is also derived from Wikipedia text, but has all XML removed, and is lower cased to only have 26 characters of English text plus spaces.

ModelBit per Character (BPC)Number of paramsPaper / SourceCode
Transformer-XL + RMS dynamic eval (Krause et al., 2019)* arxiv preprint1.038277MDynamic Evaluation of Transformer Language ModelsOfficial
Mogrifier RLSTM + dynamic eval (Melis, 2022)1.04496MCircling Back to Recurrent Models of LanguageOfficial
Transformer-XL Large (Dai et al., 2018) under review1.08277MTransformer-XL: Attentive Language Models Beyond a Fixed-Length ContextOfficial
Mogrifier RLSTM (Melis, 2022)1.09696MCircling Back to Recurrent Models of LanguageOfficial
Longformer Small (Beltagy, Peters, and Cohan; 2020)1.1041MLongformer: The Long-Document TransformerOfficial
64-layer Character Transformer Model (Al-Rfou et al., 2018)1.13235MCharacter-Level Language Modeling with Deeper Self-Attention
12-layer Character Transformer Model (Al-Rfou et al., 2018)1.1844MCharacter-Level Language Modeling with Deeper Self-Attention
mLSTM + dynamic eval (Krause et al., 2017)*1.1945MDynamic Evaluation of Neural Sequence ModelsOfficial
Large mLSTM +emb +WN +VD (Krause et al., 2016)1.2745MMultiplicative LSTM for sequence modellingOfficial
Large RHN (Zilly et al., 2016)1.2746MRecurrent Highway NetworksOfficial
LayerNorm HM-LSTM (Chung et al., 2017)1.2935MHierarchical Multiscale Recurrent Neural Networks
BN LSTM (Cooijmans et al., 2016)1.3616MRecurrent Batch NormalizationOfficial
Unregularised mLSTM (Krause et al., 2016)1.4045MMultiplicative LSTM for sequence modellingOfficial

Penn Treebank

The vocabulary of the words in the character-level dataset is limited to 10 000 - the same vocabulary as used in the word level dataset. This vastly simplifies the task of character-level language modeling as character transitions will be limited to those found within the limited word level vocabulary.

ModelBit per Character (BPC)Number of paramsPaper / SourceCode
Mogrifier RLSTM + dynamic eval (Melis, 2022)1.06124MCircling Back to Recurrent Models of LanguageOfficial
Mogrifier LSTM + dynamic eval (Melis et al., 2019)1.08324MMogrifier LSTMOfficial
Mogrifier RLSTM (Melis, 2022)1.09624MCircling Back to Recurrent Models of LanguageOfficial
Mogrifier LSTM (Melis et al., 2019)1.12024MMogrifier LSTMOfficial
Trellis Network (Bai et al., 2019)1.15913.4MTrellis Networks for Sequence ModelingOfficial
3-layer AWD-LSTM (Merity et al., 2018)1.17513.8MAn Analysis of Neural Language Modeling at Multiple ScalesOfficial
6-layer QRNN (Merity et al., 2018)1.18713.8MAn Analysis of Neural Language Modeling at Multiple ScalesOfficial
FS-LSTM-4 (Mujika et al., 2017)1.19027MFast-Slow Recurrent Neural NetworksOfficial
FS-LSTM-2 (Mujika et al., 2017)1.19327MFast-Slow Recurrent Neural NetworksOfficial
NASCell (Zoph & Le, 2016)1.21416.3MNeural Architecture Search with Reinforcement Learning
2-layer Norm HyperLSTM (Ha et al., 2016)1.21914.4MHyperNetworks

Multilingual Wikipedia Corpus

The character-based MWC dataset is a collection of Wikipedia pages available in a number of languages. Markup and rare characters were removed, but otherwise no preprocessing was applied.

MWC English in the single text, large setting.

ModelValidation BPCTest BPCNumber of paramsPaper / SourceCode
Mogrifier LSTM + dynamic eval (Melis et al., 2019)1.2001.18724MMogrifier LSTMOfficial
Mogrifier LSTM (Melis et al., 2019)1.3121.29824MMogrifier LSTMOfficial
HCLM with Cache (Kawakami et al. 2017)1.5911.5388MLearning to Create and Reuse Words in Open-Vocabulary Neural Language Modeling
LSTM (Kawakami et al. 2017)1.7931.7368MLearning to Create and Reuse Words in Open-Vocabulary Neural Language Modeling

MWC Finnish in the single text, large setting.

ModelValidation BPCTest BPCNumber of paramsPaper / SourceCode
Mogrifier LSTM + dynamic eval (Melis et al., 2019)1.2021.19124MMogrifier LSTMOfficial
Mogrifier LSTM (Melis et al., 2019)1.3271.31324MMogrifier LSTMOfficial
HCLM with Cache (Kawakami et al. 2017)1.7541.7118MLearning to Create and Reuse Words in Open-Vocabulary Neural Language Modeling
LSTM (Kawakami et al. 2017)1.9431.9138MLearning to Create and Reuse Words in Open-Vocabulary Neural Language Modeling

Go back to the README