Back to Cntk

LightRNN

Examples/Text/LightRNN/README.md

2015-12-089.2 KB
Original Source

LightRNN

This is the official implementation for LightRNN: Memory and Computation-Efficient Recurrent Neural Networks in CNTK.

LightRNN: Memory and Computation-Efficient Recurrent Neural Networks

Recurrent neural networks (RNNs) have achieved state-of-the-art performances in many natural language processing tasks, such as language modeling and machine translation. However, when the vocabulary is large, the RNN model will become very big (e.g., possibly beyond the memory capacity of a GPU device) and its training/inference will become very inefficient. LightRNN addresses this challenge using 2-Component (2C) shared embedding for word representations. It allocates every word in the vocabulary into a table, each row of which is associated with a vector, and each column associated with another vector. Depending on its position in the table, a word is jointly represented by two components: a row vector and a column vector. Since the words in the same row share the row vector and the words in the same column share the column vector, we only need $2 \sqrt{|V|}$ vectors to represent a vocabulary of $|V|$ unique words, which are far less than the $|V|$ vectors required by existing approaches. The LightRNN algorithm significantly reduces the model size and speeds up the training/inference process for corpora with large vocabularies.

For more details please refer to the NIPS 2016 paper (https://arxiv.org/abs/1610.09893)

Requirements

  • CNTK
  • Python 2.7 or later.

For multi gpu version

  • mpi program
  • mpi4py (It is recommended to build mpi4py from source so it's compatible with your MPI program.)

Details

LightRNN/

The folder LightRNN contains main structure of LightRNN.

  • converter.py Implement some functions which are used to process vocabulary and randomly initialize the word allocation table.
  • data_reader.py A overridden UserMinibatchSource which maps text to streams.
  • lightrnn.py The computation graph of LightRNN
  • reallocate.py Word reallocation implemented in Python.
  • preprocess.py    The preprocess procedure of LightRNN
    • Options
      • -datadir <string> (required), Path to the data. Put all the corpus
      • -outputdir <string> (required), Path to save output files.
      • -vocab_file <string> (default: vocab.txt), Save the vocabulary to this file in the outputdir.
      • -alloc_file <string> (default: word-0.location), Save the file of word allocation table in the outputdir.
      • -vocabsize <int> (default: 10000), Vocabulary size.
      • -seed <int> (default: 0), Random seed.
  • train.py The training procedure of LightRNN
    • Data options
      • -datadir <string> (required), Path to the data, should contain train_file, valid_file and test_file.
      • -train_file <string> (default: train.txt), The training data.
      • -valid_file <string> (default: valid.txt), The validation data.
      • -test_file <string> (default: test.txt), The test data.
      • -vocabdir <string> (default: WordInfo), Path to the word allocation table and vocabulary.
      • -vocab_file <string> (required), The (input) vocabulary file in the vocabdir.
      • -alloc_file <string> (default: word-0.location), The (input) file of word allocation table in the vocabdir.
      • -outputdir <string> (default: Models), Path to save LightRNN models.
      • -pre_model <string> (default: None), Continue training by loading this existing model file. By default, we train from scratch.
    • Model options
      • -embed <int> (default: 512), Dimension of word embedding.
      • -nhid <int> (default: 512), Dimension of hidden layer.
      • -layer <int> (default: 2), Number of layers.
      • -dropout <float> (default: 0.2), Dropout rate.
      • -lr <float> (default: 0.15), Learning rate.
      • -optim <string> (accepted: sgd, adam, adagrad, default: sgd), The optimization method which provides sgd, adam and adagrad.
      • -seqlength <int> (default: 32), number of timesteps to unroll for.
      • -vocabsize <int> (default: 10000), Vocabulary size.
      • -batchsize <int> (default: 20), Minibatch size.
    • Other options
      • -epochs <list> (default: None), Number of epochs in every round
      • -freq <int> (default: 100), Report status every this many iterations.
      • -save <string> (default: model.dnn), Save the model to the file with this suffix.

Run the example under LightRNN as follows:

Preprocess

python preprocess.py -datadir ../PTB/Data -outputdir ../PTB/Allocation -vocab_file vocab.txt -alloc_file word-0.location -vocabsize 10000 -seed 0

So, we will generate the sampled vocab named as vocab.txt and a random initial word allocation table under ../PTB/Allocation.

Train

python train.py -datadir ../PTB/Data -vocab_file ../PTB/Allocation/vocab.txt -vocabdir ../PTB/Allocation -vocabsize 10000 -epochs 12 13 -nhid 1000 -embed 1000 -optim adam -lr 0.1 -batchsize 20 -layer 2 -dropout 0.5

This command will train a LightRNN model of 2 layers with 1000 hidden units and embedding dimension of 1000. The training procedure contains two rounds, with 12 epochs in the first round and 13 epochs in the second round. The word reallocation table will be optimized and updated after every round.

Multi-GPU

mpiexec -n 2 python train.py -datadir ../PTB/Data -vocab_file ../PTB/Allocation/vocab.txt -vocabdir ../PTB/Allocation -vocabsize 10000 -epochs 12 13 -nhid 1000 -embed 1000 -optim adam -lr 0.1 -batchsize 20 -layer 2 -dropout 0.5

This command will train a LightRNN model on two GPUs, you can specify the GPU number by using mpiexec -n [gpus].

PTB/

This folder contains an example of PTB dataset. You can use download_data.py under Data/ to download the data and generate.py under Allocation/ to generate a vocabulary file and random table.

Generate C++ dynamic library

We provide two implementations of word allocation using Python and C++ separately. If you don't use a C++ dynamic library, the Python implementation will be used. The Python version will be 5 times or more slower than C++ version. Therefore, C++ version is preferred.

For Linux User

Run the Makefile in this directory by make.

For Windows User

Use the Visual Studio to open the project under the DLL folder and build, put the dll file under the LightRNN.

Experiment

ACL-French

The ACLW French corpus contains about 56M tokens, with a vocabulary of 136912 words. The parameters used in the experiment are as below.

Parameter NameValue
Vocabulary size136912
Hidden dim1000
Embed dim1000
Layer2
BatchSize100
seqlength32
Dropout0.5
Learning rate0.5
Optimadam
GPU TypeGeForce GTX Titan x
GPU Number1
Speed12080
Time/Epoch1.28 h
Epochs10,10

Valid/Test PPL

One Billion Words

The One Billion Words corpus contains about 799M tokens, with a vocabulary of 793471 words. We parsed the corpus as 32-token sequences, and used one GPU GeForce GTX Titan x for training.

Performance

Embed dimensionhidden dimensionLayerbatchsizeModel size (Bytes)tokens/second (1GPU/2GPU)GPU MemoryTime/epoch (1GPU/2GPU)
50050021005.7M24615/40000901MB9.01 h/5.54 h
50050022005.7M32000/492301558MB6.93 h/4.50 h
50050025005.7M32000/561403528MB6.93 h/3.95 h
10001000210019M12300/197681640MB18.04 h/11.22 h
10001000220019M13000/241502858MB17.07 h/9.19 h
10001000250019M14280/282686526MB15.54 h/7.85 h
15001500210041M6900/110342408MB32.16 h/20.11 h
15001500220041M7250/130614238MB30.61 h/16.99 h

We can achieve 122 perplexity (on the test set) after one epoch of training with a warm-start word allocation.

ClueWeb09 Data

The ClueWeb09 Data contains over 177 billion tokens. We select the top 10240000 most frequent words as the vocabulary, covering 99.057% tokens. We randomly sampled 1GB/1GB for evaluation/test. The model parameters include:

Parameter NameValue
Vocabulary size10240000
Hidden dim512
Embed dim512
Layer2
BatchSize625
seqlength32
Dropout0.5
Learning rate0.01
Optimadam
GPU TypeGeForce GTX Titan x
GPU Number4

We achieve a training speed of 77873 tokens/s with 4 GPUs. It takes 630 hours (26.7 days) to finish an epoch.

Train-Valid loss

Citation

If you find LightRNN useful in your work, you can cite the paper as below:

@inproceedings{LiNIPS16LightRNN,
    Author = {Xiang Li, Tao Qin, Jian Yang, Tie-Yan Liu},
    Title = {LightRNN: Memory and Computation-Efficient Recurrent Neural Networks},
    Booktitle = {Advances in Neural Information Processing Systems ({NIPS})},
    Year = {2016}
}