Examples/Text/LightRNN/README.md
This is the official implementation for LightRNN: Memory and Computation-Efficient Recurrent Neural Networks in CNTK.
Recurrent neural networks (RNNs) have achieved state-of-the-art performances in many natural language processing tasks, such as language modeling and machine translation. However, when the vocabulary is large, the RNN model will become very big (e.g., possibly beyond the memory capacity of a GPU device) and its training/inference will become very inefficient. LightRNN addresses this challenge using 2-Component (2C) shared embedding for word representations. It allocates every word in the vocabulary into a table, each row of which is associated with a vector, and each column associated with another vector. Depending on its position in the table, a word is jointly represented by two components: a row vector and a column vector. Since the words in the same row share the row vector and the words in the same column share the column vector, we only need $2 \sqrt{|V|}$ vectors to represent a vocabulary of $|V|$ unique words, which are far less than the $|V|$ vectors required by existing approaches. The LightRNN algorithm significantly reduces the model size and speeds up the training/inference process for corpora with large vocabularies.
For more details please refer to the NIPS 2016 paper (https://arxiv.org/abs/1610.09893)
For multi gpu version
The folder LightRNN contains main structure of LightRNN.
-datadir <string> (required), Path to the data. Put all the corpus-outputdir <string> (required), Path to save output files.-vocab_file <string> (default: vocab.txt), Save the vocabulary to this file in the outputdir.-alloc_file <string> (default: word-0.location), Save the file of word allocation table in the outputdir.-vocabsize <int> (default: 10000), Vocabulary size.-seed <int> (default: 0), Random seed.-datadir <string> (required), Path to the data, should contain train_file, valid_file and test_file.-train_file <string> (default: train.txt), The training data.-valid_file <string> (default: valid.txt), The validation data.-test_file <string> (default: test.txt), The test data.-vocabdir <string> (default: WordInfo), Path to the word allocation table and vocabulary.-vocab_file <string> (required), The (input) vocabulary file in the vocabdir.-alloc_file <string> (default: word-0.location), The (input) file of word allocation table in the vocabdir.-outputdir <string> (default: Models), Path to save LightRNN models.-pre_model <string> (default: None), Continue training by loading this existing model file. By default, we train from scratch.-embed <int> (default: 512), Dimension of word embedding.-nhid <int> (default: 512), Dimension of hidden layer.-layer <int> (default: 2), Number of layers.-dropout <float> (default: 0.2), Dropout rate.-lr <float> (default: 0.15), Learning rate.-optim <string> (accepted: sgd, adam, adagrad, default: sgd), The optimization method which provides sgd, adam and adagrad.-seqlength <int> (default: 32), number of timesteps to unroll for.-vocabsize <int> (default: 10000), Vocabulary size.-batchsize <int> (default: 20), Minibatch size.-epochs <list> (default: None), Number of epochs in every round-freq <int> (default: 100), Report status every this many iterations.-save <string> (default: model.dnn), Save the model to the file with this suffix.Run the example under LightRNN as follows:
Preprocess
python preprocess.py -datadir ../PTB/Data -outputdir ../PTB/Allocation -vocab_file vocab.txt -alloc_file word-0.location -vocabsize 10000 -seed 0
So, we will generate the sampled vocab named as vocab.txt and a random initial word allocation table under ../PTB/Allocation.
Train
python train.py -datadir ../PTB/Data -vocab_file ../PTB/Allocation/vocab.txt -vocabdir ../PTB/Allocation -vocabsize 10000 -epochs 12 13 -nhid 1000 -embed 1000 -optim adam -lr 0.1 -batchsize 20 -layer 2 -dropout 0.5
This command will train a LightRNN model of 2 layers with 1000 hidden units and embedding dimension of 1000. The training procedure contains two rounds, with 12 epochs in the first round and 13 epochs in the second round. The word reallocation table will be optimized and updated after every round.
Multi-GPU
mpiexec -n 2 python train.py -datadir ../PTB/Data -vocab_file ../PTB/Allocation/vocab.txt -vocabdir ../PTB/Allocation -vocabsize 10000 -epochs 12 13 -nhid 1000 -embed 1000 -optim adam -lr 0.1 -batchsize 20 -layer 2 -dropout 0.5
This command will train a LightRNN model on two GPUs, you can specify the GPU number by using mpiexec -n [gpus].
This folder contains an example of PTB dataset. You can use download_data.py under Data/ to download the data and generate.py under Allocation/ to generate a vocabulary file and random table.
We provide two implementations of word allocation using Python and C++ separately. If you don't use a C++ dynamic library, the Python implementation will be used. The Python version will be 5 times or more slower than C++ version. Therefore, C++ version is preferred.
For Linux User
Run the Makefile in this directory by make.
For Windows User
Use the Visual Studio to open the project under the DLL folder and build, put the dll file under the LightRNN.
The ACLW French corpus contains about 56M tokens, with a vocabulary of 136912 words. The parameters used in the experiment are as below.
| Parameter Name | Value |
|---|---|
| Vocabulary size | 136912 |
| Hidden dim | 1000 |
| Embed dim | 1000 |
| Layer | 2 |
| BatchSize | 100 |
| seqlength | 32 |
| Dropout | 0.5 |
| Learning rate | 0.5 |
| Optim | adam |
| GPU Type | GeForce GTX Titan x |
| GPU Number | 1 |
| Speed | 12080 |
| Time/Epoch | 1.28 h |
| Epochs | 10,10 |
Valid/Test PPL
The One Billion Words corpus contains about 799M tokens, with a vocabulary of 793471 words. We parsed the corpus as 32-token sequences, and used one GPU GeForce GTX Titan x for training.
Performance
| Embed dimension | hidden dimension | Layer | batchsize | Model size (Bytes) | tokens/second (1GPU/2GPU) | GPU Memory | Time/epoch (1GPU/2GPU) |
|---|---|---|---|---|---|---|---|
| 500 | 500 | 2 | 100 | 5.7M | 24615/40000 | 901MB | 9.01 h/5.54 h |
| 500 | 500 | 2 | 200 | 5.7M | 32000/49230 | 1558MB | 6.93 h/4.50 h |
| 500 | 500 | 2 | 500 | 5.7M | 32000/56140 | 3528MB | 6.93 h/3.95 h |
| 1000 | 1000 | 2 | 100 | 19M | 12300/19768 | 1640MB | 18.04 h/11.22 h |
| 1000 | 1000 | 2 | 200 | 19M | 13000/24150 | 2858MB | 17.07 h/9.19 h |
| 1000 | 1000 | 2 | 500 | 19M | 14280/28268 | 6526MB | 15.54 h/7.85 h |
| 1500 | 1500 | 2 | 100 | 41M | 6900/11034 | 2408MB | 32.16 h/20.11 h |
| 1500 | 1500 | 2 | 200 | 41M | 7250/13061 | 4238MB | 30.61 h/16.99 h |
We can achieve 122 perplexity (on the test set) after one epoch of training with a warm-start word allocation.
The ClueWeb09 Data contains over 177 billion tokens. We select the top 10240000 most frequent words as the vocabulary, covering 99.057% tokens. We randomly sampled 1GB/1GB for evaluation/test. The model parameters include:
| Parameter Name | Value |
|---|---|
| Vocabulary size | 10240000 |
| Hidden dim | 512 |
| Embed dim | 512 |
| Layer | 2 |
| BatchSize | 625 |
| seqlength | 32 |
| Dropout | 0.5 |
| Learning rate | 0.01 |
| Optim | adam |
| GPU Type | GeForce GTX Titan x |
| GPU Number | 4 |
We achieve a training speed of 77873 tokens/s with 4 GPUs. It takes 630 hours (26.7 days) to finish an epoch.
Train-Valid loss
If you find LightRNN useful in your work, you can cite the paper as below:
@inproceedings{LiNIPS16LightRNN,
Author = {Xiang Li, Tao Qin, Jian Yang, Tie-Yan Liu},
Title = {LightRNN: Memory and Computation-Efficient Recurrent Neural Networks},
Booktitle = {Advances in Neural Information Processing Systems ({NIPS})},
Year = {2016}
}