Training_README.md
Five new vectors were trained. Four of these vectors were trained on Gigaword and updated Wikipedia corpus and the fifth was trained on Dolma. This document includes the scripts used to train these five vectors.
The Wikipedia portion of the corpora was downloaded from the 7/22/2024 dump. The text was then extracted using Wikiextractor with the command
python -m wikiextractor.WikiExtractor enwiki-20220620-pages-meta-current.xml --no-templates --output output_dir/
The fifth edition of Gigaword was used. Dolma v1.6 was used and 5% of Common Crawl, 40% of C4, 100% of Reddit, and 100% of Project Gutenberg were used.
For tokenization, version 4.4.1 of Stanford's Stanza tokenizer was used. We used the script
java edu.stanford.nlp.process.PTBTokenizer -preserveLines -lowerCase -options 'untokenizable=allKeep'
When combining the Wikipedia and Gigaword data, Gigaword was added twice to balance the increase in Wikipedia data from the previous training corpus.
Here is the bash script used for training the vectors Wiki Giga vectors 50/100/200/300 dimension and Dolma 300 dimension
CORPUS=<wikigiga, dolma>
VOCAB_FILE=vocab.txt
COOCCURRENCE_FILE=cooccurrence.bin
COOCCURRENCE_SHUF_FILE=cooccurrence.shuf.bin
BUILDDIR=build
SAVE_FILE=vectors.txt
OVERFLOW_FILE=overflow
VERBOSE=2
MEMORY=4.0
VOCAB_MIN_COUNT=20
VECTOR_SIZE=<50, 100, 200, 300>
MAX_ITER=<50,100>
WINDOW_SIZE=10
BINARY=2
NUM_THREADS=8
X_MAX=10
SEED=<123, 2024>
ALPHA=0.75
ETA=<0.05, 0.075>
$BUILDDIR/vocab_count -min-count $VOCAB_MIN_COUNT -verbose $VERBOSE < $CORPUS > $VOCAB_FILE
$BUILDDIR/cooccur -memory $MEMORY -overflow-file $OVERFLOW_FILE -vocab-file $VOCAB_FILE -verbose $VERBOSE -window-size $WINDOW_SIZE < $CORPUS > $COOCCURRENCE_FILE
$BUILDDIR/shuffle -memory $MEMORY -seed $SEED -verbose $VERBOSE < $COOCCURRENCE_FILE > $COOCCURRENCE_SHUF_FILE
$BUILDDIR/glove -eta $ETA -alpha $ALPHA -save-file $SAVE_FILE -threads $NUM_THREADS -input-file
$COOCCURRENCE_SHUF_FILE -x-max $X_MAX -iter $MAX_ITER -vector-size $VECTOR_SIZE -binary $BINARY
-vocab-file $VOCAB_FILE -verbose $VERBOSE -seed $SEED -checkpoint-every 10
For dolma, a maximum vocabulary of 1.2M was used by passing the -max-vocab flag
For Wiki Giga 50 dimension, random seed 123 and learning rate of 0.075 was used. For all other vectors, random seed 2024 and learning rate of 0.05 was used