docs/algo/sona/word2vec_sona_en.md
The Word2Vec algorithm is one of the well-known algorithms in the NLP field. It can learn the vector representation of words from text data and serve as input to other NLP algorithms.
We used Spark On Angel to implement the SkipGram model based on negative sampling optimization, which can handle very large models up to 1 billion * 1000 dimensions. The U and V matrices are stored on Angel PS. Spark executor pulls the corresponding nodes and negative sampling nodes according to the batch data to perform gradient calculation and update calculation. Finally, the results that need to be updated are pushed back to PS.
input: The hdfs path, the sentences from the random walk, the words are separated by blanks or commas, take the digital id as an example (but the input data can be a non-digital string, and the re-encoding function that comes with the component will be used later) such as:
0 1 3 5 9
2 1 5 1 7
3 1 4 2 8
3 2 5 1 3
4 1 2 9 4
output: The result is saved in the hdfs path, and the final embedding result is saved in output/CP_x, where x represents the xth round, and the format separator for saving the result can be specified by the configuration item:
spark.hadoop.angel.line.keyvalue.sep=(Support space, comma, tab, bar, colon, etc., the default is colon)
spark.hadoop.angel.line.feature.sep=(Support space, comma, tab, bar, colon, etc., the default is colon)
saveContextEmbedding: Choose whether to save the context embedding during training, saving the embedding can be used for incremental training
extraInputEmbeddingPath: Load the pre-trained node input embedding vector from the outside for initialization for incremental training. The default data format is: node id: embedding vector (vectors are separated by spaces, such as 123:0.1 0.2 0.1), the separator can be specified by configuration items
spark.hadoop.angel.line.keyvalue.sep=(Support space, comma, tab, bar, colon, etc., the default is colon)
spark.hadoop.angel.line.feature.sep=(Support space, comma, tab, bar, colon, etc., the default is colon)
extraContextEmbeddingPath: Load the pre-trained node context embedding vector from the outside for initialization for incremental training. The default data format is: node id: embedding vector (vectors are separated by spaces, such as 123:0.1 0.2 0.1), the separator can be specified by configuration items
spark.hadoop.angel.line.keyvalue.sep=(Support space, comma, tab, bar, colon, etc., the default is colon)
spark.hadoop.angel.line.feature.sep=(Support space, comma, tab, bar, colon, etc., the default is colon)
nodeTypePath: The node type path required to run heterogeneous skip-grams (such as the result of metapath walking), the data format is: NodeId separator TypeId
saveModelInterval:save the model every few rounds of epoch
checkpointInterval:write the model checkpoint every few rounds of epoch
input=hdfs://my-hdfs/data
output=hdfs://my-hdfs/model
source ./spark-on-angel-env.sh
$SPARK_HOME/bin/spark-submit \
--master yarn-cluster\
--conf spark.ps.instances=1 \
--conf spark.ps.cores=1 \
--conf spark.ps.jars=$SONA_ANGEL_JARS \
--conf spark.ps.memory=10g \
--jars $SONA_SPARK_JARS \
--driver-memory 5g \
--num-executors 1 \
--executor-cores 4 \
--executor-memory 10g \
--class com.tencent.angel.spark.examples.cluster.Word2vecExample \
../lib/spark-on-angel-examples-3.3.0.jar \
input:$input output:$output embedding:32 negative:5 epoch:10 stepSize:0.01 batchSize:50 psPartitionNum:10 remapping:false window:5