docs/algo/sona/node2vec_sona_en.md
The Node2Vec algorithm is a well-known graph embedding learning algorithm. It combines the advantages of depth-first search and breadth-first search to sample walking sequences for nodes, which meanwhile extracts both the homophily equivalence and the structural equivalence from graphs. The node embeddings are learned by the Word2Vec algorithm based on the sampled walking sequences. For more details about the algorithm, please refer to the article Node2vec. In terms of implementation of the Node2Vec, we divide the algorithm to the walking sequences sampling step and the embedding learning step. The class Node2VecExample only focuses on the first step, and the second step should be accomplished by the class Word2VecExample.
We have implemented the Node2Vec algorithm on the Spark on Angel framework, which can handle large-scale industrial data. The neighbor set table(without edge weight) or the Alias table(with edge weight) are stored on Angel PSs. At each batch, the Spark executors pull data from the PSs according to the data at this batch and perform node sampling to finally obtain the walking sequences for each node.
input=hdfs://my-hdfs/data
output=hdfs://my-hdfs/model
source ./spark-on-angel-env.sh
$SPARK_HOME/bin/spark-submit \
--master yarn-cluster\
--conf spark.ps.instances=1 \
--conf spark.ps.cores=1 \
--conf spark.ps.jars=$SONA_ANGEL_JARS \
--conf spark.ps.memory=10g \
--jars $SONA_SPARK_JARS \
--driver-memory 5g \
--num-executors 1 \
--executor-cores 4 \
--executor-memory 10g \
--class com.tencent.angel.spark.examples.cluster.Node2VecExample \
../lib/spark-on-angel-examples-3.3.0.jar \
input:$input output:$output isWeighted:false delimiter:space needReplicaEdge:true epochNum:1 walkLength:20 useTrunc:false truncLength:6000 batchSize:1000 setCheckPoint:true pValue:0.8 qValue:1.2