docs/algo/sona/eges_sona_en.md
The EGES(enhanced graph embedding with side information) is an algorithm that learns embeddings of nodes by utilizing not only the structural information of a graph but also the side information of each node. Specifically, the side information of a node includes several discrete attributes. The EGES algorithm learns the embeddings of the node itself and its discrete attributes and adaptively adjusts their weights, and the final embedding of a node is the weighted average of the embeddings. Compared with the graph embedding algorithms which learn embeddings based on only the structure of a graph, the EGES algorithm exhibits superior performance. In addition, the EGES algorithm is able to solve the "cold start" problem to some extent, where some new generated nodes have no link to any other node already existed, by initializing the embedding of a new generated node as the average of the embeddings of its discrete attributes. For more details about the EGES algorithm, please refer to the article EGES.
We have implemented the EGES algorithm on the Spark on Angel framework, which can handle large-scale industrial data. All embeddings of nodes and attributes and their corresponding weights are stored on Angel PSs. At each epoch, each Spark executor first pulls the input embeddings of the positive nodes and their attributes, the corresponding weights, and the output embeddings of the negative nodes according to the data in the mini batch, and then computes the gradients of the embeddings and the weights. Later, each Spark executor pushes the gradients to the PSs, and the corresponding embeddings and weights on PSs are updated.
input=hdfs://my-hdfs/data
output=hdfs://my-hdfs/EGES_output
matrixOutput=hdfs://my-hdfs/EGES_matrixOutput
source ./spark-on-angel-env.sh
$SPARK_HOME/bin/spark-submit \
--master yarn-cluster\
--conf spark.ps.instances=1 \
--conf spark.ps.cores=1 \
--conf spark.ps.jars=$SONA_ANGEL_JARS \
--conf spark.ps.memory=10g \
--jars $SONA_SPARK_JARS \
--driver-memory 5g \
--num-executors 1 \
--executor-cores 4 \
--executor-memory 10g \
--class com.tencent.angel.spark.examples.cluster.EGESExample \
../lib/spark-on-angel-examples-3.1.0.jar \
input:$input output:$output matrixOutput:$matrixOutput weightedSI:true numWeightsSI:3 embeddingDim:32 numNegSamples:5 epochNum:10 stepSize:0.01 decayRate:0.5 batchSize:1000 dataPartitionNum:12 psPartitionNum:10 needRemapping:false