docs/algo/sona/line_sona_en.md
LINE (Large-scale Information Network Embedding) algorithm is one of the well-known algorithms in the field of Network Embedding. It embeds graph data into vector space as to use vertor-based machine learning algorithm to handle graph datas.
The LINE algorithm is a network representation learning algorithm(also be considered as a preprocessing algorithm for graph data). The algorithm recieve a network as input and, produces the vector representation for each node. The LINE algorithm mainly focuses on optimizing two objective functions:
where, characterizes the first-order similarity between nodes (direct edge), and depicts the second-order similarity between nodes (similar neighbors). in other words,
For more details, please refer to the paper [1]
input: The edge table hdfs path of the graph, undirected graph, separated by blanks or commas, for example, the edge data without weight is as follows (with weight, enter the weight value of the third column):
0 2
2 1
3 1
3 2
4 1
output: The result is saved in the hdfs path, and the final embedding result is saved as output/CP_x, where x represents the xth round, and the format separator for saving the result can be specified by the configuration item:
spark.hadoop.angel.line.keyvalue.sep=(Support space, comma, tab, bar, colon, etc., the default is colon)
spark.hadoop.angel.line.feature.sep=(Support space, comma, tab, bar, colon, etc., the default is colon)
saveContextEmbedding: Choose whether to save the context embedding during the second-order line training, saving the embedding can be used for incremental training
extraInputEmbeddingPath: Load the pre-trained node input embedding vector from the outside for initialization for incremental training. The default data format is: Node id: Embedding vector (vectors are separated by spaces, such as 123:0.1 0.2 0.1), the separator can be set through the configuration item Specify
spark.hadoop.angel.line.keyvalue.sep=(Support space, comma, tab, bar, colon, etc., the default is colon)
spark.hadoop.angel.line.feature.sep=(Support space, comma, tab, bar, colon, etc., the default is colon)
extraContextEmbeddingPath: Load the pre-trained node context embedding vector from the outside for initialization and use for incremental training. Only the second-order line takes effect. The default data format is: Node id: embedding vector (vectors are separated by spaces, such as 123:0.1 0.2 0.1), Separator can be specified by configuration item
spark.hadoop.angel.line.keyvalue.sep=(Support space, comma, tab, bar, colon, etc., the default is colon)
spark.hadoop.angel.line.feature.sep=(Support space, comma, tab, bar, colon, etc., the default is colon)
saveModelInterval: save the model every few rounds of epoch
checkpointInterval: write the model checkpoint every few rounds of epoch
input=hdfs://my-hdfs/data
output=hdfs://my-hdfs/model
source ./bin/spark-on-angel-env.sh
$SPARK_HOME/bin/spark-submit \
--master yarn-cluster\
--conf spark.ps.instances=1 \
--conf spark.ps.cores=1 \
--conf spark.ps.jars=$SONA_ANGEL_JARS \
--conf spark.ps.memory=10g \
--name "kcore angel" \
--jars $SONA_SPARK_JARS \
--driver-memory 5g \
--num-executors 1 \
--executor-cores 4 \
--executor-memory 10g \
--class com.tencent.angel.spark.examples.cluster.LINEExample \
../lib/spark-on-angel-examples-3.3.0.jar
input:$input output:$output embedding:128 negative:5 epoch:10 stepSize:0.01 batchSize:1000 numParts:10 remapping:false order:2