docs/algo/sona/LPA_sona_en.md
LPA(label propagation algorithm) is a semi-supervised learning algorithm, widely used in community detection.
LPA uses nodes with labels to predict those without labels. We implemented lpa algorithm for large-scale networks based on Spark On Angel. The ps maintains the node's latest estimation of labels. The Spark side maintains the adjacency list of the network, and pulls the latest label estimate in each round. According to the lpa algorithm, the label estimate of the node is updated, and then pushed back to the ps in each round. The algorithm stops with countable iterations. The algorithm takes the most common label the node's neighbors as the label of that node.
DISK_ONLY/MEMORY_ONLY/MEMORY_AND_DISKinput=hdfs://my-hdfs/data
output=hdfs://my-hdfs/output
source ./spark-on-angel-env.sh
$SPARK_HOME/bin/spark-submit \
--master yarn-cluster\
--conf spark.ps.instances=1 \
--conf spark.ps.cores=1 \
--conf spark.ps.jars=$SONA_ANGEL_JARS \
--conf spark.ps.memory=10g \
--name "cc angel" \
--jars $SONA_SPARK_JARS \
--driver-memory 5g \
--num-executors 1 \
--executor-cores 4 \
--executor-memory 10g \
--class org.apache.spark.angel.examples.graph.LPAExample \
../lib/spark-on-angel-examples-3.1.0.jar
input:$input output:$output sep:tab storageLevel:MEMORY_ONLY useBalancePartition:true \
partitionNum:4 psPartitionNum:1 maxIter:100 needReplicaEdge:true