docs/algo/sona/CC_sona_en.md
CC(connected components) algorithm is used to calculate the connected components of a graph.
CC algorithm treats every graph as the undirected. It assigns the same label to nodes belonging to the same connected component. We implemented cc algorithm for large-scale networks based on Spark On Angel. The ps maintains the node's latest estimation of labels. The Spark side maintains the adjacency list of the network, and pulls the latest label estimate in each round. According to the cc algorithm, the label estimate of the node is updated, and then pushed back to the ps in each round. The algorithm stops until none of labels of nodes are updated last round. Every component is assigned with an unique id. This implementation combines the advantages of distributed algorithm on big data with the advantages of union-find algorithm on a single machine. The processing of connected components is divided into two parts: distributed part and union-find part on a single machine, and the results of the two parts are combined into the final result at the end.
DISK_ONLY/MEMORY_ONLY/MEMORY_AND_DISKinput=hdfs://my-hdfs/data
output=hdfs://my-hdfs/output
source ./spark-on-angel-env.sh
$SPARK_HOME/bin/spark-submit \
--master yarn-cluster\
--conf spark.ps.instances=1 \
--conf spark.ps.cores=1 \
--conf spark.ps.jars=$SONA_ANGEL_JARS \
--conf spark.ps.memory=10g \
--name "cc angel" \
--jars $SONA_SPARK_JARS \
--driver-memory 5g \
--num-executors 1 \
--executor-cores 4 \
--executor-memory 10g \
--class org.apache.spark.angel.examples.graph.CCExample \
../lib/spark-on-angel-examples-3.1.0.jar
input:$input output:$output sep:tab storageLevel:MEMORY_ONLY useBalancePartition:true \
partitionNum:4 psPartitionNum:1 localLimit:100000000 compressIterNum:3 needReplicaEdge:true