SLPA algorithm

1. Algorithm Introduction

SLPA (Speaker-listener Label Propagation Algorithm) was proposed by Jierui Xie et al. in 2011. It is an overlapping community discovery algorithm, which involves an important threshold parameter r, which can be degenerated into a non-overlapping type through proper selection of r. . It is an extension of the LPA algorithm (label propagation algorithm). The discovery of the SLAP overlapping community mainly believes that each node has not only one label, but one label is added for each iteration, and finally a certain strategy is used to filter.

2. Running example

Algorithm IO parameters

input：Input, hdfs path, undirected or directed graph, weight or no weight (optional). Each line represents an edge: srcId separator dstId (separator weight). for example：

0 1 0.3

2 1 0.5

3 1 0.1

3 2 0.7

4 1 0.3

Among them, the meanings of the three columns of data from left to right are the source node ID, the end node ID, and the edge weight (it can be empty if there is no weight)

output: Output, hdfs path. Each row represents a wandering path
sep: Separator, the separator between the start vertex and target vertex of each edge in the input: tab, space, etc.
isWeighted：Whether the edge is weighted

Algorithm parameters

maxIteration: the maximum number of iterations
numMaxCommunities: Indicates that each vertex is in at most numMaxCommunities communities at the same time
preserveRate: Each time the community to which a vertex belongs is calculated, there is a certain probability that the last result will be retained, that is, the vertex will not be updated in this iteration
needReplicateEdge: whether you need to convert the graph to an undirected graph
partitionNum: the number of data partitions, the number of spark rdd data partitions
psPartitionNum: the number of partitions of the model on the parameter server
useBalancePartition: Whether the parameter server divides the storage partition of the input data node into a balanced partition, if the index of the input node is not uniform, it is recommended to choose yes
storageLevel: RDD storage level, DISK_ONLY/MEMORY_ONLY/MEMORY_AND_DISK

Resource parameters

ps number and memory size: the product of ps.instance and ps.memory is the total configured memory of ps. In order to ensure that Angel does not hang up, you need to configure memory that is about twice the size of the data storage on ps. For the SLPA algorithm, if it is 100 million nodes and numMaxCommunities is 3, the model size is 100 million*(8+4)bytes32 and the size is 7.2G, then the configuration of instances=2, memory=8G is almost the same
Spark resource configuration: The product of num-executors and executor-memory is the total configuration memory of executors, and it is best to store 2 times the input data. If the memory is tight, 1x is acceptable, but it will be relatively slow. For example, a 10 billion edge set is about 600G in size, and a 50G * 20 configuration is sufficient. When resources are really tight, try to increase the number of partitions!

Submitting scripts

input=hdfs://my-hdfs/data
output=hdfs://my-hdfs/output

source ./spark-on-angel-env.sh
$SPARK_HOME/bin/spark-submit \
  --master yarn-cluster\
  --conf spark.ps.instances=1 \
  --conf spark.ps.cores=1 \
  --conf spark.ps.jars=$SONA_ANGEL_JARS \
  --conf spark.ps.memory=10g \
  --name "slpa angel" \
  --jars $SONA_SPARK_JARS  \
  --driver-memory 5g \
  --num-executors 1 \
  --executor-cores 4 \
  --executor-memory 10g \
  --class com.tencent.angel.spark.examples.cluster.SLPAExample \
  ../lib/spark-on-angel-examples-3.3.0.jar
  input:$input output:$output \
  sep:tab storageLevel:MEMORY_ONLY useBalancePartition:true maxIteration：10\
  partitionNum:4 psPartitionNum:1 numMaxCommunities:3 needReplicateEdge:true

FAQ

At about 10 minutes, the task hangs: The most likely reason is that Angel cannot apply for resources! Since SLPA is developed based on Spark On Angel, it actually involves two systems, Spark and Angel, and their application for resources from Yarn is carried out independently. After the Spark task is started, Spark submits the Angel task to Yarn. If the resource cannot be applied for within a given time, a timeout error will be reported and the task will hang! The solution is: 1) Confirm that the resource pool has sufficient resources 2) Add spark conf: spark.hadoop.angel.am.appstate.timeout.ms = xxx to increase the timeout time, the default value is 600000, which is 10 minutes