docs/algo/sona/pagerank_pro_on_sona_en.md
PageRankPro is a variation of PageRank with initialization a special rank value on target nodes in the Graph.
We implemented large-scale PageRank calculation based on Spark On Angel, where ps maintains information of all nodes, including receiving and sending messages and rank value vectors. The calculation of the message and rank value is completed on the spark executor side, and the update is completed through the push / update operation of ps.
space, comma, tab), default is spaceMEMORY_ONLYps.instance and ps.memory is the total configuration memory of ps. In order
to ensure that Angel does not hang, you need to configure memory about twice the size of the model. For PageRank, the
calculation formula of the model size is: number of nodes * 3 * 4 Byte, according to which you can estimate the size
of ps memory that needs to be configured under Graph input of different sizes10 billion edge set is about 160G in size, and a 20G * 20 configuration is sufficient. In a situation
where resources are really tight, try to increase the number of partitions!input=hdfs://my-hdfs/data
output=hdfs://my-hdfs/model
labelPosInput=hdfs://my-hdfs/nodeToRank
source ./spark-on-angel-env.sh
$SPARK_HOME/bin/spark-submit \
--master yarn-cluster\
--conf spark.ps.instances=1 \
--conf spark.ps.cores=1 \
--conf spark.ps.jars=$SONA_ANGEL_JARS \
--conf spark.ps.memory=10g \
--jars $SONA_SPARK_JARS \
--driver-memory 5g \
--num-executors 1 \
--executor-cores 4 \
--executor-memory 10g \
--class com.tencent.angel.spark.examples.cluster.PageRankExample \
../lib/spark-on-angel-examples-3.3.0.jar \
input:$input output:$output labelPosInput:$labelPosInput tol:0.01 resetProp:0.15 batchSize:1000 psPartitionNum:10 dataPartitionNum:10
spark.hadoop.angel.am.appstate.timeout.ms = xxx to increase the timeout time, the default value is 600000, which is 10
minutes