Back to Angel

Swing

docs/algo/sona/swing_en.md

3.1.02.3 KB
Original Source

Swing

1. Algorithm Introduction

Swing is a similarity calculating method for "user-item" bipartite graph. Take the purchase graph for an example, the less two users' common purchases of items, the more similar these items are. The detailed formula is as below, in which 'Ui' indicates users purchased item i, "Iu" indicates the items that user u purchased, the value range for gamma is [-1, 0), indicates the penalty for large item set.

2. Parameters

IO Params

  • input:hdfs path for a "user-item" unweighted bipartite graph, each row represents an edge in the form of userId | itemId
  • output: hdfs path for output, each row represents a pair of item and the corresponding similarity score: itemId itemId score
  • sep: the separation in input file to separate the srcId and dstId, could be tab, space or comma

Algo Params

  • topFrom: sort the items by their popularity (number of edges to users), pick out items within the range of [topFrom,topTo), and calculate the similarities between these items, this is useful when only similarity scores between less popular items are wanted
  • topTo: refere to "topFrom"
  • alpha: refer to the formula and the introduction, default value is 0
  • beta: refer to the formula and the introduction, default value is 5
  • gamma: refer to the formula and the introduction, default value is -0.3
  • partitionNum:num of RDD partitions
  • psPartitionNum:num of data partitions on ps
  • useBalancePartition:whether to user balancePartition strategy, true / false, true is suggested when the distribution of graph vertices is unbalanced
  • storageLevel:RDD persist level,DISK_ONLY/MEMORY_ONLY/MEMORY_AND_DISK

3. Running

input=hdfs://my-hdfs/data
output=hdfs://my-hdfs/output

source ./spark-on-angel-env.sh
$SPARK_HOME/bin/spark-submit \
  --master yarn-cluster\
  --conf spark.ps.instances=1 \
  --conf spark.ps.cores=1 \
  --conf spark.ps.jars=$SONA_ANGEL_JARS \
  --conf spark.ps.memory=10g \
  --name "swing angel" \
  --jars $SONA_SPARK_JARS  \
  --driver-memory 5g \
  --num-executors 1 \
  --executor-cores 4 \
  --executor-memory 10g \
  --class org.apache.spark.angel.examples.graph.SwingExample \
  ../lib/spark-on-angel-examples-3.3.0.jar
  input:$input output:$output sep:tab storageLevel:MEMORY_ONLY useBalancePartition:true \
  partitionNum:4 psPartitionNum:1