docs/algo/kmeans_on_angel_en.md
KMeans is a method that aims to cluster data in K groups of equal variance. The conventional KMeans algorithm has performance bottleneck; however,when implemented with PS, KMeans achieves the same level of accuracy with better performance.
The KMeans algorithm assigns each data point to its nearest cluster, where the distance is measured between the data point and the cluster's centers. In general, Kmeans algorithm is implemented in an iterative way as shown below:
where, is the ith sample and is its nearest cluster; is the centers of the jth cluster.
"Web-Scale K-Means Clustering"[1] proposes an improved KMeans algorithm to address the latency, scalability and sparsity requirements in user-facing web applications, using mini-batch optimization for training. As shown below:
KMeans on Angel stores the K centers and K-centers counts on ParameterServer,using a K×N matrix represents the K centers and a K×1 vector represents the K-centers counts, where K is the number of clusters and N is the dimension of data,i.e. number of features.
KMeans on Angel is trained in an iterative way; during each iteration, the centers are updated by mini-batch.
KMeans on Angel algorithm as follows:
IO Parameters
Algorithm Parameters
Resource Parameters
Training Job
./bin/angel-submit \
--action.type=train \
--angel.app.submit.class=com.tencent.angel.ml.clustering.kmeans.KMeansRunner \
--ml.model.class.name=com.tencent.angel.ml.clustering.kmeans.KMeansModel \
--angel.train.data.path=$traindata \
--angel.save.model.path=$modelout \
--angel.output.path.deleteonexist=true \
--angel.log.path=$logpath \
--ml.data.type=libsvm \
--ml.model.type=T_DOUBLE_DENSE \
--ml.kmeans.center.num=$centerNum \
--ml.kmeans.c=0.15 \
--ml.epoch.num=10 \
--ml.feature.index.range=$featureNum \
--ml.feature.num=$featureNum \
--angel.workergroup.number=4 \
--angel.worker.memory.mb=5000 \
--angel.worker.task.number=1 \
--angel.ps.number=4 \
--angel.ps.memory.mb=5000 \
--angel.job.name=kmeans_train
IncTraining Job
./bin/angel-submit \
--action.type=inctrain \
--angel.app.submit.class=com.tencent.angel.ml.clustering.kmeans.KMeansRunner \
--ml.model.class.name=com.tencent.angel.ml.clustering.kmeans.KMeansModel \
--angel.train.data.path=$traindata \
--angel.load.model.path=$modelout \
--angel.save.model.path=$modelout \
--angel.output.path.deleteonexist=true \
--angel.log.path=$logpath \
--ml.data.type=libsvm \
--ml.model.type=T_DOUBLE_DENSE \
--ml.kmeans.center.num=$centerNum \
--ml.kmeans.c=0.15 \
--ml.epoch.num=10 \
--ml.feature.index.range=$featureNum \
--ml.feature.num=$featureNum \
--angel.workergroup.number=4 \
--angel.worker.memory.mb=5000 \
--angel.worker.task.number=1 \
--angel.ps.number=4 \
--angel.ps.memory.mb=5000 \
--angel.job.name=kmeans_inctrain
Prediction Job
./bin/angel-submit \
--action.type=predict \
--angel.app.submit.class=com.tencent.angel.ml.clustering.kmeans.KMeansRunner \
--ml.model.class.name=com.tencent.angel.ml.clustering.kmeans.KMeansModel \
--angel.predict.data.path=$predictdata \
--angel.load.model.path=$modelout \
--angel.predict.out.path=$predictout \
--angel.output.path.deleteonexist=true \
--angel.log.path=$logpath \
--ml.data.type=libsvm \
--ml.model.type=T_DOUBLE_DENSE \
--ml.kmeans.center.num=$centerNum \
--ml.feature.index.range=$featureNum \
--ml.feature.num=$featureNum \
--angel.workergroup.number=4 \
--angel.worker.memory.mb=5000 \
--angel.worker.task.number=1 \
--angel.ps.number=4 \
--angel.ps.memory.mb=5000 \
--angel.psagent.cache.sync.timeinterval.ms=500 \
--angel.job.name=kmeans_predict
[1] Sculley D. Web-scale k-means clustering[C]// International Conference on World Wide Web, WWW 2010, Raleigh, North Carolina, Usa, April. DBLP, 2010:1177-1178.