KMeans

KMeans is a method that aims to cluster data in K groups of equal variance. The conventional KMeans algorithm has performance bottleneck; however，when implemented with PS, KMeans achieves the same level of accuracy with better performance.

1. Introduction

The KMeans algorithm assigns each data point to its nearest cluster, where the distance is measured between the data point and the cluster's centers. In general, Kmeans algorithm is implemented in an iterative way as shown below:

where, is the ith sample and is its nearest cluster; is the centers of the jth cluster.

Mini-batch KMeans

"Web-Scale K-Means Clustering"[1] proposes an improved KMeans algorithm to address the latency, scalability and sparsity requirements in user-facing web applications, using mini-batch optimization for training. As shown below:

2. Distributed Implementation on Angel

Model Storage

KMeans on Angel stores the K centers and K-centers counts on ParameterServer，using a K×N matrix represents the K centers and a K×1 vector represents the K-centers counts, where K is the number of clusters and N is the dimension of data，i.e. number of features.

Model Updating

KMeans on Angel is trained in an iterative way; during each iteration, the centers are updated by mini-batch.

Algorithm

KMeans on Angel algorithm as follows:

3. Execution & Performance

Input Format

Data format is set in "ml.data.type", which supports "libsvm", "dense" and "dummy" formats. For details, see Angel Data Format

Parameters

IO Parameters
- ml.feature.index.range: number of features
- ml.data.type: Angel Data Format, can be "dummy" or "libsvm"
- angel.train.data.path: input path for train
- angel.predict.data.path：input path for predict
- angel.save.model.path: save path for trained model
- angel.predict.out.path：output path for predict
- angel.log.path: save path for the log
Algorithm Parameters
- ml.epoch.num: number of iterations
- ml.minibatch.size：: samples for mini-batch
- ml.kmeans.center.num: K, number of clusters
- ml.kmeans.c：learning rate
Resource Parameters
- angel.workergroup.number: number of workers
- angel.worker.memory.mb: worker's memory requested in G
- angel.worker.task.number: number of tasks on each worker, default is 1
- angel.ps.number: number of PS
- angel.ps.memory.mb: PS's memory requested in G

Submit Command

Training Job

java

./bin/angel-submit \
	--action.type=train \
	--angel.app.submit.class=com.tencent.angel.ml.clustering.kmeans.KMeansRunner  \
	--ml.model.class.name=com.tencent.angel.ml.clustering.kmeans.KMeansModel \
	--angel.train.data.path=$traindata \
	--angel.save.model.path=$modelout \
	--angel.output.path.deleteonexist=true \
	--angel.log.path=$logpath \
	--ml.data.type=libsvm \
	--ml.model.type=T_DOUBLE_DENSE \
	--ml.kmeans.center.num=$centerNum  \
	--ml.kmeans.c=0.15 \
	--ml.epoch.num=10 \
	--ml.feature.index.range=$featureNum \
	--ml.feature.num=$featureNum \
	--angel.workergroup.number=4 \
	--angel.worker.memory.mb=5000  \
	--angel.worker.task.number=1 \
	--angel.ps.number=4 \
	--angel.ps.memory.mb=5000 \
	--angel.job.name=kmeans_train

IncTraining Job

java

./bin/angel-submit \
	--action.type=inctrain \
	--angel.app.submit.class=com.tencent.angel.ml.clustering.kmeans.KMeansRunner  \
	--ml.model.class.name=com.tencent.angel.ml.clustering.kmeans.KMeansModel \
	--angel.train.data.path=$traindata \
	--angel.load.model.path=$modelout \
	--angel.save.model.path=$modelout \
	--angel.output.path.deleteonexist=true \
	--angel.log.path=$logpath \
	--ml.data.type=libsvm \
	--ml.model.type=T_DOUBLE_DENSE \
	--ml.kmeans.center.num=$centerNum \
	--ml.kmeans.c=0.15 \
	--ml.epoch.num=10 \
	--ml.feature.index.range=$featureNum \
	--ml.feature.num=$featureNum \
	--angel.workergroup.number=4 \
	--angel.worker.memory.mb=5000  \
	--angel.worker.task.number=1 \
	--angel.ps.number=4 \
	--angel.ps.memory.mb=5000 \
	--angel.job.name=kmeans_inctrain

Prediction Job

java

./bin/angel-submit \
	--action.type=predict \
	--angel.app.submit.class=com.tencent.angel.ml.clustering.kmeans.KMeansRunner  \
	--ml.model.class.name=com.tencent.angel.ml.clustering.kmeans.KMeansModel \
	--angel.predict.data.path=$predictdata \
	--angel.load.model.path=$modelout \
	--angel.predict.out.path=$predictout \
	--angel.output.path.deleteonexist=true \
	--angel.log.path=$logpath \
	--ml.data.type=libsvm \
	--ml.model.type=T_DOUBLE_DENSE \
	--ml.kmeans.center.num=$centerNum \
	--ml.feature.index.range=$featureNum \
	--ml.feature.num=$featureNum \
	--angel.workergroup.number=4 \
	--angel.worker.memory.mb=5000  \
	--angel.worker.task.number=1 \
	--angel.ps.number=4 \
	--angel.ps.memory.mb=5000 \
	--angel.psagent.cache.sync.timeinterval.ms=500 \
	--angel.job.name=kmeans_predict

Performance

data：SVHN，3×10^3 features，7×10^4 samples
resource：
- Angel：executor：4，5G memory，1 task； ps：4，5G memory
Time of 100 epochs:
- Angel：45min

4. References

[1] Sculley D. Web-scale k-means clustering[C]// International Conference on World Wide Web, WWW 2010, Raleigh, North Carolina, Usa, April. DBLP, 2010:1177-1178.