Back to Spark

Clustering

docs/ml-clustering.md

4.1.19.0 KB
Original Source

This page describes clustering algorithms in MLlib. The guide for clustering in the RDD-based API also has relevant information about these algorithms.

Table of Contents

  • This will become a table of contents (this text will be scraped). {:toc}

K-means

k-means is one of the most commonly used clustering algorithms that clusters the data points into a predefined number of clusters. The MLlib implementation includes a parallelized variant of the k-means++ method called kmeans||.

KMeans is implemented as an Estimator and generates a KMeansModel as the base model.

Input Columns

<table> <thead> <tr> <th align="left">Param name</th> <th align="left">Type(s)</th> <th align="left">Default</th> <th align="left">Description</th> </tr> </thead> <tbody> <tr> <td>featuresCol</td> <td>Vector</td> <td>"features"</td> <td>Feature vector</td> </tr> </tbody> </table>

Output Columns

<table> <thead> <tr> <th align="left">Param name</th> <th align="left">Type(s)</th> <th align="left">Default</th> <th align="left">Description</th> </tr> </thead> <tbody> <tr> <td>predictionCol</td> <td>Int</td> <td>"prediction"</td> <td>Predicted cluster center</td> </tr> </tbody> </table>

Examples

<div class="codetabs"> <div data-lang="python" markdown="1"> Refer to the [Python API docs](api/python/reference/api/pyspark.ml.clustering.KMeans.html) for more details.

{% include_example python/ml/kmeans_example.py %}

</div> <div data-lang="scala" markdown="1"> Refer to the [Scala API docs](api/scala/org/apache/spark/ml/clustering/KMeans.html) for more details.

{% include_example scala/org/apache/spark/examples/ml/KMeansExample.scala %}

</div> <div data-lang="java" markdown="1"> Refer to the [Java API docs](api/java/org/apache/spark/ml/clustering/KMeans.html) for more details.

{% include_example java/org/apache/spark/examples/ml/JavaKMeansExample.java %}

</div> <div data-lang="r" markdown="1">

Refer to the R API docs for more details.

{% include_example r/ml/kmeans.R %}

</div> </div>

Latent Dirichlet allocation (LDA)

LDA is implemented as an Estimator that supports both EMLDAOptimizer and OnlineLDAOptimizer, and generates a LDAModel as the base model. Expert users may cast a LDAModel generated by EMLDAOptimizer to a DistributedLDAModel if needed.

Examples

<div class="codetabs"> <div data-lang="python" markdown="1">

Refer to the Python API docs for more details.

{% include_example python/ml/lda_example.py %}

</div> <div data-lang="scala" markdown="1">

Refer to the Scala API docs for more details.

{% include_example scala/org/apache/spark/examples/ml/LDAExample.scala %}

</div> <div data-lang="java" markdown="1">

Refer to the Java API docs for more details.

{% include_example java/org/apache/spark/examples/ml/JavaLDAExample.java %}

</div> <div data-lang="r" markdown="1">

Refer to the R API docs for more details.

{% include_example r/ml/lda.R %}

</div> </div>

Bisecting k-means

Bisecting k-means is a kind of hierarchical clustering using a divisive (or "top-down") approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.

Bisecting K-means can often be much faster than regular K-means, but it will generally produce a different clustering.

BisectingKMeans is implemented as an Estimator and generates a BisectingKMeansModel as the base model.

Examples

<div class="codetabs"> <div data-lang="python" markdown="1"> Refer to the [Python API docs](api/python/reference/api/pyspark.ml.clustering.BisectingKMeans.html) for more details.

{% include_example python/ml/bisecting_k_means_example.py %}

</div> <div data-lang="scala" markdown="1"> Refer to the [Scala API docs](api/scala/org/apache/spark/ml/clustering/BisectingKMeans.html) for more details.

{% include_example scala/org/apache/spark/examples/ml/BisectingKMeansExample.scala %}

</div> <div data-lang="java" markdown="1"> Refer to the [Java API docs](api/java/org/apache/spark/ml/clustering/BisectingKMeans.html) for more details.

{% include_example java/org/apache/spark/examples/ml/JavaBisectingKMeansExample.java %}

</div> <div data-lang="r" markdown="1">

Refer to the R API docs for more details.

{% include_example r/ml/bisectingKmeans.R %}

</div> </div>

Gaussian Mixture Model (GMM)

A Gaussian Mixture Model represents a composite distribution whereby points are drawn from one of k Gaussian sub-distributions, each with its own probability. The spark.ml implementation uses the expectation-maximization algorithm to induce the maximum-likelihood model given a set of samples.

GaussianMixture is implemented as an Estimator and generates a GaussianMixtureModel as the base model.

Input Columns

<table> <thead> <tr> <th align="left">Param name</th> <th align="left">Type(s)</th> <th align="left">Default</th> <th align="left">Description</th> </tr> </thead> <tbody> <tr> <td>featuresCol</td> <td>Vector</td> <td>"features"</td> <td>Feature vector</td> </tr> </tbody> </table>

Output Columns

<table> <thead> <tr> <th align="left">Param name</th> <th align="left">Type(s)</th> <th align="left">Default</th> <th align="left">Description</th> </tr> </thead> <tbody> <tr> <td>predictionCol</td> <td>Int</td> <td>"prediction"</td> <td>Predicted cluster center</td> </tr> <tr> <td>probabilityCol</td> <td>Vector</td> <td>"probability"</td> <td>Probability of each cluster</td> </tr> </tbody> </table>

Examples

<div class="codetabs"> <div data-lang="python" markdown="1"> Refer to the [Python API docs](api/python/reference/api/pyspark.ml.clustering.GaussianMixture.html) for more details.

{% include_example python/ml/gaussian_mixture_example.py %}

</div> <div data-lang="scala" markdown="1"> Refer to the [Scala API docs](api/scala/org/apache/spark/ml/clustering/GaussianMixture.html) for more details.

{% include_example scala/org/apache/spark/examples/ml/GaussianMixtureExample.scala %}

</div> <div data-lang="java" markdown="1"> Refer to the [Java API docs](api/java/org/apache/spark/ml/clustering/GaussianMixture.html) for more details.

{% include_example java/org/apache/spark/examples/ml/JavaGaussianMixtureExample.java %}

</div> <div data-lang="r" markdown="1">

Refer to the R API docs for more details.

{% include_example r/ml/gaussianMixture.R %}

</div> </div>

Power Iteration Clustering (PIC)

Power Iteration Clustering (PIC) is a scalable graph clustering algorithm developed by Lin and Cohen. From the abstract: PIC finds a very low-dimensional embedding of a dataset using truncated power iteration on a normalized pair-wise similarity matrix of the data.

spark.ml's PowerIterationClustering implementation takes the following parameters:

  • k: the number of clusters to create
  • initMode: param for the initialization algorithm
  • maxIter: param for maximum number of iterations
  • srcCol: param for the name of the input column for source vertex IDs
  • dstCol: name of the input column for destination vertex IDs
  • weightCol: Param for weight column name

Examples

<div class="codetabs"> <div data-lang="python" markdown="1"> Refer to the [Python API docs](api/python/reference/api/pyspark.ml.clustering.PowerIterationClustering.html) for more details.

{% include_example python/ml/power_iteration_clustering_example.py %}

</div> <div data-lang="scala" markdown="1"> Refer to the [Scala API docs](api/scala/org/apache/spark/ml/clustering/PowerIterationClustering.html) for more details.

{% include_example scala/org/apache/spark/examples/ml/PowerIterationClusteringExample.scala %}

</div> <div data-lang="java" markdown="1"> Refer to the [Java API docs](api/java/org/apache/spark/ml/clustering/PowerIterationClustering.html) for more details.

{% include_example java/org/apache/spark/examples/ml/JavaPowerIterationClusteringExample.java %}

</div> <div data-lang="r" markdown="1">

Refer to the R API docs for more details.

{% include_example r/ml/powerIterationClustering.R %}

</div> </div>