doc/modules/manifold.rst
.. currentmodule:: sklearn.manifold
.. _manifold:
| Look for the bare necessities | The simple bare necessities | Forget about your worries and your strife | I mean the bare necessities | Old Mother Nature's recipes | That bring the bare necessities of life | | -- Baloo's song [The Jungle Book]
.. figure:: ../auto_examples/manifold/images/sphx_glr_plot_compare_methods_001.png :target: ../auto_examples/manifold/plot_compare_methods.html :align: center :scale: 70%
.. |manifold_img3| image:: ../auto_examples/manifold/images/sphx_glr_plot_compare_methods_003.png :target: ../auto_examples/manifold/plot_compare_methods.html :scale: 60%
.. |manifold_img4| image:: ../auto_examples/manifold/images/sphx_glr_plot_compare_methods_004.png :target: ../auto_examples/manifold/plot_compare_methods.html :scale: 60%
.. |manifold_img5| image:: ../auto_examples/manifold/images/sphx_glr_plot_compare_methods_005.png :target: ../auto_examples/manifold/plot_compare_methods.html :scale: 60%
.. |manifold_img6| image:: ../auto_examples/manifold/images/sphx_glr_plot_compare_methods_006.png :target: ../auto_examples/manifold/plot_compare_methods.html :scale: 60%
.. centered:: |manifold_img3| |manifold_img4| |manifold_img5| |manifold_img6|
Manifold learning is an approach to non-linear dimensionality reduction. Algorithms for this task are based on the idea that the dimensionality of many data sets is only artificially high.
High-dimensional datasets can be very difficult to visualize. While data in two or three dimensions can be plotted to show the inherent structure of the data, equivalent high-dimensional plots are much less intuitive. To aid visualization of the structure of a dataset, the dimension must be reduced in some way.
The simplest way to accomplish this dimensionality reduction is by taking a random projection of the data. Though this allows some degree of visualization of the data structure, the randomness of the choice leaves much to be desired. In a random projection, it is likely that the more interesting structure within the data will be lost.
.. |digits_img| image:: ../auto_examples/manifold/images/sphx_glr_plot_lle_digits_001.png :target: ../auto_examples/manifold/plot_lle_digits.html :scale: 50
.. |projected_img| image:: ../auto_examples/manifold/images/sphx_glr_plot_lle_digits_002.png :target: ../auto_examples/manifold/plot_lle_digits.html :scale: 50
.. centered:: |digits_img| |projected_img|
To address this concern, a number of supervised and unsupervised linear dimensionality reduction frameworks have been designed, such as Principal Component Analysis (PCA), Independent Component Analysis, Linear Discriminant Analysis, and others. These algorithms define specific rubrics to choose an "interesting" linear projection of the data. These methods can be powerful, but often miss important non-linear structure in the data.
.. |PCA_img| image:: ../auto_examples/manifold/images/sphx_glr_plot_lle_digits_003.png :target: ../auto_examples/manifold/plot_lle_digits.html :scale: 50
.. |LDA_img| image:: ../auto_examples/manifold/images/sphx_glr_plot_lle_digits_004.png :target: ../auto_examples/manifold/plot_lle_digits.html :scale: 50
.. centered:: |PCA_img| |LDA_img|
Manifold Learning can be thought of as an attempt to generalize linear frameworks like PCA to be sensitive to non-linear structure in data. Though supervised variants exist, the typical manifold learning problem is unsupervised: it learns the high-dimensional structure of the data from the data itself, without the use of predetermined classifications.
.. rubric:: Examples
See :ref:sphx_glr_auto_examples_manifold_plot_lle_digits.py for an example of
dimensionality reduction on handwritten digits.
See :ref:sphx_glr_auto_examples_manifold_plot_compare_methods.py for an example of
dimensionality reduction on a toy "S-curve" dataset.
See :ref:sphx_glr_auto_examples_applications_plot_stock_market.py for an example of
using manifold learning to map the stock market structure based on historical stock
prices.
See :ref:sphx_glr_auto_examples_manifold_plot_manifold_sphere.py for an example of
manifold learning techniques applied to a spherical data-set.
See :ref:sphx_glr_auto_examples_manifold_plot_swissroll.py for an example of using
manifold learning techniques on a Swiss Roll dataset.
The manifold learning implementations available in scikit-learn are summarized below
.. _isomap:
One of the earliest approaches to manifold learning is the Isomap
algorithm, short for Isometric Mapping. Isomap can be viewed as an
extension of Multi-dimensional Scaling (MDS) or Kernel PCA.
Isomap seeks a lower-dimensional embedding which maintains geodesic
distances between all points. Isomap can be performed with the object
:class:Isomap.
.. figure:: ../auto_examples/manifold/images/sphx_glr_plot_lle_digits_005.png :target: ../auto_examples/manifold/plot_lle_digits.html :align: center :scale: 50
.. dropdown:: Complexity
The Isomap algorithm comprises three stages:
Nearest neighbor search. Isomap uses
:class:~sklearn.neighbors.BallTree for efficient neighbor search.
The cost is approximately :math:O[D \log(k) N \log(N)], for :math:k
nearest neighbors of :math:N points in :math:D dimensions.
Shortest-path graph search. The most efficient known algorithms
for this are Dijkstra's Algorithm, which is approximately
:math:O[N^2(k + \log(N))], or the Floyd-Warshall algorithm, which
is :math:O[N^3]. The algorithm can be selected by the user with
the path_method keyword of Isomap. If unspecified, the code
attempts to choose the best algorithm for the input data.
Partial eigenvalue decomposition. The embedding is encoded in the
eigenvectors corresponding to the :math:d largest eigenvalues of the
:math:N \times N isomap kernel. For a dense solver, the cost is
approximately :math:O[d N^2]. This cost can often be improved using
the ARPACK solver. The eigensolver can be specified by the user
with the eigen_solver keyword of Isomap. If unspecified, the
code attempts to choose the best algorithm for the input data.
The overall complexity of Isomap is
:math:O[D \log(k) N \log(N)] + O[N^2(k + \log(N))] + O[d N^2].
N : number of training data pointsD : input dimensionk : number of nearest neighborsd : output dimension.. rubric:: References
"A global geometric framework for nonlinear dimensionality reduction" <http://science.sciencemag.org/content/290/5500/2319.full>_
Tenenbaum, J.B.; De Silva, V.; & Langford, J.C. Science 290 (5500).. _locally_linear_embedding:
Locally linear embedding (LLE) seeks a lower-dimensional projection of the data which preserves distances within local neighborhoods. It can be thought of as a series of local Principal Component Analyses which are globally compared to find the best non-linear embedding.
Locally linear embedding can be performed with function
:func:locally_linear_embedding or its object-oriented counterpart
:class:LocallyLinearEmbedding.
.. figure:: ../auto_examples/manifold/images/sphx_glr_plot_lle_digits_006.png :target: ../auto_examples/manifold/plot_lle_digits.html :align: center :scale: 50
.. dropdown:: Complexity
The standard LLE algorithm comprises three stages:
Nearest Neighbors Search. See discussion under Isomap above.
Weight Matrix Construction. :math:O[D N k^3].
The construction of the LLE weight matrix involves the solution of a
:math:k \times k linear equation for each of the :math:N local
neighborhoods.
Partial Eigenvalue Decomposition. See discussion under Isomap above.
The overall complexity of standard LLE is
:math:O[D \log(k) N \log(N)] + O[D N k^3] + O[d N^2].
N : number of training data pointsD : input dimensionk : number of nearest neighborsd : output dimension.. rubric:: References
"Nonlinear dimensionality reduction by locally linear embedding" <http://www.sciencemag.org/content/290/5500/2323.full>_
Roweis, S. & Saul, L. Science 290:2323 (2000)One well-known issue with LLE is the regularization problem. When the number
of neighbors is greater than the number of input dimensions, the matrix
defining each local neighborhood is rank-deficient. To address this, standard
LLE applies an arbitrary regularization parameter :math:r, which is chosen
relative to the trace of the local weight matrix. Though it can be shown
formally that as :math:r \to 0, the solution converges to the desired
embedding, there is no guarantee that the optimal solution will be found
for :math:r > 0. This problem manifests itself in embeddings which distort
the underlying geometry of the manifold.
One method to address the regularization problem is to use multiple weight
vectors in each neighborhood. This is the essence of modified locally
linear embedding (MLLE). MLLE can be performed with function
:func:locally_linear_embedding or its object-oriented counterpart
:class:LocallyLinearEmbedding, with the keyword method = 'modified'.
It requires n_neighbors > n_components.
.. figure:: ../auto_examples/manifold/images/sphx_glr_plot_lle_digits_007.png :target: ../auto_examples/manifold/plot_lle_digits.html :align: center :scale: 50
.. dropdown:: Complexity
The MLLE algorithm comprises three stages:
Nearest Neighbors Search. Same as standard LLE
Weight Matrix Construction. Approximately
:math:O[D N k^3] + O[N (k-D) k^2]. The first term is exactly equivalent
to that of standard LLE. The second term has to do with constructing the
weight matrix from multiple weights. In practice, the added cost of
constructing the MLLE weight matrix is relatively small compared to the
cost of stages 1 and 3.
Partial Eigenvalue Decomposition. Same as standard LLE
The overall complexity of MLLE is
:math:O[D \log(k) N \log(N)] + O[D N k^3] + O[N (k-D) k^2] + O[d N^2].
N : number of training data pointsD : input dimensionk : number of nearest neighborsd : output dimension.. rubric:: References
"MLLE: Modified Locally Linear Embedding Using Multiple Weights" <https://papers.nips.cc/paper_files/paper/2006/file/fb2606a5068901da92473666256e6e5b-Paper.pdf>_
Zhang, Z. & Wang, J.Hessian Eigenmapping (also known as Hessian-based LLE: HLLE) is another method
of solving the regularization problem of LLE. It revolves around a
hessian-based quadratic form at each neighborhood which is used to recover
the locally linear structure. Though other implementations note its poor
scaling with data size, sklearn implements some algorithmic
improvements which make its cost comparable to that of other LLE variants
for small output dimension. HLLE can be performed with function
:func:locally_linear_embedding or its object-oriented counterpart
:class:LocallyLinearEmbedding, with the keyword method = 'hessian'.
It requires n_neighbors > n_components * (n_components + 3) / 2.
.. figure:: ../auto_examples/manifold/images/sphx_glr_plot_lle_digits_008.png :target: ../auto_examples/manifold/plot_lle_digits.html :align: center :scale: 50
.. dropdown:: Complexity
The HLLE algorithm comprises three stages:
Nearest Neighbors Search. Same as standard LLE
Weight Matrix Construction. Approximately
:math:O[D N k^3] + O[N d^6]. The first term reflects a similar
cost to that of standard LLE. The second term comes from a QR
decomposition of the local hessian estimator.
Partial Eigenvalue Decomposition. Same as standard LLE.
The overall complexity of standard HLLE is
:math:O[D \log(k) N \log(N)] + O[D N k^3] + O[N d^6] + O[d N^2].
N : number of training data pointsD : input dimensionk : number of nearest neighborsd : output dimension.. rubric:: References
"Hessian Eigenmaps: Locally linear embedding techniques for high-dimensional data" <http://www.pnas.org/content/100/10/5591>_
Donoho, D. & Grimes, C. Proc Natl Acad Sci USA. 100:5591 (2003).. _spectral_embedding:
Spectral Embedding is an approach to calculating a non-linear embedding.
Scikit-learn implements Laplacian Eigenmaps, which finds a low dimensional
representation of the data using a spectral decomposition of the graph
Laplacian. The graph generated can be considered as a discrete approximation of
the low dimensional manifold in the high dimensional space. Minimization of a
cost function based on the graph ensures that points close to each other on
the manifold are mapped close to each other in the low dimensional space,
preserving local distances. Spectral embedding can be performed with the
function :func:spectral_embedding or its object-oriented counterpart
:class:SpectralEmbedding.
.. dropdown:: Complexity
The Spectral Embedding (Laplacian Eigenmaps) algorithm comprises three stages:
Weighted Graph Construction. Transform the raw input data into graph representation using affinity (adjacency) matrix representation.
Graph Laplacian Construction. unnormalized Graph Laplacian
is constructed as :math:L = D - A for and normalized one as
:math:L = D^{-\frac{1}{2}} (D - A) D^{-\frac{1}{2}}.
Partial Eigenvalue Decomposition. Eigenvalue decomposition is done on graph Laplacian.
The overall complexity of spectral embedding is
:math:O[D \log(k) N \log(N)] + O[D N k^3] + O[d N^2].
N : number of training data pointsD : input dimensionk : number of nearest neighborsd : output dimension.. rubric:: References
"Laplacian Eigenmaps for Dimensionality Reduction and Data Representation" <https://www2.imm.dtu.dk/projects/manifold/Papers/Laplacian.pdf>_
M. Belkin, P. Niyogi, Neural Computation, June 2003; 15 (6):1373-1396.Though not technically a variant of LLE, Local tangent space alignment (LTSA)
is algorithmically similar enough to LLE that it can be put in this category.
Rather than focusing on preserving neighborhood distances as in LLE, LTSA
seeks to characterize the local geometry at each neighborhood via its
tangent space, and performs a global optimization to align these local
tangent spaces to learn the embedding. LTSA can be performed with function
:func:locally_linear_embedding or its object-oriented counterpart
:class:LocallyLinearEmbedding, with the keyword method = 'ltsa'.
.. figure:: ../auto_examples/manifold/images/sphx_glr_plot_lle_digits_009.png :target: ../auto_examples/manifold/plot_lle_digits.html :align: center :scale: 50
.. dropdown:: Complexity
The LTSA algorithm comprises three stages:
Nearest Neighbors Search. Same as standard LLE
Weight Matrix Construction. Approximately
:math:O[D N k^3] + O[k^2 d]. The first term reflects a similar
cost to that of standard LLE.
Partial Eigenvalue Decomposition. Same as standard LLE
The overall complexity of standard LTSA is
:math:O[D \log(k) N \log(N)] + O[D N k^3] + O[k^2 d] + O[d N^2].
N : number of training data pointsD : input dimensionk : number of nearest neighborsd : output dimension.. rubric:: References
"Principal manifolds and nonlinear dimensionality reduction via tangent space alignment" <cs/0212008>
Zhang, Z. & Zha, H. Journal of Shanghai Univ. 8:406 (2004).. _multidimensional_scaling:
Multidimensional scaling <https://en.wikipedia.org/wiki/Multidimensional_scaling>_
(:class:MDS and :class:ClassicalMDS) seeks a low-dimensional
representation of the data in which the distances approximate the
distances in the original high-dimensional space.
In general, MDS is a technique used for analyzing dissimilarity data. It attempts to model dissimilarities as distances in a Euclidean space. The data can be ratings of dissimilarity between objects, interaction frequencies of molecules, or trade indices between countries.
There exist three types of MDS algorithm: metric, non-metric, and classical. In
scikit-learn, the class :class:MDS implements metric and non-metric MDS,
while :class:ClassicalMDS implements classical MDS. In metric MDS,
the distances in the embedding space are set as
close as possible to the dissimilarity data. In the non-metric
version, the algorithm will try to preserve the order of the distances, and
hence seek for a monotonic relationship between the distances in the embedded
space and the input dissimilarities. Finally, classical MDS is close to PCA
and, instead of approximating distances, approximates pairwise scalar products,
which is an easier optimization problem with an analytic solution
in terms of eigendecomposition.
.. |MMDS_img| image:: ../auto_examples/manifold/images/sphx_glr_plot_lle_digits_010.png :target: ../auto_examples/manifold/plot_lle_digits.html :scale: 50
.. |NMDS_img| image:: ../auto_examples/manifold/images/sphx_glr_plot_lle_digits_011.png :target: ../auto_examples/manifold/plot_lle_digits.html :scale: 50
.. centered:: |MMDS_img| |NMDS_img|
Let :math:\delta_{ij} be the dissimilarity matrix between the
:math:n input points (possibly arising as some pairwise distances
:math:d_{ij}(X) between the coordinates :math:X of the input points).
Disparities :math:\hat{d}_{ij} = f(\delta_{ij}) are some transformation of
the dissimilarities. The MDS objective, called the raw stress, is then
defined by :math:\sum_{i < j} (\hat{d}_{ij} - d_{ij}(Z))^2,
where :math:d_{ij}(Z) are the pairwise distances between the
coordinates :math:Z of the embedded points.
.. dropdown:: Metric MDS
In the metric :class:MDS model (sometimes also called absolute MDS),
disparities are simply equal to the input dissimilarities
:math:\hat{d}_{ij} = \delta_{ij}.
.. dropdown:: Non-metric MDS
Non-metric :class:MDS focuses on the ordination of the data. If
:math:\delta_{ij} > \delta_{kl}, then the embedding
seeks to enforce :math:d_{ij}(Z) > d_{kl}(Z). A simple algorithm
to enforce proper ordination is to use an
isotonic regression of :math:d_{ij}(Z) on :math:\delta_{ij}, yielding
disparities :math:\hat{d}_{ij} that are a monotonic transformation
of dissimilarities :math:\delta_{ij} and hence having the same ordering.
This is done repeatedly after every step of the optimization algorithm.
In order to avoid the trivial solution where all embedding points are
overlapping, the disparities :math:\hat{d}_{ij} are normalized.
Note that since we only care about relative ordering, our objective should be invariant to simple translation and scaling, however the stress used in metric MDS is sensitive to scaling. To address this, non-metric MDS returns normalized stress, also known as Stress-1, defined as
.. math:: \sqrt{\frac{\sum_{i < j} (\hat{d}{ij} - d{ij}(Z))^2}{\sum_{i < j} d_{ij}(Z)^2}}.
Normalized Stress-1 is returned if normalized_stress=True.
.. figure:: ../auto_examples/manifold/images/sphx_glr_plot_mds_001.png :target: ../auto_examples/manifold/plot_mds.html :align: center :scale: 60
Classical MDS, also known as
principal coordinates analysis (PCoA) or Torgerson's scaling, is implemented
in the separate :class:ClassicalMDS class. Classical MDS replaces the stress
loss function with a different loss function called strain, which has an
exact solution in terms of eigendecomposition.
If the dissimilarity matrix consists of the pairwise
Euclidean distances between some vectors, then classical MDS is equivalent
to PCA applied to this set of vectors.
.. figure:: ../auto_examples/manifold/images/sphx_glr_plot_lle_digits_012.png :target: ../auto_examples/manifold/plot_lle_digits.html :align: center :scale: 50
Formally, the loss function of classical MDS (strain) is given by
.. math:: \frac{|B - ZZ^T|F}{|B|F} =\sqrt{\frac{\sum{i,j} (b{ij} - z_i^\top z_j)^2}{\sum_{i,j} b_{ij}^2}},
where :math:Z is the :math:n \times d embedding matrix whose rows are
:math:z_i^T, :math:\|\cdot\|_F denotes the Frobenius norm, and
:math:B is the Gram matrix with elements :math:b_{ij},
given by :math:B = -\frac{1}{2}C\Delta C.
Here :math:C\Delta C is the double-centered matrix of squared dissimilarities,
with :math:\Delta being the matrix of squared input dissimilarities
:math:\delta^2_{ij} and :math:C=I-J/n is the centering matrix
(identity matrix minus a matrix of all ones divided by :math:n).
This can be minimized exactly using the eigendecomposition of :math:B.
.. rubric:: References
"More on Multidimensional Scaling and Unfolding in R: smacof Version 2" <https://www.jstatsoft.org/article/view/v102i10>_
Mair P, Groenen P., de Leeuw J. Journal of Statistical Software (2022)
"Modern Multidimensional Scaling - Theory and Applications" <https://www.springer.com/fr/book/9780387251509>_
Borg, I.; Groenen P. Springer Series in Statistics (1997)
"Nonmetric multidimensional scaling: a numerical method" <http://cda.psych.uiuc.edu/psychometrika_highly_cited_articles/kruskal_1964b.pdf>_
Kruskal, J. Psychometrika, 29 (1964)
"Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis" <http://cda.psych.uiuc.edu/psychometrika_highly_cited_articles/kruskal_1964a.pdf>_
Kruskal, J. Psychometrika, 29, (1964)
.. _t_sne:
t-SNE (:class:TSNE) converts affinities of data points to probabilities.
The affinities in the original space are represented by Gaussian joint
probabilities and the affinities in the embedded space are represented by
Student's t-distributions. This allows t-SNE to be particularly sensitive
to local structure and has a few other advantages over existing techniques:
While Isomap, LLE and variants are best suited to unfold a single continuous low dimensional manifold, t-SNE will focus on the local structure of the data and will tend to extract clustered local groups of samples as highlighted on the S-curve example. This ability to group samples based on the local structure might be beneficial to visually disentangle a dataset that comprises several manifolds at once as is the case in the digits dataset.
The Kullback-Leibler (KL) divergence of the joint probabilities in the original space and the embedded space will be minimized by gradient descent. Note that the KL divergence is not convex, i.e. multiple restarts with different initializations will end up in local minima of the KL divergence. Hence, it is sometimes useful to try different seeds and select the embedding with the lowest KL divergence.
The disadvantages to using t-SNE are roughly:
init='pca')... figure:: ../auto_examples/manifold/images/sphx_glr_plot_lle_digits_015.png :target: ../auto_examples/manifold/plot_lle_digits.html :align: center :scale: 50
.. dropdown:: Optimizing t-SNE
The main purpose of t-SNE is visualization of high-dimensional data. Hence, it works best when the data will be embedded on two or three dimensions.
Optimizing the KL divergence can be a little bit tricky sometimes. There are five parameters that control the optimization of t-SNE and therefore possibly the quality of the resulting embedding:
The perplexity is defined as :math:k=2^{(S)} where :math:S is the Shannon
entropy of the conditional probability distribution. The perplexity of a
:math:k-sided die is :math:k, so that :math:k is effectively the number of
nearest neighbors t-SNE considers when generating the conditional probabilities.
Larger perplexities lead to more nearest neighbors and less sensitive to small
structure. Conversely a lower perplexity considers a smaller number of
neighbors, and thus ignores more global information in favour of the
local neighborhood. As dataset sizes get larger more points will be
required to get a reasonable sample of the local neighborhood, and hence
larger perplexities may be required. Similarly noisier datasets will require
larger perplexity values to encompass enough local neighbors to see beyond
the background noise.
The maximum number of iterations is usually high enough and does not need
any tuning. The optimization consists of two phases: the early exaggeration
phase and the final optimization. During early exaggeration the joint
probabilities in the original space will be artificially increased by
multiplication with a given factor. Larger factors result in larger gaps
between natural clusters in the data. If the factor is too high, the KL
divergence could increase during this phase. Usually it does not have to be
tuned. A critical parameter is the learning rate. If it is too low gradient
descent will get stuck in a bad local minimum. If it is too high the KL
divergence will increase during optimization. A heuristic suggested in
Belkina et al. (2019) is to set the learning rate to the sample size
divided by the early exaggeration factor. We implement this heuristic
as learning_rate='auto' argument. More tips can be found in
Laurens van der Maaten's FAQ (see references). The last parameter, angle,
is a tradeoff between performance and accuracy. Larger angles imply that we
can approximate larger regions by a single point, leading to better speed
but less accurate results.
"How to Use t-SNE Effectively" <https://distill.pub/2016/misread-tsne/>_
provides a good discussion of the effects of the various parameters, as well
as interactive plots to explore the effects of different parameters.
.. dropdown:: Barnes-Hut t-SNE
The Barnes-Hut t-SNE that has been implemented here is usually much slower than
other manifold learning algorithms. The optimization is quite difficult
and the computation of the gradient is :math:O[d N log(N)], where :math:d
is the number of output dimensions and :math:N is the number of samples. The
Barnes-Hut method improves on the exact method where t-SNE complexity is
:math:O[d N^2], but has several other notable differences:
~sklearn.decomposition.PCAFor visualization purpose (which is the main use case of t-SNE), using the Barnes-Hut method is strongly recommended. The exact t-SNE method is useful for checking the theoretical properties of the embedding possibly in higher dimensional space but limited to small datasets due to computational constraints.
Also note that the digits labels roughly match the natural grouping found by t-SNE while the linear 2D projection of the PCA model yields a representation where label regions largely overlap. This is a strong clue that this data can be well separated by non linear methods that focus on the local structure (e.g. an SVM with a Gaussian RBF kernel). However, failing to visualize well separated homogeneously labeled groups with t-SNE in 2D does not necessarily imply that the data cannot be correctly classified by a supervised model. It might be the case that 2 dimensions are not high enough to accurately represent the internal structure of the data.
.. rubric:: References
"Visualizing High-Dimensional Data Using t-SNE" <https://jmlr.org/papers/v9/vandermaaten08a.html>_
van der Maaten, L.J.P.; Hinton, G. Journal of Machine Learning Research (2008)
"t-Distributed Stochastic Neighbor Embedding" <https://lvdmaaten.github.io/tsne/>_ van der Maaten, L.J.P.
"Accelerating t-SNE using Tree-Based Algorithms" <https://lvdmaaten.github.io/publications/papers/JMLR_2014.pdf>_
van der Maaten, L.J.P.; Journal of Machine Learning Research 15(Oct):3221-3245, 2014.
"Automated optimized parameters for T-distributed stochastic neighbor embedding improve visualization and analysis of large datasets" <https://www.nature.com/articles/s41467-019-13055-y>_
Belkina, A.C., Ciccolella, C.O., Anno, R., Halpert, R., Spidlen, J.,
Snyder-Cappione, J.E., Nature Communications 10, 5415 (2019).
Make sure the same scale is used over all features. Because manifold
learning methods are based on a nearest-neighbor search, the algorithm
may perform poorly otherwise. See :ref:StandardScaler <preprocessing_scaler>
for convenient ways of scaling heterogeneous data.
The reconstruction error computed by each routine can be used to choose
the optimal output dimension. For a :math:d-dimensional manifold embedded
in a :math:D-dimensional parameter space, the reconstruction error will
decrease as n_components is increased until n_components == d.
Note that noisy data can "short-circuit" the manifold, in essence acting as a bridge between parts of the manifold that would otherwise be well-separated. Manifold learning on noisy and/or incomplete data is an active area of research.
Certain input configurations can lead to singular weight matrices, for
example when more than two points in the dataset are identical, or when
the data is split into disjointed groups. In this case, solver='arpack'
will fail to find the null space. The easiest way to address this is to
use solver='dense' which will work on a singular matrix, though it may
be very slow depending on the number of input points. Alternatively, one
can attempt to understand the source of the singularity: if it is due to
disjoint sets, increasing n_neighbors may help. If it is due to
identical points in the dataset, removing these points may help.
.. seealso::
:ref:random_trees_embedding can also be useful to derive non-linear
representations of feature space, but it does not perform
dimensionality reduction.