scientific-skills/umap-learn/references/api_reference.md
umap.UMAP(n_neighbors=15, n_components=2, metric='euclidean', n_epochs=None, learning_rate=1.0, init='spectral', min_dist=0.1, spread=1.0, low_memory=True, set_op_mix_ratio=1.0, local_connectivity=1.0, repulsion_strength=1.0, negative_sample_rate=5, transform_queue_size=4.0, a=None, b=None, random_state=None, metric_kwds=None, angular_rp_forest=False, target_n_neighbors=-1, target_metric='categorical', target_metric_kwds=None, target_weight=0.5, transform_seed=42, transform_mode='embedding', force_approximation_algorithm=False, verbose=False, unique=False, densmap=False, dens_lambda=2.0, dens_frac=0.3, dens_var_shift=0.1, output_dens=False, disconnection_distance=None, precomputed_knn=(None, None, None))
Find low-dimensional embedding that approximates the underlying manifold of the data.
Size of the local neighborhood used for manifold approximation. Larger values result in more global views of the manifold, while smaller values preserve more local structure. Generally in the range 2 to 100.
Tuning guidance:
Dimension of the embedding space. Unlike t-SNE, UMAP scales well with increasing embedding dimensions.
Common values:
Distance metric to use. Accepts:
Common metrics:
'euclidean': Standard Euclidean distance (default)'manhattan': L1 distance'cosine': Cosine distance (good for text/document vectors)'correlation': Correlation distance'hamming': Hamming distance (for binary data)'jaccard': Jaccard distance (for binary/set data)'dice': Dice distance'canberra': Canberra distance'braycurtis': Bray-Curtis distance'chebyshev': Chebyshev distance'minkowski': Minkowski distance (specify p with metric_kwds)'precomputed': Use precomputed distance matrixEffective minimum distance between embedded points. Controls how tightly points are packed together. Smaller values result in clumpier embeddings.
Tuning guidance:
Effective scale of embedded points. Combined with min_dist to control clumped vs. spread-out embeddings. Determines how spread out the clusters are in the embedding space.
Number of training epochs. If None, automatically determined based on dataset size (typically 200-500 epochs).
Manual tuning:
Initial learning rate for the SGD optimizer. Higher values lead to faster convergence but may overshoot optimal solutions.
Initialization method for the embedding:
'spectral': Use spectral embedding (default, usually best)'random': Random initialization'pca': Initialize with PCANumber of nearest neighbors assumed to be locally connected. Higher values give more connected manifolds.
Interpolation between union and intersection when constructing fuzzy set unions. Value of 1.0 uses pure union, 0.0 uses pure intersection.
Weighting applied to negative samples in low-dimensional embedding optimization. Higher values push embedded points further apart.
Number of negative samples to select per positive sample. Higher values lead to greater repulsion between points and more spread-out embeddings but increase computational cost.
Number of nearest neighbors to use when constructing target simplicial set. If -1, uses n_neighbors value.
Distance metric for target values (labels):
'categorical': For classification tasksWeight applied to target information vs. data structure. Range 0.0 to 1.0:
Size of the nearest neighbor search queue for transform operations. Larger values improve transform accuracy but increase memory usage and computation time.
Random seed for transform operations. Ensures reproducibility of transform results.
Method for transforming new data:
'embedding': Standard approach (default)'graph': Use nearest neighbor graphWhether to use a memory-efficient implementation. Set to False only if memory is not a constraint and you want faster performance.
Whether to print progress messages during fitting.
Whether to consider only unique data points. Set to True if you know your data contains many duplicates to improve performance.
Force use of approximate nearest neighbor search even for small datasets. Can improve performance on large datasets.
Whether to use angular random projection forest for nearest neighbor search. Can improve performance for normalized data in high dimensions.
DensMAP is a variant that preserves local density information.
Whether to use the DensMAP algorithm instead of standard UMAP. Preserves local density in addition to topological structure.
Weight of density preservation term in DensMAP optimization. Higher values emphasize density preservation.
Fraction of dataset used for density estimation in DensMAP.
Regularization parameter for density estimation in DensMAP.
Whether to output local density estimates in addition to the embedding. Results stored in rad_orig_ and rad_emb_ attributes.
Parameter controlling embedding. If None, determined automatically from min_dist and spread.
Parameter controlling embedding. If None, determined automatically from min_dist and spread.
Random state for reproducibility. Set to an integer for reproducible results.
Additional keyword arguments for the distance metric.
Distance threshold for considering points disconnected. If None, uses max distance in the graph.
Precomputed k-nearest neighbors as (knn_indices, knn_dists, knn_search_index). Useful for reusing expensive computations.
Fit the UMAP model to the data.
Parameters:
X: array-like, shape (n_samples, n_features) - Training datay: array-like, shape (n_samples,), optional - Target values for supervised dimension reductionReturns:
self: Fitted UMAP objectAttributes set:
embedding_: The embedded representation of training datagraph_: Fuzzy simplicial set approximation to the manifold_raw_data: Copy of the training data_small_data: Whether the dataset is considered small_metric_kwds: Processed metric keyword arguments_n_neighbors: Actual n_neighbors used_initial_alpha: Initial learning rate_a, _b: Curve parametersFit the model and return the embedded representation.
Parameters:
X: array-like, shape (n_samples, n_features) - Training datay: array-like, shape (n_samples,), optional - Target values for supervised dimension reductionReturns:
X_new: array, shape (n_samples, n_components) - Embedded dataTransform new data into the existing embedded space.
Parameters:
X: array-like, shape (n_samples, n_features) - New data to transformReturns:
X_new: array, shape (n_samples, n_components) - Embedded representation of new dataImportant notes:
Transform data from the embedded space back to the original data space.
Parameters:
X: array-like, shape (n_samples, n_components) - Embedded data pointsReturns:
X_new: array, shape (n_samples, n_features) - Reconstructed data in original spaceImportant notes:
Update the model with new data. Allows incremental fitting.
Parameters:
X: array-like, shape (n_samples, n_features) - New data to incorporateReturns:
self: Updated UMAP objectNote: Experimental feature, may not preserve all properties of batch training.
array, shape (n_samples, n_components) - The embedded representation of the training data.
scipy.sparse.csr_matrix - The weighted adjacency matrix of the fuzzy simplicial set approximation to the manifold.
array - Copy of the raw training data.
bool - Whether the training data was sparse.
bool - Whether the dataset was considered small (uses different algorithm for small datasets).
str - Hash of the input data for caching purposes.
array - Indices of k-nearest neighbors for each training point.
array - Distances to k-nearest neighbors for each training point.
list - Random projection forest used for approximate nearest neighbor search.
umap.ParametricUMAP(encoder=None, decoder=None, parametric_reconstruction=False, autoencoder_loss=False, reconstruction_validation=None, dims=None, batch_size=None, n_training_epochs=1, loss_report_frequency=10, optimizer=None, keras_fit_kwargs={}, **kwargs)
Parametric UMAP using neural networks to learn the embedding function.
Keras model for encoding data to embeddings. If None, uses default 3-layer architecture with 100 neurons per layer.
Keras model for decoding embeddings back to data space. Only used if parametric_reconstruction=True.
Whether to use parametric reconstruction. Requires decoder model.
Whether to include reconstruction loss in the optimization. Requires decoder model.
Validation data (X_val, y_val) for monitoring reconstruction loss during training.
Input dimensions for the encoder network. Required if providing custom encoder.
Batch size for neural network training. If None, determined automatically.
Number of training epochs for the neural networks. More epochs improve quality but increase training time.
How often to report loss during training.
Keras optimizer for training. If None, uses Adam with learning_rate parameter.
Additional keyword arguments passed to the Keras fit() method.
Same as UMAP class, but transform() and inverse_transform() use learned neural networks for faster inference.
Compute k-nearest neighbors for the data.
Returns: (knn_indices, knn_dists, rp_forest)
Construct fuzzy simplicial set representation of the data.
Returns: Fuzzy simplicial set as sparse matrix
Perform the optimization to find a low-dimensional embedding.
Returns: Embedding array
Fit a, b params for the UMAP curve from spread and min_dist.
Returns: (a, b) tuple
umap.AlignedUMAP(n_neighbors=15, n_components=2, metric='euclidean', alignment_regularisation=1e-2, alignment_window_size=3, **kwargs)
UMAP variant for aligning multiple related datasets.
Strength of alignment regularization between datasets.
Number of adjacent datasets to align.
Fit model to multiple datasets.
Parameters:
X: list of arrays - List of datasets to alignReturns:
self: Fitted modellist of arrays - List of aligned embeddings, one per input dataset.
import umap
# Standard 2D visualization embedding
reducer = umap.UMAP(
n_neighbors=15, # Balance local/global structure
n_components=2, # Output dimensions
metric='euclidean', # Distance metric
min_dist=0.1, # Minimum distance between points
spread=1.0, # Scale of embedded points
random_state=42, # Reproducibility
n_epochs=200, # Training iterations (None = auto)
learning_rate=1.0, # SGD learning rate
init='spectral', # Initialization method
low_memory=True, # Memory-efficient mode
verbose=True # Print progress
)
embedding = reducer.fit_transform(data)
# Train with labels for class separation
reducer = umap.UMAP(
n_neighbors=15,
target_weight=0.5, # Balance data structure vs labels
target_metric='categorical', # Metric for labels
random_state=42
)
embedding = reducer.fit_transform(data, y=labels)
# Optimized for clustering
reducer = umap.UMAP(
n_neighbors=30, # More global structure
min_dist=0.0, # Allow tight packing
n_components=10, # Higher dimensions for density
metric='euclidean',
random_state=42
)
embedding = reducer.fit_transform(data)
from numba import njit
@njit()
def custom_distance(x, y):
"""Custom distance function (must be Numba-compatible)"""
result = 0.0
for i in range(x.shape[0]):
result += abs(x[i] - y[i])
return result
reducer = umap.UMAP(metric=custom_distance)
embedding = reducer.fit_transform(data)
import tensorflow as tf
from umap.parametric_umap import ParametricUMAP
# Define custom encoder
encoder = tf.keras.Sequential([
tf.keras.layers.InputLayer(input_shape=(input_dim,)),
tf.keras.layers.Dense(256, activation='relu'),
tf.keras.layers.Dropout(0.3),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dropout(0.3),
tf.keras.layers.Dense(2) # Output dimension
])
# Define decoder for reconstruction
decoder = tf.keras.Sequential([
tf.keras.layers.InputLayer(input_shape=(2,)),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(256, activation='relu'),
tf.keras.layers.Dense(input_dim)
])
# Train parametric UMAP with autoencoder
embedder = ParametricUMAP(
encoder=encoder,
decoder=decoder,
dims=(input_dim,),
parametric_reconstruction=True,
autoencoder_loss=True,
n_training_epochs=10,
batch_size=128,
n_neighbors=15,
min_dist=0.1,
random_state=42
)
embedding = embedder.fit_transform(data)
new_embedding = embedder.transform(new_data)
reconstructed = embedder.inverse_transform(embedding)
# Preserve local density information
reducer = umap.UMAP(
densmap=True, # Enable DensMAP
dens_lambda=2.0, # Weight of density preservation
dens_frac=0.3, # Fraction for density estimation
output_dens=True, # Output density estimates
n_neighbors=15,
min_dist=0.1,
random_state=42
)
embedding = reducer.fit_transform(data)
# Access density estimates
original_density = reducer.rad_orig_ # Density in original space
embedded_density = reducer.rad_emb_ # Density in embedded space
from umap import AlignedUMAP
# Multiple related datasets (e.g., different time points)
datasets = [day1_data, day2_data, day3_data, day4_data]
# Align embeddings
mapper = AlignedUMAP(
n_neighbors=15,
alignment_regularisation=1e-2, # Alignment strength
alignment_window_size=2, # Align with adjacent datasets
n_components=2,
random_state=42
)
mapper.fit(datasets)
# Access aligned embeddings
aligned_embeddings = mapper.embeddings_
# aligned_embeddings[0] is day1 embedding
# aligned_embeddings[1] is day2 embedding, etc.