core/VECTOR_QUANTIZATION.md
VectorQuantizer Interfacesmile.vq.hebb PackageVector quantization (VQ) is a lossy compression technique that maps an input vector to the nearest element of a finite set of code vectors (also called prototypes or centroids). Each algorithm in smile.vq learns these prototypes from data through online, competitive learning: prototypes compete for each incoming signal, and the winner (or winners) are updated to better represent that signal.
All algorithms in this package implement the VectorQuantizer interface and support incremental (online) learning via update(double[] x). This makes them suitable for streaming data, large datasets that do not fit in memory, or scenarios where the data distribution may shift over time.
The package contains five quantizer implementations:
| Class | Topology | Fixed size? | Key strength |
|---|---|---|---|
BIRCH | CF-tree (hierarchical) | No | Fast, memory-bounded, handles very large datasets |
NeuralGas | None (soft competition) | Yes | Robust convergence, good coverage |
GrowingNeuralGas | Graph (dynamic) | No | Adapts topology and size to data |
NeuralMap | Graph (dynamic, radius-gated) | No | Memory-efficient, noise-resistant |
SOM | 2-D rectangular lattice | Yes | Topology-preserving, visual interpretation |
VectorQuantizer Interfacepublic interface VectorQuantizer extends Serializable {
int OUTLIER = Integer.MAX_VALUE;
void update(double[] x);
double[] quantize(double[] x);
}
update(double[] x)Presents a new data point to the model. The model updates its internal prototype(s) using competitive learning. Calling update repeatedly over the training set (typically for multiple epochs in random order) trains the model.
quantize(double[] x)Returns the weight vector of the nearest prototype for a given input. For all implementations this is an internal reference — do not modify the returned array, or you will corrupt the model state. If you need the value for further computation, copy it first:
double[] proto = quantizer.quantize(x);
double[] copy = proto.clone();
OUTLIERThe constant Integer.MAX_VALUE is reserved as a label for outliers or noise in higher-level clustering algorithms that build on top of VQ models. The VQ algorithms themselves do not use it internally.
All implementations are Serializable, so a trained model can be saved and restored with standard Java serialization.
BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) builds a CF-tree — a height-balanced tree of Clustering Features (CF). Each CF is a compact statistical summary (N, LS, SS):
From these three statistics the centroid and radius of a sub-cluster can be computed exactly and cheaply. When a new data point arrives, it is absorbed into the closest leaf CF if the resulting radius stays within the threshold T; otherwise a new CF is created. Leaves that exceed capacity L are split; internal nodes that exceed branching factor B are likewise split, growing the tree.
quantize descends the tree to the nearest leaf CF and returns its centroid.
BIRCH(int d, int B, int L, double T)
| Parameter | Meaning | Typical range |
|---|---|---|
d | Dimensionality of data | — |
B | Maximum children per internal node | 5–50 |
L | Maximum CFs per leaf node | 5–50 |
T | Radius threshold for absorbing a point into an existing CF | data-dependent |
Constraints: B ≥ 2, L ≥ 2, T > 0.
The radius threshold T is the most important parameter. A smaller T creates more, finer sub-clusters; a larger T creates fewer, coarser ones. Inspect your data's pairwise distances to calibrate: T should be on the order of the expected intra-cluster spread.
// 256-dimensional data (e.g. image patches), branching=5, leaf capacity=5, radius=6.0
BIRCH birch = new BIRCH(256, 5, 5, 6.0);
// Online ingestion — single pass over training data
for (double[] xi : trainingData) {
birch.update(xi);
}
// Quantize new points
double[] prototype = birch.quantize(queryPoint);
// Retrieve all leaf-level sub-cluster centroids for downstream clustering
double[][] centroids = birch.centroids();
centroids()Returns the centroid of every CF at the leaf level. These centroids can be fed into a hierarchical clustering algorithm (e.g., smile.clustering.HierarchicalClustering) to produce the final cluster assignments in a second pass — this is phase 2 of the original BIRCH paper.
T, B, and L; BIRCH will never create more leaf CFs than the data variance requires given the radius constraint.quantize returns a new array (the centroid is computed on the fly from the CF statistics), so it is safe to modify.Neural Gas (Martinetz & Schulten, 1991–1994) uses soft competition: for every input signal x, all neurons are ranked by distance and updated, with adaptation strength decaying exponentially with rank. The neuron closest to x (rank 0) moves the most; the neuron furthest away (rank N−1) moves the least. Over time, the prototypes distribute themselves to match the data density.
Additionally, a competitive Hebbian edge is drawn between the best-matching unit (BMU) and the second-best-matching unit at each step. These edges record which neurons are topologically adjacent in data space. Old edges can be pruned via network().
The adaptation rule for rank k at iteration t is:
Δw_k = α(t) · exp(−k / θ(t)) · (x − w_k)
where α(t) is the learning rate and θ(t) is the neighbourhood width, both typically decaying over time.
NeuralGas(double[][] neurons, TimeFunction alpha, TimeFunction theta, TimeFunction lifetime)
| Parameter | Meaning |
|---|---|
neurons | Initial weight vectors (at least 2); commonly drawn with CentroidClustering.seeds |
alpha | Learning-rate schedule α(t) |
theta | Neighbourhood-width schedule θ(t) |
lifetime | Edge lifetime: edges older than lifetime(t) steps are zeroed by network() |
TimeFunction is a @FunctionalInterface in smile.util.function. Common schedules:
TimeFunction.constant(value) // fixed value
TimeFunction.exp(initialValue, decayTime) // exponential decay: v0 * exp(-t / T)
double[][] x = trainingData;
int epochs = 20;
int T = x.length * epochs;
// 400 prototypes seeded by k-means++
double[][] seeds = CentroidClustering.seeds(x, 400);
NeuralGas gas = new NeuralGas(
seeds,
TimeFunction.exp(0.3, T / 2.0), // alpha: 0.3 → ~0 over half the run
TimeFunction.exp(30, T / 8.0), // theta: 30 → ~0 over one eighth of run
TimeFunction.constant(x.length * 2) // edges alive for ~2 epochs
);
for (int epoch = 0; epoch < epochs; epoch++) {
for (int j : MathEx.permutate(x.length)) {
gas.update(x[j]);
}
}
// Access trained prototypes in original index order
double[][] prototypes = gas.neurons();
// Retrieve topology graph (prunes expired edges)
Graph graph = gas.network();
// Quantize
double[] nearest = gas.quantize(queryPoint);
neurons()Returns the weight vectors sorted by original index (the order in which they were passed to the constructor). This ordering is stable across calls.
network()Returns the AdjacencyMatrix of competitive Hebbian edges. Edges whose timestamp is more than lifetime(t) steps old have their weight zeroed. The returned graph can be used for topology analysis or to identify disconnected clusters.
alpha and theta must decay to near-zero over the training run for good results.IntStream.parallel() internally.Growing Neural Gas (GNG, Fritzke 1995) extends Neural Gas with the ability to dynamically add and remove neurons during training. It uses hard competition (only BMU and second-best are updated) combined with competitive Hebbian edges and age-based edge pruning.
Key additions over Neural Gas:
edgeLifetime are removed. Neurons that become isolated (no edges) are also removed.lambda signals, a new neuron is inserted halfway between the neuron with the highest accumulated error (q) and its neighbor with the highest error (f). The edge (q, f) is replaced by edges (q, r) and (r, f).beta after every signal.Two constructors are provided:
// Default hyperparameters
GrowingNeuralGas(int d)
// Full control
GrowingNeuralGas(int d, double epsBest, double epsNeighbor,
int edgeLifetime, int lambda, double alpha, double beta)
| Parameter | Default | Meaning |
|---|---|---|
d | — | Dimensionality |
epsBest | 0.2 | Learning rate for BMU |
epsNeighbor | 0.006 | Learning rate for BMU's neighbors |
edgeLifetime | 50 | Maximum edge age before pruning |
lambda | 100 | Neuron insertion interval (signals between insertions) |
alpha | 0.5 | Error reduction factor for q and f after insertion |
beta | 0.995 | Global error decay factor (applied after every signal) |
The first two update calls do not trigger learning — they simply add the first two neurons. Learning begins from the third call onward. A neuron insertion at a lambda boundary is skipped if the highest-error neuron has no topological neighbors (a rare but safe condition).
GrowingNeuralGas model = new GrowingNeuralGas(x[0].length);
for (int epoch = 0; epoch < 10; epoch++) {
for (int j : MathEx.permutate(x.length)) {
model.update(x[j]);
}
System.out.printf("%d neurons after epoch %d%n", model.neurons().length, epoch + 1);
}
// Inspect the learned prototypes and topology
Neuron[] neurons = model.neurons();
for (Neuron n : neurons) {
double[] w = n.w; // prototype weight vector
double error = n.counter; // accumulated local error
List<Edge> nbrs = n.edges; // topological neighbors
}
double[] nearest = model.quantize(queryPoint);
| Goal | Adjustment |
|---|---|
| Fewer, coarser prototypes | Increase lambda, decrease edgeLifetime |
| More, finer prototypes | Decrease lambda, increase edgeLifetime |
| Faster convergence | Increase epsBest and epsNeighbor |
| Smoother topology | Decrease epsBest, increase edgeLifetime |
| Adapt to drifting distributions | Smaller beta (faster error decay), smaller edgeLifetime |
HeapSelect for efficient O(n) BMU+second-best search.NeuralMap is a hybrid algorithm combining ideas from GNG and BIRCH. Like GNG it uses competitive Hebbian learning and edge aging. Unlike GNG, neuron creation is controlled by a distance threshold r rather than a fixed insertion interval:
r, a new neuron is created at x.r but not the second-nearest, a new neuron is created at x and connected to the nearest.r, the normal Hebbian update runs: the nearest neuron is moved toward x, its neighbors are lightly updated, the edge to the second-nearest is refreshed, and stale edges are pruned.A freshness counter (updated by counter += 1 for the winning neuron, decayed by counter *= beta for all neurons after each step) tracks how recently each neuron has been active. The clear(double eps) method removes neurons whose freshness falls below eps and also prunes any remaining stale edges.
NeuralMap(int d, double r, double epsBest, double epsNeighbor, int edgeLifetime, double beta)
| Parameter | Meaning |
|---|---|
d | Dimensionality |
r | Distance threshold for neuron activation |
epsBest | Learning rate for the nearest neuron |
epsNeighbor | Learning rate for topological neighbors |
edgeLifetime | Maximum edge age before pruning |
beta | Global freshness decay per step (0 < beta ≤ 1) |
// 256-dimensional signals; neurons must be within radius 8 to be activated
NeuralMap model = new NeuralMap(256, 8.0, 0.01, 0.002, 50, 0.995);
for (int epoch = 0; epoch < 5; epoch++) {
for (int j : MathEx.permutate(x.length)) {
model.update(x[j]);
}
// Prune stale neurons (freshness < 1e-7) and old edges
model.clear(1E-7);
System.out.printf("%d neurons after epoch %d%n", model.neurons().length, epoch + 1);
}
double[] nearest = model.quantize(queryPoint);
clear(double eps)Removes all neurons whose counter < eps AND whose edge list is empty (either because they were never connected, or all their edges were pruned). Neurons with counter ≥ eps also have their edges pruned of any that exceed edgeLifetime. Call clear between epochs to remove noise and keep the model compact.
Important: eps must be non-negative. Passing a negative value throws IllegalArgumentException.
r provides natural noise resistance: isolated spurious signals far from any existing neuron create their own neuron, but that neuron quickly becomes stale and is removed by clear.r-balls cover the data — not by a fixed insertion rate.The Self-Organizing Map (Kohonen, 2000) organizes a fixed set of neurons on a 2-D rectangular lattice. Each neuron has a grid position (i, j) and a weight vector w in input space. Training proceeds as follows:
x.x, with the update magnitude controlled by both the learning rate α(t) and a neighborhood function θ(di, dj, t) where (di, dj) is the lattice displacement from the BMU:Δw = α(t) · θ(i − i_bmu, j − j_bmu, t) · (x − w)
As training progresses, α and the neighborhood width shrink, causing the map to first organize globally and then fine-tune locally.
The result is a map where nearby neurons respond to similar inputs — a topology-preserving projection of the input space onto the 2-D lattice.
SOM(double[][][] neurons, TimeFunction alpha, Neighborhood theta)
| Parameter | Meaning |
|---|---|
neurons | Initial [nrow][ncol][d] weight array |
alpha | Learning-rate schedule α(t) |
theta | Neighborhood function |
All rows must have the same number of columns; all weight vectors must have the same length d. The lattice must be non-empty and have non-zero dimensionality.
SOM.latticeA good initialization seeds the lattice by projecting k-means++ seeds onto a 2-D MDS embedding, which preserves the global data structure from the start:
double[][][] lattice = SOM.lattice(nrow, ncol, trainingData);
This uses CentroidClustering.seeds for the initial seeds, then a pairwise distance matrix and MDS.fit to compute 2-D coordinates. The MDS coordinates are sorted so that the row and column grid positions respect the embedded distances. This is the recommended way to initialize large SOMs.
int epochs = 20;
int nrow = 20, ncol = 20; // 400-neuron map
double[][][] lattice = SOM.lattice(nrow, ncol, trainingData);
SOM som = new SOM(
lattice,
TimeFunction.constant(0.1),
Neighborhood.Gaussian(1.0, trainingData.length * epochs / 4.0)
);
for (int epoch = 0; epoch < epochs; epoch++) {
for (int j : MathEx.permutate(trainingData.length)) {
som.update(trainingData[j]);
}
}
// Access the learned lattice (internal references, do not modify)
double[][][] learned = som.neurons(); // [nrow][ncol][d]
// Compute the U-matrix for visualization
double[][] umatrix = som.umatrix(); // [nrow][ncol]
// Quantize
double[] nearest = som.quantize(queryPoint);
umatrix() computes the Unified Distance Matrix: for each neuron, its value is the maximum Euclidean distance to its orthogonal (4-connected) grid neighbors. Large U-matrix values indicate cluster boundaries; low values indicate dense regions. The U-matrix is the standard way to visualize SOM cluster structure as a heat map.
umatrix[i][j] = max distance from neuron (i,j) to its grid neighbors
Corner and edge cells only consider their existing neighbors (the grid is not toroidal).
Two built-in neighborhood functions are provided (see Section 9):
| Function | Behavior |
|---|---|
Neighborhood.bubble(radius) | Flat 1.0 within Chebyshev distance < radius, 0 outside |
Neighborhood.Gaussian(sigma, T) | Gaussian, width decays from sigma to ~0 over T iterations |
The Gaussian neighborhood is recommended for good maps; the bubble function is faster and a reasonable approximation.
neurons() method returns internal weight references — do not modify the returned arrays.update if α(t) * θ < 1E-5 (the internal tol threshold), making updates efficient for large lattices.Arrays.stream(neurons).parallel().smile.vq.hebb PackageThe sub-package smile.vq.hebb provides the graph primitives used by GNG and NeuralMap.
Neuronpublic class Neuron implements Comparable<Neuron>, Serializable {
public final double[] w; // weight / prototype vector
public final List<Edge> edges; // topological neighbors (undirected edges stored bidirectionally)
public transient double distance; // last computed distance to an input signal
public double counter; // local counter (error accumulation or freshness)
}
Key methods:
| Method | Description |
|---|---|
update(double[] x, double eps) | w += eps * (x − w) — moves the prototype toward x |
distance(double[] x) | Sets this.distance = MathEx.distance(w, x) |
addEdge(Neuron neighbor) | Adds an undirected edge with age 0 |
addEdge(Neuron neighbor, int age) | Adds an edge with a specified initial age |
removeEdge(Neuron neighbor) | Removes the edge to neighbor |
setEdgeAge(Neuron neighbor, int age) | Updates the age of the edge to neighbor |
age() | Increments the age of all emanating edges by 1 |
compareTo(Neuron o) | Orders by distance (ascending) — used with HeapSelect |
distance is declared transient so it is not serialized (it is a working variable, not persistent state).
Edgepublic class Edge implements Serializable {
public final Neuron neighbor; // the connected neuron
public int age; // number of steps since this edge was last refreshed
}
Edges are stored bidirectionally: if neuron A has an edge to B, B also has an edge to A. The algorithms maintain this invariant themselves (addEdge/removeEdge must be called on both endpoints when creating or removing a connection).
Neighborhood is a @FunctionalInterface:
double of(int i, int j, int t)
i and j are the row/column displacements from the BMU (signed); t is the current iteration count.
Neighborhood.bubble(int radius)
// Returns 1.0 if |i| < radius AND |j| < radius, else 0.0
All neurons within the Chebyshev square of side 2*radius - 1 centered on the BMU are updated equally. The boundary itself (distance exactly radius) is not included (strict inequality).
Neighborhood bubble = Neighborhood.bubble(2);
bubble.of(0, 0, t) // → 1.0 (winner)
bubble.of(1, 1, t) // → 1.0 (inside)
bubble.of(2, 0, t) // → 0.0 (on boundary, excluded)
Neighborhood.Gaussian(double sigma, double T)
// Returns exp(−0.5 * (i² + j²) / s²) where s = sigma * exp(−t / T)
The neighbourhood width shrinks exponentially from sigma to near zero over T iterations. The winner always evaluates to exactly 1.0 (displacement 0,0 gives exponent 0).
// Gaussian that shrinks over 8000 iterations
Neighborhood gauss = Neighborhood.Gaussian(2.0, 8000.0);
gauss.of(0, 0, 0) // → 1.0
gauss.of(1, 0, 0) // → exp(-0.5 / 4) ≈ 0.882
gauss.of(1, 0, 4000) // → narrower, smaller value
Because Neighborhood is a functional interface, you can provide any lambda:
// Linearly decaying step function
int initialRadius = 5;
Neighborhood custom = (i, j, t) -> {
int radius = Math.max(1, initialRadius - t / 1000);
return Math.abs(i) < radius && Math.abs(j) < radius ? 1.0 : 0.0;
};
These figures are from the integration test suite with fixed seed 19650218:
| Algorithm | Config | Train error | Test error |
|---|---|---|---|
| K-Means (baseline) | k=400 | 5.829 | 6.631 |
| Neural Gas | 400 neurons, 20 epochs | 5.699 | 6.531 |
| Growing Neural Gas | default params, 10 epochs | 5.593 | 6.432 |
| NeuralMap | r=8, 5 epochs + clear | 6.023 | 6.939 |
| SOM | 20×20 grid, 20 epochs | — | 6.582 |
| BIRCH | B=5, L=5, T=6 | 5.902 | 7.350 |
Lower is better. Errors are mean Euclidean distances.
Use BIRCH when:
Use Neural Gas when:
Use Growing Neural Gas when:
Use NeuralMap when:
clear between epochs).Use SOM when:
All online VQ algorithms benefit from processing data in a different random order each epoch, which avoids artifacts from fixed ordering:
for (int epoch = 0; epoch < numEpochs; epoch++) {
for (int j : MathEx.permutate(x.length)) {
model.update(x[j]);
}
}
MathEx.permutate(n) returns a random permutation of 0..n-1.
double error = 0.0;
for (double[] xi : testData) {
double[] proto = model.quantize(xi);
error += MathEx.distance(xi, proto);
}
error /= testData.length;
System.out.printf("Mean quantization error = %.4f%n", error);
quantize and neurons() return internal references for GNG, NeuralMap, NeuralGas, and SOM. Modifying them corrupts the model:
// Dangerous — modifies the internal weight vector:
double[] proto = gas.quantize(x);
proto[0] = 0.0; // BUG
// Safe:
double[] proto = gas.quantize(x).clone();
BIRCH's quantize is safe: it returns a freshly computed centroid array.
For Neural Gas and SOM, random uniform initialization wastes early training iterations on global reorganization. Using k-means++ seeds (CentroidClustering.seeds) or the MDS-based SOM.lattice helper gives a much better starting point:
// Neural Gas: seed with 200 representative points
double[][] seeds = CentroidClustering.seeds(trainingData, 200);
NeuralGas gas = new NeuralGas(seeds, alpha, theta, lifetime);
// SOM: MDS-ordered lattice
double[][][] lattice = SOM.lattice(10, 20, trainingData);
SOM som = new SOM(lattice, alpha, Neighborhood.Gaussian(1.0, T));
A common rule of thumb for exponential schedules over N total update steps:
int N = trainingData.length * numEpochs;
TimeFunction alpha = TimeFunction.exp(0.3, N / 2.0); // learning rate
TimeFunction theta = TimeFunction.exp(30.0, N / 8.0); // neighbourhood width
Both should reach near-zero by the end of training. Too large a final value leaves the map restless; too small wastes early training.
for (int epoch = 0; epoch < 10; epoch++) {
for (int j : MathEx.permutate(x.length)) { model.update(x[j]); }
// NeuralMap only:
((NeuralMap) model).clear(1E-7);
System.out.printf("Epoch %d: %d neurons%n", epoch, model.neurons().length);
}
If neuron count grows without bound, decrease edgeLifetime (GNG/NeuralMap) or decrease lambda (GNG) to trigger more pruning.
double[][] u = som.umatrix();
// u[i][j] is large at cluster boundaries, small within dense regions.
// Use this as a heat map: high values → inter-cluster gaps.
Plot u as an image (nrow × ncol) using your preferred visualization library to identify cluster structure without labelling.
If you have no prior knowledge of the data scale, estimate a reasonable T from a sample:
double sampleDist = 0.0;
int pairs = 0;
for (int i = 0; i < Math.min(200, x.length); i++) {
for (int j = i + 1; j < Math.min(200, x.length); j++) {
sampleDist += MathEx.distance(x[i], x[j]);
pairs++;
}
}
double meanPairwise = sampleDist / pairs;
// Set T to roughly 10–20% of the mean pairwise distance
double T = meanPairwise * 0.15;
BIRCH birch = new BIRCH(d, 5, 5, T);
All implementations are serializable. Save and restore a trained model with standard Java object streams:
// Save
try (ObjectOutputStream oos = new ObjectOutputStream(new FileOutputStream("model.ser"))) {
oos.writeObject(trainedModel);
}
// Restore
try (ObjectInputStream ois = new ObjectInputStream(new FileInputStream("model.ser"))) {
VectorQuantizer restored = (VectorQuantizer) ois.readObject();
}
Note: Neuron.distance is transient and will be reset to Double.MAX_VALUE after deserialization. This is intentional — it is a working variable that is overwritten before use.
See also: smile.clustering.KMeans, smile.clustering.HierarchicalClustering, smile.manifold.MDS, smile.util.function.TimeFunction
SMILE — Copyright © 2010–2026 Haifeng Li. GNU GPL licensed.