Back to Smile

SMILE Core Module

core/README.md

6.1.017.3 KB
Original Source

SMILE Core Module

The smile-core module is the algorithmic heart of SMILE. It builds on smile-base (math, linear algebra, data frames) and provides:

  • Supervised learning — classification and regression
  • Unsupervised learning — clustering, vector quantization, manifold learning
  • Semi-supervised / online methods — sequence labeling, time series
  • Feature engineering — scaling, extraction, selection, imputation, SHAP
  • Model evaluation — cross-validation, metrics, hyper-parameter optimization
  • Production tooling — ONNX inference, model wrappers, anomaly detection

Table of Contents

  1. Module Structure
  2. Quick Start
  3. Classification
  4. Regression
  5. Clustering
  6. Feature Engineering
  7. Anomaly Detection
  8. Association Rule Mining
  9. Vector Quantization
  10. Manifold Learning
  11. Sequence Labeling
  12. Time Series
  13. Model Validation & Metrics
  14. Hyper-Parameter Optimization
  15. ONNX Inference
  16. User Guides
  17. Building and Testing

Module Structure

smile.anomaly          – Isolation Forest, one-class SVM
smile.association      – FP-Growth, Apriori, ARM
smile.classification   – 20+ classifiers (RF, GBT, SVM, MLP, …)
smile.clustering       – K-Means, DBSCAN, Spectral, GMM, …
smile.feature          – Transforms, extraction, selection, imputation, SHAP
  ├─ smile.feature.transform    – Scaler, Standardizer, Normalizer, …
  ├─ smile.feature.extraction   – PCA, KernelPCA, BagOfWords, encoders, …
  ├─ smile.feature.imputation   – SimpleImputer, KNNImputer, SVDImputer, …
  ├─ smile.feature.selection    – SSR, SNR, FRegression, IV, GAFE
  └─ smile.feature.importance   – SHAP, TreeSHAP
smile.hpo              – Hyper-parameter search (grid, random, Bayesian)
smile.manifold         – IsoMap, LLE, t-SNE, UMAP, KPCA, …
smile.model            – Unified model wrappers (CART, MLP internals)
smile.onnx             – ONNX runtime integration
smile.regression       – 15+ regressors (RF, GBT, SVM, MLP, OLS, LASSO, …)
smile.sequence         – HMM, CRF sequence labeling
smile.timeseries       – ARIMA, GARCH, exponential smoothing, …
smile.validation       – Cross-validation, Bootstrap, metrics
smile.vq               – SOM, Neural Gas, GNG, NeuralMap, BIRCH

Dependency: smile-coresmile-base only (no circular deps).


Quick Start

Add the Gradle dependency:

kotlin
// build.gradle.kts
dependencies {
    implementation("com.github.haifengl:smile-core:6.x.x")
}

Fit and predict a Random Forest in 5 lines

java
import smile.datasets.Iris;
import smile.classification.RandomForest;
import smile.validation.metric.Accuracy;

var iris = new Iris();
RandomForest rf = RandomForest.fit(iris.formula(), iris.data());
int[] prediction = rf.predict(iris.testData());
System.out.println("Accuracy: " + Accuracy.of(iris.testLabels(), prediction));

Standardize features then train a classifier

java
import smile.feature.transform.Standardizer;
import smile.data.transform.InvertibleColumnTransform;

InvertibleColumnTransform scaler = Standardizer.fit(trainDf);
DataFrame scaledTrain = scaler.apply(trainDf);
DataFrame scaledTest  = scaler.apply(testDf);

SVM<double[]> svm = SVM.fit(formula, scaledTrain);

Classification

smile.classification provides over 16 classifiers for binary and multi-class problems.

ClassAlgorithm
RandomForestEnsemble of decision trees (bagging + random feature subset)
GradientTreeBoostGradient boosting of decision trees
AdaBoostAdaptive boosting
DecisionTreeSingle CART decision tree
LogisticRegressionL2-regularized logistic regression
SVMSupport vector machine (one-vs-one multi-class)
MLPMulti-layer perceptron
KNNk-nearest neighbors
LDA / QDA / RDA / FLDLinear/Quadratic/Regularized DA, Fisher's LD
NaiveBayes / DiscreteNaiveBayesGaussian and discrete Naive Bayes
RBFNetworkRadial basis function network
MaxEntClassifierMaximum entropy (log-linear) classifier

Common pattern:

java
// Formula-based (DataFrame) API
RandomForest rf = RandomForest.fit(Formula.lhs("class"), trainDf);
int label = rf.predict(testTuple);
int[] labels = rf.predict(testDf);
double[] proba = rf.predict(testTuple, new double[k]); // posterior probabilities

// Raw array API
double[][] x = trainDf.toArray();
int[] y = trainDf.column("class").toIntArray();
RandomForest rf2 = RandomForest.fit(x, y);

Ensemble models (RandomForest, GradientTreeBoost) implement TreeSHAP for feature importance. See Feature Engineering.

📖 Full guide: CLASSIFICATION.md


Regression

smile.regression provides over 13 regressors for continuous-valued prediction.

ClassAlgorithm
RandomForestRandom forest regression
GradientTreeBoostGradient boosted regression trees
RegressionTreeSingle CART regression tree
OLSOrdinary least squares
RidgeRegressionL2-penalized OLS
LASSOL1-penalized OLS (coordinate descent)
ElasticNetL1+L2-penalized OLS
SVRSupport vector regression
MLPMulti-layer perceptron regression
GaussianProcessRegressionGaussian process with Mercer kernel
RBFNetworkRadial basis function network

Common pattern:

java
OLS ols = OLS.fit(Formula.lhs("price"), trainDf);
double predicted = ols.predict(testTuple);

// Inspect coefficients
System.out.println(ols);

📖 Full guide: REGRESSION.md


Clustering

smile.clustering contains algorithms for grouping unlabelled data.

ClassAlgorithm
KMeansLloyd's k-Means (batch)
GMeans / XMeansAutomatic k selection
CLARANSClustering Large Applications
DBScanDensity-based spatial clustering
DENCLUEDensity-based with kernel estimation
SpectralClusteringGraph Laplacian eigenvectors
MECMinimum entropy clustering
SIBSequential information bottleneck
GaussianMixtureEM-fitted Gaussian mixture model
HierarchicalClusteringAgglomerative (single/complete/average/Ward)
BIRCHBalanced iterative reducing and clustering
java
KMeans km = KMeans.fit(data, 5);   // 5 clusters
int[] labels = km.y;               // cluster assignments
double[][] centroids = km.centroids;
int cluster = km.predict(newPoint);

📖 Full guide: CLUSTERING.md


Feature Engineering

smile.feature and its subpackages cover the entire preprocessing pipeline.

Scaling / Normalization (smile.feature.transform)

TransformerOutput rangeCenteringOutlier robust
Scaler[0, 1]NoNo
WinsorScaler[0, 1]NoYes (percentile)
MaxAbsScaler[−1, 1]NoNo
Standardizer(−∞, +∞)YesNo
RobustStandardizer(−∞, +∞)YesYes (IQR)
Normalizerunit normNoN/A (row-wise)
java
InvertibleColumnTransform t = Standardizer.fit(trainDf);
DataFrame scaled = t.apply(testDf);
DataFrame restored = t.invert(scaled); // lossless roundtrip within training range

// Pipeline
Transform pipeline = Transform.pipeline(
        Standardizer.fit(trainDf),
        MaxAbsScaler.fit(Standardizer.fit(trainDf).apply(trainDf))
);

Dimensionality Reduction (smile.feature.extraction)

java
// Batch PCA (keeps 95% variance by default)
PCA pca = PCA.fit(trainDf);
PCA pca5  = pca.getProjection(5);      // top 5 components
PCA pca90 = pca.getProjection(0.90);   // 90% variance threshold

// Kernel PCA (non-linear)
KernelPCA kpca = KernelPCA.fit(trainDf, new GaussianKernel(1.0), opts);

// Streaming PCA
GHA gha = new GHA(inputDim, 10, TimeFunction.of(0.01));
centeredSamples.forEach(gha::update);

// Random projection (no training needed)
RandomProjection rp = RandomProjection.of(highDim, lowDim);

Feature Selection (smile.feature.selection)

java
// Multi-class: BSS/WSS ratio
SumSquaresRatio[] scores = SumSquaresRatio.fit(df, "label");
Arrays.sort(scores);   // ascending (worst first)

// Binary: signal-to-noise ratio
SignalNoiseRatio[] snr = SignalNoiseRatio.fit(df, "label");

// Regression: univariate F-statistic
FRegression[] fscores = FRegression.fit(df, "target");
String[] significant = Arrays.stream(fscores)
        .filter(r -> r.pvalue() < 0.05).map(FRegression::feature)
        .toArray(String[]::new);

// Genetic algorithm wrapper
GAFE gafe = new GAFE(...);
int[] selected = gafe.apply(generations, population, formula, df, fitness);

Missing Value Imputation (smile.feature.imputation)

java
SimpleImputer imp = SimpleImputer.fit(trainDf);    // mean/mode
KNNImputer knn   = new KNNImputer(trainDf, 5, dist);
double[][] fixed = SVDImputer.impute(rawMatrix, 10, 100);

SHAP Explainability (smile.feature.importance)

java
RandomForest rf = RandomForest.fit(formula, trainDf);
double[] phi = rf.shap(testTuple);   // per-feature contributions

📖 Full guide: FEATURE_ENGINEERING.md


Anomaly Detection

smile.anomaly provides unsupervised outlier and novelty detection.

ClassAlgorithm
IsolationForestRandom partitioning; anomaly score ∝ isolation path length
SVM (one-class)Hypersphere in kernel feature space
java
IsolationForest iforest = IsolationForest.fit(trainData, 100); // 100 trees
double[] scores = iforest.score(testData);   // higher = more anomalous
// extensionLevel = 0 → standard IsolationForest
IsolationForest ext = IsolationForest.fit(trainData, 100, extensionLevel);

📖 Full guide: ANOMALY_DETECTION.md


Association Rule Mining

smile.association mines frequent itemsets and association rules from transaction data.

ClassAlgorithm
FPTreeFP-Tree data structure
FPGrowthFrequent pattern growth (FP-Growth)
ARMAssociation rule mining from itemsets
java
int[][] transactions = { {1,2,3}, {2,3,4}, {1,3,5}, ... };
List<int[]> itemsets = FPGrowth.apply(transactions, 0.3); // 30% min support

// Mine rules
List<AssociationRule> rules = ARM.apply(itemsets, transactions.length, 0.7);
rules.forEach(r -> System.out.printf("%s => %s  conf=%.2f%n",
        Arrays.toString(r.antecedent()), Arrays.toString(r.consequent()),
        r.confidence()));

📖 Full guide: ASSOCIATION_RULE_MINING.md


Vector Quantization

smile.vq provides competitive-learning algorithms for codebook construction and topology-preserving maps.

ClassAlgorithm
SOMSelf-organizing map (batch or online)
NeuralGasNeural Gas (batch)
GrowingNeuralGasOnline growing topology graph
NeuralMapApproximate nearest-neighbor map
BIRCHCF-tree for large-scale clustering
java
SOM som = SOM.fit(data, 10, 10);     // 10×10 hexagonal grid
int unit = som.predict(sample);      // best-matching unit (BMU)

NeuralGas ng = NeuralGas.fit(data, 20);
int centroid = ng.predict(sample);

📖 Full guide: VECTOR_QUANTIZATION.md


Manifold Learning

smile.manifold provides non-linear dimensionality reduction for visualization and pre-processing.

ClassAlgorithm
IsoMapGeodesic distance + MDS
LLELocally Linear Embedding
LaplacianEigenmapGraph Laplacian eigenmaps
UMAPUniform manifold approximation
TSNEt-Distributed stochastic neighbor embedding
SammonMappingStress-minimization MDS
KPCAKernel PCA
java
double[][] coords2d = UMAP.fit(highDimData).coordinates();
double[][] coords2d = TSNE.fit(highDimData, 2, 30, 200).coordinates();

📖 Full guide: MANIFOLD.md


Sequence Labeling

smile.sequence implements probabilistic models for labeling sequences of observations.

ClassAlgorithm
HMMHidden Markov Model (Baum–Welch / Viterbi)
CRFConditional Random Field
java
// CRF training
CRF<String> crf = CRF.fit(sequences, labels, features, 0.1, 100);
int[] decoded = crf.predict(testSequence);

// HMM Viterbi decoding
HMM hmm = HMM.fit(observationSequences);
int[] hiddenStates = hmm.predict(observations);

📖 Full guide: SEQUENCE.md


Time Series

smile.timeseries covers classical statistical time-series models.

ClassAlgorithm
ARIMAAutoRegressive Integrated Moving Average
GARCHGeneralized Autoregressive Conditional Heteroskedasticity
ARAutoregressive model
MAMoving average model
Utility functionsacf(), pacf(), adf() (ADF unit-root test)
java
// Fit ARIMA(1,1,1)
ARIMA model = ARIMA.fit(timeSeries, 1, 1, 1);
double[] forecast = model.forecast(12);   // 12 steps ahead
System.out.println(model);               // AIC, BIC, coefficients

// Stationarity check
double adfStat = TimeSeriesModel.adf(timeSeries, 1);

📖 Full guide: TIME_SERIES.md


Model Validation & Metrics

smile.validation provides rigorous evaluation protocols.

Resampling strategies

ClassStrategy
CrossValidationk-fold cross-validation
LOOCVLeave-one-out cross-validation
BootstrapStratified bootstrap
java
// 10-fold CV for a classifier
ClassificationMetrics cv = CrossValidation.classification(10, formula, trainDf,
        (f, d) -> RandomForest.fit(f, d));
System.out.println(cv); // accuracy, F1, MCC, …

// Bootstrap for regression
RegressionMetrics boot = Bootstrap.regression(100, x, y,
        (xi, yi) -> OLS.fit(xi, yi));

Classification metrics

Accuracy, Recall, Precision, F1Score, MatthewsCorrelation, AUC, LogLoss, ConfusionMatrix, Sensitivity, Specificity

Regression metrics

MAE, MSE, RMSE, RSS, R2, MeanAbsoluteDeviation

java
double acc = Accuracy.of(trueLabels, predictedLabels);
double auc = AUC.of(trueLabels, scores);
ConfusionMatrix cm = ConfusionMatrix.of(trueLabels, predictedLabels);

📖 Full guides: VALIDATION.md · VALIDATION_METRICS.md


Hyper-Parameter Optimization

smile.hpo provides search strategies for tuning model hyper-parameters.

StrategyDescription
Grid searchExhaustive enumeration of a parameter grid
Random searchRandom sampling from parameter distributions
Bayesian optimizationSurrogate model (GP) + acquisition function
java
HPO.Result result = HPO.randomSearch(50, params -> {
    int ntrees   = params.getInt("ntrees");
    int maxDepth = params.getInt("maxDepth");
    RandomForest rf = RandomForest.fit(formula, trainDf,
            new RandomForest.Options(ntrees, maxDepth));
    return CrossValidation.classification(5, formula, trainDf,
            (f, d) -> RandomForest.fit(f, d, new RandomForest.Options(ntrees, maxDepth)))
            .accuracy();
}, Map.of("ntrees",   HPO.range(50, 500),
          "maxDepth", HPO.range(3, 20)));

📖 Full guide: HYPER_PARAMETER_OPTIMIZATION.md


ONNX Inference

smile.onnx wraps the ONNX Runtime Java API so you can deploy any ONNX-compatible model (PyTorch, TensorFlow, scikit-learn, XGBoost, …) inside a SMILE pipeline.

java
import smile.onnx.ONNXModel;

try (ONNXModel model = ONNXModel.load("model.onnx")) {
    float[][] input  = prepareInput(df);
    float[][] output = model.predict(input);
}

model.inputNames();    // ["input"]
model.outputNames();   // ["output", "probabilities"]

📖 Full guide: ONNX.md


User Guides

Detailed documentation for each area of the module:

GuidePackage(s) covered
CLASSIFICATION.mdsmile.classification
REGRESSION.mdsmile.regression
CLUSTERING.mdsmile.clustering
FEATURE_ENGINEERING.mdsmile.feature.*
ANOMALY_DETECTION.mdsmile.anomaly
ASSOCIATION_RULE_MINING.mdsmile.association
VECTOR_QUANTIZATION.mdsmile.vq
MANIFOLD.mdsmile.manifold
SEQUENCE.mdsmile.sequence
TIME_SERIES.mdsmile.timeseries
VALIDATION.mdsmile.validation
VALIDATION_METRICS.mdsmile.validation.metric
HYPER_PARAMETER_OPTIMIZATION.mdsmile.hpo
TRAINING.mdModel training utilities and patterns
ONNX.mdsmile.onnx

Building and Testing

powershell
# Build only core (skip tests)
./gradlew :core:build -x test

# Run all core tests
./gradlew :core:test

# Run a specific test class
./gradlew :core:test --tests "smile.classification.RandomForestTest"

# Skip integration tests (USPS/MNIST heavy datasets)
./gradlew :core:test -DexcludeTags=integration

SMILE — Copyright © 2010–2026 Haifeng Li. GNU GPL licensed.