core/README.md
The smile-core module is the algorithmic heart of SMILE.
It builds on smile-base (math, linear algebra, data frames) and provides:
smile.anomaly – Isolation Forest, one-class SVM
smile.association – FP-Growth, Apriori, ARM
smile.classification – 20+ classifiers (RF, GBT, SVM, MLP, …)
smile.clustering – K-Means, DBSCAN, Spectral, GMM, …
smile.feature – Transforms, extraction, selection, imputation, SHAP
├─ smile.feature.transform – Scaler, Standardizer, Normalizer, …
├─ smile.feature.extraction – PCA, KernelPCA, BagOfWords, encoders, …
├─ smile.feature.imputation – SimpleImputer, KNNImputer, SVDImputer, …
├─ smile.feature.selection – SSR, SNR, FRegression, IV, GAFE
└─ smile.feature.importance – SHAP, TreeSHAP
smile.hpo – Hyper-parameter search (grid, random, Bayesian)
smile.manifold – IsoMap, LLE, t-SNE, UMAP, KPCA, …
smile.model – Unified model wrappers (CART, MLP internals)
smile.onnx – ONNX runtime integration
smile.regression – 15+ regressors (RF, GBT, SVM, MLP, OLS, LASSO, …)
smile.sequence – HMM, CRF sequence labeling
smile.timeseries – ARIMA, GARCH, exponential smoothing, …
smile.validation – Cross-validation, Bootstrap, metrics
smile.vq – SOM, Neural Gas, GNG, NeuralMap, BIRCH
Dependency: smile-core → smile-base only (no circular deps).
Add the Gradle dependency:
// build.gradle.kts
dependencies {
implementation("com.github.haifengl:smile-core:6.x.x")
}
import smile.datasets.Iris;
import smile.classification.RandomForest;
import smile.validation.metric.Accuracy;
var iris = new Iris();
RandomForest rf = RandomForest.fit(iris.formula(), iris.data());
int[] prediction = rf.predict(iris.testData());
System.out.println("Accuracy: " + Accuracy.of(iris.testLabels(), prediction));
import smile.feature.transform.Standardizer;
import smile.data.transform.InvertibleColumnTransform;
InvertibleColumnTransform scaler = Standardizer.fit(trainDf);
DataFrame scaledTrain = scaler.apply(trainDf);
DataFrame scaledTest = scaler.apply(testDf);
SVM<double[]> svm = SVM.fit(formula, scaledTrain);
smile.classification provides over 16 classifiers for binary and multi-class problems.
| Class | Algorithm |
|---|---|
RandomForest | Ensemble of decision trees (bagging + random feature subset) |
GradientTreeBoost | Gradient boosting of decision trees |
AdaBoost | Adaptive boosting |
DecisionTree | Single CART decision tree |
LogisticRegression | L2-regularized logistic regression |
SVM | Support vector machine (one-vs-one multi-class) |
MLP | Multi-layer perceptron |
KNN | k-nearest neighbors |
LDA / QDA / RDA / FLD | Linear/Quadratic/Regularized DA, Fisher's LD |
NaiveBayes / DiscreteNaiveBayes | Gaussian and discrete Naive Bayes |
RBFNetwork | Radial basis function network |
MaxEntClassifier | Maximum entropy (log-linear) classifier |
Common pattern:
// Formula-based (DataFrame) API
RandomForest rf = RandomForest.fit(Formula.lhs("class"), trainDf);
int label = rf.predict(testTuple);
int[] labels = rf.predict(testDf);
double[] proba = rf.predict(testTuple, new double[k]); // posterior probabilities
// Raw array API
double[][] x = trainDf.toArray();
int[] y = trainDf.column("class").toIntArray();
RandomForest rf2 = RandomForest.fit(x, y);
Ensemble models (RandomForest, GradientTreeBoost) implement TreeSHAP for
feature importance. See Feature Engineering.
📖 Full guide: CLASSIFICATION.md
smile.regression provides over 13 regressors for continuous-valued prediction.
| Class | Algorithm |
|---|---|
RandomForest | Random forest regression |
GradientTreeBoost | Gradient boosted regression trees |
RegressionTree | Single CART regression tree |
OLS | Ordinary least squares |
RidgeRegression | L2-penalized OLS |
LASSO | L1-penalized OLS (coordinate descent) |
ElasticNet | L1+L2-penalized OLS |
SVR | Support vector regression |
MLP | Multi-layer perceptron regression |
GaussianProcessRegression | Gaussian process with Mercer kernel |
RBFNetwork | Radial basis function network |
Common pattern:
OLS ols = OLS.fit(Formula.lhs("price"), trainDf);
double predicted = ols.predict(testTuple);
// Inspect coefficients
System.out.println(ols);
📖 Full guide: REGRESSION.md
smile.clustering contains algorithms for grouping unlabelled data.
| Class | Algorithm |
|---|---|
KMeans | Lloyd's k-Means (batch) |
GMeans / XMeans | Automatic k selection |
CLARANS | Clustering Large Applications |
DBScan | Density-based spatial clustering |
DENCLUE | Density-based with kernel estimation |
SpectralClustering | Graph Laplacian eigenvectors |
MEC | Minimum entropy clustering |
SIB | Sequential information bottleneck |
GaussianMixture | EM-fitted Gaussian mixture model |
HierarchicalClustering | Agglomerative (single/complete/average/Ward) |
BIRCH | Balanced iterative reducing and clustering |
KMeans km = KMeans.fit(data, 5); // 5 clusters
int[] labels = km.y; // cluster assignments
double[][] centroids = km.centroids;
int cluster = km.predict(newPoint);
📖 Full guide: CLUSTERING.md
smile.feature and its subpackages cover the entire preprocessing pipeline.
smile.feature.transform)| Transformer | Output range | Centering | Outlier robust |
|---|---|---|---|
Scaler | [0, 1] | No | No |
WinsorScaler | [0, 1] | No | Yes (percentile) |
MaxAbsScaler | [−1, 1] | No | No |
Standardizer | (−∞, +∞) | Yes | No |
RobustStandardizer | (−∞, +∞) | Yes | Yes (IQR) |
Normalizer | unit norm | No | N/A (row-wise) |
InvertibleColumnTransform t = Standardizer.fit(trainDf);
DataFrame scaled = t.apply(testDf);
DataFrame restored = t.invert(scaled); // lossless roundtrip within training range
// Pipeline
Transform pipeline = Transform.pipeline(
Standardizer.fit(trainDf),
MaxAbsScaler.fit(Standardizer.fit(trainDf).apply(trainDf))
);
smile.feature.extraction)// Batch PCA (keeps 95% variance by default)
PCA pca = PCA.fit(trainDf);
PCA pca5 = pca.getProjection(5); // top 5 components
PCA pca90 = pca.getProjection(0.90); // 90% variance threshold
// Kernel PCA (non-linear)
KernelPCA kpca = KernelPCA.fit(trainDf, new GaussianKernel(1.0), opts);
// Streaming PCA
GHA gha = new GHA(inputDim, 10, TimeFunction.of(0.01));
centeredSamples.forEach(gha::update);
// Random projection (no training needed)
RandomProjection rp = RandomProjection.of(highDim, lowDim);
smile.feature.selection)// Multi-class: BSS/WSS ratio
SumSquaresRatio[] scores = SumSquaresRatio.fit(df, "label");
Arrays.sort(scores); // ascending (worst first)
// Binary: signal-to-noise ratio
SignalNoiseRatio[] snr = SignalNoiseRatio.fit(df, "label");
// Regression: univariate F-statistic
FRegression[] fscores = FRegression.fit(df, "target");
String[] significant = Arrays.stream(fscores)
.filter(r -> r.pvalue() < 0.05).map(FRegression::feature)
.toArray(String[]::new);
// Genetic algorithm wrapper
GAFE gafe = new GAFE(...);
int[] selected = gafe.apply(generations, population, formula, df, fitness);
smile.feature.imputation)SimpleImputer imp = SimpleImputer.fit(trainDf); // mean/mode
KNNImputer knn = new KNNImputer(trainDf, 5, dist);
double[][] fixed = SVDImputer.impute(rawMatrix, 10, 100);
smile.feature.importance)RandomForest rf = RandomForest.fit(formula, trainDf);
double[] phi = rf.shap(testTuple); // per-feature contributions
📖 Full guide: FEATURE_ENGINEERING.md
smile.anomaly provides unsupervised outlier and novelty detection.
| Class | Algorithm |
|---|---|
IsolationForest | Random partitioning; anomaly score ∝ isolation path length |
SVM (one-class) | Hypersphere in kernel feature space |
IsolationForest iforest = IsolationForest.fit(trainData, 100); // 100 trees
double[] scores = iforest.score(testData); // higher = more anomalous
// extensionLevel = 0 → standard IsolationForest
IsolationForest ext = IsolationForest.fit(trainData, 100, extensionLevel);
📖 Full guide: ANOMALY_DETECTION.md
smile.association mines frequent itemsets and association rules from
transaction data.
| Class | Algorithm |
|---|---|
FPTree | FP-Tree data structure |
FPGrowth | Frequent pattern growth (FP-Growth) |
ARM | Association rule mining from itemsets |
int[][] transactions = { {1,2,3}, {2,3,4}, {1,3,5}, ... };
List<int[]> itemsets = FPGrowth.apply(transactions, 0.3); // 30% min support
// Mine rules
List<AssociationRule> rules = ARM.apply(itemsets, transactions.length, 0.7);
rules.forEach(r -> System.out.printf("%s => %s conf=%.2f%n",
Arrays.toString(r.antecedent()), Arrays.toString(r.consequent()),
r.confidence()));
📖 Full guide: ASSOCIATION_RULE_MINING.md
smile.vq provides competitive-learning algorithms for codebook construction
and topology-preserving maps.
| Class | Algorithm |
|---|---|
SOM | Self-organizing map (batch or online) |
NeuralGas | Neural Gas (batch) |
GrowingNeuralGas | Online growing topology graph |
NeuralMap | Approximate nearest-neighbor map |
BIRCH | CF-tree for large-scale clustering |
SOM som = SOM.fit(data, 10, 10); // 10×10 hexagonal grid
int unit = som.predict(sample); // best-matching unit (BMU)
NeuralGas ng = NeuralGas.fit(data, 20);
int centroid = ng.predict(sample);
📖 Full guide: VECTOR_QUANTIZATION.md
smile.manifold provides non-linear dimensionality reduction for visualization
and pre-processing.
| Class | Algorithm |
|---|---|
IsoMap | Geodesic distance + MDS |
LLE | Locally Linear Embedding |
LaplacianEigenmap | Graph Laplacian eigenmaps |
UMAP | Uniform manifold approximation |
TSNE | t-Distributed stochastic neighbor embedding |
SammonMapping | Stress-minimization MDS |
KPCA | Kernel PCA |
double[][] coords2d = UMAP.fit(highDimData).coordinates();
double[][] coords2d = TSNE.fit(highDimData, 2, 30, 200).coordinates();
📖 Full guide: MANIFOLD.md
smile.sequence implements probabilistic models for labeling sequences of
observations.
| Class | Algorithm |
|---|---|
HMM | Hidden Markov Model (Baum–Welch / Viterbi) |
CRF | Conditional Random Field |
// CRF training
CRF<String> crf = CRF.fit(sequences, labels, features, 0.1, 100);
int[] decoded = crf.predict(testSequence);
// HMM Viterbi decoding
HMM hmm = HMM.fit(observationSequences);
int[] hiddenStates = hmm.predict(observations);
📖 Full guide: SEQUENCE.md
smile.timeseries covers classical statistical time-series models.
| Class | Algorithm |
|---|---|
ARIMA | AutoRegressive Integrated Moving Average |
GARCH | Generalized Autoregressive Conditional Heteroskedasticity |
AR | Autoregressive model |
MA | Moving average model |
| Utility functions | acf(), pacf(), adf() (ADF unit-root test) |
// Fit ARIMA(1,1,1)
ARIMA model = ARIMA.fit(timeSeries, 1, 1, 1);
double[] forecast = model.forecast(12); // 12 steps ahead
System.out.println(model); // AIC, BIC, coefficients
// Stationarity check
double adfStat = TimeSeriesModel.adf(timeSeries, 1);
📖 Full guide: TIME_SERIES.md
smile.validation provides rigorous evaluation protocols.
| Class | Strategy |
|---|---|
CrossValidation | k-fold cross-validation |
LOOCV | Leave-one-out cross-validation |
Bootstrap | Stratified bootstrap |
// 10-fold CV for a classifier
ClassificationMetrics cv = CrossValidation.classification(10, formula, trainDf,
(f, d) -> RandomForest.fit(f, d));
System.out.println(cv); // accuracy, F1, MCC, …
// Bootstrap for regression
RegressionMetrics boot = Bootstrap.regression(100, x, y,
(xi, yi) -> OLS.fit(xi, yi));
Accuracy, Recall, Precision, F1Score, MatthewsCorrelation,
AUC, LogLoss, ConfusionMatrix, Sensitivity, Specificity
MAE, MSE, RMSE, RSS, R2, MeanAbsoluteDeviation
double acc = Accuracy.of(trueLabels, predictedLabels);
double auc = AUC.of(trueLabels, scores);
ConfusionMatrix cm = ConfusionMatrix.of(trueLabels, predictedLabels);
📖 Full guides: VALIDATION.md · VALIDATION_METRICS.md
smile.hpo provides search strategies for tuning model hyper-parameters.
| Strategy | Description |
|---|---|
| Grid search | Exhaustive enumeration of a parameter grid |
| Random search | Random sampling from parameter distributions |
| Bayesian optimization | Surrogate model (GP) + acquisition function |
HPO.Result result = HPO.randomSearch(50, params -> {
int ntrees = params.getInt("ntrees");
int maxDepth = params.getInt("maxDepth");
RandomForest rf = RandomForest.fit(formula, trainDf,
new RandomForest.Options(ntrees, maxDepth));
return CrossValidation.classification(5, formula, trainDf,
(f, d) -> RandomForest.fit(f, d, new RandomForest.Options(ntrees, maxDepth)))
.accuracy();
}, Map.of("ntrees", HPO.range(50, 500),
"maxDepth", HPO.range(3, 20)));
📖 Full guide: HYPER_PARAMETER_OPTIMIZATION.md
smile.onnx wraps the ONNX Runtime Java API so you can deploy any
ONNX-compatible model (PyTorch, TensorFlow, scikit-learn, XGBoost, …) inside
a SMILE pipeline.
import smile.onnx.ONNXModel;
try (ONNXModel model = ONNXModel.load("model.onnx")) {
float[][] input = prepareInput(df);
float[][] output = model.predict(input);
}
model.inputNames(); // ["input"]
model.outputNames(); // ["output", "probabilities"]
📖 Full guide: ONNX.md
Detailed documentation for each area of the module:
| Guide | Package(s) covered |
|---|---|
| CLASSIFICATION.md | smile.classification |
| REGRESSION.md | smile.regression |
| CLUSTERING.md | smile.clustering |
| FEATURE_ENGINEERING.md | smile.feature.* |
| ANOMALY_DETECTION.md | smile.anomaly |
| ASSOCIATION_RULE_MINING.md | smile.association |
| VECTOR_QUANTIZATION.md | smile.vq |
| MANIFOLD.md | smile.manifold |
| SEQUENCE.md | smile.sequence |
| TIME_SERIES.md | smile.timeseries |
| VALIDATION.md | smile.validation |
| VALIDATION_METRICS.md | smile.validation.metric |
| HYPER_PARAMETER_OPTIMIZATION.md | smile.hpo |
| TRAINING.md | Model training utilities and patterns |
| ONNX.md | smile.onnx |
# Build only core (skip tests)
./gradlew :core:build -x test
# Run all core tests
./gradlew :core:test
# Run a specific test class
./gradlew :core:test --tests "smile.classification.RandomForestTest"
# Skip integration tests (USPS/MNIST heavy datasets)
./gradlew :core:test -DexcludeTags=integration
SMILE — Copyright © 2010–2026 Haifeng Li. GNU GPL licensed.