core/VALIDATION.md
The smile.validation package provides everything needed to estimate how well a model
generalizes to unseen data. It is built around three orthogonal concerns:
Bag, Bootstrap, CrossValidation, LOOCVClassificationValidation, RegressionValidation and their
aggregating counterparts ClassificationValidations, RegressionValidationsModelSelection (AIC / BIC)All types are serializable records or static-method-only interfaces, so they carry no mutable state and compose freely.
Bag recordEvery splitting strategy returns one or more Bag objects.
public record Bag(int[] samples, int[] oob)
| Field | Meaning |
|---|---|
samples() | Training indices into the original dataset |
oob() | Held-out (out-of-bag / test) indices |
Indices are into the original array, not a copy — no data is ever duplicated.
The validation layer distinguishes two classifier flavours:
Classifier.isSoft() == false) — predicts a single class label. Metrics
that require probability estimates (AUC, LogLoss, cross-entropy) are reported as
Double.NaN.Classifier.isSoft() == true) — also provides posterior probabilities.
All metrics are computed and reported.Bag.split)A single random train / test split. The test proportion is set with holdout ∈ (0, 1).
// 80% train, 20% test on 1000 raw samples
Bag bag = Bag.split(1000, 0.2);
int[] trainIdx = bag.samples();
int[] testIdx = bag.oob();
For DataFrame inputs a convenience overload returns a typed pair:
var iris = new Iris();
Tuple2<DataFrame, DataFrame> split = Bag.split(iris.data(), 0.2);
DataFrame train = split._1;
DataFrame test = split._2;
n and holdout are validated; holdout must be strictly between 0 and 1.
Bag.stratify)Ensures the class distribution in each split mirrors the full dataset — essential when classes are imbalanced.
// Stratified 70/30 split for a DataFrame, using "species" as the class column
Tuple2<DataFrame, DataFrame> split =
Bag.stratify(iris.data(), "species", 0.3);
The low-level int[] overload is package-private and used internally by validation
runners.
Bootstrap sampling draws n samples with replacement from n originals, so
roughly 63.2% of originals appear in the training set and ~36.8% appear only in the
out-of-bag test set.
// 100 rounds of plain bootstrap for 500 samples
Bag[] bags = Bootstrap.of(500, 100);
Stratified bootstrap preserves class proportions in each bag:
int[] labels = ...; // class label per sample
Bag[] bags = Bootstrap.of(labels, 100);
Bootstrap runners for classifiers and regressors train and evaluate automatically:
var result = Bootstrap.classification(100, iris.formula(), iris.data(),
DecisionTree::fit);
System.out.println("Accuracy: " + result.avg().accuracy()
+ " ± " + result.std().accuracy());
Partitions the data into k equal folds; each fold serves as the test set exactly
once while the remaining k−1 folds are used for training.
// 5-fold CV splits for 500 samples
Bag[] folds = CrossValidation.of(500, 5);
k must satisfy 1 ≤ k ≤ n. The last fold absorbs any remainder when n is not
divisible by k.
Guarantees that each fold preserves the original class proportions:
int[] labels = ...; // one per sample
Bag[] folds = CrossValidation.stratify(labels, 5);
A warning is logged (SLF4J) if any class has fewer examples than k, which would
produce degenerate folds.
Used when samples belong to groups (e.g. subject IDs, document IDs, time windows) and leaking information across groups would inflate results. Each group appears entirely in either the training set or the test set for any given fold.
// group[i] is the group identifier for sample i
int[] group = {0, 0, 1, 1, 1, 2, 2, 3, 3, 3};
Bag[] folds = CrossValidation.nonoverlap(group, 3);
Groups are balanced across folds greedily by size. k must not exceed the number of
distinct groups.
In LOOCV every sample serves as the test set exactly once, making it the most data-efficient but computationally expensive strategy.
// Raw index splits: train[i] contains all indices except i
int[][] trainSets = LOOCV.of(100);
// trainSets[i].length == 99 for every i
Full classification and regression training loops are also available and return the
same ClassificationMetrics / RegressionMetrics records as the other strategies.
Train on an explicit train/test pair and get back a ClassificationValidation record
containing the model, the truth labels, predictions, optional posteriors, the
confusion matrix, and the computed metrics:
// Array-based trainer
ClassificationValidation<DecisionTree> result =
ClassificationValidation.of(trainX, trainY, testX, testY, DecisionTree::fit);
System.out.println(result.metrics().accuracy());
System.out.println(result.confusion());
With a Formula and DataFrame the API is symmetric:
var usps = new USPS();
ClassificationValidation<DecisionTree> result =
ClassificationValidation.of(usps.formula(), usps.train(), usps.test(),
DecisionTree::fit);
System.out.println(result);
Pass a Bag[] to train and evaluate over many folds and receive a
ClassificationValidations that aggregates per-fold results:
Bag[] folds = CrossValidation.of(x.length, 10);
ClassificationValidations<DecisionTree> cv =
ClassificationValidation.of(folds, x, y, DecisionTree::fit);
ClassificationMetrics avg = cv.avg();
ClassificationMetrics std = cv.std();
System.out.printf("Accuracy: %.2f%% ± %.2f%n",
100 * avg.accuracy(), 100 * std.accuracy());
The std metrics represent the standard deviation across folds. With a single fold,
std is 0.0 everywhere (instead of throwing an exception).
Bootstrap and LOOCV runners follow the same pattern:
// Bootstrap
var bs = Bootstrap.classification(100, formula, data, DecisionTree::fit);
System.out.println(bs.avg().accuracy());
// Stratified CV
var scv = CrossValidation.classification(5, formula, data, DecisionTree::fit);
// Repeated CV (3 repetitions × 5 folds = 15 training runs)
var rcv = CrossValidation.classification(3, 5, formula, data, DecisionTree::fit);
ClassificationMetricspublic record ClassificationMetrics(
double fitTime, // ms to train
double scoreTime, // ms to score the test set
int size, // number of test samples
int error, // number of misclassified samples
double accuracy, // correct / total
double sensitivity, // TP / (TP + FN) — binary or NaN for multiclass hard
double specificity, // TN / (TN + FP) — binary or NaN
double precision, // TP / (TP + FP) — binary or NaN
double f1, // 2·P·R / (P+R) — binary or NaN
double mcc, // Matthews Correlation Coefficient — binary or NaN
double auc, // Area Under ROC — soft binary or NaN
double logloss, // -log(p_correct) — soft binary or NaN
double crossEntropy // mean cross-entropy — soft multiclass or NaN
)
Which fields are populated depends on the classifier and data:
| Scenario | Populated |
|---|---|
| Hard binary | accuracy, error, sensitivity, specificity, precision, F1, MCC |
| Soft binary | all of the above, plus AUC, log loss, cross-entropy |
| Hard multiclass | accuracy, error |
| Soft multiclass | accuracy, error, cross-entropy |
Double.NaN is used for metrics that are not meaningful in the current scenario. Always
guard display code with !Double.isNaN(m.auc()) before printing probability-based
metrics.
RegressionValidation<RegressionTree> result =
RegressionValidation.of(abalone.formula(), abalone.train(), abalone.test(),
RegressionTree::fit);
System.out.println(result);
// Prints: RSS, MSE, RMSE, MAD, R²
Bag[] folds = CrossValidation.of(x.length, 10);
RegressionValidations<RegressionTree> cv =
RegressionValidation.of(folds, x, y, RegressionTree::fit);
System.out.printf("RMSE: %.4f ± %.4f%n",
cv.avg().rmse(), cv.std().rmse());
Bootstrap and LOOCV variants are also available:
var bs = Bootstrap.regression(100, formula, data, RegressionTree::fit);
RegressionMetricspublic record RegressionMetrics(
double fitTime, // ms to train
double scoreTime, // ms to score
int size, // test set size
double rss, // Residual Sum of Squares
double mse, // Mean Squared Error
double rmse, // Root Mean Squared Error
double mad, // Mean Absolute Error (MAE)
double r2 // Coefficient of Determination
)
All regression metrics are always populated — there is no hard/soft distinction.
ModelSelection provides two static criteria for comparing models fit to the same
dataset. Both penalize model complexity to prevent overfitting:
| Criterion | Formula | Penalty |
|---|---|---|
| AIC (Akaike) | 2k − 2 log L | 2k |
| BIC (Bayesian) | k log n − 2 log L | k log n |
Here L is the maximised likelihood, k is the number of free parameters, and n
is the sample size (BIC only).
Lower is better for both AIC and BIC.
double logL1 = -120.0; // log-likelihood of model 1
double logL2 = -125.0; // log-likelihood of model 2 (simpler, fewer params)
int k1 = 10, k2 = 4, n = 500;
double aic1 = ModelSelection.AIC(logL1, k1);
double aic2 = ModelSelection.AIC(logL2, k2);
System.out.println(aic1 < aic2 ? "Model 1 preferred by AIC"
: "Model 2 preferred by AIC");
double bic1 = ModelSelection.BIC(logL1, k1, n);
double bic2 = ModelSelection.BIC(logL2, k2, n);
System.out.println(bic1 < bic2 ? "Model 1 preferred by BIC"
: "Model 2 preferred by BIC");
When to use which:
n → ∞ if the true model is
among the candidates. It is more conservative and tends to prefer smaller models.
The log n factor means that BIC penalizes complexity more than AIC whenever
n > e² ≈ 7.4.Use a holdout split when you want the fastest possible sanity check before committing to a full CV run:
var iris = new Iris();
Tuple2<DataFrame, DataFrame> split = Bag.split(iris.data(), 0.2);
var result = ClassificationValidation.of(
iris.formula(), split._1, split._2, DecisionTree::fit);
System.out.println(result);
The idiomatic workflow for a thorough, low-variance estimate:
var iris = new Iris();
var cv = CrossValidation.classification(10, iris.formula(), iris.data(),
DecisionTree::fit);
System.out.printf("Accuracy: %.2f%% ± %.2f%n",
100 * cv.avg().accuracy(),
100 * cv.std().accuracy());
The std field lets you report confidence intervals around each metric.
Repeated CV runs standard k-fold multiple times with different random permutations,
giving a more stable estimate at the cost of round × k training runs:
// 5 repetitions of 5-fold CV = 25 training runs
var rcv = CrossValidation.classification(5, 5, iris.formula(), iris.data(),
DecisionTree::fit);
System.out.printf("Accuracy: %.2f%% ± %.2f%n",
100 * rcv.avg().accuracy(),
100 * rcv.std().accuracy());
Bootstrap is often preferred for small datasets because the test set size varies per round (unlike fixed-fold CV). The stratified variant is recommended whenever classes are imbalanced:
int[] y = formula.y(data).toIntArray();
var bs = Bootstrap.classification(100, formula, data, DecisionTree::fit);
System.out.printf("Accuracy: %.2f%% ± %.2f%n",
100 * bs.avg().accuracy(),
100 * bs.std().accuracy());
When samples are grouped (e.g. multiple measurements per patient, or overlapping time windows), standard CV leaks information between folds. Use group k-fold:
// subjectId[i] == the subject/group to which sample i belongs
int[] subjectId = ...;
Bag[] folds = CrossValidation.nonoverlap(subjectId, 5);
var cv = ClassificationValidation.of(folds, x, y, SVM::fit);
System.out.println(cv.avg());
LOOCV is unbiased and uses almost all data for training in each round, making it the right choice when data is scarce:
// Array-based
var metrics = LOOCV.classification(x, y, LogisticRegression::fit);
System.out.printf("Accuracy: %.2f%%%n", 100 * metrics.accuracy());
// Formula / DataFrame-based
var metrics2 = LOOCV.classification(formula, data, DecisionTree::fit);
Prefer CrossValidation.stratify for datasets larger than ~200 samples, since the
compute cost of LOOCV is O(n) training runs.
Fit both models to the same training set, extract their log-likelihoods, and compare:
GaussianMixture m1 = GaussianMixture.fit(x, 2); // 2 components
GaussianMixture m2 = GaussianMixture.fit(x, 5); // 5 components
double aic1 = ModelSelection.AIC(m1.logLikelihood(), m1.numParameters());
double aic2 = ModelSelection.AIC(m2.logLikelihood(), m2.numParameters());
System.out.println("Preferred by AIC: " + (aic1 < aic2 ? "2-component" : "5-component"));
| Method | Description |
|---|---|
Bag.split(n, holdout) | Random holdout split on n raw indices |
Bag.split(data, holdout) | Random holdout split returning two DataFrames |
Bag.stratify(data, column, holdout) | Stratified holdout split on a DataFrame |
Bootstrap.of(n, k) | k bootstrap bags from n samples |
Bootstrap.of(category, k) | k stratified bootstrap bags |
CrossValidation.of(n, k) | Standard k-fold splits |
CrossValidation.stratify(labels, k) | Stratified k-fold splits |
CrossValidation.nonoverlap(group, k) | Group k-fold splits |
LOOCV.of(n) | Leave-one-out training index arrays |
| Method | Returns |
|---|---|
ClassificationValidation.of(formula, train, test, trainer) | ClassificationValidation<M> |
ClassificationValidation.of(bags, x, y, trainer) | ClassificationValidations<M> |
CrossValidation.classification(k, formula, data, trainer) | ClassificationValidations<M> |
CrossValidation.classification(round, k, formula, data, trainer) | ClassificationValidations<M> (repeated) |
Bootstrap.classification(k, formula, data, trainer) | ClassificationValidations<M> |
LOOCV.classification(x, y, trainer) | ClassificationMetrics |
RegressionValidation.of(formula, train, test, trainer) | RegressionValidation<M> |
RegressionValidation.of(bags, x, y, trainer) | RegressionValidations<M> |
CrossValidation.regression(k, formula, data, trainer) | RegressionValidations<M> |
Bootstrap.regression(k, formula, data, trainer) | RegressionValidations<M> |
LOOCV.regression(x, y, trainer) | RegressionMetrics |
| Method | Formula |
|---|---|
ModelSelection.AIC(logL, k) | 2k − 2 logL |
ModelSelection.BIC(logL, k, n) | k log n − 2 logL |
ClassificationMetrics.size records the test-set size for each round. When comparing
models trained on different datasets, normalize by sample count rather than comparing
raw error counts.
std from a single foldClassificationValidations.of and RegressionValidations.of require a list of at
least one ClassificationValidation / RegressionValidation. With exactly one round,
std is 0.0 for every field — meaningful aggregation requires two or more rounds.
Double.NaN in hard-classifier metricsProbability-based metrics (auc, logloss, crossEntropy) are Double.NaN for
hard classifiers. Passing them to arithmetic expressions silently propagates NaN:
// Unsafe: if auc is NaN this prints NaN
System.out.printf("AUC: %.4f%n", metrics.auc());
// Safe
if (!Double.isNaN(metrics.auc())) {
System.out.printf("AUC: %.4f%n", metrics.auc());
}
Use CrossValidation.nonoverlap whenever samples within a group share information
(repeated measurements, sliding windows, augmented copies). Using standard k-fold
in these cases inflates accuracy estimates because the same underlying signal appears
in both train and test.
Increasing k beyond 10 rarely improves variance; instead, use repeated CV
(CrossValidation.classification(round, k, ...)) to get a more stable estimate at
the same cost as round × k training runs.
LOOCV fits the model n times. For n = 10 000 with a non-trivial model this is
prohibitively slow. Prefer stratified 10-fold CV or bootstrap for datasets larger
than ~200–500 samples.
n > 0ModelSelection.BIC calls Math.log(n). Passing n ≤ 0 produces NaN or
-Infinity silently. Always ensure n is a positive integer matching the training
sample count.
SMILE — Copyright © 2010-2026 Haifeng Li. GNU GPL licensed.