core/ANOMALY_DETECTION.md
The package smile.anomaly provides two main approaches for unsupervised and
semi-supervised anomaly detection:
IsolationForest) — tree-ensemble, unsupervised anomaly
scoring for numeric tabular data.SVM<T>) — kernel-based novelty detection that learns the
support of a high-dimensional distribution.Package: smile.anomaly
| Class | Algorithm | Score direction |
|---|---|---|
IsolationForest | Random-partition tree ensemble | Higher → more anomalous |
SVM<T> | One-class support vector machine | Lower (negative) → more anomalous |
Both classes are Serializable and support round-trip persistence via
smile.io.Write.object / smile.io.Read.object.
Use Isolation Forest when:
double[][]).(0, 1] where > 0.5 roughly flags anomalies.extensionLevel.Use One-Class SVM when:
Class: smile.anomaly.IsolationForest
import smile.anomaly.IsolationForest;
double[][] train = {
{0.0, 0.0}, {0.1, -0.1}, {-0.05, 0.05},
{0.05, 0.08}, {-0.08, -0.03}
};
// Fit with default options (100 trees, extensionLevel=0)
IsolationForest model = IsolationForest.fit(train);
// Score individual points — higher means more anomalous
double inlierScore = model.score(new double[] { 0.02, 0.01});
double outlierScore = model.score(new double[] { 6.0, -6.0});
System.out.printf("inlier score = %.4f%n", inlierScore); // e.g. 0.38
System.out.printf("outlier score = %.4f%n", outlierScore); // e.g. 0.82
All hyperparameters are captured in IsolationForest.Options:
var options = new IsolationForest.Options(
200, // ntrees – number of isolation trees
0, // maxDepth – 0 = auto (log₂ of subsample size)
0.7, // subsample – fraction of rows per tree (0 < subsample < 1)
0 // extensionLevel – 0 = standard Isolation Forest
);
IsolationForest model = IsolationForest.fit(train, options);
Options supports lossless roundtrip via java.util.Properties:
Properties props = options.toProperties();
var loaded = IsolationForest.Options.of(props);
assert options.equals(loaded);
Key property keys:
| Property key | Default |
|---|---|
smile.isolation_forest.trees | 100 |
smile.isolation_forest.max_depth | 0 |
smile.isolation_forest.sampling_rate | 0.7 |
smile.isolation_forest.extension_level | 0 |
The extensionLevel controls how many feature dimensions participate in the
random splitting hyperplane:
extensionLevel | Behaviour |
|---|---|
0 | Standard Isolation Forest — axis-aligned splits |
1 … p-2 | Intermediate — hyperplanes in a (extensionLevel+1)-dimensional subspace |
p-1 | Fully extended — hyperplanes with random slopes in all dimensions |
Rules:
[0, p-1], where p is input dimensionality.extensionLevel >= p raises IllegalArgumentException.model.extensionLevel().// Standard Isolation Forest (extensionLevel = 0)
var std = IsolationForest.fit(data, new IsolationForest.Options(100, 0, 0.7, 0));
System.out.println(std.extensionLevel()); // 0
// Extended Isolation Forest
var ext = IsolationForest.fit(data, new IsolationForest.Options(100, 0, 0.7, 1));
System.out.println(ext.extensionLevel()); // 1
// Batch score — runs in parallel
double[] scores = model.score(batchData);
// One-step predict: true → anomaly
boolean isAnomaly = model.predict(x, 0.6); // threshold in (0.5, 1.0)
predict(double[] x, double threshold) returns true when
score(x) > threshold. Typical thresholds are in (0.5, 0.8) depending on
the expected contamination rate.
import smile.io.Read;
import smile.io.Write;
import java.nio.file.Path;
Path path = Write.object(model);
IsolationForest loaded = (IsolationForest) Read.object(path);
Class: smile.anomaly.SVM<T> (extends smile.model.svm.KernelMachine<T>)
import smile.anomaly.SVM;
import smile.math.kernel.GaussianKernel;
double[][] train = {
{0.0, 0.0}, {0.1, -0.1}, {-0.05, 0.05},
{0.05, 0.08}, {-0.08, -0.03}
};
SVM<double[]> model = SVM.fit(
train,
new GaussianKernel(1.0),
new SVM.Options(0.2, 1E-3)
);
// Positive → inlier, Negative → anomaly
double inlierScore = model.score(new double[] { 0.02, 0.01});
double outlierScore = model.score(new double[] { 4.0, -4.0});
System.out.printf("inlier score = %.4f%n", inlierScore); // e.g. 0.45
System.out.printf("outlier score = %.4f%n", outlierScore); // e.g. -0.72
Score convention: Unlike
IsolationForest,SVM.score()returns the raw decision function value: positive = inlier, negative = anomaly.
SVM.Options(double nu, double tol):
| Parameter | Meaning | Default |
|---|---|---|
nu | Upper bound on outlier fraction; lower bound on support-vector fraction. Range: (0, 1]. | 0.5 |
tol | Solver convergence tolerance (> 0). | 1E-3 |
var opts = new SVM.Options(0.1, 1E-4); // 10% contamination budget
Properties props = opts.toProperties();
SVM.Options loaded = SVM.Options.of(props);
assert opts.equals(loaded);
Property keys:
| Property key | Default |
|---|---|
smile.svm.nu | 0.5 |
smile.svm.tolerance | 1E-3 |
Kernel selection: Any smile.math.kernel.MercerKernel<T> is accepted:
GaussianKernel(sigma) — most common choice; controls locality of boundary.PolynomialKernel(degree, scale, offset) — for polynomial boundaries.LinearKernel() — rarely used for one-class SVM.// Batch score — runs in parallel
double[] scores = model.score(batchSamples);
// One-step predict: true → anomaly
// threshold = 0.0 uses the natural SVM decision boundary
boolean isAnomaly = model.predict(x, 0.0);
predict(T x, double threshold) returns true when score(x) < threshold.
Use 0.0 as the natural decision boundary; lower (negative) thresholds tolerate
borderline cases.
Path path = Write.object(model);
@SuppressWarnings("unchecked")
SVM<double[]> loaded = (SVM<double[]>) Read.object(path);
| Model | score() return | Anomaly direction |
|---|---|---|
IsolationForest | (0, 1] | Higher → anomaly |
SVM | any real (f(x) − b) | Lower (negative) → anomaly |
When no labelled data is available, select a threshold from training scores using the expected contamination rate:
// IsolationForest — top `contamination` fraction flagged
double[] scores = model.score(train);
Arrays.sort(scores);
double contamination = 0.05; // 5% outliers
int idx = (int)((1.0 - contamination) * (scores.length - 1));
double threshold = scores[idx];
// Flag new point
boolean flag = model.predict(xNew, threshold);
For SVM, sort in ascending order and take the contamination-th percentile
from the bottom (most-negative end), then use score(x) < threshold.
IsolationForest.fit| Condition | Exception |
|---|---|
data == null || data.length < 2 | IllegalArgumentException |
Any row is null | IllegalArgumentException |
| Rows have inconsistent length | IllegalArgumentException |
extensionLevel >= p | IllegalArgumentException |
subsample not in (0, 1) | IllegalArgumentException (from Options) |
IsolationForest.score| Condition | Exception |
|---|---|
x == null | IllegalArgumentException |
x.length != p | IllegalArgumentException — message: "Invalid input dimension: expected <p>, actual <n>" |
SVM.fit| Condition | Exception |
|---|---|
x == null || x.length == 0 | IllegalArgumentException |
kernel == null | IllegalArgumentException |
options == null | IllegalArgumentException |
nu not in (0, 1] | IllegalArgumentException (from Options) |
tol <= 0 | IllegalArgumentException (from Options) |
import smile.anomaly.IsolationForest;
import smile.math.MathEx;
import java.util.Arrays;
MathEx.setSeed(42L);
double[][] train = loadNumericData(); // your 3-dimensional data
// Extended Isolation Forest (p = 3 → extensionLevel up to 2)
var options = new IsolationForest.Options(256, 0, 0.6, 2);
IsolationForest forest = IsolationForest.fit(train, options);
System.out.println("trees : " + forest.size());
System.out.println("extension level: " + forest.extensionLevel());
// Batch score and count anomalies above threshold 0.6
double[] scores = forest.score(train);
long anomalyCount = Arrays.stream(scores)
.filter(s -> s > 0.6)
.count();
System.out.printf("anomalies (threshold 0.6): %d / %d%n", anomalyCount, train.length);
import smile.anomaly.SVM;
import smile.math.kernel.GaussianKernel;
double[][] train = loadCleanData(); // no outliers in training set
SVM<double[]> ocsvm = SVM.fit(
train,
new GaussianKernel(0.5),
new SVM.Options(0.1, 1E-3)
);
double[] testPoint = {1.2, -0.7, 0.3};
boolean anomaly = ocsvm.predict(testPoint, 0.0);
System.out.printf("anomaly: %b (score = %.4f)%n", anomaly, ocsvm.score(testPoint));
import smile.anomaly.IsolationForest;
import java.util.Arrays;
double[][] data = loadData();
IsolationForest model = IsolationForest.fit(data);
double[] trainScores = model.score(data);
Arrays.sort(trainScores);
double contamination = 0.02; // 2% expected anomalies
int cutIdx = (int)((1.0 - contamination) * (trainScores.length - 1));
double threshold = trainScores[cutIdx];
// Classify new batch
double[][] newBatch = loadNewData();
for (double[] x : newBatch) {
if (model.predict(x, threshold)) {
System.out.println("ANOMALY detected: " + Arrays.toString(x));
}
}
// ── IsolationForest ──────────────────────────────────────────────────────────
// Training
IsolationForest model = IsolationForest.fit(double[][] data);
IsolationForest model = IsolationForest.fit(double[][] data, IsolationForest.Options options);
// Options
new IsolationForest.Options() // defaults
new IsolationForest.Options(int ntrees, int maxDepth,
double subsample, int extensionLevel)
IsolationForest.Options.of(Properties props)
options.toProperties()
// Inspection
int model.size() // number of trees
IsolationTree[] model.trees() // defensive copy
int model.extensionLevel() // 0 = standard IF
// Scoring
double model.score(double[] x) // single sample
double[] model.score(double[][] x) // batch (parallel)
boolean model.predict(double[] x, double threshold) // true = anomaly
// ── SVM (One-Class) ──────────────────────────────────────────────────────────
// Training
SVM<T> model = SVM.fit(T[] x, MercerKernel<T> kernel);
SVM<T> model = SVM.fit(T[] x, MercerKernel<T> kernel, SVM.Options options);
// Options
new SVM.Options() // nu=0.5, tol=1E-3
new SVM.Options(double nu, double tol)
SVM.Options.of(Properties props)
options.toProperties()
// Scoring (positive = inlier, negative = anomaly)
double model.score(T x) // single sample
double[] model.score(T[] x) // batch (parallel)
boolean model.predict(T x, double threshold) // true = anomaly
// (score < threshold)
SMILE — Copyright © 2010–2026 Haifeng Li. GNU GPL licensed.