base/DATA_TRANSFORMATION.md
The smile.data.transform package provides a composable, serializable pipeline
for preprocessing tabular data before model training. Every transform maps a
Tuple (one row) or an entire DataFrame to a transformed counterpart, and the
higher-level implementations in smile.feature.transform build the common
statistical scalers on top of that foundation.
Transform ──────────────────────────────────────────────────────────────────
│ apply(Tuple) → Tuple (row-by-row)
│ apply(DataFrame) → DataFrame (batch, column-parallel)
│ andThen(Transform) → Transform (compose forward)
│ compose(Transform) → Transform (compose backward)
│
├── ColumnTransform (map-of-lambdas, column-wise)
│ └── InvertibleColumnTransform (adds invert() for both Tuple and DataFrame)
│
InvertibleTransform extends Transform
│ invert(Tuple) → Tuple
│ invert(DataFrame) → DataFrame
All built-in scalers (Scaler, Standardizer, RobustStandardizer,
WinsorScaler, MaxAbsScaler) return an InvertibleColumnTransform.
Normalizer returns a plain ColumnTransform-compatible Transform.
Transform lives in smile.data.transform and extends
java.util.function.Function<Tuple, Tuple>. Every preprocessing step
implements this single interface.
Transform t = /* any transform */;
Tuple outRow = t.apply(inRow);
The default apply(DataFrame) implementation streams all rows through the
row-level apply(Tuple). ColumnTransform overrides this with a
column-parallel fast path that avoids per-row object allocation.
DataFrame out = t.apply(df);
// Build a single-step pipeline from already-fitted transforms
Transform chain = Transform.pipeline(step1, step2, step3);
// Fit several data-dependent transforms in sequence.
// Each subsequent trainer receives the data as transformed by all previous steps.
Transform pipeline = Transform.fit(trainDf,
Standardizer::fit,
data -> WinsorScaler.fit(data, 0.01, 0.99));
Extends Transform with two inverse operations:
public interface InvertibleTransform extends Transform {
Tuple invert(Tuple x);
DataFrame invert(DataFrame data);
}
Inversion is useful whenever the original feature space must be recovered — for example, to interpret a model's output in the original units, or to reconstruct an input from a latent representation.
InvertibleColumnTransform scaler = Standardizer.fit(train);
DataFrame scaled = scaler.apply(test);
DataFrame recovered = scaler.invert(scaled); // back to original units
ColumnTransform is the concrete class that powers all column-wise transforms.
It holds a Map<String, Function> — a mapping from column name to the
smile.util.function.Function lambda to apply to that column's values.
import smile.data.transform.ColumnTransform;
import smile.util.function.Function;
import java.util.Map;
// Log-transform "income", leave everything else unchanged
Map<String, Function> transforms = Map.of("income", Math::log);
ColumnTransform ct = new ColumnTransform("log-income", transforms);
DataFrame out = ct.apply(df);
ValueVector object — zero copy).DoubleType (or NullableDoubleType
for nullable inputs).apply(DataFrame) implementation processes all columns in parallel
using IntStream.range(...).parallel().System.out.println(ct);
// log-income(
// log(income)
// )
The output format is <name>(\n <function_toString()>, ...\n).
Each Function implementation's toString() should describe the
transformation (the built-in scalers do this; lambdas print as the
default JVM reference unless overridden).
Extends ColumnTransform and additionally stores a second map of inverse
lambdas, one per transformed column.
import smile.data.transform.InvertibleColumnTransform;
Map<String, Function> transforms = Map.of("price", v -> Math.log1p(v));
Map<String, Function> inverses = Map.of("price", v -> Math.expm1(v));
InvertibleColumnTransform ict =
new InvertibleColumnTransform("log1p-price", transforms, inverses);
DataFrame logPrices = ict.apply(df);
DataFrame original = ict.invert(logPrices); // ≈ df
All built-in transforms follow the same pattern:
Xxx.fit(trainData) to learn parameters from training data..apply(anyData) to transform training and test data alike..invert(transformedData) to recover the original scale.Train/test discipline: Always fit on training data only. Apply the fitted transform to both train and test sets.
Maps each numeric column to the range [0, 1]:
x_scaled = (x - min) / (max - min)
Values outside the training range are clamped to [0, 1].
import smile.feature.transform.Scaler;
// Fit on train, transform all numeric columns
InvertibleColumnTransform scaler = Scaler.fit(train);
DataFrame trainScaled = scaler.apply(train);
DataFrame testScaled = scaler.apply(test);
// Fit on selected columns only
InvertibleColumnTransform partial = Scaler.fit(train, "age", "income");
When to use: Algorithms that are sensitive to feature magnitude and when the data has no severe outliers (e.g. KNN, SVM with RBF kernel, neural networks).
When to avoid: Data with large outliers — a single extreme value will
compress all normal values into a tiny sub-range. Use WinsorScaler instead.
Scales each numeric column to zero mean and unit variance:
x_std = (x - μ) / σ
If the standard deviation is zero (constant column), the scale factor is 1.0
and the column is only centred.
import smile.feature.transform.Standardizer;
InvertibleColumnTransform std = Standardizer.fit(train);
DataFrame trainStd = std.apply(train);
DataFrame testStd = std.apply(test);
When to use: When the algorithm assumes Gaussian-distributed features or uses distance / dot-product computations (linear regression, logistic regression, linear SVM, PCA).
When to avoid: Heavy-tailed distributions or data with many outliers —
a single large outlier inflates σ and under-compresses the bulk of the data.
Use RobustStandardizer instead.
Scales by subtracting the median and dividing by the interquartile range (IQR):
x_robust = (x - median) / IQR
The IQR is the difference between the 75th and 25th percentiles. If the IQR
is zero, the scale factor is 1.0.
import smile.feature.transform.RobustStandardizer;
InvertibleColumnTransform robust = RobustStandardizer.fit(train);
DataFrame robustTrain = robust.apply(train);
When to use: Data with outliers or heavy-tailed distributions. Median and IQR are much less sensitive to extreme values than mean and standard deviation.
A two-step approach:
[lower, upper] percentile range of the training
distribution.[0, 1].import smile.feature.transform.WinsorScaler;
// Default: 5th percentile lower, 95th percentile upper
InvertibleColumnTransform ws = WinsorScaler.fit(train);
// Custom percentiles — clamp at 1st and 99th percentile
InvertibleColumnTransform ws2 = WinsorScaler.fit(train, 0.01, 0.99);
// Transform specific columns only
InvertibleColumnTransform ws3 = WinsorScaler.fit(train, 0.05, 0.95, "price", "quantity");
When to use: When you know the training set has outliers that are meaningful in the training data but that you do not want to distort scaling for the bulk of the data. Common in financial and sensor data.
Divides each feature by its maximum absolute value so that all values lie in
[-1, 1]. This scaler does not centre the data (mean is not subtracted),
making it suitable for sparse data where centering would destroy sparsity.
import smile.feature.transform.MaxAbsScaler;
InvertibleColumnTransform mas = MaxAbsScaler.fit(train);
DataFrame scaled = mas.apply(test);
When to use: Sparse data (e.g. TF-IDF vectors) or when the sign of values is meaningful and you want to preserve zero values exactly.
Normalizer rescales each row (sample) so that its vector norm equals 1.
Unlike the column-wise scalers above, Normalizer operates across columns
within a single row. It supports three norms:
| Norm | Formula |
|---|---|
L1 | `Σ |
L2 | √(Σxᵢ²) = 1 |
L_INF | `max( |
import smile.feature.transform.Normalizer;
import smile.feature.transform.Normalizer.Norm;
// Normalize specific columns
String[] features = {"f1", "f2", "f3"};
Normalizer l2 = new Normalizer(Norm.L2, features);
DataFrame normed = l2.apply(df);
// Apply row-by-row to a single Tuple
Tuple row = df.get(0);
Tuple normRow = l2.apply(row);
Normalizer is not invertible (it is not an InvertibleTransform).
When to use: Text classification (TF-IDF), cosine-similarity-based algorithms, or whenever the magnitude of a row is irrelevant but the direction (ratio of features) is meaningful.
andThen and compose mirror the semantics of
java.util.function.Function:
// a.andThen(b) ≡ b(a(x)) — apply a first, then b
Transform aFirst = a.andThen(b);
// b.compose(a) ≡ b(a(x)) — same result, different spelling
Transform alsoAFirst = b.compose(a);
Both return a new Transform (a lambda, not a named class).
InvertibleColumnTransform scaler = Scaler.fit(train);
InvertibleColumnTransform std = Standardizer.fit(train);
// Scale first, then standardise
Transform both = scaler.andThen(std);
DataFrame out = both.apply(test);
Note: The composed result is a plain
Transform, not anInvertibleTransform, even if both inputs are invertible. If you need inversion of the composed result, useTransform.fit(see §7.3) or invert the steps individually in reverse order.
Transform.pipeline(t1, t2, ..., tN) is equivalent to chaining andThen
calls and is more readable for three or more steps:
Transform pipeline = Transform.pipeline(
imputer, // fill missing values
scaler, // [0, 1] scale
std // z-score
);
DataFrame train_prepped = pipeline.apply(train);
DataFrame test_prepped = pipeline.apply(test);
Throws IllegalArgumentException if called with no transforms.
Transform.fit(data, trainer1, trainer2, ...) builds a pipeline where each
trainer is a Function<DataFrame, Transform> — a function that observes
the data as transformed so far and returns the next step:
Transform pipeline = Transform.fit(train,
// Step 1: fit imputer on raw training data
data -> SimpleImputer.fit(data),
// Step 2: fit scaler on imputed training data
data -> Scaler.fit(data),
// Step 3: fit standardiser on scaled+imputed training data
data -> Standardizer.fit(data)
);
// Apply the whole fitted pipeline to test data
DataFrame testOut = pipeline.apply(test);
This is the canonical way to build multi-step preprocessing pipelines where each step's parameters depend on the output of the previous steps.
Throws IllegalArgumentException if called with no trainers.
InvertibleTransform.invert() maps from the transformed space back to the
original space. This is used for:
invert(apply(x)) ≈ x.InvertibleColumnTransform std = Standardizer.fit(train);
// Forward pass
DataFrame trainStd = std.apply(train);
double[] yStd = model.predict(trainStd);
// Recover predictions in original units
// Build a single-column DataFrame to pass through invert()
DataFrame predDf = DataFrame.of(new DoubleVector("y", yStd));
DataFrame predOriginal = std.invert(predDf);
Tuple row = df.get(0);
Tuple scaled = std.apply(row);
Tuple backToNormal = std.invert(scaled);
When multiple invertible transforms are stacked, inverting the composed pipeline requires reversing the order:
InvertibleColumnTransform s1 = Scaler.fit(train);
InvertibleColumnTransform s2 = Standardizer.fit(s1.apply(train));
// Forward: s1 then s2
DataFrame fwd = s2.apply(s1.apply(test));
// Inverse: s2⁻¹ then s1⁻¹ (reversed order)
DataFrame inv = s1.invert(s2.invert(fwd));
SMILE supports nullable columns backed by NullableDoubleVector (and
equivalent types for other primitives). All transforms in this package
preserve nullability:
double value for a null
cell (which is Double.NaN by convention), but the result is ignored
and replaced by NaN / null in the output vector.// Suppose "income" has some null entries
InvertibleColumnTransform scaler = Scaler.fit(train);
DataFrame scaled = scaler.apply(train);
// scaled.column("income").isNullable() → true (same as input)
// scaled.column("income").isNullAt(k) → same as train
DataFrame restored = scaler.invert(scaled);
// restored.column("income").isNullable() → true (null mask preserved)
Transform extends java.io.Serializable. All built-in implementations
and the closure-captured lambdas inside InvertibleColumnTransform are
serializable, so a fitted transform can be saved and reloaded:
// Save
try (var out = new ObjectOutputStream(new FileOutputStream("scaler.ser"))) {
out.writeObject(scaler);
}
// Load
InvertibleColumnTransform loaded;
try (var in = new ObjectInputStream(new FileInputStream("scaler.ser"))) {
loaded = (InvertibleColumnTransform) in.readObject();
}
// Use the loaded transform on new data
DataFrame newData = loaded.apply(incoming);
Caution with lambdas: Only use method references or anonymous classes when the enclosing class is itself
Serializable. A non-serializable lambda passed toColumnTransformorInvertibleColumnTransformwill causeNotSerializableExceptionat save time.
Useful when the transformation is not column-wise (e.g. row normalization, dimension reduction):
import smile.data.transform.Transform;
import smile.data.Tuple;
public class ClipTransform implements Transform {
private final double min, max;
private final Set<String> columns;
public ClipTransform(double min, double max, String... columns) {
this.min = min;
this.max = max;
this.columns = new HashSet<>(Arrays.asList(columns));
}
@Override
public Tuple apply(Tuple x) {
StructType schema = x.schema();
return new smile.data.AbstractTuple(schema) {
@Override
public Object get(int i) {
String name = schema.field(i).name();
if (columns.contains(name)) {
double v = x.getDouble(i);
return Math.max(min, Math.min(max, v));
}
return x.get(i);
}
};
}
}
The simplest approach for column-wise numeric transformations:
// Clip all values to [-3, 3]
Map<String, Function> clips = new HashMap<>();
for (String col : numericColumns) {
clips.put(col, x -> Math.max(-3.0, Math.min(3.0, x)));
}
ColumnTransform clipper = new ColumnTransform("clip±3", clips);
When you need both forward and inverse:
// Box-Cox power transform with λ = 0.5 (square-root transform)
double lambda = 0.5;
Map<String, Function> fwd = Map.of(
"revenue", x -> (Math.pow(x, lambda) - 1.0) / lambda
);
Map<String, Function> inv = Map.of(
"revenue", y -> Math.pow(y * lambda + 1.0, 1.0 / lambda)
);
InvertibleColumnTransform boxCox =
new InvertibleColumnTransform("BoxCox(0.5)", fwd, inv);
// Normalize each column by its own training median
static InvertibleColumnTransform fitMedianScaler(DataFrame data, String... cols) {
Map<String, Function> transforms = new HashMap<>();
Map<String, Function> inverses = new HashMap<>();
for (String col : cols) {
double[] vals = data.column(col).toDoubleArray();
double median = MathEx.median(vals);
double scale = MathEx.isZero(median) ? 1.0 : Math.abs(median);
transforms.put(col, x -> x / scale);
inverses.put(col, y -> y * scale);
}
return new InvertibleColumnTransform("MedianScaler", transforms, inverses);
}
This tutorial preprocesses a lending dataset that has loan_amount
(double), annual_income (double, nullable), grade (categorical),
term (int) and default (int, the label).
DataFrame data = Read.csv("loans.csv");
// 80/20 train/test split (indices)
int n = data.size();
int trainSize = (int)(n * 0.8);
DataFrame train = data.slice(0, trainSize);
DataFrame test = data.slice(trainSize, n);
import smile.feature.imputation.SimpleImputer;
// Fit: learn per-column fill values (mean for numeric, mode for categorical)
Transform imputer = SimpleImputer.fit(train);
DataFrame trainImputed = imputer.apply(train);
DataFrame testImputed = imputer.apply(test);
import smile.feature.transform.WinsorScaler;
// Use Winsor to be robust to the income outliers
InvertibleColumnTransform scaler =
WinsorScaler.fit(trainImputed, 0.01, 0.99, "loan_amount", "annual_income");
DataFrame trainScaled = scaler.apply(trainImputed);
DataFrame testScaled = scaler.apply(testImputed);
Transform fullPipeline = Transform.fit(train,
SimpleImputer::fit,
data -> WinsorScaler.fit(data, 0.01, 0.99, "loan_amount", "annual_income")
);
// Apply in one call — same result as steps 2-3 above
DataFrame trainPrepped = fullPipeline.apply(train);
DataFrame testPrepped = fullPipeline.apply(test);
import smile.data.formula.Formula;
Formula formula = Formula.of("default",
Terms.$("loan_amount"),
Terms.$("annual_income"),
Terms.$("term"),
Terms.$("grade"));
// Extract design matrix and response
var X = formula.matrix(trainPrepped);
var y = formula.y(trainPrepped).toIntArray();
// … fit your model with X and y …
// Suppose we predicted a continuous score and want to understand it
// in the original income units:
double[] scores = model.predict(formula.matrix(testPrepped));
DataFrame scoreDf = DataFrame.of(new DoubleVector("annual_income", scores));
DataFrame inOriginalScale = scaler.invert(scoreDf);
System.out.println(inOriginalScale);
// Inspect what each step learned
System.out.println(fullPipeline);
// Serialize for reuse in production
try (var out = new ObjectOutputStream(new FileOutputStream("pipeline.ser"))) {
out.writeObject(fullPipeline);
}
| Situation | Recommended transform |
|---|---|
| No strong outliers, range matters | Scaler |
| Gaussian assumption, distance-based algo | Standardizer |
| Outliers present, want robustness | RobustStandardizer |
| Outliers present, want bounded output | WinsorScaler |
| Sparse data, zero must stay zero | MaxAbsScaler |
| Direction matters, not magnitude | Normalizer (L2) |
| Text / count vectors | Normalizer (L1 or L2) |
| Custom log / power transformation | InvertibleColumnTransform + lambdas |
| Missing values before scaling | SimpleImputer first, then scaler |
| Multiple sequential steps | Transform.fit(...) pipeline |
Given a column with values {1, 2, 3, 100} (the 100 is an outlier):
| Transform | 1 | 2 | 3 | 100 |
|---|---|---|---|---|
Scaler | 0.000 | 0.010 | 0.020 | 1.000 |
Standardizer (μ≈26.5, σ≈48.0) | −0.53 | −0.51 | −0.49 | 1.53 |
RobustStandardizer (med=2.5, IQR≈1.5) | −1.00 | −0.33 | 0.33 | 65.0 |
WinsorScaler (5/95 pct) | 0.000 | 0.500 | 1.000 | 1.000 |
MaxAbsScaler (max=100) | 0.010 | 0.020 | 0.030 | 1.000 |
Scaler compresses {1,2,3} to near zero because of the outlier.
WinsorScaler clamps the outlier so {1,2,3} spread cleanly across [0,1].
RobustStandardizer is robust to the outlier but its output is unbounded.
Transform (interface — smile.data.transform)| Member | Description |
|---|---|
apply(Tuple) | Transform one row (abstract). |
apply(DataFrame) | Transform all rows (default: row-stream; overridden by ColumnTransform for column-parallel batch). |
andThen(Transform) | Compose: this, then after. |
compose(Transform) | Compose: before, then this. |
Transform.pipeline(Transform...) | Chain multiple transforms left-to-right. Throws on empty input. |
Transform.fit(DataFrame, Function<DataFrame,Transform>...) | Fit a data-dependent pipeline. Each trainer sees data as already transformed. Throws on empty input. |
InvertibleTransform (interface — smile.data.transform)| Member | Description |
|---|---|
invert(Tuple) | Inverse-transform one row. |
invert(DataFrame) | Inverse-transform all rows. Preserves nullable columns. |
ColumnTransform (class — smile.data.transform)| Member | Description |
|---|---|
ColumnTransform(String name, Map<String,Function> transforms) | Constructor. |
apply(Tuple) | Applies lambdas to matching column positions; others pass through. |
apply(DataFrame) | Column-parallel batch transform; preserves null masks. |
toString() | "<name>(\n <fn>, ...\n)" |
InvertibleColumnTransform (class — smile.data.transform)| Member | Description |
|---|---|
InvertibleColumnTransform(String, Map<String,Function>, Map<String,Function>) | Constructor (name, forward lambdas, inverse lambdas). |
invert(Tuple) | Applies inverse lambdas; unmatched columns pass through. |
invert(DataFrame) | Column-parallel inverse; preserves null masks. |
smile.feature.transform)| Class | Factory method | Formula | Invertible |
|---|---|---|---|
Scaler | Scaler.fit(data, cols...) | (x−min)/(max−min) clamped to [0,1] | Yes |
Standardizer | Standardizer.fit(data, cols...) | (x−μ)/σ | Yes |
RobustStandardizer | RobustStandardizer.fit(data, cols...) | (x−median)/IQR | Yes |
WinsorScaler | WinsorScaler.fit(data) or fit(data, lo, hi, cols...) | clamp then (x−pLo)/(pHi−pLo) | Yes |
MaxAbsScaler | MaxAbsScaler.fit(data, cols...) | `x / max( | x |
Normalizer | new Normalizer(Norm, cols...) | row-wise unit norm (L1/L2/L∞) | No |
All column-wise fit() methods accept an optional varargs String... columns
argument. When omitted, all numeric columns are transformed automatically.
SMILE — © 2010-2026 Haifeng Li. GNU GPL licensed.