Back to Smile

SMILE — Data Transformation User Guide & Tutorial

base/DATA_TRANSFORMATION.md

6.1.026.1 KB
Original Source

SMILE — Data Transformation User Guide & Tutorial

The smile.data.transform package provides a composable, serializable pipeline for preprocessing tabular data before model training. Every transform maps a Tuple (one row) or an entire DataFrame to a transformed counterpart, and the higher-level implementations in smile.feature.transform build the common statistical scalers on top of that foundation.


Table of Contents

  1. Architecture overview
  2. The Transform interface
  3. The InvertibleTransform interface
  4. ColumnTransform
  5. InvertibleColumnTransform
  6. Built-in scalers and standardisers
  7. Composing transforms
  8. Inverting transforms
  9. Nullable column handling
  10. Serialization
  11. Writing a custom transform
  12. End-to-end tutorial
  13. Choosing the right transform
  14. API quick reference

1. Architecture overview

Transform  ──────────────────────────────────────────────────────────────────
│  apply(Tuple)   → Tuple          (row-by-row)
│  apply(DataFrame) → DataFrame    (batch, column-parallel)
│  andThen(Transform) → Transform  (compose forward)
│  compose(Transform) → Transform  (compose backward)
│
├── ColumnTransform                (map-of-lambdas, column-wise)
│     └── InvertibleColumnTransform (adds invert() for both Tuple and DataFrame)
│
InvertibleTransform extends Transform
│  invert(Tuple)     → Tuple
│  invert(DataFrame) → DataFrame

All built-in scalers (Scaler, Standardizer, RobustStandardizer, WinsorScaler, MaxAbsScaler) return an InvertibleColumnTransform. Normalizer returns a plain ColumnTransform-compatible Transform.


2. The Transform interface

Transform lives in smile.data.transform and extends java.util.function.Function<Tuple, Tuple>. Every preprocessing step implements this single interface.

Row-level application

java
Transform t = /* any transform */;
Tuple outRow = t.apply(inRow);

Batch application

The default apply(DataFrame) implementation streams all rows through the row-level apply(Tuple). ColumnTransform overrides this with a column-parallel fast path that avoids per-row object allocation.

java
DataFrame out = t.apply(df);

Static factories

java
// Build a single-step pipeline from already-fitted transforms
Transform chain = Transform.pipeline(step1, step2, step3);

// Fit several data-dependent transforms in sequence.
// Each subsequent trainer receives the data as transformed by all previous steps.
Transform pipeline = Transform.fit(trainDf,
        Standardizer::fit,
        data -> WinsorScaler.fit(data, 0.01, 0.99));

3. The InvertibleTransform interface

Extends Transform with two inverse operations:

java
public interface InvertibleTransform extends Transform {
    Tuple     invert(Tuple x);
    DataFrame invert(DataFrame data);
}

Inversion is useful whenever the original feature space must be recovered — for example, to interpret a model's output in the original units, or to reconstruct an input from a latent representation.

java
InvertibleColumnTransform scaler = Standardizer.fit(train);
DataFrame scaled    = scaler.apply(test);
DataFrame recovered = scaler.invert(scaled);  // back to original units

4. ColumnTransform

ColumnTransform is the concrete class that powers all column-wise transforms. It holds a Map<String, Function> — a mapping from column name to the smile.util.function.Function lambda to apply to that column's values.

Constructing manually

java
import smile.data.transform.ColumnTransform;
import smile.util.function.Function;
import java.util.Map;

// Log-transform "income", leave everything else unchanged
Map<String, Function> transforms = Map.of("income", Math::log);
ColumnTransform ct = new ColumnTransform("log-income", transforms);

DataFrame out = ct.apply(df);

Behaviour

  • Columns not present in the map are passed through unchanged (same ValueVector object — zero copy).
  • Transformed columns always become DoubleType (or NullableDoubleType for nullable inputs).
  • The apply(DataFrame) implementation processes all columns in parallel using IntStream.range(...).parallel().

toString

java
System.out.println(ct);
// log-income(
//   log(income)
// )

The output format is <name>(\n <function_toString()>, ...\n). Each Function implementation's toString() should describe the transformation (the built-in scalers do this; lambdas print as the default JVM reference unless overridden).


5. InvertibleColumnTransform

Extends ColumnTransform and additionally stores a second map of inverse lambdas, one per transformed column.

java
import smile.data.transform.InvertibleColumnTransform;

Map<String, Function> transforms = Map.of("price", v -> Math.log1p(v));
Map<String, Function> inverses   = Map.of("price", v -> Math.expm1(v));

InvertibleColumnTransform ict =
    new InvertibleColumnTransform("log1p-price", transforms, inverses);

DataFrame logPrices = ict.apply(df);
DataFrame original  = ict.invert(logPrices);  // ≈ df

Inversion behaviour

  • Columns in the inverses map are inverted using the supplied lambda.
  • Columns not in the inverses map are passed through unchanged.
  • Nullable columns retain their null mask on both the forward and inverse paths — nulls are never fabricated or lost.

6. Built-in scalers and standardisers

All built-in transforms follow the same pattern:

  1. Call Xxx.fit(trainData) to learn parameters from training data.
  2. Call .apply(anyData) to transform training and test data alike.
  3. Optionally call .invert(transformedData) to recover the original scale.

Train/test discipline: Always fit on training data only. Apply the fitted transform to both train and test sets.

6.1 Scaler — min-max scaling

Maps each numeric column to the range [0, 1]:

x_scaled = (x - min) / (max - min)

Values outside the training range are clamped to [0, 1].

java
import smile.feature.transform.Scaler;

// Fit on train, transform all numeric columns
InvertibleColumnTransform scaler = Scaler.fit(train);
DataFrame trainScaled = scaler.apply(train);
DataFrame testScaled  = scaler.apply(test);

// Fit on selected columns only
InvertibleColumnTransform partial = Scaler.fit(train, "age", "income");

When to use: Algorithms that are sensitive to feature magnitude and when the data has no severe outliers (e.g. KNN, SVM with RBF kernel, neural networks).

When to avoid: Data with large outliers — a single extreme value will compress all normal values into a tiny sub-range. Use WinsorScaler instead.

6.2 Standardizer — z-score standardization

Scales each numeric column to zero mean and unit variance:

x_std = (x - μ) / σ

If the standard deviation is zero (constant column), the scale factor is 1.0 and the column is only centred.

java
import smile.feature.transform.Standardizer;

InvertibleColumnTransform std = Standardizer.fit(train);
DataFrame trainStd = std.apply(train);
DataFrame testStd  = std.apply(test);

When to use: When the algorithm assumes Gaussian-distributed features or uses distance / dot-product computations (linear regression, logistic regression, linear SVM, PCA).

When to avoid: Heavy-tailed distributions or data with many outliers — a single large outlier inflates σ and under-compresses the bulk of the data. Use RobustStandardizer instead.

6.3 RobustStandardizer — median / IQR

Scales by subtracting the median and dividing by the interquartile range (IQR):

x_robust = (x - median) / IQR

The IQR is the difference between the 75th and 25th percentiles. If the IQR is zero, the scale factor is 1.0.

java
import smile.feature.transform.RobustStandardizer;

InvertibleColumnTransform robust = RobustStandardizer.fit(train);
DataFrame robustTrain = robust.apply(train);

When to use: Data with outliers or heavy-tailed distributions. Median and IQR are much less sensitive to extreme values than mean and standard deviation.

6.4 WinsorScaler — percentile clamping

A two-step approach:

  1. Clamp values outside the [lower, upper] percentile range of the training distribution.
  2. Scale the result to [0, 1].
java
import smile.feature.transform.WinsorScaler;

// Default: 5th percentile lower, 95th percentile upper
InvertibleColumnTransform ws = WinsorScaler.fit(train);

// Custom percentiles — clamp at 1st and 99th percentile
InvertibleColumnTransform ws2 = WinsorScaler.fit(train, 0.01, 0.99);

// Transform specific columns only
InvertibleColumnTransform ws3 = WinsorScaler.fit(train, 0.05, 0.95, "price", "quantity");

When to use: When you know the training set has outliers that are meaningful in the training data but that you do not want to distort scaling for the bulk of the data. Common in financial and sensor data.

6.5 MaxAbsScaler — maximum absolute value

Divides each feature by its maximum absolute value so that all values lie in [-1, 1]. This scaler does not centre the data (mean is not subtracted), making it suitable for sparse data where centering would destroy sparsity.

java
import smile.feature.transform.MaxAbsScaler;

InvertibleColumnTransform mas = MaxAbsScaler.fit(train);
DataFrame scaled = mas.apply(test);

When to use: Sparse data (e.g. TF-IDF vectors) or when the sign of values is meaningful and you want to preserve zero values exactly.

6.6 Normalizer — per-row unit norm

Normalizer rescales each row (sample) so that its vector norm equals 1. Unlike the column-wise scalers above, Normalizer operates across columns within a single row. It supports three norms:

NormFormula
L1
L2√(Σxᵢ²) = 1
L_INF`max(
java
import smile.feature.transform.Normalizer;
import smile.feature.transform.Normalizer.Norm;

// Normalize specific columns
String[] features = {"f1", "f2", "f3"};

Normalizer l2 = new Normalizer(Norm.L2, features);
DataFrame normed = l2.apply(df);

// Apply row-by-row to a single Tuple
Tuple row = df.get(0);
Tuple normRow = l2.apply(row);

Normalizer is not invertible (it is not an InvertibleTransform).

When to use: Text classification (TF-IDF), cosine-similarity-based algorithms, or whenever the magnitude of a row is irrelevant but the direction (ratio of features) is meaningful.


7. Composing transforms

7.1 andThen and compose

andThen and compose mirror the semantics of java.util.function.Function:

java
// a.andThen(b) ≡ b(a(x)) — apply a first, then b
Transform aFirst = a.andThen(b);

// b.compose(a) ≡ b(a(x)) — same result, different spelling
Transform alsoAFirst = b.compose(a);

Both return a new Transform (a lambda, not a named class).

java
InvertibleColumnTransform scaler = Scaler.fit(train);
InvertibleColumnTransform std    = Standardizer.fit(train);

// Scale first, then standardise
Transform both = scaler.andThen(std);
DataFrame out  = both.apply(test);

Note: The composed result is a plain Transform, not an InvertibleTransform, even if both inputs are invertible. If you need inversion of the composed result, use Transform.fit (see §7.3) or invert the steps individually in reverse order.

7.2 pipeline

Transform.pipeline(t1, t2, ..., tN) is equivalent to chaining andThen calls and is more readable for three or more steps:

java
Transform pipeline = Transform.pipeline(
        imputer,     // fill missing values
        scaler,      // [0, 1] scale
        std          // z-score
);

DataFrame train_prepped = pipeline.apply(train);
DataFrame test_prepped  = pipeline.apply(test);

Throws IllegalArgumentException if called with no transforms.

7.3 fit — data-dependent pipelines

Transform.fit(data, trainer1, trainer2, ...) builds a pipeline where each trainer is a Function<DataFrame, Transform> — a function that observes the data as transformed so far and returns the next step:

java
Transform pipeline = Transform.fit(train,
        // Step 1: fit imputer on raw training data
        data -> SimpleImputer.fit(data),

        // Step 2: fit scaler on imputed training data
        data -> Scaler.fit(data),

        // Step 3: fit standardiser on scaled+imputed training data
        data -> Standardizer.fit(data)
);

// Apply the whole fitted pipeline to test data
DataFrame testOut = pipeline.apply(test);

This is the canonical way to build multi-step preprocessing pipelines where each step's parameters depend on the output of the previous steps.

Throws IllegalArgumentException if called with no trainers.


8. Inverting transforms

InvertibleTransform.invert() maps from the transformed space back to the original space. This is used for:

  • Interpreting predictions: A model trained on standardised targets must have its output de-standardised before presenting to users.
  • Reconstruction errors: Auto-encoders or PCA-based anomaly detectors can measure reconstruction fidelity in the original units.
  • Debugging: Verify that invert(apply(x)) ≈ x.
java
InvertibleColumnTransform std = Standardizer.fit(train);

// Forward pass
DataFrame trainStd = std.apply(train);
double[] yStd = model.predict(trainStd);

// Recover predictions in original units
// Build a single-column DataFrame to pass through invert()
DataFrame predDf = DataFrame.of(new DoubleVector("y", yStd));
DataFrame predOriginal = std.invert(predDf);

Row-level inversion

java
Tuple row          = df.get(0);
Tuple scaled       = std.apply(row);
Tuple backToNormal = std.invert(scaled);

Composing inverses

When multiple invertible transforms are stacked, inverting the composed pipeline requires reversing the order:

java
InvertibleColumnTransform s1 = Scaler.fit(train);
InvertibleColumnTransform s2 = Standardizer.fit(s1.apply(train));

// Forward: s1 then s2
DataFrame fwd = s2.apply(s1.apply(test));

// Inverse: s2⁻¹ then s1⁻¹  (reversed order)
DataFrame inv = s1.invert(s2.invert(fwd));

9. Nullable column handling

SMILE supports nullable columns backed by NullableDoubleVector (and equivalent types for other primitives). All transforms in this package preserve nullability:

  • A nullable input column always produces a nullable output column — the null bit-mask is copied intact.
  • The transform lambda is still called on the raw double value for a null cell (which is Double.NaN by convention), but the result is ignored and replaced by NaN / null in the output vector.
  • Inversion also preserves the null mask.
java
// Suppose "income" has some null entries
InvertibleColumnTransform scaler = Scaler.fit(train);

DataFrame scaled = scaler.apply(train);
// scaled.column("income").isNullable()  → true  (same as input)
// scaled.column("income").isNullAt(k)   → same as train

DataFrame restored = scaler.invert(scaled);
// restored.column("income").isNullable() → true (null mask preserved)

10. Serialization

Transform extends java.io.Serializable. All built-in implementations and the closure-captured lambdas inside InvertibleColumnTransform are serializable, so a fitted transform can be saved and reloaded:

java
// Save
try (var out = new ObjectOutputStream(new FileOutputStream("scaler.ser"))) {
    out.writeObject(scaler);
}

// Load
InvertibleColumnTransform loaded;
try (var in = new ObjectInputStream(new FileInputStream("scaler.ser"))) {
    loaded = (InvertibleColumnTransform) in.readObject();
}

// Use the loaded transform on new data
DataFrame newData = loaded.apply(incoming);

Caution with lambdas: Only use method references or anonymous classes when the enclosing class is itself Serializable. A non-serializable lambda passed to ColumnTransform or InvertibleColumnTransform will cause NotSerializableException at save time.


11. Writing a custom transform

Option A — implement Transform directly

Useful when the transformation is not column-wise (e.g. row normalization, dimension reduction):

java
import smile.data.transform.Transform;
import smile.data.Tuple;

public class ClipTransform implements Transform {
    private final double min, max;
    private final Set<String> columns;

    public ClipTransform(double min, double max, String... columns) {
        this.min = min;
        this.max = max;
        this.columns = new HashSet<>(Arrays.asList(columns));
    }

    @Override
    public Tuple apply(Tuple x) {
        StructType schema = x.schema();
        return new smile.data.AbstractTuple(schema) {
            @Override
            public Object get(int i) {
                String name = schema.field(i).name();
                if (columns.contains(name)) {
                    double v = x.getDouble(i);
                    return Math.max(min, Math.min(max, v));
                }
                return x.get(i);
            }
        };
    }
}

Option B — use ColumnTransform with lambdas

The simplest approach for column-wise numeric transformations:

java
// Clip all values to [-3, 3]
Map<String, Function> clips = new HashMap<>();
for (String col : numericColumns) {
    clips.put(col, x -> Math.max(-3.0, Math.min(3.0, x)));
}
ColumnTransform clipper = new ColumnTransform("clip±3", clips);

Option C — use InvertibleColumnTransform with lambdas

When you need both forward and inverse:

java
// Box-Cox power transform with λ = 0.5  (square-root transform)
double lambda = 0.5;
Map<String, Function> fwd = Map.of(
    "revenue", x -> (Math.pow(x, lambda) - 1.0) / lambda
);
Map<String, Function> inv = Map.of(
    "revenue", y -> Math.pow(y * lambda + 1.0, 1.0 / lambda)
);
InvertibleColumnTransform boxCox =
    new InvertibleColumnTransform("BoxCox(0.5)", fwd, inv);

Option D — data-dependent custom transform via fit

java
// Normalize each column by its own training median
static InvertibleColumnTransform fitMedianScaler(DataFrame data, String... cols) {
    Map<String, Function> transforms = new HashMap<>();
    Map<String, Function> inverses   = new HashMap<>();
    for (String col : cols) {
        double[] vals = data.column(col).toDoubleArray();
        double median = MathEx.median(vals);
        double scale  = MathEx.isZero(median) ? 1.0 : Math.abs(median);
        transforms.put(col, x -> x / scale);
        inverses.put(col,   y -> y * scale);
    }
    return new InvertibleColumnTransform("MedianScaler", transforms, inverses);
}

12. End-to-end tutorial

This tutorial preprocesses a lending dataset that has loan_amount (double), annual_income (double, nullable), grade (categorical), term (int) and default (int, the label).

Step 1 — Split data

java
DataFrame data = Read.csv("loans.csv");

// 80/20 train/test split (indices)
int n = data.size();
int trainSize = (int)(n * 0.8);
DataFrame train = data.slice(0, trainSize);
DataFrame test  = data.slice(trainSize, n);

Step 2 — Impute missing values

java
import smile.feature.imputation.SimpleImputer;

// Fit: learn per-column fill values (mean for numeric, mode for categorical)
Transform imputer = SimpleImputer.fit(train);
DataFrame trainImputed = imputer.apply(train);
DataFrame testImputed  = imputer.apply(test);

Step 3 — Scale numeric features

java
import smile.feature.transform.WinsorScaler;

// Use Winsor to be robust to the income outliers
InvertibleColumnTransform scaler =
    WinsorScaler.fit(trainImputed, 0.01, 0.99, "loan_amount", "annual_income");

DataFrame trainScaled = scaler.apply(trainImputed);
DataFrame testScaled  = scaler.apply(testImputed);

Step 4 — Compose into a single reusable pipeline

java
Transform fullPipeline = Transform.fit(train,
    SimpleImputer::fit,
    data -> WinsorScaler.fit(data, 0.01, 0.99, "loan_amount", "annual_income")
);

// Apply in one call — same result as steps 2-3 above
DataFrame trainPrepped = fullPipeline.apply(train);
DataFrame testPrepped  = fullPipeline.apply(test);

Step 5 — Train a model using a Formula

java
import smile.data.formula.Formula;

Formula formula = Formula.of("default",
        Terms.$("loan_amount"),
        Terms.$("annual_income"),
        Terms.$("term"),
        Terms.$("grade"));

// Extract design matrix and response
var X = formula.matrix(trainPrepped);
var y = formula.y(trainPrepped).toIntArray();

// … fit your model with X and y …

Step 6 — Invert predictions to interpret results

java
// Suppose we predicted a continuous score and want to understand it
// in the original income units:
double[] scores = model.predict(formula.matrix(testPrepped));

DataFrame scoreDf = DataFrame.of(new DoubleVector("annual_income", scores));
DataFrame inOriginalScale = scaler.invert(scoreDf);
System.out.println(inOriginalScale);

Step 7 — Inspect and save the pipeline

java
// Inspect what each step learned
System.out.println(fullPipeline);

// Serialize for reuse in production
try (var out = new ObjectOutputStream(new FileOutputStream("pipeline.ser"))) {
    out.writeObject(fullPipeline);
}

13. Choosing the right transform

SituationRecommended transform
No strong outliers, range mattersScaler
Gaussian assumption, distance-based algoStandardizer
Outliers present, want robustnessRobustStandardizer
Outliers present, want bounded outputWinsorScaler
Sparse data, zero must stay zeroMaxAbsScaler
Direction matters, not magnitudeNormalizer (L2)
Text / count vectorsNormalizer (L1 or L2)
Custom log / power transformationInvertibleColumnTransform + lambdas
Missing values before scalingSimpleImputer first, then scaler
Multiple sequential stepsTransform.fit(...) pipeline

Quick comparison of column-wise scalers

Given a column with values {1, 2, 3, 100} (the 100 is an outlier):

Transform123100
Scaler0.0000.0100.0201.000
Standardizer (μ≈26.5, σ≈48.0)−0.53−0.51−0.491.53
RobustStandardizer (med=2.5, IQR≈1.5)−1.00−0.330.3365.0
WinsorScaler (5/95 pct)0.0000.5001.0001.000
MaxAbsScaler (max=100)0.0100.0200.0301.000

Scaler compresses {1,2,3} to near zero because of the outlier. WinsorScaler clamps the outlier so {1,2,3} spread cleanly across [0,1]. RobustStandardizer is robust to the outlier but its output is unbounded.


14. API quick reference

Transform (interface — smile.data.transform)

MemberDescription
apply(Tuple)Transform one row (abstract).
apply(DataFrame)Transform all rows (default: row-stream; overridden by ColumnTransform for column-parallel batch).
andThen(Transform)Compose: this, then after.
compose(Transform)Compose: before, then this.
Transform.pipeline(Transform...)Chain multiple transforms left-to-right. Throws on empty input.
Transform.fit(DataFrame, Function<DataFrame,Transform>...)Fit a data-dependent pipeline. Each trainer sees data as already transformed. Throws on empty input.

InvertibleTransform (interface — smile.data.transform)

MemberDescription
invert(Tuple)Inverse-transform one row.
invert(DataFrame)Inverse-transform all rows. Preserves nullable columns.

ColumnTransform (class — smile.data.transform)

MemberDescription
ColumnTransform(String name, Map<String,Function> transforms)Constructor.
apply(Tuple)Applies lambdas to matching column positions; others pass through.
apply(DataFrame)Column-parallel batch transform; preserves null masks.
toString()"<name>(\n <fn>, ...\n)"

InvertibleColumnTransform (class — smile.data.transform)

MemberDescription
InvertibleColumnTransform(String, Map<String,Function>, Map<String,Function>)Constructor (name, forward lambdas, inverse lambdas).
invert(Tuple)Applies inverse lambdas; unmatched columns pass through.
invert(DataFrame)Column-parallel inverse; preserves null masks.

Built-in scalers (all in smile.feature.transform)

ClassFactory methodFormulaInvertible
ScalerScaler.fit(data, cols...)(x−min)/(max−min) clamped to [0,1]Yes
StandardizerStandardizer.fit(data, cols...)(x−μ)/σYes
RobustStandardizerRobustStandardizer.fit(data, cols...)(x−median)/IQRYes
WinsorScalerWinsorScaler.fit(data) or fit(data, lo, hi, cols...)clamp then (x−pLo)/(pHi−pLo)Yes
MaxAbsScalerMaxAbsScaler.fit(data, cols...)`x / max(x
Normalizernew Normalizer(Norm, cols...)row-wise unit norm (L1/L2/L∞)No

All column-wise fit() methods accept an optional varargs String... columns argument. When omitted, all numeric columns are transformed automatically.


SMILE — © 2010-2026 Haifeng Li. GNU GPL licensed.