Back to Smile

SMILE — Formula User Guide & Tutorial

base/FORMULA.md

6.1.027.9 KB
Original Source

SMILE — Formula User Guide & Tutorial

The smile.data.formula package provides a compact, symbolic language for specifying statistical models. A formula describes which column is the response (dependent variable) and which columns are the predictors (independent variables), including optional transformations and interactions. It is the primary bridge between a raw DataFrame and the design matrices consumed by SMILE's machine-learning algorithms.


Table of Contents

  1. Concepts
  2. Quick Start
  3. Building Formulas
  4. Terms Reference
  5. Applying a Formula
  6. Expanding a Formula
  7. Intercept / Bias Control
  8. Nullable Columns
  9. Working with Categorical Variables
  10. Date / Time Tutorial
  11. Custom Transformations Tutorial
  12. Thread Safety & Lifecycle
  13. API Cheat Sheet

1. Concepts

Formula

A Formula has two sides separated by ~:

response ~ predictor1 + predictor2 + ...
SideNameMeaning
Left of ~ResponseThe column to predict (dependent variable). Optional — omit for unsupervised tasks.
Right of ~PredictorsOne or more terms that describe the inputs (independent variables).

Term

A Term is a node in the formula expression tree. Terms are composable:

  • Variable("age") — a raw column reference
  • Add(Variable("x"), val(10)) — an arithmetic expression
  • FactorCrossing("a","b","c") — a crossing of categorical factors
  • Date("timestamp", YEAR, MONTH) — date feature extraction

Feature

When a Term is bound to a concrete StructType (schema), it produces one or more Feature objects. A Feature knows its output StructField (name + type + measure) and can extract a value from any Tuple or an entire ValueVector from a DataFrame.


2. Quick Start

java
import smile.data.DataFrame;
import smile.data.formula.Formula;
import static smile.data.formula.Terms.*;

// Load data
DataFrame df = /* ... your DataFrame ... */;

// ① Predict "salary" from all other columns
Formula f1 = Formula.lhs("salary");

// ② Predict "class" from age and log(income)
Formula f2 = Formula.of("class", $("age"), log("income"));

// ③ Get the design matrix (bias column included by default)
var X = f2.matrix(df);

// ④ Get the response vector
var y = f2.y(df);

// ⑤ Get just the predictor DataFrame
var xdf = f2.x(df);

3. Building Formulas

3.1 From a String

Formula.of(String) parses an R-style formula string. This is the most convenient approach for interactive use.

java
Formula f = Formula.of("salary ~ age + log(income) + gender");

Parsing rules:

TokenMeaning
y ~ xResponse y, predictor x
~ xNo response (RHS only)
y ~ .Response y, all remaining columns as predictors
+ termAdd term to predictors
- termRemove term from predictors
- 1 or + 0Remove intercept
+ 1Explicitly include intercept
a:b:cInteraction of factors a, b, c
(a x b x c)Full factor crossing of a, b, c
(a x b x c)^2Factor crossing up to degree 2
log(x)Apply log to column x
abs(x)Apply abs to column x
(any supported function)See §4.6

Examples:

java
Formula.of("y ~ .")                       // y ~ all other columns
Formula.of("y ~ x1 + x2 - 1")            // no intercept
Formula.of("y ~ log(x) + sqrt(z)")        // transformations
Formula.of("y ~ (a x b x c)^2")           // interactions up to degree 2
Formula.of("y ~ a:b + c")                 // explicit interaction + main effect
Formula.of(" ~ .")                        // no response, all columns

Round-trip guarantee: Formula.of(formula.toString()) always equals the original formula.

3.2 Factory Methods

Use the programmatic API for type-safety and IDE completion.

java
// lhs("col")  — response only; predictors = all remaining columns (.)
Formula f = Formula.lhs("salary");
// equivalent to: Formula.of("salary ~ .")

// of(response, predictors...)  — explicit response + predictor terms
Formula f = Formula.of("salary", $("age"), log("income"), cross("a","b"));

// of(response, String... predictors) — shorthand with column names
Formula f = Formula.of("salary", "age", "gender");

// rhs(Term...)  — no response variable (unsupervised / feature extraction)
Formula f = Formula.rhs($("age"), log("income"));

// rhs(String...)  — no response, column names only
Formula f = Formula.rhs("age", "gender");

4. Terms Reference

All term-builder methods live in the Terms interface. The recommended usage is:

java
import static smile.data.formula.Terms.*;

4.1 Dot (.) — all remaining columns

The special term "." means every column not otherwise mentioned in the formula.

java
Formula.of("salary", dot())               // salary ~ .
Formula.of("salary", dot(), delete("id")) // salary ~ . - id

In a string formula use the literal .:

java
Formula.of("salary ~ . - id")

Note: . is only valid on the right-hand side. Using it as the response throws IllegalArgumentException.


4.2 Variable

References a column by name.

java
Term t = $("age");          // shorthand factory in Terms
Term t = new Variable("age");

String columns passed to Formula.of(String, String...) are automatically wrapped:

java
Formula.of("y", "x1", "x2")  // x1 and x2 become Variable terms

4.3 Intercept (0 / 1)

Controls whether a bias/intercept column is added when producing a design matrix via formula.matrix(data).

java
// Explicitly include intercept (default behaviour)
Formula.of("y ~ x + 1")
Formula.of("y", $("x"), new Intercept(true))

// Remove intercept — fit a line through the origin
Formula.of("y ~ x + 0")
Formula.of("y ~ x - 1")
Formula.of("y", $("x"), new Intercept(false))

If neither 0 nor 1 appears, the bias column is included by default.


4.4 Delete (-)

Removes a previously specified or Dot-implied term.

java
// Remove a single column
Formula.of("y ~ . - id")
Formula.of("y", dot(), delete("id"))

// Remove an interaction
Formula.of("y ~ (a x b x c) - a:b")
Formula.of("y", cross("a","b","c"), delete(interact("a","b")))

You can also call Terms.delete(String) or Terms.delete(Term):

java
Term t = delete("age");
Term t = delete(log("income"));

4.5 Arithmetic Operators

Binary arithmetic terms operate on two numeric columns (or any combination of columns and constant values). The result type follows Java's numeric promotion rules (int → long → float → double).

MethodExpressionNotes
add(a, b)a + b
sub(a, b)a - b
mul(a, b)a * b
div(a, b)a / bInteger division for int/long operands

Each method has four overloads: (Term, Term), (String, String), (Term, String), (String, Term).

java
add("x", "y")             // x + y
sub("revenue", "cost")    // revenue - cost
mul("price", val(1.1))    // price * 1.1  (constant scale-up by 10%)
div("total", val(100))    // total / 100

Using inside a formula:

java
Formula.of("profit", dot(), sub("revenue", "cost"), div("profit", "revenue"))
// profit ~ . + (revenue - cost) + (profit / revenue)

Type safety: Both operands must be numeric (int, long, float, or double). Passing a String or other non-numeric column throws IllegalStateException at bind-time.


4.6 Math Functions

All functions operate on numeric columns and produce a double result (except abs, round, and sign which preserve input precision).

Rounding

CallResult typeDescription
abs(x)same as inputAbsolute value; supports int, long, float, double
ceil(x)doubleCeiling (smallest integer ≥ x)
floor(x)doubleFloor (largest integer ≤ x)
round(x)same as inputNearest integer; Math.round semantics
rint(x)doubleNearest integer (IEEE 754 "round to even")

Logarithms & Exponentials

CallDescription
log(x)Natural logarithm ln(x)
log2(x)Base-2 logarithm
log10(x)Base-10 logarithm
log1p(x)ln(1 + x) — numerically stable for small x
exp(x)e^x
expm1(x)e^x − 1 — numerically stable for small x

Powers & Roots

CallDescription
sqrt(x)Square root √x
cbrt(x)Cube root ∛x

Trigonometry

CallDescription
sin(x)Sine (radians)
cos(x)Cosine (radians)
tan(x)Tangent (radians)
asin(x)Arc-sine (radians)
acos(x)Arc-cosine (radians)
atan(x)Arc-tangent (radians)
sinh(x)Hyperbolic sine
cosh(x)Hyperbolic cosine
tanh(x)Hyperbolic tangent

Sign

CallResult typeDescription
signum(x)doubleFloating-point sign: -1.0, 0.0, or 1.0
sign(x)intInteger sign: -1, 0, or 1
ulp(x)doubleUnit of least precision

Every function has both a String overload (column name) and a Term overload (for nesting):

java
log("income")                  // log of the "income" column
log(add("base", "bonus"))      // log(base + bonus)  — nested terms
sqrt(div("variance", val(n)))  // sqrt(variance / n)

4.7 Factor Interaction (::)

FactorInteraction combines two or more categorical columns into a single composite categorical feature. All participating columns must carry a CategoricalMeasure (e.g. NominalScale).

java
// Programmatic
Term t = interact("outlook", "temperature");  // outlook:temperature
Term t = interact("a", "b", "c");             // a:b:c

// String formula
Formula.of("play ~ a:b + c")

The resulting feature has a NominalScale whose levels are the Cartesian product of the input levels, joined with ":":

dry:low, dry:high, wet:low, wet:high

4.8 Factor Crossing (&& / ^)

FactorCrossing is syntactic sugar that generates all main effects and all pairwise (or higher-order) interactions among a set of factors:

(a x b x c)     ≡  a + b + c + a:b + a:c + b:c + a:b:c
(a x b x c)^2   ≡  a + b + c + a:b + a:c + b:c        (interactions up to degree 2)
java
// Full crossing of three factors
Term t = cross("a", "b", "c");               // (a x b x c)

// Crossing up to degree 2 only
Term t = cross(2, "a", "b", "c");            // (a x b x c)^2

// String formula
Formula.of("y ~ (a x b x c)^2")
Formula.of("y ~ (a x b x c)")

Combine with delete to remove specific interactions:

java
Formula.of("y", cross("a","b","c"), delete(interact("a","b")))
// Adds a, b, c, a:c, b:c, a:b:c  (a:b removed)

4.9 Date / Time Features

The Date term extracts numeric sub-fields from LocalDate, LocalDateTime, or LocalTime columns.

java
date("timestamp", DateFeature.YEAR, DateFeature.MONTH, DateFeature.DAY_OF_MONTH)
date("birthday",  DateFeature.YEAR, DateFeature.DAY_OF_WEEK)
date("checkIn",   DateFeature.HOUR, DateFeature.MINUTE)

Available DateFeature values:

FeatureColumn typesRangeMeasure
YEARDate, DateTimee.g. 2024
QUARTERDate, DateTime1–4
MONTHDate, DateTime1–12NominalScale (JANUARY…DECEMBER)
WEEK_OF_YEARDate, DateTime0–53
WEEK_OF_MONTHDate, DateTime0–5
DAY_OF_YEARDate, DateTime1–366
DAY_OF_MONTHDate, DateTime1–31
DAY_OF_WEEKDate, DateTime1–7NominalScale (MONDAY…SUNDAY)
HOURTime, DateTime0–23
MINUTETime, DateTime0–59
SECONDTime, DateTime0–59

Type safety:

  • Requesting HOUR/MINUTE/SECOND on a Date column throws UnsupportedOperationException.
  • Requesting YEAR/MONTH/… on a Time column throws UnsupportedOperationException.
  • DateTime columns support all features.

String formula:

java
// Not currently parseable from a string; use the Java API:
Formula.rhs(date("timestamp", DateFeature.YEAR, DateFeature.MONTH))

4.10 Constant Values

val(x) creates a term that returns the same constant value for every row. Use it together with arithmetic operators to encode fixed transformations.

java
val(1)         // integer constant 1
val(0.5)       // double constant 0.5
val(100L)      // long constant 100
val(true)      // boolean constant true
val('A')       // char constant 'A'
val((byte)  1) // byte constant 1
val((short) 2) // short constant 2
val("label")   // Object constant — produces an object column

Typical use-cases:

java
mul("price", val(1.08))       // apply 8% tax
add("age", val(-18))          // centre age at 18
div("bytes", val(1024 * 1024)) // convert bytes → MiB

4.11 Custom Lambdas

Terms.of(...) lets you attach any Java lambda as a formula term without writing a new class. There are overloads for unary and binary functions returning int, long, double, or an arbitrary object type.

Unary lambdas

java
// ToIntFunction<T>
Term t = Terms.of("clip", "age", (Integer x) -> Math.max(0, Math.min(x, 100)));

// ToDoubleFunction<T>
Term t = Terms.of("normalize", "score",
                  (Double x) -> (x - mean) / stddev);

// Function<T, R> with explicit return class
Term t = Terms.of("bucket", "income", String.class,
                  (Double x) -> x < 50_000 ? "low" : x < 150_000 ? "mid" : "high");

Binary lambdas

java
// ToDoubleBiFunction<T, U>
Term t = Terms.of("ratio", "numerator", "denominator",
                  (Double a, Double b) -> a / b);

// ToIntBiFunction<T, U>
Term t = Terms.of("diff_days", "start", "end",
                  (LocalDate a, LocalDate b) -> (int) ChronoUnit.DAYS.between(a, b));

// BiFunction<T, U, R>
Term t = Terms.of("concat", "first", "last", String.class,
                  (String a, String b) -> a + " " + b);

Use these terms just like any built-in term:

java
Formula f = Formula.of("churn", dot(),
        Terms.of("tenure_months", "start_date", "end_date",
                 (LocalDate s, LocalDate e) ->
                     (int) ChronoUnit.MONTHS.between(s, e)));

5. Applying a Formula

5.1 Binding to a Schema

formula.bind(StructType) resolves column names to schema positions and compiles the term tree into an efficient array of Feature objects. The result is the predictor schema (xschema).

java
StructType xschema = formula.bind(df.schema());
System.out.println(xschema);

Binding is lazy and cached — the first call per schema does the work; subsequent calls with the same schema object return immediately. Binding is thread-local, so the same Formula instance can be safely shared across threads.

5.2 Producing a DataFrame

formula.frame(DataFrame) returns a DataFrame containing the response column first, followed by all predictor columns, exactly as specified by the formula.

java
DataFrame out = formula.frame(df);
// Columns: [response, predictor1, predictor2, ...]

If the response column is absent from the data (e.g., when predicting on new data), frame() still returns the predictor columns only.

5.3 Extracting Predictors Only

java
DataFrame xdf = formula.x(df);

Returns only the predictor columns. Useful for scoring new observations.

5.4 Extracting the Response

java
// As a ValueVector (for use with SMILE learners)
ValueVector y = formula.y(df);

// As a double value from a single Tuple
double yval = formula.y(tuple);

// As an int value from a single Tuple
int yint = formula.yint(tuple);

Throws UnsupportedOperationException if the formula has no response term.

5.5 Producing a Design Matrix

formula.matrix(DataFrame) converts the predictor DataFrame into a dense DenseMatrix (suitable for linear algebra and gradient-based learners). Categorical columns are dummy encoded automatically.

java
DenseMatrix X = formula.matrix(df);          // with bias column (default)
DenseMatrix X = formula.matrix(df, true);    // with bias column
DenseMatrix X = formula.matrix(df, false);   // without bias column

The bias column (all-ones) is prepended when the formula has no explicit Intercept(false) term, or when bias=true is passed explicitly.

5.6 Applying Row-by-Row to Tuples

java
// Full (response + predictors) Tuple
Tuple yx = formula.apply(tuple);

// Predictors-only Tuple
Tuple x = formula.x(tuple);

These are useful in streaming/online scenarios where you process one row at a time.


6. Expanding a Formula

formula.expand(StructType) resolves the . (Dot) and FactorCrossing meta-terms against an actual schema, returning a new Formula where every term is a concrete Variable, FactorInteraction, or arithmetic expression.

java
Formula f = Formula.of("salary ~ . - id + log(age)");
Formula expanded = f.expand(df.schema());
System.out.println(expanded);
// salary ~ gender + birthday + name + log(age)   (id was deleted)

This is useful for inspecting exactly which columns a formula will consume before fitting a model.


7. Intercept / Bias Control

Formula stringhasBias()Effect on matrix()
y ~ xtrueBias column prepended
y ~ x + 1trueBias column prepended
y ~ x + 0falseNo bias column
y ~ x - 1falseNo bias column
matrix(df, true)(override)Bias column forced on
matrix(df, false)(override)Bias column forced off
java
// Fit line through origin
Formula f = Formula.of("y ~ x + 0");
DenseMatrix X = f.matrix(df);   // single column, no bias

// Explicit bias override
DenseMatrix X = f.matrix(df, true);  // force bias regardless of formula

8. Nullable Columns

SMILE DataFrame supports nullable columns (backed by NullableDoubleVector, etc.). Formula arithmetic propagates nulls correctly: if any operand of +, -, *, / is null for a given row, the result for that row is null.

java
// salary is nullable; age is not
Formula f = Formula.rhs(add("salary", "age"));
DataFrame out = f.frame(df);
// Rows where salary == null → result is null

When a nullable column is converted to a design matrix via matrix(), null values become Double.NaN.


9. Working with Categorical Variables

Categorical columns carry a CategoricalMeasure (usually NominalScale or OrdinalScale). The formula handles them in two ways:

As plain predictors

Including a categorical variable directly passes through its integer encoding (the underlying code value in the NominalScale).

java
Formula.of("play", $("outlook"), $("temperature"))

As dummy-encoded predictors (in matrix())

When formula.matrix(df) is called, all categorical predictors are automatically dummy-encoded (one binary column per level, with the first level dropped as the reference). This is identical to R's default treatment of factors.

java
// "outlook" has levels {sunny, overcast, rainy}
// matrix() produces two binary columns: outlook_overcast, outlook_rainy
DenseMatrix X = formula.matrix(df);

As interaction terms

Use interact or cross (see §4.7 and §4.8) to build interaction features from categorical columns. The resulting feature is itself categorical with a NominalScale whose levels are the Cartesian product.


10. Date / Time Tutorial

This tutorial shows how to enrich a sales DataFrame with calendar features.

java
import java.time.LocalDate;
import smile.data.DataFrame;
import smile.data.formula.Formula;
import smile.data.formula.DateFeature;
import static smile.data.formula.Terms.*;

// Suppose df has columns: order_date (LocalDate), amount (double), region (String)

Formula f = Formula.of("amount",
        dot(),                      // include region
        date("order_date",          // extract calendar features
                DateFeature.YEAR,
                DateFeature.QUARTER,
                DateFeature.MONTH,
                DateFeature.DAY_OF_WEEK));

DataFrame out = f.frame(df);
// Columns: amount, region, order_date_YEAR, order_date_QUARTER,
//          order_date_MONTH, order_date_DAY_OF_WEEK
//
// order_date_MONTH    has NominalScale (JANUARY … DECEMBER)
// order_date_DAY_OF_WEEK has NominalScale (MONDAY … SUNDAY)

Use getString() on the result to decode the nominal level names:

java
System.out.println(out.getString(0, 3));  // e.g. "MARCH"
System.out.println(out.getString(0, 4));  // e.g. "TUESDAY"

For LocalDateTime columns all 11 features are available:

java
date("created_at",
     DateFeature.YEAR, DateFeature.MONTH, DateFeature.DAY_OF_MONTH,
     DateFeature.HOUR, DateFeature.MINUTE)

For LocalTime-only columns use only time features:

java
date("open_time", DateFeature.HOUR, DateFeature.MINUTE)

11. Custom Transformations Tutorial

This tutorial builds a feature-engineering pipeline for a lending dataset that has loan_amount, income, start_date, and end_date columns.

java
import java.time.LocalDate;
import java.time.temporal.ChronoUnit;
import smile.data.formula.Formula;
import static smile.data.formula.Terms.*;

Formula f = Formula.of("default",

    // Debt-to-income ratio
    div("loan_amount", "income"),

    // Log-transform income to reduce skew
    log("income"),

    // Loan duration in months via a custom binary lambda
    Terms.of("duration_months", "start_date", "end_date",
             (LocalDate s, LocalDate e) ->
                 (int) ChronoUnit.MONTHS.between(s, e)),

    // Flag high-value loans (> 50 000) via a custom unary lambda
    Terms.of("high_value", "loan_amount", String.class,
             (Double x) -> x > 50_000 ? "yes" : "no"),

    // Square-root of loan amount to reduce skew
    sqrt("loan_amount")
);

// Bind to schema to inspect output columns
var schema = f.bind(df.schema());
System.out.println(schema);

// Produce design matrix with bias
var X = f.matrix(df);
var y = f.y(df);

12. Thread Safety & Lifecycle

Formula implements AutoCloseable. Internally it stores the compiled Feature array in a ThreadLocal, so multiple threads can share a single Formula instance and bind it to the same or different schemas concurrently.

java
// Safe: one Formula object, many threads
try (Formula f = Formula.of("y ~ x")) {
    parallelStream.forEach(df -> {
        var X = f.matrix(df);  // thread-safe
    });
}  // close() removes thread-local binding and avoids memory leaks

Best practice — always close in a try-with-resources when the formula is used inside a long-lived thread pool:

java
try (Formula f = Formula.lhs("label")) {
    // ... use f ...
}

Calling bind() a second time with the same StructType object is a no-op (cached). Passing a different schema re-binds and replaces the cached binding.


13. API Cheat Sheet

Formula static factories

MethodDescription
Formula.of(String)Parse formula from R-style string
Formula.of(String, String...)Response string + predictor column names
Formula.of(String, Term...)Response string + predictor terms
Formula.of(Term, Term...)Response term + predictor terms
Formula.lhs(String)Response only; predictors = .
Formula.lhs(Term)Response term only; predictors = .
Formula.rhs(String...)No response; predictor column names
Formula.rhs(Term...)No response; predictor terms

Formula instance methods

MethodReturnsDescription
bind(StructType)StructTypeBind to schema; returns predictor schema
expand(StructType)FormulaExpand . and crossings against schema
frame(DataFrame)DataFrameResponse + predictor columns
x(DataFrame)DataFramePredictor columns only
y(DataFrame)ValueVectorResponse column
matrix(DataFrame)DenseMatrixDummy-encoded design matrix (with bias)
matrix(DataFrame, boolean)DenseMatrixDesign matrix with explicit bias flag
apply(Tuple)TupleResponse + predictors for one row
x(Tuple)TuplePredictors for one row
y(Tuple)doubleResponse value (double) for one row
yint(Tuple)intResponse value (int) for one row
response()TermThe response term (may be null)
predictors()Term[]The predictor terms
toString()StringR-style formula string
close()voidRelease thread-local binding

Terms static builders (import static)

MethodSymbolDescription
$(String)variableCreate variable (auto-detects function names)
dot().All remaining columns
delete(String/Term)- xDelete term
interact(String...)a:b:cFactor interaction
cross(String...)(a x b)Full factor crossing
cross(int, String...)(a x b)^nFactor crossing to degree n
date(String, DateFeature...)Date/time feature extraction
val(x)constantConstant term
add(a, b)a + bAddition
sub(a, b)a - bSubtraction
mul(a, b)a * bMultiplication
div(a, b)a / bDivision
abs(x)Absolute value
ceil(x)Ceiling
floor(x)Floor
round(x)Round
rint(x)Round to even
exp(x)e^x
expm1(x)e^x − 1
log(x)ln(x)
log1p(x)ln(1+x)
log2(x)log₂(x)
log10(x)log₁₀(x)
sqrt(x)√x
cbrt(x)∛x
sin(x)sin(x)
cos(x)cos(x)
tan(x)tan(x)
asin(x)arcsin(x)
acos(x)arccos(x)
atan(x)arctan(x)
sinh(x)sinh(x)
cosh(x)cosh(x)
tanh(x)tanh(x)
signum(x)−1.0, 0.0, or 1.0
sign(x)−1, 0, or 1 (integer)
ulp(x)Unit of least precision
of(name, x, ToIntFunction)Custom int transform
of(name, x, ToLongFunction)Custom long transform
of(name, x, ToDoubleFunction)Custom double transform
of(name, x, Class, Function)Custom object transform
of(name, x, y, ToIntBiFunction)Custom int bi-transform
of(name, x, y, ToLongBiFunction)Custom long bi-transform
of(name, x, y, ToDoubleBiFunction)Custom double bi-transform
of(name, x, y, Class, BiFunction)Custom object bi-transform

SMILE — © 2010-2026 Haifeng Li. GNU GPL licensed.