SMILE — DataFrame User Guide & Tutorial

This document covers smile.data.DataFrame, smile.data.Tuple, and the supporting packages smile.data.type, smile.data.measure, and smile.data.vector.

Architecture overview
Data types — smile.data.type
Levels of measurement — smile.data.measure
Column vectors — smile.data.vector
Tuple — a single row
DataFrame
CategoricalEncoder
End-to-end tutorial
API quick reference

1. Architecture overview

smile.data.type
│
├── DataType (interface)          – describes storage format
│     ├── Primitive types         – BooleanType, ByteType, ShortType,
│     │                             IntType, LongType, FloatType, DoubleType,
│     │                             CharType  (each has nullable=false/true)
│     ├── TemporalType            – DateType, TimeType, DateTimeType
│     ├── DecimalType             – BigDecimal
│     ├── StringType              – String
│     ├── ObjectType              – arbitrary Object
│     ├── ArrayType               – primitive array columns
│     └── StructType              – schema (ordered list of StructFields)
│
├── StructField(name, dtype, measure?)  – one column descriptor
└── DataTypes                     – constants and factory methods

smile.data.measure
│
├── Measure (interface)           – adds semantic meaning to a numeric column
│     ├── CategoricalMeasure      – discrete labeled integer codes
│     │     ├── NominalScale      – unordered categories (gender, country…)
│     │     └── OrdinalScale      – ordered categories (rating, grade…)
│     └── NumericalMeasure        – continuous numeric annotations
│           ├── IntervalScale     – no true zero (temperature °C, year)
│           └── RatioScale        – true zero (price, weight, count)
└── Measure.Currency / Percent    – built-in ratio-scale singletons

smile.data.vector
│
├── ValueVector (interface)       – one typed column of a DataFrame
│     ├── PrimitiveVector         – non-nullable: IntVector, DoubleVector…
│     ├── NullablePrimitiveVector – nullable: NullableIntVector…
│     ├── StringVector            – String column (always nullable-compatible)
│     ├── NumberVector<N>         – boxed Number column (BigDecimal…)
│     └── ObjectVector<T>         – arbitrary Object column

smile.data
│
├── Tuple (interface)             – one row; immutable ordered field list
│     └── Row (record)            – Tuple backed by a DataFrame + row index
│
├── RowIndex                      – optional label → ordinal index for rows
│
└── DataFrame (record)            – 2-D heterogeneous tabular data
      schema    : StructType
      columns   : List<ValueVector>
      index     : RowIndex?        (optional row labels)

2. Data types — `smile.data.type`

Every column in a DataFrame has a DataType that describes how values are stored. Access the singletons via DataTypes.*:

java

import smile.data.type.DataTypes;
import smile.data.type.DataType;

2.1 Primitive and nullable types

Singleton	Java type	Nullable variant
`DataTypes.BooleanType`	`boolean`	`DataTypes.NullableBooleanType`
`DataTypes.ByteType`	`byte`	`DataTypes.NullableByteType`
`DataTypes.ShortType`	`short`	`DataTypes.NullableShortType`
`DataTypes.IntType`	`int`	`DataTypes.NullableIntType`
`DataTypes.LongType`	`long`	`DataTypes.NullableLongType`
`DataTypes.FloatType`	`float`	`DataTypes.NullableFloatType`
`DataTypes.DoubleType`	`double`	`DataTypes.NullableDoubleType`
`DataTypes.CharType`	`char`	`DataTypes.NullableCharType`

Non-nullable primitive types store values directly in a primitive array — no boxing, no null overhead. Nullable variants add a BitSet null-mask alongside the primitive array; reading a null cell via the primitive accessor (getDouble, getInt, …) returns Double.NaN / Integer.MIN_VALUE as the sentinel; always use isNullAt(i) first when the column is nullable.

java

boolean isNull = df.isNullAt(row, col);
if (!isNull) {
    double v = df.getDouble(row, col);
}

2.2 Temporal types

Singleton	Java type
`DataTypes.DateType`	`java.time.LocalDate`
`DataTypes.TimeType`	`java.time.LocalTime`
`DataTypes.DateTimeType`	`java.time.LocalDateTime`

Temporal columns are always represented as object arrays (they are not primitives), but SMILE provides ISO-8601 parsing and formatting out of the box via dtype.valueOf(String) and dtype.toString(Object).

2.3 Other types

Singleton / factory	Java type	Notes
`DataTypes.StringType`	`String`	Always nullable
`DataTypes.DecimalType`	`BigDecimal`	Always nullable
`DataTypes.ObjectType`	`Object`	Catch-all
`DataTypes.object(Class<?>)`	specific class	Resolves to known type if possible
`DataTypes.IntArrayType` etc.	`int[]` etc.	Primitive array columns

2.4 StructField

A StructField is an immutable triple (name, dtype, measure?) that describes a single column.

java

import smile.data.type.StructField;

// Plain double column
StructField age     = new StructField("age",    DataTypes.IntType);

// Nullable salary
StructField salary  = new StructField("salary", DataTypes.NullableDoubleType);

// Categorical column with a NominalScale
StructField gender  = new StructField("gender", DataTypes.ByteType,
                                      new NominalScale("Male", "Female"));

// Useful predicates
boolean numeric = age.isNumeric();      // true for non-nominal numeric
boolean nullable = salary.dtype().isNullable();

StructField is a Java record; two fields are equal when name, dtype, and measure all match.

2.5 StructType

StructType is the schema of a DataFrame or Tuple. It is an ordered list of StructFields with a fast name-to-index lookup map.

java

import smile.data.type.StructType;

StructType schema = new StructType(
    new StructField("name",   DataTypes.StringType),
    new StructField("age",    DataTypes.IntType),
    new StructField("salary", DataTypes.NullableDoubleType)
);

// Field access
StructField f = schema.field("age");    // by name
StructField f = schema.field(1);        // by ordinal

int j         = schema.indexOf("age");  // ordinal of "age"
String[]  ns  = schema.names();
DataType[] dt = schema.dtypes();
int       len = schema.length();

StructType is mutable for internal use by DataFrame (columns can be added/renamed in-place). Do not cache a StructType reference and assume it is immutable.

3. Levels of measurement — `smile.data.measure`

A Measure is an optional annotation on a StructField that adds semantic meaning to a numeric column. It controls how values are rendered and how SMILE's algorithms treat the column (e.g., dummy encoding for categorical variables).

java

import smile.data.measure.*;

3.1 NominalScale

Unordered categories. Each integer code maps to a string label.

java

// From string labels (codes are 0, 1, 2, …)
NominalScale gender = new NominalScale("Male", "Female");

// From an enum
NominalScale color = new NominalScale(Color.class); // enum Color { Red, Green, Blue }

// From explicit code→label pairs
NominalScale custom = new NominalScale(
    new int[]    {1, 3, 7},
    new String[] {"Low", "Mid", "High"}
);

int    code  = gender.valueOf("Female").intValue(); // 1
String label = gender.level(0);                    // "Male"
int    size  = gender.size();                      // 2

When a column has a NominalScale, getString(i) returns the label, not the raw integer, and df.factorize() will produce columns backed by a NominalScale.

3.2 OrdinalScale

Ordered categories. Levels carry an implied rank; values are kept sorted.

java

OrdinalScale rating = new OrdinalScale("Poor", "Fair", "Good", "Excellent");
// ordinals: 0=Poor, 1=Fair, 2=Good, 3=Excellent

The ordinal position is meaningful for comparison and sorting but arithmetic (mean, variance) is not valid on pure ordinal data.

3.3 IntervalScale

Numeric, no true zero. Arithmetic differences are meaningful, ratios are not. Primarily used as an annotation for documentation/display.

java

IntervalScale celsius = new IntervalScale(NumberFormat.getInstance());

3.4 RatioScale

Numeric with a true zero. All arithmetic is valid.

java

// Built-in singletons
Measure price  = Measure.Currency;   // formats as currency
Measure pct    = Measure.Percent;    // formats as percentage

// Custom
RatioScale weight = new RatioScale(NumberFormat.getInstance());

Compatibility rules (enforced by StructField's compact constructor):

NumericalMeasure is invalid for Boolean, Char, String columns.
CategoricalMeasure is only valid for integral (int, long, byte, short) columns.

4. Column vectors — `smile.data.vector`

A ValueVector is a typed, indexed, one-dimensional array that forms a single column of a DataFrame.

java

import smile.data.vector.*;

4.1 Primitive vectors

Each primitive type has a corresponding non-nullable vector class:

Class	Backing store	Example
`IntVector`	`int[]`	`new IntVector("age", new int[]{25, 30, 35})`
`LongVector`	`long[]`	`new LongVector("ts", new long[]{...})`
`FloatVector`	`float[]`	`new FloatVector("score", new float[]{...})`
`DoubleVector`	`double[]`	`new DoubleVector("salary", new double[]{...})`
`BooleanVector`	`boolean[]`	`new BooleanVector("flag", new boolean[]{...})`
`ByteVector`	`byte[]`	`new ByteVector("cat", new byte[]{...})`
`ShortVector`	`short[]`	`new ShortVector("rank", new short[]{...})`
`CharVector`	`char[]`	`new CharVector("grade", new char[]{...})`

All take an optional StructField as the first argument when a measure is needed:

java

StructField field = new StructField("gender", DataTypes.ByteType,
                                    new NominalScale("Male", "Female"));
ByteVector gender = new ByteVector(field, new byte[]{0, 1, 0, 1});

4.2 Nullable primitive vectors

Nullable variants store an additional BitSet null-mask:

java

import java.util.BitSet;

double[] values   = {80000.0, Double.NaN, 90000.0};
BitSet   nullMask = new BitSet(3);
nullMask.set(1);  // index 1 is null

NullableDoubleVector salary = new NullableDoubleVector(
    new StructField("salary", DataTypes.NullableDoubleType),
    values, nullMask
);

salary.isNullable();     // true
salary.isNullAt(1);      // true
salary.getDouble(0);     // 80000.0
salary.getNullCount();   // 1

Corresponding nullable classes: NullableIntVector, NullableLongVector, NullableFloatVector, NullableDoubleVector, NullableBooleanVector, NullableByteVector, NullableShortVector, NullableCharVector.

4.3 Object vectors

Class	Content
`StringVector`	`String[]` — always nullable
`NumberVector<N>`	`Number[]` — `BigDecimal`, boxed primitives
`ObjectVector<T>`	`Object[]` — `LocalDate`, `LocalDateTime`, any type

java

StringVector  names = new StringVector("name",  new String[]{"Alice","Bob"});
ObjectVector<LocalDate> dates = new ObjectVector<>(
    new StructField("birthday", DataTypes.DateType),
    new LocalDate[]{ LocalDate.of(1990,1,1), LocalDate.of(1985,6,15) }
);

4.4 Common ValueVector operations

java

ValueVector v = df.column("age");

// Size and nullability
int n  = v.size();
boolean nullable = v.isNullable();
boolean hasNull  = v.anyNull();
int nullCount    = v.getNullCount();
boolean rowNull  = v.isNullAt(2);

// Typed reads
int    i = v.getInt(0);
double d = v.getDouble(0);
String s = v.getString(0);       // uses measure for categorical columns
Object o = v.get(0);             // boxed / object value, may be null

// Bulk export
int[]    ia = v.toIntArray();
double[] da = v.toDoubleArray();
String[] sa = v.toStringArray();

// Streaming
v.intStream().sum();
v.doubleStream().average();
v.stream().filter(Objects::nonNull).count();

// Boolean filter masks
boolean[] mask = v.eq(30);       // element-wise ==
boolean[] mask = v.gt(25);       // element-wise >
boolean[] mask = v.isin(25, 30); // element-wise membership

// Sub-selection
ValueVector sub = v.get(Index.of(new int[]{0, 2, 3}));

// Rename (returns new vector)
ValueVector renamed = v.withName("years");

// In-place mutation
v.set(0, 99);

5. Tuple — a single row

Tuple is an interface representing one row of a DataFrame. It is immutable from the user's perspective — mutating a Row (the concrete implementation) directly modifies the backing DataFrame.

Creating a Tuple

java

import smile.data.Tuple;
import smile.data.type.StructType;

// Construct standalone from schema + values
StructType schema = new StructType(
    new StructField("x", DataTypes.IntType),
    new StructField("y", DataTypes.DoubleType)
);
Tuple t = Tuple.of(schema, new Object[]{42, 3.14});
Tuple t = Tuple.of(schema, new int[]{42}, new double[]{3.14});

Reading fields

All accessors have both ordinal (int i) and name (String field) forms:

java

Tuple row = df.get(0);

// By ordinal
int    age    = row.getInt(0);
double salary = row.getDouble(2);
String name   = row.getString(1);

// By name
int    age    = row.getInt("age");
double salary = row.getDouble("salary");

// Generic (boxed, may return null)
Object val = row.get(0);
Object val = row.get("name");

// Null check — always check before primitive accessors on nullable columns
boolean isNull = row.isNullAt(2);
boolean isNull = row.isNullAt("salary");
boolean anyNull = row.anyNull();

Available typed accessors: getBoolean, getChar, getByte, getShort, getInt, getLong, getFloat, getDouble, getString.

Exporting to a double array

java

// All fields as doubles (NaN for nulls, level encoding for categoricals)
double[] arr = row.toArray();

// Selective fields
double[] arr = row.toArray("age", "salary");

// With intercept (bias=1) and dummy encoding
double[] arr = row.toArray(true, CategoricalEncoder.DUMMY, "age", "gender", "salary");

Schema access

java

StructType schema  = row.schema();
int        length  = row.length();
int        j       = row.indexOf("salary");

6. DataFrame

DataFrame is a Java record — immutable value by reference — backed by:

StructType schema — column descriptors
List<ValueVector> columns — the actual data, one vector per column
RowIndex index — optional row labels (may be null)

The List<ValueVector> and StructType are mutable for the in-place operations add(), set(), rename(), and fillna(). All other operations that structurally reshape the data (e.g., select, drop, sort, concat) return a new DataFrame.

java

import smile.data.DataFrame;

6.1 Creating a DataFrame

From column vectors (most direct)

java

DataFrame df = new DataFrame(
    new StringVector ("name",   new String[] {"Alice", "Bob", "Charlie"}),
    new IntVector    ("age",    new int[]    {25, 30, 35}),
    new DoubleVector ("salary", new double[] {60000., 80000., 90000.})
);

From a 2-D double / float / int array

java

double[][] data = {{1.0, 2.0}, {3.0, 4.0}, {5.0, 6.0}};

// Auto-named columns: V1, V2
DataFrame df = DataFrame.of(data);

// Custom names
DataFrame df = DataFrame.of(data, "x", "y");
DataFrame df = DataFrame.of(new int[][]{{1,2},{3,4}}, "a", "b");
DataFrame df = DataFrame.of(new float[][]{{1f},{2f}}, "v");

From a Java Bean / POJO via reflection

SMILE uses Java Beans introspection (getXxx() methods) to discover columns automatically. Field order in the schema follows the alphabetical order of getter names.

java

public class Person {
    public String getName()     { return name; }
    public int    getAge()      { return age;  }
    public Double getSalary()   { return salary; }   // nullable wrapper → NullableDoubleType
    public Gender getGender()   { return gender; }   // enum → ByteVector with NominalScale
    // …
}

List<Person> persons = List.of(
    new Person("Alice", 25, 60000., Gender.Female),
    new Person("Bob",   30, null,   Gender.Male)
);
DataFrame df = DataFrame.of(Person.class, persons);

Rules for automatic type inference:

int / long / float / double / boolean / char / byte / short → non-nullable primitive vector
Boxed Integer, Double, etc. and Number subclasses → nullable vector or NumberVector
String → StringVector
Enum → ByteVector (≤127 levels), ShortVector (≤32767), or IntVector; always annotated with a NominalScale
LocalDate, LocalDateTime, LocalTime → ObjectVector with temporal type
Anything else → ObjectVector

From a schema + list/stream of Tuples

java

StructType schema = new StructType(
    new StructField("x", DataTypes.IntType),
    new StructField("y", DataTypes.DoubleType)
);

// Non-empty list
List<Tuple> rows = List.of(Tuple.of(schema, 1, 2.0), Tuple.of(schema, 3, 4.0));
DataFrame df = DataFrame.of(schema, rows);

// Stream variant
DataFrame df = DataFrame.of(schema, rows.stream());

// Empty list — returns a zero-row DataFrame with the correct schema
DataFrame empty = DataFrame.of(schema, List.of());

From a JDBC ResultSet

java

try (Connection conn = DriverManager.getConnection(url);
     Statement  stmt = conn.createStatement();
     ResultSet  rs   = stmt.executeQuery("SELECT * FROM employees")) {
    DataFrame df = DataFrame.of(rs);
}

JDBC type mapping is handled automatically via StructType.of(ResultSet).

6.2 Inspecting a DataFrame

java

// Dimensions
int nrow = df.nrow();          // number of rows
int ncol = df.ncol();          // number of columns
int size = df.size();          // alias for nrow()
int[] shape = {df.shape(0), df.shape(1)};  // {nrow, ncol}
boolean empty = df.isEmpty();

// Schema
StructType schema   = df.schema();
String[]   names    = df.names();
DataType[] dtypes   = df.dtypes();
Measure[]  measures = df.measures();

// Print
System.out.println(df);          // head(10)
System.out.println(df.head(5));  // first 5 rows
System.out.println(df.tail(5));  // last 5 rows
System.out.println(df.toString(2, 7, true));  // rows [2,7)

// Summary statistics
System.out.println(df.describe());

describe() returns a new DataFrame with one row per column and columns: column, type, measure, count (non-null), mode, mean, std, min, 25%, 50%, 75%, max.

6.3 Accessing cells, rows, and columns

Cells

java

Object  val = df.get(row, col);           // boxed value, may be null
int     i   = df.getInt(row, col);
long    l   = df.getLong(row, col);
float   f   = df.getFloat(row, col);
double  d   = df.getDouble(row, col);
String  s   = df.getString(row, col);     // uses measure for categoricals
String  sc  = df.getScale(row, col);      // level name for NominalScale / OrdinalScale
boolean nil = df.isNullAt(row, col);

Rows (Tuples)

java

Tuple row = df.get(0);                    // first row
Tuple row = df.apply(0);                  // Scala alias

// Iterate all rows
for (Row row : df) { /* … */ }
df.stream().forEach(row -> /* … */);
List<Row> list = df.toList();

Columns

java

ValueVector col = df.column(0);           // by ordinal
ValueVector col = df.column("age");       // by name
ValueVector col = df.apply("age");        // Scala alias

Mutating a cell

java

df.set(row, col, value);
df.update(row, col, value);               // Scala alias

6.4 Row indexing with RowIndex

A RowIndex maps arbitrary label objects to row ordinals, allowing label-based row selection similar to Pandas .loc.

java

// Attach an index from an existing column (column is removed from data)
DataFrame indexed = df.setIndex("name");

// Attach an index from an external array
DataFrame indexed = df.setIndex(new Object[]{"r0","r1","r2","r3"});

// Look up by label
Tuple row  = indexed.loc("Alice");          // single row
DataFrame sub = indexed.loc("Alice","Bob"); // multiple rows

RowIndex is also used internally by join() to perform an inner join on shared keys.

java

// RowIndex directly
RowIndex index = new RowIndex(new Object[]{"a","b","c"});

int  i   = index.apply("b");             // 1
int  i   = index.getOrDefault("x");     // -1 (not found)
boolean has = index.containsKey("a");   // true
int  n   = index.size();                // 3

Constraints: no null values, no duplicate values — both throw IllegalArgumentException at construction time.

6.5 Selecting and dropping columns

All selection/drop operations return a new DataFrame.

java

// Select by ordinal indices
DataFrame sub = df.select(0, 2);

// Select by name
DataFrame sub = df.select("name", "salary");
DataFrame sub = df.apply("name", "salary");   // Scala alias

// Drop by ordinal index
DataFrame sub = df.drop(1);
DataFrame sub = df.drop(0, 2);

// Drop by name
DataFrame sub = df.drop("age");
DataFrame sub = df.drop("age", "birthday");

6.6 Adding and replacing columns

add() and set() mutate this in-place (they modify the internal List<ValueVector> and StructType).

java

// Add new columns — all must have the same size as the DataFrame
// and names must not clash with existing columns or each other
IntVector bonus = new IntVector("bonus", new int[]{5000, 8000, 12000});
df.add(bonus);

// Two columns at once — both names must be distinct from each other
// and from existing columns
df.add(c1, c2);

// Replace an existing column (or add if not present)
df.set("salary", updatedSalaryVector);
df.update("salary", updatedSalaryVector);  // Scala alias

6.7 Renaming columns

rename() mutates both the StructType and the backing ValueVector in-place:

java

df.rename("age", "years");
// df.names() is now ["name", "years", "salary"]

6.8 Filtering rows

All row-selection operations return a new DataFrame.

Boolean mask

java

// Build a boolean mask manually or from a column comparison
boolean[] mask = df.column("age").gt(28);
DataFrame sub  = df.get(mask);

Index object

java

import smile.util.Index;

// From explicit row indices
DataFrame sub = df.get(Index.of(new int[]{0, 2, 3}));

// From a boolean mask
DataFrame sub = df.get(Index.of(new boolean[]{true, false, true, true}));

dropna

java

// Remove any row that has at least one null/NaN value
DataFrame clean = df.dropna();

6.9 Sorting

sort() returns a new DataFrame with all rows reordered. Null values always sort to the end regardless of direction.

java

// Ascending (default)
DataFrame sorted = df.sort("age");

// Descending
DataFrame sorted = df.sort("salary", false);

The sort is stable and works on any column type: integral, floating-point, String, and any Comparable.

6.10 Slicing and sampling

java

// Contiguous row range [from, to)
DataFrame slice  = df.slice(1, 4);   // rows 1, 2, 3
DataFrame first  = df.slice(0, 1);   // first row only
DataFrame empty  = df.slice(2, 2);   // zero rows (valid)

// Random sample without replacement
DataFrame sample = df.sample(50);    // up to 50 rows (capped at nrow())

slice() validates that 0 ≤ from ≤ to ≤ nrow().

6.11 Combining DataFrames

merge — horizontal (column union)

Combines two or more DataFrames side-by-side. All must have the same row count. Clashing column names get a _2, _3, … suffix.

java

DataFrame wide = left.merge(right);
DataFrame wide = a.merge(b, c, d);

concat — vertical (row union)

Stacks DataFrames on top of each other. All must have the exact same schema.

java

DataFrame tall = train.concat(test);
DataFrame tall = a.concat(b, c);

If all frames have a RowIndex, the indices are concatenated too.

join — inner join on RowIndex

Performs an inner join using matching row-label keys. If either frame has no RowIndex, falls back to merge().

java

DataFrame merged = left.join(right);
// Rows present in both left.index and right.index are kept;
// unmatched rows are dropped.

6.12 Missing values

java

// Drop rows with any null/NaN
DataFrame clean = df.dropna();

// Fill NaN/Inf in numeric columns in-place
df.fillna(0.0);    // replace with zero
df.fillna(-1.0);   // replace with sentinel

fillna operates on DoubleVector, FloatVector, NullablePrimitiveVector, and NumberVector columns; non-numeric columns are unaffected.

6.13 Categorical encoding with factorize

factorize() converts String columns into IntVector columns annotated with a NominalScale. The integer codes are assigned in alphabetical order of the distinct string values.

java

// Convert all String columns
DataFrame f = df.factorize();

// Convert specific columns
DataFrame f = df.factorize("color", "country");

// Inspect the resulting scale
NominalScale scale = (NominalScale) f.schema().field("color").measure();
String label = scale.level(0);    // first level alphabetically
int    code  = scale.valueOf("Red").intValue();

This is the standard step to prepare string data for machine-learning algorithms that require integer inputs.

6.14 Exporting to numeric arrays and matrices

Both toArray() and toMatrix() convert the DataFrame to a dense numeric representation with optional bias (intercept) column and categorical encoding.

java

// Default: no bias, level encoding, all columns
double[][] X = df.toArray();

// Selective columns
double[][] X = df.toArray("age", "salary", "gender");

// With bias + dummy encoding for categoricals
double[][] X = df.toArray(true, CategoricalEncoder.DUMMY, "age", "gender", "salary");

// DenseMatrix form (suitable for linear algebra)
DenseMatrix M = df.toMatrix();
DenseMatrix M = df.toMatrix(true, CategoricalEncoder.DUMMY, "rowNameColumn");

NaN is used for null/missing values in the output array.

See §7 for the CategoricalEncoder options.

6.15 Statistics with describe

java

DataFrame stats = df.describe();
System.out.println(stats);

Output columns: column, type, measure, count, mode, mean, std, min, 25%, 50%, 75%, max.

Categorical columns report mode, min, median, max over the integer codes.
Floating-point columns report mean, std, min, quartiles, max.
Integral columns report all statistics.
String / object columns report only count (non-null) and mode.

6.16 Printing and display

java

System.out.println(df);            // head(10)
System.out.println(df.head(5));
System.out.println(df.tail(5));
System.out.println(df.toString(from, to, truncate));

toString(from, to, truncate):

from must be in [0, nrow]; from > nrow throws.
to ≤ from (or after clamping to nrow) returns "Empty DataFrame\n".
Columns wider than maxColWidth are truncated with "..." when truncate=true.

6.17 Loading from JDBC

java

import java.sql.*;

try (Connection conn = DriverManager.getConnection(jdbcUrl, user, pass);
     Statement  stmt = conn.createStatement();
     ResultSet  rs   = stmt.executeQuery("SELECT * FROM sales")) {
    DataFrame df = DataFrame.of(rs);
    System.out.println(df.describe());
}

JDBC types are mapped to SMILE types via StructType.of(ResultSetMetaData).

7. CategoricalEncoder

CategoricalEncoder controls how categorical (NominalScale / OrdinalScale) columns are converted when calling toArray(), toMatrix(), or Tuple.toArray().

Enum value	Meaning	Output columns per category
`LEVEL`	Integer level code (default)	1 — raw code value
`DUMMY`	Dummy / treatment encoding	k−1 binary columns (reference = first level)
`ONE_HOT`	Full one-hot encoding	k binary columns

java

import smile.data.CategoricalEncoder;

// Level encoding (default) — "gender" becomes a single int column
double[][] X = df.toArray("age", "gender");

// Dummy encoding — k levels → k-1 binary columns
// e.g. gender {Male=0, Female=1} → one binary column "gender_Female"
double[][] X = df.toArray(false, CategoricalEncoder.DUMMY, "age", "gender");

// One-hot encoding — k levels → k binary columns
double[][] X = df.toArray(false, CategoricalEncoder.ONE_HOT, "age", "gender");

8. End-to-end tutorial

This tutorial processes an employee dataset from a raw POJO list through to a numeric design matrix ready for a machine-learning algorithm.

Step 1 — Define the domain object and load data

java

import java.time.LocalDate;
import smile.data.DataFrame;
import smile.data.measure.*;
import smile.data.type.*;
import smile.data.vector.*;

public enum Department { Engineering, Marketing, HR }

public class Employee {
    public String     getName()       { return name; }
    public int        getAge()        { return age; }
    public Department getDepartment() { return dept; }
    public LocalDate  getHireDate()   { return hireDate; }
    public Double     getSalary()     { return salary; }   // nullable
    // …constructor, fields…
}

List<Employee> employees = loadEmployees();  // from DB / file / …
DataFrame df = DataFrame.of(Employee.class, employees);

System.out.println(df.schema());
// age: int
// department: byte  nominal[Engineering, HR, Marketing]
// hireDate: Date
// name: String
// salary: double?

Step 2 — Inspect and describe

java

System.out.println(df);
System.out.println(df.describe());

// How many rows have a null salary?
long nullSalaries = df.column("salary").getNullCount();
System.out.println("Missing salaries: " + nullSalaries);

Step 3 — Add a derived column

java

import smile.data.vector.IntVector;

// Tenure in years = current year - hire year
int[] tenure = new int[df.nrow()];
for (int i = 0; i < df.nrow(); i++) {
    LocalDate d = (LocalDate) df.column("hireDate").get(i);
    tenure[i] = LocalDate.now().getYear() - d.getYear();
}
df.add(new IntVector("tenure", tenure));

Step 4 — Handle missing values

java

// Option A: drop rows with any null
DataFrame clean = df.dropna();

// Option B: fill salary nulls with median
double medianSalary = df.column("salary").doubleStream()
        .filter(Double::isFinite).sorted()
        .skip(df.nrow() / 2).findFirst().orElse(0.0);
df.fillna(medianSalary);

Step 5 — Sort and slice

java

// Sort by salary descending
DataFrame sorted = df.sort("salary", false);

// Top 10 earners
DataFrame top10 = sorted.slice(0, Math.min(10, sorted.nrow()));
System.out.println(top10.head(10));

Step 6 — Select features and encode categoricals

java

// Select the columns we want for the model
DataFrame features = df.select("age", "tenure", "salary", "department");

// For algorithms that need integer encoding:
// "department" already has NominalScale (auto-detected from enum)

// Export design matrix with dummy encoding
double[][] X = features.drop("salary")
        .toArray(false, CategoricalEncoder.DUMMY,
                 "age", "tenure", "department");

// Response vector
double[] y = features.column("salary").toDoubleArray();

Step 7 — Set a row index for traceability

java

DataFrame indexed = df.setIndex("name");

// Later: look up a specific employee by name
Tuple alice = indexed.loc("Alice");
System.out.println("Alice's salary: " + alice.getDouble("salary"));

// Join two DataFrames on employee name
DataFrame reviews = loadReviews().setIndex("employee");
DataFrame combined = indexed.join(reviews);

Step 8 — Describe the final feature set

java

DataFrame finalFeatures = df.select("age", "tenure", "department", "salary");
System.out.println(finalFeatures.describe());

// Verify no nulls remain
boolean anyNull = finalFeatures.stream().anyMatch(Tuple::anyNull);
System.out.println("Any nulls: " + anyNull);

9. API quick reference

DataFrame static factories

Method	Description
`new DataFrame(ValueVector...)`	Construct from column vectors
`new DataFrame(RowIndex, ValueVector...)`	With row index
`DataFrame.of(double[][], String...)`	From 2-D double array
`DataFrame.of(float[][], String...)`	From 2-D float array
`DataFrame.of(int[][], String...)`	From 2-D int array
`DataFrame.of(Class<T>, List<T>)`	From POJOs via reflection
`DataFrame.of(StructType, List<Tuple>)`	From tuple list (empty → zero-row frame)
`DataFrame.of(StructType, Stream<Tuple>)`	From tuple stream
`DataFrame.of(ResultSet)`	From JDBC ResultSet

DataFrame instance methods

Method	Returns	Mutates `this`?	Description
`nrow()` / `size()`	`int`	no	Number of rows
`ncol()`	`int`	no	Number of columns
`shape(dim)`	`int`	no	Size of dimension 0 (rows) or 1 (cols)
`isEmpty()`	`boolean`	no	True if zero rows
`schema()`	`StructType`	no	Column schema
`names()`	`String[]`	no	Column names
`dtypes()`	`DataType[]`	no	Column types
`measures()`	`Measure[]`	no	Column measures
`column(int)` / `column(String)`	`ValueVector`	no	Column vector
`get(int, int)`	`Object`	no	Cell (boxed)
`getInt/Double/…(int,int)`	primitive	no	Cell (typed)
`getString(int,int)`	`String`	no	Cell as string (uses measure)
`isNullAt(int,int)`	`boolean`	no	Null check
`set(int,int,Object)`	`void`	yes	Set cell value
`get(int)`	`Tuple`	no	Row as Tuple
`get(Index)`	`DataFrame`	no	Rows by Index
`get(boolean[])`	`DataFrame`	no	Rows by boolean mask
`slice(int,int)`	`DataFrame`	no	Rows `[from, to)`
`sample(int)`	`DataFrame`	no	Random sample without replacement
`sort(String)`	`DataFrame`	no	Ascending sort
`sort(String,boolean)`	`DataFrame`	no	Sort with direction
`select(int...)`	`DataFrame`	no	Columns by index
`select(String...)`	`DataFrame`	no	Columns by name
`drop(int...)`	`DataFrame`	no	Remove columns by index
`drop(String...)`	`DataFrame`	no	Remove columns by name
`add(ValueVector...)`	`DataFrame`	yes	Add new columns
`set(String,ValueVector)`	`DataFrame`	yes	Replace or add column
`rename(String,String)`	`DataFrame`	yes	Rename column in-place
`merge(DataFrame...)`	`DataFrame`	no	Horizontal column union
`concat(DataFrame...)`	`DataFrame`	no	Vertical row union
`join(DataFrame)`	`DataFrame`	no	Inner join on RowIndex
`setIndex(String)`	`DataFrame`	no	Column → RowIndex (removes column)
`setIndex(Object[])`	`DataFrame`	no	Attach RowIndex array
`loc(Object)`	`Tuple`	no	Row by label
`loc(Object...)`	`DataFrame`	no	Rows by labels
`dropna()`	`DataFrame`	no	Remove rows with any null
`fillna(double)`	`DataFrame`	yes	Fill NaN/null in numeric columns
`factorize(String...)`	`DataFrame`	no	Encode string columns as NominalScale
`toArray(String...)`	`double[][]`	no	Numeric array (LEVEL encoding)
`toArray(boolean,CategoricalEncoder,String...)`	`double[][]`	no	Numeric array with options
`toMatrix()`	`DenseMatrix`	no	Matrix (LEVEL, no bias)
`toMatrix(boolean,CategoricalEncoder,String)`	`DenseMatrix`	no	Matrix with options
`describe()`	`DataFrame`	no	Summary statistics
`head(int)`	`String`	no	Top-N rows formatted
`tail(int)`	`String`	no	Bottom-N rows formatted
`toString(int,int,boolean)`	`String`	no	Row range formatted
`stream()`	`Stream<Row>`	no	Row stream
`iterator()`	`Iterator<Row>`	no	Row iterator
`toList()`	`List<Row>`	no	All rows as list

StructType

Method	Description
`new StructType(StructField...)`	Construct from fields
`field(int)` / `field(String)`	Get field by ordinal or name
`indexOf(String)`	Ordinal of named field
`length()`	Number of fields
`names()` / `dtypes()` / `measures()`	Field property arrays
`add(StructField)`	Append a field (mutable)
`rename(String, String)`	Rename a field (mutable)

StructField

Constructor / method	Description
`new StructField(name, dtype)`	Without measure
`new StructField(name, dtype, measure)`	With measure
`withName(String)`	Return renamed copy
`isNumeric()`	True for non-nominal numeric fields
`toString(Object)`	Format a value using measure or dtype

ValueVector (selected)

Method	Description
`size()`	Element count
`isNullable()`	True if vector can contain nulls
`isNullAt(int)`	Null check at position
`getNullCount()`	Count of null positions
`anyNull()`	True if any null exists
`get(int)`	Boxed value (may be null)
`getInt/Double/…(int)`	Typed value
`getString(int)`	String form (uses measure)
`set(int, Object)`	Mutation
`get(Index)`	Sub-selection
`withName(String)`	Return renamed copy
`toIntArray()` / `toDoubleArray()` / `toStringArray()`	Bulk export
`intStream()` / `longStream()` / `doubleStream()` / `stream()`	Streaming
`eq(Object)` / `ne` / `lt` / `le` / `gt` / `ge`	Element-wise comparison masks
`isin(String...)` / `isin(int...)`	Membership mask
`isNull()`	Per-element null mask

SMILE — DataFrame User Guide & Tutorial

SMILE — DataFrame User Guide & Tutorial

Table of Contents

1. Architecture overview

2. Data types — smile.data.type

2.1 Primitive and nullable types

2.2 Temporal types

2.3 Other types

2.4 StructField

2.5 StructType

3. Levels of measurement — smile.data.measure

3.1 NominalScale

3.2 OrdinalScale

3.3 IntervalScale

3.4 RatioScale

4. Column vectors — smile.data.vector

4.1 Primitive vectors

4.2 Nullable primitive vectors

4.3 Object vectors

4.4 Common ValueVector operations

5. Tuple — a single row

Creating a Tuple

Reading fields

Exporting to a double array

Schema access

6. DataFrame

6.1 Creating a DataFrame

From column vectors (most direct)

From a 2-D double / float / int array

From a Java Bean / POJO via reflection

From a schema + list/stream of Tuples

From a JDBC ResultSet

6.2 Inspecting a DataFrame

6.3 Accessing cells, rows, and columns

Cells

Rows (Tuples)

Columns

Mutating a cell

6.4 Row indexing with RowIndex

6.5 Selecting and dropping columns

6.6 Adding and replacing columns

6.7 Renaming columns

6.8 Filtering rows

Boolean mask

Index object

dropna

6.9 Sorting

6.10 Slicing and sampling

6.11 Combining DataFrames

merge — horizontal (column union)

concat — vertical (row union)

join — inner join on RowIndex

6.12 Missing values

6.13 Categorical encoding with factorize

6.14 Exporting to numeric arrays and matrices

6.15 Statistics with describe

6.16 Printing and display

6.17 Loading from JDBC

7. CategoricalEncoder

8. End-to-end tutorial

Step 1 — Define the domain object and load data

Step 2 — Inspect and describe

Step 3 — Add a derived column

Step 4 — Handle missing values

Step 5 — Sort and slice

Step 6 — Select features and encode categoricals

Step 7 — Set a row index for traceability

Step 8 — Describe the final feature set

9. API quick reference

DataFrame static factories

DataFrame instance methods

StructType

StructField

ValueVector (selected)

2. Data types — `smile.data.type`

3. Levels of measurement — `smile.data.measure`

4. Column vectors — `smile.data.vector`