Back to Smile

SMILE — DataFrame User Guide & Tutorial

base/DATA_FRAME.md

6.1.039.2 KB
Original Source

SMILE — DataFrame User Guide & Tutorial

This document covers smile.data.DataFrame, smile.data.Tuple, and the supporting packages smile.data.type, smile.data.measure, and smile.data.vector.


Table of Contents

  1. Architecture overview
  2. Data types — smile.data.type
  3. Levels of measurement — smile.data.measure
  4. Column vectors — smile.data.vector
  5. Tuple — a single row
  6. DataFrame
  7. CategoricalEncoder
  8. End-to-end tutorial
  9. API quick reference

1. Architecture overview

smile.data.type
│
├── DataType (interface)          – describes storage format
│     ├── Primitive types         – BooleanType, ByteType, ShortType,
│     │                             IntType, LongType, FloatType, DoubleType,
│     │                             CharType  (each has nullable=false/true)
│     ├── TemporalType            – DateType, TimeType, DateTimeType
│     ├── DecimalType             – BigDecimal
│     ├── StringType              – String
│     ├── ObjectType              – arbitrary Object
│     ├── ArrayType               – primitive array columns
│     └── StructType              – schema (ordered list of StructFields)
│
├── StructField(name, dtype, measure?)  – one column descriptor
└── DataTypes                     – constants and factory methods

smile.data.measure
│
├── Measure (interface)           – adds semantic meaning to a numeric column
│     ├── CategoricalMeasure      – discrete labeled integer codes
│     │     ├── NominalScale      – unordered categories (gender, country…)
│     │     └── OrdinalScale      – ordered categories (rating, grade…)
│     └── NumericalMeasure        – continuous numeric annotations
│           ├── IntervalScale     – no true zero (temperature °C, year)
│           └── RatioScale        – true zero (price, weight, count)
└── Measure.Currency / Percent    – built-in ratio-scale singletons

smile.data.vector
│
├── ValueVector (interface)       – one typed column of a DataFrame
│     ├── PrimitiveVector         – non-nullable: IntVector, DoubleVector…
│     ├── NullablePrimitiveVector – nullable: NullableIntVector…
│     ├── StringVector            – String column (always nullable-compatible)
│     ├── NumberVector<N>         – boxed Number column (BigDecimal…)
│     └── ObjectVector<T>         – arbitrary Object column

smile.data
│
├── Tuple (interface)             – one row; immutable ordered field list
│     └── Row (record)            – Tuple backed by a DataFrame + row index
│
├── RowIndex                      – optional label → ordinal index for rows
│
└── DataFrame (record)            – 2-D heterogeneous tabular data
      schema    : StructType
      columns   : List<ValueVector>
      index     : RowIndex?        (optional row labels)

2. Data types — smile.data.type

Every column in a DataFrame has a DataType that describes how values are stored. Access the singletons via DataTypes.*:

java
import smile.data.type.DataTypes;
import smile.data.type.DataType;

2.1 Primitive and nullable types

SingletonJava typeNullable variant
DataTypes.BooleanTypebooleanDataTypes.NullableBooleanType
DataTypes.ByteTypebyteDataTypes.NullableByteType
DataTypes.ShortTypeshortDataTypes.NullableShortType
DataTypes.IntTypeintDataTypes.NullableIntType
DataTypes.LongTypelongDataTypes.NullableLongType
DataTypes.FloatTypefloatDataTypes.NullableFloatType
DataTypes.DoubleTypedoubleDataTypes.NullableDoubleType
DataTypes.CharTypecharDataTypes.NullableCharType

Non-nullable primitive types store values directly in a primitive array — no boxing, no null overhead. Nullable variants add a BitSet null-mask alongside the primitive array; reading a null cell via the primitive accessor (getDouble, getInt, …) returns Double.NaN / Integer.MIN_VALUE as the sentinel; always use isNullAt(i) first when the column is nullable.

java
boolean isNull = df.isNullAt(row, col);
if (!isNull) {
    double v = df.getDouble(row, col);
}

2.2 Temporal types

SingletonJava type
DataTypes.DateTypejava.time.LocalDate
DataTypes.TimeTypejava.time.LocalTime
DataTypes.DateTimeTypejava.time.LocalDateTime

Temporal columns are always represented as object arrays (they are not primitives), but SMILE provides ISO-8601 parsing and formatting out of the box via dtype.valueOf(String) and dtype.toString(Object).

2.3 Other types

Singleton / factoryJava typeNotes
DataTypes.StringTypeStringAlways nullable
DataTypes.DecimalTypeBigDecimalAlways nullable
DataTypes.ObjectTypeObjectCatch-all
DataTypes.object(Class<?>)specific classResolves to known type if possible
DataTypes.IntArrayType etc.int[] etc.Primitive array columns

2.4 StructField

A StructField is an immutable triple (name, dtype, measure?) that describes a single column.

java
import smile.data.type.StructField;

// Plain double column
StructField age     = new StructField("age",    DataTypes.IntType);

// Nullable salary
StructField salary  = new StructField("salary", DataTypes.NullableDoubleType);

// Categorical column with a NominalScale
StructField gender  = new StructField("gender", DataTypes.ByteType,
                                      new NominalScale("Male", "Female"));

// Useful predicates
boolean numeric = age.isNumeric();      // true for non-nominal numeric
boolean nullable = salary.dtype().isNullable();

StructField is a Java record; two fields are equal when name, dtype, and measure all match.

2.5 StructType

StructType is the schema of a DataFrame or Tuple. It is an ordered list of StructFields with a fast name-to-index lookup map.

java
import smile.data.type.StructType;

StructType schema = new StructType(
    new StructField("name",   DataTypes.StringType),
    new StructField("age",    DataTypes.IntType),
    new StructField("salary", DataTypes.NullableDoubleType)
);

// Field access
StructField f = schema.field("age");    // by name
StructField f = schema.field(1);        // by ordinal

int j         = schema.indexOf("age");  // ordinal of "age"
String[]  ns  = schema.names();
DataType[] dt = schema.dtypes();
int       len = schema.length();

StructType is mutable for internal use by DataFrame (columns can be added/renamed in-place). Do not cache a StructType reference and assume it is immutable.


3. Levels of measurement — smile.data.measure

A Measure is an optional annotation on a StructField that adds semantic meaning to a numeric column. It controls how values are rendered and how SMILE's algorithms treat the column (e.g., dummy encoding for categorical variables).

java
import smile.data.measure.*;

3.1 NominalScale

Unordered categories. Each integer code maps to a string label.

java
// From string labels (codes are 0, 1, 2, …)
NominalScale gender = new NominalScale("Male", "Female");

// From an enum
NominalScale color = new NominalScale(Color.class); // enum Color { Red, Green, Blue }

// From explicit code→label pairs
NominalScale custom = new NominalScale(
    new int[]    {1, 3, 7},
    new String[] {"Low", "Mid", "High"}
);

int    code  = gender.valueOf("Female").intValue(); // 1
String label = gender.level(0);                    // "Male"
int    size  = gender.size();                      // 2

When a column has a NominalScale, getString(i) returns the label, not the raw integer, and df.factorize() will produce columns backed by a NominalScale.

3.2 OrdinalScale

Ordered categories. Levels carry an implied rank; values are kept sorted.

java
OrdinalScale rating = new OrdinalScale("Poor", "Fair", "Good", "Excellent");
// ordinals: 0=Poor, 1=Fair, 2=Good, 3=Excellent

The ordinal position is meaningful for comparison and sorting but arithmetic (mean, variance) is not valid on pure ordinal data.

3.3 IntervalScale

Numeric, no true zero. Arithmetic differences are meaningful, ratios are not. Primarily used as an annotation for documentation/display.

java
IntervalScale celsius = new IntervalScale(NumberFormat.getInstance());

3.4 RatioScale

Numeric with a true zero. All arithmetic is valid.

java
// Built-in singletons
Measure price  = Measure.Currency;   // formats as currency
Measure pct    = Measure.Percent;    // formats as percentage

// Custom
RatioScale weight = new RatioScale(NumberFormat.getInstance());

Compatibility rules (enforced by StructField's compact constructor):

  • NumericalMeasure is invalid for Boolean, Char, String columns.
  • CategoricalMeasure is only valid for integral (int, long, byte, short) columns.

4. Column vectors — smile.data.vector

A ValueVector is a typed, indexed, one-dimensional array that forms a single column of a DataFrame.

java
import smile.data.vector.*;

4.1 Primitive vectors

Each primitive type has a corresponding non-nullable vector class:

ClassBacking storeExample
IntVectorint[]new IntVector("age", new int[]{25, 30, 35})
LongVectorlong[]new LongVector("ts", new long[]{...})
FloatVectorfloat[]new FloatVector("score", new float[]{...})
DoubleVectordouble[]new DoubleVector("salary", new double[]{...})
BooleanVectorboolean[]new BooleanVector("flag", new boolean[]{...})
ByteVectorbyte[]new ByteVector("cat", new byte[]{...})
ShortVectorshort[]new ShortVector("rank", new short[]{...})
CharVectorchar[]new CharVector("grade", new char[]{...})

All take an optional StructField as the first argument when a measure is needed:

java
StructField field = new StructField("gender", DataTypes.ByteType,
                                    new NominalScale("Male", "Female"));
ByteVector gender = new ByteVector(field, new byte[]{0, 1, 0, 1});

4.2 Nullable primitive vectors

Nullable variants store an additional BitSet null-mask:

java
import java.util.BitSet;

double[] values   = {80000.0, Double.NaN, 90000.0};
BitSet   nullMask = new BitSet(3);
nullMask.set(1);  // index 1 is null

NullableDoubleVector salary = new NullableDoubleVector(
    new StructField("salary", DataTypes.NullableDoubleType),
    values, nullMask
);

salary.isNullable();     // true
salary.isNullAt(1);      // true
salary.getDouble(0);     // 80000.0
salary.getNullCount();   // 1

Corresponding nullable classes: NullableIntVector, NullableLongVector, NullableFloatVector, NullableDoubleVector, NullableBooleanVector, NullableByteVector, NullableShortVector, NullableCharVector.

4.3 Object vectors

ClassContent
StringVectorString[] — always nullable
NumberVector<N>Number[]BigDecimal, boxed primitives
ObjectVector<T>Object[]LocalDate, LocalDateTime, any type
java
StringVector  names = new StringVector("name",  new String[]{"Alice","Bob"});
ObjectVector<LocalDate> dates = new ObjectVector<>(
    new StructField("birthday", DataTypes.DateType),
    new LocalDate[]{ LocalDate.of(1990,1,1), LocalDate.of(1985,6,15) }
);

4.4 Common ValueVector operations

java
ValueVector v = df.column("age");

// Size and nullability
int n  = v.size();
boolean nullable = v.isNullable();
boolean hasNull  = v.anyNull();
int nullCount    = v.getNullCount();
boolean rowNull  = v.isNullAt(2);

// Typed reads
int    i = v.getInt(0);
double d = v.getDouble(0);
String s = v.getString(0);       // uses measure for categorical columns
Object o = v.get(0);             // boxed / object value, may be null

// Bulk export
int[]    ia = v.toIntArray();
double[] da = v.toDoubleArray();
String[] sa = v.toStringArray();

// Streaming
v.intStream().sum();
v.doubleStream().average();
v.stream().filter(Objects::nonNull).count();

// Boolean filter masks
boolean[] mask = v.eq(30);       // element-wise ==
boolean[] mask = v.gt(25);       // element-wise >
boolean[] mask = v.isin(25, 30); // element-wise membership

// Sub-selection
ValueVector sub = v.get(Index.of(new int[]{0, 2, 3}));

// Rename (returns new vector)
ValueVector renamed = v.withName("years");

// In-place mutation
v.set(0, 99);

5. Tuple — a single row

Tuple is an interface representing one row of a DataFrame. It is immutable from the user's perspective — mutating a Row (the concrete implementation) directly modifies the backing DataFrame.

Creating a Tuple

java
import smile.data.Tuple;
import smile.data.type.StructType;

// Construct standalone from schema + values
StructType schema = new StructType(
    new StructField("x", DataTypes.IntType),
    new StructField("y", DataTypes.DoubleType)
);
Tuple t = Tuple.of(schema, new Object[]{42, 3.14});
Tuple t = Tuple.of(schema, new int[]{42}, new double[]{3.14});

Reading fields

All accessors have both ordinal (int i) and name (String field) forms:

java
Tuple row = df.get(0);

// By ordinal
int    age    = row.getInt(0);
double salary = row.getDouble(2);
String name   = row.getString(1);

// By name
int    age    = row.getInt("age");
double salary = row.getDouble("salary");

// Generic (boxed, may return null)
Object val = row.get(0);
Object val = row.get("name");

// Null check — always check before primitive accessors on nullable columns
boolean isNull = row.isNullAt(2);
boolean isNull = row.isNullAt("salary");
boolean anyNull = row.anyNull();

Available typed accessors: getBoolean, getChar, getByte, getShort, getInt, getLong, getFloat, getDouble, getString.

Exporting to a double array

java
// All fields as doubles (NaN for nulls, level encoding for categoricals)
double[] arr = row.toArray();

// Selective fields
double[] arr = row.toArray("age", "salary");

// With intercept (bias=1) and dummy encoding
double[] arr = row.toArray(true, CategoricalEncoder.DUMMY, "age", "gender", "salary");

Schema access

java
StructType schema  = row.schema();
int        length  = row.length();
int        j       = row.indexOf("salary");

6. DataFrame

DataFrame is a Java record — immutable value by reference — backed by:

  • StructType schema — column descriptors
  • List<ValueVector> columns — the actual data, one vector per column
  • RowIndex index — optional row labels (may be null)

The List<ValueVector> and StructType are mutable for the in-place operations add(), set(), rename(), and fillna(). All other operations that structurally reshape the data (e.g., select, drop, sort, concat) return a new DataFrame.

java
import smile.data.DataFrame;

6.1 Creating a DataFrame

From column vectors (most direct)

java
DataFrame df = new DataFrame(
    new StringVector ("name",   new String[] {"Alice", "Bob", "Charlie"}),
    new IntVector    ("age",    new int[]    {25, 30, 35}),
    new DoubleVector ("salary", new double[] {60000., 80000., 90000.})
);

From a 2-D double / float / int array

java
double[][] data = {{1.0, 2.0}, {3.0, 4.0}, {5.0, 6.0}};

// Auto-named columns: V1, V2
DataFrame df = DataFrame.of(data);

// Custom names
DataFrame df = DataFrame.of(data, "x", "y");
DataFrame df = DataFrame.of(new int[][]{{1,2},{3,4}}, "a", "b");
DataFrame df = DataFrame.of(new float[][]{{1f},{2f}}, "v");

From a Java Bean / POJO via reflection

SMILE uses Java Beans introspection (getXxx() methods) to discover columns automatically. Field order in the schema follows the alphabetical order of getter names.

java
public class Person {
    public String getName()     { return name; }
    public int    getAge()      { return age;  }
    public Double getSalary()   { return salary; }   // nullable wrapper → NullableDoubleType
    public Gender getGender()   { return gender; }   // enum → ByteVector with NominalScale
    // …
}

List<Person> persons = List.of(
    new Person("Alice", 25, 60000., Gender.Female),
    new Person("Bob",   30, null,   Gender.Male)
);
DataFrame df = DataFrame.of(Person.class, persons);

Rules for automatic type inference:

  • int / long / float / double / boolean / char / byte / short → non-nullable primitive vector
  • Boxed Integer, Double, etc. and Number subclasses → nullable vector or NumberVector
  • StringStringVector
  • Enum → ByteVector (≤127 levels), ShortVector (≤32767), or IntVector; always annotated with a NominalScale
  • LocalDate, LocalDateTime, LocalTimeObjectVector with temporal type
  • Anything else → ObjectVector

From a schema + list/stream of Tuples

java
StructType schema = new StructType(
    new StructField("x", DataTypes.IntType),
    new StructField("y", DataTypes.DoubleType)
);

// Non-empty list
List<Tuple> rows = List.of(Tuple.of(schema, 1, 2.0), Tuple.of(schema, 3, 4.0));
DataFrame df = DataFrame.of(schema, rows);

// Stream variant
DataFrame df = DataFrame.of(schema, rows.stream());

// Empty list — returns a zero-row DataFrame with the correct schema
DataFrame empty = DataFrame.of(schema, List.of());

From a JDBC ResultSet

java
try (Connection conn = DriverManager.getConnection(url);
     Statement  stmt = conn.createStatement();
     ResultSet  rs   = stmt.executeQuery("SELECT * FROM employees")) {
    DataFrame df = DataFrame.of(rs);
}

JDBC type mapping is handled automatically via StructType.of(ResultSet).

6.2 Inspecting a DataFrame

java
// Dimensions
int nrow = df.nrow();          // number of rows
int ncol = df.ncol();          // number of columns
int size = df.size();          // alias for nrow()
int[] shape = {df.shape(0), df.shape(1)};  // {nrow, ncol}
boolean empty = df.isEmpty();

// Schema
StructType schema   = df.schema();
String[]   names    = df.names();
DataType[] dtypes   = df.dtypes();
Measure[]  measures = df.measures();

// Print
System.out.println(df);          // head(10)
System.out.println(df.head(5));  // first 5 rows
System.out.println(df.tail(5));  // last 5 rows
System.out.println(df.toString(2, 7, true));  // rows [2,7)

// Summary statistics
System.out.println(df.describe());

describe() returns a new DataFrame with one row per column and columns: column, type, measure, count (non-null), mode, mean, std, min, 25%, 50%, 75%, max.

6.3 Accessing cells, rows, and columns

Cells

java
Object  val = df.get(row, col);           // boxed value, may be null
int     i   = df.getInt(row, col);
long    l   = df.getLong(row, col);
float   f   = df.getFloat(row, col);
double  d   = df.getDouble(row, col);
String  s   = df.getString(row, col);     // uses measure for categoricals
String  sc  = df.getScale(row, col);      // level name for NominalScale / OrdinalScale
boolean nil = df.isNullAt(row, col);

Rows (Tuples)

java
Tuple row = df.get(0);                    // first row
Tuple row = df.apply(0);                  // Scala alias

// Iterate all rows
for (Row row : df) { /* … */ }
df.stream().forEach(row -> /* … */);
List<Row> list = df.toList();

Columns

java
ValueVector col = df.column(0);           // by ordinal
ValueVector col = df.column("age");       // by name
ValueVector col = df.apply("age");        // Scala alias

Mutating a cell

java
df.set(row, col, value);
df.update(row, col, value);               // Scala alias

6.4 Row indexing with RowIndex

A RowIndex maps arbitrary label objects to row ordinals, allowing label-based row selection similar to Pandas .loc.

java
// Attach an index from an existing column (column is removed from data)
DataFrame indexed = df.setIndex("name");

// Attach an index from an external array
DataFrame indexed = df.setIndex(new Object[]{"r0","r1","r2","r3"});

// Look up by label
Tuple row  = indexed.loc("Alice");          // single row
DataFrame sub = indexed.loc("Alice","Bob"); // multiple rows

RowIndex is also used internally by join() to perform an inner join on shared keys.

java
// RowIndex directly
RowIndex index = new RowIndex(new Object[]{"a","b","c"});

int  i   = index.apply("b");             // 1
int  i   = index.getOrDefault("x");     // -1 (not found)
boolean has = index.containsKey("a");   // true
int  n   = index.size();                // 3

Constraints: no null values, no duplicate values — both throw IllegalArgumentException at construction time.

6.5 Selecting and dropping columns

All selection/drop operations return a new DataFrame.

java
// Select by ordinal indices
DataFrame sub = df.select(0, 2);

// Select by name
DataFrame sub = df.select("name", "salary");
DataFrame sub = df.apply("name", "salary");   // Scala alias

// Drop by ordinal index
DataFrame sub = df.drop(1);
DataFrame sub = df.drop(0, 2);

// Drop by name
DataFrame sub = df.drop("age");
DataFrame sub = df.drop("age", "birthday");

6.6 Adding and replacing columns

add() and set() mutate this in-place (they modify the internal List<ValueVector> and StructType).

java
// Add new columns — all must have the same size as the DataFrame
// and names must not clash with existing columns or each other
IntVector bonus = new IntVector("bonus", new int[]{5000, 8000, 12000});
df.add(bonus);

// Two columns at once — both names must be distinct from each other
// and from existing columns
df.add(c1, c2);

// Replace an existing column (or add if not present)
df.set("salary", updatedSalaryVector);
df.update("salary", updatedSalaryVector);  // Scala alias

6.7 Renaming columns

rename() mutates both the StructType and the backing ValueVector in-place:

java
df.rename("age", "years");
// df.names() is now ["name", "years", "salary"]

6.8 Filtering rows

All row-selection operations return a new DataFrame.

Boolean mask

java
// Build a boolean mask manually or from a column comparison
boolean[] mask = df.column("age").gt(28);
DataFrame sub  = df.get(mask);

Index object

java
import smile.util.Index;

// From explicit row indices
DataFrame sub = df.get(Index.of(new int[]{0, 2, 3}));

// From a boolean mask
DataFrame sub = df.get(Index.of(new boolean[]{true, false, true, true}));

dropna

java
// Remove any row that has at least one null/NaN value
DataFrame clean = df.dropna();

6.9 Sorting

sort() returns a new DataFrame with all rows reordered. Null values always sort to the end regardless of direction.

java
// Ascending (default)
DataFrame sorted = df.sort("age");

// Descending
DataFrame sorted = df.sort("salary", false);

The sort is stable and works on any column type: integral, floating-point, String, and any Comparable.

6.10 Slicing and sampling

java
// Contiguous row range [from, to)
DataFrame slice  = df.slice(1, 4);   // rows 1, 2, 3
DataFrame first  = df.slice(0, 1);   // first row only
DataFrame empty  = df.slice(2, 2);   // zero rows (valid)

// Random sample without replacement
DataFrame sample = df.sample(50);    // up to 50 rows (capped at nrow())

slice() validates that 0 ≤ from ≤ to ≤ nrow().

6.11 Combining DataFrames

merge — horizontal (column union)

Combines two or more DataFrames side-by-side. All must have the same row count. Clashing column names get a _2, _3, … suffix.

java
DataFrame wide = left.merge(right);
DataFrame wide = a.merge(b, c, d);

concat — vertical (row union)

Stacks DataFrames on top of each other. All must have the exact same schema.

java
DataFrame tall = train.concat(test);
DataFrame tall = a.concat(b, c);

If all frames have a RowIndex, the indices are concatenated too.

join — inner join on RowIndex

Performs an inner join using matching row-label keys. If either frame has no RowIndex, falls back to merge().

java
DataFrame merged = left.join(right);
// Rows present in both left.index and right.index are kept;
// unmatched rows are dropped.

6.12 Missing values

java
// Drop rows with any null/NaN
DataFrame clean = df.dropna();

// Fill NaN/Inf in numeric columns in-place
df.fillna(0.0);    // replace with zero
df.fillna(-1.0);   // replace with sentinel

fillna operates on DoubleVector, FloatVector, NullablePrimitiveVector, and NumberVector columns; non-numeric columns are unaffected.

6.13 Categorical encoding with factorize

factorize() converts String columns into IntVector columns annotated with a NominalScale. The integer codes are assigned in alphabetical order of the distinct string values.

java
// Convert all String columns
DataFrame f = df.factorize();

// Convert specific columns
DataFrame f = df.factorize("color", "country");

// Inspect the resulting scale
NominalScale scale = (NominalScale) f.schema().field("color").measure();
String label = scale.level(0);    // first level alphabetically
int    code  = scale.valueOf("Red").intValue();

This is the standard step to prepare string data for machine-learning algorithms that require integer inputs.

6.14 Exporting to numeric arrays and matrices

Both toArray() and toMatrix() convert the DataFrame to a dense numeric representation with optional bias (intercept) column and categorical encoding.

java
// Default: no bias, level encoding, all columns
double[][] X = df.toArray();

// Selective columns
double[][] X = df.toArray("age", "salary", "gender");

// With bias + dummy encoding for categoricals
double[][] X = df.toArray(true, CategoricalEncoder.DUMMY, "age", "gender", "salary");

// DenseMatrix form (suitable for linear algebra)
DenseMatrix M = df.toMatrix();
DenseMatrix M = df.toMatrix(true, CategoricalEncoder.DUMMY, "rowNameColumn");

NaN is used for null/missing values in the output array.

See §7 for the CategoricalEncoder options.

6.15 Statistics with describe

java
DataFrame stats = df.describe();
System.out.println(stats);

Output columns: column, type, measure, count, mode, mean, std, min, 25%, 50%, 75%, max.

  • Categorical columns report mode, min, median, max over the integer codes.
  • Floating-point columns report mean, std, min, quartiles, max.
  • Integral columns report all statistics.
  • String / object columns report only count (non-null) and mode.

6.16 Printing and display

java
System.out.println(df);            // head(10)
System.out.println(df.head(5));
System.out.println(df.tail(5));
System.out.println(df.toString(from, to, truncate));

toString(from, to, truncate):

  • from must be in [0, nrow]; from > nrow throws.
  • to ≤ from (or after clamping to nrow) returns "Empty DataFrame\n".
  • Columns wider than maxColWidth are truncated with "..." when truncate=true.

6.17 Loading from JDBC

java
import java.sql.*;

try (Connection conn = DriverManager.getConnection(jdbcUrl, user, pass);
     Statement  stmt = conn.createStatement();
     ResultSet  rs   = stmt.executeQuery("SELECT * FROM sales")) {
    DataFrame df = DataFrame.of(rs);
    System.out.println(df.describe());
}

JDBC types are mapped to SMILE types via StructType.of(ResultSetMetaData).


7. CategoricalEncoder

CategoricalEncoder controls how categorical (NominalScale / OrdinalScale) columns are converted when calling toArray(), toMatrix(), or Tuple.toArray().

Enum valueMeaningOutput columns per category
LEVELInteger level code (default)1 — raw code value
DUMMYDummy / treatment encodingk−1 binary columns (reference = first level)
ONE_HOTFull one-hot encodingk binary columns
java
import smile.data.CategoricalEncoder;

// Level encoding (default) — "gender" becomes a single int column
double[][] X = df.toArray("age", "gender");

// Dummy encoding — k levels → k-1 binary columns
// e.g. gender {Male=0, Female=1} → one binary column "gender_Female"
double[][] X = df.toArray(false, CategoricalEncoder.DUMMY, "age", "gender");

// One-hot encoding — k levels → k binary columns
double[][] X = df.toArray(false, CategoricalEncoder.ONE_HOT, "age", "gender");

8. End-to-end tutorial

This tutorial processes an employee dataset from a raw POJO list through to a numeric design matrix ready for a machine-learning algorithm.

Step 1 — Define the domain object and load data

java
import java.time.LocalDate;
import smile.data.DataFrame;
import smile.data.measure.*;
import smile.data.type.*;
import smile.data.vector.*;

public enum Department { Engineering, Marketing, HR }

public class Employee {
    public String     getName()       { return name; }
    public int        getAge()        { return age; }
    public Department getDepartment() { return dept; }
    public LocalDate  getHireDate()   { return hireDate; }
    public Double     getSalary()     { return salary; }   // nullable
    // …constructor, fields…
}

List<Employee> employees = loadEmployees();  // from DB / file / …
DataFrame df = DataFrame.of(Employee.class, employees);

System.out.println(df.schema());
// age: int
// department: byte  nominal[Engineering, HR, Marketing]
// hireDate: Date
// name: String
// salary: double?

Step 2 — Inspect and describe

java
System.out.println(df);
System.out.println(df.describe());

// How many rows have a null salary?
long nullSalaries = df.column("salary").getNullCount();
System.out.println("Missing salaries: " + nullSalaries);

Step 3 — Add a derived column

java
import smile.data.vector.IntVector;

// Tenure in years = current year - hire year
int[] tenure = new int[df.nrow()];
for (int i = 0; i < df.nrow(); i++) {
    LocalDate d = (LocalDate) df.column("hireDate").get(i);
    tenure[i] = LocalDate.now().getYear() - d.getYear();
}
df.add(new IntVector("tenure", tenure));

Step 4 — Handle missing values

java
// Option A: drop rows with any null
DataFrame clean = df.dropna();

// Option B: fill salary nulls with median
double medianSalary = df.column("salary").doubleStream()
        .filter(Double::isFinite).sorted()
        .skip(df.nrow() / 2).findFirst().orElse(0.0);
df.fillna(medianSalary);

Step 5 — Sort and slice

java
// Sort by salary descending
DataFrame sorted = df.sort("salary", false);

// Top 10 earners
DataFrame top10 = sorted.slice(0, Math.min(10, sorted.nrow()));
System.out.println(top10.head(10));

Step 6 — Select features and encode categoricals

java
// Select the columns we want for the model
DataFrame features = df.select("age", "tenure", "salary", "department");

// For algorithms that need integer encoding:
// "department" already has NominalScale (auto-detected from enum)

// Export design matrix with dummy encoding
double[][] X = features.drop("salary")
        .toArray(false, CategoricalEncoder.DUMMY,
                 "age", "tenure", "department");

// Response vector
double[] y = features.column("salary").toDoubleArray();

Step 7 — Set a row index for traceability

java
DataFrame indexed = df.setIndex("name");

// Later: look up a specific employee by name
Tuple alice = indexed.loc("Alice");
System.out.println("Alice's salary: " + alice.getDouble("salary"));

// Join two DataFrames on employee name
DataFrame reviews = loadReviews().setIndex("employee");
DataFrame combined = indexed.join(reviews);

Step 8 — Describe the final feature set

java
DataFrame finalFeatures = df.select("age", "tenure", "department", "salary");
System.out.println(finalFeatures.describe());

// Verify no nulls remain
boolean anyNull = finalFeatures.stream().anyMatch(Tuple::anyNull);
System.out.println("Any nulls: " + anyNull);

9. API quick reference

DataFrame static factories

MethodDescription
new DataFrame(ValueVector...)Construct from column vectors
new DataFrame(RowIndex, ValueVector...)With row index
DataFrame.of(double[][], String...)From 2-D double array
DataFrame.of(float[][], String...)From 2-D float array
DataFrame.of(int[][], String...)From 2-D int array
DataFrame.of(Class<T>, List<T>)From POJOs via reflection
DataFrame.of(StructType, List<Tuple>)From tuple list (empty → zero-row frame)
DataFrame.of(StructType, Stream<Tuple>)From tuple stream
DataFrame.of(ResultSet)From JDBC ResultSet

DataFrame instance methods

MethodReturnsMutates this?Description
nrow() / size()intnoNumber of rows
ncol()intnoNumber of columns
shape(dim)intnoSize of dimension 0 (rows) or 1 (cols)
isEmpty()booleannoTrue if zero rows
schema()StructTypenoColumn schema
names()String[]noColumn names
dtypes()DataType[]noColumn types
measures()Measure[]noColumn measures
column(int) / column(String)ValueVectornoColumn vector
get(int, int)ObjectnoCell (boxed)
getInt/Double/…(int,int)primitivenoCell (typed)
getString(int,int)StringnoCell as string (uses measure)
isNullAt(int,int)booleannoNull check
set(int,int,Object)voidyesSet cell value
get(int)TuplenoRow as Tuple
get(Index)DataFramenoRows by Index
get(boolean[])DataFramenoRows by boolean mask
slice(int,int)DataFramenoRows [from, to)
sample(int)DataFramenoRandom sample without replacement
sort(String)DataFramenoAscending sort
sort(String,boolean)DataFramenoSort with direction
select(int...)DataFramenoColumns by index
select(String...)DataFramenoColumns by name
drop(int...)DataFramenoRemove columns by index
drop(String...)DataFramenoRemove columns by name
add(ValueVector...)DataFrameyesAdd new columns
set(String,ValueVector)DataFrameyesReplace or add column
rename(String,String)DataFrameyesRename column in-place
merge(DataFrame...)DataFramenoHorizontal column union
concat(DataFrame...)DataFramenoVertical row union
join(DataFrame)DataFramenoInner join on RowIndex
setIndex(String)DataFramenoColumn → RowIndex (removes column)
setIndex(Object[])DataFramenoAttach RowIndex array
loc(Object)TuplenoRow by label
loc(Object...)DataFramenoRows by labels
dropna()DataFramenoRemove rows with any null
fillna(double)DataFrameyesFill NaN/null in numeric columns
factorize(String...)DataFramenoEncode string columns as NominalScale
toArray(String...)double[][]noNumeric array (LEVEL encoding)
toArray(boolean,CategoricalEncoder,String...)double[][]noNumeric array with options
toMatrix()DenseMatrixnoMatrix (LEVEL, no bias)
toMatrix(boolean,CategoricalEncoder,String)DenseMatrixnoMatrix with options
describe()DataFramenoSummary statistics
head(int)StringnoTop-N rows formatted
tail(int)StringnoBottom-N rows formatted
toString(int,int,boolean)StringnoRow range formatted
stream()Stream<Row>noRow stream
iterator()Iterator<Row>noRow iterator
toList()List<Row>noAll rows as list

StructType

MethodDescription
new StructType(StructField...)Construct from fields
field(int) / field(String)Get field by ordinal or name
indexOf(String)Ordinal of named field
length()Number of fields
names() / dtypes() / measures()Field property arrays
add(StructField)Append a field (mutable)
rename(String, String)Rename a field (mutable)

StructField

Constructor / methodDescription
new StructField(name, dtype)Without measure
new StructField(name, dtype, measure)With measure
withName(String)Return renamed copy
isNumeric()True for non-nominal numeric fields
toString(Object)Format a value using measure or dtype

ValueVector (selected)

MethodDescription
size()Element count
isNullable()True if vector can contain nulls
isNullAt(int)Null check at position
getNullCount()Count of null positions
anyNull()True if any null exists
get(int)Boxed value (may be null)
getInt/Double/…(int)Typed value
getString(int)String form (uses measure)
set(int, Object)Mutation
get(Index)Sub-selection
withName(String)Return renamed copy
toIntArray() / toDoubleArray() / toStringArray()Bulk export
intStream() / longStream() / doubleStream() / stream()Streaming
eq(Object) / ne / lt / le / gt / geElement-wise comparison masks
isin(String...) / isin(int...)Membership mask
isNull()Per-element null mask

SMILE — © 2010-2026 Haifeng Li. GNU GPL licensed.