base/DATA_FRAME.md
This document covers smile.data.DataFrame, smile.data.Tuple, and the
supporting packages smile.data.type, smile.data.measure, and
smile.data.vector.
smile.data.type
smile.data.measure
smile.data.vector
smile.data.type
│
├── DataType (interface) – describes storage format
│ ├── Primitive types – BooleanType, ByteType, ShortType,
│ │ IntType, LongType, FloatType, DoubleType,
│ │ CharType (each has nullable=false/true)
│ ├── TemporalType – DateType, TimeType, DateTimeType
│ ├── DecimalType – BigDecimal
│ ├── StringType – String
│ ├── ObjectType – arbitrary Object
│ ├── ArrayType – primitive array columns
│ └── StructType – schema (ordered list of StructFields)
│
├── StructField(name, dtype, measure?) – one column descriptor
└── DataTypes – constants and factory methods
smile.data.measure
│
├── Measure (interface) – adds semantic meaning to a numeric column
│ ├── CategoricalMeasure – discrete labeled integer codes
│ │ ├── NominalScale – unordered categories (gender, country…)
│ │ └── OrdinalScale – ordered categories (rating, grade…)
│ └── NumericalMeasure – continuous numeric annotations
│ ├── IntervalScale – no true zero (temperature °C, year)
│ └── RatioScale – true zero (price, weight, count)
└── Measure.Currency / Percent – built-in ratio-scale singletons
smile.data.vector
│
├── ValueVector (interface) – one typed column of a DataFrame
│ ├── PrimitiveVector – non-nullable: IntVector, DoubleVector…
│ ├── NullablePrimitiveVector – nullable: NullableIntVector…
│ ├── StringVector – String column (always nullable-compatible)
│ ├── NumberVector<N> – boxed Number column (BigDecimal…)
│ └── ObjectVector<T> – arbitrary Object column
smile.data
│
├── Tuple (interface) – one row; immutable ordered field list
│ └── Row (record) – Tuple backed by a DataFrame + row index
│
├── RowIndex – optional label → ordinal index for rows
│
└── DataFrame (record) – 2-D heterogeneous tabular data
schema : StructType
columns : List<ValueVector>
index : RowIndex? (optional row labels)
smile.data.typeEvery column in a DataFrame has a DataType that describes how values
are stored. Access the singletons via DataTypes.*:
import smile.data.type.DataTypes;
import smile.data.type.DataType;
| Singleton | Java type | Nullable variant |
|---|---|---|
DataTypes.BooleanType | boolean | DataTypes.NullableBooleanType |
DataTypes.ByteType | byte | DataTypes.NullableByteType |
DataTypes.ShortType | short | DataTypes.NullableShortType |
DataTypes.IntType | int | DataTypes.NullableIntType |
DataTypes.LongType | long | DataTypes.NullableLongType |
DataTypes.FloatType | float | DataTypes.NullableFloatType |
DataTypes.DoubleType | double | DataTypes.NullableDoubleType |
DataTypes.CharType | char | DataTypes.NullableCharType |
Non-nullable primitive types store values directly in a primitive array —
no boxing, no null overhead. Nullable variants add a BitSet null-mask
alongside the primitive array; reading a null cell via the primitive accessor
(getDouble, getInt, …) returns Double.NaN / Integer.MIN_VALUE as the
sentinel; always use isNullAt(i) first when the column is nullable.
boolean isNull = df.isNullAt(row, col);
if (!isNull) {
double v = df.getDouble(row, col);
}
| Singleton | Java type |
|---|---|
DataTypes.DateType | java.time.LocalDate |
DataTypes.TimeType | java.time.LocalTime |
DataTypes.DateTimeType | java.time.LocalDateTime |
Temporal columns are always represented as object arrays (they are not
primitives), but SMILE provides ISO-8601 parsing and formatting out of the
box via dtype.valueOf(String) and dtype.toString(Object).
| Singleton / factory | Java type | Notes |
|---|---|---|
DataTypes.StringType | String | Always nullable |
DataTypes.DecimalType | BigDecimal | Always nullable |
DataTypes.ObjectType | Object | Catch-all |
DataTypes.object(Class<?>) | specific class | Resolves to known type if possible |
DataTypes.IntArrayType etc. | int[] etc. | Primitive array columns |
A StructField is an immutable triple (name, dtype, measure?) that
describes a single column.
import smile.data.type.StructField;
// Plain double column
StructField age = new StructField("age", DataTypes.IntType);
// Nullable salary
StructField salary = new StructField("salary", DataTypes.NullableDoubleType);
// Categorical column with a NominalScale
StructField gender = new StructField("gender", DataTypes.ByteType,
new NominalScale("Male", "Female"));
// Useful predicates
boolean numeric = age.isNumeric(); // true for non-nominal numeric
boolean nullable = salary.dtype().isNullable();
StructField is a Java record; two fields are equal when name, dtype, and
measure all match.
StructType is the schema of a DataFrame or Tuple. It is an ordered
list of StructFields with a fast name-to-index lookup map.
import smile.data.type.StructType;
StructType schema = new StructType(
new StructField("name", DataTypes.StringType),
new StructField("age", DataTypes.IntType),
new StructField("salary", DataTypes.NullableDoubleType)
);
// Field access
StructField f = schema.field("age"); // by name
StructField f = schema.field(1); // by ordinal
int j = schema.indexOf("age"); // ordinal of "age"
String[] ns = schema.names();
DataType[] dt = schema.dtypes();
int len = schema.length();
StructType is mutable for internal use by DataFrame (columns can be
added/renamed in-place). Do not cache a StructType reference and assume
it is immutable.
smile.data.measureA Measure is an optional annotation on a StructField that adds semantic
meaning to a numeric column. It controls how values are rendered and how
SMILE's algorithms treat the column (e.g., dummy encoding for categorical
variables).
import smile.data.measure.*;
Unordered categories. Each integer code maps to a string label.
// From string labels (codes are 0, 1, 2, …)
NominalScale gender = new NominalScale("Male", "Female");
// From an enum
NominalScale color = new NominalScale(Color.class); // enum Color { Red, Green, Blue }
// From explicit code→label pairs
NominalScale custom = new NominalScale(
new int[] {1, 3, 7},
new String[] {"Low", "Mid", "High"}
);
int code = gender.valueOf("Female").intValue(); // 1
String label = gender.level(0); // "Male"
int size = gender.size(); // 2
When a column has a NominalScale, getString(i) returns the label, not
the raw integer, and df.factorize() will produce columns backed by a
NominalScale.
Ordered categories. Levels carry an implied rank; values are kept sorted.
OrdinalScale rating = new OrdinalScale("Poor", "Fair", "Good", "Excellent");
// ordinals: 0=Poor, 1=Fair, 2=Good, 3=Excellent
The ordinal position is meaningful for comparison and sorting but arithmetic (mean, variance) is not valid on pure ordinal data.
Numeric, no true zero. Arithmetic differences are meaningful, ratios are not. Primarily used as an annotation for documentation/display.
IntervalScale celsius = new IntervalScale(NumberFormat.getInstance());
Numeric with a true zero. All arithmetic is valid.
// Built-in singletons
Measure price = Measure.Currency; // formats as currency
Measure pct = Measure.Percent; // formats as percentage
// Custom
RatioScale weight = new RatioScale(NumberFormat.getInstance());
Compatibility rules (enforced by StructField's compact constructor):
NumericalMeasure is invalid for Boolean, Char, String columns.CategoricalMeasure is only valid for integral (int, long, byte,
short) columns.smile.data.vectorA ValueVector is a typed, indexed, one-dimensional array that forms a
single column of a DataFrame.
import smile.data.vector.*;
Each primitive type has a corresponding non-nullable vector class:
| Class | Backing store | Example |
|---|---|---|
IntVector | int[] | new IntVector("age", new int[]{25, 30, 35}) |
LongVector | long[] | new LongVector("ts", new long[]{...}) |
FloatVector | float[] | new FloatVector("score", new float[]{...}) |
DoubleVector | double[] | new DoubleVector("salary", new double[]{...}) |
BooleanVector | boolean[] | new BooleanVector("flag", new boolean[]{...}) |
ByteVector | byte[] | new ByteVector("cat", new byte[]{...}) |
ShortVector | short[] | new ShortVector("rank", new short[]{...}) |
CharVector | char[] | new CharVector("grade", new char[]{...}) |
All take an optional StructField as the first argument when a measure is
needed:
StructField field = new StructField("gender", DataTypes.ByteType,
new NominalScale("Male", "Female"));
ByteVector gender = new ByteVector(field, new byte[]{0, 1, 0, 1});
Nullable variants store an additional BitSet null-mask:
import java.util.BitSet;
double[] values = {80000.0, Double.NaN, 90000.0};
BitSet nullMask = new BitSet(3);
nullMask.set(1); // index 1 is null
NullableDoubleVector salary = new NullableDoubleVector(
new StructField("salary", DataTypes.NullableDoubleType),
values, nullMask
);
salary.isNullable(); // true
salary.isNullAt(1); // true
salary.getDouble(0); // 80000.0
salary.getNullCount(); // 1
Corresponding nullable classes: NullableIntVector, NullableLongVector,
NullableFloatVector, NullableDoubleVector, NullableBooleanVector,
NullableByteVector, NullableShortVector, NullableCharVector.
| Class | Content |
|---|---|
StringVector | String[] — always nullable |
NumberVector<N> | Number[] — BigDecimal, boxed primitives |
ObjectVector<T> | Object[] — LocalDate, LocalDateTime, any type |
StringVector names = new StringVector("name", new String[]{"Alice","Bob"});
ObjectVector<LocalDate> dates = new ObjectVector<>(
new StructField("birthday", DataTypes.DateType),
new LocalDate[]{ LocalDate.of(1990,1,1), LocalDate.of(1985,6,15) }
);
ValueVector v = df.column("age");
// Size and nullability
int n = v.size();
boolean nullable = v.isNullable();
boolean hasNull = v.anyNull();
int nullCount = v.getNullCount();
boolean rowNull = v.isNullAt(2);
// Typed reads
int i = v.getInt(0);
double d = v.getDouble(0);
String s = v.getString(0); // uses measure for categorical columns
Object o = v.get(0); // boxed / object value, may be null
// Bulk export
int[] ia = v.toIntArray();
double[] da = v.toDoubleArray();
String[] sa = v.toStringArray();
// Streaming
v.intStream().sum();
v.doubleStream().average();
v.stream().filter(Objects::nonNull).count();
// Boolean filter masks
boolean[] mask = v.eq(30); // element-wise ==
boolean[] mask = v.gt(25); // element-wise >
boolean[] mask = v.isin(25, 30); // element-wise membership
// Sub-selection
ValueVector sub = v.get(Index.of(new int[]{0, 2, 3}));
// Rename (returns new vector)
ValueVector renamed = v.withName("years");
// In-place mutation
v.set(0, 99);
Tuple is an interface representing one row of a DataFrame. It is
immutable from the user's perspective — mutating a Row (the concrete
implementation) directly modifies the backing DataFrame.
import smile.data.Tuple;
import smile.data.type.StructType;
// Construct standalone from schema + values
StructType schema = new StructType(
new StructField("x", DataTypes.IntType),
new StructField("y", DataTypes.DoubleType)
);
Tuple t = Tuple.of(schema, new Object[]{42, 3.14});
Tuple t = Tuple.of(schema, new int[]{42}, new double[]{3.14});
All accessors have both ordinal (int i) and name (String field) forms:
Tuple row = df.get(0);
// By ordinal
int age = row.getInt(0);
double salary = row.getDouble(2);
String name = row.getString(1);
// By name
int age = row.getInt("age");
double salary = row.getDouble("salary");
// Generic (boxed, may return null)
Object val = row.get(0);
Object val = row.get("name");
// Null check — always check before primitive accessors on nullable columns
boolean isNull = row.isNullAt(2);
boolean isNull = row.isNullAt("salary");
boolean anyNull = row.anyNull();
Available typed accessors: getBoolean, getChar, getByte, getShort,
getInt, getLong, getFloat, getDouble, getString.
// All fields as doubles (NaN for nulls, level encoding for categoricals)
double[] arr = row.toArray();
// Selective fields
double[] arr = row.toArray("age", "salary");
// With intercept (bias=1) and dummy encoding
double[] arr = row.toArray(true, CategoricalEncoder.DUMMY, "age", "gender", "salary");
StructType schema = row.schema();
int length = row.length();
int j = row.indexOf("salary");
DataFrame is a Java record — immutable value by reference — backed by:
StructType schema — column descriptorsList<ValueVector> columns — the actual data, one vector per columnRowIndex index — optional row labels (may be null)The List<ValueVector> and StructType are mutable for the in-place
operations add(), set(), rename(), and fillna(). All other
operations that structurally reshape the data (e.g., select, drop,
sort, concat) return a new DataFrame.
import smile.data.DataFrame;
DataFrame df = new DataFrame(
new StringVector ("name", new String[] {"Alice", "Bob", "Charlie"}),
new IntVector ("age", new int[] {25, 30, 35}),
new DoubleVector ("salary", new double[] {60000., 80000., 90000.})
);
double[][] data = {{1.0, 2.0}, {3.0, 4.0}, {5.0, 6.0}};
// Auto-named columns: V1, V2
DataFrame df = DataFrame.of(data);
// Custom names
DataFrame df = DataFrame.of(data, "x", "y");
DataFrame df = DataFrame.of(new int[][]{{1,2},{3,4}}, "a", "b");
DataFrame df = DataFrame.of(new float[][]{{1f},{2f}}, "v");
SMILE uses Java Beans introspection (getXxx() methods) to discover
columns automatically. Field order in the schema follows the alphabetical
order of getter names.
public class Person {
public String getName() { return name; }
public int getAge() { return age; }
public Double getSalary() { return salary; } // nullable wrapper → NullableDoubleType
public Gender getGender() { return gender; } // enum → ByteVector with NominalScale
// …
}
List<Person> persons = List.of(
new Person("Alice", 25, 60000., Gender.Female),
new Person("Bob", 30, null, Gender.Male)
);
DataFrame df = DataFrame.of(Person.class, persons);
Rules for automatic type inference:
int / long / float / double / boolean / char / byte / short
→ non-nullable primitive vectorInteger, Double, etc. and Number subclasses → nullable vector
or NumberVectorString → StringVectorByteVector (≤127 levels), ShortVector (≤32767), or IntVector;
always annotated with a NominalScaleLocalDate, LocalDateTime, LocalTime → ObjectVector with temporal typeObjectVectorStructType schema = new StructType(
new StructField("x", DataTypes.IntType),
new StructField("y", DataTypes.DoubleType)
);
// Non-empty list
List<Tuple> rows = List.of(Tuple.of(schema, 1, 2.0), Tuple.of(schema, 3, 4.0));
DataFrame df = DataFrame.of(schema, rows);
// Stream variant
DataFrame df = DataFrame.of(schema, rows.stream());
// Empty list — returns a zero-row DataFrame with the correct schema
DataFrame empty = DataFrame.of(schema, List.of());
try (Connection conn = DriverManager.getConnection(url);
Statement stmt = conn.createStatement();
ResultSet rs = stmt.executeQuery("SELECT * FROM employees")) {
DataFrame df = DataFrame.of(rs);
}
JDBC type mapping is handled automatically via StructType.of(ResultSet).
// Dimensions
int nrow = df.nrow(); // number of rows
int ncol = df.ncol(); // number of columns
int size = df.size(); // alias for nrow()
int[] shape = {df.shape(0), df.shape(1)}; // {nrow, ncol}
boolean empty = df.isEmpty();
// Schema
StructType schema = df.schema();
String[] names = df.names();
DataType[] dtypes = df.dtypes();
Measure[] measures = df.measures();
// Print
System.out.println(df); // head(10)
System.out.println(df.head(5)); // first 5 rows
System.out.println(df.tail(5)); // last 5 rows
System.out.println(df.toString(2, 7, true)); // rows [2,7)
// Summary statistics
System.out.println(df.describe());
describe() returns a new DataFrame with one row per column and columns:
column, type, measure, count (non-null), mode, mean, std,
min, 25%, 50%, 75%, max.
Object val = df.get(row, col); // boxed value, may be null
int i = df.getInt(row, col);
long l = df.getLong(row, col);
float f = df.getFloat(row, col);
double d = df.getDouble(row, col);
String s = df.getString(row, col); // uses measure for categoricals
String sc = df.getScale(row, col); // level name for NominalScale / OrdinalScale
boolean nil = df.isNullAt(row, col);
Tuple row = df.get(0); // first row
Tuple row = df.apply(0); // Scala alias
// Iterate all rows
for (Row row : df) { /* … */ }
df.stream().forEach(row -> /* … */);
List<Row> list = df.toList();
ValueVector col = df.column(0); // by ordinal
ValueVector col = df.column("age"); // by name
ValueVector col = df.apply("age"); // Scala alias
df.set(row, col, value);
df.update(row, col, value); // Scala alias
A RowIndex maps arbitrary label objects to row ordinals, allowing
label-based row selection similar to Pandas .loc.
// Attach an index from an existing column (column is removed from data)
DataFrame indexed = df.setIndex("name");
// Attach an index from an external array
DataFrame indexed = df.setIndex(new Object[]{"r0","r1","r2","r3"});
// Look up by label
Tuple row = indexed.loc("Alice"); // single row
DataFrame sub = indexed.loc("Alice","Bob"); // multiple rows
RowIndex is also used internally by join() to perform an inner join on
shared keys.
// RowIndex directly
RowIndex index = new RowIndex(new Object[]{"a","b","c"});
int i = index.apply("b"); // 1
int i = index.getOrDefault("x"); // -1 (not found)
boolean has = index.containsKey("a"); // true
int n = index.size(); // 3
Constraints: no null values, no duplicate values — both throw
IllegalArgumentException at construction time.
All selection/drop operations return a new DataFrame.
// Select by ordinal indices
DataFrame sub = df.select(0, 2);
// Select by name
DataFrame sub = df.select("name", "salary");
DataFrame sub = df.apply("name", "salary"); // Scala alias
// Drop by ordinal index
DataFrame sub = df.drop(1);
DataFrame sub = df.drop(0, 2);
// Drop by name
DataFrame sub = df.drop("age");
DataFrame sub = df.drop("age", "birthday");
add() and set() mutate this in-place (they modify the internal
List<ValueVector> and StructType).
// Add new columns — all must have the same size as the DataFrame
// and names must not clash with existing columns or each other
IntVector bonus = new IntVector("bonus", new int[]{5000, 8000, 12000});
df.add(bonus);
// Two columns at once — both names must be distinct from each other
// and from existing columns
df.add(c1, c2);
// Replace an existing column (or add if not present)
df.set("salary", updatedSalaryVector);
df.update("salary", updatedSalaryVector); // Scala alias
rename() mutates both the StructType and the backing ValueVector
in-place:
df.rename("age", "years");
// df.names() is now ["name", "years", "salary"]
All row-selection operations return a new DataFrame.
// Build a boolean mask manually or from a column comparison
boolean[] mask = df.column("age").gt(28);
DataFrame sub = df.get(mask);
import smile.util.Index;
// From explicit row indices
DataFrame sub = df.get(Index.of(new int[]{0, 2, 3}));
// From a boolean mask
DataFrame sub = df.get(Index.of(new boolean[]{true, false, true, true}));
// Remove any row that has at least one null/NaN value
DataFrame clean = df.dropna();
sort() returns a new DataFrame with all rows reordered. Null values
always sort to the end regardless of direction.
// Ascending (default)
DataFrame sorted = df.sort("age");
// Descending
DataFrame sorted = df.sort("salary", false);
The sort is stable and works on any column type: integral, floating-point,
String, and any Comparable.
// Contiguous row range [from, to)
DataFrame slice = df.slice(1, 4); // rows 1, 2, 3
DataFrame first = df.slice(0, 1); // first row only
DataFrame empty = df.slice(2, 2); // zero rows (valid)
// Random sample without replacement
DataFrame sample = df.sample(50); // up to 50 rows (capped at nrow())
slice() validates that 0 ≤ from ≤ to ≤ nrow().
Combines two or more DataFrames side-by-side. All must have the same row
count. Clashing column names get a _2, _3, … suffix.
DataFrame wide = left.merge(right);
DataFrame wide = a.merge(b, c, d);
Stacks DataFrames on top of each other. All must have the exact same schema.
DataFrame tall = train.concat(test);
DataFrame tall = a.concat(b, c);
If all frames have a RowIndex, the indices are concatenated too.
Performs an inner join using matching row-label keys. If either frame has
no RowIndex, falls back to merge().
DataFrame merged = left.join(right);
// Rows present in both left.index and right.index are kept;
// unmatched rows are dropped.
// Drop rows with any null/NaN
DataFrame clean = df.dropna();
// Fill NaN/Inf in numeric columns in-place
df.fillna(0.0); // replace with zero
df.fillna(-1.0); // replace with sentinel
fillna operates on DoubleVector, FloatVector, NullablePrimitiveVector,
and NumberVector columns; non-numeric columns are unaffected.
factorize() converts String columns into IntVector columns annotated
with a NominalScale. The integer codes are assigned in alphabetical order
of the distinct string values.
// Convert all String columns
DataFrame f = df.factorize();
// Convert specific columns
DataFrame f = df.factorize("color", "country");
// Inspect the resulting scale
NominalScale scale = (NominalScale) f.schema().field("color").measure();
String label = scale.level(0); // first level alphabetically
int code = scale.valueOf("Red").intValue();
This is the standard step to prepare string data for machine-learning algorithms that require integer inputs.
Both toArray() and toMatrix() convert the DataFrame to a dense numeric
representation with optional bias (intercept) column and categorical
encoding.
// Default: no bias, level encoding, all columns
double[][] X = df.toArray();
// Selective columns
double[][] X = df.toArray("age", "salary", "gender");
// With bias + dummy encoding for categoricals
double[][] X = df.toArray(true, CategoricalEncoder.DUMMY, "age", "gender", "salary");
// DenseMatrix form (suitable for linear algebra)
DenseMatrix M = df.toMatrix();
DenseMatrix M = df.toMatrix(true, CategoricalEncoder.DUMMY, "rowNameColumn");
NaN is used for null/missing values in the output array.
See §7 for the CategoricalEncoder options.
DataFrame stats = df.describe();
System.out.println(stats);
Output columns: column, type, measure, count, mode, mean, std,
min, 25%, 50%, 75%, max.
mode, min, median, max over the integer
codes.mean, std, min, quartiles, max.count (non-null) and mode.System.out.println(df); // head(10)
System.out.println(df.head(5));
System.out.println(df.tail(5));
System.out.println(df.toString(from, to, truncate));
toString(from, to, truncate):
from must be in [0, nrow]; from > nrow throws.to ≤ from (or after clamping to nrow) returns "Empty DataFrame\n".maxColWidth are truncated with "..." when
truncate=true.import java.sql.*;
try (Connection conn = DriverManager.getConnection(jdbcUrl, user, pass);
Statement stmt = conn.createStatement();
ResultSet rs = stmt.executeQuery("SELECT * FROM sales")) {
DataFrame df = DataFrame.of(rs);
System.out.println(df.describe());
}
JDBC types are mapped to SMILE types via StructType.of(ResultSetMetaData).
CategoricalEncoder controls how categorical (NominalScale / OrdinalScale)
columns are converted when calling toArray(), toMatrix(), or
Tuple.toArray().
| Enum value | Meaning | Output columns per category |
|---|---|---|
LEVEL | Integer level code (default) | 1 — raw code value |
DUMMY | Dummy / treatment encoding | k−1 binary columns (reference = first level) |
ONE_HOT | Full one-hot encoding | k binary columns |
import smile.data.CategoricalEncoder;
// Level encoding (default) — "gender" becomes a single int column
double[][] X = df.toArray("age", "gender");
// Dummy encoding — k levels → k-1 binary columns
// e.g. gender {Male=0, Female=1} → one binary column "gender_Female"
double[][] X = df.toArray(false, CategoricalEncoder.DUMMY, "age", "gender");
// One-hot encoding — k levels → k binary columns
double[][] X = df.toArray(false, CategoricalEncoder.ONE_HOT, "age", "gender");
This tutorial processes an employee dataset from a raw POJO list through to a numeric design matrix ready for a machine-learning algorithm.
import java.time.LocalDate;
import smile.data.DataFrame;
import smile.data.measure.*;
import smile.data.type.*;
import smile.data.vector.*;
public enum Department { Engineering, Marketing, HR }
public class Employee {
public String getName() { return name; }
public int getAge() { return age; }
public Department getDepartment() { return dept; }
public LocalDate getHireDate() { return hireDate; }
public Double getSalary() { return salary; } // nullable
// …constructor, fields…
}
List<Employee> employees = loadEmployees(); // from DB / file / …
DataFrame df = DataFrame.of(Employee.class, employees);
System.out.println(df.schema());
// age: int
// department: byte nominal[Engineering, HR, Marketing]
// hireDate: Date
// name: String
// salary: double?
System.out.println(df);
System.out.println(df.describe());
// How many rows have a null salary?
long nullSalaries = df.column("salary").getNullCount();
System.out.println("Missing salaries: " + nullSalaries);
import smile.data.vector.IntVector;
// Tenure in years = current year - hire year
int[] tenure = new int[df.nrow()];
for (int i = 0; i < df.nrow(); i++) {
LocalDate d = (LocalDate) df.column("hireDate").get(i);
tenure[i] = LocalDate.now().getYear() - d.getYear();
}
df.add(new IntVector("tenure", tenure));
// Option A: drop rows with any null
DataFrame clean = df.dropna();
// Option B: fill salary nulls with median
double medianSalary = df.column("salary").doubleStream()
.filter(Double::isFinite).sorted()
.skip(df.nrow() / 2).findFirst().orElse(0.0);
df.fillna(medianSalary);
// Sort by salary descending
DataFrame sorted = df.sort("salary", false);
// Top 10 earners
DataFrame top10 = sorted.slice(0, Math.min(10, sorted.nrow()));
System.out.println(top10.head(10));
// Select the columns we want for the model
DataFrame features = df.select("age", "tenure", "salary", "department");
// For algorithms that need integer encoding:
// "department" already has NominalScale (auto-detected from enum)
// Export design matrix with dummy encoding
double[][] X = features.drop("salary")
.toArray(false, CategoricalEncoder.DUMMY,
"age", "tenure", "department");
// Response vector
double[] y = features.column("salary").toDoubleArray();
DataFrame indexed = df.setIndex("name");
// Later: look up a specific employee by name
Tuple alice = indexed.loc("Alice");
System.out.println("Alice's salary: " + alice.getDouble("salary"));
// Join two DataFrames on employee name
DataFrame reviews = loadReviews().setIndex("employee");
DataFrame combined = indexed.join(reviews);
DataFrame finalFeatures = df.select("age", "tenure", "department", "salary");
System.out.println(finalFeatures.describe());
// Verify no nulls remain
boolean anyNull = finalFeatures.stream().anyMatch(Tuple::anyNull);
System.out.println("Any nulls: " + anyNull);
| Method | Description |
|---|---|
new DataFrame(ValueVector...) | Construct from column vectors |
new DataFrame(RowIndex, ValueVector...) | With row index |
DataFrame.of(double[][], String...) | From 2-D double array |
DataFrame.of(float[][], String...) | From 2-D float array |
DataFrame.of(int[][], String...) | From 2-D int array |
DataFrame.of(Class<T>, List<T>) | From POJOs via reflection |
DataFrame.of(StructType, List<Tuple>) | From tuple list (empty → zero-row frame) |
DataFrame.of(StructType, Stream<Tuple>) | From tuple stream |
DataFrame.of(ResultSet) | From JDBC ResultSet |
| Method | Returns | Mutates this? | Description |
|---|---|---|---|
nrow() / size() | int | no | Number of rows |
ncol() | int | no | Number of columns |
shape(dim) | int | no | Size of dimension 0 (rows) or 1 (cols) |
isEmpty() | boolean | no | True if zero rows |
schema() | StructType | no | Column schema |
names() | String[] | no | Column names |
dtypes() | DataType[] | no | Column types |
measures() | Measure[] | no | Column measures |
column(int) / column(String) | ValueVector | no | Column vector |
get(int, int) | Object | no | Cell (boxed) |
getInt/Double/…(int,int) | primitive | no | Cell (typed) |
getString(int,int) | String | no | Cell as string (uses measure) |
isNullAt(int,int) | boolean | no | Null check |
set(int,int,Object) | void | yes | Set cell value |
get(int) | Tuple | no | Row as Tuple |
get(Index) | DataFrame | no | Rows by Index |
get(boolean[]) | DataFrame | no | Rows by boolean mask |
slice(int,int) | DataFrame | no | Rows [from, to) |
sample(int) | DataFrame | no | Random sample without replacement |
sort(String) | DataFrame | no | Ascending sort |
sort(String,boolean) | DataFrame | no | Sort with direction |
select(int...) | DataFrame | no | Columns by index |
select(String...) | DataFrame | no | Columns by name |
drop(int...) | DataFrame | no | Remove columns by index |
drop(String...) | DataFrame | no | Remove columns by name |
add(ValueVector...) | DataFrame | yes | Add new columns |
set(String,ValueVector) | DataFrame | yes | Replace or add column |
rename(String,String) | DataFrame | yes | Rename column in-place |
merge(DataFrame...) | DataFrame | no | Horizontal column union |
concat(DataFrame...) | DataFrame | no | Vertical row union |
join(DataFrame) | DataFrame | no | Inner join on RowIndex |
setIndex(String) | DataFrame | no | Column → RowIndex (removes column) |
setIndex(Object[]) | DataFrame | no | Attach RowIndex array |
loc(Object) | Tuple | no | Row by label |
loc(Object...) | DataFrame | no | Rows by labels |
dropna() | DataFrame | no | Remove rows with any null |
fillna(double) | DataFrame | yes | Fill NaN/null in numeric columns |
factorize(String...) | DataFrame | no | Encode string columns as NominalScale |
toArray(String...) | double[][] | no | Numeric array (LEVEL encoding) |
toArray(boolean,CategoricalEncoder,String...) | double[][] | no | Numeric array with options |
toMatrix() | DenseMatrix | no | Matrix (LEVEL, no bias) |
toMatrix(boolean,CategoricalEncoder,String) | DenseMatrix | no | Matrix with options |
describe() | DataFrame | no | Summary statistics |
head(int) | String | no | Top-N rows formatted |
tail(int) | String | no | Bottom-N rows formatted |
toString(int,int,boolean) | String | no | Row range formatted |
stream() | Stream<Row> | no | Row stream |
iterator() | Iterator<Row> | no | Row iterator |
toList() | List<Row> | no | All rows as list |
| Method | Description |
|---|---|
new StructType(StructField...) | Construct from fields |
field(int) / field(String) | Get field by ordinal or name |
indexOf(String) | Ordinal of named field |
length() | Number of fields |
names() / dtypes() / measures() | Field property arrays |
add(StructField) | Append a field (mutable) |
rename(String, String) | Rename a field (mutable) |
| Constructor / method | Description |
|---|---|
new StructField(name, dtype) | Without measure |
new StructField(name, dtype, measure) | With measure |
withName(String) | Return renamed copy |
isNumeric() | True for non-nominal numeric fields |
toString(Object) | Format a value using measure or dtype |
| Method | Description |
|---|---|
size() | Element count |
isNullable() | True if vector can contain nulls |
isNullAt(int) | Null check at position |
getNullCount() | Count of null positions |
anyNull() | True if any null exists |
get(int) | Boxed value (may be null) |
getInt/Double/…(int) | Typed value |
getString(int) | String form (uses measure) |
set(int, Object) | Mutation |
get(Index) | Sub-selection |
withName(String) | Return renamed copy |
toIntArray() / toDoubleArray() / toStringArray() | Bulk export |
intStream() / longStream() / doubleStream() / stream() | Streaming |
eq(Object) / ne / lt / le / gt / ge | Element-wise comparison masks |
isin(String...) / isin(int...) | Membership mask |
isNull() | Per-element null mask |
SMILE — © 2010-2026 Haifeng Li. GNU GPL licensed.