studio/src/universal/notebooks/data-frame.ipynb
Many Smile algorithms take simple double[] as input. But we also use the encapsulation class DataFrame. As shown in Data notebook, the output of most Smile data parsers is a DataFrame object. DataFrames are immutable and contain a fixed number of named columns.
import java.nio.file.*;
import java.time.*;
import org.apache.commons.csv.CSVFormat;
import smile.data.*;
import smile.data.vector.*;
import smile.io.*;
import smile.util.Index;
In this session, we will explore the functionality of DataFrame with the iris data. The iris data is from early statistical work of R.A. Fisher, who used three species of Iris flowers to develop linear discriminant analysis.
var home = System.getProperty("smile.home");
var iris = Read.arff(home + "/data/weka/iris.arff");
First, let's check out the statistic summary of numeric columns in the data.
iris.describe();
We can get a row with the array syntax.
iris.get(0);
When selecting a row, it returns a Tuple, which is an immutable finite ordered list (sequence) of elements. Moreover, we can slice a DataFrame into a new one.
iris.get(Index.range(10, 20));
We can refer a column by its name and it returns a vector.
iris.column("sepallength");
Similarly, we can select a few columns to create a new data frame.
iris.select("sepallength", "sepalwidth");
Advanced operations such as exists, forall, find, filter are also supported. The predicate of these functions expect a Tuple.
iris.stream().anyMatch(row -> row.getDouble(0) > 4.5);
In this example, we test if there is any sample with sepallength > 4.5. Since sepallength is the first column, we use getDouble(0) to retrive the value in the predicate labmda. Note that Tuple allows generic access by get() method, which will incur boxing overhead for primitives. Therefore, Tuple also provides the native primitive access method getXXX(), where XXX is the type.
It is invalid to use the native primitive interface to retrieve a value
that is null, instead a user must check isNullAt before attempting
to retrieve a value that might be null.
iris.stream().allMatch(row -> row.getDouble(0) < 10);
In contrast to exists, the function forall returns true only if all rows pass the test.
iris.stream().filter(row -> row.getByte("class") == 1).findAny();
The find method returns the first row passes the test if it exists. Otherwise, it returns Optional.empty. Note that _("class") in the example returns an object of Integer because the nominal data are stored as integers (byte, short, or int, depending on the levels of measurements). To the string representation of class, one can use getString() method.
iris.stream().filter(row -> row.getString("class").equals("Iris-versicolor")).findAny();
Let's combine what we just learn into an example of filter.
var stream = iris.stream().filter(row -> row.getDouble(1) > 3 && row.getByte("class") != 0);
IO.println(DataFrame.of(iris.schema(), stream));
For data wrangling, the most important functions of DataFrame are map and groupBy.
var x6 = iris.stream().map(row -> {
var x = new double[6];
for (int i = 0; i < 4; i++) x[i] = row.getDouble(i);
x[4] = x[0] * x[1];
x[5] = x[2] * x[3];
return x;
});
var groups = iris.stream().collect(java.util.stream.Collectors.groupingBy(row -> row.getString("class")));
IO.println(groups);