Back to Smile

DataFrame

studio/src/universal/notebooks/data-frame.ipynb

6.1.03.6 KB
Original Source

DataFrame

Many Smile algorithms take simple double[] as input. But we also use the encapsulation class DataFrame. As shown in Data notebook, the output of most Smile data parsers is a DataFrame object. DataFrames are immutable and contain a fixed number of named columns.

java
import java.nio.file.*;
import java.time.*;
import org.apache.commons.csv.CSVFormat;
import smile.data.*;
import smile.data.vector.*;
import smile.io.*;
import smile.util.Index;

In this session, we will explore the functionality of DataFrame with the iris data. The iris data is from early statistical work of R.A. Fisher, who used three species of Iris flowers to develop linear discriminant analysis.

java
var home = System.getProperty("smile.home");
var iris = Read.arff(home + "/data/weka/iris.arff");

First, let's check out the statistic summary of numeric columns in the data.

java
iris.describe();

We can get a row with the array syntax.

java
iris.get(0);

When selecting a row, it returns a Tuple, which is an immutable finite ordered list (sequence) of elements. Moreover, we can slice a DataFrame into a new one.

java
iris.get(Index.range(10, 20));

We can refer a column by its name and it returns a vector.

java
iris.column("sepallength");

Similarly, we can select a few columns to create a new data frame.

java
iris.select("sepallength", "sepalwidth");

Advanced operations such as exists, forall, find, filter are also supported. The predicate of these functions expect a Tuple.

java
iris.stream().anyMatch(row -> row.getDouble(0) > 4.5);

In this example, we test if there is any sample with sepallength > 4.5. Since sepallength is the first column, we use getDouble(0) to retrive the value in the predicate labmda. Note that Tuple allows generic access by get() method, which will incur boxing overhead for primitives. Therefore, Tuple also provides the native primitive access method getXXX(), where XXX is the type.

It is invalid to use the native primitive interface to retrieve a value that is null, instead a user must check isNullAt before attempting to retrieve a value that might be null.

java
iris.stream().allMatch(row -> row.getDouble(0) < 10);

In contrast to exists, the function forall returns true only if all rows pass the test.

java
iris.stream().filter(row -> row.getByte("class") == 1).findAny();

The find method returns the first row passes the test if it exists. Otherwise, it returns Optional.empty. Note that _("class") in the example returns an object of Integer because the nominal data are stored as integers (byte, short, or int, depending on the levels of measurements). To the string representation of class, one can use getString() method.

java
iris.stream().filter(row -> row.getString("class").equals("Iris-versicolor")).findAny();

Let's combine what we just learn into an example of filter.

java
var stream = iris.stream().filter(row -> row.getDouble(1) > 3 && row.getByte("class") != 0);
IO.println(DataFrame.of(iris.schema(), stream));

For data wrangling, the most important functions of DataFrame are map and groupBy.

java
var x6 = iris.stream().map(row -> {
               var x = new double[6];
               for (int i = 0; i < 4; i++) x[i] = row.getDouble(i);
               x[4] = x[0] * x[1];
               x[5] = x[2] * x[3];
               return x;
           });
java
var groups = iris.stream().collect(java.util.stream.Collectors.groupingBy(row -> row.getString("class")));
IO.println(groups);