studio/src/universal/notebooks/plot-swing.ipynb
A picture is worth a thousand words. In machine learning, we usually handle high-dimensional data, which is impossible to draw on display directly. But a variety of statistical plots are tremendously valuable for us to grasp the characteristics of many data points. Smile provides data visualization tools such as plots and maps for researchers to understand information more easily and quickly.
Smile provides many advanced interactive statistical plots with Java's Swing graphics library. To render Swing plot canvas in Notebook, we generate an image and embedded it into HTML. Therefore, we lose the interactive functionality. To fully leverage Swing-based plots, we recommend the users to use Smile's shell.
import java.awt.Color;
import java.nio.file.*;
import java.util.*;
import java.util.stream.*;
import smile.io.*;
import smile.stat.distribution.*;
import smile.tensor.*;
import smile.util.function.*;
import smile.interpolation.BicubicInterpolation;
import smile.plot.swing.*;
import static smile.swing.SmileUtilities.*;
import static java.lang.Math.*;
Now let's plot a heart. Math is beautiful, isn't it?
double[][] heart = new double[200][2];
for (int i = 0; i < heart.length; i++) {
double t = PI * (i - 100) / 100;
heart[i][0] = 16 * pow(sin(t), 3);
heart[i][1] = 13 * cos(t) - 5 * cos(2*t) - 2 * cos(3*t) - cos(4*t);
}
var figure = LinePlot.of(heart, Color.RED).figure();
figure.setTitle("Mathematical Beauty");
show(figure);
Note that the class LinePlot that encapsulates the plot specification. The function show does the renderring job.
A scatter plot displays data as a collection of points. The points can be color-coded, which is very useful for classification tasks. The mark parameter sets the legend:
For any other char, the data point will be drawn as a dot.
The class Figure can be used to control the plot programmatically. The user can also use the popup context menu by right mouse click to print, change the title, axis labels, and font, etc.
On the desktop, the user can zoom in/out by mouse wheel. For 2D plot, the user can shift the coordinates by moving mouse after double click. The user can also select an area by mouse for detailed view. For 3D plot, the user can rotate the view by dragging mouse.
var home = System.getProperty("smile.home");
var iris = Read.arff(home + "/data/weka/iris.arff");
var figure = ScatterPlot.of(iris, "sepallength", "sepalwidth", "class", '*').figure();
figure.setAxisLabels("sepallength", "sepalwidth");
figure.setTitle("Iris");
show(figure);
In this example, we plot the first two columns of Iris data. We use the class label for legend and color coding. It is also easy to draw a 3D plot.
var figure = ScatterPlot.of(iris, "sepallength", "sepalwidth", "petallength", "class", '*').figure();
figure.setAxisLabels("sepallength", "sepalwidth", "petallength");
figure.setTitle("Iris 3D");
show(figure);
However, the Iris data has four attributes. So even 3D plot is not sufficient to see the whole picture. A general practice is plot all the attribute pairs. For example,
var splom = MultiFigurePane.splom(iris, '*', "class");
var frame = show(splom);
frame.setTitle("Scatterplot Matrix");
The box plot is a standardized way of displaying the distribution of data based on the five number summary: minimum, first quartile, median, third quartile, and maximum.
Box plots can be useful to display differences between populations without making any assumptions of the underlying statistical distribution: they are non-parametric. The spacings between the different parts of the box help indicate the degree of dispersion (spread) and skewness in the data, and identify outliers.
String[] labels = ((smile.data.measure.NominalScale) iris.schema().field("class").measure()).levels();
double[][] data = new double[labels.length][];
for (int i = 0; i < data.length; i++) {
var label = labels[i];
data[i] = iris.stream().
filter(row -> row.getString("class").equals(label)).
mapToDouble(row -> row.getFloat("sepallength")).
toArray();
}
var figure = new BoxPlot(data, labels).figure();
figure.setAxisLabels("", "sepallength");
figure.setTitle("Box Plot");
show(figure);
A histogram is a graphical representation of the distribution of numerical data. The range of values is divided into a series of consecutive, non-overlapping intervals/bins. The bins must be adjacent, and are usually equal size. By default, the number of bins is 10. You may also specify an array of the breakpoints between bins.
Let's apply the histogram to an interesting data: the wisdom of crowds. The original experiment took place about a hundred years ago at a county fair in England. The fair had a guess the weight of the ox contest. Francis Galton calculated the average of all guesses, which is right to within one pound.
Recently, NPR Planet Money ran the experiment again. NPR posted a couple of pictures of a cow (named Penelope) and asked people to guess her weight. They got over 17,000 responses. The average of guesses was 1,287 pounds, which is pretty close to Penelope's weight 1,355 pounds.
var cow = Read.csv(home + "/data/stat/cow.txt").column("V1").toDoubleArray();
var figure = Histogram.of(cow, 50, true).figure();
figure.setAxisLabels("Weight", "Probability");
figure.setTitle("Cow Weight");
show(figure);
The histogram gives a rough sense of the distribution of crowd guess, which has a long tail. Filter out the weights over 3500 pounds, the histogram shows more details.
var figure = Histogram.of(Arrays.stream(cow).filter(w -> w <= 3500).toArray(), 50, true).figure();
figure.setAxisLabels("Weight", "Probability");
figure.setTitle("Cow Weight <= 3500");
show(figure);
SMILE also supports histograms that display the distribution of 2-dimensional data. Here we generate a data set from a 2-dimensional Gaussian distribution.
double[] mu = {0.0, 0.0};
double[][] v = { {1.0, 0.6}, {0.6, 2.0} };
var gauss2d = new MultivariateGaussianDistribution(mu, DenseMatrix.of(v));
var data = Stream.generate(gauss2d::rand).limit(10000).toArray(double[][]::new);
var figure = Histogram3D.of(data, 50, false).figure();
show(figure);
A Q–Q plot ("Q" stands for quantile) is a probability plot for comparing two probability distributions by plotting their quantiles against each other. A point (x, y) on the plot corresponds to one of the quantiles of the second distribution (y-coordinate) plotted against the same quantile of the first distribution (x-coordinate).
SMILE supports the Q-Q plot of samples to a given distribution and also of two sample sets. The second distribution/samples is optional. If missing, we assume it the standard Gaussian distribution.
In what follows, we generate a random sample set from standard Gaussian distribution and draw its Q-Q plot.
var gauss = new GaussianDistribution(0.0, 1.0);
var samples = DoubleStream.generate(gauss::rand).limit(1000).toArray();
var figure = QQPlot.of(samples).figure();
figure.setTitle("Q-Q Plot");
show(figure);
In fact, this is also a good visual way to verify the quality of our random number generator.
A heat map is a graphical representation of data where the values in a matrix are represented as colors. In cluster analysis, researchers often employs the heat map by permuting the rows and the columns of a matrix to place similar values near each other according to the clustering.
In below example, z is the matrix to display and the optional parameters x and y are the coordinates of data matrix cells, which must be in ascending order. Alternatively, one can also provide labels as the coordinates, which is a common practice in cluster analysis.
In what follows, we display the heat map of a matrix. We starts with a small 4 x 4 matrix and enlarge it with bicubic interpolation. We also use the helper class Palette to generate the color scheme. This class provides many other color schemes.
// the matrix to display
double[][] z = {
{1.0, 2.0, 4.0, 1.0},
{6.0, 3.0, 5.0, 2.0},
{4.0, 2.0, 1.0, 5.0},
{5.0, 4.0, 2.0, 3.0}
};
// make the matrix larger with bicubic interpolation
double[] x = {0.0, 1.0, 2.0, 3.0};
double[] y = {0.0, 1.0, 2.0, 3.0};
var bicubic = new BicubicInterpolation(x, y, z);
var Z = new double[101][101];
for (int i = 0; i <= 100; i++) {
for (int j = 0; j <= 100; j++)
Z[i][j] = bicubic.interpolate(i * 0.03, j * 0.03);
}
var figure = Heatmap.of(Z, Palette.jet(256)).figure();
show(figure);
A special case of heat map is to draw the sparsity pattern of a matrix.
The structure of sparse matrix is critical in solving linear systems.
var sparse = SparseMatrix.text(Path.of("base/src/test/resources/data/matrix/mesh2em5.txt"));
var figure = SparseMatrixPlot.of(sparse).figure();
figure.setTitle("mesh2em5");
show(figure);
A contour plot represents a 3-dimensional surface by plotting constant z slices, called contours, on a 2-dimensional format. That is, given a value for z, lines are drawn for connecting the (x, y) coordinates where that z value occurs.
Similar to heatmap, the parameters x and y are the coordinates of data matrix cells, which must be in ascending order. The slice values can be automatically determined from the data, or provided through the parameter levels.
Contours are often jointly used with the heat map. In the following example, we add the contour lines to the previous heat map exampl.
var figure = Heatmap.of(Z, 256).figure();
figure.add(Contour.of(Z));
show(figure);
This example also shows how to mix multiple plots together.
Besides heat map and contour, we can also visualize a matrix with the three-dimensional shaded surface.
The usage is similar with heatmap and contour functions.
var figure = Surface.of(Z, Palette.jet(256, 1.0f)).figure();
figure.setTitle("Surface Plot");
show(figure);