Back to Smile

SMILE Base Module

base/README.md

6.1.016.6 KB
Original Source

SMILE Base Module

smile-base is the foundation module of the SMILE (Statistical Machine Intelligence and Learning Engine) library. It provides all the data structures, mathematical primitives, statistical utilities, and I/O facilities that the rest of SMILE is built upon.


Package Map

PackageDescription
smile.dataDataFrame, Tuple, type system, measures, vectors
smile.data.formulaFormula language for model matrices
smile.data.transformFeature-transformation pipelines
smile.datasetsBuilt-in benchmark datasets
smile.ioData I/O — CSV, JSON, Parquet, ARFF, …
smile.tensorTensors and dense/sparse linear algebra
smile.mathCore math utilities, MathEx, special functions
smile.math.distanceDistance and metric functions
smile.math.kernelMercer kernel functions (SVM, GP, …)
smile.math.rbfRadial basis functions
smile.math.randomPseudorandom number generators
smile.math.BFGSQuasi-Newton BFGS / L-BFGS optimization
smile.csCompressed sensing
smile.stat.distributionProbability distributions
smile.stat.hypothesisHypothesis testing
smile.interpolation1-D and 2-D interpolation
smile.neighborNearest-neighbor search (KD-tree, ball tree, LSH, …)
smile.graphGraph data structures and algorithms
smile.gapGenetic algorithm (GAP framework)
smile.icaIndependent component analysis
smile.hashNon-cryptographic hash functions
smile.sortSorting and selection algorithms
smile.waveletWavelet transforms

Data

Package: smile.data, smile.data.type, smile.data.measure, smile.data.vector
Guide: DATA_FRAME.md

DataFrame is SMILE's primary in-memory tabular data structure — a typed, column-oriented table with named fields, rich type metadata, and a streaming API. Tuple is a single row of a DataFrame. The smile.data.type package defines a complete type system (DataType, StructType, …), and smile.data.measure provides nominal, ordinal, interval, and ratio measurement scales. Strongly-typed column vectors (IntVector, DoubleVector, StringVector, …) live in smile.data.vector.

java
DataFrame iris = Read.arff(Paths.getTestData("weka/iris.arff"));
DataFrame subset = iris.select("sepallength", "sepalwidth", "class");
double[] lengths = iris.column("sepallength").toDoubleArray();

Formula

Package: smile.data.formula
Guide: FORMULA.md

A compact, symbolic language for specifying model matrices (design matrices) from a DataFrame. Inspired by R's formula notation.

java
// All predictors except class
Formula f = Formula.lhs("class");

// Explicit terms with an interaction
Formula f = Formula.of("y", "x1", "x2", "x1:x2");

double[][] X = f.x(dataFrame).toArray();
double[]   y = f.y(dataFrame).toDoubleArray();

Data Transformation

Package: smile.data.transform
Guide: DATA_TRANSFORMATION.md

Composable, serializable feature-transformation pipelines. Each transformer is fitted on training data and then applied consistently to new data. Includes scaling (Standardizer, MaxAbsScaler, MinMaxScaler), encoding (OneHotEncoder, KBinsDiscretizer), imputation, and more.

java
var scaler = Standardizer.fit(trainDF);
DataFrame normalized = scaler.apply(trainDF);
DataFrame normalizedTest = scaler.apply(testDF);

Datasets

Package: smile.datasets
Guide: DATASET.md

Ready-to-use loaders for over 30 standard machine-learning benchmark datasets (Iris, MNIST, USPS, Breast Cancer, CPU, Abalone, …). Each loader returns structured data suitable for immediate use with SMILE classifiers, regressors, or clustering algorithms.

java
var iris      = Iris.load();
var mnist     = MNIST.load();
var breastCancer = BreastCancer.load();

Data I/O

Package: smile.io
Guide: DATA_IO.md

Unified read/write API for many tabular and hierarchical formats:

FormatReadWrite
CSV / TSV / DSV
JSON (row-oriented)
Apache Parquet
Apache Avro
Apache Arrow
Weka ARFF
LibSVM sparse
Matrix Market
java
DataFrame df = Read.csv("data.csv");
DataFrame df = Read.parquet("data.parquet");
Write.csv(df, Paths.get("out.csv"));

Tensor & Linear Algebra

Package: smile.tensor, smile.linalg
Guide: TENSOR.md

Core numerical data structures used throughout SMILE:

  • DenseMatrix — double-precision dense matrix with BLAS/LAPACK backend
  • SparseMatrix — CSR compressed sparse matrix
  • BandMatrix, SymmMatrix, Cholesky, LU, QR, SVD, EVD — decompositions
  • IMatrix / FloatMatrix — integer and single-precision variants
  • N-D Tensor — general multi-dimensional array (float32 / float64)
java
DenseMatrix A = DenseMatrix.of(new double[][]{{1,2},{3,4}});
DenseMatrix B = A.mm(A);          // matrix multiply
var svd = A.svd();                // full SVD
double[] x = A.solve(b);         // least-squares solve

Math

Package: smile.math

MathEx is a large static utility class with fast implementations of common mathematical operations: log-sum-exp, softmax, entropy, distance helpers, combinatorics, special functions (gamma, beta, erf, …), and more. smile.math.special provides the underlying special-function implementations.

java
double lse = MathEx.logSumExp(logProbs);
MathEx.softmax(scores);
double h = MathEx.entropy(probs);

Distances

Package: smile.math.distance
Guide: DISTANCES.md

A rich collection of distance and similarity metrics implementing the Distance<T> interface:

MetricClass
EuclideanEuclideanDistance
Manhattan (L1)ManhattanDistance
Chebyshev (L∞)ChebyshevDistance
Minkowski (Lp)MinkowskiDistance
CosineCosineDistance
MahalanobisMahalanobisDistance
HammingHammingDistance
Levenshtein (edit)EditDistance
JaccardJaccardDistance
Dynamic Time WarpingDynamicTimeWarping
Sparse vectorsSparseEuclideanDistance, SparseManhattanDistance, …

Kernels

Package: smile.math.kernel
Guide: KERNELS.md

Mercer kernel functions used in support vector machines, Gaussian processes, and kernel methods. All implement MercerKernel<T>.

KernelClass
LinearLinearKernel
PolynomialPolynomialKernel
Gaussian (RBF)GaussianKernel
LaplacianLaplacianKernel
Matérn 3/2, 5/2Matern32Kernel, Matern52Kernel
Hyperbolic TangentHyperbolicTangentKernel
Sparse variantsSparseGaussianKernel, SparseLaplacianKernel, …

Radial Basis Functions

Package: smile.math.rbf
Guide: RBF.md

Radial basis functions used for interpolation, RBF networks, and other kernel methods.

RBFClass
GaussianGaussianRadialBasis
MultiquadricMultiquadricRadialBasis
Inverse MultiquadricInverseMultiquadricRadialBasis
Thin-plate splineThinPlateSplineRadialBasis

Random Number Generators

Package: smile.math.random
Guide: RNG.md

High-quality pseudorandom number generators and distributions:

  • MersenneTwister — MT19937, the default general-purpose PRNG
  • XoShiRo256StarStar, XoRoShiRo128StarStar — fast modern RNGs
  • Halton, SobolSequence — low-discrepancy quasi-random sequences
java
var rng = new MersenneTwister(42);
double u = rng.nextDouble();
int[] perm = rng.permutation(100);

BFGS Optimization

Package: smile.math
Guide: BFGS.md

Quasi-Newton BFGS and L-BFGS unconstrained optimisers. The BFGS class minimises a smooth, differentiable objective function given its gradient. L-BFGS uses a limited-memory approximation suitable for high-dimensional problems.

java
double[] x0 = {0.0, 0.0};
double[] xMin = BFGS.minimize(f, g, x0, 1e-6, 200);

Compressed Sensing

Package: smile.cs
Guide: COMPRESSED_SENSING.md

Sparse signal recovery from underdetermined linear measurements using the Basis Pursuit (BP) and LASSO formulations. Useful for compressive imaging, sparse regression, and dictionary learning.


Probability Distributions

Package: smile.stat.distribution
Guide: DISTRIBUTIONS.md

A comprehensive library of univariate and multivariate distributions, each implementing Distribution (or DiscreteDistribution). Every distribution supports PDF/PMF, CDF, quantile (inverse CDF), mean, variance, entropy, and random sampling.

Continuous: GaussianDistribution, ExponentialDistribution, GammaDistribution, BetaDistribution, WeibullDistribution, LogNormalDistribution, TDistribution, FDistribution, ChiSquaredDistribution, CauchyDistribution, LogisticDistribution, UniformDistribution, EmpiricalDistribution, KernelDensityEstimation, …

Discrete: BinomialDistribution, PoissonDistribution, NegativeBinomialDistribution, GeometricDistribution, HypergeometricDistribution, DiscreteUniformDistribution, …

Multivariate: MultivariateGaussianDistribution, DirichletDistribution, MultivariateExponentialFamilyMixture, …

java
var gauss = new GaussianDistribution(0, 1);
double p = gauss.cdf(1.96);       // ≈ 0.975
double x = gauss.quantile(0.975); // ≈ 1.96
double[] samples = gauss.rand(1000);

Hypothesis Testing

Package: smile.stat.hypothesis
Guide: HYPOTHESIS_TESTING.md

Parametric and non-parametric tests for comparing means, variances, distributions, and correlations.

TestClass / Method
One-sample t-testTTest.test(double[] x, double μ)
Two-sample t-testTTest.test(double[] x, double[] y)
Paired t-testTTest.pairedTest(double[] x, double[] y)
F-test (variance)FTest.test(double[] x, double[] y)
Chi-squared goodness-of-fitChiSqTest.test(int[] counts, double[] prob)
Chi-squared independenceChiSqTest.test(int[][] table)
Kolmogorov-SmirnovKSTest.test(double[] x, Distribution d)
Two-sample KSKSTest.test(double[] x, double[] y)
Spearman correlationSpearmanTest
Kendall's τKendallTest

All test objects expose a p-value field for decision making.


Interpolation

Package: smile.interpolation
Guide: INTERPOLATION.md

Smooth function reconstruction from a discrete set of sample points.

1-D: LinearInterpolation, PolynomialInterpolation, SplineInterpolation (natural cubic spline), CubicSplineInterpolation, RBFInterpolation1D, KrigingInterpolation1D

2-D: BilinearInterpolation, BicubicInterpolation, CubicSplineInterpolation2D, RBFInterpolation2D, KrigingInterpolation2D, ShepardInterpolation2D

java
double[] x = {0, 1, 2, 3};
double[] y = {0, 1, 4, 9};
var spline = new CubicSplineInterpolation(x, y);
double v = spline.interpolate(1.5);   // ≈ 2.25

Package: smile.neighbor
Guide: NEAREST_NEIGHBOR.md

Exact and approximate nearest-neighbor data structures.

StructureClassBest for
KD-treeKDTreeLow-dimensional Euclidean data
Ball treeBallTreeArbitrary metric spaces
Cover treeCoverTreeExpandable, metric spaces
LSHLSHHigh-dimensional approximate NN
MPLSHMPLSHMulti-probe LSH

All implement NearestNeighborSearch<K,V> and support k-NN and range queries.

java
var kdtree = KDTree.of(data);
Neighbor<double[], double[]>[] nn = kdtree.knn(query, 5);

Graph

Package: smile.graph
Guide: GRAPH.md

Directed and undirected weighted graphs with adjacency-list and adjacency-matrix representations, plus classic graph algorithms.

  • AdjacencyList — sparse graph (recommended for most uses)
  • AdjacencyMatrix — dense graph
  • Algorithms: BFS, DFS, topological sort, shortest path (Dijkstra, Bellman-Ford), minimum spanning tree (Prim, Kruskal), strongly connected components (Tarjan)
java
var g = new AdjacencyList(6, false);  // 6 vertices, undirected
g.addEdge(0, 1, 1.0);
int[] order = g.bfs(0);
double[] dist = g.dijkstra(0);

Genetic Algorithm

Package: smile.gap
Guide: GAP.md

A flexible genetic algorithm framework (GAP — Genetic Algorithm Platform) for combinatorial and continuous optimization. Users implement a Chromosome with fitness(), crossover(), and mutate() to define the problem; the GeneticAlgorithm driver handles selection, recombination, and termination.

java
var ga = new GeneticAlgorithm<>(population, 0.5, 0.01);
Chromosome best = ga.evolve(1000, 1e-4);

Independent Component Analysis

Package: smile.ica
Guide: ICA.md

FastICA for blind source separation. Recovers statistically independent components from a linear mixture of signals — widely used in signal processing, fMRI analysis, and feature extraction.

java
var ica = ICA.fit(X, 3);   // extract 3 independent components
double[][] S = ica.transform(X);

Hash Functions

Package: smile.hash
Guide: HASH.md

High-performance, non-cryptographic hash functions and locality-sensitive hashing (LSH) families.

HashClassNotes
MurmurHash3MurmurHash332-bit / 128-bit; fast, high quality
SHA-1SHACryptographic (SHA-1 used for LSH)
MinHashMinHashJaccard similarity LSH
SimHashSimHashCosine similarity LSH
Random projection LSHRandomProjectionHashEuclidean LSH
Cross-polytope LSHCrossPolytopeEfficient Euclidean LSH

Sorting & Selection

Package: smile.sort
Guide: SORT.md

Allocation-free, cache-friendly sorting and selection algorithms used internally throughout SMILE.

AlgorithmClass / Method
Introsort (O(n log n))IntroSort
Heap sortHeapSort
Shell sortShellSort
Quickselect (k-th element)QuickSelect
Indexed sortQuickSort (with index array)
PartitionQuickSelect.select(a, k)

Wavelets

Package: smile.wavelet
Guide: WAVELET.md

Discrete Wavelet Transform (DWT) and Wavelet Packet Transform (WPT) for signal processing, compression, and multi-resolution analysis.

Families: Haar, Daubechies (D4–D20), Symlet (S8–S20), Coiflet (C6–C30), BestLocalized (BL14–BL20), Vaidyanathan.

java
double[] signal = ...;
var wavelet = new DaubechiesWavelet(4);   // D4
wavelet.transform(signal);               // in-place DWT
wavelet.inverse(signal);                 // reconstruction

Building

bash
# Build the module
./gradlew :base:build

# Run tests
./gradlew :base:test

# Generate Javadoc
./gradlew :base:javadoc

Dependencies: Java 25+, BLAS/LAPACK native libraries (optional — pure-Java fallback included).


SMILE — © 2010-2026 Haifeng Li. GNU GPL licensed.