SMILE Scala API

The smile-scala module is an idiomatic Scala shim over the SMILE Java library. Except smile.cas package (Computer Algebra System), it adds nothing algorithmic — every function ultimately delegates to the same Java fit, of, or constructor — but it replaces verbose Java patterns with concise, expressive Scala idioms:

Implicit conversions ("pimp-my-library") enrich DataFrame, Tuple, arrays, and String with domain-specific methods.
Operator DSL — R-style formula syntax (y ~ x1 + x2), NumPy-style array slicing (0 ~ 9 ~ 2), and linear-algebra operators (%*%, \).
Computer Algebra System — symbolic differentiation and simplification on scalar, vector, and matrix expressions.
Top-level functions with named default arguments replace long static-method argument lists.
Idiomatic object namespaces (read, write, gpr, validate, cv, loocv, bootstrap) group related operations without polluting the package namespace.
Macro-backed rendering in Scala 2.13 detects notebook environments (Zeppelin, Databricks) at compile time and routes show(…) to either an in-process Swing window or an HTML `` element.

The module depends on :core (ML), :base (data and I/O), :nlp, :plot, and :json.

Installation
Data I/O — read and write
DataFrame Extensions
Formula DSL
Math, Arrays, and Linear Algebra
Computer Algebra System (CAS)
Classification
Regression
Clustering
Dimensionality Reduction
Manifold Learning
Natural Language Processing
Sequence Labeling
Association Rule Mining
Wavelets
Model Validation
Plotting
Utility Helpers
Complete Examples

Installation

Add the module to your build.gradle.kts (for use inside this Gradle project):

kotlin

dependencies {
    implementation(project(":scala"))
}

Or, from SBT in a standalone project:

scala

libraryDependencies += "com.github.haifengl" %% "smile-scala" % "<version>"

Import the relevant package objects at the top of each file. The most common imports are:

scala

import smile.io.*           // read, write
import smile.data.*         // DataFrame implicits, summary
import smile.data.formula.* // formula DSL: ~, +, -, ::, &&, ^
import smile.math.*         // PimpedInt, PimpedDouble, array extensions, linalg
import smile.classification.*
import smile.regression.*
import smile.clustering.*
import smile.manifold.*
import smile.nlp.*
import smile.validation.*

Data I/O — `read` and `write`

Both read and write are top-level Scala objects defined in smile.io. They serve as namespaces so you can write read.csv(…) instead of importing a static Java method.

Loading data

scala

import smile.io.*

// Auto-detect format from extension (.csv, .json, .arff, .parquet, .avro, .sas7bdat)
val df = read.data("path/to/file.csv")
val df = read.data("path/to/file.parquet")
val df = read.data("path/to/file.json", "multi-line")   // JSON mode hint

// CSV with options (all have defaults — delimiter=",", header=true, quote='"')
val df = read.csv("iris.csv")
val df = read.csv("data.tsv", delimiter = "\t")
val df = read.csv("data.csv", header = false, comment = '#')

// Other formats
val df = read.json("records.json")
val df = read.json("records.json", JSON.Mode.MULTI_LINE, schema)
val df = read.arff("weka.arff")
val df = read.sas("dataset.sas7bdat")
val df = read.arrow("data.arrow")
val df = read.avro("data.avro", schemaInputStream)
val df = read.parquet("data.parquet")
val ds = read.libsvm("data.libsvm")        // returns SparseDataset[Integer]
val (vertices, edges) = read.wavefront("mesh.obj")  // 3-D OBJ geometry

// JDBC result set
val df = read.jdbc(resultSet)

// Deserialize a previously serialized model
val model = read("model.bin")

Saving data

scala

import smile.io.*

// Serialize any Serializable object (e.g. a trained model)
write(model, "model.bin")

// Write a DataFrame
write.csv(df, "out.csv")
write.csv(df, "out.tsv", delimiter = "\t")
write.arff(df, "out.arff", "relation-name")
write.arrow(df, "out.arrow")

// Write raw arrays
write.array(predictions, "predictions.txt")          // one element per line
write.table(matrix, "matrix.csv", delimiter = ",")  // 2-D array to delimited file

DataFrame Extensions

When you import smile.data.*, implicit conversions enrich DataFrame and Tuple with Scala-idiomatic methods.

`DataFrameOps` — enriches `DataFrame`

scala

import smile.data.*

val df: DataFrame = read.csv("iris.csv")

// Select/drop columns by name or Range
val sub   = df.select("sepal.length", "sepal.width")
val fewer = df.drop("class")
val slice = df.of(0 until 100)         // row slice using Scala Range

// Functional operations
val row: Option[Tuple] = df.find(_.getInt("class") == 1)
val all: Boolean        = df.forall(_.getDouble("petal.length") > 0.0)
val any: Boolean        = df.exists(_.getDouble("sepal.length") > 7.0)
df.foreach(row => println(row))

val mapped: Array[Double] = df.map(_.getDouble(0))
val filtered: DataFrame   = df.filter(_.getDouble("sepal.length") > 5.0)
val (yes, no)             = df.partition(_.getInt("class") == 0)
val groups                = df.groupBy(_.getInt("class"))

// JSON conversion
val json: String = df.toJSON

`TupleOps` — enriches `Tuple`

scala

val t: Tuple = df.get(0)
val json: String = t.toJSON    // handles categorical fields correctly

Summary statistics (top-level)

scala

import smile.data.*

summary(intArray)     // prints min/Q1/median/mean/Q3/max for Array[Int]
summary(doubleArray)  // same for Array[Double]

Formula DSL

Import smile.data.formula.* to unlock an R-style formula language for specifying model structure.

Basic syntax

scala

import smile.data.formula.*

// y ~ x  means "predict y from x"
val f: Formula = "y" ~ "x"

// Include multiple terms
val f = "price" ~ "size" + "bedrooms" + "location"

// Exclude a term with unary -
val f = "y" ~ "." - "id"     // use all columns except "id"

// Intercept-only: just ". ~ ."

Interaction and crossing

scala

// Interaction term: a :: b  (a*b without main effects in R notation)
val f = "y" ~ "a" :: "b"

// Crossing (main effects + interactions): a && b
val f = "y" ~ "a" && "b"       // expands to a + b + a:b

// Degree on crossing
val f = "y" ~ ("a" && "b") ^ 3

Function terms

All common Math functions are available as Formula terms:

scala

val f = "y" ~ log("income") + sqrt("age") + "gender"
val f = "y" ~ abs("balance") + exp("rate")
// Available: abs, ceil, floor, round, rint, exp, expm1, log, log1p,
//            log10, log2, signum, sign, sqrt, cbrt, sin, cos, tan,
//            sinh, cosh, tanh, asin, acos, atan, ulp

Math, Arrays, and Linear Algebra

Import smile.math.* to get enriched numeric types, operator overloading for arrays and matrices, and many statistical/linear-algebra helpers.

Enriched primitives

scala

import smile.math.*

// PimpedInt — slice construction (Python-like)
val s: Slice = 0 ~ 9          // indices 0..9
val s: Slice = 0 ~ 9 ~ 2      // indices 0, 2, 4, 6, 8 (step 2)

// PimpedDouble — arithmetic with arrays and matrices
2.0 + someArray       // returns VectorExpression
3.0 * someMatrix      // returns MatrixExpression

Array extensions (`PimpedDoubleArray`, `PimpedArray2D`)

scala

import smile.math.*

val a = Array(1.0, 2.0, 3.0)
val b = Array(4.0, 5.0, 6.0)

a += b            // in-place element-wise addition
a -= b
a *= 2.0
a /= 2.0

// 2-D
val m: Array[Array[Double]] = …
m.toMatrix        // converts to DenseMatrix

// Sampling
val sample = a.sample(50)   // draw 50 elements without replacement

`VectorExpression` operators

scala

val u: VectorExpression = …
val v: VectorExpression = …

u + v           // element-wise addition  → VectorExpression
u - v
u * 3.0
u %*% v         // dot product           → Double (via simplify.toVector)

`MatrixExpression` operators

scala

val A: MatrixExpression = …
val B: MatrixExpression = …

A + B
A - B
A * B           // element-wise
A %*% B         // matrix multiplication (uses optimal chain order)
A.t             // transpose
A * v           // matrix-vector product

// Solve A x = b
val x = A \ b   // via LU or QR depending on shape

Linear algebra helpers (top-level in `smile.math`)

scala

import smile.math.*

zeros(3, 4)                   // 3×4 zero matrix
ones(3, 4)
eye(5)                        // identity
rand(3, 3)                    // uniform random
randn(3, 3)                   // Gaussian random

trace(A)
diag(A)                       // extract diagonal or build diagonal matrix
lu(A)                         // LU decomposition
qr(A)                         // QR decomposition
cholesky(A)
eig(A)                        // eigenvalues only
eigen(A)                      // full eigendecomposition
svd(A)
det(A)
rank(A)
inv(A)

Statistical tests (top-level)

scala

import smile.math.*

chisqtest(freq)               // Chi-squared goodness-of-fit
chisqtest2(x, y)              // Two-sample Chi-squared
ftest(x, y)                   // F-test for variance equality
ttest(x, mean)                // One-sample t-test
ttest2(x, y)                  // Two-sample t-test
ttest(x, y, paired = true)    // Paired t-test
kstest(x, dist)               // Kolmogorov-Smirnov
pearsontest(x, y)             // Pearson correlation
spearmantest(x, y)            // Spearman rank correlation
kendalltest(x, y)             // Kendall tau

// Contingency-table Chi-squared
chisqtest(table)

Special functions (top-level)

scala

import smile.math.*

beta(a, b); erf(x); erfc(x); gamma(x); lgamma(x); digamma(x)
inverf(p); inverfc(p); erfcc(x)

Computer Algebra System (CAS)

The smile.cas package provides symbolic scalars, vectors, and matrices. Import smile.cas.* to enable implicit conversions from Scala literals to CAS nodes.

Scalars

scala

import smile.cas.*

// Literals become CAS nodes automatically
val x: Var = "x"          // Var — symbolic variable
val a: Val = 3.14          // Val — numeric constant
val n: IntVal = 2          // integer constant

// Arithmetic
val expr = x * x + 2 * x + 1      // Scalar expression
val diff = expr.d("x")             // symbolic derivative w.r.t. x: 2*x + 2
val simplified = diff.simplify     // simplification

// Helper functions
val f = exp(x) + log(x) + sqrt(x) + sin(x) + cos(x) + tan(x)
val g = abs("y") + ceil("z") + floor("w")

Vectors

scala

import smile.cas.*

val v = Vector("a", "b", "c")     // 3-element symbolic vector
val u = Vector("x", "y")
val dot = v * u                    // dot product expression
val jac = v.d("x")                 // Jacobian w.r.t. scalar

Matrices

scala

import smile.cas.*

val M = Matrix("M")                // symbolic matrix variable
val N = Matrix("N")
val prod = M * N                   // symbolic matrix product
val inv  = M.inv                   // symbolic inverse
val grad = M.d("alpha")            // derivative w.r.t. scalar parameter

Classification

Import smile.classification.*. Every function is wrapped with time(…) which logs its wall-clock duration.

K-Nearest Neighbors

scala

import smile.classification.*

// From a pre-built KNN search structure
val model = knn(knnSearch, y, k = 5)

// Build automatically from feature matrix (custom distance)
val model = knn(x, y, k = 5, distance = new EuclideanDistance)

// Euclidean distance shortcut
val model = knn(x, y, k = 5)

Logistic Regression

scala

val model = logit(x, y,
  lambda  = 0.01,    // L2 regularization (0 = none)
  tol     = 1e-5,    // convergence tolerance
  maxIter = 500)

Maximum Entropy (Multinomial Logistic for sparse features)

scala

// x(i) is a sparse binary feature: array of non-zero feature indices
val model = maxent(x, y,
  p       = 50000,   // feature space dimension
  lambda  = 0.1,
  tol     = 1e-5,
  maxIter = 500)

Multilayer Perceptron

scala

import smile.model.mlp.*
import smile.util.function.TimeFunction

val layers = Array(
  Layer.input(4),
  Layer.sigmoid(20),
  Layer.mle(3, OutputFunction.SOFTMAX)
)

val model = mlp(x, y, layers,
  epochs      = 10,
  learningRate = TimeFunction.linear(0.01, 10000, 0.001),
  momentum     = TimeFunction.constant(0.0),
  weightDecay  = 0.0,
  rho          = 0.0,
  epsilon      = 1e-7)

RBF Network

scala

// Provide explicit RBF neurons
val neurons = RBF.fit(x, k = 10)
val model = rbfnet(x, y, neurons, normalized = false)

// Convenience: build Gaussian RBF with k-means automatically
val model = rbfnet(x, y, k = 10, normalized = false)

Support Vector Machine

scala

import smile.math.kernel.*

val kernel = new GaussianKernel(sigma = 1.0)

val model = svm(x, y, kernel,
  C      = 1.0,
  tol    = 1e-3,
  epochs = 1)

Decision Tree (CART)

scala

import smile.data.formula.*
import smile.model.cart.SplitRule

val model = cart(formula, data,
  splitRule = SplitRule.GINI,
  maxDepth  = 20,
  maxNodes  = 0,    // 0 = unlimited
  nodeSize  = 5)

Random Forest

scala

val model = randomForest(formula, data,
  ntrees      = 500,
  mtry        = 0,           // 0 = floor(sqrt(p))
  splitRule   = SplitRule.GINI,
  maxDepth    = 20,
  maxNodes    = 500,
  nodeSize    = 1,
  subsample   = 1.0,         // 1.0 = with replacement
  classWeight = null,
  seeds       = null)

Gradient Boosted Trees

scala

val model = gbm(formula, data,
  ntrees    = 500,
  maxDepth  = 20,
  maxNodes  = 6,
  nodeSize  = 5,
  shrinkage = 0.05,
  subsample = 0.7)

AdaBoost

scala

val model = adaboost(formula, data,
  ntrees   = 500,
  maxDepth = 20,
  maxNodes = 6,
  nodeSize = 1)

Discriminant Analysis

scala

// Fisher's Linear Discriminant
val model = fisher(x, y, L = -1, tol = 1e-4)

// Linear Discriminant Analysis
val model = lda(x, y, priori = null, tol = 1e-4)

// Quadratic Discriminant Analysis
val model = qda(x, y, priori = null, tol = 1e-4)

// Regularized Discriminant Analysis (blends LDA and QDA)
val model = rda(x, y,
  alpha  = 0.5,       // 0 = LDA, 1 = QDA
  priori = null,
  tol    = 1e-4)

Naive Bayes

scala

import smile.classification.DiscreteNaiveBayes

// Document classification with add-k smoothing
val model = naiveBayes(x, y,
  model  = DiscreteNaiveBayes.Model.MULTINOMIAL,
  priori = null,
  sigma  = 1.0)

// General form with continuous distributions
val model = naiveBayes(priori, condprob)

Multiclass Wrappers

scala

// One-vs-One (K*(K-1)/2 binary classifiers; max-wins voting)
val model = ovo(x, y) { (x, y) => svm(x, y, kernel, C = 1.0) }

// One-vs-Rest (K binary classifiers; highest confidence wins)
val model = ovr(x, y) { (x, y) => svm(x, y, kernel, C = 1.0) }

Both ovo and ovr accept any trainer function (Array[T], Array[Int]) => Classifier[T], expressed as a curried Scala lambda.

Regression

Import smile.regression.*.

Linear Models

scala

import smile.data.formula.*
import smile.regression.*

// Ordinary Least Squares
val model = lm(formula, data,
  method    = OLS.Method.QR,  // "svd" or "qr"
  stderr    = true,
  recursive = true)

// Ridge Regression (L2 penalty)
val model = ridge(formula, data, lambda = 0.1)

// LASSO (L1 penalty; produces sparse solutions)
val model = lasso(formula, data,
  lambda  = 0.1,
  tol     = 1e-3,
  maxIter = 5000)

Support Vector Regression

scala

val model = svm(x, y, kernel,
  eps = 0.1,   // epsilon-insensitive loss threshold
  C   = 1.0,   // soft-margin penalty
  tol = 1e-3)

Regression Tree and Ensembles

scala

// Single regression tree
val model = cart(formula, data, maxDepth = 20, maxNodes = 0, nodeSize = 5)

// Random Forest
val model = randomForest(formula, data,
  ntrees    = 500,
  mtry      = 0,
  maxDepth  = 20,
  maxNodes  = 500,
  nodeSize  = 5,
  subsample = 1.0)

// Gradient Boosted Trees
import smile.model.cart.Loss

val model = gbm(formula, data,
  loss      = Loss.lad(),   // least absolute deviation (robust default)
  ntrees    = 500,
  maxDepth  = 20,
  maxNodes  = 6,
  nodeSize  = 5,
  shrinkage = 0.05,
  subsample = 0.7)

Gaussian Process Regression

Grouped under the gpr object:

scala

import smile.regression.gpr
import smile.math.kernel.GaussianKernel

val kernel = new GaussianKernel(sigma = 1.0)

// Full GP — O(n³) in training, exact inference
val model = gpr(x, y, kernel,
  noise     = 0.01,
  normalize = true,
  tol       = 1e-5,
  maxIter   = 0)    // maxIter=0 skips hyperparameter optimization

// Subset-of-Regressors approximation (inducing points t ⊂ x)
val model = gpr.approx(x, y, t, kernel, noise = 0.01)

// Nyström approximation (inducing points may be external)
val model = gpr.nystrom(x, y, t, kernel, noise = 0.01)

RBF Network

scala

// Provide explicit neurons
val model = rbfnet(x, y, neurons, normalized = false)

// Convenience: Gaussian RBF via k-means
val model = rbfnet(x, y, k = 10)

Clustering

Import smile.clustering.*.

Hierarchical Clustering

scala

import smile.clustering.*

// Euclidean distance; method ∈ "single" | "complete" | "upgma" | "average" |
//                                      "upgmc" | "centroid" | "wpgma" |
//                                      "wpgmc" | "median" | "ward"
val hc = hclust(data, "ward")

// Custom distance
val hc = hclust(data, myDistance, "complete")

// Cut the dendrogram to obtain k clusters
val labels = hc.partition(k = 5)

Partitional Clustering

scala

// K-Means (best of 16 runs by default)
val km = kmeans(data, k = 5, maxIter = 100, runs = 16)
println(km.k)           // actual number of clusters
println(km.distortion)  // within-cluster sum of squared distances

// K-Modes (binary / categorical data)
val km = kmodes(data, k = 5, maxIter = 100, runs = 10)

// X-Means — automatically determines k using BIC
val xm = xmeans(data, k = 20)   // k is the upper bound

// G-Means — automatically determines k using Gaussian normality test
val gm = gmeans(data, k = 20)

// Deterministic Annealing
val da = dac(data, k = 10, alpha = 0.9)

// CLARANS (medoid-based; any distance)
val cl = clarans(data, myDistance, k = 5)

Density-Based Clustering

scala

// DBSCAN with Euclidean distance
val db = dbscan(data, minPts = 5, radius = 0.5)

// DBSCAN with custom distance
val db = dbscan(data, myDistance, minPts = 5, radius = 0.5)

// DBSCAN with pre-built RNN search structure
val db = dbscan(data, rnnSearch, minPts = 5, radius = 0.5)

// DENCLUE (kernel-density attractors)
val dc = denclue(data, sigma = 0.5, m = 50)

Information-Theoretic Clustering

scala

import smile.util.SparseArray

// SIB — co-occurrence data (e.g. document–word)
val sb = sib(sparseData, k = 10, maxIter = 100, runs = 8)

// MEC — minimum conditional entropy (works with any distance)
val mc = mec(data, myDistance, k = 10, radius = 0.5)
val mc = mec(data, myMetric,   k = 10, radius = 0.5)
val mc = mec(data,              k = 10, radius = 0.5)  // Euclidean shortcut

// Spectral Clustering
val sp = specc(data, k = 5, sigma = 1.0, l = 0, maxIter = 100)

Cluster assignments are accessed as:

scala

model.y          // Array[Int] of cluster labels (-1 = noise in DBSCAN)
model.centroids  // cluster centres (for centroid-based models)

Dimensionality Reduction

Import smile.feature.extraction.*. All methods are wrapped with time(…).

scala

import smile.feature.extraction.*

// PCA — Principal Component Analysis
val pca = pca(data)
val pca = pca(data, cor = true)       // use correlation matrix

// Probabilistic PCA (handles missing values)
val ppca = ppca(data, k = 10)

// Kernel PCA
import smile.math.kernel.GaussianKernel
val kpca = kpca(data, kernel = new GaussianKernel(1.0), k = 10)
val kpca = kpca(data, new GaussianKernel(1.0), k = 10, threshold = 1e-4)

// Generalized Hebbian Algorithm (online / incremental PCA)
val gha = gha(data, k = 10)
val gha = gha(data, k = 10, r = 0.0001)

After fitting, project new data:

scala

val embedding = pca.project(newData)
pca.setProjection(k)   // change number of retained components

Manifold Learning

Import smile.manifold.*. All methods return low-dimensional coordinate arrays (Array[Array[Double]]) or dedicated result objects.

scala

import smile.manifold.*

// Isomap — geodesic MDS (C-Isomap variant by default)
val coords = isomap(data, k = 10, d = 2, CIsomap = true)

// Locally Linear Embedding
val coords = lle(data, k = 10, d = 2)

// Laplacian Eigenmap
val coords = laplacian(data, k = 10, d = 2, t = -1.0)
// t > 0 uses Gaussian heat kernel; t ≤ 0 uses binary weights

// t-SNE (2-D or 3-D; input may be pre-computed distance matrix)
val result = tsne(data,
  d               = 2,
  perplexity      = 20.0,
  eta             = 200.0,
  earlyExaggeration = 12.0,
  maxIter         = 1000)
val coords = result.coordinates

// UMAP
val coords = umap(data,
  k                = 15,
  d                = 2,
  epochs           = 0,       // 0 = auto
  learningRate     = 1.0,
  minDist          = 0.1,
  spread           = 1.0,
  negativeSamples  = 5,
  repulsionStrength = 1.0)

// Classical MDS (equivalent to PCA when Euclidean distances are used)
val result = mds(proximity, d = 2)

// Non-metric (Kruskal) MDS
val result = isomds(proximity, d = 2, tol = 1e-4, maxIter = 200)

// Sammon Mapping
val result = sammon(proximity, d = 2, step = 0.2, maxIter = 100)

Natural Language Processing

Import smile.nlp.*. The implicit conversion pimpString enriches every String with NLP pipeline methods.

String extension methods

scala

import smile.nlp.*

val text = "Dr. Smith went to Washington D.C. He arrived on Tuesday."

// Unicode normalization (NFKC, whitespace normalization, quote normalization)
val clean = text.normalize

// Sentence splitting
val sentences: Array[String] = text.sentences

// Tokenization with stop-word filtering
val words: Array[String] = text.words                     // default stop list
val words: Array[String] = text.words("comprehensive")    // larger stop list
val words: Array[String] = text.words("none")             // no filtering
val words: Array[String] = text.words("the,a,an")         // custom stop list

// Bag-of-words (word → count)
val bag: Map[String, Int] = text.bag()                      // Porter stemming
val bag: Map[String, Int] = text.bag(stemmer = None)        // no stemming
val bag: Map[String, Int] = text.bag(filter = "google")

// Binary bag-of-words (presence/absence)
val bag2: Set[String] = text.bag2()

// Part-of-speech tagging (returns word–POS pairs)
val tagged: Array[(String, PennTreebankPOS)] = "She sells seashells".postag

// Keyword extraction
val keywords: Seq[NGram] = text.keywords(k = 10)

Corpus and n-gram utilities

scala

import smile.nlp.*

// Build an in-memory corpus
val corp = corpus(Seq("First document text.", "Second document text."))

// Bigram collocations
val topBigrams: Seq[Bigram] = bigram(k = 100, minFreq = 5, docs: _*)
val sigBigrams: Seq[Bigram] = bigram(p = 0.01, minFreq = 5, docs: _*)

// N-gram extraction (Apriori-style)
val grams: Array[Array[NGram]] = ngram(maxNGramSize = 3, minFreq = 3, docs: _*)

// HMM POS tagging on a pre-tokenised sentence
val tags: Array[PennTreebankPOS] = postag(Array("She", "sells", "seashells"))

Stemming

scala

import smile.nlp.*

porter.stem("running")    // "run"
lancaster.stem("running") // "run" (more aggressive)

Vectorization and TF-IDF

scala

import smile.nlp.*

// Term-frequency feature vector
val vocab    = Array("machine", "learning", "deep")
val features = vectorize(vocab, bag)          // Array[Double]
val sparse   = vectorize(vocab, bag2)         // Array[Int] (indices of present terms)

// Document frequency array
val dfreq: Array[Int] = df(vocab, corpusOfBags)

// Whole-corpus TF-IDF normalized to unit L2 norm
val matrix: Array[Array[Double]] = tfidf(corpusOfBags)

// Single document
val vec: Array[Double] = tfidf(bag, n = corpusSize, df = dfreq)

Sequence Labeling

Import smile.sequence.*.

scala

import smile.sequence.*

// Hidden Markov Model
val model = hmm(pi, a, b)          // from initial / transition / emission
val model = hmm(observations, k)   // learns from observation sequences

// Conditional Random Field (linear-chain)
val model = crf(x, y, feature, k, eta = 0.1, lambda = 0.1)

// CRF with Gaussian process smoothing
val model = gcrf(x, y, feature, k, eta = 0.1, lambda = 0.1)

Association Rule Mining

Import smile.association.*.

scala

import smile.association.*

val itemsets: Array[Array[Int]] = …

// Build FP-tree
val tree = fptree(itemsets)
val tree = fptree(itemsets.toStream)  // streaming variant

// Mine frequent item sets
val frequent = fpgrowth(tree, minSupport = 3)
val frequent = fpgrowth(itemsets, minSupport = 3)

// Generate association rules
val rules = arm(tree, minSupport = 3, confidence = 0.5)
val rules = arm(itemsets, minSupport = 3, confidence = 0.5)

Wavelets

Import smile.wavelet.*.

scala

import smile.wavelet.*

val wt = wavelet("D4")   // Daubechies-4 filter
// Available filters include: "Haar", "D4"–"D20" (even), "Coiflet1"–"Coiflet5", etc.

val signal = Array(1.0, 2.0, 3.0, 4.0, 3.0, 2.0, 1.0, 0.0)

// In-place discrete wavelet transform
dwt(signal, wt)

// In-place inverse DWT
idwt(signal, wt)

// Wavelet shrinkage denoising (modifies in-place)
wsdenoise(signal, wt, soft = true)

Model Validation

Import smile.validation.*.

One-shot train/test evaluation

scala

import smile.validation.*

// With raw arrays
val result = validate.classification(x, y, testX, testY) { (x, y) =>
  randomForest(Formula.lhs("label"), DataFrame.of(x, y), ntrees = 100)
}

// With DataFrame + Formula
val result = validate.classification(formula, trainDf, testDf) { (f, df) =>
  randomForest(f, df)
}

// Regression variants
val result = validate.regression(x, y, testX, testY) { (x, y) => lm(…) }
val result = validate.regression(formula, train, test) { (f, df) => lm(f, df) }

Cross-Validation

scala

val cv5 = cv.classification(k = 5, formula, data) { (f, df) =>
  randomForest(f, df)
}
println(cv5.avg.accuracy)

// With raw arrays
val cv5 = cv.classification(k = 5, x, y) { (x, y) =>
  lda(x, y)
}

// Regression
val cv5r = cv.regression(k = 5, formula, data) { (f, df) => lm(f, df) }
val cv5r = cv.regression(k = 5, x, y) { (x, y) => ridge(…) }

Leave-One-Out CV

scala

val loo = loocv.classification(formula, data) { (f, df) => cart(f, df) }
val loo = loocv.regression(x, y) { (x, y) => lasso(…) }

Bootstrap

scala

val boot = bootstrap.classification(k = 100, x, y) { (x, y) => knn(x, y, 5) }
val boot = bootstrap.regression(k = 100, formula, data) { (f, df) => gbm(f, df) }

Individual metric functions

scala

import smile.validation.*

// Classification
val cm   = confusion(truth, predictions)
val acc  = accuracy(truth, predictions)
val rec  = recall(truth, predictions)
val prec = precision(truth, predictions)
val f1   = f1(truth, predictions)
val auc  = auc(truth, probabilities)
val ll   = logloss(truth, probabilities)
val ce   = crossentropy(truth, probMatrix)
val mcc  = mcc(truth, predictions)
val sens = sensitivity(truth, predictions)
val spec = specificity(truth, predictions)
val fo   = fallout(truth, predictions)
val fdr  = fdr(truth, predictions)

// Regression
val mseVal  = mse(truth, predictions)
val rmseVal = rmse(truth, predictions)
val rssVal  = rss(truth, predictions)
val madVal  = mad(truth, predictions)

// Clustering
val ri  = randIndex(labels1, labels2)
val ari = adjustedRandIndex(labels1, labels2)
val nmiVal = nmi(labels1, labels2)

Plotting

The module includes two complementary plot APIs:

smile.plot.swing — traditional Swing-based Canvas charts for desktop use.
smile.plot.vega — Vega-Lite declarative charts for notebooks and browser-based output.

Displaying a chart (`show`)

scala

import smile.plot.*

// Render a Canvas in a JFrame (desktop) or as HTML (notebook)
show(canvas)
show(multiFigurePane)
show(vegaLiteSpec)

In Scala 2.13 notebook environments the show implicit calls are backed by macros that detect Zeppelin/Databricks context at compile time and emit HTML `` tags instead of opening a Swing window.

Swing plots (`smile.plot.swing.*`)

Every chart returns a Canvas that can be passed to show(…).

scala

import smile.plot.swing.*

// Scatter plot
val c = plot(x, y, '.')             // Array[Double] x and y
val c = plot(data, labels, marks)   // colour-coded by class label

// Scatter-plot matrix
val c = splom(data, marks, colNames)

// Line plot
val c = line(x, y)
val c = staircase(x, y)

// Box plot
val c = boxplot(data)
val c = boxplot(groups, names)

// Histogram
val c = hist(data)
val c = hist(data, bins = 20)
val c = hist3(x, y, bins = 20)

// Q-Q plot
val c = qqplot(data)                    // vs normal
val c = qqplot(x, y)                    // two-sample
val c = qqplot(data, distribution)      // vs arbitrary distribution

// Heatmap and sparse matrix spy plot
val c = heatmap(matrix)
val c = spy(sparseMatrix)
val c = hexmap(data)

// Contour and surface
val c = contour(x, y, z)
val c = surface(z)
val c = wireframe(vertices, edges)
val c = grid(ax, ay, az)

// Dendrogram
val c = dendrogram(hierarchicalClustering)

// Scree plot (PCA)
val c = screeplot(pca)

// Text annotations
val c = text(coords, labels)

Vega-Lite charts (`smile.plot.vega.*`)

Build declarative specs using a fluent Scala API. The VegaLite companion object is the entry point.

scala

import smile.plot.vega.*

// Single view
val view = VegaLite.view()
  .mark("point")
  .x(Field("sepalLength", "quantitative"))
  .y(Field("petalLength", "quantitative"))
  .color(Field("species",  "nominal"))
  .data(irisDataFrame)

show(view)

// Layered chart (multiple marks in the same coordinate system)
val chart = VegaLite.layer(view1, view2)

// Faceted chart
val faceted = VegaLite.facet(view).row("origin").column("cylinders")

// Concatenated charts
val hcat = VegaLite.hconcat(view1, view2, view3)
val vcat = VegaLite.vconcat(view1, view2)

// Scatter-plot matrix
val splomChart = VegaLite.splom(irisDataFrame)

// Fluent global properties
val chart = VegaLite.view()
  .background("#f5f5f5")
  .padding(10)
  .config(JsObject("view" -> JsObject("stroke" -> JsString("transparent"))))

Utility Helpers

`time` — measure and log execution time

scala

import smile.util.time

// Block form — returns the value, logs elapsed time with a label
val model = time("Random Forest") {
  randomForest(formula, data, ntrees = 500)
}

// Toggle output
time.on()   // enable timing output (default)
time.off()  // suppress timing output
time.echo   // check current state

Implicit Java function converters

scala

import smile.util.{toJavaFunction, toJavaBiFunction}

// Convert Scala lambdas to java.util.function types automatically
val jf:   java.util.function.Function[Int, String]          = (i: Int) => i.toString
val jbf:  java.util.function.BiFunction[Int, Int, Int]      = (a: Int, b: Int) => a + b

These conversions are automatically applied wherever SMILE's Java API requires a Function or BiFunction — for example, when passing trainers to ovo, ovr, validate.classification, or cv.classification.

Complete Examples

Example 1 — Load data and train a classifier

scala

import smile.io.*
import smile.data.formula.*
import smile.classification.*
import smile.validation.*

val df      = read.csv("iris.csv")
val formula = "class" ~ "."

// 5-fold cross-validation on a random forest
val result = cv.classification(k = 5, formula, df) { (f, d) =>
  randomForest(f, d, ntrees = 100)
}

println(s"CV accuracy: ${result.avg.accuracy * 100 %.1f %%}")

Example 2 — Text classification pipeline

scala

import smile.io.*
import smile.nlp.*
import smile.classification.*
import smile.validation.*

val texts  = Array("great product", "terrible service", "very happy")
val labels = Array(1, 0, 1)

// Build vocabulary from training data
val bags  = texts.map(_.bag())
val vocab = bags.flatMap(_.keys).distinct.sorted

// Vectorise
val x = bags.map(b => vectorize(vocab, b))
val y = labels

// Train and evaluate
val result = cv.classification(k = 3, x, y) { (x, y) =>
  logit(x, y, lambda = 0.01)
}
println(result.avg.accuracy)

Example 3 — Regression with cross-validation

scala

import smile.io.*
import smile.data.formula.*
import smile.regression.*
import smile.validation.*

val longley = read.arff("data/regression/longley.arff")
val formula = "Employed" ~ "."

val cv5 = cv.regression(k = 5, formula, longley) { (f, df) =>
  lm(f, df)
}
println(f"RMSE: ${cv5.avg.rmse}%.4f")

Example 4 — Gaussian Process Regression with Nyström approximation

scala

import smile.regression.gpr
import smile.math.kernel.GaussianKernel

val kernel = new GaussianKernel(sigma = 1.0)

// Inducing inputs (e.g. k-means centroids of x)
import smile.clustering.*
val km = kmeans(x, k = 200)
val t  = km.centroids

val model = gpr.nystrom(x, y, t, kernel, noise = 0.01, normalize = true)
val predictions = x.map(model.predict)

Example 5 — NLP keyword extraction

scala

import smile.nlp.*

val text = """
  Machine learning is a field of artificial intelligence. It enables computers
  to learn from experience without being explicitly programmed.
"""

val keywords = text.keywords(k = 5)
keywords.foreach(ng => println(ng.words.mkString(" ")))

Example 6 — Manifold learning and visualization

scala

import smile.io.*
import smile.manifold.*
import smile.plot.swing.*
import smile.plot.*

val (x, _) = read.csv("mnist.csv").toArray …  // high-dimensional data

val embedding = umap(x, k = 15, d = 2)

val canvas = plot(embedding.map(_(0)), embedding.map(_(1)), '.')
show(canvas)

Example 7 — Symbolic differentiation with CAS

scala

import smile.cas.*

val x = "x"
val y = "y"

// Define f(x, y) = x² y + sin(x) y
val f = (x ** 2) * y + sin(x) * y

// Partial derivatives
val df_dx = f.d("x").simplify   // 2 x y + cos(x) y
val df_dy = f.d("y").simplify   // x² + sin(x)

println(df_dx)
println(df_dy)

// Evaluate at x=1, y=2
val env = Map("x" -> 1.0, "y" -> 2.0)
println(df_dx.apply(env))

Notable Differences from the Kotlin Shim

Aspect	Scala	Kotlin
Extension mechanism	Implicit classes (`PimpedXxx`) via `implicit def`	Extension functions
Formula DSL	Rich operator DSL: `~`, `+`, `-`, `::`, `&&`, `^`, function terms	Not present
CAS	Full symbolic algebra (`smile.cas`)	Not present
Plotting	Both Swing (`smile.plot.swing`) and Vega-Lite (`smile.plot.vega`)	Not present
Notebook rendering	Macro-detected at compile time (Scala 2.13)	N/A
Validation API	Object-based: `validate`, `cv`, `loocv`, `bootstrap`	Top-level functions
Sequence models	HMM, CRF, GCRF	Not present
Operator DSL	`%*%` (dot/matmul), `\` (solve), `~` (slice)	N/A
Array slicing	`0 ~ 9 ~ 2` (Python-like with step)	N/A
`gpr` namespace	`object gpr { apply, approx, nystrom }`	`object gpr` (same)

Both shims expose the same underlying Java algorithms. The Kotlin shim focuses on function-level conciseness; the Scala shim additionally provides a richer operator language and is more appropriate for exploratory notebook workflows that involve linear algebra, symbolic math, and interactive visualization.

SMILE Scala API

SMILE Scala API

Table of Contents

Installation

Data I/O — read and write

Loading data

Saving data

DataFrame Extensions

DataFrameOps — enriches DataFrame

TupleOps — enriches Tuple

Summary statistics (top-level)

Formula DSL

Basic syntax

Interaction and crossing

Function terms

Math, Arrays, and Linear Algebra

Enriched primitives

Array extensions (PimpedDoubleArray, PimpedArray2D)

VectorExpression operators

MatrixExpression operators

Linear algebra helpers (top-level in smile.math)

Statistical tests (top-level)

Special functions (top-level)

Computer Algebra System (CAS)

Scalars

Vectors

Matrices

Classification

K-Nearest Neighbors

Logistic Regression

Maximum Entropy (Multinomial Logistic for sparse features)

Multilayer Perceptron

RBF Network

Support Vector Machine

Decision Tree (CART)

Random Forest

Gradient Boosted Trees

AdaBoost

Discriminant Analysis

Naive Bayes

Multiclass Wrappers

Regression

Linear Models

Support Vector Regression

Regression Tree and Ensembles

Gaussian Process Regression

RBF Network

Clustering

Hierarchical Clustering

Partitional Clustering

Density-Based Clustering

Information-Theoretic Clustering

Dimensionality Reduction

Manifold Learning

Natural Language Processing

String extension methods

Corpus and n-gram utilities

Stemming

Vectorization and TF-IDF

Sequence Labeling

Association Rule Mining

Wavelets

Model Validation

One-shot train/test evaluation

Cross-Validation

Leave-One-Out CV

Bootstrap

Individual metric functions

Plotting

Displaying a chart (show)

Swing plots (smile.plot.swing.*)

Vega-Lite charts (smile.plot.vega.*)

Utility Helpers

time — measure and log execution time

Implicit Java function converters

Complete Examples

Example 1 — Load data and train a classifier

Example 2 — Text classification pipeline

Example 3 — Regression with cross-validation

Example 4 — Gaussian Process Regression with Nyström approximation

Data I/O — `read` and `write`

`DataFrameOps` — enriches `DataFrame`

`TupleOps` — enriches `Tuple`

Array extensions (`PimpedDoubleArray`, `PimpedArray2D`)

`VectorExpression` operators

`MatrixExpression` operators

Linear algebra helpers (top-level in `smile.math`)

Displaying a chart (`show`)

Swing plots (`smile.plot.swing.*`)

Vega-Lite charts (`smile.plot.vega.*`)

`time` — measure and log execution time