scala/README.md
The smile-scala module is an idiomatic Scala shim over the SMILE Java library.
Except smile.cas package (Computer Algebra System), it adds nothing algorithmic
— every function ultimately delegates to the same Java fit, of, or constructor
— but it replaces verbose Java patterns with concise, expressive Scala idioms:
DataFrame, Tuple,
arrays, and String with domain-specific methods.y ~ x1 + x2), NumPy-style array
slicing (0 ~ 9 ~ 2), and linear-algebra operators (%*%, \).object namespaces (read, write, gpr, validate, cv,
loocv, bootstrap) group related operations without polluting the package
namespace.show(…) to either an
in-process Swing window or an HTML `` element.The module depends on :core (ML), :base (data and I/O), :nlp, :plot, and
:json.
read and writeAdd the module to your build.gradle.kts (for use inside this Gradle project):
dependencies {
implementation(project(":scala"))
}
Or, from SBT in a standalone project:
libraryDependencies += "com.github.haifengl" %% "smile-scala" % "<version>"
Import the relevant package objects at the top of each file. The most common imports are:
import smile.io.* // read, write
import smile.data.* // DataFrame implicits, summary
import smile.data.formula.* // formula DSL: ~, +, -, ::, &&, ^
import smile.math.* // PimpedInt, PimpedDouble, array extensions, linalg
import smile.classification.*
import smile.regression.*
import smile.clustering.*
import smile.manifold.*
import smile.nlp.*
import smile.validation.*
read and writeBoth read and write are top-level Scala objects defined in smile.io.
They serve as namespaces so you can write read.csv(…) instead of importing a
static Java method.
import smile.io.*
// Auto-detect format from extension (.csv, .json, .arff, .parquet, .avro, .sas7bdat)
val df = read.data("path/to/file.csv")
val df = read.data("path/to/file.parquet")
val df = read.data("path/to/file.json", "multi-line") // JSON mode hint
// CSV with options (all have defaults — delimiter=",", header=true, quote='"')
val df = read.csv("iris.csv")
val df = read.csv("data.tsv", delimiter = "\t")
val df = read.csv("data.csv", header = false, comment = '#')
// Other formats
val df = read.json("records.json")
val df = read.json("records.json", JSON.Mode.MULTI_LINE, schema)
val df = read.arff("weka.arff")
val df = read.sas("dataset.sas7bdat")
val df = read.arrow("data.arrow")
val df = read.avro("data.avro", schemaInputStream)
val df = read.parquet("data.parquet")
val ds = read.libsvm("data.libsvm") // returns SparseDataset[Integer]
val (vertices, edges) = read.wavefront("mesh.obj") // 3-D OBJ geometry
// JDBC result set
val df = read.jdbc(resultSet)
// Deserialize a previously serialized model
val model = read("model.bin")
import smile.io.*
// Serialize any Serializable object (e.g. a trained model)
write(model, "model.bin")
// Write a DataFrame
write.csv(df, "out.csv")
write.csv(df, "out.tsv", delimiter = "\t")
write.arff(df, "out.arff", "relation-name")
write.arrow(df, "out.arrow")
// Write raw arrays
write.array(predictions, "predictions.txt") // one element per line
write.table(matrix, "matrix.csv", delimiter = ",") // 2-D array to delimited file
When you import smile.data.*, implicit conversions enrich DataFrame and
Tuple with Scala-idiomatic methods.
DataFrameOps — enriches DataFrameimport smile.data.*
val df: DataFrame = read.csv("iris.csv")
// Select/drop columns by name or Range
val sub = df.select("sepal.length", "sepal.width")
val fewer = df.drop("class")
val slice = df.of(0 until 100) // row slice using Scala Range
// Functional operations
val row: Option[Tuple] = df.find(_.getInt("class") == 1)
val all: Boolean = df.forall(_.getDouble("petal.length") > 0.0)
val any: Boolean = df.exists(_.getDouble("sepal.length") > 7.0)
df.foreach(row => println(row))
val mapped: Array[Double] = df.map(_.getDouble(0))
val filtered: DataFrame = df.filter(_.getDouble("sepal.length") > 5.0)
val (yes, no) = df.partition(_.getInt("class") == 0)
val groups = df.groupBy(_.getInt("class"))
// JSON conversion
val json: String = df.toJSON
TupleOps — enriches Tupleval t: Tuple = df.get(0)
val json: String = t.toJSON // handles categorical fields correctly
import smile.data.*
summary(intArray) // prints min/Q1/median/mean/Q3/max for Array[Int]
summary(doubleArray) // same for Array[Double]
Import smile.data.formula.* to unlock an R-style formula language for
specifying model structure.
import smile.data.formula.*
// y ~ x means "predict y from x"
val f: Formula = "y" ~ "x"
// Include multiple terms
val f = "price" ~ "size" + "bedrooms" + "location"
// Exclude a term with unary -
val f = "y" ~ "." - "id" // use all columns except "id"
// Intercept-only: just ". ~ ."
// Interaction term: a :: b (a*b without main effects in R notation)
val f = "y" ~ "a" :: "b"
// Crossing (main effects + interactions): a && b
val f = "y" ~ "a" && "b" // expands to a + b + a:b
// Degree on crossing
val f = "y" ~ ("a" && "b") ^ 3
All common Math functions are available as Formula terms:
val f = "y" ~ log("income") + sqrt("age") + "gender"
val f = "y" ~ abs("balance") + exp("rate")
// Available: abs, ceil, floor, round, rint, exp, expm1, log, log1p,
// log10, log2, signum, sign, sqrt, cbrt, sin, cos, tan,
// sinh, cosh, tanh, asin, acos, atan, ulp
Import smile.math.* to get enriched numeric types, operator overloading for
arrays and matrices, and many statistical/linear-algebra helpers.
import smile.math.*
// PimpedInt — slice construction (Python-like)
val s: Slice = 0 ~ 9 // indices 0..9
val s: Slice = 0 ~ 9 ~ 2 // indices 0, 2, 4, 6, 8 (step 2)
// PimpedDouble — arithmetic with arrays and matrices
2.0 + someArray // returns VectorExpression
3.0 * someMatrix // returns MatrixExpression
PimpedDoubleArray, PimpedArray2D)import smile.math.*
val a = Array(1.0, 2.0, 3.0)
val b = Array(4.0, 5.0, 6.0)
a += b // in-place element-wise addition
a -= b
a *= 2.0
a /= 2.0
// 2-D
val m: Array[Array[Double]] = …
m.toMatrix // converts to DenseMatrix
// Sampling
val sample = a.sample(50) // draw 50 elements without replacement
VectorExpression operatorsval u: VectorExpression = …
val v: VectorExpression = …
u + v // element-wise addition → VectorExpression
u - v
u * 3.0
u %*% v // dot product → Double (via simplify.toVector)
MatrixExpression operatorsval A: MatrixExpression = …
val B: MatrixExpression = …
A + B
A - B
A * B // element-wise
A %*% B // matrix multiplication (uses optimal chain order)
A.t // transpose
A * v // matrix-vector product
// Solve A x = b
val x = A \ b // via LU or QR depending on shape
smile.math)import smile.math.*
zeros(3, 4) // 3×4 zero matrix
ones(3, 4)
eye(5) // identity
rand(3, 3) // uniform random
randn(3, 3) // Gaussian random
trace(A)
diag(A) // extract diagonal or build diagonal matrix
lu(A) // LU decomposition
qr(A) // QR decomposition
cholesky(A)
eig(A) // eigenvalues only
eigen(A) // full eigendecomposition
svd(A)
det(A)
rank(A)
inv(A)
import smile.math.*
chisqtest(freq) // Chi-squared goodness-of-fit
chisqtest2(x, y) // Two-sample Chi-squared
ftest(x, y) // F-test for variance equality
ttest(x, mean) // One-sample t-test
ttest2(x, y) // Two-sample t-test
ttest(x, y, paired = true) // Paired t-test
kstest(x, dist) // Kolmogorov-Smirnov
pearsontest(x, y) // Pearson correlation
spearmantest(x, y) // Spearman rank correlation
kendalltest(x, y) // Kendall tau
// Contingency-table Chi-squared
chisqtest(table)
import smile.math.*
beta(a, b); erf(x); erfc(x); gamma(x); lgamma(x); digamma(x)
inverf(p); inverfc(p); erfcc(x)
The smile.cas package provides symbolic scalars, vectors, and matrices.
Import smile.cas.* to enable implicit conversions from Scala literals to CAS
nodes.
import smile.cas.*
// Literals become CAS nodes automatically
val x: Var = "x" // Var — symbolic variable
val a: Val = 3.14 // Val — numeric constant
val n: IntVal = 2 // integer constant
// Arithmetic
val expr = x * x + 2 * x + 1 // Scalar expression
val diff = expr.d("x") // symbolic derivative w.r.t. x: 2*x + 2
val simplified = diff.simplify // simplification
// Helper functions
val f = exp(x) + log(x) + sqrt(x) + sin(x) + cos(x) + tan(x)
val g = abs("y") + ceil("z") + floor("w")
import smile.cas.*
val v = Vector("a", "b", "c") // 3-element symbolic vector
val u = Vector("x", "y")
val dot = v * u // dot product expression
val jac = v.d("x") // Jacobian w.r.t. scalar
import smile.cas.*
val M = Matrix("M") // symbolic matrix variable
val N = Matrix("N")
val prod = M * N // symbolic matrix product
val inv = M.inv // symbolic inverse
val grad = M.d("alpha") // derivative w.r.t. scalar parameter
Import smile.classification.*. Every function is wrapped with time(…)
which logs its wall-clock duration.
import smile.classification.*
// From a pre-built KNN search structure
val model = knn(knnSearch, y, k = 5)
// Build automatically from feature matrix (custom distance)
val model = knn(x, y, k = 5, distance = new EuclideanDistance)
// Euclidean distance shortcut
val model = knn(x, y, k = 5)
val model = logit(x, y,
lambda = 0.01, // L2 regularization (0 = none)
tol = 1e-5, // convergence tolerance
maxIter = 500)
// x(i) is a sparse binary feature: array of non-zero feature indices
val model = maxent(x, y,
p = 50000, // feature space dimension
lambda = 0.1,
tol = 1e-5,
maxIter = 500)
import smile.model.mlp.*
import smile.util.function.TimeFunction
val layers = Array(
Layer.input(4),
Layer.sigmoid(20),
Layer.mle(3, OutputFunction.SOFTMAX)
)
val model = mlp(x, y, layers,
epochs = 10,
learningRate = TimeFunction.linear(0.01, 10000, 0.001),
momentum = TimeFunction.constant(0.0),
weightDecay = 0.0,
rho = 0.0,
epsilon = 1e-7)
// Provide explicit RBF neurons
val neurons = RBF.fit(x, k = 10)
val model = rbfnet(x, y, neurons, normalized = false)
// Convenience: build Gaussian RBF with k-means automatically
val model = rbfnet(x, y, k = 10, normalized = false)
import smile.math.kernel.*
val kernel = new GaussianKernel(sigma = 1.0)
val model = svm(x, y, kernel,
C = 1.0,
tol = 1e-3,
epochs = 1)
import smile.data.formula.*
import smile.model.cart.SplitRule
val model = cart(formula, data,
splitRule = SplitRule.GINI,
maxDepth = 20,
maxNodes = 0, // 0 = unlimited
nodeSize = 5)
val model = randomForest(formula, data,
ntrees = 500,
mtry = 0, // 0 = floor(sqrt(p))
splitRule = SplitRule.GINI,
maxDepth = 20,
maxNodes = 500,
nodeSize = 1,
subsample = 1.0, // 1.0 = with replacement
classWeight = null,
seeds = null)
val model = gbm(formula, data,
ntrees = 500,
maxDepth = 20,
maxNodes = 6,
nodeSize = 5,
shrinkage = 0.05,
subsample = 0.7)
val model = adaboost(formula, data,
ntrees = 500,
maxDepth = 20,
maxNodes = 6,
nodeSize = 1)
// Fisher's Linear Discriminant
val model = fisher(x, y, L = -1, tol = 1e-4)
// Linear Discriminant Analysis
val model = lda(x, y, priori = null, tol = 1e-4)
// Quadratic Discriminant Analysis
val model = qda(x, y, priori = null, tol = 1e-4)
// Regularized Discriminant Analysis (blends LDA and QDA)
val model = rda(x, y,
alpha = 0.5, // 0 = LDA, 1 = QDA
priori = null,
tol = 1e-4)
import smile.classification.DiscreteNaiveBayes
// Document classification with add-k smoothing
val model = naiveBayes(x, y,
model = DiscreteNaiveBayes.Model.MULTINOMIAL,
priori = null,
sigma = 1.0)
// General form with continuous distributions
val model = naiveBayes(priori, condprob)
// One-vs-One (K*(K-1)/2 binary classifiers; max-wins voting)
val model = ovo(x, y) { (x, y) => svm(x, y, kernel, C = 1.0) }
// One-vs-Rest (K binary classifiers; highest confidence wins)
val model = ovr(x, y) { (x, y) => svm(x, y, kernel, C = 1.0) }
Both ovo and ovr accept any trainer function (Array[T], Array[Int]) => Classifier[T],
expressed as a curried Scala lambda.
Import smile.regression.*.
import smile.data.formula.*
import smile.regression.*
// Ordinary Least Squares
val model = lm(formula, data,
method = OLS.Method.QR, // "svd" or "qr"
stderr = true,
recursive = true)
// Ridge Regression (L2 penalty)
val model = ridge(formula, data, lambda = 0.1)
// LASSO (L1 penalty; produces sparse solutions)
val model = lasso(formula, data,
lambda = 0.1,
tol = 1e-3,
maxIter = 5000)
val model = svm(x, y, kernel,
eps = 0.1, // epsilon-insensitive loss threshold
C = 1.0, // soft-margin penalty
tol = 1e-3)
// Single regression tree
val model = cart(formula, data, maxDepth = 20, maxNodes = 0, nodeSize = 5)
// Random Forest
val model = randomForest(formula, data,
ntrees = 500,
mtry = 0,
maxDepth = 20,
maxNodes = 500,
nodeSize = 5,
subsample = 1.0)
// Gradient Boosted Trees
import smile.model.cart.Loss
val model = gbm(formula, data,
loss = Loss.lad(), // least absolute deviation (robust default)
ntrees = 500,
maxDepth = 20,
maxNodes = 6,
nodeSize = 5,
shrinkage = 0.05,
subsample = 0.7)
Grouped under the gpr object:
import smile.regression.gpr
import smile.math.kernel.GaussianKernel
val kernel = new GaussianKernel(sigma = 1.0)
// Full GP — O(n³) in training, exact inference
val model = gpr(x, y, kernel,
noise = 0.01,
normalize = true,
tol = 1e-5,
maxIter = 0) // maxIter=0 skips hyperparameter optimization
// Subset-of-Regressors approximation (inducing points t ⊂ x)
val model = gpr.approx(x, y, t, kernel, noise = 0.01)
// Nyström approximation (inducing points may be external)
val model = gpr.nystrom(x, y, t, kernel, noise = 0.01)
// Provide explicit neurons
val model = rbfnet(x, y, neurons, normalized = false)
// Convenience: Gaussian RBF via k-means
val model = rbfnet(x, y, k = 10)
Import smile.clustering.*.
import smile.clustering.*
// Euclidean distance; method ∈ "single" | "complete" | "upgma" | "average" |
// "upgmc" | "centroid" | "wpgma" |
// "wpgmc" | "median" | "ward"
val hc = hclust(data, "ward")
// Custom distance
val hc = hclust(data, myDistance, "complete")
// Cut the dendrogram to obtain k clusters
val labels = hc.partition(k = 5)
// K-Means (best of 16 runs by default)
val km = kmeans(data, k = 5, maxIter = 100, runs = 16)
println(km.k) // actual number of clusters
println(km.distortion) // within-cluster sum of squared distances
// K-Modes (binary / categorical data)
val km = kmodes(data, k = 5, maxIter = 100, runs = 10)
// X-Means — automatically determines k using BIC
val xm = xmeans(data, k = 20) // k is the upper bound
// G-Means — automatically determines k using Gaussian normality test
val gm = gmeans(data, k = 20)
// Deterministic Annealing
val da = dac(data, k = 10, alpha = 0.9)
// CLARANS (medoid-based; any distance)
val cl = clarans(data, myDistance, k = 5)
// DBSCAN with Euclidean distance
val db = dbscan(data, minPts = 5, radius = 0.5)
// DBSCAN with custom distance
val db = dbscan(data, myDistance, minPts = 5, radius = 0.5)
// DBSCAN with pre-built RNN search structure
val db = dbscan(data, rnnSearch, minPts = 5, radius = 0.5)
// DENCLUE (kernel-density attractors)
val dc = denclue(data, sigma = 0.5, m = 50)
import smile.util.SparseArray
// SIB — co-occurrence data (e.g. document–word)
val sb = sib(sparseData, k = 10, maxIter = 100, runs = 8)
// MEC — minimum conditional entropy (works with any distance)
val mc = mec(data, myDistance, k = 10, radius = 0.5)
val mc = mec(data, myMetric, k = 10, radius = 0.5)
val mc = mec(data, k = 10, radius = 0.5) // Euclidean shortcut
// Spectral Clustering
val sp = specc(data, k = 5, sigma = 1.0, l = 0, maxIter = 100)
Cluster assignments are accessed as:
model.y // Array[Int] of cluster labels (-1 = noise in DBSCAN)
model.centroids // cluster centres (for centroid-based models)
Import smile.feature.extraction.*. All methods are wrapped with time(…).
import smile.feature.extraction.*
// PCA — Principal Component Analysis
val pca = pca(data)
val pca = pca(data, cor = true) // use correlation matrix
// Probabilistic PCA (handles missing values)
val ppca = ppca(data, k = 10)
// Kernel PCA
import smile.math.kernel.GaussianKernel
val kpca = kpca(data, kernel = new GaussianKernel(1.0), k = 10)
val kpca = kpca(data, new GaussianKernel(1.0), k = 10, threshold = 1e-4)
// Generalized Hebbian Algorithm (online / incremental PCA)
val gha = gha(data, k = 10)
val gha = gha(data, k = 10, r = 0.0001)
After fitting, project new data:
val embedding = pca.project(newData)
pca.setProjection(k) // change number of retained components
Import smile.manifold.*. All methods return low-dimensional coordinate arrays
(Array[Array[Double]]) or dedicated result objects.
import smile.manifold.*
// Isomap — geodesic MDS (C-Isomap variant by default)
val coords = isomap(data, k = 10, d = 2, CIsomap = true)
// Locally Linear Embedding
val coords = lle(data, k = 10, d = 2)
// Laplacian Eigenmap
val coords = laplacian(data, k = 10, d = 2, t = -1.0)
// t > 0 uses Gaussian heat kernel; t ≤ 0 uses binary weights
// t-SNE (2-D or 3-D; input may be pre-computed distance matrix)
val result = tsne(data,
d = 2,
perplexity = 20.0,
eta = 200.0,
earlyExaggeration = 12.0,
maxIter = 1000)
val coords = result.coordinates
// UMAP
val coords = umap(data,
k = 15,
d = 2,
epochs = 0, // 0 = auto
learningRate = 1.0,
minDist = 0.1,
spread = 1.0,
negativeSamples = 5,
repulsionStrength = 1.0)
// Classical MDS (equivalent to PCA when Euclidean distances are used)
val result = mds(proximity, d = 2)
// Non-metric (Kruskal) MDS
val result = isomds(proximity, d = 2, tol = 1e-4, maxIter = 200)
// Sammon Mapping
val result = sammon(proximity, d = 2, step = 0.2, maxIter = 100)
Import smile.nlp.*. The implicit conversion pimpString enriches every
String with NLP pipeline methods.
import smile.nlp.*
val text = "Dr. Smith went to Washington D.C. He arrived on Tuesday."
// Unicode normalization (NFKC, whitespace normalization, quote normalization)
val clean = text.normalize
// Sentence splitting
val sentences: Array[String] = text.sentences
// Tokenization with stop-word filtering
val words: Array[String] = text.words // default stop list
val words: Array[String] = text.words("comprehensive") // larger stop list
val words: Array[String] = text.words("none") // no filtering
val words: Array[String] = text.words("the,a,an") // custom stop list
// Bag-of-words (word → count)
val bag: Map[String, Int] = text.bag() // Porter stemming
val bag: Map[String, Int] = text.bag(stemmer = None) // no stemming
val bag: Map[String, Int] = text.bag(filter = "google")
// Binary bag-of-words (presence/absence)
val bag2: Set[String] = text.bag2()
// Part-of-speech tagging (returns word–POS pairs)
val tagged: Array[(String, PennTreebankPOS)] = "She sells seashells".postag
// Keyword extraction
val keywords: Seq[NGram] = text.keywords(k = 10)
import smile.nlp.*
// Build an in-memory corpus
val corp = corpus(Seq("First document text.", "Second document text."))
// Bigram collocations
val topBigrams: Seq[Bigram] = bigram(k = 100, minFreq = 5, docs: _*)
val sigBigrams: Seq[Bigram] = bigram(p = 0.01, minFreq = 5, docs: _*)
// N-gram extraction (Apriori-style)
val grams: Array[Array[NGram]] = ngram(maxNGramSize = 3, minFreq = 3, docs: _*)
// HMM POS tagging on a pre-tokenised sentence
val tags: Array[PennTreebankPOS] = postag(Array("She", "sells", "seashells"))
import smile.nlp.*
porter.stem("running") // "run"
lancaster.stem("running") // "run" (more aggressive)
import smile.nlp.*
// Term-frequency feature vector
val vocab = Array("machine", "learning", "deep")
val features = vectorize(vocab, bag) // Array[Double]
val sparse = vectorize(vocab, bag2) // Array[Int] (indices of present terms)
// Document frequency array
val dfreq: Array[Int] = df(vocab, corpusOfBags)
// Whole-corpus TF-IDF normalized to unit L2 norm
val matrix: Array[Array[Double]] = tfidf(corpusOfBags)
// Single document
val vec: Array[Double] = tfidf(bag, n = corpusSize, df = dfreq)
Import smile.sequence.*.
import smile.sequence.*
// Hidden Markov Model
val model = hmm(pi, a, b) // from initial / transition / emission
val model = hmm(observations, k) // learns from observation sequences
// Conditional Random Field (linear-chain)
val model = crf(x, y, feature, k, eta = 0.1, lambda = 0.1)
// CRF with Gaussian process smoothing
val model = gcrf(x, y, feature, k, eta = 0.1, lambda = 0.1)
Import smile.association.*.
import smile.association.*
val itemsets: Array[Array[Int]] = …
// Build FP-tree
val tree = fptree(itemsets)
val tree = fptree(itemsets.toStream) // streaming variant
// Mine frequent item sets
val frequent = fpgrowth(tree, minSupport = 3)
val frequent = fpgrowth(itemsets, minSupport = 3)
// Generate association rules
val rules = arm(tree, minSupport = 3, confidence = 0.5)
val rules = arm(itemsets, minSupport = 3, confidence = 0.5)
Import smile.wavelet.*.
import smile.wavelet.*
val wt = wavelet("D4") // Daubechies-4 filter
// Available filters include: "Haar", "D4"–"D20" (even), "Coiflet1"–"Coiflet5", etc.
val signal = Array(1.0, 2.0, 3.0, 4.0, 3.0, 2.0, 1.0, 0.0)
// In-place discrete wavelet transform
dwt(signal, wt)
// In-place inverse DWT
idwt(signal, wt)
// Wavelet shrinkage denoising (modifies in-place)
wsdenoise(signal, wt, soft = true)
Import smile.validation.*.
import smile.validation.*
// With raw arrays
val result = validate.classification(x, y, testX, testY) { (x, y) =>
randomForest(Formula.lhs("label"), DataFrame.of(x, y), ntrees = 100)
}
// With DataFrame + Formula
val result = validate.classification(formula, trainDf, testDf) { (f, df) =>
randomForest(f, df)
}
// Regression variants
val result = validate.regression(x, y, testX, testY) { (x, y) => lm(…) }
val result = validate.regression(formula, train, test) { (f, df) => lm(f, df) }
val cv5 = cv.classification(k = 5, formula, data) { (f, df) =>
randomForest(f, df)
}
println(cv5.avg.accuracy)
// With raw arrays
val cv5 = cv.classification(k = 5, x, y) { (x, y) =>
lda(x, y)
}
// Regression
val cv5r = cv.regression(k = 5, formula, data) { (f, df) => lm(f, df) }
val cv5r = cv.regression(k = 5, x, y) { (x, y) => ridge(…) }
val loo = loocv.classification(formula, data) { (f, df) => cart(f, df) }
val loo = loocv.regression(x, y) { (x, y) => lasso(…) }
val boot = bootstrap.classification(k = 100, x, y) { (x, y) => knn(x, y, 5) }
val boot = bootstrap.regression(k = 100, formula, data) { (f, df) => gbm(f, df) }
import smile.validation.*
// Classification
val cm = confusion(truth, predictions)
val acc = accuracy(truth, predictions)
val rec = recall(truth, predictions)
val prec = precision(truth, predictions)
val f1 = f1(truth, predictions)
val auc = auc(truth, probabilities)
val ll = logloss(truth, probabilities)
val ce = crossentropy(truth, probMatrix)
val mcc = mcc(truth, predictions)
val sens = sensitivity(truth, predictions)
val spec = specificity(truth, predictions)
val fo = fallout(truth, predictions)
val fdr = fdr(truth, predictions)
// Regression
val mseVal = mse(truth, predictions)
val rmseVal = rmse(truth, predictions)
val rssVal = rss(truth, predictions)
val madVal = mad(truth, predictions)
// Clustering
val ri = randIndex(labels1, labels2)
val ari = adjustedRandIndex(labels1, labels2)
val nmiVal = nmi(labels1, labels2)
The module includes two complementary plot APIs:
smile.plot.swing — traditional Swing-based Canvas charts for desktop
use.smile.plot.vega — Vega-Lite declarative charts for notebooks and
browser-based output.show)import smile.plot.*
// Render a Canvas in a JFrame (desktop) or as HTML (notebook)
show(canvas)
show(multiFigurePane)
show(vegaLiteSpec)
In Scala 2.13 notebook environments the show implicit calls are backed by
macros that detect Zeppelin/Databricks context at compile time and emit HTML
`` tags instead of opening a Swing window.
smile.plot.swing.*)Every chart returns a Canvas that can be passed to show(…).
import smile.plot.swing.*
// Scatter plot
val c = plot(x, y, '.') // Array[Double] x and y
val c = plot(data, labels, marks) // colour-coded by class label
// Scatter-plot matrix
val c = splom(data, marks, colNames)
// Line plot
val c = line(x, y)
val c = staircase(x, y)
// Box plot
val c = boxplot(data)
val c = boxplot(groups, names)
// Histogram
val c = hist(data)
val c = hist(data, bins = 20)
val c = hist3(x, y, bins = 20)
// Q-Q plot
val c = qqplot(data) // vs normal
val c = qqplot(x, y) // two-sample
val c = qqplot(data, distribution) // vs arbitrary distribution
// Heatmap and sparse matrix spy plot
val c = heatmap(matrix)
val c = spy(sparseMatrix)
val c = hexmap(data)
// Contour and surface
val c = contour(x, y, z)
val c = surface(z)
val c = wireframe(vertices, edges)
val c = grid(ax, ay, az)
// Dendrogram
val c = dendrogram(hierarchicalClustering)
// Scree plot (PCA)
val c = screeplot(pca)
// Text annotations
val c = text(coords, labels)
smile.plot.vega.*)Build declarative specs using a fluent Scala API. The VegaLite companion
object is the entry point.
import smile.plot.vega.*
// Single view
val view = VegaLite.view()
.mark("point")
.x(Field("sepalLength", "quantitative"))
.y(Field("petalLength", "quantitative"))
.color(Field("species", "nominal"))
.data(irisDataFrame)
show(view)
// Layered chart (multiple marks in the same coordinate system)
val chart = VegaLite.layer(view1, view2)
// Faceted chart
val faceted = VegaLite.facet(view).row("origin").column("cylinders")
// Concatenated charts
val hcat = VegaLite.hconcat(view1, view2, view3)
val vcat = VegaLite.vconcat(view1, view2)
// Scatter-plot matrix
val splomChart = VegaLite.splom(irisDataFrame)
// Fluent global properties
val chart = VegaLite.view()
.background("#f5f5f5")
.padding(10)
.config(JsObject("view" -> JsObject("stroke" -> JsString("transparent"))))
time — measure and log execution timeimport smile.util.time
// Block form — returns the value, logs elapsed time with a label
val model = time("Random Forest") {
randomForest(formula, data, ntrees = 500)
}
// Toggle output
time.on() // enable timing output (default)
time.off() // suppress timing output
time.echo // check current state
import smile.util.{toJavaFunction, toJavaBiFunction}
// Convert Scala lambdas to java.util.function types automatically
val jf: java.util.function.Function[Int, String] = (i: Int) => i.toString
val jbf: java.util.function.BiFunction[Int, Int, Int] = (a: Int, b: Int) => a + b
These conversions are automatically applied wherever SMILE's Java API requires a
Function or BiFunction — for example, when passing trainers to ovo, ovr,
validate.classification, or cv.classification.
import smile.io.*
import smile.data.formula.*
import smile.classification.*
import smile.validation.*
val df = read.csv("iris.csv")
val formula = "class" ~ "."
// 5-fold cross-validation on a random forest
val result = cv.classification(k = 5, formula, df) { (f, d) =>
randomForest(f, d, ntrees = 100)
}
println(s"CV accuracy: ${result.avg.accuracy * 100 %.1f %%}")
import smile.io.*
import smile.nlp.*
import smile.classification.*
import smile.validation.*
val texts = Array("great product", "terrible service", "very happy")
val labels = Array(1, 0, 1)
// Build vocabulary from training data
val bags = texts.map(_.bag())
val vocab = bags.flatMap(_.keys).distinct.sorted
// Vectorise
val x = bags.map(b => vectorize(vocab, b))
val y = labels
// Train and evaluate
val result = cv.classification(k = 3, x, y) { (x, y) =>
logit(x, y, lambda = 0.01)
}
println(result.avg.accuracy)
import smile.io.*
import smile.data.formula.*
import smile.regression.*
import smile.validation.*
val longley = read.arff("data/regression/longley.arff")
val formula = "Employed" ~ "."
val cv5 = cv.regression(k = 5, formula, longley) { (f, df) =>
lm(f, df)
}
println(f"RMSE: ${cv5.avg.rmse}%.4f")
import smile.regression.gpr
import smile.math.kernel.GaussianKernel
val kernel = new GaussianKernel(sigma = 1.0)
// Inducing inputs (e.g. k-means centroids of x)
import smile.clustering.*
val km = kmeans(x, k = 200)
val t = km.centroids
val model = gpr.nystrom(x, y, t, kernel, noise = 0.01, normalize = true)
val predictions = x.map(model.predict)
import smile.nlp.*
val text = """
Machine learning is a field of artificial intelligence. It enables computers
to learn from experience without being explicitly programmed.
"""
val keywords = text.keywords(k = 5)
keywords.foreach(ng => println(ng.words.mkString(" ")))
import smile.io.*
import smile.manifold.*
import smile.plot.swing.*
import smile.plot.*
val (x, _) = read.csv("mnist.csv").toArray … // high-dimensional data
val embedding = umap(x, k = 15, d = 2)
val canvas = plot(embedding.map(_(0)), embedding.map(_(1)), '.')
show(canvas)
import smile.cas.*
val x = "x"
val y = "y"
// Define f(x, y) = x² y + sin(x) y
val f = (x ** 2) * y + sin(x) * y
// Partial derivatives
val df_dx = f.d("x").simplify // 2 x y + cos(x) y
val df_dy = f.d("y").simplify // x² + sin(x)
println(df_dx)
println(df_dy)
// Evaluate at x=1, y=2
val env = Map("x" -> 1.0, "y" -> 2.0)
println(df_dx.apply(env))
| Aspect | Scala | Kotlin |
|---|---|---|
| Extension mechanism | Implicit classes (PimpedXxx) via implicit def | Extension functions |
| Formula DSL | Rich operator DSL: ~, +, -, ::, &&, ^, function terms | Not present |
| CAS | Full symbolic algebra (smile.cas) | Not present |
| Plotting | Both Swing (smile.plot.swing) and Vega-Lite (smile.plot.vega) | Not present |
| Notebook rendering | Macro-detected at compile time (Scala 2.13) | N/A |
| Validation API | Object-based: validate, cv, loocv, bootstrap | Top-level functions |
| Sequence models | HMM, CRF, GCRF | Not present |
| Operator DSL | %*% (dot/matmul), \ (solve), ~ (slice) | N/A |
| Array slicing | 0 ~ 9 ~ 2 (Python-like with step) | N/A |
gpr namespace | object gpr { apply, approx, nystrom } | object gpr (same) |
Both shims expose the same underlying Java algorithms. The Kotlin shim focuses on function-level conciseness; the Scala shim additionally provides a richer operator language and is more appropriate for exploratory notebook workflows that involve linear algebra, symbolic math, and interactive visualization.
SMILE — Copyright © 2010–2026 Haifeng Li. GNU GPL licensed.