SMILE — Validation Metrics

The smile.validation.metric package provides scalar evaluation metrics for classification, probabilistic classification, regression, and clustering tasks. Every metric is a stateless, serializable object that implements one of four functional interfaces. The static of(...) factory methods let you compute a score in one line without instantiating a class.

1. Package Overview

Functional interfaces

Interface	Method signature	Who implements it
`ClassificationMetric`	`double score(int[] truth, int[] prediction)`	`Accuracy`, `Error`, `Precision`, `Recall`, `FScore`, `FDR`, `Fallout`, `Specificity`, `Sensitivity`, `MatthewsCorrelation`
`ProbabilisticClassificationMetric`	`double score(int[] truth, double[] probability)`	`AUC`, `LogLoss`
`RegressionMetric`	`double score(double[] truth, double[] prediction)`	`MSE`, `RMSE`, `RSS`, `MAD`, `R2`
`ClusteringMetric`	`double score(int[] truth, int[] cluster)`	`RandIndex`, `AdjustedRandIndex`, `MutualInformation`, `NormalizedMutualInformation`, `AdjustedMutualInformation`

All four interfaces extend java.util.function.ToDoubleBiFunction so they can be used directly as lambdas or method references wherever that type is expected.

Common conventions

Label arrays are zero-indexed integers. Binary metrics expect labels in {0, 1} (0 = negative, 1 = positive). Multi-class metrics accept any non-negative integer labels; max(label) + 1 is used as the number of classes.
Arrays must have equal length. All of(truth, prediction) methods throw IllegalArgumentException if sizes differ.
Higher is better for most metrics; exceptions are Error, MSE, RMSE, RSS, MAD, LogLoss, and CrossEntropy where lower is better.
Singleton instances (Accuracy.instance, AUC.instance, etc.) are provided for convenience when you need a reusable object reference.

2. Classification Metrics

Classification metrics compare an integer label array truth against a predicted integer label array prediction.

2.1 Accuracy

java

double acc = Accuracy.of(truth, prediction);
// or via instance
Accuracy accuracy = new Accuracy();
double acc = accuracy.score(truth, prediction);

Formula: acc = (number of correct predictions) / n

Accuracy is symmetric and works for any number of classes. It is the complement of the error rate: accuracy + errorRate == 1.0.

java

int[] truth      = {1, 0, 1, 0, 1, 0};
int[] prediction = {1, 0, 0, 0, 1, 1};
double acc = Accuracy.of(truth, prediction);   // (4 correct) / 6 ≈ 0.667

Caveat: Accuracy is misleading on imbalanced datasets. A classifier that always predicts the majority class can achieve 99 % accuracy on a 99:1 dataset while being completely useless for the minority class.

2.2 Error

java

int errors = Error.of(truth, prediction);

Returns the raw count of mismatches (not a rate). Cast to double when used via ClassificationMetric.score().

java

int n      = truth.length;
int errors = Error.of(truth, prediction);
double errorRate = (double) errors / n;
double accuracy  = Accuracy.of(truth, prediction);
// errorRate + accuracy == 1.0

2.3 Precision, Recall, F-score

These three metrics work in both binary and multi-class modes.

Binary mode

java

// Both arrays must contain only 0 and 1.
double p = Precision.of(truth, prediction);
double r = Recall.of(truth, prediction);
double f1 = FScore.of(truth, prediction, 1.0, null);  // F₁ = harmonic mean of P and R

Metric	Formula	Numerator	Denominator
Precision	TP / (TP + FP)	True positives	All predicted positives
Recall	TP / (TP + FN)	True positives	All actual positives
F₁	2PR / (P + R)	—	—

When there are no predicted positives (Precision) or no actual positives (Recall), the result is NaN. Handle this defensively:

java

double p = Precision.of(truth, prediction);
if (Double.isNaN(p)) {
    // model made no positive predictions
}

Multi-class mode — `Averaging` strategy

Pass one of three Averaging enum values as the third argument:

Strategy	Description
`Averaging.Macro`	Compute per-class metric, take unweighted mean. Treats all classes equally.
`Averaging.Micro`	Pool all TP/FP/FN globally. Equivalent to accuracy for Micro-Precision/Recall.
`Averaging.Weighted`	Compute per-class metric, weight by class support in `truth`.

java

import smile.validation.metric.Averaging;

double macroPrecision   = Precision.of(truth, pred, Averaging.Macro);
double microPrecision   = Precision.of(truth, pred, Averaging.Micro);
double weightedPrecision = Precision.of(truth, pred, Averaging.Weighted);

double macroRecall   = Recall.of(truth, pred, Averaging.Macro);
double macroF1       = FScore.of(truth, pred, 1.0, Averaging.Macro);

Generalized Fβ score

The beta parameter controls the trade-off between precision and recall:

Fβ = (1 + β²) · (P · R) / (β²·P + R)

β < 1: weights precision more heavily (e.g., spam detection where false positives matter).
β = 1: F₁, the harmonic mean of P and R (most common choice).
β > 1: weights recall more heavily (e.g., disease screening where missing a case is costly).

java

double f2 = FScore.of(truth, prediction, 2.0, null);  // binary, recall-weighted
FScore f05Instance = new FScore(0.5, Averaging.Macro); // reusable instance
double score = f05Instance.score(truth, prediction);

beta must be strictly positive; passing 0 or a negative value throws IllegalArgumentException("Non-positive beta: ...").

2.4 False Discovery Rate (FDR)

java

double fdr = FDR.of(truth, prediction);

Formula: FDR = FP / (TP + FP) = 1 − Precision

Only applicable to binary labels {0, 1}. Returns NaN if no positive predictions are made.

2.5 Fallout (False Positive Rate)

java

double fpr = Fallout.of(truth, prediction);

Formula: FPR = FP / (FP + TN) = FP / (number of true negatives)

The negatives in this metric are samples where truth[i] != 1 (i.e., any non-positive label counts as negative, not just truth == 0).

Returns NaN if there are no negative samples in truth.

2.6 Specificity (True Negative Rate)

java

double tnr = Specificity.of(truth, prediction);

Formula: TNR = TN / (TN + FP) = TN / (number of samples where truth == 0)

Specificity counts only samples where truth[i] == 0 as negatives (stricter than Fallout). Returns NaN if no negative samples exist.

Specificity = 1 − Fallout only when all non-positive labels are exactly 0.

2.7 Sensitivity (True Positive Rate / Recall)

java

double tpr = Sensitivity.of(truth, prediction);

Formula: TPR = TP / (TP + FN)

Binary only; identical to binary Recall.of(truth, prediction). Returns NaN if there are no positive samples.

2.8 Matthews Correlation Coefficient (MCC)

java

double mcc = MatthewsCorrelation.of(truth, prediction);

Formula:

MCC = (TP·TN − FP·FN) / sqrt((TP+FP)(TP+FN)(TN+FP)(TN+FN))

MCC is widely considered the most informative single metric for binary classification because it accounts for all four cells of the confusion matrix and is robust to class imbalance.

MCC = +1: perfect prediction.
MCC = 0: no better than random.
MCC = −1: total disagreement (inverted classifier).

The input labels must reduce to a 2×2 confusion matrix (exactly two distinct classes). Returns NaN when the denominator is zero (e.g., all predictions or all truths are the same class).

java

int[] truth      = {1, 0, 1, 0, 1, 0, 1, 0};
int[] prediction = {1, 0, 1, 0, 0, 1, 1, 0};
double mcc = MatthewsCorrelation.of(truth, prediction);  // ≈ 0.5

2.9 Confusion Matrix

A ConfusionMatrix is not a scalar metric; it is a 2-D summary from which any per-class breakdown can be derived.

java

ConfusionMatrix cm = ConfusionMatrix.of(truth, prediction);
int[][] matrix = cm.matrix();
// matrix[t][p] = count of samples with true label t, predicted as p
System.out.println(cm);  // formatted table

The matrix dimension is (max_label + 1) × (max_label + 1) based on the union of values in truth and prediction.

3. Probabilistic Classification Metrics

Probabilistic metrics require a continuous score (probability) in addition to the integer ground truth.

3.1 AUC (Area Under the ROC Curve)

java

double auc = AUC.of(truth, probability);

truth: binary labels {0, 1}.
probability: positive-class probability score (higher means more likely to be positive).

Interpretation: AUC equals the probability that a randomly chosen positive sample is ranked higher than a randomly chosen negative sample.

AUC value	Meaning
1.0	Perfect ranking — all positives rank above all negatives
0.5	Random classifier
0.0	Worst-case — all positives rank below all negatives

Algorithm: Mann–Whitney U rank statistic with tie-averaging:

AUC = (sum_ranks_of_positives − pos*(pos+1)/2) / (pos * neg)

Ties in probability receive the average of their ranks.

java

int[]    truth = {0, 0, 1, 1};
double[] prob  = {0.1, 0.4, 0.35, 0.8};
double auc = AUC.of(truth, prob);  // = 0.75

Returns NaN when truth contains only one class (no positive or no negative samples).

3.2 Log Loss (Binary Cross-Entropy)

java

double loss = LogLoss.of(truth, probability);

truth: binary labels {0, 1}.
probability: predicted probability for the positive class, in (0, 1).

Formula:

LogLoss = −(1/n) Σᵢ [ truth[i]·log(pᵢ) + (1−truth[i])·log(1−pᵢ) ]

Computed in nats (natural logarithm). For truth[i] == 0, uses Math.log1p(−pᵢ) for numerical accuracy at values near 0.

java

int[]    truth = {0, 0, 1, 1, 0};
double[] prob  = {0.1, 0.4, 0.35, 0.8, 0.1};
double loss = LogLoss.of(truth, prob);  // ≈ 0.3989

Lower is better. Perfect confidence yields 0; confident wrong predictions yield +∞.
Probabilities must be in (0, 1); values of exactly 0 or 1 at the wrong class produce infinite loss.
Only binary labels {0, 1} are accepted; other values throw IllegalArgumentException.

3.3 Cross-Entropy (Multiclass Log Loss)

java

double ce = CrossEntropy.of(truth, probability);

truth: integer class index for each sample.
probability: double[n][k] — row i contains the probability distribution over k classes for sample i.

Formula:

CE = −(1/n) Σᵢ log(probability[i][truth[i]])

CrossEntropy is an interface (not a class); call the static of(...) method directly. It generalizes LogLoss to any number of classes; for k = 2 the values are identical up to the column selection convention.

java

int[] truth = {0, 1, 2, 0};
double[][] prob = {
    {0.9, 0.05, 0.05},
    {0.05, 0.9, 0.05},
    {0.05, 0.05, 0.9},
    {0.9, 0.05, 0.05}
};
double ce = CrossEntropy.of(truth, prob);  // = -log(0.9) ≈ 0.1054

4. Regression Metrics

Regression metrics compare continuous truth and prediction arrays.

4.1 RSS — Residual Sum of Squares

java

double rss = RSS.of(truth, prediction);

Formula: RSS = Σ (yᵢ − ŷᵢ)²

RSS is scale-dependent and grows with n. Use it when you need the raw magnitude of the fit, not a normalized quantity.

4.2 MSE — Mean Squared Error

java

double mse = MSE.of(truth, prediction);

Formula: MSE = RSS / n = (1/n) Σ (yᵢ − ŷᵢ)²

MSE penalizes large errors heavily (squaring effect) and is the optimization objective for ordinary least squares. Scale is in squared units of y.

4.3 RMSE — Root Mean Squared Error

java

double rmse = RMSE.of(truth, prediction);

Formula: RMSE = √MSE

Same units as y; directly interpretable as "typical error magnitude". RMSE ≥ MAD always (by Jensen's inequality).

4.4 MAD — Mean Absolute Error

java

double mae = MAD.of(truth, prediction);

Formula: MAD = (1/n) Σ |yᵢ − ŷᵢ|

Also called MAE in many frameworks. Less sensitive to outliers than MSE/RMSE because it does not square the residuals. Both MAD(truth, pred) and MAD(pred, truth) produce the same result.

4.5 R² — Coefficient of Determination

java

double r2 = R2.of(truth, prediction);

Formula:

R² = 1 − RSS / TSS
TSS = Σ (yᵢ − ȳ)²   (total sum of squares)

R² value	Interpretation
1.0	Perfect fit — predictions equal truth exactly
0.0	Model is no better than always predicting `mean(truth)`
< 0	Model is worse than predicting the mean

Important: When truth is constant (TSS = 0), R² is undefined and returns NaN or infinity. Check before comparing:

java

double r2 = R2.of(truth, prediction);
if (!Double.isFinite(r2)) {
    // constant truth — R² is not meaningful
}

4.6 Aggregated regression metrics

smile.validation.RegressionMetrics bundles all five into a single record:

java

import smile.validation.RegressionMetrics;

RegressionMetrics m = RegressionMetrics.of(fitTime, scoreTime, truth, prediction);
System.out.println(m.RSS());
System.out.println(m.MSE());
System.out.println(m.RMSE());
System.out.println(m.MAD());
System.out.println(m.R2());

5. Clustering Metrics

Clustering metrics compare a ground-truth labelling against a proposed cluster assignment. Labels in both arrays are permutation-invariant: the metrics only care about which samples are grouped together, not which integer label names each group.

All clustering metrics use a ContingencyTable internally, which remaps the raw integer labels to contiguous indices automatically.

5.1 Rand Index

java

double ri = RandIndex.of(truth, cluster);

The Rand index measures the fraction of pairs of samples that are either both in the same group or both in different groups in both labellings.

Formula:

RI = (number of agreeing pairs) / C(n, 2)
   = (T − P/2 − Q/2 + C(n,2)) / C(n,2)

where:

T = Σ C(nᵢⱼ, 2) — pairs that agree in both clusterings.
P = Σᵢ C(aᵢ, 2) — pairs in the same ground-truth class.
Q = Σⱼ C(bⱼ, 2) — pairs in the same predicted cluster.

Range: [0, 1]. A value of 1 means perfect agreement.

Limitation: The Rand index has a non-zero expected value for random clusterings (especially when many clusters are used). Use Adjusted Rand Index for chance-corrected evaluation.

5.2 Adjusted Rand Index (ARI)

java

double ari = AdjustedRandIndex.of(truth, cluster);

Corrects the Rand index for the expected agreement under chance:

ARI = (RI − E[RI]) / (max(RI) − E[RI])

ARI value	Interpretation
1.0	Perfect agreement
0.0	Agreement at the level of a random clustering
< 0	Worse than random

ARI is the standard clustering quality metric when ground-truth labels are available.

java

int[] clusters = {0, 0, 1, 1, 2, 2};
int[] alt      = {1, 1, 0, 0, 2, 2};  // same partition, different labels
double ari = AdjustedRandIndex.of(clusters, alt);  // = 1.0 (perfect)

5.3 Mutual Information (MI)

java

double mi = MutualInformation.of(truth, cluster);

Measures the information shared between two labellings (in nats, natural log):

I(X;Y) = Σᵢⱼ (nᵢⱼ/n) · log[ (nᵢⱼ/n) / ((aᵢ/n)(bⱼ/n)) ]

MI = H(truth) when truth == cluster (perfect clustering).
MI = 0 when the two labellings are statistically independent.
Non-negative by definition.

java

int[] x = {0, 0, 0, 1, 1, 1};
MutualInformation.of(x, x);   // = ln(2) ≈ 0.6931 nats
MutualInformation.of(x, new int[]{0,1,0,1,0,1});  // = 0.0 (independent)

5.4 Normalized Mutual Information (NMI)

NMI scales MI to the interval [0, 1] by dividing by a normalization factor derived from the marginal entropies. Five normalization methods are available:

Constant	Formula	Notes
`NormalizedMutualInformation.JOINT`	`I / H(X,Y)`	H(X,Y) = joint entropy
`NormalizedMutualInformation.MAX`	`I / max(H(X), H(Y))`	Bounded by the larger entropy
`NormalizedMutualInformation.MIN`	`I / min(H(X), H(Y))`	Can reach 1 even for imperfect clustering if one labelling has lower entropy
`NormalizedMutualInformation.SUM`	`2I / (H(X) + H(Y))`	Symmetric F-measure-like
`NormalizedMutualInformation.SQRT`	`I / √(H(X)·H(Y))`	Geometric mean normalization

java

double nmi = NormalizedMutualInformation.max(truth, cluster);
// or via instance:
double nmi = NormalizedMutualInformation.MAX.score(truth, cluster);

All variants equal 1.0 for a perfect clustering and 0.0 for statistically independent labellings. The variants differ for intermediate cases.

Note: Due to floating-point arithmetic, values may be infinitesimally above 1.0 (e.g., 1.0000000000000002); treat values within 1 + 1e-10 as 1.0 in downstream comparisons.

5.5 Adjusted Mutual Information (AMI)

AMI corrects MI for chance under a hypergeometric model (analogous to how ARI corrects RI):

AMI = (I − E[MI]) / (norm − E[MI])

Four normalization methods are provided:

Constant	Denominator
`AdjustedMutualInformation.MAX`	`max(H(X), H(Y)) − E[MI]`
`AdjustedMutualInformation.MIN`	`min(H(X), H(Y)) − E[MI]`
`AdjustedMutualInformation.SUM`	`0.5·(H(X) + H(Y)) − E[MI]`
`AdjustedMutualInformation.SQRT`	`√(H(X)·H(Y)) − E[MI]`

java

double ami = AdjustedMutualInformation.max(truth, cluster);
// or via instance:
double ami = AdjustedMutualInformation.MAX.score(truth, cluster);

AMI value	Interpretation
1.0	Perfect agreement
0.0	No more information than chance
< 0	Worse than chance

Warning: The expected MI computation involves a double sum over the hypergeometric support and can be slow for large numbers of clusters.

6. Choosing the Right Metric

Classification

Scenario	Recommended metric(s)
Balanced classes, overall correctness	`Accuracy`
Imbalanced classes	`F₁`, `MCC`, `AUC`
Precision–recall trade-off	`Precision`, `Recall`, `Fβ`, `FDR`
Confident probabilistic output	`LogLoss`, `AUC`
Multi-class, equal class importance	`Macro F₁`
Multi-class, class-proportional	`Weighted F₁`
Best single binary metric	`MCC`

Regression

Scenario	Recommended metric(s)
General-purpose	`RMSE`, `R²`
Outlier-robust	`MAD`
Comparing across different scales	`R²`
Matching the loss function (OLS)	`RSS` or `MSE`

Clustering

Scenario	Recommended metric(s)
Ground truth available, absolute quality	`ARI`
Information-theoretic comparison	`AMI (MAX)`
Pairwise agreement, no correction	`Rand Index`
Raw MI for downstream use	`MutualInformation`
Normalized to `[0, 1]` without chance correction	`NMI (MAX)`

7. Usage Patterns

7.1 Quick one-liner evaluation

java

import smile.validation.metric.*;

// Classification
double acc  = Accuracy.of(truth, pred);
double f1   = FScore.of(truth, pred, 1.0, null);
double mcc  = MatthewsCorrelation.of(truth, pred);
double auc  = AUC.of(truth, prob);
double loss = LogLoss.of(truth, prob);

// Regression
double rmse = RMSE.of(yTrue, yPred);
double r2   = R2.of(yTrue, yPred);
double mad  = MAD.of(yTrue, yPred);

// Clustering
double ari = AdjustedRandIndex.of(labels, clusters);
double ami = AdjustedMutualInformation.max(labels, clusters);
double nmi = NormalizedMutualInformation.max(labels, clusters);

7.2 Reusable metric instances

Pass metrics as ClassificationMetric / RegressionMetric parameters:

java

ClassificationMetric metric = new FScore(2.0, Averaging.Macro);  // F₂, macro
double score = metric.score(truth, prediction);

// Use as lambda / method reference
ClassificationMetric simpleAcc = Accuracy::of;  // won't work; use instance
ClassificationMetric acc = Accuracy.instance;

Pre-built singleton instances for all metrics:

java

Accuracy.instance
Error.instance
AUC.instance
LogLoss.instance
MSE.instance
RMSE.instance
RSS.instance
MAD.instance
R2.instance
MutualInformation.instance
RandIndex.instance
AdjustedRandIndex.instance
NormalizedMutualInformation.JOINT   // or MAX, MIN, SUM, SQRT
AdjustedMutualInformation.MAX       // or MIN, SUM, SQRT

7.3 Aggregated metrics via `ClassificationMetrics`

java

import smile.validation.ClassificationMetrics;

ClassificationMetrics m = ClassificationMetrics.of(fitTime, scoreTime,
                                                    truth, prediction, prob);
System.out.println(m.accuracy());
System.out.println(m.f1());
System.out.println(m.mcc());
System.out.println(m.auc());
System.out.println(m.logloss());

7.4 Cross-validation evaluation

java

import smile.validation.*;

var result = CrossValidation.classification(5, data, labels,
    (x, y) -> SVM.fit(x, y, kernel, C, tol),
    ClassificationMetrics::of);

System.out.println(result);

8. Numeric Examples

Binary classification summary

java

int[] truth = {1,1,1,1,1,0,0,0,0,0};
int[] pred  = {1,1,1,0,0,1,0,0,0,0};
// TP=3, FN=2, FP=1, TN=4

Accuracy.of(truth, pred)             // 7/10 = 0.7
Error.of(truth, pred)                // 3
Precision.of(truth, pred)            // 3/(3+1) = 0.75
Recall.of(truth, pred)               // 3/(3+2) = 0.60
FScore.of(truth, pred, 1.0, null)    // 2*0.75*0.60/(0.75+0.60) ≈ 0.667
FDR.of(truth, pred)                  // 1/4 = 0.25
Specificity.of(truth, pred)          // 4/(4+1) = 0.80
Sensitivity.of(truth, pred)          // same as Recall = 0.60
MatthewsCorrelation.of(truth, pred)  // ≈ 0.398

AUC with tie-breaking

java

int[]    truth = {0, 0, 1, 1};
double[] prob  = {0.1, 0.4, 0.35, 0.8};
// Sorted ascending by prob: labels=[0,1,0,1], ranks=[1,2,3,4]
// Sum of positive ranks = 2+4 = 6
// AUC = (6 - 2*3/2) / (2*2) = 3/4 = 0.75
AUC.of(truth, prob);  // 0.75

R² interpretation

java

double[] truth = {3.0, -0.5, 2.0, 7.0};
double[] pred  = {2.5,  0.0, 2.0, 8.0};
R2.of(truth, pred);   // ≈ 0.948 — excellent fit

double[] naive = {3.625, 3.625, 3.625, 3.625};  // always predict mean
R2.of(truth, naive);  // = 0.0 — no better than the mean

Perfect clustering vs independent clustering

java

int[] x = {0, 0, 0, 1, 1, 1};

// Perfect: truth == cluster
AdjustedRandIndex.of(x, x)                       // 1.0
NormalizedMutualInformation.max(x, x)             // 1.0
AdjustedMutualInformation.max(x, x)               // 1.0

// Independent: balanced 2×2 contingency → MI = 0
int[] y = {0, 1, 0, 1, 0, 1};
MutualInformation.of(x, y)                        // 0.0... actually non-zero here
// Use balanced n=4 for exact independence:
MutualInformation.of(new int[]{0,0,1,1}, new int[]{0,1,0,1})  // 0.0

9. Edge Cases and Pitfalls

All predictions from one class (no positives predicted)
Precision and FDR return NaN when no sample is predicted positive.

All ground truth from one class
Recall, Sensitivity return NaN; AUC returns NaN; MCC returns NaN; R2 returns NaN/infinity if all truth values are equal (TSS = 0). Guard with Double.isFinite(result) before using.

Only one class in truth for MCC
MatthewsCorrelation requires a 2×2 confusion matrix. If truth contains only one distinct value, the resulting confusion matrix has only one non-zero row/column and MCC returns NaN.

AMI performance
AdjustedMutualInformation computes the expected MI via a double loop over the hypergeometric support. For datasets with many small clusters (large R and C), this is noticeably slow. Prefer ARI or NMI for exploratory work.

NMI slightly above 1.0
Floating-point rounding can produce NMI values of 1.0 + ε. When using NMI in comparisons (e.g., storing the best score), clamp to [0.0, 1.0]:

java

double nmi = Math.min(1.0, NormalizedMutualInformation.max(truth, cluster));

Large label IDs in ConfusionMatrix / Precision / Recall
These metrics allocate arrays of size max(label) + 1. Labels like {0, 1000} allocate a 1001-element array. Use remapped/contiguous labels for efficiency.

Probabilistic metrics require calibrated probabilities
LogLoss, CrossEntropy, and AUC use the raw score values directly. LogLoss blows up (+∞) if a probability of exactly 0.0 or 1.0 is submitted for the wrong class. Calibrate or clip probabilities before use:

java

double p = Math.max(1e-15, Math.min(1 - 1e-15, rawProbability));

SMILE — Validation Metrics

SMILE — Validation Metrics

1. Package Overview

Functional interfaces

Common conventions

2. Classification Metrics

2.1 Accuracy

2.2 Error

2.3 Precision, Recall, F-score

Binary mode

Multi-class mode — Averaging strategy

Generalized Fβ score

2.4 False Discovery Rate (FDR)

2.5 Fallout (False Positive Rate)

2.6 Specificity (True Negative Rate)

2.7 Sensitivity (True Positive Rate / Recall)

2.8 Matthews Correlation Coefficient (MCC)

2.9 Confusion Matrix

3. Probabilistic Classification Metrics

3.1 AUC (Area Under the ROC Curve)

3.2 Log Loss (Binary Cross-Entropy)

3.3 Cross-Entropy (Multiclass Log Loss)

4. Regression Metrics

4.1 RSS — Residual Sum of Squares

4.2 MSE — Mean Squared Error

4.3 RMSE — Root Mean Squared Error

4.4 MAD — Mean Absolute Error

4.5 R² — Coefficient of Determination

4.6 Aggregated regression metrics

5. Clustering Metrics

5.1 Rand Index

5.2 Adjusted Rand Index (ARI)

5.3 Mutual Information (MI)

5.4 Normalized Mutual Information (NMI)

5.5 Adjusted Mutual Information (AMI)

6. Choosing the Right Metric

Classification

Regression

Clustering

7. Usage Patterns

7.1 Quick one-liner evaluation

7.2 Reusable metric instances

7.3 Aggregated metrics via ClassificationMetrics

7.4 Cross-validation evaluation

8. Numeric Examples

Binary classification summary

AUC with tie-breaking

R² interpretation

Perfect clustering vs independent clustering

9. Edge Cases and Pitfalls

Multi-class mode — `Averaging` strategy

7.3 Aggregated metrics via `ClassificationMetrics`