--- layout: layout.njk permalink: "{{ page.filePathStem }}.html" title: SMILE - What's Machine Learning --- {% include "toc.njk" %}

What's Machine Learning

Machine learning is a type of artificial intelligence that gives computers the ability to learn from data without being explicitly programmed. A machine learning algorithm builds a model from example inputs so that it can make predictions or decisions on new, unseen data.

A core objective of machine learning is generalization — the ability of a learned model to perform accurately on new examples after training on a finite sample. The training examples are drawn from some generally unknown probability distribution; the learner must build a model that produces accurate predictions even for cases it has never seen before.

Learning Paradigms

Machine learning tasks are classified into broad categories depending on the nature of the feedback available to the learning system.

Paradigm	Feedback	Typical Problems	SMILE Pages
Supervised	Labelled input–output pairs	Classification, Regression, Sequence labelling	Classification, Regression, Deep Learning
Unsupervised	No labels — find structure	Clustering, Dimensionality reduction, Density estimation, Association rules	Clustering, Manifold Learning, Association Rules
Semi-supervised	A few labels + many unlabelled samples	Label propagation, Self-training, Generative models	Classification
Self-supervised	Pseudo-labels derived from data itself	Language modelling, Contrastive learning, Masked autoencoders	LLM, Deep Learning
Reinforcement	Scalar reward from environment	Game playing, Robotics, Control	—

Features & Feature Engineering

A feature (also called explanatory variable, predictor, or covariate) is an individual measurable property of the phenomenon being observed. Choosing informative, discriminating, and independent features is a crucial step for effective machine learning. Features are usually numeric; a set of numeric features is conveniently described by a feature vector. Structural features such as strings, sequences, and graphs are also used in NLP and computational biology.

Feature engineering is the process of using domain knowledge to transform raw data into features that make algorithms work well. It encompasses:

Encoding — converting categorical variables to numbers (one-hot, ordinal, target encoding)
Scaling — normalizing or standardizing numeric ranges
Imputation — filling in missing values
Selection — removing irrelevant or redundant features (SNR, genetic algorithm, TreeSHAP)
Construction — deriving new features from existing ones (polynomial, interaction terms, embeddings)
Dimensionality reduction — compressing high-dimensional inputs (PCA, ICA, random projection)

See Feature Engineering for SMILE's full API.

Supervised Learning

In supervised learning each example is a pair: an input object (typically a feature vector) and a desired output (the label or response variable). The algorithm learns a function from inputs to outputs by analysing a labelled training set, then uses that function to predict outputs for new inputs.

Learning is typically framed as empirical risk minimisation: choose the hypothesis that minimises the average loss on the training set.

Task	Output type	Common algorithms in SMILE
Classification	Discrete class label	Random Forest, Gradient Boosted Trees, SVM, KNN, Logistic Regression, Naïve Bayes, LDA/QDA/RDA, AdaBoost, Neural Networks, RBF Networks
Regression	Continuous real value	GBDT, Random Forest, SVR, Gaussian Process, OLS, Ridge, LASSO, ElasticNet, RBF Networks
Sequence Labelling	Sequence of labels	Hidden Markov Model (Viterbi), Conditional Random Field

See Classification and Regression for detailed API guides.

Overfitting & the Bias–Variance Trade-off

When a model captures random noise rather than the true underlying pattern it is called overfitting. An overfit model has low training error but high test error. Conversely, a model that is too simple underfits: it has high bias and cannot capture the signal.

The green overfit model memorises training noise; the black model generalises better.

The bias–variance decomposition breaks generalisation error into:

Bias — error from wrong assumptions in the model family.
Variance — sensitivity to fluctuations in the training set.
Irreducible noise — inherent noise in the data that no model can remove.

Ensemble methods such as Random Forest reduce variance via bagging; boosting methods such as Gradient Boosted Trees reduce bias iteratively.

Model Validation

To estimate how well a model generalises it must be evaluated on data it has not seen during training. SMILE provides:

Hold-out — reserve a fixed percentage of data for testing.
k-fold cross-validation — partition data into k equal folds; each fold serves as the test set exactly once.
Leave-one-out (LOO) — k-fold where k equals the dataset size.
Bootstrap — resample with replacement; use out-of-bag samples for testing.

See Model Validation for the full API, including confusion matrices, AUC, F1 score, RMSE, and more.

Regularization

Regularization introduces additional constraints or penalties to prevent overfitting. Common forms:

L2 / Ridge — penalises the squared norm of weights; shrinks all weights smoothly towards zero.
L1 / LASSO — penalises the absolute norm; produces sparse solutions by driving irrelevant weights to zero.
Elastic Net — convex combination of L1 and L2; balances sparsity and coefficient stability.
Dropout — randomly deactivates neurons during neural network training, acting as implicit ensemble averaging.
Early stopping — halts training when validation error stops improving.

From a Bayesian perspective, L2 regularization corresponds to a Gaussian prior on weights and L1 to a Laplace prior.

Hyperparameter Optimisation

Beyond model parameters learned from data, most algorithms have hyperparameters set before training (e.g., number of trees, learning rate, regularization strength). SMILE supports:

Grid search — exhaustive search over a discrete hyperparameter grid.
Random search — sample configurations uniformly from a search space.
Bayesian optimisation — use a Gaussian Process surrogate to select the next most promising configuration.

See Validation & HPO.

Unsupervised Learning

Unsupervised learning discovers hidden structure in unlabelled data. Because there is no explicit error signal to optimise, evaluating unsupervised models requires different criteria: intra-cluster cohesion, log-likelihood under a density model, reconstruction error, etc.

Clustering

Clustering groups objects so that items in the same cluster are more similar to each other than to items in other clusters. SMILE provides a wide range of algorithms:

Category	Algorithms
Partitional	K-Means, X-Means, G-Means, Deterministic Annealing
Hierarchical	Agglomerative (single, complete, average, Ward linkage)
Density-based	DBSCAN, DENCLUE
Grid / scalable	BIRCH, CLARANS
Spectral / graph	Spectral Clustering, SIB
Neural / SOM	Self-Organizing Map (SOM), Neural Gas, Growing Neural Gas
Information-theoretic	Min-Entropy Clustering

See Clustering.

Dimensionality Reduction & Manifold Learning

High-dimensional data often lies near a low-dimensional manifold. Dimensionality reduction makes data easier to visualise, speeds up downstream algorithms, and can remove noise.

Linear methods : PCA, Kernel PCA, Probabilistic PCA, GHA, Random Projection, ICA
Non-linear / manifold : IsoMap, LLE, Laplacian Eigenmap, t-SNE, UMAP
Multi-dimensional Scaling : Classical MDS, Sammon Mapping

See Manifold Learning and Multi-Dimensional Scaling.

Association Rule Mining

Association rule mining discovers interesting co-occurrence patterns among variables in large databases. The classic application is market basket analysis: given supermarket transaction records, find rules of the form {onions, potatoes} ⇒ {burger meat}.

Rules are evaluated with three metrics:

Support — fraction of transactions containing all items in the rule.
Confidence — P(consequent | antecedent): how often the rule is correct.
Lift — confidence divided by the baseline frequency of the consequent; lift > 1 means the antecedent is positively correlated with the consequent.

SMILE implements the FP-growth algorithm, which mines frequent itemsets without candidate generation, making it efficient on large datasets. See Association Rule Mining.

Semi-supervised Learning

Labelled data is expensive to acquire; unlabelled data is cheap. Semi-supervised learning combines a small labelled set with a large unlabelled set to improve model accuracy beyond what either alone could achieve.

Assumptions that make semi-supervised learning effective:

Continuity (smoothness) assumptionPoints that are close together are likely to share a label, yielding a preference for decision boundaries in low-density regions.Cluster assumptionData tend to form discrete clusters; points in the same cluster are likely to share a label.Manifold assumptionData lies near a low-dimensional manifold; learning manifold structure from unlabelled data helps avoid the curse of dimensionality.

Self-Supervised Learning

A self-supervised model is trained on a pretext task whose labels are derived automatically from the data — no human annotation required. The pretext task forces the model to learn rich internal representations that transfer well to downstream tasks.

Examples include:

Masked language modelling (BERT, LLaMA) — predict randomly masked tokens.
Next-token prediction (GPT) — predict the next token in a sequence.
Contrastive learning (SimCLR, MoCo) — pull augmented views of the same image together and push different images apart in embedding space.
Masked autoencoders (MAE) — reconstruct randomly masked patches of an image.

Generative AI

Generative AI models learn to produce new data samples — text, images, audio, video, 3D — that are statistically indistinguishable from real data. The three dominant approaches are Transformers, diffusion models, and GANs.

Transformer & Large Language Models

The Transformer is built on multi-head scaled dot-product attention. Text is tokenized into sub-word units and converted to dense vectors via an embedding table. At each layer, every token is contextualised against all other tokens in the context window via parallel attention heads, allowing important signals to be amplified. GPT-family models use a decoder-only stack trained with next-token prediction. LLaMA-3 extends this with grouped-query attention (GQA), rotary positional encoding (RoPE), SwiGLU feed-forward networks, and RMS normalisation.

Each generation of GPT models is significantly more capable than the previous due to increased model size (number of trainable parameters) and larger training data.

SMILE ships a complete LLaMA-3 inference stack — see Large Language Models.

Diffusion Models

Diffusion models (Stable Diffusion, DALL·E 3) learn to reverse a gradual Gaussian-noise corruption process. Training : iteratively add noise to real images. Inference : start from pure noise and denoise step-by-step, guided by a text prompt via cross-attention. Key components:

VAE encoder/decoder — compresses images to/from a compact latent space.
U-Net / DiT denoiser — predicts the noise component at each step.
Text encoder — conditions the denoiser on the input prompt.

Generative Adversarial Networks (GANs)

A GAN pits two networks against each other in a minimax game:

Generator — maps random latent vectors to synthetic data samples, trying to fool the discriminator.
Discriminator — classifies samples as real or generated, trying to detect fakes.

At convergence the generator produces samples that the discriminator cannot distinguish from real data. Known challenges include mode collapse and training instability.

Deep Learning

Deep learning uses multi-layer neural networks to learn hierarchical representations directly from raw data (pixels, waveforms, tokens). Key architectural families:

Architecture	Typical use
MLP (fully-connected)	Tabular data, embeddings
CNN (convolutional)	Images, audio spectrograms
RNN / LSTM / GRU	Sequences, time series (largely superseded by Transformers)
Transformer	Text, images (ViT), multi-modal
GNN (graph neural network)	Molecular property prediction, social networks
Diffusion model	Image and audio synthesis
VAE (variational autoencoder)	Representation learning, generation

SMILE's smile-deep module wraps LibTorch (PyTorch C++) with GPU acceleration, pre-built layer primitives, and pretrained models (EfficientNet-V2 image classification, LLaMA-3 language models). See Deep Learning.

Reinforcement Learning

A reinforcement learning (RL) agent interacts with an environment to maximise cumulative reward. Unlike supervised learning, no labelled examples are provided; the agent discovers which actions yield reward through trial and error — trial-and-error search and delayed reward are its most important distinguishing features.

There are four main components of an RL system:

Policy (π) — the agent's strategy: maps observed states to actions.
Reward signal (R) — scalar feedback from the environment after each action; the primary basis for changing the policy.
Value function (V / Q) — expected cumulative reward from a state (or state–action pair); guides action selection towards long-term payoff rather than immediate reward.
Model (optional) — the agent's internal model of environment dynamics, used for planning.

Markov Decision Processes (MDPs) provide the standard mathematical framework. Algorithms range from tabular methods (Q-Learning, SARSA) to deep RL (DQN, PPO, SAC) for large or continuous state/action spaces.

The fundamental challenge is the exploration–exploitation trade-off : the agent must exploit known good actions to accumulate reward, while also exploring new actions that might yield higher long-term payoffs. Randomly selecting actions without reference to any distribution shows poor performance; clever exploration mechanisms (ε-greedy, UCB, Thompson sampling) are essential.

Typical ML Workflow

Most machine learning projects follow a common lifecycle:

Problem definition — frame the task (classification, regression, …), define success metrics, identify data sources.
Data collection & exploration — gather raw data, compute summary statistics, visualise distributions and correlations. See Data Processing and Data Visualization.
Data preparation — clean, encode, scale, impute, split into train/validation/test sets. See Data Processing and Missing Value Imputation.
Feature engineering — construct informative features, select relevant ones, reduce dimensionality. See Feature Engineering.
Model selection & training — choose candidate algorithms, train, and tune hyperparameters.
Evaluation — measure performance on held-out data with appropriate metrics (accuracy, AUC, RMSE, …). See Model Validation.
Deployment & monitoring — serve predictions via SMILE's Quarkus-based inference server, and track data/model drift over time.

← Quick Start Data Processing →