website/src/overview.html
--- layout: layout.njk permalink: "{{ page.filePathStem }}.html" title: SMILE - What's Machine Learning --- {% include "toc.njk" %}
Machine learning is a type of artificial intelligence that gives computers the ability to learn from data without being explicitly programmed. A machine learning algorithm builds a model from example inputs so that it can make predictions or decisions on new, unseen data.
A core objective of machine learning is generalization — the ability of a learned model to perform accurately on new examples after training on a finite sample. The training examples are drawn from some generally unknown probability distribution; the learner must build a model that produces accurate predictions even for cases it has never seen before.
Machine learning tasks are classified into broad categories depending on the nature of the feedback available to the learning system.
| Paradigm | Feedback | Typical Problems | SMILE Pages |
|---|---|---|---|
| Supervised | Labelled input–output pairs | Classification, Regression, Sequence labelling | Classification, Regression, Deep Learning |
| Unsupervised | No labels — find structure | Clustering, Dimensionality reduction, Density estimation, Association rules | Clustering, Manifold Learning, Association Rules |
| Semi-supervised | A few labels + many unlabelled samples | Label propagation, Self-training, Generative models | Classification |
| Self-supervised | Pseudo-labels derived from data itself | Language modelling, Contrastive learning, Masked autoencoders | LLM, Deep Learning |
| Reinforcement | Scalar reward from environment | Game playing, Robotics, Control | — |
A feature (also called explanatory variable, predictor, or covariate) is an individual measurable property of the phenomenon being observed. Choosing informative, discriminating, and independent features is a crucial step for effective machine learning. Features are usually numeric; a set of numeric features is conveniently described by a feature vector. Structural features such as strings, sequences, and graphs are also used in NLP and computational biology.
Feature engineering is the process of using domain knowledge to transform raw data into features that make algorithms work well. It encompasses:
See Feature Engineering for SMILE's full API.
In supervised learning each example is a pair: an input object (typically a feature vector) and a desired output (the label or response variable). The algorithm learns a function from inputs to outputs by analysing a labelled training set, then uses that function to predict outputs for new inputs.
Learning is typically framed as empirical risk minimisation: choose the hypothesis that minimises the average loss on the training set.
| Task | Output type | Common algorithms in SMILE |
|---|---|---|
| Classification | Discrete class label | Random Forest, Gradient Boosted Trees, SVM, KNN, Logistic Regression, Naïve Bayes, LDA/QDA/RDA, AdaBoost, Neural Networks, RBF Networks |
| Regression | Continuous real value | GBDT, Random Forest, SVR, Gaussian Process, OLS, Ridge, LASSO, ElasticNet, RBF Networks |
| Sequence Labelling | Sequence of labels | Hidden Markov Model (Viterbi), Conditional Random Field |
See Classification and Regression for detailed API guides.
When a model captures random noise rather than the true underlying pattern it is called overfitting. An overfit model has low training error but high test error. Conversely, a model that is too simple underfits: it has high bias and cannot capture the signal.
The green overfit model memorises training noise; the black model generalises better.
The bias–variance decomposition breaks generalisation error into:
Ensemble methods such as Random Forest reduce variance via bagging; boosting methods such as Gradient Boosted Trees reduce bias iteratively.
To estimate how well a model generalises it must be evaluated on data it has not seen during training. SMILE provides:
See Model Validation for the full API, including confusion matrices, AUC, F1 score, RMSE, and more.
Regularization introduces additional constraints or penalties to prevent overfitting. Common forms:
From a Bayesian perspective, L2 regularization corresponds to a Gaussian prior on weights and L1 to a Laplace prior.
Beyond model parameters learned from data, most algorithms have hyperparameters set before training (e.g., number of trees, learning rate, regularization strength). SMILE supports:
See Validation & HPO.
Unsupervised learning discovers hidden structure in unlabelled data. Because there is no explicit error signal to optimise, evaluating unsupervised models requires different criteria: intra-cluster cohesion, log-likelihood under a density model, reconstruction error, etc.
Clustering groups objects so that items in the same cluster are more similar to each other than to items in other clusters. SMILE provides a wide range of algorithms:
| Category | Algorithms |
|---|---|
| Partitional | K-Means, X-Means, G-Means, Deterministic Annealing |
| Hierarchical | Agglomerative (single, complete, average, Ward linkage) |
| Density-based | DBSCAN, DENCLUE |
| Grid / scalable | BIRCH, CLARANS |
| Spectral / graph | Spectral Clustering, SIB |
| Neural / SOM | Self-Organizing Map (SOM), Neural Gas, Growing Neural Gas |
| Information-theoretic | Min-Entropy Clustering |
See Clustering.
High-dimensional data often lies near a low-dimensional manifold. Dimensionality reduction makes data easier to visualise, speeds up downstream algorithms, and can remove noise.
See Manifold Learning and Multi-Dimensional Scaling.
Association rule mining discovers interesting co-occurrence patterns among variables in large databases. The classic application is market basket analysis: given supermarket transaction records, find rules of the form {onions, potatoes} ⇒ {burger meat}.
Rules are evaluated with three metrics:
SMILE implements the FP-growth algorithm, which mines frequent itemsets without candidate generation, making it efficient on large datasets. See Association Rule Mining.
Labelled data is expensive to acquire; unlabelled data is cheap. Semi-supervised learning combines a small labelled set with a large unlabelled set to improve model accuracy beyond what either alone could achieve.
Assumptions that make semi-supervised learning effective:
Continuity (smoothness) assumptionPoints that are close together are likely to share a label, yielding a preference for decision boundaries in low-density regions.Cluster assumptionData tend to form discrete clusters; points in the same cluster are likely to share a label.Manifold assumptionData lies near a low-dimensional manifold; learning manifold structure from unlabelled data helps avoid the curse of dimensionality.
A self-supervised model is trained on a pretext task whose labels are derived automatically from the data — no human annotation required. The pretext task forces the model to learn rich internal representations that transfer well to downstream tasks.
Examples include:
Generative AI models learn to produce new data samples — text, images, audio, video, 3D — that are statistically indistinguishable from real data. The three dominant approaches are Transformers, diffusion models, and GANs.
The Transformer is built on multi-head scaled dot-product attention. Text is tokenized into sub-word units and converted to dense vectors via an embedding table. At each layer, every token is contextualised against all other tokens in the context window via parallel attention heads, allowing important signals to be amplified. GPT-family models use a decoder-only stack trained with next-token prediction. LLaMA-3 extends this with grouped-query attention (GQA), rotary positional encoding (RoPE), SwiGLU feed-forward networks, and RMS normalisation.
Each generation of GPT models is significantly more capable than the previous due to increased model size (number of trainable parameters) and larger training data.
SMILE ships a complete LLaMA-3 inference stack — see Large Language Models.
Diffusion models (Stable Diffusion, DALL·E 3) learn to reverse a gradual Gaussian-noise corruption process. Training : iteratively add noise to real images. Inference : start from pure noise and denoise step-by-step, guided by a text prompt via cross-attention. Key components:
A GAN pits two networks against each other in a minimax game:
At convergence the generator produces samples that the discriminator cannot distinguish from real data. Known challenges include mode collapse and training instability.
Deep learning uses multi-layer neural networks to learn hierarchical representations directly from raw data (pixels, waveforms, tokens). Key architectural families:
| Architecture | Typical use |
|---|---|
| MLP (fully-connected) | Tabular data, embeddings |
| CNN (convolutional) | Images, audio spectrograms |
| RNN / LSTM / GRU | Sequences, time series (largely superseded by Transformers) |
| Transformer | Text, images (ViT), multi-modal |
| GNN (graph neural network) | Molecular property prediction, social networks |
| Diffusion model | Image and audio synthesis |
| VAE (variational autoencoder) | Representation learning, generation |
SMILE's smile-deep module wraps LibTorch (PyTorch C++) with GPU acceleration, pre-built layer primitives, and pretrained models (EfficientNet-V2 image classification, LLaMA-3 language models). See Deep Learning.
A reinforcement learning (RL) agent interacts with an environment to maximise cumulative reward. Unlike supervised learning, no labelled examples are provided; the agent discovers which actions yield reward through trial and error — trial-and-error search and delayed reward are its most important distinguishing features.
There are four main components of an RL system:
Markov Decision Processes (MDPs) provide the standard mathematical framework. Algorithms range from tabular methods (Q-Learning, SARSA) to deep RL (DQN, PPO, SAC) for large or continuous state/action spaces.
The fundamental challenge is the exploration–exploitation trade-off : the agent must exploit known good actions to accumulate reward, while also exploring new actions that might yield higher long-term payoffs. Randomly selecting actions without reference to any distribution shows poor performance; clever exploration mechanisms (ε-greedy, UCB, Thompson sampling) are essential.
Most machine learning projects follow a common lifecycle: