tinytorch/site/tiers/foundation.md
Build the mathematical core that makes neural networks learn.
The Foundation tier teaches you how to build a complete learning system from scratch. Starting with basic tensor operations, you'll construct the mathematical infrastructure that powers every modern ML framework—data loading, automatic differentiation, gradient-based optimization, and training loops.
By the end of this tier, you'll understand:
:align: center
:caption: "**Foundation Module Dependencies.** Tensors and activations feed into layers, which connect to losses and dataloader, then autograd, enabling optimizers and ultimately training loops."
graph TB
M01[01. Tensor
Multidimensional arrays] --> M03[03. Layers
Linear transformations]
M02[02. Activations
Non-linear functions] --> M03
M03 --> M04[04. Losses
Measure prediction quality]
M04 --> M05[05. DataLoader
Efficient data pipelines]
M05 --> M06[06. Autograd
Automatic differentiation]
M06 --> M07[07. Optimizers
Gradient-based updates]
M07 --> M08[08. Training
Complete learning loop]
style M01 fill:#e3f2fd,stroke:#1976d2,stroke-width:3px
style M02 fill:#e3f2fd,stroke:#1976d2,stroke-width:3px
style M03 fill:#bbdefb,stroke:#1565c0,stroke-width:3px
style M04 fill:#90caf9,stroke:#1565c0,stroke-width:3px
style M05 fill:#90caf9,stroke:#1565c0,stroke-width:3px
style M06 fill:#64b5f6,stroke:#0d47a1,stroke-width:3px
style M07 fill:#64b5f6,stroke:#0d47a1,stroke-width:3px
style M08 fill:#42a5f5,stroke:#0d47a1,stroke-width:4px
The Foundation tier follows a deliberate Forward Pass → Learning → Training progression that mirrors how neural networks actually work:
Tensors (01) → Activations (02) → Layers (03) → Losses (04)
You must build things in the order data flows through them:
At this point, you can do a complete forward pass: input → layer → activation → loss.
DataLoader (05) → Autograd (06) → Optimizers (07)
Now you need the infrastructure to learn from data: 5. DataLoader provides efficient data batching—real training needs this before autograd 6. Autograd computes gradients automatically—the engine that makes learning possible 7. Optimizers use gradients to update parameters—SGD, Adam, and friends
Training (08) integrates everything into a complete learning loop.
This order isn't arbitrary—it's the minimal dependency chain. You can't build optimizers without autograd (no gradients), can't build autograd without losses (nothing to differentiate), can't build losses without layers (no predictions). Each module unlocks the next.
What it is: Multidimensional arrays with automatic shape tracking and broadcasting.
Why it matters: Tensors are the universal data structure for ML. Understanding tensor operations, broadcasting, and memory layouts is essential for building efficient neural networks.
What you'll build: A pure Python tensor class supporting arithmetic, reshaping, slicing, and broadcasting—just like PyTorch tensors.
Systems focus: Memory layout, broadcasting semantics, operation fusion
What it is: Non-linear functions applied element-wise to tensors.
Why it matters: Without activations, neural networks collapse to linear models. Activations like ReLU, Sigmoid, and Tanh enable networks to learn complex, non-linear patterns.
What you'll build: Common activation functions with their gradients for backpropagation.
Systems focus: Numerical stability, in-place operations, gradient flow
What it is: Parameterized transformations (Linear, Conv2d) that learn from data.
Why it matters: Layers are the modular components you stack to build networks. Understanding weight initialization, parameter management, and forward passes is crucial.
What you'll build: Linear (fully-connected) layers with proper initialization and parameter tracking.
Systems focus: Parameter storage, initialization strategies, forward computation
What it is: Functions that quantify how wrong your predictions are.
Why it matters: Loss functions define what "good" means for your model. Different tasks (classification, regression) require different loss functions.
What you'll build: CrossEntropyLoss, MSELoss, and other common objectives with their gradients.
Systems focus: Numerical stability (log-sum-exp trick), reduction strategies
What it is: Infrastructure for loading, batching, and shuffling training data efficiently.
Why it matters: Real ML systems train on datasets that don't fit in memory. DataLoaders handle batching, shuffling, and parallel data loading, which are essential for efficient training.
What you'll build: A DataLoader that supports batching, shuffling, and dataset iteration with proper memory management.
Systems focus: Memory efficiency, batching strategies, I/O optimization
What it is: Automatic differentiation system that computes gradients through computation graphs.
Why it matters: Autograd is what makes deep learning practical. It automatically computes gradients for any computation, enabling backpropagation through arbitrarily complex networks.
What you'll build: A computational graph system that tracks operations and computes gradients via the chain rule.
Systems focus: Computational graphs, topological sorting, gradient accumulation
What it is: Algorithms that update parameters using gradients (SGD, Adam, RMSprop).
Why it matters: Raw gradients don't directly tell you how to update parameters. Optimizers use momentum, adaptive learning rates, and other tricks to make training converge faster and more reliably.
What you'll build: SGD, Adam, and RMSprop with proper momentum and learning rate scheduling.
Systems focus: Update rules, momentum buffers, numerical stability
What it is: The training loop that ties everything together—forward pass, loss computation, backpropagation, parameter updates.
Why it matters: Training loops orchestrate the entire learning process. Understanding this flow—including batching, epochs, and validation—is essential for practical ML.
What you'll build: A complete training framework with progress tracking, validation, and model checkpointing.
Systems focus: Batch processing, gradient clipping, learning rate scheduling
:align: center
:caption: "**Foundation Tier Milestones.** After completing modules 01-08, you unlock three historical achievements spanning three decades of neural network breakthroughs."
timeline
title Historical Achievements Unlocked
1958 : Perceptron : Binary classification with gradient descent
1969 : XOR Crisis Solved : Hidden layers enable non-linear learning
1986 : MLP Revival : Multi-layer networks achieve 95%+ on MNIST
After completing the Foundation tier, you'll be able to:
Required:
Helpful but not required:
Per module: 3-5 hours (implementation + exercises + systems thinking)
Total tier: ~25-35 hours for complete mastery
Recommended pace: 1-2 modules per week
Each module follows the Build → Use → Reflect cycle:
Ready to start building?
# Start with Module 01: Tensor
tito module start 01_tensor
# Follow the daily workflow
# 1. Read the ABOUT guide
# 2. Implement in *_dev.py
# 3. Test with tito module test
# 4. Export to *_sol.py
Or explore other tiers: