tinytorch/milestones/04_1998_cnn/README.md
After backpropagation revived neural networks (1986), researchers still struggled with image recognition. MLPs treated pixels independently, requiring millions of parameters and ignoring spatial structure.
Then in 1998, Yann LeCun's LeNet-5 revolutionized computer vision with Convolutional Neural Networks (CNNs). By using:
LeNet achieved 99%+ accuracy on handwritten digits, launching the deep learning revolution that led to ImageNet (2012), object detection, and modern computer vision.
CNNs that exploit spatial structure in images:
Run after Module 09 (Convolutions: Conv2d + Pooling)
<table> <thead> <tr> <th width="25%"><b>Module</b></th> <th width="25%">Component</th> <th width="50%">What It Provides</th> </tr> </thead> <tbody> <tr><td><b>Module 01</b></td><td>Tensor</td><td>YOUR data structure</td></tr> <tr><td><b>Module 02</b></td><td>Activations</td><td>YOUR ReLU activation</td></tr> <tr><td><b>Module 03</b></td><td>Layers</td><td>YOUR Linear layers</td></tr> <tr><td><b>Module 04</b></td><td>Losses</td><td>YOUR CrossEntropyLoss</td></tr> <tr><td><b>Module 05</b></td><td>DataLoader</td><td>YOUR data batching</td></tr> <tr><td><b>Module 06</b></td><td>Autograd</td><td>YOUR automatic differentiation</td></tr> <tr><td><b>Module 07</b></td><td>Optimizers</td><td>YOUR SGD/Adam optimizers</td></tr> <tr><td><b>Module 08</b></td><td>Training</td><td>YOUR end-to-end training loop</td></tr> <tr><td><b>Module 09</b></td><td>Convolutions</td><td>YOUR Conv2d + MaxPool2d</td></tr> </tbody> </table>This milestone has two parts that progressively showcase your TinyTorch modules:
Script: 01_lecun_tinydigits.py
Purpose: Prove CNNs > MLPs on same data
Why This Comparison Matters:
Script: 02_lecun_cifar10.py
Purpose: Scale to natural color images + showcase YOUR DataLoader!
What Part 2 Showcases:
Historical Note: CIFAR-10 (2009) became the benchmark for evaluating CNN architectures before ImageNet.
CNNs exploit three key principles:
MLP: Every pixel connects to every neuron (millions of parameters) CNN: Only local regions connect (shared filters, 100× fewer params)
MLP: "Cat in top-left" ≠ "Cat in bottom-right" (different weights!) CNN: Same filter detects features anywhere (shared weights)
Layer 1: Edge detectors (vertical, horizontal, diagonal) Layer 2: Texture patterns (combinations of edges) Layer 3: Object parts (wheels, faces, legs) Output: Full objects (cars, cats, planes)
This is why CNNs remained state-of-the-art for vision until Vision Transformers (2020)!
cd milestones/04_1998_cnn
# Step 1: Prove CNNs > MLPs (run after Module 09)
python 01_lecun_tinydigits.py
# Step 2: Scale to natural images (run after Module 09)
python 02_lecun_cifar10.py
After completing this milestone, you'll understand:
You've recreated the architecture that launched modern computer vision!
Note for Next Milestone: CNNs excel at vision, but what about sequences (text, audio, time series)? Milestone 05 introduces Transformers - the architecture that unified vision AND language!