tinytorch/milestones/05_2017_transformer/README.md
In 2017, Vaswani et al. published "Attention Is All You Need," showing that attention mechanisms alone (no RNNs, no convolutions!) could achieve state-of-the-art results on sequence tasks. This breakthrough:
Transformers didn't just improve NLP - they unified vision, language, and multimodal AI. Now it's your turn to build one from scratch using YOUR Tiny🔥Torch!
Character-level transformer models for text generation:
Run after Module 13 (Complete transformer stack)
<table> <thead> <tr> <th width="25%"><b>Module</b></th> <th width="25%">Component</th> <th width="50%">What It Provides</th> </tr> </thead> <tbody> <tr><td><b>Module 01</b></td><td>Tensor</td><td>YOUR data structure with autograd</td></tr> <tr><td><b>Module 02</b></td><td>Activations</td><td>YOUR ReLU/GELU activations</td></tr> <tr><td><b>Module 03</b></td><td>Layers</td><td>YOUR Linear layers</td></tr> <tr><td><b>Module 04</b></td><td>Losses</td><td>YOUR CrossEntropyLoss</td></tr> <tr><td><b>Module 05</b></td><td>DataLoader</td><td>YOUR data batching</td></tr> <tr><td><b>Module 06</b></td><td>Autograd</td><td>YOUR automatic differentiation</td></tr> <tr><td><b>Module 07</b></td><td>Optimizers</td><td>YOUR Adam optimizer</td></tr> <tr><td><b>Module 10</b></td><td>Tokenization</td><td>YOUR CharTokenizer</td></tr> <tr><td><b>Module 11</b></td><td>Embeddings</td><td>YOUR token + positional embeddings</td></tr> <tr><td><b>Module 12</b></td><td>Attention</td><td>YOUR multi-head self-attention</td></tr> <tr><td><b>Module 13</b></td><td>Transformers</td><td>YOUR LayerNorm + TransformerBlock + GPT</td></tr> </tbody> </table>This milestone uses progressive difficulty with 3 scripts:
Purpose: PROVE your attention mechanism works
[1,2,3,4] → [4,3,2,1]Why This Is THE Test:
🎯 Run this FIRST to verify your attention before complex tasks!
Purpose: Apply attention to real language (Q&A)
Why TinyTalks?
Purpose: Generate natural conversational text
What Makes This Special:
Transformers solve the fundamental problems of RNNs:
The Key Insight:
RNN: Hidden state carries ALL information (bottleneck!)
h[t] = f(h[t-1], x[t]) ← Sequential, lossy
Attention: Directly access ANY past position (no bottleneck!)
output[i] = Σ attention[i,j] × value[j] ← Parallel, lossless
This is why GPT, BERT, T5, and modern LLMs all use transformers!
cd milestones/05_2017_transformer
# Step 1: Q&A generation (run after Module 13)
python 01_vaswani_generation.py --epochs 5 --batch-size 4
# Step 2: Dialogue generation (run after Module 13)
python 02_vaswani_dialogue.py --epochs 5 --batch-size 4
Optional flags:
--levels N - Use first N difficulty levels (1-5)--embed-dim D - Embedding dimension (default: 64)--num-layers L - Number of transformer blocks (default: 3)--num-heads H - Attention heads (default: 4)After completing this milestone, you'll understand:
You've built the architecture powering modern AI!
Note for Next Milestone: You can now BUILD transformers, but can you OPTIMIZE them for production? Milestone 06 (MLPerf) teaches systematic optimization: profiling → compression → acceleration!