Milestone 05: The Transformer Era (2017)

Historical Context

In 2017, Vaswani et al. published "Attention Is All You Need," showing that attention mechanisms alone (no RNNs, no convolutions!) could achieve state-of-the-art results on sequence tasks. This breakthrough:

Replaced RNNs/LSTMs for sequence modeling
Enabled parallel training (unlike sequential RNNs)
Scaled to massive datasets and model sizes
Launched the era of GPT, BERT, and modern LLMs

Transformers didn't just improve NLP - they unified vision, language, and multimodal AI. Now it's your turn to build one from scratch using YOUR Tiny🔥Torch!

What You're Building

Character-level transformer models for text generation:

Question Answering - Train on TinyTalks Q&A dataset
Dialogue Generation - Generate coherent conversational responses

Required Modules

Run after Module 13 (Complete transformer stack)

<table> <thead> <tr> <th width="25%"><b>Module</b></th> <th width="25%">Component</th> <th width="50%">What It Provides</th> </tr> </thead> <tbody> <tr><td><b>Module 01</b></td><td>Tensor</td><td>YOUR data structure with autograd</td></tr> <tr><td><b>Module 02</b></td><td>Activations</td><td>YOUR ReLU/GELU activations</td></tr> <tr><td><b>Module 03</b></td><td>Layers</td><td>YOUR Linear layers</td></tr> <tr><td><b>Module 04</b></td><td>Losses</td><td>YOUR CrossEntropyLoss</td></tr> <tr><td><b>Module 05</b></td><td>DataLoader</td><td>YOUR data batching</td></tr> <tr><td><b>Module 06</b></td><td>Autograd</td><td>YOUR automatic differentiation</td></tr> <tr><td><b>Module 07</b></td><td>Optimizers</td><td>YOUR Adam optimizer</td></tr> <tr><td><b>Module 10</b></td><td>Tokenization</td><td>YOUR CharTokenizer</td></tr> <tr><td><b>Module 11</b></td><td>Embeddings</td><td>YOUR token + positional embeddings</td></tr> <tr><td><b>Module 12</b></td><td>Attention</td><td>YOUR multi-head self-attention</td></tr> <tr><td><b>Module 13</b></td><td>Transformers</td><td>YOUR LayerNorm + TransformerBlock + GPT</td></tr> </tbody> </table>

Milestone Structure

This milestone uses progressive difficulty with 3 scripts:

⭐ 00_vaswani_attention_proof.py (START HERE!)

Purpose: PROVE your attention mechanism works

Dataset: Auto-generated sequences (no files needed!)
Task: Reverse sequences [1,2,3,4] → [4,3,2,1]
From Paper: "Attention is All You Need" validation task
Training Time: ~30 seconds
Expected: 95%+ accuracy
Key Learning: "My attention is computing relationships!"

Why This Is THE Test:

IMPOSSIBLE without attention working
Trains in 30 seconds (instant gratification!)
Binary pass/fail (95%+ or broken)
Proves Q·K·V computation works

🎯 Run this FIRST to verify your attention before complex tasks!

01_vaswani_generation.py

Purpose: Apply attention to real language (Q&A)

Dataset: TinyTalks (17.5 KB, 5 difficulty levels)
Task: Learn to answer questions (Q: ... A: ...)
Architecture: Character-level GPT with attention
Expected: Coherent responses in 3-5 minutes
Key Learning: "Attention learns long-range dependencies!"

Why TinyTalks?

Fast training = instant feedback
Clear Q&A format = easy to verify learning
Progressive difficulty = see capability growth
Ships with TinyTorch = no downloads

02_vaswani_dialogue.py

Purpose: Generate natural conversational text

Dataset: Same TinyTalks, different framing
Task: Multi-turn dialogue generation
Expected: Context-aware responses
Key Learning: "Transformers capture conversation flow!"

What Makes This Special:

Same model architecture as GPT/ChatGPT (scaled down)
YOUR implementation from scratch (no magic!)
Proves attention mechanism works

Expected Results

<table> <thead> <tr> <th width="20%"><b>Script</b></th> <th width="20%">Task</th> <th width="15%">Context Length</th> <th width="35%">Success Criteria</th> <th width="10%">Training Time</th> </tr> </thead> <tbody> <tr><td><b>01 (Q&A)</b></td><td>Answer questions</td><td>128 chars</td><td>Loss < 1.5, sensible word choices</td><td>3-5 min</td></tr> <tr><td><b>02 (Dialogue)</b></td><td>Multi-turn chat</td><td>128 chars</td><td>Maintains topic coherence, loss < 1.5</td><td>3-5 min</td></tr> </tbody> </table>

Key Learning: Why Attention Revolutionized AI

Transformers solve the fundamental problems of RNNs:

Problem with RNNs:

Sequential processing → Can't parallelize (slow training)
Vanishing gradients → Struggles with long sequences
Fixed hidden state → Information bottleneck

Transformer Solution:

Attention mechanism → "Look at ANY position, weighted by relevance"
Parallel processing → Process entire sequence at once
Direct connections → Every position can attend to every other position

The Key Insight:

RNN:  Hidden state carries ALL information (bottleneck!)
      h[t] = f(h[t-1], x[t])  ← Sequential, lossy

Attention: Directly access ANY past position (no bottleneck!)
          output[i] = Σ attention[i,j] × value[j]  ← Parallel, lossless

This is why GPT, BERT, T5, and modern LLMs all use transformers!

Running the Milestone

bash

cd milestones/05_2017_transformer

# Step 1: Q&A generation (run after Module 13)
python 01_vaswani_generation.py --epochs 5 --batch-size 4

# Step 2: Dialogue generation (run after Module 13)
python 02_vaswani_dialogue.py --epochs 5 --batch-size 4

Optional flags:

--levels N - Use first N difficulty levels (1-5)
--embed-dim D - Embedding dimension (default: 64)
--num-layers L - Number of transformer blocks (default: 3)
--num-heads H - Attention heads (default: 4)

Achievement Unlocked

After completing this milestone, you'll understand:

How self-attention computes context-aware representations
Why transformers parallelize better than RNNs
What positional embeddings do (give position information)
How GPT-style autoregressive generation works

You've built the architecture powering modern AI!

Note for Next Milestone: You can now BUILD transformers, but can you OPTIMIZE them for production? Milestone 06 (MLPerf) teaches systematic optimization: profiling → compression → acceleration!

Milestone 05: The Transformer Era (2017)

Milestone 05: The Transformer Era (2017)

Historical Context

What You're Building

Required Modules

Milestone Structure

⭐ 00_vaswani_attention_proof.py (START HERE!)

01_vaswani_generation.py

02_vaswani_dialogue.py

Expected Results

Key Learning: Why Attention Revolutionized AI

Problem with RNNs:

Transformer Solution:

Running the Milestone

Further Reading

Achievement Unlocked