Back to Cs249r Book

Milestone 05: The Transformer Era (2017)

tinytorch/milestones/05_2017_transformer/README.md

latest6.5 KB
Original Source

Milestone 05: The Transformer Era (2017)

Historical Context

In 2017, Vaswani et al. published "Attention Is All You Need," showing that attention mechanisms alone (no RNNs, no convolutions!) could achieve state-of-the-art results on sequence tasks. This breakthrough:

  • Replaced RNNs/LSTMs for sequence modeling
  • Enabled parallel training (unlike sequential RNNs)
  • Scaled to massive datasets and model sizes
  • Launched the era of GPT, BERT, and modern LLMs

Transformers didn't just improve NLP - they unified vision, language, and multimodal AI. Now it's your turn to build one from scratch using YOUR Tiny🔥Torch!

What You're Building

Character-level transformer models for text generation:

  1. Question Answering - Train on TinyTalks Q&A dataset
  2. Dialogue Generation - Generate coherent conversational responses

Required Modules

Run after Module 13 (Complete transformer stack)

<table> <thead> <tr> <th width="25%"><b>Module</b></th> <th width="25%">Component</th> <th width="50%">What It Provides</th> </tr> </thead> <tbody> <tr><td><b>Module 01</b></td><td>Tensor</td><td>YOUR data structure with autograd</td></tr> <tr><td><b>Module 02</b></td><td>Activations</td><td>YOUR ReLU/GELU activations</td></tr> <tr><td><b>Module 03</b></td><td>Layers</td><td>YOUR Linear layers</td></tr> <tr><td><b>Module 04</b></td><td>Losses</td><td>YOUR CrossEntropyLoss</td></tr> <tr><td><b>Module 05</b></td><td>DataLoader</td><td>YOUR data batching</td></tr> <tr><td><b>Module 06</b></td><td>Autograd</td><td>YOUR automatic differentiation</td></tr> <tr><td><b>Module 07</b></td><td>Optimizers</td><td>YOUR Adam optimizer</td></tr> <tr><td><b>Module 10</b></td><td>Tokenization</td><td>YOUR CharTokenizer</td></tr> <tr><td><b>Module 11</b></td><td>Embeddings</td><td>YOUR token + positional embeddings</td></tr> <tr><td><b>Module 12</b></td><td>Attention</td><td>YOUR multi-head self-attention</td></tr> <tr><td><b>Module 13</b></td><td>Transformers</td><td>YOUR LayerNorm + TransformerBlock + GPT</td></tr> </tbody> </table>

Milestone Structure

This milestone uses progressive difficulty with 3 scripts:

⭐ 00_vaswani_attention_proof.py (START HERE!)

Purpose: PROVE your attention mechanism works

  • Dataset: Auto-generated sequences (no files needed!)
  • Task: Reverse sequences [1,2,3,4] → [4,3,2,1]
  • From Paper: "Attention is All You Need" validation task
  • Training Time: ~30 seconds
  • Expected: 95%+ accuracy
  • Key Learning: "My attention is computing relationships!"

Why This Is THE Test:

  • IMPOSSIBLE without attention working
  • Trains in 30 seconds (instant gratification!)
  • Binary pass/fail (95%+ or broken)
  • Proves Q·K·V computation works

🎯 Run this FIRST to verify your attention before complex tasks!

01_vaswani_generation.py

Purpose: Apply attention to real language (Q&A)

  • Dataset: TinyTalks (17.5 KB, 5 difficulty levels)
  • Task: Learn to answer questions (Q: ... A: ...)
  • Architecture: Character-level GPT with attention
  • Expected: Coherent responses in 3-5 minutes
  • Key Learning: "Attention learns long-range dependencies!"

Why TinyTalks?

  • Fast training = instant feedback
  • Clear Q&A format = easy to verify learning
  • Progressive difficulty = see capability growth
  • Ships with TinyTorch = no downloads

02_vaswani_dialogue.py

Purpose: Generate natural conversational text

  • Dataset: Same TinyTalks, different framing
  • Task: Multi-turn dialogue generation
  • Expected: Context-aware responses
  • Key Learning: "Transformers capture conversation flow!"

What Makes This Special:

  • Same model architecture as GPT/ChatGPT (scaled down)
  • YOUR implementation from scratch (no magic!)
  • Proves attention mechanism works

Expected Results

<table> <thead> <tr> <th width="20%"><b>Script</b></th> <th width="20%">Task</th> <th width="15%">Context Length</th> <th width="35%">Success Criteria</th> <th width="10%">Training Time</th> </tr> </thead> <tbody> <tr><td><b>01 (Q&A)</b></td><td>Answer questions</td><td>128 chars</td><td>Loss < 1.5, sensible word choices</td><td>3-5 min</td></tr> <tr><td><b>02 (Dialogue)</b></td><td>Multi-turn chat</td><td>128 chars</td><td>Maintains topic coherence, loss < 1.5</td><td>3-5 min</td></tr> </tbody> </table>

Key Learning: Why Attention Revolutionized AI

Transformers solve the fundamental problems of RNNs:

Problem with RNNs:

  • Sequential processing → Can't parallelize (slow training)
  • Vanishing gradients → Struggles with long sequences
  • Fixed hidden state → Information bottleneck

Transformer Solution:

  • Attention mechanism → "Look at ANY position, weighted by relevance"
  • Parallel processing → Process entire sequence at once
  • Direct connections → Every position can attend to every other position

The Key Insight:

RNN:  Hidden state carries ALL information (bottleneck!)
      h[t] = f(h[t-1], x[t])  ← Sequential, lossy

Attention: Directly access ANY past position (no bottleneck!)
          output[i] = Σ attention[i,j] × value[j]  ← Parallel, lossless

This is why GPT, BERT, T5, and modern LLMs all use transformers!

Running the Milestone

bash
cd milestones/05_2017_transformer

# Step 1: Q&A generation (run after Module 13)
python 01_vaswani_generation.py --epochs 5 --batch-size 4

# Step 2: Dialogue generation (run after Module 13)
python 02_vaswani_dialogue.py --epochs 5 --batch-size 4

Optional flags:

  • --levels N - Use first N difficulty levels (1-5)
  • --embed-dim D - Embedding dimension (default: 64)
  • --num-layers L - Number of transformer blocks (default: 3)
  • --num-heads H - Attention heads (default: 4)

Further Reading

  • The Paper: Vaswani et al. (2017). "Attention Is All You Need"
  • Illustrated Transformer: http://jalammar.github.io/illustrated-transformer/
  • GPT Evolution: Radford et al. (2018, 2019, 2020). GPT-1/2/3 papers
  • BERT: Devlin et al. (2018). "BERT: Pre-training of Deep Bidirectional Transformers"

Achievement Unlocked

After completing this milestone, you'll understand:

  • How self-attention computes context-aware representations
  • Why transformers parallelize better than RNNs
  • What positional embeddings do (give position information)
  • How GPT-style autoregressive generation works

You've built the architecture powering modern AI!


Note for Next Milestone: You can now BUILD transformers, but can you OPTIMIZE them for production? Milestone 06 (MLPerf) teaches systematic optimization: profiling → compression → acceleration!