tinytorch/milestones/05_2017_transformer/ABOUT.md
- Why attention beats sequential processing (direct access, no bottleneck)
- How transformers handle long-range dependencies that defeat RNNs
- The architecture that powers GPT, BERT, and every modern LLM
Imagine compressing an entire book into a single number.
That's essentially what RNNs do. No matter how long the input—a sentence, a paragraph, an entire novel—it all gets squeezed through a fixed-size hidden state. Information from the beginning gradually fades as new tokens arrive. By the time you process token 1000, token 1 is nearly forgotten.
In 2017, Vaswani et al. asked: what if we just... didn't? Their paper "Attention Is All You Need" proved that attention mechanisms alone could achieve state-of-the-art results. No sequential bottleneck. No information loss. Any token can directly attend to any other token.
This launched the era of GPT, BERT, ChatGPT, and every modern LLM. This milestone builds a character-level transformer using YOUR TinyTorch implementations.
Transformer models for text generation:
Tokens --> Embeddings --> [Attention --> FFN] x N --> Output
| Module | Component | What It Provides |
|---|---|---|
| 01-08 | Foundation + Training | Tensor, Layers, DataLoader, Training |
| 10 | Tokenization | YOUR CharTokenizer |
| 11 | Embeddings | YOUR token + positional embeddings |
| 12 | Attention | YOUR multi-head self-attention |
| 13 | Transformers | YOUR LayerNorm + TransformerBlock |
Before running, ensure you have completed Modules 01-13. You can check your progress:
tito module status
cd milestones/05_2017_transformer
# Step 0: Prove attention works
python 00_vaswani_attention_proof.py
# Expected: 95%+ on sequence reversal
# Step 1: Q&A generation
python 01_vaswani_generation.py --epochs 5
# Expected: Loss < 1.5, coherent responses
# Step 2: Dialogue generation
python 02_vaswani_dialogue.py --epochs 5
# Expected: Context-aware responses
:::{admonition} Why Start with Script 00? :class: note Script 00 is the critical validation. Sequence reversal is practically impossible without working attention, gives instant feedback (30 seconds), and is binary pass/fail (95%+ or broken). If 00 fails, debug your attention. If 00 passes, 01-02 will work. :::
| Script | Task | Success Criteria | Time |
|---|---|---|---|
| 00 (Attention Proof) | Reverse sequences | 95%+ accuracy | 30 sec |
| 01 (Q&A) | Answer questions | Loss < 1.5, sensible words | 3-5 min |
| 02 (Dialogue) | Multi-turn chat | Topic coherence | 3-5 min |
Watch Script 00 prove attention works—in 30 seconds.
Sequence reversal is practically impossible for simple sequential models. To output the first character, you need to remember the last character of input. To output the second character, you need the second-to-last. The entire sequence must be held in memory simultaneously.
RNNs struggle with this. Attention makes it trivial:
RNN: h[t] = f(h[t-1], x[t]) # Sequential, lossy
Attention: out[i] = sum(attn[i,j] * v[j]) # Parallel, direct
The moment you realize: YOUR attention mechanism lets any position directly access any other position. No bottleneck. No forgetting. This is why transformers can handle entire documents, codebases, even books.
This solves the fundamental RNN problems:
This is the capstone of the Architecture Tier. Every component comes from YOUR implementations:
| Component | Your Module | What It Does |
|---|---|---|
CharTokenizer | Module 10 | YOUR character-level tokenization |
TokenEmbedding | Module 11 | YOUR learned token representations |
PositionalEmbedding | Module 11 | YOUR position encodings |
MultiHeadAttention | Module 12 | YOUR self-attention mechanism |
TransformerBlock | Module 13 | YOUR attention + feedforward blocks |
LayerNorm | Module 13 | YOUR normalization layers |
No PyTorch. No HuggingFace. Just YOUR code generating coherent text.
The original "Attention Is All You Need" paper used sequence-to-sequence tasks like translation. TinyTalks provides a similar Q&A format at character level.
This architecture (with scaling) powers:
You can BUILD transformers, but can you OPTIMIZE them? Milestone 06 (MLPerf) teaches systematic optimization: profiling, compression, and acceleration for production deployment.