tinytorch/site/datasets.md
Purpose: Understand TinyTorch's dataset strategy and where to find each dataset used in milestones.
TinyTorch uses a two-tier dataset approach:
<div style="display: grid; grid-template-columns: 1fr 1fr; gap: 1.5rem; margin: 2rem 0;"> <div style="background: #e3f2fd; border: 1px solid #2196f3; padding: 1.5rem; border-radius: 0.5rem;"> <h3 style="margin: 0 0 1rem 0; color: #1976d2;">Shipped Datasets</h3> <p style="margin: 0 0 1rem 0;"><strong>~350 KB total - Ships with repository</strong></p> <ul style="margin: 0; font-size: 0.9rem;"> <li>Small enough to fit in Git (~1K samples each)</li> <li>Fast training (seconds to minutes)</li> <li>Instant gratification for learners</li> <li>Works offline - no download needed</li> <li>Perfect for rapid iteration</li> </ul> </div> <div style="background: #f3e5f5; border: 1px solid #9c27b0; padding: 1.5rem; border-radius: 0.5rem;"> <h3 style="margin: 0 0 1rem 0; color: #7b1fa2;">Downloaded Datasets</h3> <p style="margin: 0 0 1rem 0;"><strong>~180 MB - Auto-downloaded when needed</strong></p> <ul style="margin: 0; font-size: 0.9rem;"> <li>Standard ML benchmarks (MNIST, CIFAR-10)</li> <li>Larger scale (~60K samples)</li> <li>Used for validation and scaling</li> <li>Downloaded automatically by milestones</li> <li>Cached locally for reuse</li> </ul> </div> </div>Philosophy: Following Andrej Karpathy's "~1K samples" approach—small datasets for learning, full benchmarks for validation.
Location: datasets/tinydigits/
Size: ~310 KB
Used by: Milestones 03 & 04 (MLP and CNN examples)
Contents:
Format: Python pickle file with NumPy arrays
Why 8×8?
Usage in milestones:
# Automatically loaded by milestones
from datasets.tinydigits import load_tinydigits
X_train, y_train, X_test, y_test = load_tinydigits()
# X_train shape: (1000, 8, 8)
# y_train shape: (1000,)
Location: datasets/tinytalks/
Size: ~40 KB
Used by: Milestone 05 (Transformer/GPT text generation)
Contents:
Format: Plain text files with Q: / A: format
Why conversational format?
Example:
Q: What is the capital of France?
A: Paris
Q: If a train travels 120 km in 2 hours, what is its average speed?
A: 60 km/h
Usage in milestones:
# Automatically loaded by transformer milestones
from datasets.tinytalks import load_tinytalks
dataset = load_tinytalks()
# Returns list of (question, answer) pairs
See detailed documentation: datasets/tinytalks/README.md
These standard benchmarks download automatically when you run relevant milestone scripts:
Downloads to: milestones/datasets/mnist/
Size: ~10 MB (compressed)
Used by: milestones/03_1986_mlp/02_rumelhart_mnist.py
Contents:
Auto-download: When you run the MNIST milestone script, it automatically:
Purpose: Validate that your framework achieves production-level results (95%+ accuracy target)
Milestone goal: Implement backpropagation and achieve 95%+ accuracy—matching 1986 Rumelhart's breakthrough.
</div>Downloads to: milestones/datasets/cifar-10/
Size: ~170 MB (compressed)
Used by: milestones/04_1998_cnn/02_lecun_cifar10.py
Contents:
Auto-download: Milestone script handles everything:
Purpose: Prove your CNN implementation works on real natural images (75%+ accuracy target)
Milestone goal: Build LeNet-style CNN achieving 75%+ accuracy—demonstrating spatial intelligence.
</div>TinyDigits (not full MNIST):
TinyTalks (custom dataset):
MNIST (when scaling up):
CIFAR-10 (for CNN validation):
You don't need to manually download anything!
# Just run milestone scripts
cd milestones/03_1986_mlp
python 01_rumelhart_tinydigits.py # Uses shipped TinyDigits
python 02_rumelhart_mnist.py # Auto-downloads MNIST if needed
The milestones handle all data loading automatically.
Direct dataset access:
# Shipped datasets (always available)
from datasets.tinydigits import load_tinydigits
X_train, y_train, X_test, y_test = load_tinydigits()
from datasets.tinytalks import load_tinytalks
conversations = load_tinytalks()
# Downloaded datasets (through milestones)
# See milestones/data_manager.py for download utilities
| Dataset | Size | Samples | Ships With Repo | Purpose |
|---|---|---|---|---|
| TinyDigits | 310 KB | 1,200 | Yes | Fast MLP/CNN iteration |
| TinyTalks | 40 KB | 350 pairs | Yes | Transformer learning |
| MNIST | 10 MB | 70,000 | Downloads | MLP validation |
| CIFAR-10 | 170 MB | 60,000 | Downloads | CNN validation |
Total shipped: ~350 KB Total with benchmarks: ~180 MB
Traditional ML courses:
TinyTorch approach:
Educational benefit: Students see working models within minutes, not hours.
</div>Q: Why not use full MNIST from the start? A: TinyDigits trains 100× faster, enabling rapid iteration during learning. MNIST validates your complete implementation later.
Q: Can I use my own datasets? A: Absolutely! TinyTorch is a real framework—add your data loading code just like PyTorch.
Q: Why ship datasets in Git? A: 350 KB is negligible (smaller than many images), and it enables offline learning with instant iteration.
Q: Where does CIFAR-10 download from?
A: Official sources via milestones/data_manager.py, with integrity verification.
Q: Can I skip the large downloads? A: Yes! You can work through most milestones using only shipped datasets. Downloaded datasets are for validation milestones.
See the Milestones section for how each dataset is used in historical achievements.
Dataset implementation details: See datasets/tinydigits/README.md and datasets/tinytalks/README.md for technical specifications.