tinytorch/site/milestones/06_mlperf_ABOUT.md
- The systematic optimization workflow: measure, optimize, validate, repeat
- Why profiling before optimizing beats heroic rewrites
- How to achieve 8× compression and 10× speedup with minimal accuracy loss
MLCommons launches MLPerf - the first systematic benchmark for ML systems engineering. The message is clear: research breakthroughs don't matter if you can't ship them. Optimization isn't an afterthought; it's a core competency.
This milestone teaches you the same systematic approach production ML engineers use. You'll compress YOUR models 8× and speed up YOUR transformer generation 10×. That's the difference between research demos and shipped products.
A complete MLPerf-style optimization pipeline:
Measure --> Optimize --> Validate --> Repeat
| Module | Component | What It Provides |
|---|---|---|
| 01-13 | Foundation + Architectures | Models to optimize |
| 14 | Profiling | YOUR measurement tools |
| 15 | Quantization | YOUR INT8/FP16 implementations |
| 16 | Compression | YOUR pruning techniques |
| 17 | Acceleration | YOUR vectorized operations |
| 18 | Memoization | YOUR KV-cache for generation |
Before running, ensure you have completed Modules 01-18. You can check your progress:
tito module status
cd milestones/06_2018_mlperf
# Part 1: Optimize MLP/CNN (profiling + quantization + pruning)
python 01_optimization_olympics.py
# Expected: 4-8x compression with <2% accuracy loss
# Part 2: Speed up Transformer generation (KV caching)
python 02_generation_speedup.py
# Expected: 6-10x faster generation
Static Model Optimization (Script 01)
| Optimization | Size | Accuracy | Notes |
|---|---|---|---|
| Baseline (FP32) | 100% | 85-90% | Full precision |
| + Quantization (INT8) | 25% | 84-89% | 4x smaller |
| + Pruning (50%) | 12.5% | 82-87% | 8x smaller total |
Generation Speedup (Script 02)
| Mode | Time/Token | Speedup |
|---|---|---|
| Without KV-Cache | ~10ms | 1x |
| With KV-Cache | ~1ms | 6-10x |
The Wrong Way (Heroic Optimization):
"It's too slow! Let me rewrite everything in C++!"
"Memory is too high! Let me redesign the architecture!"
"KV-cache sounds complex! Let me try CUDA kernels first!"
Result: Weeks of work, marginal gains, introduced bugs.
The Right Way (Systematic Optimization):
1. MEASURE: Profile shows 70% of time is in attention, 80% of memory is Linear layers
2. OPTIMIZE: Add KV-cache (targets the 70%), quantize Linear layers (targets the 80%)
3. VALIDATE: Accuracy drops 1.5% (acceptable), 8× faster (huge win)
4. REPEAT: Profile again, find next bottleneck
Result: 10× faster, 8× smaller, 2% accuracy cost - achieved in days.
This is what separates ML researchers from ML engineers:
The complete workflow, measure, optimize, validate, using YOUR tools.
This is the capstone of the entire TinyTorch journey. Every optimization tool comes from YOUR implementations:
| Component | Your Module | What It Does |
|---|---|---|
Profiler | Module 14 | YOUR measurement and bottleneck identification |
quantize() | Module 15 | YOUR INT8/FP16 conversion |
prune() | Module 16 | YOUR weight pruning |
vectorize() | Module 17 | YOUR accelerated operations |
KVCache | Module 18 | YOUR key-value caching for generation |
No external optimization libraries. Just YOUR code making models production-ready.
MLPerf (MLCommons) standardized ML benchmarking across hardware and software stacks. Before MLPerf, comparing ML systems was nearly impossible: different datasets, metrics, and conditions.
The "Optimization Era" marks when ML engineering became as important as ML research. Building a model is step one; deploying it efficiently is where production value lives.
Milestone 06 completes the TinyTorch journey from tensors to production. You've now:
The Capstone (Module 20) puts it all together in the Torch Olympics competition.