interviews/README.md
Content may be incomplete or change without notice. The published curriculum lives at <a href="https://mlsysbook.ai"><b>mlsysbook.ai</b></a>.</p>
<p> <a href="https://github.com/harvard-edge/cs249r_book/tree/dev"></a> <a href="https://mlsysbook.ai"></a> </p> </td></tr> </table> </div> <!-- DEV-BANNER-END --><i>You can generate the code, but you cannot prompt your way out of a silicon bottleneck.</i>
</p> <p align="center"> <a href="cloud/README.md">βοΈ Cloud</a> Β· <a href="edge/README.md">π€ Edge</a> Β· <a href="mobile/README.md">π± Mobile</a> Β· <a href="tinyml/README.md">π¬ TinyML</a> Β· <a href="NUMBERS.md">π Numbers</a> Β· <a href="00_The_Architects_Rubric.md">π Rubric</a> </p>In the age of GenAI, writing a training loop is trivial. Anyone can ask an LLM for PyTorch syntax. But an LLM cannot fix a fragmented KV-cache, it cannot un-choke a saturated InfiniBand switch, and it cannot cool a melting Edge NPU. Code is generated; physics is enforced.
Students often ask me: "How do I prepare for ML systems interviews?" This playbook is the answer. These questions test your Mechanical Sympathy: the ability to see past the framework abstractions and engineer the metal underneath. You must learn to reason about the physical constraints of keeping 10,000 GPUs fed and 1 million users served. This is exactly what companies like Meta, Google, and OpenAI test for.
This playbook organizes that knowledge into something you can actually study.
Pick your level and start drilling:
<table> <thead> <tr> <th width="35%">You are...</th> <th width="65%">Start here</th> </tr> </thead> <tbody> <tr> <td><b>Preparing for a screen</b> (Junior/Mid)</td> <td>π’ Green-tagged questions in any round</td> </tr> <tr> <td><b>Building applied skills</b> (Mid)</td> <td>π΅ Blue-tagged questions β diagnose real systems</td> </tr> <tr> <td><b>Targeting Senior (L5)</b></td> <td>π‘ Yellow-tagged questions + <a href="cloud/01_compute_and_memory.md">1. Compute & Memory</a> & <a href="cloud/03_inference_and_serving.md">3. Inference & Serving</a></td> </tr> <tr> <td><b>Targeting Staff+ (L6+)</b></td> <td>π΄ Red-tagged questions + <a href="cloud/05_visual_debugging.md">5. Visual Debugging</a></td> </tr> <tr> <td><b>Mock interview practice</b></td> <td><a href="00_The_Architects_Rubric.md">The Architect's Rubric</a> β grade your own designs</td> </tr> </tbody> </table>Each track targets a different deployment regime β different physics, different constraints, different interview questions. Pick the one that matches the roles you're interviewing for, or study multiple tracks to build breadth.
<table> <thead> <tr> <th width="15%">Track</th> <th width="25%">Focus</th> <th width="20%">Primary Constraint</th> <th width="10%">Questions</th> <th width="15%">Rounds</th> <th width="15%">Scale</th> </tr> </thead> <tbody> <tr> <td><b><a href="cloud/README.md">βοΈ Cloud</a></b></td> <td>Data center training & serving</td> <td>Memory bandwidth / network</td> <td>253</td> <td>6 + visual</td> <td>PFLOPS, 80 GB HBM</td> </tr> <tr> <td><b><a href="edge/README.md">π€ Edge</a></b></td> <td>Autonomous vehicles, robotics</td> <td>Thermal envelope / real-time</td> <td>207</td> <td>5</td> <td>TOPS, 8β32 GB</td> </tr> <tr> <td><b><a href="mobile/README.md">π± Mobile</a></b></td> <td>On-device AI for smartphones</td> <td>Battery life / shared resources</td> <td>177</td> <td>5</td> <td>TOPS, 6β12 GB</td> </tr> <tr> <td><b><a href="tinyml/README.md">π¬ TinyML</a></b></td> <td>Microcontroller & ultra-low-power</td> <td>SRAM capacity / hard real-time</td> <td>171</td> <td>5</td> <td>MFLOPS, 256 KBβ2 MB</td> </tr> </tbody> </table>π Numbers Every ML Systems Engineer Should Know β The physics constants, scaling rules, and hardware specs behind every question in this playbook.
The physics is universal. The numbers are not.
<table> <thead> <tr> <th width="18%">Topic</th> <th width="22%">βοΈ Cloud</th> <th width="22%">π€ Edge</th> <th width="20%">π± Mobile</th> <th width="18%">π¬ TinyML</th> </tr> </thead> <tbody> <tr> <td><b>Roofline</b></td> <td>H100 ridge point ~295 Ops/Byte</td> <td>Jetson Orin ridge point ~15 Ops/Byte</td> <td>NPU ridge point varies by SoC</td> <td>No FPU β integer-only roofline</td> </tr> <tr> <td><b>Memory</b></td> <td>KV-cache fragmentation in 80 GB HBM</td> <td>Model + sensor pipeline in 32 GB DRAM</td> <td>Model coexists with OS in shared RAM</td> <td>Entire model must fit in on-chip SRAM</td> </tr> <tr> <td><b>Quantization</b></td> <td>FP16βINT8 for throughput</td> <td>INT8βINT4 for thermal headroom</td> <td>INT8 for NPU compatibility</td> <td>INT8βbinary to fit on chip</td> </tr> <tr> <td><b>Serving</b></td> <td>Continuous batching, PagedAttention</td> <td>Hard real-time inference deadlines</td> <td>On-device inference, thermal throttling</td> <td>Single-shot, microsecond latency</td> </tr> <tr> <td><b>Fault tolerance</b></td> <td>Checkpoint 10,000 GPUs (MTBF)</td> <td>Graceful degradation, functional safety</td> <td>Crash recovery, model fallback</td> <td>Watchdog timers, hard real-time guarantees</td> </tr> </tbody> </table>Every question is tagged with a mastery level. These levels mirror engineering ladders at major tech companies (Google, Meta, etc.) but represent cognitive thresholds: each level tests a different kind of reasoning, mapped to Bloom's taxonomy and the scope of ownership expected at that career stage.
<table> <thead> <tr> <th width="15%">Level</th> <th width="15%">Scope</th> <th width="20%">Cognitive Skill</th> <th width="50%">What the interviewer hears</th> </tr> </thead> <tbody> <tr> <td>π’ <b>L3 β The Screen</b></td> <td>Own a <b>task</b></td> <td><b>Recall & Define</b></td> <td>"The Roofline model relates compute to memory bandwidth."</td> </tr> <tr> <td>π΅ <b>L4 β The Practitioner</b></td> <td>Own a <b>component</b></td> <td><b>Apply & Identify</b></td> <td>"This workload is memory-bound because its arithmetic intensity is below the ridge point."</td> </tr> <tr> <td>π‘ <b>L5 β The Architect</b></td> <td>Own a <b>system</b></td> <td><b>Analyze & Predict</b></td> <td>"Switching from A100 to H100 won't help because the ridge point shifts right while our intensity stays at ~1."</td> </tr> <tr> <td>π΄ <b>L6+ β The Lead</b></td> <td>Own the <b>architecture</b></td> <td><b>Synthesize & Derive</b></td> <td>"Let me derive the optimal parallelism dimensions from the NVLink topology, memory capacity, and pipeline bubble cost."</td> </tr> </tbody> </table>For question contributors: When writing a new question, ask yourself: "What scope of reasoning does this require?" If the answer is "name the concept," it's L3. If it's "diagnose this system," it's L4. If it's "predict what happens when X changes," it's L5. If it's "derive the solution from constraints," it's L6+.
Every question is tagged with a topic. Use this index to study a specific concept across all rounds. The examples below highlight key questions from across the tracks.
<table> <thead> <tr> <th width="22%">Topic</th> <th width="60%">Example Questions Across Tracks</th> <th width="18%">Coverage</th> </tr> </thead> <tbody> <tr> <td><b><code>roofline</code></b> β Arithmetic Intensity, compute vs memory bound</td> <td><b>Cloud:</b> The Profiling Crisis Β· <b>Edge:</b> The Bandwidth-Bound Orin Β· <b>Mobile:</b> The Budget Phone Mystery</td> <td>β Strong</td> </tr> <tr> <td><b><code>memory</code></b> β VRAM accounting, memory hierarchy, energy</td> <td><b>Cloud:</b> The Sequence Length Trap Β· <b>Mobile:</b> The Background Kill Β· <b>TinyML:</b> The Flash-SRAM Boundary</td> <td>β Strong</td> </tr> <tr> <td><b><code>kv-cache</code></b> β KV-Cache sizing, fragmentation, PagedAttention</td> <td><b>Cloud:</b> The Fragmentation Crisis Β· <b>Mobile:</b> The Mobile LLM KV-Cache Squeeze</td> <td>β Strong</td> </tr> <tr> <td><b><code>precision</code></b> β FP16/BF16/INT8, quantization, underflow</td> <td><b>Cloud:</b> The Underflow Crisis Β· <b>Edge:</b> The QAT Cliff Β· <b>TinyML:</b> The Requantization Pipeline</td> <td>β Strong</td> </tr> <tr> <td><b><code>hardware</code></b> β Tensor Cores, sparsity, silicon architecture</td> <td><b>Cloud:</b> The Sparsity Fallacy Β· <b>Mobile:</b> The NPU Efficiency Advantage</td> <td>β Strong</td> </tr> <tr> <td><b><code>frameworks</code></b> β JIT compilation, graph tracing, kernels</td> <td><b>Cloud:</b> The Compilation Overhead Β· <b>Mobile:</b> The Single-Op Delegation Fix</td> <td>β Strong</td> </tr> <tr> <td><b><code>data-pipeline</code></b> β CPU starvation, preprocessing, ingestion</td> <td><b>Cloud:</b> The Data Pipeline Stall Β· <b>Edge:</b> The Timestamp Drift</td> <td>β Strong</td> </tr> <tr> <td><b><code>parallelism</code></b> β DP, TP, PP, ZeRO, 3D parallelism</td> <td><b>Cloud:</b> The Pipeline Bubble Β· The Amdahl Ceiling Β· Dimensioning the 3D Cube</td> <td>β Strong</td> </tr> <tr> <td><b><code>network</code></b> β NVLink, InfiniBand, Fat-Tree, AllReduce</td> <td><b>Cloud:</b> The Cross-Rack Stall Β· The Oversubscription Choke Β· The Ring vs Tree Dilemma</td> <td>β Strong</td> </tr> <tr> <td><b><code>fault-tolerance</code></b> β Checkpointing, MTBF, stragglers</td> <td><b>Cloud:</b> The Straggler Problem Β· <b>Edge:</b> The Degradation Ladder</td> <td>β Strong</td> </tr> <tr> <td><b><code>latency</code></b> β TTFT, TPOT, tail latency, queueing theory</td> <td><b>Cloud:</b> The Serving Inversion Β· <b>Mobile:</b> The Jank Explanation Β· <b>TinyML:</b> The Interrupt Deadline</td> <td>β Strong</td> </tr> <tr> <td><b><code>serving</code></b> β Batching, cold starts, speculative decoding</td> <td><b>Cloud:</b> The Batching Dilemma Β· <b>Edge:</b> The eMMC Cold Start</td> <td>β Strong</td> </tr> <tr> <td><b><code>mlops</code></b> β Drift, skew, deployment, technical debt</td> <td><b>Cloud:</b> The Training-Serving Skew Β· <b>Mobile:</b> The Silent Accuracy Degradation</td> <td>β Strong</td> </tr> <tr> <td><b><code>economics</code></b> β TCO, retraining cost, sustainability</td> <td><b>Cloud:</b> The Energy Economics Β· <b>Edge:</b> The Edge vs Cloud Cost Crossover</td> <td>β Strong</td> </tr> <tr> <td><b><code>security</code></b> β Prompt injection, adversarial attacks</td> <td><b>Cloud:</b> The Trust Boundary Β· <b>Edge:</b> The Adversarial Patch Attack</td> <td>β Strong</td> </tr> <tr> <td><b><code>privacy</code></b> β DP-SGD, membership inference</td> <td><b>Cloud:</b> The Privacy Audit Β· <b>Mobile:</b> The Federated Keyboard</td> <td>β Strong</td> </tr> </tbody> </table>Each question includes a π Deep Dive link to the relevant chapter of Machine Learning Systems. The questions prove the knowledge matters; the textbook teaches it.
<table> <thead> <tr> <th width="25%">Round</th> <th width="75%">Textbook Chapters Referenced</th> </tr> </thead> <tbody> <tr> <td><b>Silicon Physics</b></td> <td>HW Acceleration, Data Engineering, Model Compression, Frameworks, Neural Computation</td> </tr> <tr> <td><b>Distributed Infra</b></td> <td>Distributed Training, Fault Tolerance, Fleet Orchestration, Network Fabrics</td> </tr> <tr> <td><b>Production Serving</b></td> <td>Model Serving, Benchmarking, Inference at Scale</td> </tr> <tr> <td><b>Ops & Economics</b></td> <td>ML Operations, Responsible Engineering, Sustainable AI, Security & Privacy, Robust AI</td> </tr> </tbody> </table>We welcome questions from recent AI systems interviews.
Every question follows this structure. The Interviewer prompt is visible as the question; the answer is hidden behind a "Reveal Answer" fold so readers can quiz themselves. Not all fields are required; use Common Mistake and Napkin Math where they add value.
<details>
<summary><b>[LEVEL BADGE]: [Question Title]</b> Β· <code>topic-tag</code></summary>
- **Interviewer:** [The scenario or crisis]
<details>
<summary><b>π Reveal Answer</b></summary>
**Common Mistake:** [What most people say wrong β creates the "aha" moment]
**Realistic Solution:** [The physics/logic behind the correct answer]
> **Napkin Math:** [Quick back-of-envelope calculation with real numbers]
> **Key Equation:** $[The formula to memorize]$
π **Deep Dive:** [Link to the relevant textbook chapter]
</details>
</details>
β <b>Vijay Janapa Reddi</b>
</p>