Back to Cs249r Book

The ML Systems Interview Playbook

interviews/README.md

latest17.0 KB
Original Source
<!-- DEV-BANNER-START --> <div align="center"> <table> <tr><td> <h3>🚧 Under Active Development</h3> <p>This component is being built on the <code>dev</code> branch and is <b>not yet available</b> on the live site.

Content may be incomplete or change without notice. The published curriculum lives at <a href="https://mlsysbook.ai"><b>mlsysbook.ai</b></a>.</p>

<p> <a href="https://github.com/harvard-edge/cs249r_book/tree/dev"></a> <a href="https://mlsysbook.ai"></a> </p> </td></tr> </table> </div> <!-- DEV-BANNER-END -->

The ML Systems Interview Playbook

<p align="center"> <b>1,000+ systems design questions across Cloud, Edge, Mobile & TinyML tracks.</b>

<i>You can generate the code, but you cannot prompt your way out of a silicon bottleneck.</i>

</p> <p align="center"> <a href="cloud/README.md">☁️ Cloud</a> Β· <a href="edge/README.md">πŸ€– Edge</a> Β· <a href="mobile/README.md">πŸ“± Mobile</a> Β· <a href="tinyml/README.md">πŸ”¬ TinyML</a> Β· <a href="NUMBERS.md">πŸ“Š Numbers</a> Β· <a href="00_The_Architects_Rubric.md">πŸ“‹ Rubric</a> </p>

Why This Exists

In the age of GenAI, writing a training loop is trivial. Anyone can ask an LLM for PyTorch syntax. But an LLM cannot fix a fragmented KV-cache, it cannot un-choke a saturated InfiniBand switch, and it cannot cool a melting Edge NPU. Code is generated; physics is enforced.

Students often ask me: "How do I prepare for ML systems interviews?" This playbook is the answer. These questions test your Mechanical Sympathy: the ability to see past the framework abstractions and engineer the metal underneath. You must learn to reason about the physical constraints of keeping 10,000 GPUs fed and 1 million users served. This is exactly what companies like Meta, Google, and OpenAI test for.

This playbook organizes that knowledge into something you can actually study.


Quick Start

Pick your level and start drilling:

<table> <thead> <tr> <th width="35%">You are...</th> <th width="65%">Start here</th> </tr> </thead> <tbody> <tr> <td><b>Preparing for a screen</b> (Junior/Mid)</td> <td>🟒 Green-tagged questions in any round</td> </tr> <tr> <td><b>Building applied skills</b> (Mid)</td> <td>πŸ”΅ Blue-tagged questions β€” diagnose real systems</td> </tr> <tr> <td><b>Targeting Senior (L5)</b></td> <td>🟑 Yellow-tagged questions + <a href="cloud/01_compute_and_memory.md">1. Compute & Memory</a> & <a href="cloud/03_inference_and_serving.md">3. Inference & Serving</a></td> </tr> <tr> <td><b>Targeting Staff+ (L6+)</b></td> <td>πŸ”΄ Red-tagged questions + <a href="cloud/05_visual_debugging.md">5. Visual Debugging</a></td> </tr> <tr> <td><b>Mock interview practice</b></td> <td><a href="00_The_Architects_Rubric.md">The Architect's Rubric</a> β€” grade your own designs</td> </tr> </tbody> </table>

Choose Your Track

Each track targets a different deployment regime β€” different physics, different constraints, different interview questions. Pick the one that matches the roles you're interviewing for, or study multiple tracks to build breadth.

<table> <thead> <tr> <th width="15%">Track</th> <th width="25%">Focus</th> <th width="20%">Primary Constraint</th> <th width="10%">Questions</th> <th width="15%">Rounds</th> <th width="15%">Scale</th> </tr> </thead> <tbody> <tr> <td><b><a href="cloud/README.md">☁️ Cloud</a></b></td> <td>Data center training & serving</td> <td>Memory bandwidth / network</td> <td>253</td> <td>6 + visual</td> <td>PFLOPS, 80 GB HBM</td> </tr> <tr> <td><b><a href="edge/README.md">πŸ€– Edge</a></b></td> <td>Autonomous vehicles, robotics</td> <td>Thermal envelope / real-time</td> <td>207</td> <td>5</td> <td>TOPS, 8–32 GB</td> </tr> <tr> <td><b><a href="mobile/README.md">πŸ“± Mobile</a></b></td> <td>On-device AI for smartphones</td> <td>Battery life / shared resources</td> <td>177</td> <td>5</td> <td>TOPS, 6–12 GB</td> </tr> <tr> <td><b><a href="tinyml/README.md">πŸ”¬ TinyML</a></b></td> <td>Microcontroller & ultra-low-power</td> <td>SRAM capacity / hard real-time</td> <td>171</td> <td>5</td> <td>MFLOPS, 256 KB–2 MB</td> </tr> </tbody> </table>

πŸ“Š Numbers Every ML Systems Engineer Should Know β€” The physics constants, scaling rules, and hardware specs behind every question in this playbook.

How the Same Topic Changes Across Tracks

The physics is universal. The numbers are not.

<table> <thead> <tr> <th width="18%">Topic</th> <th width="22%">☁️ Cloud</th> <th width="22%">πŸ€– Edge</th> <th width="20%">πŸ“± Mobile</th> <th width="18%">πŸ”¬ TinyML</th> </tr> </thead> <tbody> <tr> <td><b>Roofline</b></td> <td>H100 ridge point ~295 Ops/Byte</td> <td>Jetson Orin ridge point ~15 Ops/Byte</td> <td>NPU ridge point varies by SoC</td> <td>No FPU β€” integer-only roofline</td> </tr> <tr> <td><b>Memory</b></td> <td>KV-cache fragmentation in 80 GB HBM</td> <td>Model + sensor pipeline in 32 GB DRAM</td> <td>Model coexists with OS in shared RAM</td> <td>Entire model must fit in on-chip SRAM</td> </tr> <tr> <td><b>Quantization</b></td> <td>FP16β†’INT8 for throughput</td> <td>INT8β†’INT4 for thermal headroom</td> <td>INT8 for NPU compatibility</td> <td>INT8β†’binary to fit on chip</td> </tr> <tr> <td><b>Serving</b></td> <td>Continuous batching, PagedAttention</td> <td>Hard real-time inference deadlines</td> <td>On-device inference, thermal throttling</td> <td>Single-shot, microsecond latency</td> </tr> <tr> <td><b>Fault tolerance</b></td> <td>Checkpoint 10,000 GPUs (MTBF)</td> <td>Graceful degradation, functional safety</td> <td>Crash recovery, model fallback</td> <td>Watchdog timers, hard real-time guarantees</td> </tr> </tbody> </table>

Mastery Levels

Every question is tagged with a mastery level. These levels mirror engineering ladders at major tech companies (Google, Meta, etc.) but represent cognitive thresholds: each level tests a different kind of reasoning, mapped to Bloom's taxonomy and the scope of ownership expected at that career stage.

<table> <thead> <tr> <th width="15%">Level</th> <th width="15%">Scope</th> <th width="20%">Cognitive Skill</th> <th width="50%">What the interviewer hears</th> </tr> </thead> <tbody> <tr> <td>🟒 <b>L3 β€” The Screen</b></td> <td>Own a <b>task</b></td> <td><b>Recall & Define</b></td> <td>"The Roofline model relates compute to memory bandwidth."</td> </tr> <tr> <td>πŸ”΅ <b>L4 β€” The Practitioner</b></td> <td>Own a <b>component</b></td> <td><b>Apply & Identify</b></td> <td>"This workload is memory-bound because its arithmetic intensity is below the ridge point."</td> </tr> <tr> <td>🟑 <b>L5 β€” The Architect</b></td> <td>Own a <b>system</b></td> <td><b>Analyze & Predict</b></td> <td>"Switching from A100 to H100 won't help because the ridge point shifts right while our intensity stays at ~1."</td> </tr> <tr> <td>πŸ”΄ <b>L6+ β€” The Lead</b></td> <td>Own the <b>architecture</b></td> <td><b>Synthesize & Derive</b></td> <td>"Let me derive the optimal parallelism dimensions from the NVLink topology, memory capacity, and pipeline bubble cost."</td> </tr> </tbody> </table>

The Transitions

  • L3β†’L4: You stop reciting and start diagnosing. You can look at a system and correctly classify what's happening: identify the bottleneck, name the failure mode, apply the right formula.
  • L4β†’L5: You stop diagnosing and start predicting. You can reason about what happens when a constraint changes (a hardware upgrade, a traffic spike, a precision change) and explain why the system behaves differently.
  • L5β†’L6+: You stop predicting known patterns and start deriving solutions from first principles. You can stand at a whiteboard with incomplete information and work backward from physics to architecture.

How This Maps to Industry

<table> <thead> <tr> <th width="10%">Level</th> <th width="15%">Google</th> <th width="15%">Meta</th> <th width="15%">Amazon</th> <th width="45%">What systems interviews test</th> </tr> </thead> <tbody> <tr> <td><b>L3</b></td> <td>L3 (SWE II)</td> <td>E3 (IC3)</td> <td>SDE I</td> <td>Can you define the concepts? Do you know the vocabulary of ML systems?</td> </tr> <tr> <td><b>L4</b></td> <td>L4 (SWE III)</td> <td>E4 (IC4)</td> <td>SDE II</td> <td>Given a broken system, can you diagnose the root cause?</td> </tr> <tr> <td><b>L5</b></td> <td>L5 (Senior)</td> <td>E5 (IC5)</td> <td>Senior SDE</td> <td>Given a working system and a changing constraint, can you predict what breaks?</td> </tr> <tr> <td><b>L6+</b></td> <td>L6 (Staff)</td> <td>E6 (Staff)</td> <td>Principal</td> <td>Given a blank whiteboard and a set of requirements, can you derive the architecture from physics?</td> </tr> </tbody> </table>

For question contributors: When writing a new question, ask yourself: "What scope of reasoning does this require?" If the answer is "name the concept," it's L3. If it's "diagnose this system," it's L4. If it's "predict what happens when X changes," it's L5. If it's "derive the solution from constraints," it's L6+.


Topic Index

Every question is tagged with a topic. Use this index to study a specific concept across all rounds. The examples below highlight key questions from across the tracks.

<table> <thead> <tr> <th width="22%">Topic</th> <th width="60%">Example Questions Across Tracks</th> <th width="18%">Coverage</th> </tr> </thead> <tbody> <tr> <td><b><code>roofline</code></b> β€” Arithmetic Intensity, compute vs memory bound</td> <td><b>Cloud:</b> The Profiling Crisis Β· <b>Edge:</b> The Bandwidth-Bound Orin Β· <b>Mobile:</b> The Budget Phone Mystery</td> <td>βœ… Strong</td> </tr> <tr> <td><b><code>memory</code></b> β€” VRAM accounting, memory hierarchy, energy</td> <td><b>Cloud:</b> The Sequence Length Trap Β· <b>Mobile:</b> The Background Kill Β· <b>TinyML:</b> The Flash-SRAM Boundary</td> <td>βœ… Strong</td> </tr> <tr> <td><b><code>kv-cache</code></b> β€” KV-Cache sizing, fragmentation, PagedAttention</td> <td><b>Cloud:</b> The Fragmentation Crisis Β· <b>Mobile:</b> The Mobile LLM KV-Cache Squeeze</td> <td>βœ… Strong</td> </tr> <tr> <td><b><code>precision</code></b> β€” FP16/BF16/INT8, quantization, underflow</td> <td><b>Cloud:</b> The Underflow Crisis Β· <b>Edge:</b> The QAT Cliff Β· <b>TinyML:</b> The Requantization Pipeline</td> <td>βœ… Strong</td> </tr> <tr> <td><b><code>hardware</code></b> β€” Tensor Cores, sparsity, silicon architecture</td> <td><b>Cloud:</b> The Sparsity Fallacy Β· <b>Mobile:</b> The NPU Efficiency Advantage</td> <td>βœ… Strong</td> </tr> <tr> <td><b><code>frameworks</code></b> β€” JIT compilation, graph tracing, kernels</td> <td><b>Cloud:</b> The Compilation Overhead Β· <b>Mobile:</b> The Single-Op Delegation Fix</td> <td>βœ… Strong</td> </tr> <tr> <td><b><code>data-pipeline</code></b> β€” CPU starvation, preprocessing, ingestion</td> <td><b>Cloud:</b> The Data Pipeline Stall Β· <b>Edge:</b> The Timestamp Drift</td> <td>βœ… Strong</td> </tr> <tr> <td><b><code>parallelism</code></b> β€” DP, TP, PP, ZeRO, 3D parallelism</td> <td><b>Cloud:</b> The Pipeline Bubble Β· The Amdahl Ceiling Β· Dimensioning the 3D Cube</td> <td>βœ… Strong</td> </tr> <tr> <td><b><code>network</code></b> β€” NVLink, InfiniBand, Fat-Tree, AllReduce</td> <td><b>Cloud:</b> The Cross-Rack Stall Β· The Oversubscription Choke Β· The Ring vs Tree Dilemma</td> <td>βœ… Strong</td> </tr> <tr> <td><b><code>fault-tolerance</code></b> β€” Checkpointing, MTBF, stragglers</td> <td><b>Cloud:</b> The Straggler Problem Β· <b>Edge:</b> The Degradation Ladder</td> <td>βœ… Strong</td> </tr> <tr> <td><b><code>latency</code></b> β€” TTFT, TPOT, tail latency, queueing theory</td> <td><b>Cloud:</b> The Serving Inversion Β· <b>Mobile:</b> The Jank Explanation Β· <b>TinyML:</b> The Interrupt Deadline</td> <td>βœ… Strong</td> </tr> <tr> <td><b><code>serving</code></b> β€” Batching, cold starts, speculative decoding</td> <td><b>Cloud:</b> The Batching Dilemma Β· <b>Edge:</b> The eMMC Cold Start</td> <td>βœ… Strong</td> </tr> <tr> <td><b><code>mlops</code></b> β€” Drift, skew, deployment, technical debt</td> <td><b>Cloud:</b> The Training-Serving Skew Β· <b>Mobile:</b> The Silent Accuracy Degradation</td> <td>βœ… Strong</td> </tr> <tr> <td><b><code>economics</code></b> β€” TCO, retraining cost, sustainability</td> <td><b>Cloud:</b> The Energy Economics Β· <b>Edge:</b> The Edge vs Cloud Cost Crossover</td> <td>βœ… Strong</td> </tr> <tr> <td><b><code>security</code></b> β€” Prompt injection, adversarial attacks</td> <td><b>Cloud:</b> The Trust Boundary Β· <b>Edge:</b> The Adversarial Patch Attack</td> <td>βœ… Strong</td> </tr> <tr> <td><b><code>privacy</code></b> β€” DP-SGD, membership inference</td> <td><b>Cloud:</b> The Privacy Audit Β· <b>Mobile:</b> The Federated Keyboard</td> <td>βœ… Strong</td> </tr> </tbody> </table>

Each question includes a πŸ“– Deep Dive link to the relevant chapter of Machine Learning Systems. The questions prove the knowledge matters; the textbook teaches it.

<table> <thead> <tr> <th width="25%">Round</th> <th width="75%">Textbook Chapters Referenced</th> </tr> </thead> <tbody> <tr> <td><b>Silicon Physics</b></td> <td>HW Acceleration, Data Engineering, Model Compression, Frameworks, Neural Computation</td> </tr> <tr> <td><b>Distributed Infra</b></td> <td>Distributed Training, Fault Tolerance, Fleet Orchestration, Network Fabrics</td> </tr> <tr> <td><b>Production Serving</b></td> <td>Model Serving, Benchmarking, Inference at Scale</td> </tr> <tr> <td><b>Ops & Economics</b></td> <td>ML Operations, Responsible Engineering, Sustainable AI, Security & Privacy, Robust AI</td> </tr> </tbody> </table>

Contributing

We welcome questions from recent AI systems interviews.

  1. Pull Request: Click the βž• Add a Flashcard link at the top of any round file and submit a question using the format below.
  2. Issue: Open an issue with your question and we'll work with you to shape it.

Question Format

Every question follows this structure. The Interviewer prompt is visible as the question; the answer is hidden behind a "Reveal Answer" fold so readers can quiz themselves. Not all fields are required; use Common Mistake and Napkin Math where they add value.

markdown
<details>
<summary><b>[LEVEL BADGE]: [Question Title]</b> Β· <code>topic-tag</code></summary>

- **Interviewer:** [The scenario or crisis]

  <details>
  <summary><b>πŸ” Reveal Answer</b></summary>

  **Common Mistake:** [What most people say wrong β€” creates the "aha" moment]

  **Realistic Solution:** [The physics/logic behind the correct answer]

  > **Napkin Math:** [Quick back-of-envelope calculation with real numbers]

  > **Key Equation:** $[The formula to memorize]$

  πŸ“– **Deep Dive:** [Link to the relevant textbook chapter]

  </details>

</details>

Contributors

<!-- ALL-CONTRIBUTORS-LIST:START - Do not remove or modify this section --> <!-- prettier-ignore-start --> <!-- markdownlint-disable --> <table> <tbody> <tr> <td align="center" valign="top" width="14.28%"><a href="https://github.com/profvjreddi"> <sub><b>Vijay Janapa Reddi</b></sub></a> 🧠 🎨 ✍️</td> </tr> </tbody> </table> <!-- markdownlint-restore --> <!-- prettier-ignore-end --> <!-- ALL-CONTRIBUTORS-LIST:END -->
<p align="center"> <i>Wishing you all the best in your interviews and your engineering journey.</i>

β€” <b>Vijay Janapa Reddi</b>

</p>