interviews/cloud/03_inference_and_serving.md
The domain of the MLOps and Deployment Engineer. This round tests your ability to survive unpredictable user traffic: latency constraints, continuous batching, and KV-cache management.
โ Add a Flashcard (Edit in Browser) โ see README for the template.
Interviewer: "We took our highly-optimized training architecture (large batches, deep pipelines) and deployed it directly to serving. Now, user requests are timing out. What fundamental priority did we fail to invert?"
<details> <summary><b>๐ Reveal Answer</b></summary>Common Mistake: "We need to scale up the serving cluster." More hardware won't fix a design that optimizes for the wrong metric.
Realistic Solution: The shift from maximizing throughput ($T$) to minimizing latency ($L_{lat}$). Training maximizes throughput by using massive batches to keep the GPUs at 100% utilization. Serving minimizes latency because a slow response is a broken product. The batch-heavy architectures that saturate accelerators during training are fundamentally ill-suited for the bursty, latency-critical reality of production traffic.
Napkin Math: Training batch=256 on H100: 312 TFLOPS ร 85% MFU = 265 TFLOPS effective, throughput = 10,000 tokens/sec. Serving batch=1 on same H100: arithmetic intensity drops below ridge point (295 Ops/Byte), becomes memory-bandwidth-bound, effective throughput = 800 tokens/sec but latency = 25ms. The same GPU is 12ร more efficient at training but 12ร worse at serving โ because the optimization target flipped from throughput to latency.
๐ Deep Dive: Model Serving
</details>Interviewer: "Our users are complaining the LLM feels 'laggy', but our monitoring shows 'Average Latency' is well under our 2-second SLO. What specific generation metrics are we failing to monitor?"
<details> <summary><b>๐ Reveal Answer</b></summary>Common Mistake: "We should look at P99 latency instead of average." P99 helps, but you're still measuring the wrong thing.
Realistic Solution: You are measuring the total request time, which masks the user experience. You must split monitoring into Time-To-First-Token (TTFT) and Time-Per-Output-Token (TPOT). TTFT measures the compute-bound prefill phase (when the user is waiting for the bot to start typing). TPOT measures the memory-bandwidth-bound decode phase (the reading speed). A fast TPOT cannot save a slow TTFT.
Napkin Math: If TTFT = 3 seconds and TPOT = 20ms, a 200-token response takes $3 + (200 \times 0.02) = 7$ seconds total. Average latency looks fine across short and long requests, but users experience a 3-second blank screen before any text appears โ that feels broken regardless of total latency.
๐ Deep Dive: Benchmarking
</details>Interviewer: "Our GPU serving cluster handles 1,000 QPS at 50ms latency. On Black Friday, traffic spiked 10x to 10,000 QPS. The system didn't just slow down 10x; it completely collapsed, with latencies hitting 10 seconds and GPUs throwing Out Of Memory errors despite appearing to have free VRAM. Why did the system fail non-linearly, and how does continuous batching on GPUs shift this queueing knee point differently than traditional CPU-based web serving?"
<details> <summary><b>๐ Reveal Answer</b></summary>Common Mistake: "The system should degrade linearly โ 10x traffic means 10x latency." Queueing systems don't work that way, and GPU memory doesn't degrade gracefully.
Realistic Solution: The collapse is caused by the interaction of queueing theory and GPU memory fragmentation. In a CPU web server, as utilization ($\rho$) passes the "Knee" at ~70%, request queue lengths grow exponentially (per Little's Law). But in an LLM serving cluster, a traffic spike causes the batch scheduler to accept more concurrent requests. Without PagedAttention, each new request statically allocates maximum potential KV-cache in HBM. The VRAM rapidly fragments. The non-linear collapse happens because the GPU spends cycles swapping KV-cache blocks to CPU RAM or simply OOMs, rather than just queueing. Continuous batching shifts this knee point significantly higher (to $\rho \approx 0.85$) because it dynamically slots new requests into the batch at the token level, absorbing the burst into the GPU's compute capacity without requiring new static memory reservations.
Napkin Math: At $\rho = 0.5$ (50% util): avg requests in system โ 1. At $\rho = 0.9$ (90% util): avg requests in system โ 9. The relationship is $L = \rho/(1-\rho)$. On a CPU, this just means waiting. On an 80GB H100, if 9 concurrent requests each statically reserve 8GB for a max 8K context window, you need 72GB. You hit the memory wall before you hit the compute wall. Continuous batching + PagedAttention changes the denominator: instead of allocating 8GB per request upfront, it allocates 16MB blocks dynamically. This allows the GPU to sustain the queueing spike by keeping memory utilization proportional to actual tokens generated, not max potential tokens.
๐ Deep Dive: Model Serving
</details>Interviewer: "We are serving a chatbot. Even though we have 40GB of free VRAM, our inference server refuses to accept new concurrent requests, citing an 'Out of Memory' error. What is consuming our VRAM invisibly, and how do we fix it?"
<details> <summary><b>๐ Reveal Answer</b></summary>Common Mistake: "There must be a memory leak in the serving framework." It's not a leak โ it's by design.
Realistic Solution: KV-Cache memory fragmentation. Standard attention allocates contiguous VRAM for the maximum possible sequence length of every request. Because actual sequence lengths are unpredictable, this wastes 60-80% of memory. You must implement PagedAttention (like vLLM), which maps virtual KV-cache blocks to non-contiguous physical blocks, allowing near-zero fragmentation and 2-3x higher batch sizes.
Napkin Math: Max sequence = 8192 tokens. Average actual sequence = 500 tokens. Waste per request = $(8192 - 500)/8192 = 93.9%$. With 40 GB free and each max-length reservation taking ~13 GB, you can only serve 3 concurrent requests. With PagedAttention, you allocate only what's used: 3 requests ร 500 tokens = 1500 tokens worth of cache โ leaving room for 20+ more concurrent requests.
๐ Deep Dive: Frameworks
</details>Interviewer: "We are deploying a photo classification model. Running it in real-time on user upload costs us $10,000 a month in GPU compute. How do we reduce costs without losing model accuracy?"
<details> <summary><b>๐ Reveal Answer</b></summary>Common Mistake: "Quantize the model to INT8 to halve the compute cost." Quantization helps, but there's a 10x cheaper architectural change.
Realistic Solution: Shift from Dynamic (Real-time) to Static (Batch) inference. The reason batch inference is dramatically cheaper isn't just about amortizing GPU idle time โ it's about arithmetic intensity. At batch=1, each forward pass loads the full model weights from HBM but performs very little compute per byte moved, placing the workload deep in the memory-bound regime of the Roofline. At batch=256, the same weight load is reused across 256 inputs, pushing arithmetic intensity above the ridge point into the compute-bound regime where the GPU's ALUs are fully saturated. Run the model periodically (e.g., overnight) on cheap spot instances with large batches, and store results in a fast key-value cache (Redis) for millisecond serving.
Napkin Math: Batch=1 inference: arithmetic intensity ~1 Op/Byte (memory-bound, 5% GPU utilization). Batch=256 inference: arithmetic intensity ~200 Ops/Byte (compute-bound, 70% GPU utilization). Same model, same GPU โ 14ร better utilization just from batching. Real-time: 1 GPU reserved 24/7 at $2/hr = $1,440/month at ~10% average utilization. Batch: same workload in 2 hours overnight on spot instances at $0.70/hr = $1.40/day = $42/month at 70%+ utilization. That's a 34ร cost reduction.
๐ Deep Dive: Model Serving
</details>Interviewer: "We use static batching for our LLM API. If a request generating 5 tokens is batched with a request generating 500 tokens, the 5-token request sits in VRAM until the 500-token request finishes. How do we fix this idle capacity?"
<details> <summary><b>๐ Reveal Answer</b></summary>Common Mistake: "Set a maximum generation length to keep batch members similar." This caps functionality and still wastes compute on padding.
Realistic Solution: Implement Continuous (or In-Flight) Batching. Instead of waiting for all requests in a batch to finish, continuous batching operates at the iteration level. As soon as the 5-token request finishes and emits its <EOS> token, it is immediately evicted from the batch, and a new request from the queue is slotted in for the very next forward pass.
Napkin Math: Static batch of 8 requests: shortest = 5 tokens, longest = 500 tokens. Wasted compute = 7 requests ร (500 - avg_length) ร cost_per_token. With continuous batching, the 5-token request is done in 5 iterations and replaced โ the slot processes ~100 different short requests in the time the long request takes, increasing effective throughput by 2-4ร.
๐ Deep Dive: Model Serving
</details>Interviewer: "Our serverless inference endpoint scales down to 0 replicas to save money. However, the first user request after scaling back up takes 30 seconds to process. What is causing this delay?"
<details> <summary><b>๐ Reveal Answer</b></summary>Common Mistake: "The model needs to warm up its caches." There's no cache warm-up โ the delay is purely about data movement.
Realistic Solution: Cold Start Latency. The system must provision the container and load the Python runtime, but most importantly, it must transfer the massive model weights (gigabytes of data) from network storage into the GPU's HBM before the first forward pass can execute. You must use persistent warm replicas or optimized weight loading strategies to fix this.
Napkin Math: A 70B model in FP16 = 140 GB. Network storage throughput (EBS/S3) โ 5-10 GB/s. Load time = $140/7.5 \approx 19$ seconds just for weight transfer. Add container startup (~3s) and Python/CUDA init (~5s) = ~27 seconds total cold start.
๐ Deep Dive: Model Serving
</details>Interviewer: "In our LLM deployment, users sending very long prompts are causing massive latency spikes for other users who are currently in the middle of generating tokens. How do we isolate these workloads?"
<details> <summary><b>๐ Reveal Answer</b></summary>Common Mistake: "Rate-limit long prompts" or "Add more GPUs to the pool." Neither addresses the fundamental resource contention.
Realistic Solution: Disaggregated Serving. The prompt phase (Prefill) is heavily compute-bound and monopolizes the GPU ALUs, starving the token generation phase (Decode), which is memory-bandwidth bound. You must split Prefill and Decode onto entirely separate GPU clusters, computing the KV-Cache on the Prefill nodes and transmitting it over the network to the Decode nodes.
Napkin Math: Prefill for a 10k-token prompt on a 70B model: ~2 seconds of pure compute, consuming 100% of GPU ALUs. During those 2 seconds, every concurrent decode request (which needs ~5ms per token) is blocked. With 50 concurrent users, that's 50 ร 2s = 100 user-seconds of stalled generation from a single long prompt.
๐ Deep Dive: Model Serving
</details>Interviewer: "We are heavily memory-bandwidth bound during LLM decoding. How can we generate tokens faster without changing the model weights, quantizing, or losing exact mathematical accuracy?"
<details> <summary><b>๐ Reveal Answer</b></summary>Common Mistake: "Use a faster GPU" or "Increase the batch size." A faster GPU won't help if you're bandwidth-bound (Round 1: Roofline Shift), and larger batches increase throughput but not per-request latency.
Realistic Solution: Speculative Decoding. You use a tiny, fast "draft" model to guess the next $K$ tokens. You then pass these $K$ tokens to your massive target model in a single forward pass. The large model verifies the guesses in parallel (trading spare ALU compute capacity to save memory fetches). All correct tokens are accepted, maintaining identical output distributions but yielding 2-3x speedups.
Napkin Math: Normal decode: 1 token per forward pass, each pass loads all 140 GB of weights from HBM. 100 tokens = 100 weight loads = $100 \times 140\text{GB} / 3.35\text{TB/s} = 4.2$ seconds. Speculative decode with $K=5$ and an average of 4 tokens accepted per pass: ~100 tokens in 25 forward passes = $25 \times 140\text{GB} / 3.35\text{TB/s} = 1.04$ seconds. 4ร speedup.
๐ Deep Dive: Model Serving
</details>Interviewer: "You implement speculative decoding for a 70B parameter LLM using a 1.5B draft model. On a single A100 (80GB), generation speed increases by 2.5x. You deploy this to production, but under heavy concurrent load (batch size 64), the speculative decoding actually makes generation slower than standard autoregressive decoding. Why?"
<details> <summary><b>๐ Reveal Answer</b></summary>Common Mistake: "Assuming speculative decoding always provides a speedup because it reduces the number of forward passes of the large model."
Realistic Solution: Speculative decoding trades spare compute (ALU cycles) for memory bandwidth. At batch size 1, LLM inference is severely memory-bandwidth bound, so the A100's ALUs are mostly idle. Using those idle ALUs to run a small draft model and verify multiple tokens in one memory sweep of the 70B model is a massive win. However, at batch size 64, the arithmetic intensity of the workload increases significantly. The system becomes compute-bound. Now, running the draft model steals precious ALU cycles away from the main model's generation, and the overhead of verifying rejected tokens actually reduces overall throughput.
Napkin Math: A 70B model at BS=1 has an arithmetic intensity of
~2 FLOPs/byte. The A100 is memory bound up to156 FLOPs/byte. But at BS=64, intensity jumps to128 FLOPs/byte, nearing the compute roofline. The draft model requires1.5B * 2 = 3 GFLOPsper token. At BS=64, that's 192 GFLOPs of pure overhead per speculative step that is now eating into a fully saturated compute pipeline.
๐ Deep Dive: Model Serving
</details>Interviewer: "You are serving a Llama-3 8B model. At a batch size of 1 with a 1,000-token prompt, inference requires roughly 16GB of VRAM. A customer wants you to increase the context window to 128,000 tokens for document summarization. Your PM assumes it will just take a bit more memory and asks you to deploy it on a 24GB RTX 4090. Why is this physically impossible?"
<details> <summary><b>๐ Reveal Answer</b></summary>Common Mistake: "Forgetting that the KV-Cache grows linearly with sequence length, and assuming model weights are the only significant memory consumer."
Realistic Solution: You hit the KV-Cache memory wall. While the model weights for an 8B model (at 16-bit precision) consume about 16GB of VRAM regardless of the prompt size, the Key-Value (KV) cache does not. To avoid recomputing attention for past tokens, autoregressive generation stores the K and V vectors for every single token in the context window. At 128,000 tokens, the physical memory required just to store the state of the conversation dwarfs the size of the model itself.
Napkin Math: For Llama-3 8B (32 layers, 8 KV heads, 128 head dim, 2 bytes per param):
Memory per token = 2 (K&V) * 32 layers * 8 heads * 128 dim * 2 bytes = ~131 KB per token.1,000 tokens * 131 KB = ~130 MB(Fits easily in a 4090).128,000 tokens * 131 KB = ~16.7 GB.16GB (weights) + 16.7GB (KV Cache) = 32.7 GB. It physically cannot fit into a 24GB GPU.
๐ Deep Dive: Model Serving
</details>Interviewer: "You implement continuous batching (iteration-level scheduling) for an LLM endpoint to improve throughput. Under low load, TTFT (Time To First Token) is 100ms. Under high load, your throughput triples, but users start complaining that TTFT spikes to over 4 seconds, even though token generation speed (TPOT) remains fast. What scheduling flaw caused this?"
<details> <summary><b>๐ Reveal Answer</b></summary>Common Mistake: "Assuming that because continuous batching optimizes VRAM and throughput, it automatically guarantees low latency for incoming requests."
Realistic Solution: You starved the prefill phase. In LLM serving, a request has two phases: Prefill (processing the prompt, compute-bound) and Decode (generating tokens, memory-bound). With continuous batching, an incoming request is injected into the very next iteration of the running batch. However, the prefill phase for a new prompt requires a massive matrix multiplication that consumes almost all the GPU's compute. If you inject a new 2,000-token prompt into a running batch, the GPU must pause decoding for everyone else to crunch that prefill. To prevent this, schedulers often limit the prefill chunk size or delay new prefills. If delayed too aggressively under high load, incoming requests wait in the queue for seconds, destroying TTFT.
Napkin Math: A 2000-token prefill on a 70B model requires
2 * 70B * 2000 = 280 TFLOPs. An A100 (FP16) does ~312 TFLOPs. That prefill takes almost1 full secondof dedicated compute. If 4 users send prompts simultaneously, the 4th user sits in the queue for 3 seconds before their prefill even begins, destroying TTFT, while the active decoders are completely stalled waiting for the prefills to finish.
๐ Deep Dive: Model Serving
</details>Interviewer: "Your company is choosing between H100, A100, and TPU v5e for serving a 70B parameter LLM. The CFO wants a cost-per-million-tokens comparison across these accelerators at realistic batch sizes. Walk me through the napkin math and tell me which hardware wins โ and when."
<details> <summary><b>๐ Reveal Answer</b></summary>Common Mistake: "Pick the GPU with the highest FLOPS โ H100 always wins." Raw FLOPS is irrelevant for decode-phase serving, which is memory-bandwidth bound. The cheapest $/token depends on batch size, model size, and whether you're prefill-bound or decode-bound.
Realistic Solution: LLM decode is memory-bandwidth bound: each token requires loading all model weights from HBM once. The metric that matters is $/GB/s of memory bandwidth, not $/TFLOP. At small batch sizes, the accelerator with the best bandwidth-per-dollar wins. At large batch sizes, the workload shifts compute-bound and FLOPS/$ matters more. TPU v5e wins on raw $/token at high batch sizes due to lower per-chip cost; H100 wins on latency-sensitive low-batch workloads due to 3.35 TB/s HBM3 bandwidth.
Napkin Math: 70B model in FP16 = 140 GB weights. Each decode token loads 140 GB.
- H100 (80 GB HBM3, 3.35 TB/s): Decode token = 140 GB / 3.35 TB/s = 42 ms at BS=1. On-demand ~$3.50/hr โ $3.50 / (3600s / 0.042s) = $0.000041/token โ $41/M tokens at BS=1. At BS=32, throughput scales ~25ร, cost drops to ~$1.60/M tokens.
- A100 (80 GB HBM2e, 2.0 TB/s): Decode token = 140 GB / 2.0 TB/s = 70 ms at BS=1. On-demand ~$2.20/hr โ $43/M tokens at BS=1. At BS=32, cost drops to ~$2.10/M tokens (less compute headroom than H100).
- TPU v5e (16 GB HBM, $1.20/chip/hr): Requires 16-way tensor parallelism to fit 140 GB. Aggregate bandwidth = 16 ร 819 GB/s = 13.1 TB/s. Decode = 140 GB / 13.1 TB/s = 10.7 ms at BS=1. Cost = 16 ร $1.20 = $19.20/hr โ $57/M tokens at BS=1. At BS=64, throughput scales, cost drops to ~$2.50/M tokens.
H100 wins across the board at these prices ($3.50/hr) because it delivers more bandwidth-per-dollar (957 GB/s per $) and more compute-per-dollar (282 TFLOPS/$) than both the A100 and TPU v5e.
๐ Deep Dive: Model Serving
</details>Interviewer: "We're serving a 70B parameter model in FP16 across two A100-80GB GPUs. The team proposes switching to INT8 to fit on a single GPU and cut costs in half. The PM says 'quantization is free performance.' Walk me through the real trade-offs โ when does the accuracy drop actually matter?"
<details> <summary><b>๐ Reveal Answer</b></summary>Common Mistake: "INT8 quantization always halves memory and doubles throughput with negligible accuracy loss." In practice, the throughput gain is ~1.5ร, not 2ร, because compute is rarely the bottleneck during decode, and accuracy degradation is task-dependent โ catastrophic for math/code, negligible for summarization.
Realistic Solution: INT8 quantization halves the weight memory (140 GB โ 70 GB), fitting on a single A100-80GB and eliminating tensor-parallelism communication overhead. But the throughput gain is ~1.5ร because: (1) KV-cache is still in FP16, (2) activations require dequantization overhead, and (3) decode is bandwidth-bound โ halving weight size only helps if weights dominate the memory traffic. The accuracy trade-off is non-uniform: perplexity increases ~0.1โ0.3 points on average, but tail tasks (multi-step math, code generation, structured reasoning) degrade 3โ8% on benchmarks like HumanEval and GSM8K.
Napkin Math:
- FP16 on 2ร A100: 140 GB weights, tensor-parallel across 2 GPUs. Aggregate bandwidth = 4.0 TB/s. Decode latency = 140 GB / 4.0 TB/s = 35 ms/token. Cost = 2 ร $2.20/hr = $4.40/hr.
- INT8 on 1ร A100: 70 GB weights on 1 GPU. Bandwidth = 2.0 TB/s. Decode latency = 70 GB / 2.0 TB/s = 35 ms/token. Cost = $2.20/hr. Same latency, half the cost.
- But add KV-cache: at 4K context, KV-cache โ 4 GB (FP16 in both cases). Total memory traffic per token: FP16 = 144 GB; INT8 = 74 GB. Effective speedup = 144/74 = 1.95ร in bandwidth, but dequant overhead eats ~20%, netting ~1.5ร real throughput.
- Accuracy: Perplexity on WikiText goes from 5.2 โ 5.4 (acceptable). HumanEval pass@1 drops from 67% โ 61% (may be unacceptable for a coding assistant).
๐ Deep Dive: Model Optimizations
</details>Interviewer: "Our LLM API serves both autocomplete requests (~10 output tokens, 50ms SLA) and document summarization requests (~10,000 output tokens, no strict SLA). Both hit the same pool of 8ร H100 GPUs. Autocomplete users are furious โ P99 latency is 4 seconds instead of 50ms. The summarization users are fine. What's happening and how do you fix it?"
<details> <summary><b>๐ Reveal Answer</b></summary>Common Mistake: "Add more GPUs to handle the load." The problem isn't capacity โ it's mixing workloads with incompatible SLA profiles in the same scheduling queue.
Realistic Solution: Head-of-line blocking from mixed request lengths. With continuous batching, a 10,000-token summarization request occupies a batch slot for ~400 seconds (at 25 tokens/sec). During that time, the scheduler must run prefill for incoming autocomplete requests, but the long requests consume KV-cache memory, reducing the maximum batch size. Worse, if the batch is full of long-running decodes, new short requests queue behind prefill scheduling delays. The fix is workload-aware routing: separate GPU pools (or priority lanes) for latency-sensitive short requests vs throughput-oriented long requests. Short-request GPUs run with small max-sequence-length and aggressive preemption; long-request GPUs optimize for throughput with large batches.
Napkin Math: 8ร H100 shared pool, continuous batching, max batch = 32.
- Long requests: 10,000 tokens ร 40ms/token = 400s each. 20 concurrent long requests fill 20/32 batch slots.
- Short requests: 10 tokens ร 40ms/token = 0.4s each. Only 12 batch slots remain.
- But each long request's KV-cache: 10,000 tokens ร 131 KB = 1.3 GB. 20 requests = 26 GB of KV-cache, leaving only 54 GB for weights + short request KV-cache on an 80 GB GPU.
- With dedicated pools: 2 GPUs for short requests (batch=32, all slots turn over in <1s, P99 TTFT < 50ms) and 6 GPUs for long requests (optimized for throughput). Short-request P99 drops from 4s to 40ms.
๐ Deep Dive: Model Serving
</details>Interviewer: "We're spending $180,000/month serving a 70B model on 8ร H100 GPUs. An engineer proposes distilling it into a 7B student model that can run on a single A10G ($0.75/hr). The distillation job itself will require 2 weeks on 32ร A100 GPUs. Should we do it? Show me the math."
<details> <summary><b>๐ Reveal Answer</b></summary>Common Mistake: "Distillation is obviously worth it โ 10ร smaller model means 10ร cheaper serving." This ignores the upfront distillation cost, the quality regression requiring human evaluation, and the ongoing retraining cost as the teacher model is updated.
Realistic Solution: Distillation is a capital investment with a payback period. You must compute: (1) the one-time distillation cost, (2) the monthly serving cost delta, (3) the quality gap and its business impact, and (4) the maintenance cost of re-distilling when the teacher is updated. The math usually works for stable, high-volume workloads but fails for rapidly iterating models.
Napkin Math:
- Distillation cost: 32ร A100 ร $2.20/hr ร 24 hr ร 14 days = $23,654 one-time.
- Current serving (70B on 8ร H100): 8 ร $3.50/hr ร 730 hr/month = $20,440/month (the $180K includes multi-region redundancy; let's use single-region for comparison).
- Student serving (7B on 1ร A10G): Need ~4 A10Gs for equivalent throughput (7B is 10ร smaller but A10G is 5ร slower than H100). 4 ร $0.75/hr ร 730 = $2,190/month.
- Monthly savings: $20,440 โ $2,190 = $18,250/month.
- Payback period: $23,654 / $18,250 = 1.3 months.
- But: If the teacher model is updated quarterly, you re-distill 4ร/year = $94,616/year in distillation costs. Still saves $94,616 vs $218,880 annually = $124,264/year net savings โ if the quality gap is acceptable. A 5% drop in task accuracy on a revenue-critical endpoint could cost more than the savings.
๐ Deep Dive: Model Optimizations
</details>Interviewer: "Our chatbot returns the full response only after all tokens are generated. Users say it feels 'broken' โ they stare at a blank screen for 8 seconds. A colleague suggests streaming tokens as they're generated. Walk me through the latency math and when streaming actually helps vs. hurts."
<details> <summary><b>๐ Reveal Answer</b></summary>Common Mistake: "Streaming always improves the user experience." Streaming improves perceived latency for long responses but adds per-token SSE/WebSocket overhead and complicates error handling. For short responses (<20 tokens), the overhead can make streaming slower than batch.
Realistic Solution: Streaming decouples perceived latency from total latency. The user sees the first token after TTFT (prefill time), then each subsequent token arrives at TPOT intervals. The perceived wait is TTFT (~200ms), not the total generation time. But streaming adds: (1) SSE/WebSocket framing overhead (~0.5ms/token), (2) client-side rendering cost, and (3) inability to retry or filter the full response before sending. For short responses, the total time with streaming overhead can exceed the non-streaming total time.
Napkin Math: 200-token response on H100 serving a 13B model:
- Non-streaming: TTFT = 150ms, TPOT = 30ms/token. User waits: 150 + (200 ร 30) = 6,150ms before seeing anything.
- Streaming: User sees first token at 150ms. Full response arrives at same 6,150ms total, but perceived wait = 150ms. That's a 41ร improvement in perceived responsiveness.
- Short response (10 tokens): Non-streaming total = 150 + (10 ร 30) = 450ms. Streaming: first token at 150ms, but SSE overhead adds ~5ms total. Streaming barely helps โ 150ms vs 450ms perceived, but the user reads the full 10-token response in <1 second either way.
- Break-even: Streaming helps when total generation time > ~1 second (roughly >30 tokens at 30ms/token).
๐ Deep Dive: Model Serving
</details>Interviewer: "Our vLLM instance on an A100-80GB has been serving requests for 6 hours. nvidia-smi shows 16 GB used, 64 GB free. But the server rejects new requests with 'Cannot allocate KV-cache.' We have 80% free memory โ where did it go?"
Common Mistake: "There's a memory leak โ restart the server." The memory is genuinely free, but it's fragmented into non-contiguous chunks that can't satisfy a contiguous allocation request.
Realistic Solution: This is classic external fragmentation in GPU memory. Without PagedAttention, KV-cache is allocated as contiguous blocks sized for the maximum sequence length. As requests of varying lengths complete and free their blocks, the free memory becomes a patchwork of small non-contiguous holes. A new request needing a 2 GB contiguous block fails even though 64 GB is free โ because no single contiguous 2 GB region exists. PagedAttention (used by vLLM) solves this by managing KV-cache as fixed-size pages (like OS virtual memory), mapping logical KV blocks to arbitrary physical memory locations. This eliminates external fragmentation entirely.
Napkin Math: After 10,000 requests with sequence lengths uniformly distributed between 100โ8,192 tokens:
- Without paging: Each completed request leaves a hole sized for its actual length. After 10K requests, memory is a Swiss cheese of ~5,000 free fragments averaging 13 KB each. Largest contiguous free block โ 50 KB. A new 2,048-token request needs 2,048 ร 131 KB/token รท 1000 โ 268 MB contiguous โ impossible.
- With PagedAttention (block size = 16 tokens = ~2 KB): Any free block can be used. 64 GB free = 32 million free blocks. Allocation always succeeds regardless of fragmentation pattern. Waste = only the last partially-filled block per request โ 1 KB average โ <0.1% internal fragmentation.
๐ Deep Dive: Frameworks
</details>Interviewer: "Every request to our customer-support LLM starts with the same 2,000-token system prompt. We're serving 1,000 requests/minute on 4ร H100 GPUs. An engineer says 'we should cache the system prompt's KV-cache.' Quantify the savings โ is this worth the engineering effort?"
<details> <summary><b>๐ Reveal Answer</b></summary>Common Mistake: "Prompt caching only saves a little memory โ the real cost is in generation." Wrong. The system prompt's prefill is the most expensive per-request compute cost, and its KV-cache is duplicated across every concurrent request.
Realistic Solution: Caching the system prompt's KV-cache provides two savings: (1) compute savings โ skip the 2,000-token prefill for every request, and (2) memory savings โ store one copy of the system prompt's KV-cache instead of one per concurrent request. The compute savings dominate at high QPS. This is essentially prefix caching, implemented in vLLM via automatic prefix caching (APC) and in SGLang via RadixAttention.
Napkin Math: 70B model, 2,000-token system prompt:
- Prefill compute per request: 2 ร 70B ร 2,000 = 280 TFLOPs. On H100 (990 TFLOPS FP16): 280/990 = 283ms per request just for the system prompt.
- At 1,000 req/min: 1,000 ร 283ms = 283 seconds of GPU-seconds/minute = 4.7 GPU-minutes/minute spent re-computing the same system prompt. That's almost 5 GPUs doing nothing but redundant prefill.
- With caching: Prefill the system prompt once, reuse the KV-cache. Compute savings = 283ms ร 1,000 = 283 GPU-seconds/minute saved.
- Memory savings: KV-cache for 2,000 tokens on 70B model โ 2,000 ร 640 KB/token (80 layers ร 8 KV heads ร 128 dim ร 2 bytes ร 2 K&V) = 1.28 GB. Without caching, 50 concurrent requests = 50 ร 1.28 GB = 64 GB of duplicated KV-cache. With caching: 1.28 GB shared. Saves ~63 GB of VRAM.
๐ Deep Dive: Model Serving
</details>Interviewer: "We run a multi-tenant LLM platform. Each of our 100 enterprise customers has a fine-tuned LoRA adapter for a shared Llama-70B base model. We need to serve all 100 adapters from the same GPU cluster. Walk me through the memory math, the adapter swapping cost, and how to batch requests across different adapters."
<details> <summary><b>๐ Reveal Answer</b></summary>Common Mistake: "Load all 100 LoRA adapters into GPU memory โ they're small." Even small adapters add up, and the real bottleneck is batching: you can't batch requests across different adapters naively because each adapter produces different weight matrices.
Realistic Solution: LoRA adapters are small individually (rank-16 adapter for 70B โ 160 MB) but 100 of them = 16 GB, which competes with KV-cache for VRAM. The key insight is that LoRA is a low-rank additive decomposition: $W' = W + BA$. The base model weights $W$ are shared; only $B$ and $A$ differ per adapter. You can batch the base model computation across all requests, then apply per-adapter corrections using a batched SGMV (Segmented Gather Matrix-Vector) kernel that applies different $BA$ products to different requests within the same batch. This is how S-LoRA and Punica work.
Napkin Math: Llama-70B base model = 140 GB (FP16), needs 2ร H100 with tensor parallelism.
- Per-adapter size (rank 16): Each adapted layer has $B \in \mathbb{R}^{d \times 16}$ and $A \in \mathbb{R}^{16 \times d}$. For $d = 8192$ and 64 adapted layers: $64 \times 2 \times 8192 \times 16 \times 2$ bytes = 32 MB per adapter.
- 100 adapters: 100 ร 32 MB = 3.2 GB โ fits in VRAM alongside the base model.
- Adapter swapping (if not using SGMV): Loading a 32 MB adapter from CPU to GPU over PCIe Gen5 (64 GB/s) = 0.5ms. But if you swap per-request with 1,000 QPS, that's 500ms/s of PCIe bandwidth โ 50% of a PCIe link just for adapter swapping.
- With SGMV batching: All 100 adapters resident in VRAM. A batch of 32 requests (potentially 32 different adapters) runs the base model forward pass once, then a single SGMV kernel applies all 32 adapter corrections in ~0.1ms. No swapping, no PCIe bottleneck.
๐ Deep Dive: Model Optimizations
</details>Interviewer: "Our LLM endpoint on Kubernetes has a scale-to-zero policy to save money overnight. But the first morning request takes 45 seconds. The SRE team proposes keeping 1 warm replica 24/7. The finance team says that wastes $2,500/month. Who's right? Show me the break-even math."
<details> <summary><b>๐ Reveal Answer</b></summary>Common Mistake: "Always keep warm replicas โ cold starts are unacceptable." This ignores workloads with genuinely sparse traffic where paying for 24/7 idle GPUs is pure waste.
Realistic Solution: The answer depends on the traffic pattern. For bursty workloads with predictable schedules (e.g., business-hours traffic), use scheduled scaling โ spin up replicas 15 minutes before peak and scale to zero overnight. For unpredictable but sparse traffic, use a tiered approach: keep a small quantized model (7B on CPU or A10G) as a "warm fallback" while the full model cold-starts, then route to the full model once it's ready.
Napkin Math: 70B model on 2ร H100:
- Cold start breakdown: Container pull (5s) + weight download from S3 (140 GB / 10 GB/s = 14s) + CUDA context init (3s) + model load to GPU (140 GB / 64 GB/s PCIe = 2.2s) + warmup inference (2s) = ~26 seconds.
- 1 warm replica 24/7: 2 ร $3.50/hr ร 730 hr = $5,110/month.
- Scale-to-zero, 50 cold starts/day: 50 ร 26s = 1,300s of user-facing delay/day. If each delayed request costs $0.50 in user churn: 50 ร $0.50 ร 30 = $750/month in churn cost.
- Break-even: Warm replica costs $5,110 vs cold-start churn costs $750. Scale-to-zero wins unless cold-start churn exceeds ~$170/day (340 impacted requests/day).
- Hybrid: Keep 1 A10G ($548/month) running a 7B fallback model. Cold-start the H100s on first request. Users get a fast (if lower-quality) response in 200ms while the full model loads. Total cost: $548/month + near-zero churn.
๐ Deep Dive: Model Serving
</details>Interviewer: "Our vLLM scheduler pre-allocates KV-cache memory for each request based on max_tokens. Most requests use only 10% of their allocation, wasting 90% of VRAM. But if we reduce max_tokens, long requests get preempted mid-generation. How do you solve this without wasting memory or killing long requests?"
Common Mistake: "Set max_tokens to the average output length." This causes half of all requests to be preempted, which is worse than over-allocation because preemption means re-computing the entire KV-cache from scratch.
Realistic Solution: Predict the output length per-request using a lightweight classifier (or heuristic based on prompt structure), then allocate KV-cache accordingly with a safety margin. Requests that exceed their prediction are handled via PagedAttention's dynamic allocation โ pages are added on-demand rather than pre-allocated. The predictor doesn't need to be perfect; even a coarse bucketing (short: <50 tokens, medium: 50โ500, long: >500) reduces waste by 60โ70%.
Napkin Math: 70B model on H100-80GB, 50 concurrent requests:
- Over-allocation (max_tokens=4096): KV-cache per request = 4,096 ร 640 KB = 2.6 GB. 50 requests = 130 GB โ exceeds 80 GB VRAM. Can only serve 30 concurrent requests.
- Tight allocation (max_tokens=200): KV-cache per request = 200 ร 640 KB = 128 MB. 50 requests = 6.4 GB. Fits easily, but 15% of requests exceed 200 tokens and get preempted.
- Preemption cost: Re-prefill a 2,000-token prompt = 283ms of wasted compute per preemption. At 150 preemptions/minute: 150 ร 283ms = 42.5 GPU-seconds/minute wasted.
- With prediction + PagedAttention: Allocate predicted length + 20% buffer. Average allocation = 300 tokens ร 640 KB = 192 MB. 50 requests = 9.6 GB. Requests that exceed prediction grow dynamically via page allocation. Preemption rate drops from 15% to <1%. Net: 3ร more concurrent requests than over-allocation, <1% preemption rate.
๐ Deep Dive: Model Serving
</details>Interviewer: "We're deploying a RAG pipeline: text preprocessing (tokenization + embedding lookup) โ retrieval model โ LLM generation โ JSON postprocessing. An engineer wants to chain these as a Triton Inference Server ensemble. Another says 'just put it all in one Python process.' Compare the latency and throughput of both approaches."
<details> <summary><b>๐ Reveal Answer</b></summary>Common Mistake: "The ensemble adds network hops between stages, so the monolithic approach is always faster." Triton ensembles use shared-memory IPC, not network calls, and the pipeline parallelism enables higher throughput.
Realistic Solution: A Triton ensemble chains models as a DAG within a single server process, passing tensors via shared memory (zero-copy on GPU). The latency for a single request is slightly higher than monolithic (~1โ2ms overhead per stage for scheduling), but throughput is dramatically higher because stages run as a pipeline: while the LLM generates tokens for request N, the preprocessor handles request N+1. The monolithic approach serializes everything, leaving the GPU idle during CPU-bound pre/postprocessing.
Napkin Math: RAG pipeline stages:
- Preprocessing (CPU): 5ms
- Embedding retrieval (GPU): 3ms
- LLM generation (GPU): 200ms
- Postprocessing (CPU): 2ms
Monolithic (serial): Total = 5 + 3 + 200 + 2 = 210ms/request. Throughput = 1000/210 = 4.8 req/s. GPU utilization = 203/210 = 97% (looks good for single request, but CPU stages block the pipeline).
Triton ensemble (pipelined): Single-request latency = 210 + 3ms scheduling overhead = 213ms. But pipeline throughput: bottleneck = LLM at 200ms โ 5.0 req/s per pipeline. With 4 LLM instances: CPU stages overlap with GPU, throughput = ~19 req/s. The monolithic approach can't overlap CPU and GPU stages, capping at 4.8 req/s regardless of GPU count.
๐ Deep Dive: Model Serving
</details>Interviewer: "You're deploying Llama-3 70B on a single RTX 4090 (24 GB VRAM) using llama.cpp. You need to choose between Q4_K_M, Q5_K_M, and Q8_0 quantization. Walk me through the model size, throughput, and quality trade-offs for each โ and tell me which ones actually fit."
<details> <summary><b>๐ Reveal Answer</b></summary>Common Mistake: "Q4 is always the best choice because it's the smallest and fastest." Q4_K_M fits in 24 GB but has measurable quality degradation on reasoning tasks. Q8_0 doesn't fit at all. The right choice depends on the use case.
Realistic Solution: GGUF quantization uses mixed-precision: important layers (attention) get higher precision, less important layers (FFN) get lower. The "K" variants use k-quant methods that minimize perplexity loss. You must check: (1) does it fit in VRAM (model + KV-cache + overhead), (2) what's the tokens/sec at your target context length, and (3) is the perplexity acceptable for your task.
Napkin Math: Llama-3 70B (FP16 = 140 GB):
- Q4_K_M (~4.5 bits/param): 70B ร 4.5/8 = 39.4 GB. Doesn't fit in 24 GB VRAM alone, but with GPU offloading (24 GB GPU + 16 GB in RAM): ~20 tokens/sec decode. Perplexity: +0.25 vs FP16 (5.2 โ 5.45). Usable for chat, weak on math.
- Q5_K_M (~5.5 bits/param): 70B ร 5.5/8 = 48.1 GB. Requires heavy CPU offloading. ~12 tokens/sec with partial offload. Perplexity: +0.12 vs FP16 (5.2 โ 5.32). Better quality, slower.
- Q8_0 (~8 bits/param): 70B ร 8/8 = 70 GB. Doesn't fit even with offloading on a consumer machine. Need 2ร 4090 or a single A100-80GB. ~15 tokens/sec on A100. Perplexity: +0.02 vs FP16 (near-lossless).
For a single 4090: Q4_K_M is the only viable option for interactive use. Q5_K_M works if you accept ~12 tok/s. Q8_0 is physically impossible. Reality check: Even Q4_K_M at 39.4 GB needs ~15 GB offloaded to CPU RAM, and KV-cache at 2K context adds ~2 GB. You're running at the edge of what's possible.
๐ Deep Dive: Model Optimizations
</details>Interviewer: "Our LLM API has a 2-second P99 SLA. Last Tuesday, a single user sent a 32,000-token prompt that took 10 seconds to prefill. In the next 30 seconds, 100 requests violated their SLA. Explain the cascade mechanism and design a system that prevents one bad request from taking down the fleet."
<details> <summary><b>๐ Reveal Answer</b></summary>Common Mistake: "Rate-limit users who send long prompts." Rate limiting prevents abuse but doesn't prevent the cascade once a long request is admitted โ the damage happens during prefill, before you can react.
Realistic Solution: The cascade works through three mechanisms: (1) Prefill starvation: The 10-second prefill monopolizes GPU compute, blocking all decode iterations for concurrent requests. (2) Queue buildup: During the 10-second stall, new requests accumulate in the queue at the arrival rate. (3) Batch bloat: When the prefill finishes, the scheduler tries to drain the queue by admitting many requests at once, causing KV-cache memory pressure and further slowdowns. The fix requires admission control at multiple levels: max prompt length enforcement, chunked prefill (split long prefills into 512-token chunks interleaved with decode iterations), and request-level timeout with graceful preemption.
Napkin Math: Normal operation: 100 QPS, 200ms TTFT, 30ms TPOT, 100-token avg response = 3.2s avg latency. P99 < 2s for short requests.
- The cascade: 32K-token prefill on 70B model: 2 ร 70B ร 32,000 = 4,480 TFLOPs. On H100 (990 TFLOPS): 4.5 seconds of pure compute. During this time, 100 QPS ร 4.5s = 450 requests queue up.
- Drain phase: 450 queued requests all start prefill. Even with continuous batching, prefilling 450 requests (avg 500 tokens each) takes: 450 ร 500 ร 140 GFLOPS / 990 TFLOPS = ~32 seconds to clear the backlog. Every request admitted during the drain phase also violates SLA.
- With chunked prefill (512-token chunks): The 32K prompt is split into 63 chunks. Each chunk takes 72ms. Between chunks, the scheduler runs 1 decode iteration for all active requests (~5ms). Total prefill time increases to 63 ร (72 + 5) = 4.9s, but no decode request is stalled for more than 72ms. Zero cascade.
๐ Deep Dive: Model Serving
</details>Interviewer: "We're deploying a 335M parameter embedding model (like BGE-large) for our search engine. Unlike LLMs, there's no autoregressive decode โ it's a single forward pass. The team sized the GPU as if it were an LLM workload and provisioned an H100. Is that the right hardware? What's the optimal batch size?"
<details> <summary><b>๐ Reveal Answer</b></summary>Common Mistake: "Embedding models are small, so any GPU works โ just use the cheapest one." Embedding models are compute-bound (not memory-bandwidth bound like LLM decode), so the optimization strategy is completely different. An H100 is overkill for a single replica but the wrong framing โ you should be asking about throughput per dollar.
Realistic Solution: Embedding inference is a single forward pass with no autoregressive loop. The workload is compute-bound at moderate batch sizes because the full sequence is processed in parallel (like LLM prefill, not decode). This means: (1) batch size directly increases throughput with near-linear scaling until you hit compute saturation, (2) an A10G ($0.75/hr) at high batch sizes can match H100 throughput-per-dollar because you're not paying for unused HBM bandwidth, and (3) CPU inference (with ONNX Runtime) is viable for low-QPS workloads.
Napkin Math: BGE-large (335M params, FP16 = 670 MB):
- H100 (990 TFLOPS, $3.50/hr): At BS=1, 512-token input: 2 ร 335M ร 512 = 343 GFLOPs. Latency = 343G / 990T = 0.35ms. Throughput = 2,857 embeddings/sec. Cost = $3.50 / (2,857 ร 3,600) = $0.00000034/embedding.
- A10G (125 TFLOPS, $0.75/hr): Same workload: latency = 343G / 125T = 2.7ms at BS=1. At BS=64: 64 ร 343G = 22 TFLOPs, latency = 22T / 125T = 176ms for 64 embeddings = 2.75ms/embedding. Throughput = 364 embeddings/sec. Cost = $0.75 / (364 ร 3,600) = $0.00000057/embedding.
- H100 is 1.7ร cheaper per embedding but costs 4.7ร more per hour. A10G wins if you need <400 embeddings/sec. H100 wins at >1,000 embeddings/sec.
- Optimal batch size on A10G: Throughput plateaus around BS=128โ256 where compute utilization hits ~80%. Beyond that, latency increases linearly with no throughput gain.
๐ Deep Dive: Benchmarking
</details>Interviewer: "We're serving a vision-language model (LLaVA-1.5 13B) that takes an image + text prompt and generates text. Users are uploading 4K images (3840ร2160) and complaining about 8-second TTFT. The text-only version of the same model has 200ms TTFT. Where are the 7.8 extra seconds going?"
<details> <summary><b>๐ Reveal Answer</b></summary>Common Mistake: "The image encoder is slow โ use a smaller vision model." The vision encoder (ViT) is fast; the problem is that high-resolution images produce thousands of visual tokens that dominate the prefill cost.
Realistic Solution: VLMs convert images into visual tokens via a vision encoder (typically ViT), then concatenate these with text tokens for the LLM. A 4K image at the default patch size (14ร14 pixels) produces $(3840/14) \times (2160/14) \approx 274 \times 154 = 42,196$ visual tokens. The LLM must prefill all of these โ the same as a 42K-token text prompt. The vision encoder itself takes ~50ms; the remaining 7.75 seconds is pure LLM prefill over 42K visual tokens. The fix: downsample images, use adaptive resolution (only high-res where needed), or use a vision encoder with larger patch sizes.
Napkin Math: LLaVA-1.5 13B on A100-80GB:
- Vision encoder (ViT-L/14): 304M params. Forward pass on 4K image: ~50ms.
- Visual tokens from 4K image: (3840/14) ร (2160/14) โ 42,196 tokens.
- LLM prefill for 42K tokens: 2 ร 13B ร 42,196 = 1,097 TFLOPs. On A100 (312 TFLOPS): 1,097/312 = 3.5 seconds. Plus KV-cache allocation for 42K tokens: 42,196 ร 131 KB = 5.5 GB per request.
- At 768ร768 resolution: (768/14)ยฒ โ 3,025 tokens. Prefill: 2 ร 13B ร 3,025 = 78.7 TFLOPs / 312 = 252ms. KV-cache: 397 MB. 14ร faster TTFT.
- Trade-off: Downsampling to 768ร768 loses fine-grained detail (can't read small text in documents). Adaptive resolution (high-res crops only for regions of interest) balances quality and speed.
๐ Deep Dive: Model Serving
</details>Interviewer: "Our API must return valid JSON for downstream systems. We're using constrained decoding (grammar-guided generation) to guarantee valid JSON output. The team reports that constrained decoding adds 15ms per token on top of the normal 30ms TPOT. For a 500-token JSON response, that's 7.5 extra seconds. Is this overhead acceptable, and can we reduce it?"
<details> <summary><b>๐ Reveal Answer</b></summary>Common Mistake: "Constrained decoding is always worth the overhead because it guarantees valid output." For long structured outputs, the cumulative overhead can exceed the cost of simply retrying on parse failure.
Realistic Solution: Constrained decoding works by masking invalid tokens at each step according to a grammar (e.g., JSON CFG). The overhead comes from: (1) computing the valid token mask by traversing the grammar state machine against the full vocabulary (~32Kโ128K tokens), and (2) applying the mask before sampling. The 15ms/token overhead is typical for naive implementations. Optimizations include: pre-computing grammar state transitions into a lookup table, using a compressed vocabulary trie, or using speculative grammar checking (validate only the top-K tokens instead of the full vocabulary). Alternatively, for simple schemas, use structured output mode with retry: generate unconstrained, parse, retry on failure โ cheaper if failure rate is <10%.
Napkin Math: 500-token JSON response on H100:
- Unconstrained: 500 ร 30ms = 15 seconds.
- Constrained (naive): 500 ร (30 + 15)ms = 22.5 seconds. 50% overhead.
- Constrained (optimized, pre-computed masks): 500 ร (30 + 2)ms = 16 seconds. 7% overhead.
- Retry strategy: If 5% of unconstrained outputs are invalid JSON, expected cost = 15s ร 1.05 = 15.75 seconds average. Cheaper than even optimized constrained decoding, but with 5% of requests taking 30s (2 attempts).
- Break-even: Constrained decoding wins when JSON failure rate > overhead%. With optimized masks (7% overhead), constrained wins if failure rate > 7%. With naive masks (50% overhead), retry wins unless failure rate > 50%.
๐ Deep Dive: Model Serving
</details>Interviewer: "We want to A/B test our current 13B model against a new 70B model in shadow mode. The product team says 'just mirror 100% of traffic to both models for a week to gather data.' Why is this statistically unnecessary but infrastructure-ruinous, and how does the GPU memory asymmetry between these models dictate your A/B testing architecture?"
<details> <summary><b>๐ Reveal Answer</b></summary>Common Mistake: "Just spin up enough 70B instances to handle the mirrored traffic." This ignores the massive non-linear cost of serving a 70B model and the fact that statistical significance requires a fraction of that traffic.
Realistic Solution: The infrastructure cost of an A/B test is dominated by hardware provisioning, not the statistical sample size. A 13B model fits on a single A100 (40GB) with room for KV-cache. A 70B model requires tensor parallelism across at least 4ร A100s (80GB) or 2ร H100s just to fit the weights, plus massive KV-cache overhead. Mirroring 100% of traffic means provisioning a 70B cluster that is 5-8ร larger and more expensive than your production 13B cluster, just for a test. Instead, use an asymmetric traffic split (e.g., 95/5) or sample-based shadow routing. You only need to route enough traffic to the 70B model to reach statistical significance, which is typically a few thousand requests.
Napkin Math:
- Memory Asymmetry: 13B FP16 = 26GB weights. Fits on 1ร A100 (40GB) with 14GB for KV-cache. 70B FP16 = 140GB weights. Requires 2ร H100 (80GB) or 4ร A100 (40GB). The 70B model requires 4ร the GPUs per replica.
- Traffic Mirroring Cost: If production is 1,000 QPS running on 50ร A100s, mirroring 100% to 70B would require ~200-300 GPUs due to lower throughput and higher memory footprint. Cost: ~$500k/week.
- Statistical Reality: To detect a 3% quality improvement with 80% power, you need ~3,200 samples per arm. At 1,000 QPS, you collect 3,200 samples in 3.2 seconds.
- Optimized Architecture: Route 99.9% of traffic to 13B, and 0.1% (1 QPS) to a single minimal 70B deployment (2ร H100s). You gather the required 3,200 samples in under an hour, costing ~$10 in GPU compute instead of $500k.
๐ Deep Dive: Benchmarking
</details>Interviewer: "We just deployed a new model version. The first inference request takes 12 seconds, but subsequent requests take 200ms. Our health check passed (model loaded successfully), so why is the first real request 60ร slower? And how many warm-up requests do we need before routing production traffic?"
<details> <summary><b>๐ Reveal Answer</b></summary>Common Mistake: "The model needs to 'warm up' its internal caches like a CPU cache." GPU inference doesn't have the same cache-warming semantics as CPU workloads. The slowdown is from one-time initialization costs that happen on the first forward pass, not cache misses.
Realistic Solution: The first-request penalty comes from three sources: (1) CUDA kernel JIT compilation: PyTorch/TensorRT compiles and caches GPU kernels on first use. Different input shapes trigger different kernels. (2) CUDA context and memory pool initialization: The first cudaMalloc initializes the memory allocator, and the first kernel launch initializes the CUDA context on each GPU. (3) cuBLAS handle and workspace allocation: The first GEMM operation allocates cuBLAS workspace buffers. After the first request, all of these are cached. You need warm-up requests that cover all expected input shapes (different sequence lengths, batch sizes) to trigger compilation of all kernel variants.
Napkin Math: First-request breakdown on H100 with a 13B model:
- CUDA context init: ~2 seconds (one-time per GPU).
- cuBLAS workspace allocation: ~500ms.
- Kernel JIT compilation (first unique shape): ~3โ8 seconds depending on model complexity and whether using torch.compile or TensorRT.
- Actual inference: 200ms.
- Total first request: ~6โ11 seconds.
Warm-up strategy: Send 3โ5 requests with representative input shapes:
- Short input (128 tokens) โ triggers short-sequence kernels.
- Medium input (1024 tokens) โ triggers medium-sequence kernels.
- Long input (4096 tokens) โ triggers long-sequence kernels.
- Batch of 8 short inputs โ triggers batched kernels. Total warm-up time: ~15โ30 seconds. After this, all kernel variants are compiled and cached. Add warm-up to the readiness probe, not the liveness probe โ the pod is alive but not ready until warm-up completes.
๐ Deep Dive: Model Serving
</details>Interviewer: "Our LLM API rate-limits users to 100 requests per minute. A power user sends 100 requests, each with a 50,000-token prompt and requesting 4,000 output tokens. A casual user sends 100 requests with 100-token prompts and 50 output tokens. Both hit the same rate limit, but one costs 1,000ร more GPU time. How do you design a fair rate limiter?"
<details> <summary><b>๐ Reveal Answer</b></summary>Common Mistake: "Request-based rate limiting is fine โ all requests are equal." This is the fundamental error. A request with 50K input tokens consumes ~500ร more prefill compute and ~80ร more decode compute than a 100-token request. Request-count rate limiting lets heavy users subsidize their costs with the same limits as light users.
Realistic Solution: Implement token-based rate limiting with separate budgets for input tokens (prefill cost) and output tokens (decode cost). Each user gets a token budget per time window (e.g., 1M input tokens + 100K output tokens per minute). This aligns rate limiting with actual resource consumption. Weight input and output tokens differently because prefill is compute-bound (cheaper per token at high batch sizes) while decode is memory-bandwidth-bound (expensive per token regardless of batch size).
Napkin Math: 70B model on H100:
- Prefill cost per token: 2 ร 70B = 140 GFLOPs per token. At 990 TFLOPS: 0.14ms/token.
- Decode cost per token: Load 140 GB weights from HBM. At 3.35 TB/s: 42ms/token (at BS=1).
- Power user (50K input + 4K output per request, 100 requests):
- Prefill: 100 ร 50,000 ร 0.14ms = 700 seconds of compute.
- Decode: 100 ร 4,000 ร 42ms = 16,800 seconds of compute (at BS=1; batching helps).
- Casual user (100 input + 50 output per request, 100 requests):
- Prefill: 100 ร 100 ร 0.14ms = 1.4 seconds.
- Decode: 100 ร 50 ร 42ms = 210 seconds.
- Cost ratio: Power user consumes (700 + 16,800) / (1.4 + 210) = 83ร more GPU time for the same number of requests.
- Token-based limit: Give both users 500K input tokens/min + 50K output tokens/min. Power user can send ~10 requests. Casual user can send all 100. Fair.
๐ Deep Dive: Model Serving
</details>Interviewer: "We're choosing between AWS SageMaker Serverless (pay-per-invocation, scales to zero) and a dedicated ml.g5.2xlarge instance (1ร A10G, $1.52/hr always-on) for serving a 7B model. Our traffic is bursty: 500 QPS during business hours (10 hrs/day), near-zero overnight. Find the break-even point and tell me which option wins."
Common Mistake: "Serverless always wins for bursty traffic because you don't pay for idle time." Serverless has per-invocation costs that scale linearly with QPS, while dedicated instances have fixed costs that become cheaper per-request at high utilization.
Realistic Solution: The break-even depends on three factors: (1) the per-invocation cost of serverless (compute time ร price per GB-second), (2) the cold-start penalty and its frequency, and (3) the utilization-adjusted cost of dedicated instances. For bursty workloads, the optimal strategy is often a hybrid: dedicated instances sized for the base load with serverless overflow for spikes.
Napkin Math:
- Dedicated (ml.g5.2xlarge, 1ร A10G): $1.52/hr ร 24 hr ร 30 days = $1,094/month. Throughput: ~50 tokens/sec for 7B model. At 500 QPS with 100-token responses: need 500 ร 100 / 50 = 1,000 instances during peak. That's $1,094 ร 1,000 = $1.09M/month. Clearly need fewer instances with batching โ at BS=32, throughput โ 800 tok/s per instance, need ~63 instances = $68,922/month.
- Serverless (SageMaker Serverless): ~$0.0001 per invocation + $0.00003/sec compute. Per request (200ms inference): $0.0001 + 0.2 ร $0.00003 = $0.000106/request. At 500 QPS ร 10 hr/day ร 30 days: 500 ร 36,000 ร 30 = 540M requests. Cost = 540M ร $0.000106 = $57,240/month. Plus cold starts: ~5% of requests hit cold start (3s penalty), adding latency but no extra cost.
- Break-even: Serverless < dedicated when: $0.000106 ร QPS ร 3,600 < $1.52/hr per instance. Per instance: QPS < 1.52 / (0.000106 ร 3,600) = ~4 QPS per instance. Above 4 QPS sustained per instance, dedicated wins.
- Hybrid: 40 dedicated instances for base load ($43,776/month) + serverless for overflow spikes. Saves ~20% vs pure dedicated during off-peak.
๐ Deep Dive: Model Serving
</details>Interviewer: "Our team is deploying a new image classification model as a microservice. The initial container image size is 5GB, and we're seeing unacceptable cold start latencies, especially when scaling up. What are the primary culprits for such a large image and how would you systematically reduce it?"
<details> <summary><b>๐ Reveal Answer</b></summary>Common Mistake: "We need faster internet for image pulls." While network speed affects pull time, it doesn't address the underlying issue of bloated images or the impact on container runtime startup.
Realistic Solution: A large container image typically stems from including unnecessary dependencies, development tools, or a large base image. To systematically reduce it:
ubuntu) to minimal images like alpine or distroless.apt-get clean, pip cache purge).Napkin Math: Reducing a 5GB image to 500MB on a 100Mbps network: Initial download time:
5 GB * 8 bits/byte / 100 Mbps = 40 secondsOptimized download time:0.5 GB * 8 bits/byte / 100 Mbps = 4 secondsThis significantly reduces cold start time from image pull alone, not accounting for container initialization and model loading.
Key Equation:
ContainerSize = BaseImageSize + DependenciesSize + ModelArtifactSize
๐ Deep Dive: Volume I: Containerization for ML Services
</details>Interviewer: "Our auto-scaling group for an inference service is occasionally failing to replace unhealthy instances. We've confirmed the infrastructure is fine, but the service becomes unresponsive even though the instance itself is still running. What's likely going wrong with our health check strategy and how would you fix it?"
<details> <summary><b>๐ Reveal Answer</b></summary>Common Mistake: "The auto-scaling group isn't configured correctly to terminate instances." While true for a misconfigured ASG, the problem describes an instance running but unresponsive, implying the health check isn't detecting the application-level failure.
Realistic Solution: The issue points to a health check that is too shallow, likely only verifying the OS or basic network connectivity (e.g., TCP port open), but not the application's readiness to serve requests.
/healthz or /ready) that verifies the ML model is loaded into memory, all necessary pre-processing components are initialized, and the service can actually process a dummy request./livez) to confirm the application process is still running and not deadlocked. This typically involves less intensive checks than a readiness probe.Napkin Math: If an instance takes 60 seconds to replace, and your health check period is 10 seconds with an unhealthy threshold of 3, it would take
3 * 10 = 30seconds to detect and mark as unhealthy. If the replacement process is slower than this, you'll have a gap.TimeToDetectFailure = UnhealthyThreshold * PeriodSecondsTotalDowntime = TimeToDetectFailure + InstanceReplacementTime
Key Equation:
HealthCheckSuccessCondition = ModelLoaded AND PreprocessingReady AND ServiceResponsive
๐ Deep Dive: Volume I: Robust Health Checks for ML Services
</details>Interviewer: "We've deployed a new version of our recommendation model using a rolling update, and while all technical metrics (latency, error rate) look good, our key business metrics (e.g., click-through rate, user engagement) have silently dropped by 5%. How would you have prevented this 'silent regression' and what deployment strategy would you advocate for future model updates?"
<details> <summary><b>๐ Reveal Answer</b></summary>Common Mistake: "We need more comprehensive monitoring for system metrics." While important, system metrics (CPU, memory, network) wouldn't catch a subtle degradation in model quality that impacts user behavior without increasing technical errors.
Realistic Solution: A silent regression indicates a failure to validate model quality in a production environment before full rollout. The key is to use a canary deployment strategy combined with robust A/B testing and comprehensive business metric monitoring.
Napkin Math: If 5% of traffic is routed to the canary, and the business metric drops by 5% for that group, the overall impact on the entire user base would be
0.05 * 0.05 = 0.0025(0.25% overall drop). This makes the regression detectable without a massive impact on the entire user base.OverallMetricDrop = CanaryTrafficShare * CanaryMetricDrop
Key Equation:
(Metric_Canary - Metric_Control) / Metric_Control < Threshold
๐ Deep Dive: Volume I: Canary Releases and A/B Testing for ML
</details>Interviewer: "Our new recommendation model is performing well in terms of accuracy, but when deployed to a resource-constrained environment (e.g., mobile device, edge IoT gateway, or a serverless function with strict cold-start latency), it consistently breaches our 50ms inference latency budget. The model is a standard ResNet-like architecture. How would you approach optimizing it for this strict latency constraint without a complete re-architecture?"
<details> <summary><b>๐ Reveal Answer</b></summary>Common Mistake: "We need to use a more powerful CPU/GPU." This might be true if the current hardware is severely underpowered, but often the solution lies in optimizing the model itself or its runtime rather than throwing more expensive hardware at the problem, especially in resource-constrained or serverless environments where costs scale.
Realistic Solution: For strict latency budgets on resource-constrained environments, model quantization is a highly effective technique to reduce model size and inference time with minimal accuracy loss.
Napkin Math: A typical FP32 model might require
4 bytesper parameter, while an INT8 model needs only1 byte. This reduces memory footprint by 4x and can offer significant speedups (2-4x) on hardware with INT8 support. For a 100MB FP32 model, converting to INT8 reduces it to25MB. If inference time reduces by 2x, a 100ms FP32 inference becomes50msINT8 inference.
Key Equation:
InferenceTime_Quantized โ InferenceTime_FP32 / (QuantizationSpeedupFactor)
๐ Deep Dive: Volume I: Model Quantization for Efficient Inference
</details>Interviewer: "You're responsible for deploying a large language model (LLM) on a GPU cluster for real-time inference. Under high concurrent user requests, you observe frequent Out-Of-Memory (OOM) errors on the GPUs, despite the total memory of the model weights fitting comfortably. What is the most likely cause of these OOMs, and what advanced technique would you implement to mitigate this?"
<details> <summary><b>๐ Reveal Answer</b></summary>Common Mistake: "The model weights are too large, we need model parallelism." While true for extremely large models, the prompt states weights fit comfortably. The OOMs under concurrent requests point to a per-request memory overhead that scales with the number of users, rather than just the model size itself.
Realistic Solution: The most likely cause is the Key-Value (KV) Cache for attention mechanisms. During auto-regressive decoding, each generated token requires re-computing attention over all previous tokens. To optimize this, the Keys and Values of the attention mechanism for previous tokens are cached in GPU memory. This KV cache grows with the sequence length for each concurrent request. Under high concurrency, the sum of all individual KV caches quickly exhausts GPU memory.
The advanced technique to mitigate this is PagedAttention (as implemented in vLLM) or similar KV-cache management strategies:
Napkin Math: For a 7B LLM (e.g., Llama-2 with 32 layers, 32 heads, 128 head dim, FP16), the KV cache size per token is
32 layers * 2 (K+V) * 32 heads * 128 dim * 2 bytes/FP16 = 524 KB. If each request has an average sequence length of 1024 tokens, one request consumes524 KB/token * 1024 tokens โ 0.5 GB. With 16 concurrent requests, this is16 * 0.5 GB = 8 GBof KV cache, quickly saturating a 16GB GPU on top of model weights. PagedAttention can reduce this by 2-4x for typical workloads.
Key Equation:
KV_Cache_Memory_Per_Request = SequenceLength * NumLayers * 2 * NumHeads * HeadDim * BytesPerFloat
๐ Deep Dive: Volume I: KV-Cache Optimization for LLMs
</details>Interviewer: "Your team is exploring serverless functions (e.g., AWS Lambda, Google Cloud Functions) for an on-demand, low-volume ML inference endpoint to minimize operational overhead and cost. However, initial tests reveal significant 'cold start' latencies, sometimes exceeding 5 seconds, which is unacceptable for the user experience. What are the main contributors to serverless cold starts for ML workloads, and what strategies would you employ to bring them within an acceptable range?"
<details> <summary><b>๐ Reveal Answer</b></summary>Common Mistake: "Serverless is just too slow for ML, we need dedicated instances." While dedicated instances avoid cold starts, this misses the point of optimizing serverless, which offers significant cost and operational benefits for bursty or low-volume workloads if cold starts can be managed.
Realistic Solution: Serverless cold starts for ML workloads are primarily due to:
Strategies to mitigate cold starts:
alpine or distroless for minimal footprint.Napkin Math: A typical serverless cold start:
- Image download (500MB on 100Mbps):
~4 seconds- Runtime init (Python, TF/PyTorch):
~1-2 seconds- Model load (200MB model into memory):
~0.5-1 secondTotal:~5.5-7 seconds. With provisioned concurrency, this can drop to~<100msfor subsequent requests, at the cost of continuous billing for the pre-warmed instances.
Key Equation:
ColdStartTime = ImageDownloadTime + RuntimeInitTime + ModelLoadTime
๐ Deep Dive: Volume I: Serverless Inference Challenges and Solutions
</details>Interviewer: "You are designing a low-latency real-time fraud detection system. The model relies on features like 'number of transactions in the last 5 minutes' and 'average transaction amount in the last hour.' Initially, these features were pre-computed in daily batch jobs and stored in a data warehouse. However, your team observes that the model's performance significantly degrades in the first few hours of each day, showing a 'staleness' effect. How would you redesign the feature serving architecture to ensure features are fresh and consistent for real-time inference?"
<details> <summary><b>๐ Reveal Answer</b></summary>Common Mistake: "We need to run the batch jobs more frequently, e.g., hourly." While more frequent batching reduces staleness, it doesn't eliminate it and introduces increasing computational overhead. It also doesn't solve for features that truly need second-level freshness.
Realistic Solution: The problem highlights the mismatch between batch-generated features and real-time inference needs. The solution requires a real-time feature store and a stream processing pipeline to calculate and serve fresh features.
Redesign Steps:
Napkin Math: If batch features update daily at 3 AM, a request at 4 AM uses features that are 1 hour old. A request at 2 AM the next day uses features that are nearly 23 hours old. With stream processing and a real-time feature store, the feature staleness can be reduced to
(processing_latency + ingestion_latency), typically in the order oftens to hundreds of milliseconds. For example, a feature updated within 500ms of the event occurrence.
Key Equation:
FeatureStaleness = Max(EventToFeatureCalculationLatency, FeatureIngestionToStoreLatency)
๐ Deep Dive: Volume I: Real-time Feature Stores for ML
</details>Interviewer: "You're optimizing an LLM inference service running on A100 GPUs. You've already implemented dynamic batching, where requests are grouped together until the GPU queue is full or a timeout is reached. However, under bursty traffic with highly variable input sequence lengths, you still observe significant GPU underutilization and high tail latencies. What is the fundamental limitation of standard dynamic batching in this scenario, and what advanced technique would you propose to maximize GPU throughput and reduce tail latency?"
<details> <summary><b>๐ Reveal Answer</b></summary>Common Mistake: "We need to use larger batch sizes." While larger batches increase throughput for homogeneous requests, standard dynamic batching still struggles with variable lengths and bursty traffic, leading to idle GPU cycles and increased tail latency as the system waits for full batches.
Realistic Solution: The fundamental limitation of standard dynamic batching is that it processes requests sequentially within a batch and waits for all sequences in a batch to complete before moving to the next. For LLMs, sequences have variable lengths and generate tokens one by one. If one sequence in a batch finishes early, its allocated GPU memory and compute resources remain idle while other, longer sequences in the same batch continue decoding. This leads to memory fragmentation and GPU underutilization for the entire batch.
The advanced technique to address this is Continuous Batching (or Iteration-level Batching), pioneered by systems like vLLM:
This approach maximizes GPU utilization by keeping the GPU busy with useful work, significantly reducing the "bubble" time where the GPU is idle waiting for a batch to fill or for long sequences to finish.
Napkin Math: Consider a batch of 4 requests, with sequence lengths
[100, 200, 300, 400]tokens.
- Standard Dynamic Batching: The GPU waits for the 400-token sequence to complete before starting the next batch. The resources for the 100, 200, 300-token sequences are idle after they complete.
- Continuous Batching: As soon as the 100-token sequence completes, its resources are freed, and a new incoming request (if available) can immediately start processing on those freed resources, interleaving with the remaining 200, 300, 400-token sequences. This can lead to 2-4x higher throughput and lower tail latency compared to standard dynamic batching.
Key Equation:
GPU_Utilization = (TotalActiveComputeTime / TotalTime) * 100%
๐ Deep Dive: Volume I: Continuous Batching for LLM Inference
</details>Interviewer: "To handle a massive spike in concurrent users for your LLM service, you configure vLLM to enable CPU swapping. If the GPU VRAM gets full, it moves inactive KV-cache blocks to the host CPU RAM over PCIe. The system handles 10x more users without OOMing, but the Time-Per-Output-Token (TPOT) randomly spikes from 40ms to 800ms. Why does swapping cause such a catastrophic latency cliff?"
<details> <summary><b>๐ Reveal Answer</b></summary>Common Mistake: "CPU RAM is slower than GPU RAM." It is slower, but the CPU RAM speed isn't the bottleneck. The pipe connecting them is.
Realistic Solution: You hit the PCIe Bandwidth Wall.
When a request that was swapped to the CPU becomes active again (it's their turn to generate a token), the GPU cannot compute attention over data sitting in host RAM. The entire KV-cache for that user's sequence must be physically copied back over the PCIe bus into the GPU's HBM before the forward pass can execute.
If the user has a 8,000 token context window, their KV-cache is roughly 2 GB. A PCIe Gen4 x16 bus has a real-world bandwidth of roughly 25 GB/s. It takes nearly 100ms just to copy that single user's cache back to the GPU. If the scheduler decides to swap in 8 users simultaneously for the next batch, you saturate the PCIe bus, causing an 800ms pure I/O stall before any math happens.
The Fix: CPU swapping is a trap for latency-sensitive real-time generation. It should only be used for offline batch processing. For real-time serving, you must cap max_num_seqs to strictly fit within the physical HBM boundaries, or use prompt caching / context summarization.
Napkin Math: 2 GB KV cache. PCIe Gen4 = 25 GB/s. Transfer time = 2 / 25 = 80ms per user. Generating 1 token takes 10ms of math. You spent 8x more time moving the memory than doing the inference.
๐ Deep Dive: Volume I: Frameworks
</details>Interviewer: "Users submit 100,000-token documents to your summarization model. To avoid OOMing during the prefill phase, you implement 'Chunked Prefill' (breaking the document into 4,000-token blocks). The prefill now succeeds. But users notice that the Time-To-First-Token (TTFT) actually increased compared to when you ran the 100k prompt all at once on a bigger GPU. Why does chunking the prompt slow down the time to first token?"
<details> <summary><b>๐ Reveal Answer</b></summary>Common Mistake: "Chunking adds Python looping overhead." Python overhead is microseconds; the slowdown is seconds.
Realistic Solution: You destroyed the Parallel Math Advantage (Arithmetic Intensity).
Attention and feed-forward layers are matrix multiplications.
When you process 100,000 tokens in a single massive batch, you formulate a massive matrix multiplication [100k, hidden_dim] * [hidden_dim, hidden_dim]. GPUs are incredibly efficient at this. The arithmetic intensity is astronomical, pushing the GPU to its absolute theoretical peak TFLOPS (e.g., hitting 900+ TFLOPS on an H100).
When you chunk the prompt into 25 sequential blocks of 4,000 tokens, you force the GPU to do 25 separate, smaller matrix multiplications.
Instead of loading the weights once and doing massive math, you load the weights 25 times and do small math. You shifted the workload from compute-bound back toward memory-bandwidth-bound, ruining efficiency.
The Fix: Chunked prefill is a memory-saving compromise, not a speedup. To fix the speed, you must use Tensor Parallelism across multiple GPUs to keep the sequence contiguous, or interleave the chunks of different users (continuous batching) to restore the arithmetic intensity.
Napkin Math: Single 100k pass: Read 140 GB weights once. Compute 140 TFLOPs. Time = 140T / 900TFLOPS = 0.15s. Chunked 25 passes: Read 140 GB weights 25 times = 3,500 GB of memory traffic. At 3.3 TB/s bandwidth, just loading the weights takes 1.06 seconds. TTFT increases by nearly 7x.
๐ Deep Dive: Volume I: ML Systems
</details>Interviewer: "You are serving a multi-modal model (like LLaVA). A user uploads a 4K image and asks, 'What is this?' The image is encoded into 4,000 visual tokens by the Vision Encoder. The LLM then begins decoding text. During the text decode phase, the GPU utilization is at 4%. Your manager asks you to optimize the Vision Encoder to fix this. Why is the manager focusing on the wrong part of the stack?"
<details> <summary><b>๐ Reveal Answer</b></summary>Common Mistake: "4,000 tokens is too many, the Vision Encoder needs to be compressed." While a smaller encoder helps, it doesn't solve why the text decode phase is running at 4% utilization.
Realistic Solution: The manager fails to understand the Autoregressive Decode Bottleneck.
During the decode phase, the model generates exactly one text token at a time. To generate that single token, the GPU must read the entire 70B parameter model from memory, plus the KV-cache of the 4,000 visual tokens.
The GPU is doing almost zero math (just a vector-matrix multiplication) but is forced to move massive amounts of data across the memory bus. The 4% utilization means the compute cores (Tensor Cores) are sitting idle waiting for the HBM memory controller.
Optimizing the Vision Encoder (which runs once during the prefill phase) will do absolutely nothing to speed up the generation of the 500 text tokens that follow.
The Fix: You must optimize the Decode phase.
Napkin Math: Generating 1 text token requires reading 140 GB (weights) + 2 GB (4k visual KV-cache). At 3 TB/s bandwidth, that takes ~47ms. You are doing ~140 GFLOPs of math in 47ms, which is 3 TFLOPS. 3 TFLOPS / 900 TFLOPS peak = 0.3% compute utilization. The GPU is functionally asleep.
๐ Deep Dive: Volume I: Model Serving
</details>Interviewer: "You switch your generation algorithm from Greedy Search to Beam Search with a beam width of 4 to get better translation quality. Suddenly, your inference server can only handle 1/4th the number of concurrent users before OOMing. Why does Beam Search destroy your concurrency scaling?"
<details> <summary><b>๐ Reveal Answer</b></summary>Common Mistake: "Beam search does 4 times the math, so it takes 4 times the CPU/GPU." It does take more compute, but OOM errors are about memory capacity, not math speed.
Realistic Solution: You caused a KV-Cache Explosion.
In Greedy Search, a single user request requires exactly one sequence in the KV-cache. When you enable Beam Search with a width of 4, the model explores 4 different possible sentence branches simultaneously.
To do this without recalculating the entire prompt every step, the serving framework must allocate 4 independent KV-cache sequences for that single user. A user who previously required 500 MB of VRAM for their context window now instantly requires 2.0 GB.
Because your VRAM is fixed, multiplying the memory footprint per user by 4 exactly divides your maximum concurrent user capacity by 4.
The Fix: For high-concurrency production serving, standard Beam Search is almost never used. Stick to Greedy Search, or use advanced frameworks that implement KV-Cache Sharing (like PagedAttention) where the initial prompt is shared across all 4 beams, and only the diverging generated tokens allocate new memory pages.
Napkin Math: Server has 40 GB free VRAM. Greedy User = 1 GB cache. Max concurrent = 40 users. Beam(4) User = 4 GB cache. Max concurrent = 10 users. You just quadrupled your hardware hosting costs for a slight bump in BLEU score.
๐ Deep Dive: Volume I: Frameworks
</details>Interviewer: "You need your LLM to output strict JSON. You implement a constrained decoding wrapper (like Guidance or Outlines) that forces the model to only select tokens valid under your JSON schema. The outputs are now perfect JSON, but the Time-Per-Output-Token (TPOT) doubled. What is the CPU doing that takes so much time?"
<details> <summary><b>๐ Reveal Answer</b></summary>Common Mistake: "The model has to think harder to generate JSON." The model is doing the exact same neural network math. The delay is outside the model.
Realistic Solution: You are paying the Vocabulary Masking Overhead.
Constrained decoding works by intercepting the logits (the probabilities for all 32,000+ words in the vocabulary) right before sampling. The CPU must evaluate the current state of the JSON string against a complex Regular Expression or Context-Free Grammar.
For every single token generated, the CPU must loop through all 32,000 vocabulary words, check if adding that word violates the JSON schema, and if so, force its probability to zero (masking).
If this masking logic is written in pure Python or uses inefficient regex state machines, checking 32,000 strings takes 20-30 milliseconds. If the GPU only takes 15ms to do the forward pass, you have effectively bottlenecked your trillion-dollar GPU behind a slow Python text-parsing script.
The Fix:
Napkin Math: GPU forward pass: 15ms. Logit transfer to CPU (32k floats): 1ms. Python Regex over 32k strings: 25ms. Total TPOT = 41ms. The CPU parsing text took nearly double the time it took the GPU to run 70 billion parameters.
๐ Deep Dive: Volume I: ML Systems
</details>Interviewer: "Your LLM inference server is utilizing a 70B model on an H100. P99 latency suddenly spikes. You look at the GPU utilization metrics, and the GPU is sitting at 0% for 50 milliseconds before every generation request. What CPU-bound process is blocking the GPU from doing its job?"
<details> <summary><b>๐ Reveal Answer</b></summary>Common Mistake: "The prompt is being transferred over the PCIe bus." Transferring a few kilobytes of text takes microseconds, not 50 milliseconds.
Realistic Solution: You are hitting the Tokenizer CPU Bottleneck.
Before an LLM can process text, the raw string must be converted into integer Token IDs. Most default tokenizers (like HuggingFace's pure Python implementations or regex-heavy BPE tokenizers) run entirely on a single CPU core.
If a user sends a 10,000 word document to be summarized, the CPU must iterate through the string, apply complex regex rules, and perform dictionary lookups to generate the thousands of token IDs. If this tokenizer is written in unoptimized Python, it can easily take 50 to 100 milliseconds of pure CPU compute, during which your trillion-dollar GPU is physically doing nothing.
The Fix:
tokenizers fast mode).Napkin Math: A slow Python BPE tokenizer might process 100,000 tokens per second. A 5,000 token prompt takes 50ms to tokenize. An H100 can process that same 5,000 token prompt (prefill) in ~15ms. The text parsing is literally 3x slower than the neural network math.
๐ Deep Dive: Volume I: Frameworks
</details>Interviewer: "You are serving a 13B model on a single 80GB GPU. You know the weights take 26GB, leaving 54GB for KV-cache. A single user request takes exactly 1GB of KV-cache. Therefore, you configure the server to accept 54 concurrent requests. However, after an hour of operation, the server OOMs and crashes when serving only 35 concurrent requests. Where did the missing 19GB of VRAM go?"
<details> <summary><b>๐ Reveal Answer</b></summary>Common Mistake: "Memory leak in the Python code." Python memory leaks affect system RAM, not the GPU VRAM.
Realistic Solution: You are experiencing severe KV-Cache Memory Fragmentation.
Standard huggingface/PyTorch text generation pre-allocates a contiguous block of GPU memory for a sequence's maximum possible length (e.g., 2048 tokens).
If User A asks a question and the model only generates 10 tokens before stopping, User A's request still pinned a 2048-token-sized block of VRAM. When User A finishes, that block is freed. Over thousands of requests of varying lengths, the VRAM becomes a "swiss cheese" of free and used blocks.
When a new user connects and asks for a 2048-token context window, the OS cannot find a contiguous 1GB block of free VRAM, even though there is 19GB of total free space scattered across the chip. The allocation fails, and the server OOMs.
The Fix: You must use PagedAttention (like vLLM). PagedAttention breaks the KV-cache into small, fixed-size blocks (e.g., 16 tokens) that do not need to be contiguous in physical memory, completely eliminating external fragmentation and allowing you to safely utilize 99% of your VRAM.
Napkin Math: Without PagedAttention, the average internal/external memory waste in an LLM serving cluster is typically 60% to 80%. A $30,000 H100 GPU is effectively operating as a $6,000 GPU because the memory allocator is too rigid.
๐ Deep Dive: Volume I: Model Serving
</details>Interviewer: "You deploy a Mixtral 8x7B (Mixture of Experts) model on a single 80GB GPU. Because it only uses ~13B active parameters per token, the math should be fast. However, your profiling shows that generating a token takes 2x longer than a standard dense 13B model. What architectural component of the MoE model is destroying your memory bandwidth?"
<details> <summary><b>๐ Reveal Answer</b></summary>Common Mistake: "The router network math is too heavy." The router network is usually a tiny single linear layer; its compute is negligible.
Realistic Solution: You are paying the MoE Memory Loading Penalty.
During autoregressive decoding (batch size 1), the GPU must load the weights from HBM to SRAM to do the math. For a dense 13B model, the GPU streams 26 GB of weights sequentially. Memory controllers love sequential reads, hitting near peak bandwidth.
For an MoE model, the router decides at runtime which expert to use for a given token. The GPU must wait for the router math to finish, and then it must perform sparse, random memory accesses to fetch the weights for Expert 3 and Expert 7 out of the massive 46GB pool of total weights.
These random memory reads defeat the GPU's hardware prefetchers and destroy memory bus efficiency. Even though you are only doing 13B parameters worth of math, the time it takes to randomly fetch those specific 13B parameters from HBM takes vastly longer than streaming them sequentially.
The Fix:
Napkin Math: HBM Sequential Read Speed = 2000 GB/s. HBM Random Access Speed = ~400 GB/s. Fetching 26GB sequentially takes 13ms. Fetching 26GB via sparse random accesses takes 65ms. The MoE model feels 5x slower despite having the same FLOPs.
๐ Deep Dive: Volume I: Network Architectures
</details>Interviewer: "You are using vLLM to serve an LLM. You configure the PagedAttention block size to 256 tokens per block to reduce the metadata overhead in the memory allocator. Your memory fragmentation drops to zero. However, your time-to-first-token (TTFT) latency suddenly spikes, and the GPU is sitting idle waiting for memory. Why did a large block size destroy your performance?"
<details> <summary><b>๐ Reveal Answer</b></summary>Common Mistake: "256 tokens is too much memory to allocate at once." 256 tokens is only a few megabytes; the GPU can allocate that instantly.
Realistic Solution: You are suffering from Internal Fragmentation and Wasted Memory Bandwidth.
PagedAttention eliminates external fragmentation by breaking the KV-cache into blocks. However, it still suffers from internal fragmentation. If a user prompt is 257 tokens long, vLLM must allocate TWO 256-token blocks (512 tokens total). The second block is 99% empty.
The fatal flaw is the memory fetch: During the autoregressive decode phase, when the GPU reads the KV-cache to compute attention for the 258th token, the memory controller fetches the entire 256-token block from HBM into SRAM, even though 255 of those token slots are completely empty.
You are forcing the GPU to move massive amounts of useless, empty bytes across the memory bus, destroying the effective memory bandwidth and starving the Tensor Cores.
The Fix: You must use smaller block sizes (e.g., 16 or 32 tokens). This slightly increases the page table metadata overhead on the CPU/GPU, but tightly packs the KV-cache memory, ensuring that almost every byte fetched from HBM contains useful math data.
Napkin Math: Block size 256. Sequence length 257. Allocated = 512. Wasted space = 255 tokens (nearly 50%). When generating the next token, the GPU reads 512 tokens worth of data (e.g., 2 MB), but only 1 MB is real data. Your effective memory bandwidth is halved.
๐ Deep Dive: Volume I: Frameworks
</details>Interviewer: "You serve a PyTorch model on a T4 GPU. The model is cast to torch.float16. You use standard matrix multiplications. However, your profiler shows the GPU is only achieving 8 TFLOPS of throughput, which is the theoretical maximum for FP32 on a T4. The T4 is capable of 65 TFLOPS in FP16. Why is your FP16 model running at FP32 speeds?"
Common Mistake: "The model didn't actually cast to FP16." It did; the memory footprint confirmed it. The issue is the math engine.
Realistic Solution: You failed to enable Tensor Cores (Hardware FP16 Multipliers).
On NVIDIA GPUs (Volta architecture and newer, like the T4), standard CUDA cores handle FP32 math. To get the massive 65 TFLOPS speedup for FP16, the math must be routed to specialized hardware units called Tensor Cores.
In older versions of PyTorch (or if not explicitly configured), simply casting the tensors to FP16 is not enough. If the matrix dimensions do not perfectly align with the hardware requirements of the Tensor Cores (e.g., dimensions must be multiples of 8), the cuBLAS library will silently refuse to use the Tensor Cores.
Instead, it will cast your FP16 data back up to FP32 in registers, run the math on the slow, standard FP32 CUDA cores, and cast the result back to FP16. You get the memory savings of FP16, but completely lose the compute speedup.
The Fix:
torch.backends.cudnn.allow_tf32 = True for TF32, or ensure AMP/autocast is correctly configured).Napkin Math: T4 Peak FP32 = 8.1 TFLOPS. Peak FP16 (Tensor Cores) = 65 TFLOPS. Failing to align dimensions by 1 pixel (e.g., size 127 instead of 128) silently degrades your hardware performance by exactly 8x.
๐ Deep Dive: Volume I: HW Acceleration
</details>Interviewer: "You implement Speculative Decoding to speed up an LLM. You use a 1B draft model to guess 5 tokens, and a 70B target model to verify them in a single pass. On standard English text, it yields a 2.5x speedup. When you deploy it to your coding assistant API, the overall throughput actually drops by 15% compared to not using Speculative Decoding at all. Why did the optimization become a penalty?"
<details> <summary><b>๐ Reveal Answer</b></summary>Common Mistake: "Code is harder to generate." It is, but that alone doesn't explain why a speedup technique becomes an active penalty.
Realistic Solution: You suffered a Catastrophic Draft Acceptance Rate Drop.
Speculative Decoding works by gambling: the draft model is fast, and the target model verifies its guesses in parallel. If the target model agrees with the draft model, you get 5 tokens for the price of 1.
However, the target model's verification pass requires reading the KV-cache and processing all 5 draft tokens. If the draft model guesses wrong on the very first token, the target model rejects the entire draft sequence. You spent compute running the draft model, and you spent compute running the target model to verify a 5-token sequence, but you only got 1 valid token out of it.
Your 1B draft model was trained on general English, so it guessed English well. But on complex, highly structured programming languages (Python, C++), the 1B model was completely lost. Its guess-acceptance rate dropped from 70% to near 0%.
Because you constantly paid the overhead of running the draft model and evaluating longer target sequences without getting any parallel tokens accepted, the net throughput was worse than just running the 70B model autoregressively.
The Fix: Speculative Decoding is entirely reliant on the Draft/Target alignment. You must either fine-tune the 1B draft model specifically on your domain (code), use a smaller draft model to reduce overhead, or dynamically disable Speculative Decoding if the rolling acceptance rate drops below the break-even mathematical threshold.
Napkin Math: Break-even Acceptance Rate. If draft takes 5ms, and target takes 20ms per pass. Standard (1 token): 20ms. Speculative (5 tokens, 0% accept): 5ms (draft) + 20ms (eval) = 25ms to get 1 token. (25% slower). Speculative (5 tokens, 100% accept): 5ms + 20ms = 25ms to get 5 tokens = 5ms per token. (400% faster).
๐ Deep Dive: Volume I: Model Serving
</details>Interviewer: "You are deploying Llama-3-70B. It uses Grouped Query Attention (GQA), which massively reduces the size of the KV-cache compared to standard Multi-Head Attention (MHA). Because the KV-cache is 8x smaller, your manager assumes the Time-Per-Output-Token (TPOT) during the decoding phase will be 8x faster. However, the TPOT only improves by about 10%. Why didn't an 8x reduction in KV-cache size yield an 8x reduction in latency?"
<details> <summary><b>๐ Reveal Answer</b></summary>Common Mistake: "GQA requires complex routing logic that slows it down." The routing is trivial. The issue is the proportion of memory traffic.
Realistic Solution: The manager forgot about the Model Weights memory traffic.
During autoregressive decoding (batch size 1), the GPU must read both the KV-cache and the entire Model Weights from HBM to SRAM to generate a single token.
For Llama-3-70B (FP16):
Total memory read per token (MHA) = 140 GB + 2 GB = 142 GB. Total memory read per token (GQA) = 140 GB + 0.25 GB = 140.25 GB.
You only reduced the total memory bandwidth requirement per token from 142 GB down to 140.25 GB. The 140 GB of model weights still completely dominates the memory bus. Therefore, single-batch latency barely improves at all.
The Fix: GQA does not significantly speed up single-user latency. GQA's superpower is Concurrency Scaling. Because the KV-cache is 8x smaller, you can fit 8x as many concurrent users onto the same GPU before OOMing, which massively increases total system throughput (tokens/second) at high batch sizes.
Napkin Math: Latency depends on total bytes moved. 142 GB vs 140.25 GB is only a ~1.2% difference in physical memory traffic. The true speedup from GQA only appears when serving hundreds of users, where the KV-cache size finally approaches the weight size.
๐ Deep Dive: Volume I: Network Architectures
</details>Interviewer: "Your PyTorch LLM inference script runs model.generate(). The GPU is extremely powerful (H100), but your Time-Per-Output-Token is hovering around 10ms. You want it to be 2ms. You look at nvtop and see the GPU utilization is only 15%. The GPU isn't doing much work. What part of the Python generate loop is physically preventing the GPU from running faster?"
Common Mistake: "The memory bandwidth is saturated." If memory bandwidth was the only issue, GPU utilization (specifically memory controller utilization) would be high. 15% means the GPU is sleeping.
Realistic Solution: You are suffering from Python CPU Launch Overhead.
Autoregressive generation generates one token at a time. In a naive PyTorch loop, the process looks like this:
argmax or sampling logic on the CPU.The Python interpreter's overhead to dispatch the CUDA kernels and do the sampling logic takes longer than the GPU takes to execute the 70 billion parameters. The GPU finishes its math instantly and then sits completely idle waiting for Python to tell it what to do next.
The Fix:
Napkin Math: H100 kernel execution = ~100 microseconds. Python interpreter overhead to launch a kernel = ~20-50 microseconds. If the model has 80 layers, Python overhead alone adds 4ms per token of pure idle time where the GPU is waiting for instructions.
๐ Deep Dive: Volume I: Frameworks
</details>Interviewer: "To handle a massive spike in traffic, you configure vLLM to allow swapping inactive KV-cache blocks to the host CPU RAM via PCIe. This prevents OOM errors. However, you notice that while throughput remains okay, the latency (Time-Per-Output-Token) for individual users randomly spikes from 50ms to 900ms. Why does CPU swapping cause catastrophic latency spikes for active users?"
<details> <summary><b>๐ Reveal Answer</b></summary>Common Mistake: "CPU RAM is slower than GPU RAM." The speed of the RAM isn't the primary bottleneck; the physical transport pipe between them is.
Realistic Solution: You are experiencing PCIe Bus Thrashing.
When a request that was swapped out to CPU RAM becomes active again, the GPU cannot compute attention over it. The entire KV-cache block must be physically transferred back from the CPU to the GPU over the PCIe bus before the forward pass can execute.
If a user has a 4,000 token context window, their KV-cache might be 1 GB. A PCIe Gen4 x16 bus has a real-world bandwidth of roughly 25 GB/s. It takes 40ms just to copy that single user's cache back to the GPU. If the scheduler decides to swap in 10 users simultaneously for the next batch, the PCIe bus is locked for 400ms. The GPU tensor cores sit completely idle during this time, causing the massive 900ms latency spikes for anyone waiting for a token.
The Fix: CPU swapping is a trap for real-time latency-sensitive applications. It should only be used for offline, asynchronous batch processing. For real-time serving, you must strictly cap max_num_seqs to fit within physical HBM, or use context summarization.
Napkin Math: 10 GB of swapped KV-cache needed for the next batch. PCIe Gen4 limit = 25 GB/s.
10 / 25 = 0.4 secondsof pure memory movement before a single FLOP of neural network math can occur.
๐ Deep Dive: Volume I: Model Serving
</details>Interviewer: "You implement Continuous Batching (iteration-level scheduling) for your LLM server. It successfully groups users together to maximize GPU utilization. However, you notice that users asking for very short, 10-token answers are experiencing incredibly long wait times, even though their requests require very little compute. How did your batching algorithm starve the easy requests?"
<details> <summary><b>๐ Reveal Answer</b></summary>Common Mistake: "The GPU prioritizes larger matrix math." The GPU doesn't prioritize anything; the software scheduler does.
Realistic Solution: You caused Head-of-Line Blocking via First-Come-First-Served (FCFS) Scheduling.
If a user asks for a 2,000-token story, their request enters the batch. The batch size is capped by the GPU's KV-cache capacity (e.g., max 32 concurrent sequences). Because generating 2,000 tokens takes a long time, that user occupies 1 of the 32 slots for several seconds.
If 32 users ask for long stories, the batch becomes completely full. When User 33 arrives and asks a simple question requiring a 10-token response, the Continuous Batching scheduler cannot let them in because there is no KV-cache memory available. User 33 must wait entirely for one of the long stories to finish generating all 2,000 tokens before a slot frees up.
The Fix: You must implement Preemption and Fair-Share Scheduling. The scheduler should detect long-running requests, temporarily evict/preempt them (saving their KV-cache to CPU RAM or recomputing it later), inject the short requests into the batch to clear them out quickly, and then resume the long requests.
Napkin Math: 32 users request 2,000 tokens at 20ms/token = 40 seconds to finish. User 33 requests 10 tokens (0.2 seconds of work). User 33 waits 40 seconds to get 0.2 seconds of compute. This ruins P99 latency metrics.
๐ Deep Dive: Volume I: Model Serving
</details>