Back to Cs249r Book

Numbers Every ML Systems Engineer Should Know

interviews/NUMBERS.md

latest6.5 KB
Original Source

Numbers Every ML Systems Engineer Should Know

<div align="center"> <a href="README.md">๐Ÿ  Playbook Home</a> ยท <a href="cloud/README.md">โ˜๏ธ Cloud</a> ยท <a href="edge/README.md">๐Ÿค– Edge</a> ยท <a href="mobile/README.md">๐Ÿ“ฑ Mobile</a> ยท <a href="tinyml/README.md">๐Ÿ”ฌ TinyML</a> </div>

Adapted from the textbook's Machine Foundations appendix. Memorize the ratios; they're physics. Use the absolute numbers as sanity checks. All hardware values sourced from mlsysim/core/constants.py, the single source of truth for the book.


๐Ÿชœ The Scale Ladder

The defining characteristic of ML Systems Engineering is that the physics change by orders of magnitude depending on where you deploy.

DimensionCloudEdgeMobileTinyMLCloud:TinyML Ratio
Compute~1,000 TFLOPS~100 TOPS~30 TOPS~100 MFLOPS10,000,000ร—
Memory80 GB HBM8โ€“32 GB DRAM6โ€“12 GB shared256 KBโ€“2 MB SRAM40,000ร—
Power700 W30 W5 W10 mW70,000ร—
Latency Budget100ms (P99)33ms (hard RT)16ms (jank)1ms (interrupt)100ร—

1. The Invariants (Physics โ€” Will Not Change)

<table> <thead> <tr> <th width="35%">Relationship</th> <th width="25%">Ratio</th> <th width="40%">Why it's stable</th> </tr> </thead> <tbody> <tr> <td><b>DRAM access vs FP16 compute</b></td> <td><b>~580ร—</b> more energy</td> <td>Wire capacitance scales with distance</td> </tr> <tr> <td><b>FP32 vs INT8 energy</b></td> <td>~18ร—</td> <td>Bit width determines switching energy</td> </tr> <tr> <td><b>FP32 vs FP16 energy</b></td> <td>~3.4ร—</td> <td>Halving bits roughly halves energy</td> </tr> <tr> <td><b>HBM vs L1 cache latency</b></td> <td>~300ร— slower</td> <td>On-chip vs off-chip</td> </tr> <tr> <td><b>SSD vs L1 cache latency</b></td> <td>~100,000ร— slower</td> <td>Electrical vs flash</td> </tr> <tr> <td><b>Network vs local memory</b></td> <td>~17ร— slower</td> <td>Speed of light + switching</td> </tr> <tr> <td><b>Light in fiber</b></td> <td>~200 km/ms</td> <td>Cross-country US โ‰ˆ 40ms RTT</td> </tr> </tbody> </table>

2. Scaling Rules (Arithmetic โ€” Hardware Independent)

These formulas let you estimate memory, compute, and power requirements from basic model specs.

Cloud / LLM

What you're estimatingFormulaExample
Inference memory (FP16)2 bytes ร— params7B params ร— 2 bytes = 14 GB
Inference memory (INT8)1 byte ร— params7B params ร— 1 byte = 7 GB
Training memory (Adam)16 bytes ร— params7B params ร— 16 bytes = 112 GB
Inference compute~2 FLOPs ร— params per token7B โ†’ ~14 GFLOPs/token
Training compute~6 FLOPs ร— params ร— tokens7B on 1T tokens โ†’ 4ร—10ยฒยฒ FLOPs
KV-cache per token2 ร— layers ร— heads ร— head_dim ร— 2 bytesLlama 70B, 128k tokens โ†’ ~335 GB

Edge / Vision

What you're estimatingFormulaExample
Activation memory$H \times W \times C \times \text{batch} \times \text{bytes}$640ร—640ร—32ร—1ร—2 = ~26 MB
FPS budget$1000\text{ms} / \text{frame_deadline_ms}$33ms deadline = 30 FPS
Sustained TOPS$\text{TOPS/W} \times \text{thermal_budget_W}$4.6 TOPS/W ร— 15W = 69 TOPS

Mobile

What you're estimatingFormulaExample
App memory budget$\text{device_RAM} \times 0.25$8 GB RAM ร— 0.25 = 2 GB max
NPU delegation ratio$\text{supported_ops} / \text{total_ops}$85/100 ops = 85% delegated
Battery drain$P_{\text{inference}} \times \text{duty_cycle} \times \text{time}$2W ร— 0.05 ร— 1 hr = 0.1 Wh

TinyML

What you're estimatingFormulaExample
Tensor arena peak$\max(\text{concurrent activation sizes})$Layer 3: 40KB + 20KB = 60KB peak
Flash budget$\text{Total} - \text{Bootloader} - \text{RTOS} - \text{OTA}$1MB - 32K - 64K - 450K = 454KB
Duty cycle energy$(P_{\text{active}} t_{\text{active}} + P_{\text{sleep}} t_{\text{sleep}}) / t_{\text{period}}$(10mWร—1s + 10ยตWร—9s)/10s = 1mW avg

3. Current Hardware Snapshots (2024โ€“2025)

โ˜๏ธ Cloud (Data Center)

CategoryMetricValue
ComputeH100 FP16 Tensor Core989 TFLOPS
B200 FP16 Tensor Core2,250 TFLOPS
Memory BWH100 HBM33.35 TB/s
B200 HBM3e8.0 TB/s
InterconnectNVLink 4.0 (H100)900 GB/s
InfiniBand NDR400 Gbps (50 GB/s)
Ridge PointH100 (FP16)~295 Ops/Byte
PowerH100 TDP700 W
LatencyHBM3~300 ns

๐Ÿค– Edge (Autonomous & Industrial)

CategoryMetricValue
ComputeJetson AGX Orin (INT8)275 TOPS
Hailo-8 (INT8)26 TOPS
Memory BWJetson AGX Orin (LPDDR5)204.8 GB/s
Hailo-8 (On-chip SRAM)~2.5 TB/s
InterconnectMIPI CSI-2 (Camera)~2.5 GB/s (4-lane)
Ridge PointJetson AGX Orin (INT8)~1,342 Ops/Byte
PowerJetson AGX Orin15W โ€“ 60W
Hailo-82.5W
LatencyLPDDR5~100 ns

๐Ÿ“ฑ Mobile (Smartphones)

CategoryMetricValue
ComputeApple A17 Pro (ANE)35 TOPS
Snapdragon 8 Gen 3 (Hexagon)45 TOPS
Memory BWApple A17 Pro (LPDDR5)51.2 GB/s
Snapdragon 8 Gen 3 (LPDDR5x)77 GB/s
InterconnectOn-chip NoC~100 GB/s
Ridge PointApple A17 Pro (INT8)~683 Ops/Byte
PowerTotal SoC Active3W โ€“ 5W
LatencyUFS 4.0 Flash Read~4.2 GB/s

๐Ÿ”ฌ TinyML (Microcontrollers)

CategoryMetricValue
ComputeCortex-M4 (168 MHz)~336 MFLOPS
Cortex-M7 (480 MHz)~960 MFLOPS
Memory BWOn-chip SRAM~1.2 GB/s
InterconnectSPI / I2C10 Mbps / 400 Kbps
Ridge PointCortex-M4~0.2 Ops/Byte
PowerActive (Cortex-M4)~10 mW โ€“ 50 mW
Sleep (Deep)~1 ยตW โ€“ 10 ยตW
LatencyFlash Read~50 ns

Source: All values from the textbook's constants.py. When hardware generations change, update the constants file and every calculation in the book (and this playbook) updates automatically.