book/quarto/contents/vol2/README.md
Distributed systems and production infrastructure for ML.
[!CAUTION] This volume is under active development. I am writing and revising chapters continuously. Diagrams, figures, and cross-references are being created and updated throughout. What you see here is a work in progress, not a finished product. I share it openly because I believe in transparent development. Expect the content to evolve significantly before the Summer 2026 release.
Volume II picks up where Volume I ends, moving from a single machine to fleets of machines connected by high-speed networks. It covers the mathematical and algorithmic demand for scale, how to build the physical infrastructure that meets it, how to serve models to billions of users, and how to do all of this safely and responsibly.
Where Volume I teaches you to optimize a single node (one to eight accelerators, shared memory, PCIe/NVLink within one box), Volume II teaches you to orchestrate many nodes (hundreds to thousands of accelerators, InfiniBand/Ethernet fabric, message passing, fault tolerance across racks and datacenters).
Volume II assumes you have read Volume I or have equivalent knowledge of single-machine ML systems: the Iron Law of ML Systems, the D.A.M Taxonomy, training and inference pipelines, model compression, and hardware acceleration fundamentals.
I develop this volume in the open because I believe it produces a better textbook. Every commit is visible, every editorial decision is traceable. If something looks rough, that is because you are watching the book being written.
If you notice an error, have a suggestion, or want to propose a topic, please open an issue or start a discussion. Feedback on structure, missing topics, examples, and clarity is especially valuable at this stage. The book is better for every reader who engages with it.