Machine Learning Systems: Two-Volume Structure

Status: Implemented Target Publisher: MIT Press Audience: Undergraduate and graduate CS/ECE students, academic courses

Overview

This textbook is organized into two volumes following the Hennessy & Patterson pedagogical model:

Volume I: Introduction to Machine Learning Systems — Build, Optimize, Deploy
Volume II: Machine Learning Systems at Scale — Scale, Distribute, Govern

Each volume stands alone as a complete learning experience while together forming a comprehensive treatment of the field.

Volume I: Introduction to Machine Learning Systems

Goal

A reader completes Volume I and can competently build, optimize, and deploy ML systems on a single machine with awareness of responsible practices.

Target Audience

Upper-level undergraduates
Early graduate students
Practitioners transitioning into ML systems

Course Mapping

Single semester "Introduction to Machine Learning Systems" course
Foundation for more advanced distributed systems or MLOps courses

Structure (16 chapters)

Part I: Foundations

Establish the conceptual framework for understanding ML as a systems discipline.

Ch	Title	Purpose
1	Introduction	Why ML systems thinking matters
2	ML Systems	Survey of the field, deployment paradigms
3	ML Workflow	End-to-end ML development process
4	Data Engineering	Pipelines, preprocessing, data quality

Part II: Build

The technical implementation of machine learning systems from math to trained models.

Ch	Title	Purpose
5	Neural Computation	Mathematical and conceptual foundations
6	Network Architectures	CNNs, RNNs, Transformers, architectural choices
7	ML Frameworks	PyTorch, TensorFlow, JAX ecosystem
8	Model Training	Training loops, optimization, debugging

Part III: Optimization

Techniques for making ML systems efficient and fast.

Ch	Title	Purpose
9	Data Selection	Optimizing information, active learning, pruning
10	Model Compression	Quantization, pruning, distillation
11	Hardware Acceleration	GPUs, TPUs, custom accelerators
12	Benchmarking	Measuring performance, MLPerf

Part IV: Deployment

Getting models into production responsibly.

Ch	Title	Purpose
13	Model Serving	Inference fundamentals, batching, latency optimization
14	ML Operations	Deployment, monitoring, CI/CD for ML
15	Responsible Engineering	Ethics, safety, and professional practice
16	Conclusion	Synthesis and bridge to Volume II

Volume II: Machine Learning Systems at Scale

Goal

A reader completes Volume II understanding how to build and operate ML systems at scale, with production resilience and responsible practices.

Target Audience

Graduate students
Industry practitioners
Researchers building large-scale systems

Prerequisites

Volume I or equivalent knowledge
Basic distributed systems concepts helpful

Course Mapping

Graduate seminar on large-scale ML systems
Advanced MLOps course
Research group reading material

Structure (16 chapters)

Part I: Foundations of Scale

Infrastructure and concepts for scaling beyond single machines.

Ch	Title	Purpose
1	Introduction	Motivation, challenges of scale
2	Infrastructure	Clusters, cloud, resource management
3	Storage Systems	Data lakes, distributed storage, checkpointing
4	Communication	AllReduce, parameter servers, network topology

Part II: Distributed Systems

Training and inference across multiple machines.

Ch	Title	Purpose
5	Distributed Training	Parallelism strategies, multi-chip hardware, scaling infrastructure
6	Fault Tolerance	Checkpointing, recovery, handling failures
7	Inference at Scale	Serving systems, batching, latency optimization
8	Edge Intelligence	Federated learning, fleet coordination, on-device adaptation

Part III: Production Challenges

Real-world complexities of operating ML systems.

Ch	Title	Purpose
9	Privacy & Security	Differential privacy, secure computation, attacks
10	Robust AI	Adversarial robustness, distribution shift
11	ML Ops at Scale	Advanced MLOps, platform engineering
12	Sustainable AI	Environmental impact, efficient computing

Part IV: Responsible Deployment

Building ML systems that benefit society.

Ch	Title	Purpose
13	Responsible AI	Fairness, accountability, transparency
14	AI for Good	Applications for societal benefit
15	Frontiers	Emerging trends, open problems
16	Conclusion	Synthesis, future of the field

Key Design Decisions

Why This Split?

Pedagogical Progression: Volume I covers what every ML practitioner needs. Volume II covers what scale/production engineers need.
Course Adoptability: Volume I maps to a single semester intro course. Volume II maps to an advanced graduate seminar.
Standalone Completeness: A reader of only Volume I gets responsible engineering awareness through Chapter 14.
Industry Alignment: Volume I produces capable junior engineers. Volume II produces senior/staff-level systems thinkers.

The Hennessy & Patterson Test

When deciding where content belongs, ask: What is the SCOPE of the system being discussed?

Aspect	Volume I	Volume II
Scope	Single-machine systems (1-8 GPUs)	Multi-machine distributed systems
Math & Theory	Full rigor, derivations	Full rigor, derivations
Performance Metrics	Single-system analysis	Scaling/efficiency analysis
Code Examples	Single-node implementations	Multi-node implementations

Summary Statistics

Metric	Volume I	Volume II
Chapters	16	16
Parts	4	4
Focus	Single system	Distributed systems
Prerequisite	None	Volume I

Document Version: January 2025 Reflects current implementation in _quarto-html.yml