Back to Cs249r Book

Machine Learning Systems: Two-Volume Structure

book/docs/VOLUME_STRUCTURE.md

latest6.2 KB
Original Source

Machine Learning Systems: Two-Volume Structure

Status: Implemented Target Publisher: MIT Press Audience: Undergraduate and graduate CS/ECE students, academic courses


Overview

This textbook is organized into two volumes following the Hennessy & Patterson pedagogical model:

  • Volume I: Introduction to Machine Learning Systems — Build, Optimize, Deploy
  • Volume II: Machine Learning Systems at Scale — Scale, Distribute, Govern

Each volume stands alone as a complete learning experience while together forming a comprehensive treatment of the field.


Volume I: Introduction to Machine Learning Systems

Goal

A reader completes Volume I and can competently build, optimize, and deploy ML systems on a single machine with awareness of responsible practices.

Target Audience

  • Upper-level undergraduates
  • Early graduate students
  • Practitioners transitioning into ML systems

Course Mapping

  • Single semester "Introduction to Machine Learning Systems" course
  • Foundation for more advanced distributed systems or MLOps courses

Structure (16 chapters)

Part I: Foundations

Establish the conceptual framework for understanding ML as a systems discipline.

ChTitlePurpose
1IntroductionWhy ML systems thinking matters
2ML SystemsSurvey of the field, deployment paradigms
3ML WorkflowEnd-to-end ML development process
4Data EngineeringPipelines, preprocessing, data quality

Part II: Build

The technical implementation of machine learning systems from math to trained models.

ChTitlePurpose
5Neural ComputationMathematical and conceptual foundations
6Network ArchitecturesCNNs, RNNs, Transformers, architectural choices
7ML FrameworksPyTorch, TensorFlow, JAX ecosystem
8Model TrainingTraining loops, optimization, debugging

Part III: Optimization

Techniques for making ML systems efficient and fast.

ChTitlePurpose
9Data SelectionOptimizing information, active learning, pruning
10Model CompressionQuantization, pruning, distillation
11Hardware AccelerationGPUs, TPUs, custom accelerators
12BenchmarkingMeasuring performance, MLPerf

Part IV: Deployment

Getting models into production responsibly.

ChTitlePurpose
13Model ServingInference fundamentals, batching, latency optimization
14ML OperationsDeployment, monitoring, CI/CD for ML
15Responsible EngineeringEthics, safety, and professional practice
16ConclusionSynthesis and bridge to Volume II

Volume II: Machine Learning Systems at Scale

Goal

A reader completes Volume II understanding how to build and operate ML systems at scale, with production resilience and responsible practices.

Target Audience

  • Graduate students
  • Industry practitioners
  • Researchers building large-scale systems

Prerequisites

  • Volume I or equivalent knowledge
  • Basic distributed systems concepts helpful

Course Mapping

  • Graduate seminar on large-scale ML systems
  • Advanced MLOps course
  • Research group reading material

Structure (16 chapters)

Part I: Foundations of Scale

Infrastructure and concepts for scaling beyond single machines.

ChTitlePurpose
1IntroductionMotivation, challenges of scale
2InfrastructureClusters, cloud, resource management
3Storage SystemsData lakes, distributed storage, checkpointing
4CommunicationAllReduce, parameter servers, network topology

Part II: Distributed Systems

Training and inference across multiple machines.

ChTitlePurpose
5Distributed TrainingParallelism strategies, multi-chip hardware, scaling infrastructure
6Fault ToleranceCheckpointing, recovery, handling failures
7Inference at ScaleServing systems, batching, latency optimization
8Edge IntelligenceFederated learning, fleet coordination, on-device adaptation

Part III: Production Challenges

Real-world complexities of operating ML systems.

ChTitlePurpose
9Privacy & SecurityDifferential privacy, secure computation, attacks
10Robust AIAdversarial robustness, distribution shift
11ML Ops at ScaleAdvanced MLOps, platform engineering
12Sustainable AIEnvironmental impact, efficient computing

Part IV: Responsible Deployment

Building ML systems that benefit society.

ChTitlePurpose
13Responsible AIFairness, accountability, transparency
14AI for GoodApplications for societal benefit
15FrontiersEmerging trends, open problems
16ConclusionSynthesis, future of the field

Key Design Decisions

Why This Split?

  1. Pedagogical Progression: Volume I covers what every ML practitioner needs. Volume II covers what scale/production engineers need.

  2. Course Adoptability: Volume I maps to a single semester intro course. Volume II maps to an advanced graduate seminar.

  3. Standalone Completeness: A reader of only Volume I gets responsible engineering awareness through Chapter 14.

  4. Industry Alignment: Volume I produces capable junior engineers. Volume II produces senior/staff-level systems thinkers.

The Hennessy & Patterson Test

When deciding where content belongs, ask: What is the SCOPE of the system being discussed?

AspectVolume IVolume II
ScopeSingle-machine systems (1-8 GPUs)Multi-machine distributed systems
Math & TheoryFull rigor, derivationsFull rigor, derivations
Performance MetricsSingle-system analysisScaling/efficiency analysis
Code ExamplesSingle-node implementationsMulti-node implementations

Summary Statistics

MetricVolume IVolume II
Chapters1616
Parts44
FocusSingle systemDistributed systems
PrerequisiteNoneVolume I

Document Version: January 2025 Reflects current implementation in _quarto-html.yml