docs/source/en/concepts/distributed_training.md
Author: Shenggui Li, Siqi Mai
A distributed system consists of multiple software components which run on multiple machines. For example, the traditional database runs on a single machine. As the amount of data gets incredibly large, a single machine can no longer deliver desirable performance to the business, especially in situations such as Black Friday where network traffic can be unexpectedly high. To handle such pressure, modern high-performance database is designed to run on multiple machines, and they work together to provide high throughput and low latency to the user.
One important evaluation metric for distributed system is scalability. For example, when we run an application on 4 machines, we naturally expect that the application can run 4 times faster. However, due to communication overhead and difference in hardware performance, it is difficult to achieve linear speedup. Thus, it is important to consider how to make the application faster when we implement it. Algorithms of good design and system optimization can help to deliver good performance. Sometimes, it is even possible to achieve linear and super-linear speedup.
Back in 2012, AlexNet won the champion of the ImageNet competition, and it was trained on two GTX 580 3GB GPUs. Today, most models that appear in the top AI conferences are trained on multiple GPUs. Distributed training is definitely a common practice when researchers and engineers develop AI models. There are several reasons behind this trend.
Dataset size increases rapidly. For most machine learning developers, MNIST and CIFAR10 datasets are often the first few datasets on which they train their models. However, these datasets are very small compared to well-known ImageNet datasets. Google even has its own (unpublished) JFT-300M dataset which has around 300 million images, and this is close to 300 times larger than the ImageNet-1k dataset.
Computing power gets stronger. With the advancement in the semiconductor industry, graphics cards become more and more powerful. Due to its larger number of cores, GPU is the most common compute platform for deep learning. From K10 GPU in 2012 to A100 GPU in 2020, the computing power has increased several hundred times. This allows us to performance compute-intensive tasks faster and deep learning is exactly such a task.
Nowadays, the model can be too large to fit into a single GPU, and the dataset can be large enough to train for a hundred days on a single GPU. Only by training our models on multiple GPUs with different parallelization techniques, we are able to speed up the training process and obtain results in a reasonable amount of time.
Distributed training requires multiple machines/GPUs. During training, there will be communication among these devices. To understand distributed training better, there are several important terms to be made clear.
To illustrate these concepts, let's assume we have 2 machines (also called nodes), and each machine has 4 GPUs. When we initialize distributed environment over these two machines, we essentially launch 8 processes (4 processes on each machine) and each process is bound to a GPU.
Before initializing the distributed environment, we need to specify the host (master address) and port (master port). In this example, we can let host be node 0 and port be a number such as 29500. All the 8 processes will then look for the address and port and connect to one another. The default process group will then be created. The default process group has a world size of 8 and details are as follows:
| process ID | rank | Node index | GPU index |
|---|---|---|---|
| 0 | 0 | 0 | 0 |
| 1 | 1 | 0 | 1 |
| 2 | 2 | 0 | 2 |
| 3 | 3 | 0 | 3 |
| 4 | 4 | 1 | 0 |
| 5 | 5 | 1 | 1 |
| 6 | 6 | 1 | 2 |
| 7 | 7 | 1 | 3 |
We can also create a new process group. This new process group can contain any subset of the processes. For example, we can create one containing only even-number processes, and the details of this new group will be:
| process ID | rank | Node index | GPU index |
|---|---|---|---|
| 0 | 0 | 0 | 0 |
| 2 | 1 | 0 | 2 |
| 4 | 2 | 1 | 0 |
| 6 | 3 | 1 | 2 |
Please note that rank is relative to the process group and one process can have a different rank in different process
groups. The max rank is always world size of the process group - 1.
In the process group, the processes can communicate in two ways: