:orphan:

.. _gpu_faq:

GPU training (FAQ)

How should I adjust the batch size when using multiple devices?

Lightning automatically shards your data across multiple GPUs, meaning that each device only sees a unique subset of your data, but the batch_size in your DataLoader remains the same. This means that the effective batch size e.g. the total number of samples processed in one forward/backward pass is

.. math::

\text{Effective Batch Size} = \text{DataLoader Batch Size} \times \text{Number of Devices} \times \text{Number of Nodes}

A couple of examples to illustrate this:

.. code-block:: python

dataloader = DataLoader(..., batch_size=7)

# Single GPU: effective batch size = 7
Trainer(accelerator="gpu", devices=1)

# Multi-GPU: effective batch size = 7 * 8 = 56
Trainer(accelerator="gpu", devices=8, strategy=...)

# Multi-node: effective batch size = 7 * 8 * 10 = 560
Trainer(accelerator="gpu", devices=8, num_nodes=10, strategy=...)

In general you should be able to use the same batch_size in your DataLoader regardless of the number of devices you are using.

.. note::

If you want distributed training to work exactly the same as single GPU training, you need to set the `batch_size`
in your DataLoader to `original_batch_size / num_devices` to maintain the same effective batch size. However, this
can lead to poor GPU utilization.

How should I adjust the learning rate when using multiple devices?

Because the effective batch size is larger when using multiple devices, you need to adjust your learning rate accordingly. Because the learning rate is a hyperparameter that controls how much to change the model in response to the estimated error each time the model weights are updated, it is important to scale it with the effective batch size.

In general, there are two common scaling rules:

Linear scaling: Increase the learning rate linearly with the number of devices.

.. code-block:: python

 # Example: Linear scaling
 base_lr = 1e-3
 num_devices = 8
 scaled_lr = base_lr * num_devices  # 8e-3

Square root scaling: Increase the learning rate by the square root of the number of devices.

.. code-block:: python

 # Example: Square root scaling
 base_lr = 1e-3
 num_devices = 8
 scaled_lr = base_lr * (num_devices ** 0.5)  # 2.83e-3

.. note:: Huge batch sizes are actually really bad for convergence. Check out: Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour <https://arxiv.org/abs/1706.02677>_

How do I use multiple GPUs on Jupyter or Colab notebooks?

To use multiple GPUs on notebooks, use the DDP_NOTEBOOK mode.

.. code-block:: python

Trainer(accelerator="gpu", devices=4, strategy="ddp_notebook")

If you want to use other strategies, please launch your training via the command-shell. See also: :doc:../../common/notebooks

I'm getting errors related to Pickling. What do I do?

Pickle is Python's mechanism for serializing and unserializing data. Some distributed modes require that your code is fully pickle compliant. If you run into an issue with pickling, try the following to figure out the issue.

.. code-block:: python

import pickle

model = YourModel()
pickle.dumps(model)

For example, the ddp_spawn strategy has the pickling requirement. This is a limitation of Python.

.. code-block:: python

Trainer(accelerator="gpu", devices=4, strategy="ddp_spawn")

If you use ddp, your code doesn't need to be pickled:

.. code-block:: python

Trainer(accelerator="gpu", devices=4, strategy="ddp")