doc/source/train/user-guides/using-gpus.rst
.. _train_scaling_config:
Increasing the scale of a Ray Train training run is simple and can be done in a few lines of code.
The main interface for this is the :class:~ray.train.ScalingConfig,
which configures the number of workers and the resources they should use.
In this guide, a worker refers to a Ray Train distributed training worker,
which is a :ref:Ray Actor <actor-key-concept> that runs your training function.
The main interface to control parallelism in your training code is to set the
number of workers. This can be done by passing the num_workers attribute to
the :class:~ray.train.ScalingConfig:
.. testcode::
from ray.train import ScalingConfig
scaling_config = ScalingConfig(
num_workers=8
)
To use GPUs, pass use_gpu=True to the :class:~ray.train.ScalingConfig.
This will request one GPU per training worker. In the example below, training will
run on 8 GPUs (8 workers, each using one GPU).
.. testcode::
from ray.train import ScalingConfig
scaling_config = ScalingConfig(
num_workers=8,
use_gpu=True
)
Using GPUs in the training function
When ``use_gpu=True`` is set, Ray Train will automatically set up environment variables
in your training function so that the GPUs can be detected and used
(e.g. ``CUDA_VISIBLE_DEVICES``).
You can get the associated devices with :meth:`ray.train.torch.get_device`.
.. testcode::
import torch
from ray.train import ScalingConfig
from ray.train.torch import TorchTrainer, get_device
def train_func():
assert torch.cuda.is_available()
device = get_device()
assert device == torch.device("cuda:0")
trainer = TorchTrainer(
train_func,
scaling_config=ScalingConfig(
num_workers=1,
use_gpu=True
)
)
trainer.fit()
Assigning multiple GPUs to a worker
Sometimes you might want to allocate multiple GPUs for a worker. For example,
you can specify resources_per_worker={"GPU": 2} in the ScalingConfig if you want to
assign 2 GPUs for each worker.
You can get a list of associated devices with :meth:ray.train.torch.get_devices.
.. testcode::
import torch
from ray.train import ScalingConfig
from ray.train.torch import TorchTrainer, get_device, get_devices
def train_func():
assert torch.cuda.is_available()
device = get_device()
devices = get_devices()
assert device == torch.device("cuda:0")
assert devices == [torch.device("cuda:0"), torch.device("cuda:1")]
trainer = TorchTrainer(
train_func,
scaling_config=ScalingConfig(
num_workers=1,
use_gpu=True,
resources_per_worker={"GPU": 2}
)
)
trainer.fit()
Setting the GPU type
Ray Train allows you to specify the accelerator type for each worker.
This is useful if you want to use a specific accelerator type for model training.
In a heterogeneous Ray cluster, this means that your training workers will be forced to run on the specified GPU type,
rather than on any arbitrary GPU node. You can get a list of supported `accelerator_type` from
:ref:`the available accelerator types <accelerator_types>`.
For example, you can specify `accelerator_type="A100"` in the :class:`~ray.train.ScalingConfig` if you want to
assign each worker a NVIDIA A100 GPU.
.. tip::
Ensure that your cluster has instances with the specified accelerator type
or is able to autoscale to fulfill the request.
.. testcode::
ScalingConfig(
num_workers=1,
use_gpu=True,
accelerator_type="A100"
)
(PyTorch) Setting the communication backend
PyTorch Distributed supports multiple backends <https://pytorch.org/docs/stable/distributed.html#backends>__
for communicating tensors across workers. By default Ray Train will use NCCL when use_gpu=True and Gloo otherwise.
If you explicitly want to override this setting, you can configure a :class:~ray.train.torch.TorchConfig
and pass it into the :class:~ray.train.torch.TorchTrainer.
.. testcode:: :hide:
num_training_workers = 1
.. testcode::
from ray.train.torch import TorchConfig, TorchTrainer
trainer = TorchTrainer(
train_func,
scaling_config=ScalingConfig(
num_workers=num_training_workers,
use_gpu=True, # Defaults to NCCL
),
torch_config=TorchConfig(backend="gloo"),
)
(NCCL) Setting the communication network interface
When using NCCL for distributed training, you can configure the network interface cards
that are used for communicating between GPUs by setting the
`NCCL_SOCKET_IFNAME <https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-socket-ifname>`__
environment variable.
To ensure that the environment variable is set for all training workers, you can pass it
in a :ref:`Ray runtime environment <runtime-environments>`:
.. testcode::
:skipif: True
import ray
runtime_env = {"env_vars": {"NCCL_SOCKET_IFNAME": "ens5"}}
ray.init(runtime_env=runtime_env)
trainer = TorchTrainer(...)
Setting the resources per worker
--------------------------------
If you want to allocate more than one CPU or GPU per training worker, or if you
defined :ref:`custom cluster resources <cluster-resources>`, set
the ``resources_per_worker`` attribute:
.. testcode::
from ray.train import ScalingConfig
scaling_config = ScalingConfig(
num_workers=8,
resources_per_worker={
"CPU": 4,
"GPU": 2,
},
use_gpu=True,
)
.. note::
If you specify GPUs in ``resources_per_worker``, you also need to set
``use_gpu=True``.
You can also instruct Ray Train to use fractional GPUs. In that case, multiple workers
will be assigned the same CUDA device.
.. testcode::
from ray.train import ScalingConfig
scaling_config = ScalingConfig(
num_workers=8,
resources_per_worker={
"CPU": 4,
"GPU": 0.5,
},
use_gpu=True,
)
(Deprecated) Trainer resources
------------------------------
.. important::
This API is deprecated. See `this migration guide <https://github.com/ray-project/ray/issues/49454>`_ for more details.
So far we've configured resources for each training worker. Technically, each
training worker is a :ref:`Ray Actor <actor-guide>`. Ray Train also schedules
an actor for the trainer object when you call ``trainer.fit()``.
This object often only manages lightweight communication between the training workers.
Per default, a trainer uses 1 CPU. If you have a cluster with 8 CPUs and want
to start 4 training workers a 2 CPUs, this will not work, as the total number
of required CPUs will be 9 (4 * 2 + 1). In that case, you can specify the trainer
resources to use 0 CPUs:
.. testcode::
from ray.train import ScalingConfig
scaling_config = ScalingConfig(
num_workers=4,
resources_per_worker={
"CPU": 2,
},
trainer_resources={
"CPU": 0,
}
)