docs/source-pytorch/clouds/cluster_intermediate_1.rst
:orphan:
######################################## Run on an on-prem cluster (intermediate) ######################################## Audience: Users who need to run on an academic or enterprise private cluster.
.. _non-slurm:
Set up the cluster
This guide shows how to run a training job on a general purpose cluster. We recommend beginners to try this method first because it requires the least amount of configuration and changes to the code. To setup a multi-node computing cluster you need:
PyTorch Lightning follows the design of PyTorch distributed communication package <https://pytorch.org/docs/stable/distributed.html#environment-variable-initialization>_. and requires the following environment variables to be defined on each node:
.. _training_script_setup:
Set up the training script
To train a model using multiple nodes, do the following:
Design your :ref:lightning_module (no need to add anything specific here).
Enable DDP in the trainer
.. code-block:: python
trainer = Trainer(accelerator="gpu", devices=8, num_nodes=4, strategy="ddp")
Submit a job to the cluster
To submit a training job to the cluster you need to run the same training script on each node of the cluster. This means that you need to:
Debug on a cluster
When running in DDP mode, some errors in your code can show up as an NCCL issue.
Set the NCCL_DEBUG=INFO environment variable to see the ACTUAL error.
.. code-block:: bash
NCCL_DEBUG=INFO python train.py ...