tensorflow/python/distribute/cluster_resolver/README_Slurm.md
The Slurm Cluster Resolver resolves cluster specification for distributing TensorFlow work launched on HPC systems running on Slurm. This implementation is able to handle homogeneous and heterogeneous tasks as long as the number of GPUs per node and task are the same. This means on nodes with 4 GPUs each it will be possible to allocate 4 processes on node A and only 2 on node B. The resolution is done by determining job configuration through a number of Slurm variables and can be overwritten by user input. By default everything is determined from the Slurm environment, hence for most uses case no manual setting of parameters is required.
SlurmClusterResolver reads the environment variables that are set inside a job
step launched by Slurm. This means it will only work correctly for applications
launched via srun.
The process ID/rank is extracted from environment variable SLURM_PROCID and
the total number of tasks launched is extracted from SLURM_STEP_NUM_TASKS. The
hostnames are resolved by inspection SLURM_STEP_NODELIST. The number of tasks
per node is extracted from SLURM_STEP_TASKS_PER_NODE, unless a value is
specified by user. By using this variable heterogeneous task distributions are
possible. The user can set tasks_per_node to a single integer for homogeneous
tasks or a dictionary mapping node names to number of tasks for heterogeneous
distributions. However setting this is NOT recommended as there is a chance
it makes SLURM_PROCID be wrong.
A base port can be specified by user and in case there are more than one task launched per node the port number will be incremented for each additional tasks on that node. However a reasonable default is used.
The number of GPUs present on each node and number of GPUs for each tasks are
automatically detected. This is done by checking for CUDA_VISIBLE_DEVICES
first (which is set by Slurm to a list of GPUs for the current node) and has a
fallback to using nvidia-smi. If this doesn't work or non-NVIDIA GPUs are used
those 2 values have to be specified by the user. By default allocated GPUs will
be automatically exposed to processes according to specification by setting
CUDA_VISIBLE_DEVICES.
salloc --nodes=2 -t 01:30:00 --ntasks-per-node=2 --gres=gpu:k80:4 --exclusivesrun python tf_example.pyimport tensorflow as tf cluster_resolver = tf.distribute.cluster_resolver.SlurmClusterResolver() strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy(cluster_resolver=cluster_resolver) with strategy.scope(): # Load and compile model and dataThe above example will allocate 4 jobs on 2 nodes with each node having 2 jobs
and 4 GPUs. cluster_resolver.cluster_spec() will return a cluster
specification object in protobuf format with the following value (host names may
vary): job { name: "worker" tasks { key: 0 value: "t02n13:8888" } tasks { key: 1 value: "t02n13:8889" } tasks { key: 2 value: "t02n41:8888" } tasks { key: 3 value: "t02n41:8889" } }
The job_name will be worker for all nodes and task_index will be 0 to
3. Also GPUs will be allocated automatically, so the first job on each node
will see GPU 0 and 1, and the second GPU 2 and 3.
salloc & srun) as abovecluster = cluster_resolver.cluster_spec() job_name, task_index = cluster_resolver.get_task_info() ```
In this case 1 parameter server job and 3 worker jobs are used. The resulting
protobuf specification will look similar to this: job { name: "ps" tasks { key: 0 value: "t02n13:1337" } } job { name: "worker" tasks { key: 0 value: "t02n13:1338" } tasks { key: 1 value: "t02n41:1337" } tasks { key: 2 value: "t02n41:1338" } }
The value of job_name will be ps for t02n13:1337 and worker for all
others. There will be no GPU allocation done by the cluster resolver, so this
has to be done manually which is useful if e.g. GPUs 0 should go to the first
process and GPU 3 to the second process on each node. Also note that only 1 GPU
will be used per task.
The class SlurmClusterResolver provides some methods that are meant to be
overwritten by deriving classes:
_resolve_own_rank
_resolve_num_tasks
_resolve_hostlist
_resolve_task_configuration
Those can be used to implement a cluster resolver that gets information from a different source, e.g. via MPI, a file or other environment variables. See the documentation of these methods on what to return.