Back to Pytorch

Rendezvous

docs/source/elastic/rendezvous.rst

2.11.02.8 KB
Original Source

.. _rendezvous-api:

Rendezvous

.. automodule:: torch.distributed.elastic.rendezvous

Below is a state diagram describing how rendezvous works.

.. image:: etcd_rdzv_diagram.png

Registry

.. autoclass:: RendezvousParameters :members:

.. autoclass:: RendezvousHandlerRegistry

.. automodule:: torch.distributed.elastic.rendezvous.registry

Handler

.. currentmodule:: torch.distributed.elastic.rendezvous

.. autoclass:: RendezvousHandler :members:

Dataclasses

.. autoclass:: RendezvousInfo

.. currentmodule:: torch.distributed.elastic.rendezvous.api

.. autoclass:: RendezvousStoreInfo

.. automethod:: build(rank, store)

Exceptions

.. autoclass:: RendezvousError .. autoclass:: RendezvousClosedError .. autoclass:: RendezvousTimeoutError .. autoclass:: RendezvousConnectionError .. autoclass:: RendezvousStateError .. autoclass:: RendezvousGracefulExitError

Implementations

Dynamic Rendezvous


.. currentmodule:: torch.distributed.elastic.rendezvous.dynamic_rendezvous

.. autofunction:: create_handler

.. autoclass:: DynamicRendezvousHandler() :members: from_backend

.. autoclass:: RendezvousBackend :members:

.. autoclass:: RendezvousTimeout :members:

C10d Backend ^^^^^^^^^^^^

.. currentmodule:: torch.distributed.elastic.rendezvous.c10d_rendezvous_backend

.. autofunction:: create_backend

.. autoclass:: C10dRendezvousBackend :members:

Etcd Backend ^^^^^^^^^^^^

.. currentmodule:: torch.distributed.elastic.rendezvous.etcd_rendezvous_backend

.. autofunction:: create_backend

.. autoclass:: EtcdRendezvousBackend :members:

Etcd Rendezvous (Legacy)


.. warning:: The DynamicRendezvousHandler class supersedes the EtcdRendezvousHandler class, and is recommended for most users. EtcdRendezvousHandler is in maintenance mode and will be deprecated in the future.

.. currentmodule:: torch.distributed.elastic.rendezvous.etcd_rendezvous

.. autoclass:: EtcdRendezvousHandler

Etcd Store


The EtcdStore is the C10d Store instance type returned by next_rendezvous() when etcd is used as the rendezvous backend.

.. currentmodule:: torch.distributed.elastic.rendezvous.etcd_store

.. autoclass:: EtcdStore :members:

Etcd Server


The EtcdServer is a convenience class that makes it easy for you to start and stop an etcd server on a subprocess. This is useful for testing or single-node (multi-worker) deployments where manually setting up an etcd server on the side is cumbersome.

.. warning:: For production and multi-node deployments please consider properly deploying a highly available etcd server as this is the single point of failure for your distributed jobs.

.. currentmodule:: torch.distributed.elastic.rendezvous.etcd_server

.. autoclass:: EtcdServer