doc/source/ray-core/internals/port-service-discovery.rst
.. _ray-port-service-discovery:
This document describes Ray's dynamic port assignment and discovery.
When a port is not explicitly specified by the user or start script, Ray uses a bind-then-report pattern: components first bind to a random port, then report the actual port.
This avoids the TOCTOU (Time-Of-Check-Time-Of-Use) race condition: if Ray preassigned a port and passed it to a component, another process might bind to that port before the component does.
Ray uses two discovery mechanisms based on process relationships:
.. list-table:: :header-rows: 1
Raylet spawns child processes and needs to discover their ports. The mechanism depends on the language boundary.
File-Based: Raylet ↔ Agents
Raylet (C++) spawns Agents (Python). Cross-language communication is needed, and
files are the simplest solution that works reliably across both platforms (Linux/Windows)
and languages (C++/Python). Raylet spawns two agents:
1. **Dashboard Agent** - exposes three ports:
- ``dashboard_agent_listen_port``: HTTP for Dashboard UI (default: 52365)
- ``metrics_agent_port``: gRPC for internal communication (default: random)
- ``metrics_export_port``: Prometheus metrics export (default: random)
2. **Runtime Env Agent** - manages runtime environments:
- ``runtime_env_agent_port``: gRPC (default: random)
After binding, agents write their ports to
``{session_dir}/{port_name}_{node_id_hex}`` (see `port_persistence.h <https://github.com/ray-project/ray/blob/master/src/ray/util/port_persistence.h>`_).
Raylet polls these files and waits for all agent ports before registering the node to GCS.
IPC-Based: Raylet ↔ Core Workers
Raylet (C++) spawns Core Workers (C++). Same language, so they use socket-based IPC with Flatbuffers protocol. On Linux/macOS this is a Unix domain socket; on Windows it's a TCP socket on localhost.
Default (no port range):
Worker binds to port 0 (OS picks a random port), then tells Raylet the actual port.
With port range ( --min-worker-port / --max-worker-port ):
OS only supports binding to a specific port or port 0 (random). There's no syscall for "bind to any port within this range". So Raylet must manage the range itself.
Raylet maintains a free_ports_ queue. When a worker registers, Raylet assigns
it an unused port from the queue. The worker binds, then confirms with AnnounceWorkerPort.
If the worker fails to bind the port (e.g., port already in use by external process),
it crashes. Raylet detects the socket disconnect via
NodeManager::HandleClientConnectionError <https://github.com/ray-project/ray/blob/10869d565047ae02b398802e1efaf04109f27249/src/ray/raylet/node_manager.h>_, which returns the port to the queue and starts a new worker.
Raylet Internal Port Discovery only serves Raylet discovering its own children's ports. For everything else they query GCS.
Two common examples:
Node Table
Each Raylet registers a GcsNodeInfo to GCS, containing its own ports
(``node_manager_port``, ``object_manager_port``) and agent ports
(``runtime_env_agent_port``, ``metrics_agent_port``, ``dashboard_agent_listen_port``).
Other components query GCS for this information:
- Object Manager needs ``object_manager_port`` of remote nodes to pull objects across nodes
- Core Worker needs ``node_manager_port`` of remote nodes for task cancellation, and object recovery
- Dashboard needs ``runtime_env_agent_port`` of each node to collect runtime env info
- Ray Client Server needs ``runtime_env_agent_port`` to set up runtime environments for client jobs
- ...
Actor Table
Actors are stateful—callers must reach the same worker every time. Actor method calls require direct RPC to a specific worker.
When an actor is created, its worker address (address.ip_address and address.port)
is registered to GCS via GcsActorManager::HandleRegisterActor <https://github.com/ray-project/ray/blob/10869d565047ae02b398802e1efaf04109f27249/src/ray/gcs/gcs_actor_manager.h>_.
Callers query GCS to get the address, then communicate directly with the worker.