doc/source/serve/architecture.md
(serve-architecture)=
In this section, we explore Serve's key architectural concepts and components. It will offer insight and overview into:
% Figure source: https://docs.google.com/drawings/d/1jSuBN5dkSj2s9-0eGzlU_ldsRa3TsswQUZM-cMQ29a0/edit?usp=sharing
:align: center
:width: 600px
(serve-architecture-high-level-view)=
Serve runs on Ray and utilizes Ray actors.
There are three kinds of actors that are created to make up a Serve instance:
proxy_location field inside serve.start() or the config file.port and grpc_servicer_functions,
then the gRPC proxy is started alongside the HTTP proxy. This Actor runs a
grpcio server. The gRPC server accepts
incoming requests, forwards them to replicas, and responds once they are completed.@serve.batch. See the batching docs.When an HTTP or gRPC request is sent to the corresponding HTTP or gRPC proxy, the following happens:
max_ongoing_requests requests are outstanding at each replica), the request
is left in the queue until a replica becomes available.Each replica maintains a queue of requests and executes requests one at a time, possibly
using asyncio to process them concurrently. If the handler (the deployment function or the __call__ method of the deployment class) is declared with async def, the replica will not wait for the
handler to run. Otherwise, the replica blocks until the handler returns.
When making a request via a DeploymentHandle instead of HTTP or gRPC for model composition, the request is placed on a queue in the DeploymentHandle, and we skip to step 3 above.
(serve-ft-detail)=
Application errors like exceptions in your model evaluation code are caught and wrapped. A 500 status code will be returned with the traceback information. The replica will be able to continue to handle requests.
Machine errors and faults are handled by Ray Serve as follows:
When a machine hosting any of the actors crashes, those actors are automatically restarted on another available machine. All data in the Controller (routing policies, deployment configurations, etc) is checkpointed to the Ray Global Control Store (GCS) on the head node. Transient data in the router and the replica (like network connections and internal request queues) will be lost for this kind of failure. See the end-to-end fault tolerance guide for more details on how actor crashes are detected.
(serve-autoscaling-architecture)=
Ray Serve's autoscaling feature automatically increases or decreases a deployment's number of replicas based on its load.
DeploymentHandle and each replica periodically pushes its metrics to the autoscaler.DeploymentHandle queues and in-flight queries on replicas to decide whether or not to scale the number of replicas.DeploymentHandle continuously polls the controller to check for new deployment replicas. Whenever new replicas are discovered, it sends any buffered or new queries to the replica until max_ongoing_requests is reached. Queries are sent to replicas using a power of two choices scheduling strategy, subject to the constraint that no replica is handling more than max_ongoing_requests requests at a time.:::{note}
When the controller dies, requests can still be sent via HTTP, gRPC and DeploymentHandle, but autoscaling is paused. When the controller recovers, the autoscaling resumes, but all previous metrics collected are lost.
:::
Ray Serve provides a CLI for managing your Ray Serve instance, as well as a REST API. Each node in your Ray cluster provides a Serve REST API server that can connect to Serve and respond to Serve REST requests.
You can configure Serve to start one proxy Actor per node with the proxy_location field inside serve.start() or the config file. Each proxy binds to the same port. You
should be able to reach Serve and send requests to any models with any of the
servers. You can use your own load balancer on top of Ray Serve.
This architecture ensures horizontal scalability for Serve. You can scale your HTTP and gRPC ingress by adding more nodes. You can also scale your model inference by increasing the number
of replicas via the num_replicas option of your deployment.
{mod}DeploymentHandles <ray.serve.handle.DeploymentHandle> wrap a handle to a "router" on the
same node which routes requests to replicas for a deployment. When a
request is sent from one replica to another via the handle, the
requests go through the same data path as incoming HTTP or gRPC requests. This enables
the same deployment selection and batching procedures to happen. DeploymentHandles are
often used to implement model composition.
Serve utilizes Ray’s shared memory object store and in process memory store. Small request objects are directly sent between actors via network call. Larger request objects (100KiB+) are written to the object store and the replica can read them via zero-copy read.