doc/source/serve/autoscaling-guide.md
(serve-autoscaling)=
Each Ray Serve deployment has one replica by default. This means there is one worker process running the model and serving requests. When traffic to your deployment increases, the single replica can become overloaded. To maintain high performance of your service, you need to scale out your deployment.
Before jumping into autoscaling, which is more complex, the other option to consider is manual scaling. You can increase the number of replicas by setting a higher value for num_replicas in the deployment options through in place updates. By default, num_replicas is 1. Increasing the number of replicas will horizontally scale out your deployment and improve latency and throughput for increased levels of traffic.
# Deploy with a single replica
deployments:
- name: Model
num_replicas: 1
# Scale up to 10 replicas
deployments:
- name: Model
num_replicas: 10
Instead of setting a fixed number of replicas for a deployment and manually updating it, you can configure a deployment to autoscale based on incoming traffic. The Serve autoscaler reacts to traffic spikes by monitoring queue sizes and making scaling decisions to add or remove replicas. Turn on autoscaling for a deployment by setting num_replicas="auto". You can further configure it by tuning the autoscaling_config in deployment options.
The following config is what we will use in the example in the following section.
- name: Model
num_replicas: auto
Setting num_replicas="auto" is equivalent to the following deployment configuration.
- name: Model
max_ongoing_requests: 5
autoscaling_config:
target_ongoing_requests: 2
min_replicas: 1
max_replicas: 100
:::{note}
When you set num_replicas="auto", Ray Serve applies the defaults shown above,
including max_replicas: 100. However, if you configure autoscaling manually
without using num_replicas="auto", the base default for max_replicas is 1,
which means autoscaling won't occur unless you explicitly set a higher value.
You can override any of these defaults by specifying autoscaling_config even
when using num_replicas="auto".
:::
Let's dive into what each of these parameters do.
These guidelines are a great starting point. If you decide to further tune your autoscaling config for your application, see Advanced Ray Serve Autoscaling.
(resnet-autoscaling-example)=
This example is a synchronous workload that runs ResNet50. The application code and its autoscaling configuration are below. Alternatively, see the second tab for specifying the autoscaling config through a YAML file.
::::{tab-set}
:::{tab-item} Application Code
:language: python
:start-after: __serve_example_begin__
:end-before: __serve_example_end__
:::
:::{tab-item} (Alternative) YAML config
applications:
- name: default
import_path: resnet:app
deployments:
- name: Model
num_replicas: auto
::: ::::
This example uses Locust to run a load test against this application. The Locust load test runs a certain number of "users" that ping the ResNet50 service, where each user has a constant wait time of 0. Each user (repeatedly) sends a request, waits for a response, then immediately sends the next request. The number of users running over time is shown in the following graph:
The results of the load test are as follows:
| Replicas | ||
| QPS | ||
| P50 Latency |
Notice the following:
target_ongoing_requests=2 setting.The Ray Serve Autoscaler is an application-level autoscaler that sits on top of the Ray Autoscaler. Concretely, this means that the Ray Serve autoscaler asks Ray to start a number of replica actors based on the request demand. If the Ray Autoscaler determines there aren't enough available resources (e.g. CPUs, GPUs, etc.) to place these actors, it responds by requesting more Ray nodes. The underlying cloud provider then responds by adding more nodes. Similarly, when Ray Serve scales down and terminates replica Actors, it attempts to make as many nodes idle as possible so the Ray Autoscaler can remove them. To learn more about the architecture underlying Ray Serve Autoscaling, see Ray Serve Autoscaling Architecture.