doc/source/serve/advanced-guides/advanced-autoscaling.md
(serve-advanced-autoscaling)=
This guide goes over more advanced autoscaling parameters in autoscaling_config and an advanced model composition example.
(serve-autoscaling-config-parameters)=
In this section, we go into more detail about Serve autoscaling concepts as well as how to set your autoscaling config.
To define what the steady state of your deployments should be, set values for
target_ongoing_requests and max_ongoing_requests.
target_ongoing_requests [default=2]:::{note}
The default for target_ongoing_requests changed from 1.0 to 2.0 in Ray 2.32.0.
You can continue to set it manually to override the default.
:::
Serve scales the number of replicas for a deployment up or down based on the
average number of ongoing requests per replica. Specifically, Serve compares the
actual number of ongoing requests per replica with the target value you set in
the autoscaling config and makes upscale or downscale decisions from that. Set
the target value with target_ongoing_requests, and Serve attempts to ensure
that each replica has roughly that number of requests being processed and
waiting in the queue.
Always load test your workloads. For example, if the use case is latency
sensitive, you can lower the target_ongoing_requests number to maintain high
performance. Benchmark your application code and set this number based on an
end-to-end latency objective.
:::{note}
As an example, suppose you have two replicas of a synchronous deployment that
has 100ms latency, serving a traffic load of 30 QPS. Then Serve assigns requests
to replicas faster than the replicas can finish processing them; more and more
requests queue up at the replica (these requests are "ongoing requests") as time
progresses, and then the average number of ongoing requests at each replica
steadily increases. Latency also increases because new requests have to wait for
old requests to finish processing. If you set target_ongoing_requests = 1,
Serve detects a higher than desired number of ongoing requests per replica, and
adds more replicas. At 3 replicas, your system would be able to process 30 QPS
with 1 ongoing request per replica on average.
:::
max_ongoing_requests [default=5]:::{note}
The default for max_ongoing_requests changed from 100 to 5 in Ray 2.32.0.
You can continue to set it manually to override the default.
:::
There is also a maximum queue limit that proxies respect when assigning requests
to replicas. Define the limit with max_ongoing_requests. Set
max_ongoing_requests to ~20 to 50% higher than target_ongoing_requests.
:::{note}
max_ongoing_requests should be tuned higher especially for lightweight
requests, else the overall throughput will be impacted.
:::
To use autoscaling, you need to define the minimum and maximum number of resources allowed for your system.
min_replicas [default=1]: This is the minimum number of replicas for the
deployment. If you want to ensure your system can deal with a certain level of
traffic at all times, set min_replicas to a positive number. On the other
hand, if you anticipate periods of no traffic and want to scale to zero to save
cost, set min_replicas = 0. Note that setting min_replicas = 0 causes higher
tail latencies; when you start sending traffic, the deployment scales up, and
there will be a cold start time as Serve waits for replicas to be started to
serve the request.
max_replicas [default=1]: This is the maximum number of replicas for the
deployment. This should be greater than min_replicas. Ray Serve Autoscaling
relies on the Ray Autoscaler to scale up more nodes when the currently available
cluster resources (CPUs, GPUs, etc.) are not enough to support more replicas.
initial_replicas: This is the number of replicas that are started
initially for the deployment. This defaults to the value for min_replicas.
Given a steady stream of traffic and appropriately configured min_replicas and
max_replicas, the steady state of your system is essentially fixed for a
chosen configuration value for target_ongoing_requests. Before reaching steady
state, however, your system is reacting to traffic shifts. How you want your
system to react to changes in traffic determines how you want to set the
remaining autoscaling configurations.
upscale_delay_s [default=30s]: This defines how long Serve waits before
scaling up the number of replicas in your deployment. In other words, this
parameter controls the frequency of upscale decisions. If the replicas are
consistently serving more requests than desired for an upscale_delay_s
number of seconds, then Serve scales up the number of replicas based on
aggregated ongoing requests metrics. For example, if your service is likely to
experience bursts of traffic, you can lower upscale_delay_s so that your
application can react quickly to increases in traffic.Ray Serve allows you to use different delays for different downscaling scenarios, providing more granular control over when replicas are removed. This is particularly useful when you want different behavior for scaling down to zero versus scaling down to a non-zero number of replicas.
downscale_delay_s [default=600s]: This defines how long Serve waits before
scaling down the number of replicas in your deployment. If the replicas are
consistently serving fewer requests than desired for a downscale_delay_s
number of seconds, Serve scales down the number of replicas based on
aggregated ongoing requests metrics. This delay applies to all downscaling
decisions except for the optional 1→0 transition (see below). For example, if
your application initializes slowly, you can increase downscale_delay_s to
make downscaling happen more infrequently and avoid reinitialization costs when
the application needs to upscale again.
downscale_to_zero_delay_s [Optional]: This defines how long Serve waits
before scaling from one replica down to zero (only applies when min_replicas = 0).
If not specified, the 1→0 transition uses the downscale_delay_s value. This is
useful when you want more conservative scale-to-zero behavior. For example, you
might set downscale_delay_s = 300 for regular downscaling but
downscale_to_zero_delay_s = 1800 to wait 30 minutes before scaling to zero,
avoiding cold starts for brief periods of inactivity.
upscale_smoothing_factor [default_value=1.0] (DEPRECATED): This parameter
is renamed to upscaling_factor. upscale_smoothing_factor will be removed in
a future release.
downscale_smoothing_factor [default_value=1.0] (DEPRECATED): This
parameter is renamed to downscaling_factor. downscale_smoothing_factor will
be removed in a future release.
upscaling_factor [default_value=1.0]: The multiplicative factor to amplify
or moderate each upscaling decision. For example, when the application has high
traffic volume in a short period of time, you can increase upscaling_factor to
scale up the resource quickly. This parameter is like a "gain" factor to amplify
the response of the autoscaling algorithm.
downscaling_factor [default_value=1.0]: The multiplicative factor to
amplify or moderate each downscaling decision. For example, if you want your
application to be less sensitive to drops in traffic and scale down more
conservatively, you can decrease downscaling_factor to slow down the pace of
downscaling.
metrics_interval_s [default_value=10]: In future this deployment level
config will be removed in favor of cross-application level global config.
This controls how often each replica and handle sends reports on current ongoing requests to the autoscaler.
::{note}
If metrics are reported infrequently, Ray Serve can take longer to notice a change in autoscaling metrics, so scaling can start later even if your delays are short. For example, if you set upscale_delay_s = 3 but metrics are pushed every 10 seconds, Ray Serve might not see a change until the next push, so scaling up can be limited to about once every 10 seconds.
::
look_back_period_s [default_value=30]: This is the window over which the
average number of ongoing requests per replica is calculated.
aggregation_function [default_value="mean"]: This controls how metrics are
aggregated over the look_back_period_s time window. The aggregation function
determines how Ray Serve combines multiple metric measurements into a single
value for autoscaling decisions. Supported values:
"mean" (default): Uses time-weighted average of metrics. This provides
smooth scaling behavior that responds to sustained traffic patterns."max": Uses the maximum metric value observed. This makes autoscaling more
sensitive to spikes, scaling up quickly when any replica experiences high load."min": Uses the minimum metric value observed. This results in more
conservative scaling behavior.For most workloads, the default "mean" aggregation provides the best balance.
Use "max" if you need to react quickly to traffic spikes, or "min" if you
prefer conservative scaling that avoids rapid fluctuations.
Understanding how metrics flow through the autoscaling system helps you configure the parameters effectively. The metrics pipeline involves several stages, each with its own timing parameters:
┌──────────────────────────────────────────────────────────────────────────┐
│ Metrics Pipeline Overview │
├──────────────────────────────────────────────────────────────────────────┤
│ │
│ Replicas/Handles Controller Autoscaling Policy │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Record │ Push │ Receive │ Decide │ Policy │ │
│ │ Metrics │────────────>│ Metrics │──────────>│ Runs │ │
│ │ (10s) │ (10s) │ │ (0.1s) │ │ │
│ └──────────┘ │ Aggregate│ └──────────┘ │
│ │ (30s) │ │
│ └──────────┘ │
│ │
└──────────────────────────────────────────────────────────────────────────┘
Replicas and deployment handles continuously record autoscaling metrics:
metrics_interval_s)Periodically, replicas and handles push their metrics to the controller:
RAY_SERVE_REPLICA_AUTOSCALING_METRIC_PUSH_INTERVAL_S and RAY_SERVE_HANDLE_AUTOSCALING_METRIC_PUSH_INTERVAL_S)look_back_period_s window before sending (only recent measurements within the window are sent)look_back_period_s window at the replica/handleRAY_SERVE_AGGREGATE_METRICS_AT_CONTROLLER setting (see Stage 3 below)The controller aggregates metrics to compute total ongoing requests across all replicas.
Ray Serve supports two aggregation modes (controlled by RAY_SERVE_AGGREGATE_METRICS_AT_CONTROLLER):
Simple mode (default - RAY_SERVE_AGGREGATE_METRICS_AT_CONTROLLER=0):
look_back_period_s)Aggregate mode (experimental - RAY_SERVE_AGGREGATE_METRICS_AT_CONTROLLER=1):
look_back_period_s)aggregation_function (mean, max, or min). Uses an instantaneous merge approach that treats metrics as right-continuous step functions.:::{note}
The aggregation_function parameter only applies in aggregate mode. In simple mode, the aggregation is always a sum of the pre-computed simple averages.
:::
:::{note}
The long-term plan is to deprecate simple mode in favor of aggregate mode. Aggregate mode provides more accurate metrics aggregation and will become the default in a future release. Consider testing aggregate mode(RAY_SERVE_AGGREGATE_METRICS_AT_CONTROLLER=1) in your deployments to prepare for this transition.
:::
The autoscaling policy runs frequently to make scaling decisions, see Custom policy for deployment for details on implementing custom scaling logic:
RAY_SERVE_CONTROL_LOOP_INTERVAL_S)AutoscalingContext(target_replicas, updated_policy_state)The timing parameters interact in important ways:
Recording vs pushing intervals:
Push interval vs look-back period:
look_back_period_s (30s) should be > push interval (10s)Push interval vs control loop:
Push interval vs upscale/downscale delays:
upscale_delay_s = 20 means up to 2 new metric updates before scalingRecommendation: Keep default values unless you have specific needs. If you need faster autoscaling, decrease push intervals first, then adjust delays.
Several environment variables control autoscaling behavior at a lower level. These variables affect metrics collection and the control loop timing:
RAY_SERVE_CONTROL_LOOP_INTERVAL_S (default: 0.1s): How often the Ray
Serve controller runs the autoscaling control loop. Your autoscaling policy
function executes at this frequency. The default value of 0.1s means policies
run approximately 10 times per second.
RAY_SERVE_RECORD_AUTOSCALING_STATS_TIMEOUT_S (default: 10.0s): Maximum
time allowed for the record_autoscaling_stats() method to complete in custom
metrics collection. If this timeout is exceeded, the metrics collection fails
and a warning is logged.
RAY_SERVE_MIN_HANDLE_METRICS_TIMEOUT_S (default: 10.0s): Minimum timeout
for handle metrics collection. The system uses the maximum of this value and
2 * metrics_interval_s to determine when to drop stale handle metrics.
RAY_SERVE_AGGREGATE_METRICS_AT_CONTROLLER (default: false): Enables an
experimental metrics aggregation mode where the controller aggregates raw
timeseries data instead of using pre-aggregated metrics. This mode provides more
accurate time-weighted averages but may increase controller overhead. See Stage 3
in "How autoscaling metrics work" for details.Determining the autoscaling configuration for a multi-model application requires understanding each deployment's scaling requirements. Every deployment has a different latency and differing levels of concurrency. As a result, finding the right autoscaling config for a model-composition application requires experimentation.
This example is a simple application with three deployments composed together to build some intuition about multi-model autoscaling. Assume these deployments:
HeavyLoad: A mock 200ms workload with high CPU usage.LightLoad: A mock 100ms workload with high CPU usage.Driver: A driver deployment that fans out to the HeavyLoad and LightLoad
deployments and aggregates the two outputs.Driver replicaFirst consider the following deployment configurations. Because the driver
deployment has low CPU usage and is only asynchronously making calls to the
downstream deployments, allocating one fixed Driver replica is reasonable.
::::{tab-set}
:::{tab-item} Driver
- name: Driver
num_replicas: 1
max_ongoing_requests: 200
:::
:::{tab-item} HeavyLoad
- name: HeavyLoad
max_ongoing_requests: 3
autoscaling_config:
target_ongoing_requests: 1
min_replicas: 0
initial_replicas: 0
max_replicas: 200
upscale_delay_s: 3
downscale_delay_s: 60
upscaling_factor: 0.3
downscaling_factor: 0.3
metrics_interval_s: 2
look_back_period_s: 10
:::
:::{tab-item} LightLoad
- name: LightLoad
max_ongoing_requests: 3
autoscaling_config:
target_ongoing_requests: 1
min_replicas: 0
initial_replicas: 0
max_replicas: 200
upscale_delay_s: 3
downscale_delay_s: 60
upscaling_factor: 0.3
downscaling_factor: 0.3
metrics_interval_s: 2
look_back_period_s: 10
:::
:::{tab-item} Application Code
:language: python
:start-after: __serve_example_begin__
:end-before: __serve_example_end__
:::
::::
Running the same Locust load test from the Resnet workload generates the following results:
| HeavyLoad and LightLoad Number Replicas |
As you might expect, the number of autoscaled LightLoad replicas is roughly
half that of autoscaled HeavyLoad replicas. Although the same number of
requests per second are sent to both deployments, LightLoad replicas can
process twice as many requests per second as HeavyLoad replicas can, so the
deployment should need half as many replicas to handle the same traffic load.
Unfortunately, the service latency rises to from 230 to 400 ms when the number of Locust users increases to 100.
| P50 Latency | QPS |
|---|---|
Note that the number of HeavyLoad replicas should roughly match the number of
Locust users to adequately serve the Locust traffic. However, when the number of
Locust users increased to 100, the HeavyLoad deployment struggled to reach 100
replicas, and instead only reached 65 replicas. The per-deployment latencies
reveal the root cause. While HeavyLoad and LightLoad latencies stayed steady
at 200ms and 100ms, Driver latencies rose from 230 to 400 ms. This suggests
that the high Locust workload may be overwhelming the Driver replica and
impacting its asynchronous event loop's performance.
DriverFor this attempt, set an autoscaling configuration for Driver as well, with
the setting target_ongoing_requests = 20. Now the deployment configurations
are as follows:
::::{tab-set}
:::{tab-item} Driver
- name: Driver
max_ongoing_requests: 200
autoscaling_config:
target_ongoing_requests: 20
min_replicas: 1
initial_replicas: 1
max_replicas: 10
upscale_delay_s: 3
downscale_delay_s: 60
upscaling_factor: 0.3
downscaling_factor: 0.3
metrics_interval_s: 2
look_back_period_s: 10
:::
:::{tab-item} HeavyLoad
- name: HeavyLoad
max_ongoing_requests: 3
autoscaling_config:
target_ongoing_requests: 1
min_replicas: 0
initial_replicas: 0
max_replicas: 200
upscale_delay_s: 3
downscale_delay_s: 60
upscaling_factor: 0.3
downscaling_factor: 0.3
metrics_interval_s: 2
look_back_period_s: 10
:::
:::{tab-item} LightLoad
- name: LightLoad
max_ongoing_requests: 3
autoscaling_config:
target_ongoing_requests: 1
min_replicas: 0
initial_replicas: 0
max_replicas: 200
upscale_delay_s: 3
downscale_delay_s: 60
upscaling_factor: 0.3
downscaling_factor: 0.3
metrics_interval_s: 2
look_back_period_s: 10
::: ::::
Running the same Locust load test again generates the following results:
| HeavyLoad and LightLoad Number Replicas | |
| Driver Number Replicas |
With up to 6 Driver deployments to receive and distribute the incoming
requests, the HeavyLoad deployment successfully scales up to 90+ replicas, and
LightLoad up to 47 replicas. This configuration helps the application latency
stay consistent as the traffic load increases.
| Improved P50 Latency | Improved RPS |
|---|---|
If the number of replicas in your deployment keeps oscillating even though the traffic is relatively stable, try the following:
Set a smaller upscaling_factor and downscaling_factor. Setting both values
smaller than one helps the autoscaler make more conservative upscale and
downscale decisions. It effectively smooths out the replicas graph, and there
will be less "sharp edges".
Set a look_back_period_s value that matches the rest of the autoscaling
config. For longer upscale and downscale delay values, a longer look back period
can likely help stabilize the replica graph, but for shorter upscale and
downscale delay values, a shorter look back period may be more appropriate. For
instance, the following replica graphs show how a deployment with
upscale_delay_s = 3 works with a longer vs shorter look back period.
look_back_period_s = 30 | look_back_period_s = 3 |
|---|---|
If you expect your application to receive bursty traffic, and at the same time want the deployments to scale down in periods of inactivity, you are likely concerned about how quickly the deployment can scale up and respond to bursts of traffic. While an increase in latency initially during a burst in traffic may be unavoidable, you can try the following to improve latency during bursts of traffic.
Set a lower upscale_delay_s. The autoscaler always waits upscale_delay_s
seconds before making a decision to upscale, so lowering this delay allows the
autoscaler to react more quickly to changes, especially bursts, of traffic.
Set a larger upscaling_factor. If upscaling_factor > 1, then the
autoscaler scales up more aggressively than normal. This setting can allow your
deployment to be more sensitive to bursts of traffic.
Lower the metrics_interval_s.
Always set metrics_interval_s to be less than
or equal to upscale_delay_s, otherwise upscaling is delayed because the
autoscaler doesn't receive fresh information often enough.
Set a lower max_ongoing_requests. If max_ongoing_requests is too high
relative to target_ongoing_requests, then when traffic increases, Serve might
assign most or all of the requests to the existing replicas before the new
replicas are started. This setting can lead to very high latencies during
upscale.
You may observe that deployments are scaling down too quickly. Instead, you may want the downscaling to be much more conservative to maximize the availability of your service.
Set a longer downscale_delay_s. The autoscaler always waits
downscale_delay_s seconds before making a decision to downscale, so by
increasing this number, your system has a longer "grace period" after traffic
drops before the autoscaler starts to remove replicas.
Set a smaller downscaling_factor. If downscaling_factor < 1, then the
autoscaler removes less replicas than what it thinks it should remove to
achieve the target number of ongoing requests. In other words, the autoscaler
makes more conservative downscaling decisions.
downscaling_factor = 1 | downscaling_factor = 0.5 |
|---|---|
(serve-custom-autoscaling-policies)=
:::{warning} Custom autoscaling policies are experimental and may change in future releases. :::
Ray Serve’s built-in, request-driven autoscaling works well for most apps. Use custom autoscaling policies when you need more control—e.g., scaling on external metrics (CloudWatch, Prometheus), anticipating predictable traffic (scheduled batch jobs), or applying business logic that goes beyond queue thresholds.
Custom policies let you implement scaling logic based on any metrics or rules you choose.
A custom autoscaling policy is a user-provided Python function that takes an AutoscalingContext and returns a tuple (target_replicas, policy_state) for a single Deployment.
An AutoscalingContext object provides the following information to the custom autoscaling policy:
record_autoscaling_stats(). (See below.)min / max replica limits adjusted for current cluster capacity.dict you can use to persist arbitrary state across control-loop iterations.The following example showcases a policy that scales up during business hours and evening batch processing, and scales down during off-peak hours:
autoscaling_policy.py file:
:language: python
:start-after: __begin_scheduled_batch_processing_policy__
:end-before: __end_scheduled_batch_processing_policy__
main.py file:
:language: python
:start-after: __serve_example_begin__
:end-before: __serve_example_end__
Policies are defined per deployment. If you don’t provide one, Ray Serve falls back to its built-in request-based policy.
The policy function is invoked by the Ray Serve controller every RAY_SERVE_CONTROL_LOOP_INTERVAL_S seconds (default 0.1s), so your logic runs against near-real-time state.
Your policy can return an int or a float for target_replicas. If it returns a float, Ray Serve converts it to an integer replica count by rounding up to the next greatest integer.
:::{warning} Keep policy functions fast and lightweight. Slow logic can block the Serve controller and degrade cluster responsiveness. :::
Ray Serve automatically applies the following standard autoscaling parameters from your AutoscalingConfig to custom policies:
upscale_delay_s, downscale_delay_s, downscale_to_zero_delay_supscaling_factor, downscaling_factormin_replicas, max_replicasThe following example shows a custom autoscaling policy with standard autoscaling parameters applied.
:language: python
:start-after: __begin_apply_autoscaling_config_example__
:end-before: __end_apply_autoscaling_config_example__
:language: python
:start-after: __begin_apply_autoscaling_config_usage__
:end-before: __end_apply_autoscaling_config_usage__
::::{note}
Your policy function should return the "raw" desired number of replicas. Ray Serve applies the autoscaling_config settings (delays, factors, and bounds) on top of your decision.
Your policy can return an int or a float "raw desired" replica count. Ray Serve returns an integer decision number.
::::
You can make richer decisions by emitting your own metrics from the deployment. Implement record_autoscaling_stats() to return a dict[str, float]. Ray Serve will surface these values in the AutoscalingContext.
This example demonstrates how deployments can provide their own metrics (CPU usage, memory usage) and how autoscaling policies can use these metrics to make scaling decisions:
autoscaling_policy.py file:
:language: python
:start-after: __begin_custom_metrics_autoscaling_policy__
:end-before: __end_custom_metrics_autoscaling_policy__
main.py file:
:language: python
:start-after: __serve_example_begin__
:end-before: __serve_example_end__
:::{note}
The record_autoscaling_stats() method can be either synchronous or asynchronous. It must complete within the timeout specified by RAY_SERVE_RECORD_AUTOSCALING_STATS_TIMEOUT_S (default 10 seconds).
:::
In your policy, access custom metrics via:
ctx.raw_metrics[metric_name] — A mapping of replica IDs to lists of raw metric values.
The number of data points stored for each replica depends on the look_back_period_s (the sliding window size) and metrics_interval_s (the metric recording interval).ctx.aggregated_metrics[metric_name] — A time-weighted average computed from the raw metric values for each replica.When your policy needs long-running setup — such as polling an external metrics service, maintaining a persistent connection, or running background computation — you can define it as a class instead of a plain function. Pass the class reference through policy_function and supply constructor arguments through policy_kwargs.
Ray Serve instantiates the class once on the controller when the deployment starts. __init__ runs one-time setup, and __call__ runs on every autoscaling tick with the current AutoscalingContext.
The following example shows a policy that reads a target replica count from a JSON file in a background loop. In production you could replace the file read with an HTTP call, a message-queue consumer, or any other async IO operation:
class_based_autoscaling_policy.py file:
:language: python
:start-after: __begin_class_based_autoscaling_policy__
:end-before: __end_class_based_autoscaling_policy__
main.py file:
:language: python
:start-after: __serve_example_begin__
:end-before: __serve_example_end__
:::{note}
The instance lives only on the Serve controller and is never serialized after creation, so it's safe to hold non-picklable state such as asyncio.Task objects, open connections, or thread pools. policy_kwargs values must be JSON-serializable because they travel through the deployment config.
:::
:::{tip}
If you're using @task_consumer deployments for asynchronous inference, Ray Serve provides a built-in AsyncInferenceAutoscalingPolicy that scales based on message queue length. See Asynchronous Inference: Autoscaling for setup and configuration.
:::
By default, each deployment in Ray Serve autoscales independently. When you have multiple deployments that need to scale in a coordinated way—such as deployments that share backend resources, have dependencies on each other, or need load-aware routing—you can define an application-level autoscaling policy. This policy makes scaling decisions for all deployments within an application simultaneously.
An application-level autoscaling policy is a function that takes a dict[DeploymentID, AutoscalingContext] objects (one per deployment) and returns a tuple of (decisions, policy_state). Each context contains metrics and bounds for one deployment, and the policy returns target replica counts for all deployments.
The policy_state returned from an application-level policy must be a Dict[DeploymentID, Dict]— a dictionary mapping each deployment ID to its own state dictionary. Serve stores this per-deployment state and on the next control-loop iteration, injects each deployment's state back into that deployment's AutoscalingContext.policy_state.
The per deployment number replicas returned from the policy can be an int or a float. If it returns a float, Ray Serve converts it to an integer replica count by rounding up to the next greatest integer.
Serve itself does not interpret the contents of policy_state. All the keys in each deployment's state dictionary are user-controlled except for internal keys that are used when default parameters are applied to custom autoscaling policies.
The following example shows a policy that scales deployments based on their relative load, ensuring that downstream deployments have enough capacity for upstream traffic:
autoscaling_policy.py file:
:language: python
:start-after: __begin_application_level_autoscaling_policy__
:end-before: __end_application_level_autoscaling_policy__
The following example shows a stateful application-level policy that persists state between control-loop iterations:
autoscaling_policy.py file:
:language: python
:start-after: __begin_stateful_application_level_policy__
:end-before: __end_stateful_application_level_policy__
To use an application-level policy, you need to define your deployments:
main.py file:
:language: python
:start-after: __serve_example_begin__
:end-before: __serve_example_end__
Then specify the application-level policy in your application config:
serve.yaml file:
:language: yaml
:emphasize-lines: 4-5
:::{note}
Programmatic configuration of application-level autoscaling policies through serve.run() will be supported in a future release.
:::
:::{note} When you specify both a deployment-level policy and an application-level policy, the application-level policy takes precedence. Ray Serve logs a warning if you configure both. :::
Ray Serve automatically applies standard autoscaling parameters (delays, factors, and min/max bounds) to application-level policies on a per-deployment basis. These parameters include:
upscale_delay_s, downscale_delay_s, downscale_to_zero_delay_supscaling_factor, downscaling_factormin_replicas, max_replicasThe YAML configuration file shows the default parameters applied to the application level policy.
:language: yaml
Your application level policy can return per deployment desired replicas as int or float values. Ray Serve applies the autoscaling config parameters per deployment and returns integer decisions.
:::{warning}
When you provide a custom policy, Ray Serve can fully support it as long as it's simple, self-contained Python code that relies only on the standard library. Once the policy becomes more complex, such as depending on other custom modules or packages, you need to bundle those modules into the Docker image or environment. This is because Ray Serve uses cloudpickle to serialize custom policies and it doesn't vendor transitive dependencies—if your policy inherits from a superclass in another module or imports custom packages, those must exist in the target environment. Additionally, environment parity matters: differences in Python version, cloudpickle version, or library versions can affect deserialization.
When your custom autoscaling policy has complex dependencies or you want better control over versioning and deployment, you have several alternatives:
python/ray/serve/autoscaling_policy.py.(serve-external-scale-api)=
:::{warning} This API is in alpha and may change before becoming stable. :::
The external scaling API provides programmatic control over the number of replicas for any deployment in your Ray Serve application. Unlike Ray Serve's built-in autoscaling, which scales based on queue depth and ongoing requests, this API allows you to scale based on any external criteria you define.
This example shows how to implement predictive scaling based on historical patterns or forecasts. You can preemptively scale up before anticipated traffic spikes by running an external script that adjusts replica counts based on time of day.
The following example creates a simple text processing deployment that you can scale externally. Save this code to a file named external_scaler_predictive.py:
:language: python
:start-after: __serve_example_begin__
:end-before: __serve_example_end__
Before using the external scaling API, enable it in your application configuration by setting external_scaler_enabled: true. Save this configuration to a file named external_scaler_config.yaml:
:language: yaml
:start-after: __external_scaler_config_begin__
:end-before: __external_scaler_config_end__
:::{warning}
External scaling and built-in autoscaling are mutually exclusive. You can't use both for the same application. If you set external_scaler_enabled: true, you must not configure autoscaling_config on any deployment in that application. Attempting to use both results in an error.
:::
The following script implements predictive scaling based on time of day and historical traffic patterns. Save this script to a file named external_scaler_predictive_client.py:
:language: python
:start-after: __client_script_begin__
:end-before: __client_script_end__
The script uses the external scaling API endpoint to scale deployments:
POST http://localhost:8265/api/v1/applications/{application_name}/deployments/{deployment_name}/scale{"target_num_replicas": <number>} (must conform to the ScaleDeploymentRequest schema)The scaling client continuously adjusts the number of replicas based on the time of day:
Follow these steps to run the complete example:
serve run external_scaler_config.yaml
python external_scaler_predictive_client.py
The client adjusts replica counts automatically based on the time of day. You can monitor the scaling behavior in the Ray dashboard or by checking the application logs.
Understanding how the external scaler interacts with your deployments helps you build reliable scaling logic:
Idempotent API calls: The scaling API is idempotent. You can safely call it multiple times with the same target_num_replicas value without side effects. This makes it safe to run your scaling logic on a schedule or in response to repeated metric updates.
Interaction with serve deploy: When you upgrade your service with serve deploy, the number of replicas you set through the external scaler API stays intact. This behavior matches what you'd expect from Ray Serve's built-in autoscaler—deployment updates don't reset replica counts.
Query current replica count: You can get the current number of replicas for any deployment by querying the GET /applications API:
curl -X GET http://localhost:8265/api/serve/applications/ \
The response follows the ServeInstanceDetails schema, which includes an applications field containing a dictionary with application names as keys. Each application includes detailed information about all its deployments, including current replica counts. Use this information to make informed scaling decisions. For example, you might scale up gradually by adding a percentage of existing replicas rather than jumping to a fixed number.
Initial replica count: When you deploy an application for the first time, Ray Serve creates the number of replicas specified in the num_replicas field of your deployment configuration. The external scaler can then adjust this count dynamically based on your scaling logic.