(custom-request-router-guide)=

Use Custom Algorithm for Request Routing

:::{warning} This API is in alpha and may change before becoming stable. :::

Different Ray serve applications demand different logics for load balancing. For example, in serving LLMs you might want to have a different policy than balancing number of requests across replicas: e.g. balancing ongoing input tokens, balancing kv-cache utilization, etc. RequestRouter is an abstraction in Ray Serve that allows extension and customization of load-balancing logic for each deployment.

This guide shows how to use RequestRouter API to achieve custom load balancing across replicas of a given deployment. It will cover the following:

Define a simple uniform request router for load balancing
Deploy an app with the uniform request router
Utility mixins for request routing
Define a complex throughput-aware request router
Deploy an app with the throughput-aware request router
Experimental: Define a centralized capacity queue request router

(simple-uniform-request-router)=

Define simple uniform request router

Create a file custom_request_router.py with the following code:

{literalinclude}

:start-after: __begin_define_uniform_request_router__
:end-before: __end_define_uniform_request_router__
:language: python

This code defines a simple uniform request router that routes requests a random replica to distribute the load evenly regardless of the queue length of each replica or the body of the request. The router is defined as a class that inherits from RequestRouter. It implements the choose_replicas method, which returns the random replica for all incoming requests. The returned type is a list of lists of replicas, where each inner list represents a rank of replicas. The first rank is the most preferred and the last rank is the least preferred. The request will be attempted to be routed to the replica with the shortest request queue in each set of the rank in order until a replica is able to process the request. If none of the replicas are able to process the request, choose_replicas will be called again with a backoff delay until a replica is able to process the request.

:::{note} This request router also implements on_request_routed which can help you update the state of the request router after a request is routed. :::

(deploy-app-with-uniform-request-router)=

Deploy an app with the uniform request router

To use a custom request router, you need to pass the request_router_class argument to the deployment decorator. Also note that the request_router_class can be passed as the already imported class or as the string of import path to the class. Let's deploy a simple app that uses the uniform request router like this:

{literalinclude}

:start-after: __begin_deploy_app_with_uniform_request_router__
:end-before: __end_deploy_app_with_uniform_request_router__
:language: python

As the request is routed, both "UniformRequestRouter routing request" and "on_request_routed callback is called!!" messages will be printed to the console. The response will also be randomly routed to one of the replicas. You can test this by sending more requests and seeing the distribution of the replicas are roughly equal.

:::{note} Currently, the only way to configure the request router is to pass it as an argument to the deployment decorator. This means that you cannot change the request router for an existing deployment handle with running router. If you have a particular usecase where you need to reconfigure a request router on the deployment handle, please open a feature request on the Ray GitHub repository :::

(utility-mixin)=

Utility mixins for request routing

Ray Serve provides utility mixins that can be used to extend the functionality of the request router. These mixins can be used to implement common routing policies such as locality-aware routing, multiplexed model support, and FIFO request routing.

FIFOMixin: This mixin implements first in first out (FIFO) request routing. The default behavior for the request router is OOO (out of order) which routes requests to the exact replica which got assigned by the request passed to choose_replicas. This mixin is useful for the routing algorithm that can work independently of the request content, so the requests can be routed as soon as possible in the order they were received. By including this mixin, in your custom request router, the request matching algorithm will be updated to route requests FIFO. There are no additional flags needs to be configured and no additional helper methods provided by this mixin.
LocalityMixin: This mixin implements locality-aware request routing. It updates the internal states when between replica updates to track the location between replicas in the same node, same zone, and everything else. It offers helpers apply_locality_routing and rank_replicas_via_locality to route and ranks replicas based on their locality to the request, which can be useful for reducing latency and improving performance.
MultiplexMixin: When you use model-multiplexing you need to route requests based on knowing which replica has already a hot version of the model. It updates the internal states when between replica updates to track the model loaded on each replica, and size of the model cache on each replica. It offers helpers apply_multiplex_routing and rank_replicas_via_multiplex to route and ranks replicas based on their multiplexed model id of the request.

(throughput-aware-request-router)=

Define a complex throughput-aware request router

A fully featured request router can be more complex and should take into account the multiplexed model, locality, the request queue length on each replica, and using custom statistics like throughput to decide which replica to route the request to. The following class defines a throughput-aware request router that routes requests to the replica with these factors in mind. Add the following code into the custom_request_router.py file:

{literalinclude}

:start-after: __begin_define_throughput_aware_request_router__
:end-before: __end_define_throughput_aware_request_router__
:language: python

This request router inherits from RequestRouter, as well as FIFOMixin for FIFO request routing, LocalityMixin for locality-aware request routing, and MultiplexMixin for multiplexed model support. It implements choose_replicas to take the highest ranked replicas from rank_replicas_via_multiplex and rank_replicas_via_locality and uses the select_available_replicas helper to filter out replicas that have reached their maximum request queue length. Finally, it takes the replicas with the minimum throughput and returns the top one.

(deploy-app-with-throughput-aware-request-router)=

Deploy an app with the throughput-aware request router

To use the throughput-aware request router, you can deploy an app like this:

{literalinclude}

:start-after: __begin_deploy_app_with_throughput_aware_request_router__
:end-before: __end_deploy_app_with_throughput_aware_request_router__
:language: python

Similar to the uniform request router, the custom request router can be defined in the request_router_class argument of the deployment decorator. The Serve controller pulls statistics from the replica of each deployment by calling record_routing_stats. The request_routing_stats_period_s and request_routing_stats_timeout_s arguments control the frequency and timeout time of the serve controller pulling information from each replica in its background thread. You can customize the emission of these statistics by overriding record_routing_stats in the definition of the deployment class. The custom request router can then get the updated routing stats by looking up the routing_stats attribute of the running replicas and use it in the routing policy.

(capacity-queue-request-router)=

Experimental: Define a centralized capacity queue request router

In the previous examples, the routing decisions are based on the locally visible state of the target replicas from the perspective of the router replica. This view is eventually consistent not strongly because the serve controller frequently broadcasts the replica information to the router. Under high concurrency with multiple routers, this information can drift from reality and can cause several routers to simultaneously pick the same replica, causing transient load imbalance or triggering rejections and retries. For some applications this can result in lower throughput. A centralized approach avoids this: a single actor tracks per-replica in-flight counts, and every router acquires a capacity token before forwarding a request. This way, each token guarantees the target replica has room, eliminating the rejection protocol entirely.

This example demonstrates how we can implement such routing policy. The example has three pieces:

An importable CapacityQueue actor that tracks per-replica capacity and hands out tokens using a least-loaded selection strategy.
An importable CapacityQueueRouter custom request router that acquires a token before routing and releases it when the request completes. In a real application, we can have multiple replicas of CapacityQueueRouter each one keeping tracking their own view of state of replicas. The centralized CapacityQueue actor is meant to keep their local information synchronized with reality.
A deployment that ties them together using a deployment actor for the queue and a RequestRouterConfig for the router.

(deploy-app-with-capacity-queue-router)=

Deploy an app with the capacity queue router

The deployment wires the pieces together: a DeploymentActorConfig for the capacity queue and a RequestRouterConfig pointing at the custom router:

{literalinclude}

:start-after: __begin_deploy_app_with_capacity_queue_router__
:end-before: __end_deploy_app_with_capacity_queue_router__
:language: python

When the app starts:

The Serve controller creates the CapacityQueue deployment actor before any replicas start. CapacityQueue subscribes to replica updates via long poll.
As the controller starts replicas, it sends deployment-target updates. The queue's long-poll callback automatically registers each replica with its max_ongoing_requests capacity and unregisters replicas that are removed during scale-down or crash recovery.
The CapacityQueueRouter running in each proxy discovers the singleton CapacityQueue deployment actor, acquires a token for every incoming request, and routes to the replica identified by the token.
When the request completes, CapacityQueueRouter.on_request_completed fires and the token is released back to the queue.

Because the queue is a deployment actor, the controller handles its lifecycle automatically — health checks, cleanup on app deletion, and versioning during rolling updates.

Fault tolerance

The CapacityQueueRouter handles failures gracefully:

Queue unavailable — if the queue actor is dead, not yet discovered, or errors, the router retries with exponential backoff and falls back to power-of-two-choices after MAX_FAULT_RETRIES consecutive failures. Requests never raise exceptions due to queue issues.
Capacity exhausted — when all replicas are at capacity, the router backs off and retries until capacity frees up.
Queue restart — a restarted queue has no knowledge of pre-crash in-flight counts and may temporarily over-provision. This self-heals: replicas reject excess requests, and the router does not release rejected tokens intentionally, ratcheting up in_flight on the queue until it matches reality. token_ttl_s (if configured) auto-reclaims any remaining leaked tokens.
Replica death — the controller sends a long-poll update, the queue unregisters the dead replica, and tokens are only issued for live replicas.

Usage

The centralized capacity queue request router could bring performance benefits particularly in a constrained supply deployment, i.e. max_ongoing_request=1 or 2.

Benchmark

Benchmark Setup

Deployment topology: Client -> ParentDeployment -> ChildDeployment. Request router selection is applied to both deployments, controlling how parent replicas are selected by the HTTP proxy and how child replicas are selected by parent's DeploymentHandle.
Scale: small (8 replicas), medium (32 replicas), large (128 replicas), xlarge (512 replicas).
Workload: Replica processing latency is drawn from an exponential distribution with mean 1s and capped at 10s.
max_ongoing_request is set to 2.
Load generation: Applies closed-loop load generation where the load consistently keeps replicas saturated at max_ongoing_request concurrency.
Warmup: 10s; metrics within the warmup window are discarded entirely.

Benchmark Metrics

Throughput: Requests per second, i.e. num_requests / duration.
Utilization: Measures what fraction of a replica's total processing capacity was consumed by actual work during the experiment. Concretely, sum(replica_processing_latency_s) / (duration_s * max_ongoing_requests). For GPU deployments, utilization serves as an assessment proxy for GPU utilization.
Latency: Measures the client-side end-to-end latency, covering the full round-trip -- client -> ParentDeployment -> ChildDeployment -> ParentDeployment -> client.

Normal Situation

Under normal (success) situations, CapacityQueueRouter yields higher throughput and utilization and lower latency.

{image}

:align: center
:width: 800px

Fault Situation

A fault is simulated by killing the CapacityQueue router, and upon recovery, CapacityQueue converges towards its pre-fault performance.

{image}

:align: center
:width: 800px

:::{note} If you experience the following error when the CapacityQueue actor experiences faults and routing decisions fall back to the power-of-two-choices router, set RAY_SERVE_QUEUE_LENGTH_RESPONSE_DEADLINE_S to a higher value.

Failed to get queue length from Replica(id='...', deployment='ParentDeployment', app='...') within 0.1s.

:::

:::{warning}

Gotchas and limitations

When you provide a custom router, Ray Serve can fully support it as long as it's simple, self-contained Python code that relies only on the standard library. Once the router becomes more complex, such as depending on other custom modules or packages, you need to ensure those modules are bundled into the Docker image or environment. This is because Ray Serve uses cloudpickle to serialize custom routers and it doesn't vendor transitive dependencies—if your router inherits from a superclass in another module or imports custom packages, those must exist in the target environment. Additionally, environment parity matters: differences in Python version, cloudpickle version, or library versions can affect deserialization.

Alternatives for complex routers

When your custom request router has complex dependencies or you want better control over versioning and deployment, you have several alternatives:

Use built-in routers: Consider using the routers shipped with Ray Serve—these are well-tested, production-ready, and guaranteed to work across different environments.
Contribute to Ray Serve: If your router is general-purpose and might benefit others, consider contributing it to Ray Serve as a built-in router by opening a feature request or pull request on the Ray GitHub repository. The recommended location for the implementation is python/ray/serve/_private/request_router/.
Ensure dependencies in your environment: Make sure that the external dependencies are installed in your Docker image or environment. :::

Custom Request Router

Use Custom Algorithm for Request Routing

Define simple uniform request router

Deploy an app with the uniform request router

Utility mixins for request routing

Define a complex throughput-aware request router

Deploy an app with the throughput-aware request router

Experimental: Define a centralized capacity queue request router

Deploy an app with the capacity queue router

Fault tolerance

Usage

Benchmark

Benchmark Setup

Benchmark Metrics

Normal Situation

Fault Situation

Gotchas and limitations

Alternatives for complex routers