doc/source/serve/advanced-guides/asyncio-best-practices.md
(serve-asyncio-best-practices)=
The code that runs inside of each replica in a Ray Serve deployment runs on an asyncio event loop. Asyncio enables efficient I/O bound concurrency but requires following a few best practices for optimal performance.
This guide explains:
async def versus def in Ray Serve.max_ongoing_requests interacts with asyncio concurrency.The examples assume the following imports unless stated otherwise:
:start-after: __imports_begin__
:end-before: __imports_end__
:language: python
async def and defUse this decision table as a starting point:
| Workload type | Recommended handler | Reason |
|---|---|---|
| I/O-bound (databases, HTTP calls, queues) | async def | Lets the event loop handle many requests while each waits on I/O. |
| CPU-bound (model inference, heavy numeric compute) | def or async def with offload | Async alone doesn't make CPU work faster. You need more replicas, threads, or native parallelism. |
| Streaming responses | async def generator | Integrates with backpressure and non-blocking iteration. |
FastAPI ingress (@serve.ingress) | def or async def | FastAPI runs def endpoints in a threadpool, so they don't block the loop. |
At a high level, requests go through a router to a replica actor that runs your code:
Client
↓
Serve router (asyncio loop A)
↓
Replica actor
├─ System / control loop
└─ User code loop (your handlers)
└─ Optional threadpool for sync methods
The following are the key ideas to consider when deciding to use async def or def:
RAY_SERVE_RUN_SYNC_IN_THREADPOOL, def handlers may run directly on the user event loop (blocking) or in a threadpool (non-blocking for the loop).For a simple deployment:
:start-after: __echo_async_begin__
:end-before: __echo_async_end__
:language: python
async def __call__ runs directly on the replica's user event loop.asyncio.sleep, the loop is free to start handling other requests.For a synchronous deployment:
:start-after: __blocking_echo_begin__
:end-before: __blocking_echo_end__
:language: python
How this method executes depends on configuration:
RAY_SERVE_RUN_SYNC_IN_THREADPOOL=0 (current default), __call__ runs directly on the user event loop and blocks it for 1 second.RAY_SERVE_RUN_SYNC_IN_THREADPOOL=1, Serve offloads __call__ to a threadpool so the event loop stays responsive.@serve.ingress)When you use FastAPI ingress, FastAPI controls how endpoints run:
:start-after: __fastapi_deployment_begin__
:end-before: __fastapi_deployment_end__
:language: python
Important differences:
def endpoints to a threadpool.def methods run on the event loop unless you opt into threadpool behavior.Serve sets a default threadpool size for user code that mirrors Python's
ThreadPoolExecutor defaults while respecting ray_actor_options["num_cpus"].
In most cases, the default is fine. If you need to tune it, you can override the default executor inside your deployment:
:start-after: __threadpool_override_begin__
:end-before: __threadpool_override_end__
:language: python
Guidance for choosing a size:
num_cpus.num_cpus.Blocking code keeps the event loop from processing other work. Non-blocking code yields control back to the loop when it's waiting on something.
Blocking I/O example:
:start-after: __blocking_http_begin__
:end-before: __blocking_http_end__
:language: python
Even though the method is async def, requests.get blocks the loop. No other requests can run on this replica during the request call. Blocking in async def is still blocking.
Non-blocking equivalent with async HTTP client:
:start-after: __async_http_begin__
:end-before: __async_http_end__
:language: python
Non-blocking equivalent using a threadpool:
:start-after: __threaded_http_begin__
:end-before: __threaded_http_end__
:language: python
It's common to expect async code to "use all the cores" or make CPU-heavy code faster. asyncio doesn't do that.
Asyncio gives you concurrency for I/O-bound workloads:
await.This is ideal for high-throughput APIs that mostly wait on external systems.
True CPU parallelism usually comes from:
Python's GIL means that pure Python bytecode runs one thread at a time in a process, even if you use a threadpool.
Many numeric and ML libraries release the GIL while doing heavy work in native code:
In these cases, you can still get useful parallelism from threads inside a single replica process:
:start-after: __numpy_deployment_begin__
:end-before: __numpy_deployment_end__
:language: python
However:
For predictable CPU scaling, it's usually simpler to increase the number of replicas.
async def improves concurrency for I/O-bound code.async.max_ongoing_requests and replica concurrency workEach deployment has a max_ongoing_requests configuration that controls how many in-flight requests a replica handles at once.
:start-after: __max_ongoing_requests_begin__
:end-before: __max_ongoing_requests_end__
:language: python
Key points:
max_ongoing_requests.How useful max_ongoing_requests is depends on how your handler behaves.
async handlers and max_ongoing_requestsWith an async def handler that spends most of its time awaiting I/O, max_ongoing_requests directly controls concurrency:
:start-after: __async_io_bound_begin__
:end-before: __async_io_bound_end__
:language: python
def handlers and max_ongoing_requestsWith a blocking def handler that runs on the event loop (threadpool disabled), max_ongoing_requests doesn't give you the concurrency you expect:
:start-after: __blocking_cpu_begin__
:end-before: __blocking_cpu_end__
:language: python
In this case:
max_ongoing_requests=100, the replica effectively processes requests serially.If you enable the sync-in-threadpool behavior (see the next section), each in-flight request can run in a thread:
:start-after: __cpu_with_threadpool_begin__
:end-before: __cpu_with_threadpool_end__
:language: python
Now:
max_ongoing_requests calls can be running at once.For heavily CPU-bound workloads, it's usually better to:
max_ongoing_requests modest (to avoid queueing too many heavy tasks), andnum_replicas) rather than pushing a single replica's concurrency too high.Ray Serve exposes several environment variables that control how user code interacts with event loops and threads.
RAY_SERVE_RUN_SYNC_IN_THREADPOOLBy default (RAY_SERVE_RUN_SYNC_IN_THREADPOOL=0), which means synchronous methods in a deployment run directly on the user event loop. To help you migrate to a safer model, Serve emits a warning like:
RAY_SERVE_RUN_SYNC_IN_THREADPOOL_WARNING: Calling sync method '...' directly on the asyncio loop. In a future version, sync methods will be run in a threadpool by default...
This warning means:
def method that is currently running on the event loop.You can opt in to the future behavior now by setting:
export RAY_SERVE_RUN_SYNC_IN_THREADPOOL=1
When this flag is 1:
Before enabling this in production, make sure:
RAY_SERVE_RUN_USER_CODE_IN_SEPARATE_THREADBy default, Serve runs user code in a separate event loop from the replica's main/control loop:
export RAY_SERVE_RUN_USER_CODE_IN_SEPARATE_THREAD=1 # default
This isolation:
You can disable this behavior:
export RAY_SERVE_RUN_USER_CODE_IN_SEPARATE_THREAD=0
Only advanced users should change this. When user code and system tasks share a loop, any blocking operation in user code can interfere with replica health and control-plane operations.
RAY_SERVE_RUN_ROUTER_IN_SEPARATE_LOOPServe's request router is also run on its own event loop by default:
export RAY_SERVE_RUN_ROUTER_IN_SEPARATE_LOOP=1 # default
This ensures:
Disabling this:
export RAY_SERVE_RUN_ROUTER_IN_SEPARATE_LOOP=0
makes the router share an event loop with other work. This can reduce overhead in advanced, highly optimized scenarios, but makes the system more sensitive to blocking operations. See High throughput optimization.
For most production deployments, you should keep the defaults (1) for both separate-loop flags.
Batching and streaming both rely on the event loop to stay responsive. They don't change where your code runs: batched handlers and streaming handlers still run on the same user event loop as any other handler. This means that if you add batching or streaming on top of blocking code, you can make event loop blocking effects much worse.
When you enable batching, Serve groups multiple incoming requests together and passes them to your handler as a list. The handler still runs on the user event loop, but each call now processes many requests at once instead of just one. If that batched work is blocking, it blocks the event loop for all of those requests at the same time.
The following example shows a batched deployment:
:start-after: __batched_model_begin__
:end-before: __batched_model_end__
:language: python
The batch handler runs on the user event loop:
_run_model is CPU-heavy and runs inline, it blocks the loop for the duration of the batch.:start-after: __batched_model_offload_begin__
:end-before: __batched_model_offload_end__
:language: python
:emphasize-lines: 9-16
This keeps the event loop responsive while the model runs in a thread.
max_concurrent_batches and event loop yieldingThe @serve.batch decorator accepts a max_concurrent_batches argument that controls how many batches can be processed concurrently. However, this argument only works effectively if your batch handler yields control back to the event loop during processing.
If your batch handler blocks the event loop (for example, by doing heavy CPU work without awaiting or offloading), max_concurrent_batches won't provide the concurrency you expect. The event loop can only start processing a new batch when the current batch yields control.
To get the benefit of max_concurrent_batches:
async def for your batch handler and await I/O operations or offloaded CPU work.asyncio.to_thread() or loop.run_in_executor().In the offloaded batch example above, the handler yields to the event loop when awaiting the threadpool executor, which allows multiple batches to be in flight simultaneously (up to the max_concurrent_batches limit).
Streaming is different from a regular response because the client starts receiving data while your handler is still running. Serve calls your handler once, gets back a generator or async generator, and then repeatedly asks it for the next chunk. That generator code still runs on the user event loop (or in a worker thread if you offload it).
Streaming is especially sensitive to blocking:
Bad streaming example:
:start-after: __blocking_stream_begin__
:end-before: __blocking_stream_end__
:language: python
Better streaming example:
:start-after: __async_stream_begin__
:end-before: __async_stream_end__
:language: python
In streaming scenarios:
async def generators that use await between yields.This section summarizes common offloading patterns you can use inside async handlers.
async def:start-after: __offload_io_begin__
:end-before: __offload_io_end__
:language: python
async def:start-after: __offload_cpu_begin__
:end-before: __offload_cpu_end__
:language: python
:::{note} While you can spawn Ray tasks from Ray Serve deployments, this approach isn't recommended because it lacks tooling for observability and debugging. :::
:start-after: __ray_parallel_begin__
:end-before: __ray_parallel_end__
:language: python
This pattern:
async def for I/O-bound and streaming work so the event loop can stay responsive.max_ongoing_requests to bound concurrency per replica, but remember that blocking def handlers can still serialize work if they run on the event loop.RAY_SERVE_RUN_SYNC_IN_THREADPOOL once your code is thread-safe, and be aware of the sync-in-threadpool warning.