doc/source/serve/tutorials/batch.md
(serve-batch-tutorial)=
This tutorial shows how to deploy a text generator that processes multiple queries simultaneously using batching. Learn how to:
Batching can significantly improve performance when your model supports parallel processing like GPU acceleration or vectorized operations. It increases both throughput and hardware utilization by processing multiple requests together.
:::{note} This tutorial focuses on online serving with batching. For offline batch processing of large datasets, see batch inference with Ray Data. :::
pip install "ray[serve] transformers"
Open a new Python file called tutorial_batch.py. First, import Ray Serve and some other helpers.
:end-before: __doc_import_end__
:start-after: __doc_import_begin__
Ray Serve provides the @serve.batch decorator to automatically batch individual requests to
a function or class method.
The decorated method:
async def to handle concurrent requests@serve.batch
async def my_batch_handler(self, requests: List):
# Process multiple requests together
results = []
for request in requests:
results.append(request) # processing logic here
return results
You can call the batch handler from another async def method in your deployment.
Ray Serve batches and executes these calls together, but returns individual results just like
normal function calls:
class BatchingDeployment:
@serve.batch
async def my_batch_handler(self, requests: List):
results = []
for request in requests:
results.append(request.json()) # processing logic here
return results
async def __call__(self, request):
return await self.my_batch_handler(request)
:::{note}
Ray Serve uses opportunistic batching by default - executing requests as
soon as they arrive without waiting for a full batch. You can adjust this behavior using
batch_wait_timeout_s in the @serve.batch decorator to trade increased latency
for increased throughput (defaults to 0). Increasing this value may improve throughput
at the cost of latency under low load.
:::
Next, define a deployment that takes in a list of input strings and runs vectorized text generation on the inputs.
:end-before: __doc_define_servable_end__
:start-after: __doc_define_servable_begin__
Next, prepare to deploy the deployment. Note that in the @serve.batch decorator, you
are specifying the maximum batch size with max_batch_size=4. This option limits
the maximum possible batch size that Ray Serve executes at once.
:end-before: __doc_deploy_end__
:start-after: __doc_deploy_begin__
You can deploy your app in two ways:
$ serve run tutorial_batch:generator --name "Text-Completion-App"
Alternatively, you can deploy the app using the Python API using the serve.run function.
This command returns a handle that you can use to query the deployment.
from ray.serve.handle import DeploymentHandle
handle: DeploymentHandle = serve.run(generator, name="Text-Completion-App")
You can now use this handle to query the model. See the Querying the Model section below.
There are multiple ways to interact with your deployed model:
For basic testing, use curl:
$ curl "http://localhost:8000/?text=Once+upon+a+time"
For higher throughput, use Ray remote tasks to send parallel requests:
import ray
import requests
@ray.remote
def send_query(text):
resp = requests.post("http://localhost:8000/", params={"text": text})
return resp.text
# Example batch of queries
texts = [
'Once upon a time,',
'Hi my name is Lewis and I like to',
'In a galaxy far far away',
]
# Send all queries in parallel
results = ray.get([send_query.remote(text) for text in texts])
For a more Pythonic way to query the model, you can use the deployment handle directly:
import ray
from ray import serve
input_batch = [
'Once upon a time,',
'Hi my name is Lewis and I like to',
'In a galaxy far far away',
]
# initialize using the 'auto' option to connect to the already-running Ray cluster
ray.init(address="auto")
handle = serve.get_deployment_handle("BatchTextGenerator", app_name="Text-Completion-App")
responses = [handle.handle_batch.remote(text) for text in input_batch]
results = [r.result() for r in responses]
max_batch_size if you have sufficient memory and want higher throughput - this may increase latencybatch_wait_timeout_s if throughput is more important than latencymax_concurrent_batches if you have an asynchronous function that you want to process multiple batches with concurrently