doc/source/serve/llm/user-guides/data-parallel-attention.md
(data-parallel-attention-guide)=
Deploy LLMs with data parallel attention for increased throughput and better resource utilization, especially for sparse MoE (Mixture of Experts) models.
Data parallel attention creates multiple coordinated inference engine replicas that process requests in parallel. This pattern is most effective when combined with expert parallelism for sparse MoE models, where attention (QKV) layers are replicated across replicas while MoE experts are sharded. This separation provides:
Consider this pattern when:
When not to use data parallel attention:
TP_size <= num_kv_heads. Beyond that, TP requires KV cache replication—at that point, data parallel attention becomes a better choice.The following example shows how to deploy with data parallel attention. Each data parallel deployment requires num_replicas * data_parallel_size * tensor_parallel_size GPUs.
:language: python
:start-after: __dp_basic_example_start__
:end-before: __dp_basic_example_end__
For production deployments, use a declarative YAML configuration file:
applications:
- name: dp_llm_app
route_prefix: /
import_path: ray.serve.llm:build_dp_openai_app
args:
llm_config:
model_loading_config:
model_id: Qwen/Qwen2.5-0.5B-Instruct
deployment_config:
num_replicas: 2
engine_kwargs:
data_parallel_size: 4
tensor_parallel_size: 2
Deploy with CLI:
serve deploy dp_config.yaml
data_parallel_size: Number of data parallel replicas within a data parallel group. Must be a positive integer and passed in via engine_kwargs.num_replicas: Can be set to any positive integer, unset (defaults to 1), or "auto" to enable autoscaling based on request queue length.:::{note}
Within a data parallel deployment, num_replicas under the deployment_config refers to the number of data parallel groups, which translates to num_replicas * data_parallel_size data parallel replicas (equivalent to the number of Ray serve replicas). Each data parallel replica inherently runs a vLLM data parallel server.
:::
In data parallel attention, all data parallel replicas within a data parallel group work together as a cohesive unit by leveraging Ray Serve's gang scheduling capability:
data_parallel_size-1) from Ray Serve's controller to start a vLLM data parallel server.num_replicas > 1.There's no coordination overhead introduced by Ray Serve LLM:
For more details, see {doc}../architecture/serving-patterns/data-parallel.
Test with a chat completion request:
curl -X POST "http://localhost:8000/v1/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer fake-key" \
-d '{
"model": "Qwen/Qwen2.5-0.5B-Instruct",
"messages": [
{"role": "user", "content": "Explain data parallel attention"}
],
"max_tokens": 100,
"temperature": 0.7
}'
You can also test programmatically:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="fake-key"
)
response = client.chat.completions.create(
model="Qwen/Qwen2.5-0.5B-Instruct",
messages=[
{"role": "user", "content": "Explain data parallel attention"}
],
max_tokens=100
)
print(response.choices[0].message.content)
You can combine data parallel attention with prefill-decode disaggregation to scale both phases independently while using DP within each phase. This pattern is useful when you need high throughput for both prefill and decode phases.
The following example shows a complete, functional deployment:
:language: python
:start-after: __dp_pd_example_start__
:end-before: __dp_pd_example_end__
This configuration creates:
This allows you to:
:::{note}
This example uses 4 GPUs total (2 for prefill, 2 for decode). Adjust the data_parallel_size values based on your available GPU resources.
:::
:::{note}
For this example to work, you need to have NIXL installed. See the {doc}prefill-decode guide for prerequisites and installation instructions.
:::
../architecture/serving-patterns/data-parallel - Data parallel attention architecture detailsprefill-decode - Prefill-decode disaggregation guide../architecture/serving-patterns/index - Overview of serving patterns../quick-start - Basic LLM deployment examples