doc/source/serve/llm/architecture/serving-patterns/data-parallel.md
(serve-llm-architecture-data-parallel)=
Data parallel attention (DP) is a serving pattern that creates multiple inference engine instances to process requests in parallel. This pattern is most useful when you combine it with expert parallelism for sparse MoE models. In this case, the experts are parallelized across multiple machines and attention (QKV) layers are replicated across GPUs, providing an opportunity to shard across requests.
In this serving pattern, engine replicas aren't isolated. In fact, they need to run in sync with each other to serve a large number of requests concurrently.
---
width: 700px
name: dp-architecture
---
Data parallel attention architecture showing LLMServer replicas coordination.
In data parallel attention serving:
data_parallel_size replicas as a cohesive gang (i.e. data parallel group), assigning each data parallel replica a unique rank (0 to data_parallel_size-1).Data parallel attention serving works best when:
Consider alternatives when:
TP_size <= num_kv_heads. Beyond that point, TP requires KV cache replication, which wastes memory—DP becomes a better choice to avoid duplication. For example, for Qwen-235B, using DP=2, TP=4, EP=8 makes more sense than DP=8, EP=8 because you can still shard the KV cache with TP=4 before needing to replicate it. Benchmark these configurations with your workload to determine the optimal setup.The following are the main components of DP deployments:
DPServer extends LLMServer with data parallel attention coordination. DPServer utilizes Ray Serve's gang scheduling capability to ensure that all replicas in a DP group start and fail together atomically. In the following sections, we will use "gang" and "DP group" interchangeably. The following pseudocode shows the structure:
from ray import serve
class DPServer(LLMServer):
"""LLM server with data parallel attention coordination."""
async def __init__(self, llm_config: LLMConfig):
# Get rank and gang info from Ray Serve's gang context.
gang_context = serve.get_replica_context().gang_context
self.dp_rank = gang_context.rank
self.gang_id = gang_context.gang_id
# Register and obtain master address / port
if self.dp_rank == 0:
address, rpc_port = get_address_and_port()
GangMasterInfoRegistry.register(self.gang_id, address, rpc_port)
else:
address, rpc_port = await GangMasterInfoRegistry.get(self.gang_id)
# Pass DP metadata to the engine
llm_config.engine_kwargs["data_parallel_rank"] = self.dp_rank
llm_config.engine_kwargs["data_parallel_address"] = address
llm_config.engine_kwargs["data_parallel_rpc_port"] = rpc_port
await super().__init__(llm_config)
@classmethod
def get_deployment_options(cls, llm_config):
options = super().get_deployment_options(llm_config)
dp_size = llm_config.engine_kwargs.get("data_parallel_size", 1)
# Configure gang scheduling for the DP group.
# This tells Ray Serve controller to treat data parallel replicas within a DP group as a cohesive unit.
options["gang_scheduling_config"] = GangSchedulingConfig(
gang_size=dp_size,
gang_placement_strategy=GangPlacementStrategy.PACK,
runtime_failure_policy=GangRuntimeFailurePolicy.RESTART_GANG,
)
return options
Key responsibilities:
GangMasterInfoRegistry.GangMasterInfoRegistry is a GCS-backed KV store, persisting the DP master address and port. The following pseudocode shows the structure:
class GangMasterInfoRegistry:
"""Registry for gang DP master info using GCS KV store."""
@classmethod
def register(cls, gang_id: str, address: str, port: int):
"""Persists address and port associated with a DP group (gang) in the KV store."""
...
@classmethod
async def get(cls, gang_id: str, timeout: float) -> Tuple[str, int]:
"""Polls for address and port for a given DP group."""
...
@classmethod
def unregister(cls, gang_id: str):
"""Remove the DP master info on shutdown."""
...
Key responsibilities:
---
width: 700px
name: dp-flow
---
Data parallel attention request flow from client to distributed replicas.
The following is the request flow through a data parallel attention deployment:
DPServer.DPServer replica processes request.The key difference from basic serving is that all the dp_size replicas coordinate with each other rather than in isolation.
Data parallel attention deployments support autoscaling based on request queue length. Specify min_replicas, max_replicas, initial_replicas to configure autoscaling bound and starting point. Note that all min_replicas, max_replicas, initial_replicas refer to the number of DP groups, where each group has dp_size of engine instances.
:language: python
:start-after: __dp_autoscaling_example_start__
:end-before: __dp_autoscaling_example_end__
DPServer always uses the PACK strategy to place each replica's resources together:
If any DP replica in a DP group fails, Ray Serve controller restarts the entire DP group atomically. This ensures all replicas in a group are always in a consistent state, which is critical because DP replicas perform collective operations together.
You can run data parallel attention on both prefill and decode phases:
---
width: 700px
name: dp-pd
---
Using DP attention pattern along with PD deployments with independent DP sizes.
../overview - High-level architecture overview../core - Core components and protocolsprefill-decode - Prefill-decode disaggregation architecture../routing-policies - Request routing architecture