docs/features/disagg_encoder.md
A disaggregated encoder runs the vision-encoder stage of a multimodal LLM in a process that is separate from the pre-fill / decoder stage. Deploying these two stages in independent vLLM instances brings three practical benefits:
Design doc: https://docs.google.com/document/d/1aed8KtC6XkXtdoV87pWT0a8OJlZ-CpnuLLzmR8l9BAE
The current reference pathway is ExampleConnector.
Below ready-to-run scripts shows the workflow:
1 Encoder instance + 1 PD instance:
examples/online_serving/disaggregated_encoder/disagg_1e1pd_example.sh
1 Encoder instance + 1 Prefill instance + 1 Decode instance:
examples/online_serving/disaggregated_encoder/disagg_1e1p1d_example.sh
Please refer to the directories tests/v1/ec_connector
Disaggregated encoding is implemented by running two parts:
disagg_encoder_example.sh (E->PD) or in disaggregated instances with disagg_epd_example.sh (E->P->D)A connector transfers encoder-cache (EC) embeddings from the encoder instance to the PD instance.
All related code is under vllm/distributed/ec_transfer.
Here is a figure illustrating disaggregate encoder flow:
For the PD disaggregation part, the Prefill instance receives cache exactly the same as the disaggregated encoder flow above. Prefill instance executes 1 step (prefill -> 1 token output) and then transfers KV cache to the Decode instance for the remaining execution. The KV transfer part purely happens after the execution of the PD instance.
docs/features/disagg_prefill.md shows the brief idea about the disaggregated prefill (v0)
We create the example setup with the NixlConnector from vllm/distributed/kv_transfer/kv_connector/v1/nixl/ and referred to the tests/v1/kv_connector/nixl_integration/toy_proxy_server.py to facilitate the kv transfer between P and D;