docs/features/automatic_prefix_caching.md
Automatic Prefix Caching (APC in short) caches the KV cache of existing queries, so that a new query can directly reuse the KV cache if it shares the same prefix with one of the existing queries, allowing the new query to skip the computation of the shared part.
!!! note Technical details on how vLLM implements APC can be found here.
Set enable_prefix_caching=True in vLLM engine to enable APC. Here is an example:
examples/offline_inference/automatic_prefix_caching.py
We describe two example workloads, where APC can provide huge performance benefit:
APC in general does not reduce the performance of vLLM. With that being said, APC only reduces the time of processing the queries (the prefilling phase) and does not reduce the time of generating new tokens (the decoding phase). So APC does not bring performance gain when vLLM spends most of the time generating answers to the queries (e.g. when the length of the answer is long), or new queries do not share the same prefix with any of existing queries (so that the computation cannot be reused).