src/plugins/intel_gpu/docs/dynamic_shape/kv_cache.md
For auto-regressive transformer models, KV (key-value) cache plays a pivotal role to avoid duplicated computation for past tokens. In OpenVINO, management of the KV (key-value) cache differs between the continuous batch (CB) pipeline and the non-CB pipeline.
In CB pipeline with openvino_genai, an SDPA (Scaled Dot Product Attention) operation in the OV IR (intermediate representation) is converted to a PagedAttention opertation. Then, the KV cache memories are managed by CB pipeline. CB pipeline is the default mode of openvino_genai for running LLMs.
On the other hand, we can use stateful model for Non-CB pipeline for KV cache optimization. In this case, KV cache memory is managed by the plugin.
In this document, KV cache management in stateful model within a non-CB pipeline is explained.
The following diagram shows a typical pattern of the KV cache in a stateful LLM model.
graph TB
StateInit["State
Initialize
Subgraph"]
ReadValue["ReadValue
(Past KV)"]
BeamIdx["Parameter
beam_idx"]
Gather["Gather"]
KVProj["KV projection
for new token"]
Concat["Concat"]
SDPA["SDPA"]
Assign["Assign
(Present KV)"]
VariableState["VariableState"]
%% flows
StateInit --> ReadValue
ReadValue -.-> VariableState
ReadValue --> Gather
BeamIdx --> Gather
Gather --> Concat
KVProj --> Concat
Concat --> SDPA
Concat --> Assign
Assign -.-> VariableState
Here, the KV cache from previous tokens is stored in the VariableState's memory and loaded by the ReadValue operation. It is then concatenated with the KV produced from the new token. The combined KV is written back to the VariableState's memory via the Assign operation as the updated (present) KV cache. This present KV is consumed by the SDPA operation. Also you can see the Gather operation after the ReadValue operation, which is actually used only when the beam_idx input is given for a beam search sampling.
Here, the GPU plugin applied optimizations targeting the three key points :
To this end, we introduced an internal operation of ov::intel_gpu::KVCache.
First, the Concatenation and Assign operations are fused into a single KVCache operation by the KVCacheFusion transform. Next, the IndirectKVCache transform further fuses the Gather operation into the same KVCache operation. (You can refer to the diagrams in the header files of each pass for detailed fusion patterns.)
The resulting graph after the transformations is as follows:
graph TB
StateInit["State
Initialize
Subgraph"]
BeamIdx["Parameter
beam_idx"]
NewKV["KV projection
for new token"]
KVCache["KVCache"]
SDPA["SDPA"]
VariableState["VariableState"]
ReadValue["ReadValue"]
%% main flows
StateInit --> ReadValue
ReadValue --> KVCache
BeamIdx --> KVCache
NewKV --> KVCache
KVCache -- beam_table --> SDPA
KVCache -- KV --> SDPA
KVCache -.-> VariableState
ReadValue -.-> VariableState
The following sections describe the policies and optimizations applied to the KVCache operation.
As described above, the KV projection data used as the input to the current SDPA step is formed by concatenating the KV data from past tokens with that of the new token. In the original graph, this concatenation requires two memory copies : one for past KV and another for the new KV. With intel_gpu::KVCache, we reduce this to a single copy by:
The relevant code sections to enable in-place kv cache is as follows :
kv_cache node.kv_cache primitive_inst’s _can_be_optimized flag to true.kv_cache primitive_inst's _can_be_optimized is true
kv_cache primitive_inst's output memory with the VariableState's memory._can_be_optimized is not true)
kv_cache inst and then set it as the VariableState's memory. At this time, we allocate larger memory for kv_cache than actually needed at that time (i.e., preallocate), and the detail of the policy is described in the next subsection.concatenate kernels._can_be_optimized flag is false, it launches one concatenate kernel per input (two in total for past and new KV), with each kernel copying its input data into the target memory region._can_be_optimized flag is true, execution of the concatenate (i.e., copy) of the first input (i.e., the past KV) is skipped, reducing the operation to a single concatenate kernel for the new KV input (see link).Suppose that the KV cache has shape [batch, num_heads, seq_len, head_size]. For example, let's consider a case when the past KV cache's data shape is [1, 32, 17, 64] and the new KV projection data shape is [1, 32, 1, 64]. If the preallocated memory can hold a data with shape [1, 32, 128, 64], do_runtime_in_place_kv_cache sets the upper padding of the kv_cache inst to [0, 0, 111, 0]. At the execution, the past KV data remains unchanged, and the new token’s KV values are written into the remaining space along the concat axis. As a result, the effective data shape becomes [1, 32, 18, 64], and the updated upper padding becomes [0, 0, 110, 0].
To achieve in_place_kv_cache and reduce host overhead due to the memory allocation, GPU plugin allocates larger memory for KV cache than actually needed. In the general case, we preallocate only enough memory for 10 iterations, but for the KV cache we allocate somewhat more (around 128).
However, we do not allocate same size for all layers, but we allocate different size. You can see how the allocation size is determined in this function.
We vary memory allocation across layers to prevent periodic peaks in overall memory usage. If every kv_cache instance used the same allocation size, their reallocation intervals would be same. When a reallocation occurs, the concatenation inside the kv_cache cannot be performed in place, so both the input and output buffers must coexist temporarily. This effectively doubles the required memory for that kv_cache inst. If all layers hit this reallocation point simultaneously, the combined effect produces a large memory spike. By staggering the allocation schedules across layers, we mitigate these synchronized peaks and significantly reduce overall memory pressure.