src/plugins/intel_gpu/docs/dynamic_shape/preprocessing.md
As explained in basic flow of primitive execution for dynamic shape from Overall flow flow for dynamic shape, several preprocessing steps are performed before setting arguments to kernel and executing selected impl.
update_shape - when the input shape changes, calculate and change the output shape and perform shape inference so that the shape is propagated to the next node.update_impl - depending on the changed shape, primitive_impl is retrieved from in-memory cache or new impl is selected.realloc_if_needed - allocates new output memory if necessary.The following is a description for some of the representative preprocessing steps for dynamic shape execution.
To support dynamic shape in GPU plugin, cldnn::layout uses ov::PartialShape to express shape. While the existing cldnn::tensor does not support dynamic shape and has limitations in rank, ov::PartialShape supports static and dynamic dimensions and has no limitations in rank. And when creating cldnn::primitive from ov::op, ov::PartialShape that ov::op already has is directly used.
Note: In the execution flow for the existing static shape in GPU plugin, the shape of
ov::opmay be transformed intoov::tensorand used, so when creatingcldnn::primitivefromov::op, it is separated from the dynamic shape execution flow. When buildingcldnn::program, if there is at least one dynamic node among the nodes,ov::intel_gpu::allow_new_shape_inferproperty is set (link) and execution of static shape and dynamic shape is separated through this property duringcldnn::primitivecreation. It will be integrated in the future when GPU plugin fully supports dynamic shape.
When the input shape of the model changes, the input shape of the current primitive is also updated by checking whether the input shape has changed, and the output shape is calculated through the input shape, then this shape is propagated to the next primitive on shape inference stage.
Details on how to execute shape inference through primitive_inst::update_shape when executing primitive in GPU plugin for dynamic shape are as follows:
primitive_inst::do_runtime_in_place_concat (link)) that runs before update_shape(). At this time, if update_shape() has already been executed by another primitive, set update_shape_done_by_other to TRUE. Therefore, if update_shape_done_by_other is TRUE, update_shape() is skipped. (link)kernel_impl_params from the dependencies of primitive_inst are compared with the input layouts of kernel_impl_params of the current primitive. If changed, the changed shape is updated to input layouts of kernel_impl_params. (link)_shape_changed to TRUE if the input shape has changed. (link)shape_of and the input shape has not changed, reset _shape_changed to FALSE and skip update_shape(). (link)shape_of subgraph, check dependent shape_of primitives and skip update_shape() if the shape has not changed. (link)update_shape() is skipped if any of the following conditions hold: the input shape has not changed, the node generates dynamic output (e.g. Nonzero, Unique), or the output layouts of kernel_impl_params are already static. (link)cldnn::primitive. In dynamic shape execution, if that data is stored in the output memory of a preceding node, execution waits until those dependent nodes complete. To determine which input nodes have memory dependencies, most program_nodes define get_shape_infer_dependencies(). The dependency information (index and memory for each dependent input node) is collected from the current node, stored in a map, and the corresponding primitive events are added to an event list to await completion. Finally, the populated map is saved in memory_deps of kernel_impl_params. (link)program_node: calc_output_layout() for static shape execution and calc_output_layouts() for dynamic shape execution. In this step, calc_output_layouts() is called, which invokes the shape_infer() API of ov::op with the updated input layouts from kernel_impl_params, the primitive's attributes, and memory_deps, and returns output layouts as a vector. The newly calculated output layout is then written back to output_layouts in kernel_impl_params (link)
struct program_node {
...
public:
layout calc_output_layout() const;
std::vector<layout> calc_output_layouts() const;
}
kernel_impl_params, the output layout of the descriptor is also updated with ov::PartialShape of updated output layout. (link)If primitive_impl is created or updated through update_impl(), and it is a weightable node (e.g. convolution, deconvolution, fc), the weight should be reordered to the layout required by kernel as needed. The following describes the processes performed in update_weights().
update_weight() is skipped. (link)kernel_impl_params for weights reorder) from WeightsReorderParams of primitive_inst. (link)kernel_impl_params to the output layout of reorder kernel params. This is the expected layout. (link)
implementations cache using reorder kernel params, or create a new reorder impl through WeightsReordersFactory and set the compiled kernel on it. Add the impl to implementation cache. Check whether the weights memory can be reused in reordered weights cache; if so, reuse it, otherwise allocate a new buffer. Update reordered weights cache accordingly. Finally, use kernel_arguments_data() to set kernel arguments in the reorder impl and execute the kernel.In the case of static shape execution, output memory is allocated when creating primitive_inst, but in dynamic shape execution, output memory is allocated before arguments are set to kernel and execution. The following describes the processes performed in realloc_if_needed().
concat and has 1 user, can_be_optimized() is TRUE but allocation_done_by_other is FALSE (i.e. not yet allocated by another node), execute concat's realloc_if_needed() and set allocation_done_by_other to TRUE. Also, use concat's output memory as the output memory of the current node and skip realloc_if_needed(). (link)fully_connected), the input and output shapes of kernel_impl_params are updated accordingly. A more detailed explanation will be added as a separate section later (TBD). (link)input_layout, realloc_if_needed() is skipped because it is assumed to always use external memory. (link)can_reuse_buffer. (link)concat and both can_be_optimized() and allocation_done_by_other are TRUE, realloc_if_needed() is skipped. (link)ShapePredictor predicts a preallocation shape from the current shape and data type, and updates the output layout shape of kernel_impl_params accordingly. A more detailed explanation will be added as a separate section later (TBD). (link)can_reuse_buffer is TRUE, reused of output memory is set to TRUE and output memory is updated with reinterpreted buffer. (link)can_reuse_buffer is FALSE, reallocate with allocate_outputs() to set the output memory and update max_output_layout_size. (link)primitive_impl. (link)
allocate_internal_buffer() to update or add a new intermediate memory that has already been allocated.