src/plugins/intel_gpu/docs/graph_optimization_passes.md
Graph optimization is a collection of optimization passes that convert a general network description into a network-description-for-GPU-execution. It happens in the constructor of cldnn::program. In other words, the input of graph optimization is topology (link) and the output is program (link).
The transformation from the original graph into the final graph is quite complicated. The steps are divided into smaller pieces (pass). The purpose of this documentation is not to explain every step in detail, but to explain key steps.
For debugging purposes, you can dump the optimized graph after each step. See this article for details.
Note: The optimization passes run in sequence and the prefixed number indicates the sequence. However, the sequence number might change in the future.
layout_optimizer::get_preferred_format function, which returns preferred format for a node (or “any” which means that the format must be propagated from adjacent nodes if possible). Then it propagates formats for nodes with “any” preferred format to minimize local reorders. After propagating formats, it inserts actual reorder nodes into the graph. The result of this pass is a quite complicated graph with many redundant reorders. It will be removed from remove_redundant_reorders.reorder - reorder - reorder, it can be shrunk into a single reorder. Second one is about supporting cross-layout operation of a primitive. For example, when a convolution needs to receive bfyx input and to generate b_fs_yx_fsv16 output, the initial graph from reorder_inputs looks as follows: data(bfyx) --> reorder(b_fs_yx_fsv16) --> convolution(b_fs_yx_fsv16). This pass looks for such a pattern and removes the reorder to generate a cross-layout graph for the target convolution: data(bfyx) --> convolution(b_fs_yx_fsv16)concatenation primitive when two predecessors can put result into the target buffer of concat directly. For example, if two convolution results are concatenated along f-axis and the layout is bfyx format and b=1, you can just remove concat primitive and manipulate the output address of the convolutions to point to proper locations.implementation_map<op_t>, defined in <op_type>_gpu.cpp file. If it is not defined, this pass tries to change layout to one of the most common format [bfyx, yxfb, byxf] and picks the first supported format.primitive_impl through the kernel selector. In this pass, the kernel for each node is chosen. For oneDNN primitives, OpenCL code is compiled in this stage. For clDNN primitives, OpenCL code will be compiled after all passes.compile_graph stage, it is now known that some reordering is required for the weights. It is because the weights are stored in a simple planar format in IR, but other format is usually required for optimized convolution(or deconv, FC). To reorder weights, this pass creates a simple graph that receives weights and generates reordered weights. You get the reordered weights by executing the network and the reordered weights are inserted back into the original graph.Intel_GPU plugin is using out-of-order queue. As you are not sure about the exact sequence of execution, there is an additional limitation of reusing the buffer. For example, in case of a multi-branch structure like inception, there is no direct dependencies between the branches except for the common ancestor. However, in OOOQ execution mode, as you are not sure about the sequence of execution in inception module, it is necessary not to reuse the buffer from one branch by another branch. Such implicit dependency information is processed in this pass.