src/plugins/intel_gpu/docs/gpu_plugin_ops_enabling.md
convolution_gpu_winograd_2x3_s1.cl. Usually, single kernel fulfills the operation of a single primitive, but several kernels may be used to support one primitive.Understand the new operation.
Try to find existing primitive that fully or partially covers this operation.
Add new / extend existing GPU primitive according to the operation spec.
This phase is to enable primitive within GPU plugin, without exposing it to IE.
Implement reference parallel kernel that supports all parameters of the operation and all input/output data types and layouts.
| File | Description |
|---|---|
| group_normalization_impls.cpp | list up implemented kernels for kernel selection |
| group_normalization_ref.cl | OpenCL Kernel body. For more detail, please see How to write OCL kernel section |
| group_normalization_ref.(cpp,hpp) | Counterpart of kernel body for host |
| registry.hpp | Primitive registration. For example, registration for group_normalization. |
| group_normalization_inst.h | Node type declaration for GPU program |
| src/graph/group_normalization.cpp | Code for group_normalization_inst.h |
| primitives/group_normalization.hpp | GPU primitive definition |
| common_types.h | Enum declaration for KernelType and arguments |
ocl_v2 approach, it should remove legacy primitive registration at impls/ocl/register.(cpp,hpp). Also primitive registration for input specifications and kernel parameters is not necessary for ocl_v2.(e.g. scatter_elements_update.cpp)Add unit tests for the new operation.
| File | Description |
|---|---|
| group_normalization_gpu_test.cpp | Unittest for layer |
You need to add reference code or expected result for checking the result.
You can also specify the kernel with force_implementations in case the primitive contains multiple kernels.
...
build_options options;
implementation_desc conv_impl = { format::fs_b_yx_fsv32, "" };
options.set_option(build_option::force_implementations({ {"conv_fsv", conv_impl} }));
network network(engine, topology, options);
...
This unit test is built into ov_gpu_unit_tests. It is a gtest application.
# Show list of test cases
openvino/bin/intel64/Debug$ ./ov_gpu_unit_tests --gtest_list_tests
# Run test
openvino/bin/intel64/Debug$ ./ov_gpu_unit_tests --gtest_filter=scatter_elements_update_gpu_fp16.*
Test scope needs to be comprehensive, but not wasteful. These tests run for every PR in CI. Let's save the planet.
Support layer fusion, if applicable
prepare_primitive_fusing::fuse_simple_primitives.fuse_simple_primitives is called during graph compilation phaseov_gpu_unit_tests.jitter. It is created as FUSED_OPS.. macro in OCL code. This generation logic is in KernelBase::MakeFusedOpsJitConstants.Add / update factory for this operation in the GPU plugin to use new primitive in inference-engine.
| File | Description |
|---|---|
| plugin/ops/group_normalization.cpp | Instantiation of gpu plugin primitive from IE |
| primitives_list.hpp | Registration for primitives |
Add functional single-layer tests for the operation and try to cover most of the different use cases of this operation.
| File | Description |
|---|---|
| single_op/group_normalization.hpp | Shared class for single layer test |
| single_layer_tests/group_normalization.cpp | Single layer test for GPU plugin |
ov_gpu_func_test. It is also gtest application.[Optional] If there are existing IRs with this operation, try to run the full model(s) to be sure that it is correctly processed within the context.
[Optional] If there are existing IRs with this operation, try to run the full model(s) and estimate performance impact from this operation on total model execution time.
Create a PR with your changes.
OpenVINO group member in github, CI will be triggered.build_option::force_implementations.In GPU OCL kernels, many conditional statements are processed with #ifdef so that they can be handled during compile-time. The definitions are created with jitter.cpp. It is set during graph compilation. You can see generated macros, following the steps in source dumps.
Jitter also contains run-time parameters such as input and output size.
Additional macros can be defined from the host-code of a kernel itself. For example, see the code snippet below. It passes SUB_GROUP_SIZE through macro definition through jitter.
// GetJitConstants method of the kernel
const size_t sub_group_size = 16;
JitConstants jit = MakeBaseParamsJitConstants(params);
jit.AddConstant(MakeJitConstant("SUB_GROUP_SIZE", sub_group_size ));
Jitter generates macros for index calculations. With these macros, you can program OCL kernel in a layout-agnostic way. If you use the macro ${TENSOR_NAME}_GET_INDEX, you can get 1d-index from a tensor coordinate whether the format is planar (such as bfyx or byxf) or blocked (such as b_fs_yx_fsv16). You can check source code for GET_INDEX macro.
If a kernel is not performance-critical, you can support bfyx, bfzyx and bfwzyx only for layout. Those are default layouts. As an optimized format, b_fs_yx_fsv16, b_fs_yx_fsv4 or byxf can be used as well.
General description of layout can be found here and header file is here.
When layers are fused, jitter will create macros to generate code for fused layers. It is realized into FUSED_OPS.. in OCL kernel. You can understand the usage from other kernels.
There is a comment that describes layer fusion.