docs/proposals/0003-ONNXIFIproposal.md
Leading hardware and systems vendors offer highly optimized software to run neural network graphs. These software can deliver order-of-magnitude speedups compared to generic implementations, but their integration with deep learning frameworks and applications is complicated by large variety in vendor-specific interfaces, and subtle incompatibilities with the software stack of high-level applications.
So far, ONNX format targets the problem of offline conversion of neural network models between different high-level frameworks and vendor-specific libraries through offline translation. In this proposal, we suggest that ONNX ecosystem could be enriched to enable runtime discovery and selection of high-performance graph execution backends, and online (in runtime) conversion of ONNX graph to internal representations of these implementations.
We should strive for consensus on a library API to interface with optimized backends and offload parts of ONNX graphs to these high-performance hardware and software implementation. The API should enable wide interoperability between high-level deep learning frameworks, software implementations of optimized graph runtimes, and existing and upcoming neural network acceleration hardware.
The standardized API should reduce friction in deploying neural network models for all involved parties:
We propose a small C-based API, which includes the following functionality:
onnxGetNumBackends) and query information (onnxGetBackendInfo) about high-performance backendsonnxInitBackend) and deinitialize (onnxReleaseBackend) high-performance backendsonnxGetBackendCompatibility)onnxInitGraph)onnxSetGraphIO)onnxRunGraph)onnxReleaseGraph)The user (deep learning framework) iterates operators in a model graph one-by-one, convert them to ONNX, and calls onnxGetBackendCompatibility to check which of the operators can be offloaded to the backend.
The user constructs connected subgraphs of operators that can be offloaded to the backend.
(Optional) For each subgraph, the user estimates if it is beneficial to offload it to the optimized backend:
a. The user queries the backend about it high-level performance characteristics using ONNX_BACKEND_MACS_* and ONNX_BACKEND_MEMORY_BANDWIDTH information queries. These data let the user build a simple roofline model of backend performance.
b. For every subgraph the user estimates time to do inference using the roofline model.
c. The user additionally estimates time to transfer subgraph inputs to the backend using ONNX_BACKEND_CPU_MEMORY_READ_BANDWIDTH information query and to transfer subgraph outputs from the backend using ONNX_BACKEND_CPU_MEMORY_WRITE_BANDWIDTH.
d. If predicted time to transfer inputs to the backend, do inference, and transfer outputs from the backend exceeds predicted time to do the inference on default engine (e.g. CPU), the user falls back to a different ONNX backend, or to the default engine.
The user initialized the backend, and offloads the subgraph execution to the ONNX backend by calling onnxInitGraph, onnxSetGraphIO and onnxRunGraph
Backend is a combination of software library and hardware device. The same device (e.g. "NVIDIA Tesla P100 on CUDA index #0" accessed though different software libraries would be seen as different backends. A single software library can expose multiple backends, one per device (e.g. each CUDA GPU in a system is exposed as a separate backend, or CPU, GPU, and DSP on a mobile chipset are exposed as three different backends).
We recommend that vendors make the backend object reference-counted, and use uint32_t magic as the first data field of the object:
struct MyBackend {
uint32_t magic;
uint64_t referenceCount;
...
};
/* This line won't compile, but gives you an idea of relation between MyBackend structure and onnxBackend type. */
typedef MyBackend* onnxBackend;
Magic is an arbitrary 32-bit integer unique for a library implementing the API. It should be used to verify that the backend object passed to onnxInitGraph was created by onnxInitBackend in the same library.
Graph object is a vendor-specific representation of ONNX ModelProto message. Graph is logically related to the backend used to create it, and a typical implementation of a graph object would hold a reference to its backend object.
We recommend that vendors use uint32_t magic as the first data field of the graph object:
struct MyGraph {
uint32_t magic;
struct MyBackend* backend;
...
};
/* This line won't compile, but gives you an idea of relation between MyGraph structure and onnxGraph type. */
typedef MyGraph* onnxGraph;
Magic is an arbitrary 32-bit integer unique for a library implementing the API. It should be used to verify that the backend object passed to onnxInitGraph was created by onnxInitBackend in the same library. Magic for a graph object should be different from magic of a backend object of the same library.
During one-time library initialization, the implementation of the API would detect n supported devices and map them to backend indices in 0...(n-1) range. The implementation of device discovery and checking required device characteristics is highly vendor- and platform-specific, e.g.:
cudaGetDeviceCount to get the number of CUDA-enabled devices, then
call cudaGetDeviceProperties for each device, and map CUDA devices which satisfy the minimum required functionality, such as compute capability, to backend indices.clGetPlatformIDs and clGetPlatformInfo to find a supported platform, then call clGetDeviceIDs and clGetDeviceInfo to find a supported GPU device, and map it to the only exposed backend if such device exists, or expose 0 devices otherwise.We recommend that library initialization is triggered on the first call to onnxGetNumBackends, onnxGetBackendInfo, or onnxInitBackend. Using a global static C++ object for initialization may hurt portability if library initialization involves loading other shared libraries (DLLs): on Windows LoadLibrary function can't be used in initializers of global static objects.
Implementation would initialize the library, if it wasn't initialized already, and return the number n of available backends.
Implementation would initialize the library, if it wasn't initialized already, and query information about the backend using vendor- or platform-specific API (e.g. cudaGetDeviceProperties, clGetDeviceInfo, CPUID instruction). Implementation can cache this information when it is first queried or during initialization, and return the cached value.