docs/articles_en/openvino-workflow/running-inference/optimize-inference.rst
.. toctree:: :maxdepth: 1 :hidden:
optimize-inference/general-optimizations optimize-inference/high-level-performance-hints optimize-inference/precision-control optimize-inference/optimizing-latency optimize-inference/optimizing-throughput optimize-inference/optimize-preprocessing optimize-inference/optimizing-low-level-implementation Optimizing memory usage <optimize-inference/optimizing-memory-usage>
.. meta:: :description: Improving inference performance involves model and runtime optimizations that can be done independently. Inference speed depends on latency and throughput.
Runtime, or deployment optimization focuses on tuning inference and execution parameters. Unlike
model-level optimization, it is highly specific to the hardware you use and the goal you want
to achieve. You need to plan whether to prioritize accuracy or performance,
:doc:throughput <optimize-inference/optimizing-throughput> or :doc:latency <optimize-inference/optimizing-latency>,
or aim at the golden mean. You should also predict how scalable your application needs to be
and how exactly it is going to work with the inference component. This way, you will be able
to achieve the best results for your product.
.. note::
For more information on this topic, see the following articles:
Inference Devices and Modes <inference-devices-and-modes>Inputs Pre-processing with the OpenVINO <inputs_pre_processing>Async API <async_api>The 'get_tensor' Idiom <tensor_idiom>dynamic shapes <model-input-output/dynamic-shapes>Performance-Portable Inference ################################
To make configuration easier and performance optimization more portable, OpenVINO offers the
:doc:Performance Hints <optimize-inference/high-level-performance-hints> feature. It comprises
two high-level “presets” focused on latency (default) or throughput.
Although inference with OpenVINO Runtime can be configured with a multitude of low-level performance settings, it is not recommended, as:
It requires deep understanding of device architecture and the inference engine.
It may not translate well to other device-model combinations. For example:
do not always map well to the CPU <optimize-inference/optimizing-low-level-implementation>.Additional Resources ####################
Using Async API and running multiple inference requests in parallel to leverage throughput <throughput_app_design>.The throughput approach implementation details for specific devices <optimize-inference/optimizing-low-level-implementation>Details on throughput <optimize-inference/optimizing-throughput>Details on latency <optimize-inference/optimizing-latency>API examples and details <optimize-inference/high-level-performance-hints>