tensorflow/lite/g3doc/performance/gpu.md
Using graphics processing units (GPUs) to run your machine learning (ML) models can dramatically improve the performance of your model and the user experience of your ML-enabled applications. TensorFlow Lite enables the use of GPUs and other specialized processors through hardware driver called delegates. Enabling use of GPUs with your TensorFlow Lite ML applications can provide the following benefits:
This document provides an overview of GPUs support in TensorFlow Lite, and some advanced uses for GPU processors. For more specific information about implementing GPU support on specific platforms, see the following guides:
There are some limitations to what TensorFlow ML operations, or ops, can be accelerated by the TensorFlow Lite GPU delegate. The delegate supports the following ops in 16-bit and 32-bit float precision:
ADDAVERAGE_POOL_2DCONCATENATIONCONV_2DDEPTHWISE_CONV_2D v1-2EXPFULLY_CONNECTEDLOGICAL_ANDLOGISTICLSTM v2 (Basic LSTM only)MAX_POOL_2DMAXIMUMMINIMUMMULPADPRELURELURELU6RESHAPERESIZE_BILINEAR v1-3SOFTMAXSTRIDED_SLICESUBTRANSPOSE_CONVBy default, all ops are only supported at version 1. Enabling the quantization support enables the appropriate versions, for example, ADD v2.
If some of the ops are not supported by the GPU delegate, the framework will only run a part of the graph on the GPU and the remaining part on the CPU. Due to the high cost of CPU/GPU synchronization, a split execution mode like this often results in slower performance than when the whole network is run on the CPU alone. In this case, the application generates warning, such as:
WARNING: op code #42 cannot be handled by this delegate.
There is no callback for failures of this type, since this is not an actual run-time failure. When testing execution of your model with the GPU delegate, you should be alert for these warnings. A high number of these warnings can indicate that your model is not the best fit for use for GPU acceleration, and may require refactoring of the model.
The following example models are built to take advantage GPU acceleration with TensorFlow Lite and are provided for reference and testing:
The following techniques can help you get better performance when running models on GPU hardware using the TensorFlow Lite GPU delegate:
Reshape operations - Some operations that are quick on a CPU may have a
high cost for the GPU on mobile devices. Reshape operations are particularly
expensive to run, including BATCH_TO_SPACE, SPACE_TO_BATCH,
SPACE_TO_DEPTH, and so forth. You should closely examine use of reshape
operations, and consider that may have been applied only for exploring data
or for early iterations of your model. Removing them can significantly
improve performance.
Image data channels - On GPU, tensor data is sliced into 4-channels, and
so a computation on a tensor with the shape [B,H,W,5] performs about the
same on a tensor of shape [B,H,W,8], but significantly worse than
[B,H,W,4]. If the camera hardware you are using supports image frames in
RGBA, feeding that 4-channel input is significantly faster, since it avoids
a memory copy from 3-channel RGB to 4-channel RGBX.
Mobile-optimized models - For best performance, you should consider retraining your classifier with a mobile-optimized network architecture. Optimization for on-device inferencing can dramatically reduce latency and power consumption by taking advantage of mobile hardware features.
You can use additional, advanced techniques with GPU processing to enable even better performance for your models, including quantization and serialization. The following sections describe these techniques in further detail.
This section explains how the GPU delegate accelerates 8-bit quantized models, including the following:
To optimize performance, use models that have both floating-point input and output tensors.
Since the GPU backend only supports floating-point execution, we run quantized models by giving it a ‘floating-point view’ of the original model. At a high-level, this entails the following steps:
Constant tensors (such as weights/biases) are de-quantized once into the GPU memory. This operation happens when the delegate is enabled for TensorFlow Lite.
Inputs and outputs to the GPU program, if 8-bit quantized, are de-quantized and quantized (respectively) for each inference. This operation is done on the CPU using TensorFlow Lite’s optimized kernels.
Quantization simulators are inserted between operations to mimic quantized behavior. This approach is necessary for models where ops expect activations to follow bounds learnt during quantization.
For information about enabling this feature with the GPU delegate, see the following:
The GPU delegate feature allows you to load from pre-compiled kernel code and model data serialized and saved on disk from previous runs. This approach avoids re-compilation and can reduce startup time by up to 90%. This improvement is achieved by exchanging disk space for time savings. You can enable this feature with a few configurations options, as shown in the following code examples:
<div> <devsite-selector> <section> <h3>C++</h3> <p><pre class="prettyprint lang-cpp"> TfLiteGpuDelegateOptionsV2 options = TfLiteGpuDelegateOptionsV2Default(); options.experimental_flags |= TFLITE_GPU_EXPERIMENTAL_FLAGS_ENABLE_SERIALIZATION; options.serialization_dir = kTmpDir; options.model_token = kModelToken;auto* delegate = TfLiteGpuDelegateV2Create(options);
if (interpreter->ModifyGraphWithDelegate(delegate) != kTfLiteOk) return false;
</pre></p>
</section>
<section>
<h3>Java</h3>
<p><pre class="prettyprint lang-java">
GpuDelegate delegate = new GpuDelegate(
new GpuDelegate.Options().setSerializationParams(
/* serializationDir= */ serializationDir,
/* modelToken= */ modelToken));
Interpreter.Options options = (new Interpreter.Options()).addDelegate(delegate);
</pre></p>
</section>
When using the serialization feature, make sure your code complies with these implementation rules:
getCodeCacheDir()
which points to a location that is private to the current application.farmhash::Fingerprint64.Note: Use of this serialization feature requires the OpenCL SDK.