tensorflow/lite/delegates/xnnpack/README.md
XNNPACK is a highly optimized library of neural network inference operators for ARM, x86, and WebAssembly architectures in Android, iOS, Windows, Linux, macOS, and Emscripten environments. This document describes how to use the XNNPACK library as an inference engine for TensorFlow Lite.
XNNPACK integrates with TensorFlow Lite interpreter through the delegation mechanism. TensorFlow Lite supports several methods to enable XNNPACK for floating-point inference.
Pre-built
nightly TensorFlow Lite binaries for Android
include XNNPACK, albeit it is disabled by default. Use the setUseXNNPACK
method in Interpreter.Options class to enable it:
Interpreter.Options interpreterOptions = new Interpreter.Options();
interpreterOptions.setUseXNNPACK(true);
Interpreter interpreter = new Interpreter(model, interpreterOptions);
Pre-built
nightly TensorFlow Lite CocoaPods
include XNNPACK, but do not enable it by default. Swift developers can use
InterpreterOptions object to enable XNNPACK:
var options = InterpreterOptions()
options.isXNNPackEnabled = true
var interpreter = try Interpreter(modelPath: "model/path", options: options)
Objective-C developers can enable XNNPACK via a new property in the
TFLInterpreterOptions class:
TFLInterpreterOptions *options = [[TFLInterpreterOptions alloc] init];
options.useXNNPACK = YES;
NSError *error;
TFLInterpreter *interpreter =
[[TFLInterpreter alloc] initWithModelPath:@"model/path"
options:options
error:&error];
When building TensorFlow Lite with Bazel, add --define tflite_with_xnnpack=true, and the TensorFlow Lite interpreter will use XNNPACK
engine by default.
The exact command depends on the target platform, e.g. for Android AAR you'd use
bazel build -c opt --fat_apk_cpu=x86,x86_64,arm64-v8a,armeabi-v7a \
--host_crosstool_top=@bazel_tools//tools/cpp:toolchain \
--define android_dexmerger_tool=d8_dexmerger \
--define android_incremental_dexing_tool=d8_dexbuilder \
--define tflite_with_xnnpack=true \
//tensorflow/lite/java:tensorflow-lite
Note that in this case Interpreter::SetNumThreads invocation does not take
effect on number of threads used by XNNPACK engine. In order to specify number
of threads available for XNNPACK engine you should manually pass the value when
constructing the interpreter. The snippet below illustrates this assuming you
are using InterpreterBuilder to construct the interpreter:
// Load model
tflite::Model* model;
...
// Construct the interprepter
tflite::ops::builtin::BuiltinOpResolver resolver;
std::unique_ptr<tflite::Interpreter> interpreter;
TfLiteStatus res = tflite::InterpreterBuilder(model, resolver, num_threads);
XNNPACK engine used by TensorFlow Lite interpreter uses a single thread for inference by default.
Another way to enable XNNPACK is to build and link the
//tensorflow/lite:tflite_with_xnnpack target into your application alongside
the TensorFlow Lite framework.
This method works on platforms which support POSIX-style weak symbols (Android, iOS, Linux, Mac, but NOT Windows).
While it is possible to use low-level delegate API to enable XNNPACK, this method is NOT RECOMMENDED unless you need to use TensorFlow Lite both with and without XNNPACK (e.g. for benchmarking).
With low-level delegate API users create an XNNPACK delegate with the
TfLiteXNNPackDelegateCreate function, and then call
Interpreter::ModifyGraphWithDelegate to delegate supported parts of the model
to the XNNPACK delegate. The users must destroy the delegate with
TfLiteXNNPackDelegateDelete after releasing the TensorFlow Lite
interpreter. The snippet below illustrates the typical usage:
// Build the interpreter
std::unique_ptr<tflite::Interpreter> interpreter;
...
// IMPORTANT: initialize options with TfLiteXNNPackDelegateOptionsDefault() for
// API-compatibility with future extensions of the TfLiteXNNPackDelegateOptions
// structure.
TfLiteXNNPackDelegateOptions xnnpack_options =
TfLiteXNNPackDelegateOptionsDefault();
xnnpack_options.num_threads = num_threads;
TfLiteDelegate* xnnpack_delegate =
TfLiteXNNPackDelegateCreate(&xnnpack_options);
if (interpreter->ModifyGraphWithDelegate(xnnpack_delegate) != kTfLiteOk) {
// Report error and fall back to another delegate, or the default backend
}
// IMPORTANT: AllocateTensors can be called only AFTER ModifyGraphWithDelegate
...
// Run inference using XNNPACK
interpreter->Invoke()
...
// IMPORTANT: release the interpreter before destroying the delegate
interpreter.reset();
TfLiteXNNPackDelegateDelete(xnnpack_delegate);
XNNPACK internally packs static weights for operations (like convolutions) in order to make accessing weights more memory friendly. XNNPACK needs to allocate memory internally to hold these packed weights. If you are starting multiple TFLite interpreter instances based on the same model, there can be multiple copies of the same packed weights in each instance which can cause high memory usage.
The weights cache can be used to store these packed weights to a file to avoid re-packing on every run and to share share packed weights between multiple TFLite instances. Depending on your use case, this can lead to significant intialization speed-up and memory savings.
The initialization speed-up happens because the packing operations are only done once and read from the cache file for subsequent runs. We are skipping the most expensive part of XNNPack's initialization.
The memory savings have multiple reasons, which don't all apply to all use cases:
The original weights are never read when using the cache as packing doesn't
happen. This is because TFLite usually uses mmap to load the model files
and that only pulls data that you actually read into memory.
The weight cache provides buffer de-duplication: if multiple tensors share the same weights, it only keeps one copy of the corresponding packed weights. This is usually the case for LLM models and models that have several signatures.
The weight cache can be shared between interpreter instances, further de-duplicating packed data.
Thanks to using mmap, the file-backed cache can be shared between
processes, further de-duplicating packed data. This is automatic, you don't
need to do anything.
The weights cache is a contents-based cache. Every time XNNPACK has to pack weights, it first tries to look up if the packed weights can be found in the weights cache. If they can be found, we access the packed weights in the cache for subsequent operations. Otherwise, the weights are packed and added to the cache.
Warning: The weight cache cannot be shared between models or hardware architectures, a different cache file must be used for each (model, architecture) pair.
Warning: XNNPack does it's best to detect outdated cache files but cannot check for model changes. Checking that the model has been updated and deleting old cache files is left to the user.
Saving the cache to disk bring you the full list of advantages listed above.
std::unique_ptr<tflite::Interpreter> interpreter;
// Like using the low-level API above, initialize options, and pass this cache
// to an XNNPACK delegate via the options.
TfLiteXNNPackDelegateOptions xnnpack_options =
TfLiteXNNPackDelegateOptionsDefault();
xnnpack_options.weight_cache_file_path = "path/to/the/cache/file";
// Modify graph with delegate, as above...
TfLiteDelegate* delegate = TfLiteXNNPackDelegateCreate(&xnnpack_options);
if (interpreter->ModifyGraphWithDelegate(delegate) != kTfLiteOk) {
// Handle errors...
}
// You can now run the interpreter.
//
// Static weights will be packed and written into weights_cache the first time,
// directly read from disk the 2nd time.
If you cannot access a file system, the cache can also be used "in-memory" instead of saving it to disk.
You will lose advantages 1 and 4 but can still profit from 2 and 3.
Note: Currently, this is only accessible on systems that have the memfd_create
system call.
std::unique_ptr<tflite::Interpreter> interpreter;
// Like using the low-level API above, initialize options, and pass this cache
// to an XNNPACK delegate via the options.
TfLiteXNNPackDelegateOptions xnnpack_options =
TfLiteXNNPackDelegateOptionsDefault();
xnnpack_options.weight_cache_file_path =
TfLiteXNNPackDelegateInMemoryFilePath();
// Modify graph with delegate, as above...
TfLiteDelegate* delegate = TfLiteXNNPackDelegateCreate(&xnnpack_options);
if (interpreter->ModifyGraphWithDelegate(delegate) != kTfLiteOk) {
// Handle errors...
}
// You can now run the interpreter.
//
// Static weights will be packed and written into weights_cache the first time,
// directly read from disk the 2nd time.
This is independent of using file-backed or in-memory caching. To share a cache between interpreters, you need to create the cache outside of the delegate and pass it down to it.
std::unique_ptr<tflite::Interpreter> interpreter1;
std::unique_ptr<tflite::Interpreter> interpreter2;
// Create a weight cache. This should outlive the interpreter.
tflite::xnnpack::MMapWeightCacheProvider weight_cache;
// Like using the low-level API above, initialize options, and pass this cache
// to an XNNPACK delegate via the options.
TfLiteXNNPackDelegateOptions xnnpack_options =
TfLiteXNNPackDelegateOptionsDefault();
// When sharing an existing cache, the path will be used by the first
// interpreter that is run to load it or create it.
xnnpack_options.weight_cache_file_path = /* See previous examples. */;
// Share the cache.
xnnpack_options.weight_cache_provider = &weight_cache;
// Modify graph with delegate, as above...
TfLiteDelegate* delegate1 = TfLiteXNNPackDelegateCreate(&xnnpack_options);
if (interpreter1->ModifyGraphWithDelegate(delegate1) != kTfLiteOk) {
// Handle errors...
}
// Signal to the weight cache provider that there's no building to be done
// anymore. That way subsequent interpreter setups won't try to continue
// building the cache.
weight_cache.StopBuild();
// Modify graph with delegate, as above...
TfLiteDelegate* delegate2 = TfLiteXNNPackDelegateCreate(&xnnpack_options);
if (interpreter2->ModifyGraphWithDelegate(delegate2) != kTfLiteOk) {
// Handle errors...
}
// You can now run the interpreters.
//
// Static weights will be packed and written into weights_cache the first time,
// directly read from disk the 2nd time.
Warning: Sharing the cache is not thread safe for building. You should always do
one full run of one of the interpreters before starting threading. Once the
building run is done, call weight_cache.StopBuild() before using the weight
cache provider to build other delegate instances.
When TfLite profiling is enabled, XNNPACK will time each operator and report the results to TfLite which will print them as part of the overall execution profile.
XNNPACK delegate is a work-in-progress, and currently supports a limited set of operators. Unsupported operators will fall back to the default implementations, so models using a combination of supported and unsupported operators can still benefit from XNNPACK delegate.
Below is the list of currently supported floating-point operators:
ABSADDNONE, RELU, RELU_N1_TO_1, and RELU6 activations are supported,
but fused TANH and SIGN_BIT activations are not.AVERAGE_POOL_2DNONE, RELU, RELU_N1_TO_1, and RELU6 activations are supported,
but fused TANH and SIGN_BIT activations are not.CEILCONCATENATIONCONV_2DkTfLiteMmapRo allocation type).NONE, RELU, RELU_N1_TO_1, and RELU6 activations are supported,
but fused TANH and SIGN_BIT activations are not.DEPTH_TO_SPACEDEPTHWISE_CONV_2DkTfLiteMmapRo allocation type).NONE, RELU, RELU_N1_TO_1, and RELU6 activations are supported,
but fused TANH and SIGN_BIT activations are not.DIVNONE, RELU, RELU_N1_TO_1, and RELU6 activations are supported,
but fused TANH and SIGN_BIT activations are not.ELUFULLY_CONNECTEDkTfLiteMmapRo allocation type).NONE, RELU, RELU_N1_TO_1, and RELU6 activations are supported,
but fused TANH and SIGN_BIT activations are not.FLOORHARD_SWISHLEAKY_RELULOGISTICMAX_POOL_2DNONE, RELU, RELU_N1_TO_1, and RELU6 activations are supported,
but fused TANH and SIGN_BIT activations are not.MAXIMUMMEANkTfLiteMmapRo allocation type).MINIMUMMULNONE, RELU, RELU_N1_TO_1, and RELU6 activations are supported,
but fused TANH and SIGN_BIT activations are not.NEGPADkTfLiteMmapRo allocation type).PRELUkTfLiteMmapRo allocation type).RELURELU6RELU_N1_TO_1RESHAPEkTfLiteMmapRo allocation type), or absent (with the new shape
specified via ReshapeOptions table).RESIZE_BILINEARkTfLiteMmapRo allocation type).ROUNDSLICEkTfLiteMmapRo allocation type).SOFTMAXbeta = 1.0 is supported.SPACE_TO_DEPTHSPLITSQRTSQUARESQUARED_DIFFERENCESTRIDED_SLICEkTfLiteMmapRo
allocation type).SUBNONE, RELU, RELU_N1_TO_1, and RELU6 activations are supported,
but fused TANH and SIGN_BIT activations are not.TANHTRANSPOSEkTfLiteMmapRo allocation type).TRANSPOSE_CONVkTfLiteMmapRo allocation type).XNNPACK supports half-precision (using IEEE FP16 format) inference for all floating-point operators. XNNPACK automatically enables half-precision inference when the following conditions are met:
XNNPACK runs on hardware that natively supports computations in IEEE FP16 format. Currently, this hardware is limited to ARM & ARM64 devices with ARMv8.2 FP16 arithmetics extension, and includes Android phones starting with Pixel 3, Galaxy S9 (Snapdragon SoC), Galaxy S10 (Exynos SoC), iOS devices with A11 or newer SoCs, all Apple Silicon Macs, and Windows ARM64 laptops based with Snapdragon 850 SoC or newer.
The model's "reduced_precision_support" metadata indicates that the model is
compatible with FP16 inference. The metadata can be added during model
conversion using the _experimental_supported_accumulation_type attribute
of the
tf.lite.TargetSpec
object:
converter.optimizations = [tf.lite.Optimize.DEFAULT]
...
converter.target_spec.supported_types = [tf.float16]
converter.target_spec._experimental_supported_accumulation_type = tf.dtypes.float16
When the above conditions are met, XNNPACK replace FP32 operators with their FP16 equivalents, and insert additional operators to convert model inputs from FP32 to FP16 and convert model outputs back from FP16 to FP32. If the above conditions are not met, XNNPACK will perform model inference with FP32 calculations.
Additionally, XNNPACK delegate provides an option to force FP16 inference regardless of model metadata. This option is intended for development workflows, and in particular for testing end-to-end accuracy of model when FP16 inference is used. Forcing FP16 inference has several effects:
Besides ARM64 devices with ARMv8.2 FP16 arithmetics extension, forced FP16 inference is supported on x86/x86-64 devices with AVX2 extension in emulation mode: all elementary floating-point operations are computed in FP32, then converted to FP16 and back to FP32. Note that such simulation is not bit-exact equivalent to native FP16 inference, but simulates the effects of restricted mantissa precision and exponent range in the native FP16 arithmetics.
On devices that support neither the native FP16 arithmetics (ARM64 devices with ARMv8.2 FP16 arithmetics extension), nor emulation (x86/x86-64 devices with AVX2 extension), inference will fail rather than fall back to FP32.
If any floating-point operator offloaded to XNNPACK is not supported for FP16 inference, inference will fail rather than fall back to FP32.
To force FP16 inference, either build the delegate with --define xnnpack_force_float_precision=fp16 option, or add
TFLITE_XNNPACK_DELEGATE_FLAG_FORCE_FP16 flag to the
TfLiteXNNPackDelegateOptions.flags bitmask passed into the
TfLiteXNNPackDelegateCreate call:
TfLiteXNNPackDelegateOptions xnnpack_options =
TfLiteXNNPackDelegateOptionsDefault();
...
xnnpack_options.flags |= TFLITE_XNNPACK_DELEGATE_FLAG_FORCE_FP16;
TfLiteDelegate* xnnpack_delegate =
TfLiteXNNPackDelegateCreate(&xnnpack_options);
XNNPACK has full feature parity between FP32 and FP16 operators: all operators that are supported for FP32 inference are also supported for FP16 inference, and vice versa. In particular, sparse inference operators are supported for FP16 inference on ARM processors.
By default, quantized inference in XNNPACK delegate is disabled, and XNNPACK is used only for floating-point models. Support for quantized inference in XNNPACK must be enabled by adding extra Bazel flags when building TensorFlow Lite.
--define tflite_with_xnnpack_qs8=true flag enables XNNPACK inference for
quantized operators using signed quantization schema. This schema is used by
models produced by
Model Optimization Toolkit
through either post-training integer quantization or quantization-aware
training. Post-training dynamic range quantization is not supported in
XNNPACK.
--define tflite_with_xnnpack_qu8=true flag enables XNNPACK inference for
quantized operators using unsigned quantization schema, produced via the
legacy TensorFlow 1.X quantization tooling. This option is experimental and
may perform suboptimally on mobile processors with NEON DOT product
instructions.
Below is the list of currently supported quantized operators:
ADDNONE, RELU, RELU_N1_TO_1, and RELU6 activations are supported,
but fused TANH and SIGN_BIT activations are not.CONCATENATIONCONV_2DkTfLiteMmapRo allocation type),
and can use either per-tensor or per-channel quantization parameters.NONE, RELU, RELU_N1_TO_1, and RELU6 activations are supported,
but fused TANH and SIGN_BIT activations are not.DEPTH_TO_SPACEDEPTHWISE_CONV_2DkTfLiteMmapRo allocation type),
and can use either per-tensor or per-channel quantization parameters.NONE, RELU, RELU_N1_TO_1, and RELU6 activations are supported,
but fused TANH and SIGN_BIT activations are not.DEQUANTIZEELUFULLY_CONNECTEDkTfLiteMmapRo allocation type).NONE, RELU, RELU_N1_TO_1, and RELU6 activations are supported,
but fused TANH and SIGN_BIT activations are not.LEAKY_RELULOGISTICMAX_POOL_2DNONE, RELU, RELU_N1_TO_1, and RELU6 activations are supported,
but fused TANH and SIGN_BIT activations are not.MEANkTfLiteMmapRo allocation type).MULNONE, RELU, RELU_N1_TO_1, and RELU6 activations are supported,
but fused TANH and SIGN_BIT activations are not.PADkTfLiteMmapRo allocation type).QUANTIZERESHAPEkTfLiteMmapRo allocation type), or absent (with the new shape
specified via ReshapeOptions table).RESIZE_BILINEARkTfLiteMmapRo allocation type).SLICEkTfLiteMmapRo allocation type).SPACE_TO_DEPTHSPLITSUBNONE, RELU, RELU_N1_TO_1, and RELU6 activations are supported,
but fused TANH and SIGN_BIT activations are not.TANHTRANSPOSEkTfLiteMmapRo allocation type).TRANSPOSE_CONVkTfLiteMmapRo allocation type).XNNPACK backend supports sparse inference for CNN models described in the Fast Sparse ConvNets paper. Sparse inference is restricted to subgraphs with the following floating-point operators:
DENSIFY operators in the TensorFlow Lite schema).CONV_2D operator with
padding 1 on each side, no dilation, and 3 input channels.MEAN operator with reduction across
spatial axes, or a DEPTH_TO_SPACE operator.CONV_2D with 1x1 kernel and no padding. At least 2/3rd of filter
weights in the 1x1 CONV_2D operators across the sparse subgraph must
be zeroes to enable sparse inference.DEPTHWISE_CONV_2D with 3x3 kernel, stride 1, no dilation, and padding
1 on each side.DEPTHWISE_CONV_2D with 3x3 kernel, stride 2, no dilation, and padding
1 on each side.DEPTHWISE_CONV_2D with 5x5 kernel, stride 1, no dilation, and padding
2 on each side.DEPTHWISE_CONV_2D with 5x5 kernel, stride 2, no dilation, and padding
2 on each side.RESIZE_BILINEAR operator with output dimensions greater than 1.MEAN operator with reduction across spatial axes.ADD and MUL operators where both inputs are 4D tensors. If one of
the inputs to ADD or MUL is a constant tensor, it must be
representable as either a scalar, or a 1D vector.ABS, CEIL, ELU, FLOOR, HARD_SWISH,
LEAKY_RELU, LOGISTIC, NEG, RELU, RELU6, RELU_N1_TO_1,
ROUND, SIGMOID, and SQUARE.Pre-trained Fast Sparse ConvNets models provide examples that satisfy these constraints.
Some of XNNPACK operators, such as CONV_2D, use indirection buffers to supply
locations of input for the operators. Indirection buffers are created for each
operator instance, and are persistent by default. It causes XNNPACK to use
substantial amount of memory, especially when the input is in high resolution.
To reduce the memory footprint of indirection buffers, either build the delegate
with --define tflite_with_xnnpack_transient_indirection_buffer=true option, or
add TFLITE_XNNPACK_DELEGATE_FLAG_TRANSIENT_INDIRECTION_BUFFER flag to the
TfLiteXNNPackDelegateOptions.flags bitmask passed into the
TfLiteXNNPackDelegateCreate call:
TfLiteXNNPackDelegateOptions xnnpack_options =
TfLiteXNNPackDelegateOptionsDefault();
...
xnnpack_options.flags |= TFLITE_XNNPACK_DELEGATE_FLAG_TRANSIENT_INDIRECTION_BUFFER;
TfLiteDelegate* xnnpack_delegate =
TfLiteXNNPackDelegateCreate(&xnnpack_options);
XNNPACK will now use the temporary memory in the workspace for indirection buffers. However, instead of initializing the indirection buffers once during the initialization of the operators, the indirection buffers will be initialized during every inference run.
Below is the list of currently supported operators:
CONV_2DDEPTHWISE_CONV_2DRESIZE_BILINEAR