XNNPACK backend for TensorFlow Lite

XNNPACK is a highly optimized library of neural network inference operators for ARM, x86, and WebAssembly architectures in Android, iOS, Windows, Linux, macOS, and Emscripten environments. This document describes how to use the XNNPACK library as an inference engine for TensorFlow Lite.

Using XNNPACK engine with TensorFlow Lite interpreter

XNNPACK integrates with TensorFlow Lite interpreter through the delegation mechanism. TensorFlow Lite supports several methods to enable XNNPACK for floating-point inference.

Enable XNNPACK via Java API on Android (recommended on Android)

Pre-built nightly TensorFlow Lite binaries for Android include XNNPACK, albeit it is disabled by default. Use the setUseXNNPACK method in Interpreter.Options class to enable it:

java

Interpreter.Options interpreterOptions = new Interpreter.Options();
interpreterOptions.setUseXNNPACK(true);
Interpreter interpreter = new Interpreter(model, interpreterOptions);

Enable XNNPACK via Swift/Objective-C API on iOS (recommended on iOS)

Pre-built nightly TensorFlow Lite CocoaPods include XNNPACK, but do not enable it by default. Swift developers can use InterpreterOptions object to enable XNNPACK:

swift

var options = InterpreterOptions()
options.isXNNPackEnabled = true
var interpreter = try Interpreter(modelPath: "model/path", options: options)

Objective-C developers can enable XNNPACK via a new property in the TFLInterpreterOptions class:

objc

TFLInterpreterOptions *options = [[TFLInterpreterOptions alloc] init];
options.useXNNPACK = YES;
NSError *error;
TFLInterpreter *interpreter =
    [[TFLInterpreter alloc] initWithModelPath:@"model/path"
                                      options:options
                                        error:&error];

Enable XNNPACK via Bazel build flags (recommended on desktop)

When building TensorFlow Lite with Bazel, add --define tflite_with_xnnpack=true, and the TensorFlow Lite interpreter will use XNNPACK engine by default.

The exact command depends on the target platform, e.g. for Android AAR you'd use

bazel build -c opt --fat_apk_cpu=x86,x86_64,arm64-v8a,armeabi-v7a \
  --host_crosstool_top=@bazel_tools//tools/cpp:toolchain \
  --define android_dexmerger_tool=d8_dexmerger \
  --define android_incremental_dexing_tool=d8_dexbuilder \
  --define tflite_with_xnnpack=true \
  //tensorflow/lite/java:tensorflow-lite

Note that in this case Interpreter::SetNumThreads invocation does not take effect on number of threads used by XNNPACK engine. In order to specify number of threads available for XNNPACK engine you should manually pass the value when constructing the interpreter. The snippet below illustrates this assuming you are using InterpreterBuilder to construct the interpreter:

c++

// Load model
tflite::Model* model;
...

// Construct the interprepter
tflite::ops::builtin::BuiltinOpResolver resolver;
std::unique_ptr<tflite::Interpreter> interpreter;

TfLiteStatus res = tflite::InterpreterBuilder(model, resolver, num_threads);

XNNPACK engine used by TensorFlow Lite interpreter uses a single thread for inference by default.

Enable XNNPACK via additional dependency

Another way to enable XNNPACK is to build and link the //tensorflow/lite:tflite_with_xnnpack target into your application alongside the TensorFlow Lite framework.

This method works on platforms which support POSIX-style weak symbols (Android, iOS, Linux, Mac, but NOT Windows).

Enable XNNPACK via low-level delegate API (not recommended)

While it is possible to use low-level delegate API to enable XNNPACK, this method is NOT RECOMMENDED unless you need to use TensorFlow Lite both with and without XNNPACK (e.g. for benchmarking).

With low-level delegate API users create an XNNPACK delegate with the TfLiteXNNPackDelegateCreate function, and then call Interpreter::ModifyGraphWithDelegate to delegate supported parts of the model to the XNNPACK delegate. The users must destroy the delegate with TfLiteXNNPackDelegateDelete after releasing the TensorFlow Lite interpreter. The snippet below illustrates the typical usage:

c++

// Build the interpreter
std::unique_ptr<tflite::Interpreter> interpreter;
...

// IMPORTANT: initialize options with TfLiteXNNPackDelegateOptionsDefault() for
// API-compatibility with future extensions of the TfLiteXNNPackDelegateOptions
// structure.
TfLiteXNNPackDelegateOptions xnnpack_options =
    TfLiteXNNPackDelegateOptionsDefault();
xnnpack_options.num_threads = num_threads;

TfLiteDelegate* xnnpack_delegate =
    TfLiteXNNPackDelegateCreate(&xnnpack_options);
if (interpreter->ModifyGraphWithDelegate(xnnpack_delegate) != kTfLiteOk) {
  // Report error and fall back to another delegate, or the default backend
}

// IMPORTANT: AllocateTensors can be called only AFTER ModifyGraphWithDelegate

...

// Run inference using XNNPACK
interpreter->Invoke()

...

// IMPORTANT: release the interpreter before destroying the delegate
interpreter.reset();
TfLiteXNNPackDelegateDelete(xnnpack_delegate);

Using the XNNPACK Weights Cache

XNNPACK internally packs static weights for operations (like convolutions) in order to make accessing weights more memory friendly. XNNPACK needs to allocate memory internally to hold these packed weights. If you are starting multiple TFLite interpreter instances based on the same model, there can be multiple copies of the same packed weights in each instance which can cause high memory usage.

The weights cache can be used to store these packed weights to a file to avoid re-packing on every run and to share share packed weights between multiple TFLite instances. Depending on your use case, this can lead to significant intialization speed-up and memory savings.

The initialization speed-up happens because the packing operations are only done once and read from the cache file for subsequent runs. We are skipping the most expensive part of XNNPack's initialization.

The memory savings have multiple reasons, which don't all apply to all use cases:

The original weights are never read when using the cache as packing doesn't happen. This is because TFLite usually uses mmap to load the model files and that only pulls data that you actually read into memory.
The weight cache provides buffer de-duplication: if multiple tensors share the same weights, it only keeps one copy of the corresponding packed weights. This is usually the case for LLM models and models that have several signatures.
The weight cache can be shared between interpreter instances, further de-duplicating packed data.
Thanks to using mmap, the file-backed cache can be shared between processes, further de-duplicating packed data. This is automatic, you don't need to do anything.

The weights cache is a contents-based cache. Every time XNNPACK has to pack weights, it first tries to look up if the packed weights can be found in the weights cache. If they can be found, we access the packed weights in the cache for subsequent operations. Otherwise, the weights are packed and added to the cache.

Warning: The weight cache cannot be shared between models or hardware architectures, a different cache file must be used for each (model, architecture) pair.

Warning: XNNPack does it's best to detect outdated cache files but cannot check for model changes. Checking that the model has been updated and deleting old cache files is left to the user.

Saving the Cache to Disk

Saving the cache to disk bring you the full list of advantages listed above.

c++

std::unique_ptr<tflite::Interpreter> interpreter;

// Like using the low-level API above, initialize options, and pass this cache
// to an XNNPACK delegate via the options.
TfLiteXNNPackDelegateOptions xnnpack_options =
  TfLiteXNNPackDelegateOptionsDefault();
xnnpack_options.weight_cache_file_path = "path/to/the/cache/file";

// Modify graph with delegate, as above...
TfLiteDelegate* delegate = TfLiteXNNPackDelegateCreate(&xnnpack_options);
if (interpreter->ModifyGraphWithDelegate(delegate) != kTfLiteOk) {
  // Handle errors...
}
// You can now run the interpreter.
//
// Static weights will be packed and written into weights_cache the first time,
// directly read from disk the 2nd time.

Using the Cache In-Memory

If you cannot access a file system, the cache can also be used "in-memory" instead of saving it to disk.

You will lose advantages 1 and 4 but can still profit from 2 and 3.

Note: Currently, this is only accessible on systems that have the memfd_create system call.

c++

std::unique_ptr<tflite::Interpreter> interpreter;

// Like using the low-level API above, initialize options, and pass this cache
// to an XNNPACK delegate via the options.
TfLiteXNNPackDelegateOptions xnnpack_options =
  TfLiteXNNPackDelegateOptionsDefault();
xnnpack_options.weight_cache_file_path =
  TfLiteXNNPackDelegateInMemoryFilePath();

// Modify graph with delegate, as above...
TfLiteDelegate* delegate = TfLiteXNNPackDelegateCreate(&xnnpack_options);
if (interpreter->ModifyGraphWithDelegate(delegate) != kTfLiteOk) {
  // Handle errors...
}
// You can now run the interpreter.
//
// Static weights will be packed and written into weights_cache the first time,
// directly read from disk the 2nd time.

Sharing the Cache Between TFLite Interpreter Instances

This is independent of using file-backed or in-memory caching. To share a cache between interpreters, you need to create the cache outside of the delegate and pass it down to it.

c++

std::unique_ptr<tflite::Interpreter> interpreter1;
std::unique_ptr<tflite::Interpreter> interpreter2;

// Create a weight cache. This should outlive the interpreter.
tflite::xnnpack::MMapWeightCacheProvider weight_cache;

// Like using the low-level API above, initialize options, and pass this cache
// to an XNNPACK delegate via the options.
TfLiteXNNPackDelegateOptions xnnpack_options =
  TfLiteXNNPackDelegateOptionsDefault();

// When sharing an existing cache, the path will be used by the first
// interpreter that is run to load it or create it.
xnnpack_options.weight_cache_file_path = /* See previous examples. */;
// Share the cache.
xnnpack_options.weight_cache_provider = &weight_cache;

// Modify graph with delegate, as above...
TfLiteDelegate* delegate1 = TfLiteXNNPackDelegateCreate(&xnnpack_options);
if (interpreter1->ModifyGraphWithDelegate(delegate1) != kTfLiteOk) {
  // Handle errors...
}
// Signal to the weight cache provider that there's no building to be done
// anymore. That way subsequent interpreter setups won't try to continue
// building the cache.
weight_cache.StopBuild();

// Modify graph with delegate, as above...
TfLiteDelegate* delegate2 = TfLiteXNNPackDelegateCreate(&xnnpack_options);
if (interpreter2->ModifyGraphWithDelegate(delegate2) != kTfLiteOk) {
  // Handle errors...
}
// You can now run the interpreters.
//
// Static weights will be packed and written into weights_cache the first time,
// directly read from disk the 2nd time.

Warning: Sharing the cache is not thread safe for building. You should always do one full run of one of the interpreters before starting threading. Once the building run is done, call weight_cache.StopBuild() before using the weight cache provider to build other delegate instances.

Profiling

When TfLite profiling is enabled, XNNPACK will time each operator and report the results to TfLite which will print them as part of the overall execution profile.

Limitations and supported operators

XNNPACK delegate is a work-in-progress, and currently supports a limited set of operators. Unsupported operators will fall back to the default implementations, so models using a combination of supported and unsupported operators can still benefit from XNNPACK delegate.

Floating-Point (IEEE FP32) Operators

Below is the list of currently supported floating-point operators:

`ABS`

Inputs and outputs must be in 32-bit floating-point format.

`ADD`

Inputs and outputs must be in 32-bit floating-point format.
Only addition with two inputs is supported.
Fused NONE, RELU, RELU_N1_TO_1, and RELU6 activations are supported, but fused TANH and SIGN_BIT activations are not.

`AVERAGE_POOL_2D`

Inputs and outputs must be in 32-bit floating-point format.
1x1 pooling with non-unit stride is not supported.
Fused NONE, RELU, RELU_N1_TO_1, and RELU6 activations are supported, but fused TANH and SIGN_BIT activations are not.

`CEIL`

Inputs and outputs must be in 32-bit floating-point format.

`CONCATENATION`

Inputs and outputs must be in 32-bit floating-point format.
Only concatenation with two, three, or four inputs is supported.

`CONV_2D`

Inputs and outputs must be in 32-bit floating-point format.
Bias is mandatory.
Both filter and bias must be static (use kTfLiteMmapRo allocation type).
Fused NONE, RELU, RELU_N1_TO_1, and RELU6 activations are supported, but fused TANH and SIGN_BIT activations are not.

`DEPTH_TO_SPACE`

Inputs and outputs must be in 32-bit floating-point format.
Block size must be greater than 1.

`DEPTHWISE_CONV_2D`

Inputs and outputs must be in 32-bit floating-point format.
Bias is mandatory.
Both filter and bias must be static (use kTfLiteMmapRo allocation type).
Fused NONE, RELU, RELU_N1_TO_1, and RELU6 activations are supported, but fused TANH and SIGN_BIT activations are not.

`DIV`

Inputs and outputs must be in 32-bit floating-point format.
Fused NONE, RELU, RELU_N1_TO_1, and RELU6 activations are supported, but fused TANH and SIGN_BIT activations are not.

`ELU`

Inputs and outputs must be in 32-bit floating-point format.

`FULLY_CONNECTED`

Inputs and outputs must be in 32-bit floating-point format.
Both filter and bias must be static (use kTfLiteMmapRo allocation type).
Fused NONE, RELU, RELU_N1_TO_1, and RELU6 activations are supported, but fused TANH and SIGN_BIT activations are not.

`FLOOR`

Inputs and outputs must be in 32-bit floating-point format.

`HARD_SWISH`

Inputs and outputs must be in 32-bit floating-point format.

`LEAKY_RELU`

Inputs and outputs must be in 32-bit floating-point format.

`LOGISTIC`

Inputs and outputs must be in 32-bit floating-point format.

`MAX_POOL_2D`

Inputs and outputs must be in 32-bit floating-point format.
1x1 pooling with non-unit stride is not supported.
Fused NONE, RELU, RELU_N1_TO_1, and RELU6 activations are supported, but fused TANH and SIGN_BIT activations are not.

`MAXIMUM`

Inputs and outputs must be in 32-bit floating-point format.

`MEAN`

The first input and the output must be 4D tensors in 32-bit floating-point format.
The second input (the input with the axes specification) must be static (use kTfLiteMmapRo allocation type).
Only [1, 2], [2, 1], and [2] axes specification (i.e. reduction across either both spatial dimensions or across the width dimension) is supported.

`MINIMUM`

Inputs and outputs must be in 32-bit floating-point format.

`MUL`

Inputs and outputs must be in 32-bit floating-point format.
Fused NONE, RELU, RELU_N1_TO_1, and RELU6 activations are supported, but fused TANH and SIGN_BIT activations are not.

`NEG`

Inputs and outputs must be in 32-bit floating-point format.

`PAD`

The first input and the output must be in 32-bit floating-point format.
The second input (the input with the padding specification) must be static (use kTfLiteMmapRo allocation type).
The numbers of padding elements must be non-negative.

`PRELU`

Inputs and outputs must be in 32-bit floating-point format.
Slope must be static (use kTfLiteMmapRo allocation type).
Slope must be either a 1D tensor, or have all its non-channel dimensions equal 1.

`RELU`

Inputs and outputs must be in 32-bit floating-point format.

`RELU6`

Inputs and outputs must be in 32-bit floating-point format.

`RELU_N1_TO_1`

Inputs and outputs must be in 32-bit floating-point format.

`RESHAPE`

The first input and the output must be in 32-bit floating-point format.
The second input (the input with the new shape specification) must be either static (use kTfLiteMmapRo allocation type), or absent (with the new shape specified via ReshapeOptions table).

`RESIZE_BILINEAR`

The first input and the output must be 4D tensors in 32-bit floating-point format.
The second input (the input with the new shape specification) must be static (use kTfLiteMmapRo allocation type).

`ROUND`

Inputs and outputs must be in 32-bit floating-point format.

`SLICE`

The first input and the output must be in 32-bit floating-point format.
The second and third inputs (the inputs with the slices' begin and size specification) must be static (use kTfLiteMmapRo allocation type).

`SOFTMAX`

Inputs and outputs must be in 32-bit floating-point format.
Only beta = 1.0 is supported.

`SPACE_TO_DEPTH`

Inputs and outputs must be in 32-bit floating-point format.
Block size must be greater than 1.

`SPLIT`

Inputs and outputs must be in 32-bit floating-point format.
Only split into two, three, or four outputs is supported.

`SQRT`

Inputs and outputs must be in 32-bit floating-point format.

`SQUARE`

Inputs and outputs must be in 32-bit floating-point format.

`SQUARED_DIFFERENCE`

Inputs and outputs must be in 32-bit floating-point format.

`STRIDED_SLICE`

The first input and the output must be in 32-bit floating-point format.
The second, third, and fourth inputs (the inputs with the slices' begin, end, and stride specification) must be static (use kTfLiteMmapRo allocation type).
The fourth input (strides) must be all ones.
The ellipsis mask, new axis mask, and shrink axis mask must be 0.

`SUB`

Inputs and outputs must be in 32-bit floating-point format.
Fused NONE, RELU, RELU_N1_TO_1, and RELU6 activations are supported, but fused TANH and SIGN_BIT activations are not.

`TANH`

Inputs and outputs must be in 32-bit floating-point format.

`TRANSPOSE`

The first input and the output must be in 32-bit floating-point format.
The second input (the input with the permutation specification) must be static (use kTfLiteMmapRo allocation type).

`TRANSPOSE_CONV`

Input, filter, bias (if present) and output tensors must be in 32-bit floating-point format.
Output size, filter and bias (if present) must be static (use kTfLiteMmapRo allocation type).

Floating-Point (IEEE FP16) Operators

XNNPACK supports half-precision (using IEEE FP16 format) inference for all floating-point operators. XNNPACK automatically enables half-precision inference when the following conditions are met:

XNNPACK runs on hardware that natively supports computations in IEEE FP16 format. Currently, this hardware is limited to ARM & ARM64 devices with ARMv8.2 FP16 arithmetics extension, and includes Android phones starting with Pixel 3, Galaxy S9 (Snapdragon SoC), Galaxy S10 (Exynos SoC), iOS devices with A11 or newer SoCs, all Apple Silicon Macs, and Windows ARM64 laptops based with Snapdragon 850 SoC or newer.
The model's "reduced_precision_support" metadata indicates that the model is compatible with FP16 inference. The metadata can be added during model conversion using the _experimental_supported_accumulation_type attribute of the tf.lite.TargetSpec object:

python

converter.optimizations = [tf.lite.Optimize.DEFAULT]
...
converter.target_spec.supported_types = [tf.float16]
converter.target_spec._experimental_supported_accumulation_type = tf.dtypes.float16

When the above conditions are met, XNNPACK replace FP32 operators with their FP16 equivalents, and insert additional operators to convert model inputs from FP32 to FP16 and convert model outputs back from FP16 to FP32. If the above conditions are not met, XNNPACK will perform model inference with FP32 calculations.

Additionally, XNNPACK delegate provides an option to force FP16 inference regardless of model metadata. This option is intended for development workflows, and in particular for testing end-to-end accuracy of model when FP16 inference is used. Forcing FP16 inference has several effects:

Besides ARM64 devices with ARMv8.2 FP16 arithmetics extension, forced FP16 inference is supported on x86/x86-64 devices with AVX2 extension in emulation mode: all elementary floating-point operations are computed in FP32, then converted to FP16 and back to FP32. Note that such simulation is not bit-exact equivalent to native FP16 inference, but simulates the effects of restricted mantissa precision and exponent range in the native FP16 arithmetics.
On devices that support neither the native FP16 arithmetics (ARM64 devices with ARMv8.2 FP16 arithmetics extension), nor emulation (x86/x86-64 devices with AVX2 extension), inference will fail rather than fall back to FP32.
If any floating-point operator offloaded to XNNPACK is not supported for FP16 inference, inference will fail rather than fall back to FP32.

To force FP16 inference, either build the delegate with --define xnnpack_force_float_precision=fp16 option, or add TFLITE_XNNPACK_DELEGATE_FLAG_FORCE_FP16 flag to the TfLiteXNNPackDelegateOptions.flags bitmask passed into the TfLiteXNNPackDelegateCreate call:

TfLiteXNNPackDelegateOptions xnnpack_options =
    TfLiteXNNPackDelegateOptionsDefault();
...
xnnpack_options.flags |= TFLITE_XNNPACK_DELEGATE_FLAG_FORCE_FP16;
TfLiteDelegate* xnnpack_delegate =
    TfLiteXNNPackDelegateCreate(&xnnpack_options);

XNNPACK has full feature parity between FP32 and FP16 operators: all operators that are supported for FP32 inference are also supported for FP16 inference, and vice versa. In particular, sparse inference operators are supported for FP16 inference on ARM processors.

Quantized Operators

By default, quantized inference in XNNPACK delegate is disabled, and XNNPACK is used only for floating-point models. Support for quantized inference in XNNPACK must be enabled by adding extra Bazel flags when building TensorFlow Lite.

--define tflite_with_xnnpack_qs8=true flag enables XNNPACK inference for quantized operators using signed quantization schema. This schema is used by models produced by Model Optimization Toolkit through either post-training integer quantization or quantization-aware training. Post-training dynamic range quantization is not supported in XNNPACK.
--define tflite_with_xnnpack_qu8=true flag enables XNNPACK inference for quantized operators using unsigned quantization schema, produced via the legacy TensorFlow 1.X quantization tooling. This option is experimental and may perform suboptimally on mobile processors with NEON DOT product instructions.

Below is the list of currently supported quantized operators:

`ADD`

Inputs and outputs must be in 8-bit quantized format.
Only addition with two inputs is supported.
Fused NONE, RELU, RELU_N1_TO_1, and RELU6 activations are supported, but fused TANH and SIGN_BIT activations are not.

`CONCATENATION`

Inputs and outputs must be in 8-bit quantized format.
Only concatenation with two, three, or four inputs is supported.

`CONV_2D`

Inputs and outputs must be in 8-bit quantized format (bias must be in 32-bit quantized format).
Bias is mandatory.
Both filter and bias must be static (use kTfLiteMmapRo allocation type), and can use either per-tensor or per-channel quantization parameters.
Fused NONE, RELU, RELU_N1_TO_1, and RELU6 activations are supported, but fused TANH and SIGN_BIT activations are not.

`DEPTH_TO_SPACE`

Inputs and outputs must be in 8-bit quantized format.
Block size must be greater than 1.

`DEPTHWISE_CONV_2D`

Inputs and outputs must be in 8-bit quantized format (bias must be in 32-bit quantized format).
Bias is mandatory.
Both filter and bias must be static (use kTfLiteMmapRo allocation type), and can use either per-tensor or per-channel quantization parameters.
Fused NONE, RELU, RELU_N1_TO_1, and RELU6 activations are supported, but fused TANH and SIGN_BIT activations are not.

`DEQUANTIZE`

Input tensor must be in 8-bit quantized format without per-channel quantization.
Output tensor must be in 32-bit floating-point format.

`ELU`

Inputs and outputs must be in 8-bit signed quantized format.

`FULLY_CONNECTED`

Inputs and outputs must be in 8-bit quantized format (bias, if present, must be in 32-bit quantized format).
Both filter and bias must be static (use kTfLiteMmapRo allocation type).
Fused NONE, RELU, RELU_N1_TO_1, and RELU6 activations are supported, but fused TANH and SIGN_BIT activations are not.

`LEAKY_RELU`

Inputs and outputs must be in 8-bit quantized format.
The ratio of input scale to output scale must be within [1/256, 128].
The product of negative slope by the ratio of input scale to output scale must be within either [-127.99609375, -1/256] range or [1/256, 128] range.

`LOGISTIC`

Inputs and outputs must be in 8-bit quantized format.

`MAX_POOL_2D`

Inputs and outputs must be in 8-bit quantized format.
1x1 pooling with non-unit stride is not supported.
Fused NONE, RELU, RELU_N1_TO_1, and RELU6 activations are supported, but fused TANH and SIGN_BIT activations are not.

`MEAN`

The first input and the output must be 4D tensors in 8-bit quantized format.
The second input (the input with the axes specification) must be static (use kTfLiteMmapRo allocation type).
Only [1, 2], [2, 1], and [2] axes specification (i.e. reduction across either both spatial dimensions or across the width dimension) is supported.

`MUL`

Inputs and outputs must be in 8-bit quantized format.
Fused NONE, RELU, RELU_N1_TO_1, and RELU6 activations are supported, but fused TANH and SIGN_BIT activations are not.

`PAD`

The first input and the output must be in 8-bit quantized format.
The second input (the input with the padding specification) must be static (use kTfLiteMmapRo allocation type).
The numbers of padding elements must be non-negative.

`QUANTIZE`

Input tensor must be in 32-bit floating-point format or in 8-bit quantized format.
Output tensor must be in 8-bit quantized format without per-channel quantization.
If inputs are in 8-bit quantized format, they must have the same signedness as the outputs, and the ratio of input scale to output scale must be in the [2**-8, 2**7] range.

`RESHAPE`

The first input and the output must be in 8-bit quantized format.
The second input (the input with the new shape specification) must be either static (use kTfLiteMmapRo allocation type), or absent (with the new shape specified via ReshapeOptions table).

`RESIZE_BILINEAR`

The first input and the output must be 4D tensors in 8-bit quantized format.
The second input (the input with the new shape specification) must be static (use kTfLiteMmapRo allocation type).

`SLICE`

The first input and the output must be in 8-bit quantized format.
The second and third inputs (the inputs with the slices' begin and size specification) must be static (use kTfLiteMmapRo allocation type).

`SPACE_TO_DEPTH`

Inputs and outputs must be in 8-bit quantized format.
Block size must be greater than 1.

`SPLIT`

Inputs and outputs must be in 8-bit quantized format.
Only split into two, three, or four outputs is supported.

`SUB`

Inputs and outputs must be in 8-bit quantized format.
Fused NONE, RELU, RELU_N1_TO_1, and RELU6 activations are supported, but fused TANH and SIGN_BIT activations are not.

`TANH`

Inputs and outputs must be in 8-bit quantized format.

`TRANSPOSE`

The first input and the output must be in 8-bit quantized format.
The second input (the input with the permutation specification) must be static (use kTfLiteMmapRo allocation type).

`TRANSPOSE_CONV`

Input, filter, and output tensors must be in 8-bit quantized format (bias, if present, must be in 32-bit quantized format).
Output size, filter and bias (if present) must be static (use kTfLiteMmapRo allocation type).

Sparse Inference

XNNPACK backend supports sparse inference for CNN models described in the Fast Sparse ConvNets paper. Sparse inference is restricted to subgraphs with the following floating-point operators:

Sparse subgraph must store its weights in sparse representation (using DENSIFY operators in the TensorFlow Lite schema).
Sparse subgraph must start with a 3x3 stride-2 CONV_2D operator with padding 1 on each side, no dilation, and 3 input channels.
Sparse subgraph must end with either a MEAN operator with reduction across spatial axes, or a DEPTH_TO_SPACE operator.
Sparse subgraph may contain the following operators:
- CONV_2D with 1x1 kernel and no padding. At least 2/3rd of filter weights in the 1x1 CONV_2D operators across the sparse subgraph must be zeroes to enable sparse inference.
- DEPTHWISE_CONV_2D with 3x3 kernel, stride 1, no dilation, and padding 1 on each side.
- DEPTHWISE_CONV_2D with 3x3 kernel, stride 2, no dilation, and padding 1 on each side.
- DEPTHWISE_CONV_2D with 5x5 kernel, stride 1, no dilation, and padding 2 on each side.
- DEPTHWISE_CONV_2D with 5x5 kernel, stride 2, no dilation, and padding 2 on each side.
- RESIZE_BILINEAR operator with output dimensions greater than 1.
- MEAN operator with reduction across spatial axes.
- ADD and MUL operators where both inputs are 4D tensors. If one of the inputs to ADD or MUL is a constant tensor, it must be representable as either a scalar, or a 1D vector.
- Unary elementwise operators ABS, CEIL, ELU, FLOOR, HARD_SWISH, LEAKY_RELU, LOGISTIC, NEG, RELU, RELU6, RELU_N1_TO_1, ROUND, SIGMOID, and SQUARE.

Pre-trained Fast Sparse ConvNets models provide examples that satisfy these constraints.

Transient Indirection Buffer

Some of XNNPACK operators, such as CONV_2D, use indirection buffers to supply locations of input for the operators. Indirection buffers are created for each operator instance, and are persistent by default. It causes XNNPACK to use substantial amount of memory, especially when the input is in high resolution.

To reduce the memory footprint of indirection buffers, either build the delegate with --define tflite_with_xnnpack_transient_indirection_buffer=true option, or add TFLITE_XNNPACK_DELEGATE_FLAG_TRANSIENT_INDIRECTION_BUFFER flag to the TfLiteXNNPackDelegateOptions.flags bitmask passed into the TfLiteXNNPackDelegateCreate call:

TfLiteXNNPackDelegateOptions xnnpack_options =
    TfLiteXNNPackDelegateOptionsDefault();
...
xnnpack_options.flags |= TFLITE_XNNPACK_DELEGATE_FLAG_TRANSIENT_INDIRECTION_BUFFER;
TfLiteDelegate* xnnpack_delegate =
    TfLiteXNNPackDelegateCreate(&xnnpack_options);

XNNPACK will now use the temporary memory in the workspace for indirection buffers. However, instead of initializing the indirection buffers once during the initialization of the operators, the indirection buffers will be initialized during every inference run.

Below is the list of currently supported operators:

CONV_2D
DEPTHWISE_CONV_2D
RESIZE_BILINEAR