tensorflow/lite/tools/benchmark/README.md
A simple C++ binary to benchmark a TFLite model and its individual operators, both on desktop machines and on Android. The binary takes a TFLite model, generates random inputs and then repeatedly runs the model for specified number of runs. Aggregate latency statistics are reported after running the benchmark.
The instructions below are for running the binary on Desktop and Android, for iOS please use the iOS benchmark app.
An experimental Android APK wrapper for the benchmark model utility offers more faithful execution behavior on Android (via a foreground Activity). It is located here.
The binary takes the following required parameters:
graph: string and the following optional parameters:
signature_to_run_for: string (default="")
If the model contains multiple signatures, use this flag to specify the
signature to benchmark.
num_threads: int (default=-1)
The number of threads to use for running TFLite interpreter. By default,
this is set to the platform default value -1.
warmup_runs: int (default=1)
The number of warmup runs to do before starting the benchmark.
num_runs: int (default=50)
The number of runs. Increase this to reduce variance.
max_secs : float (default=150.0)
The maximum number of seconds the benchmark will run before being
terminated.
run_delay: float (default=-1.0)
The delay in seconds between subsequent benchmark runs. Non-positive values
mean use no delay.
run_frequency: float (default=-1.0)
The frequency of running a benchmark run as the number of prorated runs per
second. If the targeted rate per second cannot be reached, the benchmark
would start the next run immediately, trying its best to catch up. If set,
this will override the run_delay parameter. A non-positive value means
there is no delay between subsequent runs.
enable_op_profiling: bool (default=false)
Whether to enable per-operator profiling measurement.
max_profiling_buffer_entries: int (default=1024)
The initial max number of profiling events that will be stored during each
inference run. It is only meaningful when enable_op_profiling is set to
true. Note, the actual value of this parameter will be adjusted if the
model has more nodes than the specified value of this parameter. Also, when
allow_dynamic_profiling_buffer_increase is set to true, the number of
profiling buffer entries will be increased dynamically.
allow_dynamic_profiling_buffer_increase: bool (default=false)
Whether allowing dynamic increase on the number of profiling buffer entries.
It is only meaningful when enable_op_profiling is set to true. Note,
allowing dynamic buffer size increase may cause more profiling overhead,
thus it is preferred to set max_profiling_buffer_entries to a large-enough
value.
op_profiling_output_mode: str (default="stdout")
The output mode for the profiling information generated. Requires
enable_op_profiling to be true. Takes one of the following 3 values:
stdout : Print profiling information to STDOUT.csv : Print the profiling information in a CSV format.proto : Print the profiling information in a proto format as specified
in tensorflow/lite/profiling/proto/profiling_info.proto.op_profiling_output_file: str (default="")
File path to export profile data to. The results are printed to
stdout if option is not set. Requires enable_op_profiling to be true
and the path to include the name of the output file; otherwise results are
printed to stdout.
export_model_runtime_info: bool (default="false")
Exports the model runtime information in a proto format as specified
in tensorflow/lite/profiling/proto/model_runtime_info.proto.
model_runtime_info_output_file: str (default="")
File path to export model runtime data to. The results are printed to
stdout if option is not set. Requires export_model_runtime_info to be
true and the path to include the name of the output file; otherwise
results are printed to stdout.
profiling_output_csv_file: str (default="") \
WARNING: Deprecated, prefer using op_profiling_output_mode and
op_profiling_output_file instead.
File path to export profile data to as CSV. The results are printed to
stdout if option is not set. Requires enable_op_profiling to be true
and the path to include the name of the output CSV; otherwise results are
printed to stdout.
output_filepath: str (default="")
File path to save output tensor data to. If specified, the output tensor
values are saved as binary data in the file.
output_proto_filepath: str (default="")
File path to save output tensor data as tensorflow example proto. If
specified, the output tensor values are saved in tensorflow example and then
serialized to the file.
print_preinvoke_state: bool (default=false)
Whether to print out the TfLite interpreter internals just before calling
tflite::Interpreter::Invoke. The internals will include allocated memory
size of each tensor etc. Enabling this could help understand TfLite graph
and memory usage.
print_postinvoke_state: bool (default=false)
Whether to print out the TfLite interpreter internals just before benchmark
completes (i.e. after all repeated Invoke calls complete). The internals
will include allocated memory size of each tensor etc. Enabling this could
help understand TfLite graph and memory usage, particularly when there are
dynamic-shaped tensors in the graph.
report_peak_memory_footprint: bool (default=false)
Whether to report the peak memory footprint by periodically checking the
memory footprint. Internally, a separate thread will be spawned for this
periodic check. Therefore, the performance benchmark result could be
affected.
memory_footprint_check_interval_ms: int (default=50)
The interval in millisecond between two consecutive memory footprint checks.
This is only used when --report_peak_memory_footprint is set to true.
dry_run: bool (default=false)
Whether to run the tool just with simply loading the model, allocating
tensors etc. but without actually invoking any op kernels.
verbose: bool (default=false)
Whether to log parameters whose values are not set. By default, only log
those parameters that are set by parsing their values from the commandline
flags.
release_dynamic_tensors: bool (default=false)
Whether to configure the Interpreter to immediately release the memory of
dynamic tensors in the graph once they are not used.
optimize_memory_for_large_tensors: int (default=0)
Whether to optimize memory usage for large tensors with sacrificing latency.
When the feature is enabled, release_dynamic_tensors is also enabled.
enable_builtin_cast_constant_cache: bool (default=false)
Configure the builtin TFLite CAST operation to cache its output if its input
is a constant tensor.
WARNING: This is an experimental option that may be removed at any time.
This list of parameters is not exhaustive. See here and here for all parameters that the binary takes.
By default, the tool will use randomized data for model inputs. The following parameters allow users to specify customized input values to the model when running the benchmark tool:
input_layer: string input_layer_shape: string input_layer, this
parameter also requires shapes of all inputs be specified, and the order of
inputs be same with that is seen by the interpreter.input_layer_value_range: string input_layer.input_layer_value_files: string input::0:file_path1. If a input name appears in both
input_layer_value_range and input_layer_value_files, the corresponding
input value range specified byinput_layer_value_range will be ignored.
The file format is binary, and the content should be either a byte array or
null-separated strings. Note that the input layer name must also exist in
the list of names specified by input_layer.The tool supports all runtime/delegate parameters introduced by the delegate registrar. The following simply lists the names of these parameters and additional notes where applicable. For details about each parameter, please refer to this page.
max_delegated_partitions: int (default=0)min_nodes_per_partition:int (default=0)delegate_serialize_dir: str (default="")delegate_serialize_token: str (default="")use_gpu: bool (default=false)gpu_precision_loss_allowed: bool (default=true)gpu_experimental_enable_quant: bool (default=true)gpu_inference_for_sustained_speed: bool (default=false)gpu_backend: string (default="")gpu_wait_type: str (default="")use_nnapi: bool (default=false) /data/local/tmp/ and this benchmark tool will not correctly use NNAPI.nnapi_execution_preference: str (default="") fast_single_answer, sustained_speed, low_power,
undefined.nnapi_execution_priority: str (default="") nnapi_accelerator_name: str (default="") disable_nnapi_cpu: bool (default=true)nnapi_allow_fp16: bool (default=false)nnapi_allow_dynamic_dimensions:bool (default=false)nnapi_use_burst_mode:bool (default=false)use_hexagon: bool (default=false)hexagon_profiling: bool (default=false) enable_op_profiling is also turned on. When both parameters are set to true,
the profile of ops on hexagon DSP will be added to the profile table. Note that,
the reported data on hexagon is in cycles, not in ms like on cpu.hexagon_lib_path: string (default="/data/local/tmp/") use_xnnpack: bool (default=false) false, the TfLite runtime will
use its original CPU kernels for model execution. In other words, after
enabling the feature that the XNNPACK delegate is applied by default in
TfLite runtime, explicitly setting this flag to false will cause the
benchmark tool to disable the feature at runtime, and to use the original
non-delegated CPU execution path for model benchmarking.xnnpack_force_fp16: bool (default=false) use_coreml: bool (default=false)coreml_version: int (default=0)external_delegate_path: string (default="")external_delegate_options: string (default="")stable_delegate_loader_settings: string (default="") A path to the
JSON-encoded delegate TFLiteSettings file, which is defined in configuration.proto.As some delegates are only available on certain platforms, when running the
benchmark tool on a particular platform, specifying --help will print out all
supported parameters.
When multiple delegates are specified to be used in the commandline flags, the order of delegates applied to the TfLite runtime will be same as their enabling commandline flag is specified. For example, "--use_xnnpack=true --use_gpu=true" means applying the XNNPACK delegate first, and then the GPU delegate secondly. In comparison, "--use_gpu=true --use_xnnpack=true" means applying the GPU delegate first, and then the XNNPACK delegate secondly.
Note: The benchmarking tool must be compiled with a TFLite runtime that supports the ops found in the model to be tested.
If Tensorflow Ops ("flex ops") or other custom ops are used in the model, please see the section below.
(0) Refer to https://www.tensorflow.org/lite/guide/build_android to edit the
WORKSPACE to configure the android NDK/SDK.
(1) Build for your specific platform, e.g.:
bazel build -c opt \
--config=android_arm64 \
tensorflow/lite/tools/benchmark:benchmark_model
(2) Connect your phone. Push the binary to your phone with adb push (make the directory if required):
adb push bazel-bin/tensorflow/lite/tools/benchmark/benchmark_model /data/local/tmp
(3) Make the binary executable.
adb shell chmod +x /data/local/tmp/benchmark_model
(4) Push the compute graph that you need to test. For example:
adb push mobilenet_quant_v1_224.tflite /data/local/tmp
(5) Optionally, install Hexagon libraries on device.
That step is only needed when using the Hexagon delegate.
bazel build --config=android_arm64 \
tensorflow/lite/delegates/hexagon/hexagon_nn:libhexagon_interface.so
adb push bazel-bin/tensorflow/lite/delegates/hexagon/hexagon_nn/libhexagon_interface.so /data/local/tmp
adb push libhexagon_nn_skel*.so /data/local/tmp
(6) Run the benchmark. For example:
adb shell /data/local/tmp/benchmark_model \
--graph=/data/local/tmp/mobilenet_quant_v1_224.tflite \
--num_threads=4
(1) build the binary
bazel build -c opt tensorflow/lite/tools/benchmark:benchmark_model
(2) Run on your compute graph, similar to the Android case but without the need of adb shell. For example:
bazel-bin/tensorflow/lite/tools/benchmark/benchmark_model \
--graph=mobilenet_quant_v1_224.tflite \
--num_threads=4
The MobileNet graph used as an example here may be downloaded from here.
Most modern Android phones use ARM big.LITTLE
architecture where some cores are more power hungry but faster than other cores.
When running benchmarks on these phones there can be significant variance
between different runs of the benchmark. One way to reduce variance between runs
is to set the CPU affinity
before running the benchmark. On Android this can be done using the taskset
command.
E.g. for running the benchmark on big cores on Pixel 2 with a single thread one
can use the following command:
adb shell taskset f0 /data/local/tmp/benchmark_model \
--graph=/data/local/tmp/mobilenet_quant_v1_224.tflite \
--num_threads=1
where f0 is the affinity mask for big cores on Pixel 2.
Note: The affinity mask varies with the device.
The benchmark model binary also allows you to profile operators and give
execution times of each operator. To do this, pass the flag
--enable_op_profiling=true to benchmark_model during invocation, e.g.,
adb shell taskset f0 /data/local/tmp/benchmark_model \
--graph=/data/local/tmp/mobilenet_quant_v1_224.tflite \
--enable_op_profiling=true
When enabled, the benchmark_model binary will produce detailed statistics for
each operation similar to those shown below:
============================== Run Order ==============================
[node type] [start] [first] [avg ms] [%] [cdf%] [mem KB] [times called] [Name]
CONV_2D 0.000 4.269 4.269 0.107% 0.107% 0.000 0 [MobilenetV1/MobilenetV1/Conv2d_0/Relu6]
DEPTHWISE_CONV_2D 4.270 2.150 2.150 0.054% 0.161% 0.000 0 [MobilenetV1/MobilenetV1/Conv2d_1_depthwise/Relu6]
CONV_2D 6.421 6.107 6.107 0.153% 0.314% 0.000 0 [MobilenetV1/MobilenetV1/Conv2d_1_pointwise/Relu6]
DEPTHWISE_CONV_2D 12.528 1.366 1.366 0.034% 0.348% 0.000 0 [MobilenetV1/MobilenetV1/Conv2d_2_depthwise/Relu6]
CONV_2D 13.895 4.195 4.195 0.105% 0.454% 0.000 0 [MobilenetV1/MobilenetV1/Conv2d_2_pointwise/Relu6]
DEPTHWISE_CONV_2D 18.091 1.260 1.260 0.032% 0.485% 0.000 0 [MobilenetV1/MobilenetV1/Conv2d_3_depthwise/Relu6]
CONV_2D 19.352 6.652 6.652 0.167% 0.652% 0.000 0 [MobilenetV1/MobilenetV1/Conv2d_3_pointwise/Relu6]
DEPTHWISE_CONV_2D 26.005 0.698 0.698 0.018% 0.670% 0.000 0 [MobilenetV1/MobilenetV1/Conv2d_4_depthwise/Relu6]
CONV_2D 26.703 3.344 3.344 0.084% 0.754% 0.000 0 [MobilenetV1/MobilenetV1/Conv2d_4_pointwise/Relu6]
DEPTHWISE_CONV_2D 30.047 0.646 0.646 0.016% 0.770% 0.000 0 [MobilenetV1/MobilenetV1/Conv2d_5_depthwise/Relu6]
CONV_2D 30.694 5.800 5.800 0.145% 0.915% 0.000 0 [MobilenetV1/MobilenetV1/Conv2d_5_pointwise/Relu6]
DEPTHWISE_CONV_2D 36.495 0.331 0.331 0.008% 0.924% 0.000 0 [MobilenetV1/MobilenetV1/Conv2d_6_depthwise/Relu6]
CONV_2D 36.826 2.838 2.838 0.071% 0.995% 0.000 0 [MobilenetV1/MobilenetV1/Conv2d_6_pointwise/Relu6]
DEPTHWISE_CONV_2D 39.665 0.439 0.439 0.011% 1.006% 0.000 0 [MobilenetV1/MobilenetV1/Conv2d_7_depthwise/Relu6]
CONV_2D 40.105 5.293 5.293 0.133% 1.139% 0.000 0 [MobilenetV1/MobilenetV1/Conv2d_7_pointwise/Relu6]
DEPTHWISE_CONV_2D 45.399 0.352 0.352 0.009% 1.147% 0.000 0 [MobilenetV1/MobilenetV1/Conv2d_8_depthwise/Relu6]
CONV_2D 45.752 5.322 5.322 0.133% 1.281% 0.000 0 [MobilenetV1/MobilenetV1/Conv2d_8_pointwise/Relu6]
DEPTHWISE_CONV_2D 51.075 0.357 0.357 0.009% 1.290% 0.000 0 [MobilenetV1/MobilenetV1/Conv2d_9_depthwise/Relu6]
CONV_2D 51.432 5.693 5.693 0.143% 1.433% 0.000 0 [MobilenetV1/MobilenetV1/Conv2d_9_pointwise/Relu6]
DEPTHWISE_CONV_2D 57.126 0.366 0.366 0.009% 1.442% 0.000 0 [MobilenetV1/MobilenetV1/Conv2d_10_depthwise/Relu6]
CONV_2D 57.493 5.472 5.472 0.137% 1.579% 0.000 0 [MobilenetV1/MobilenetV1/Conv2d_10_pointwise/Relu6]
DEPTHWISE_CONV_2D 62.966 0.364 0.364 0.009% 1.588% 0.000 0 [MobilenetV1/MobilenetV1/Conv2d_11_depthwise/Relu6]
CONV_2D 63.330 5.404 5.404 0.136% 1.724% 0.000 0 [MobilenetV1/MobilenetV1/Conv2d_11_pointwise/Relu6]
DEPTHWISE_CONV_2D 68.735 0.155 0.155 0.004% 1.728% 0.000 0 [MobilenetV1/MobilenetV1/Conv2d_12_depthwise/Relu6]
CONV_2D 68.891 2.970 2.970 0.074% 1.802% 0.000 0 [MobilenetV1/MobilenetV1/Conv2d_12_pointwise/Relu6]
DEPTHWISE_CONV_2D 71.862 0.206 0.206 0.005% 1.807% 0.000 0 [MobilenetV1/MobilenetV1/Conv2d_13_depthwise/Relu6]
CONV_2D 72.069 5.888 5.888 0.148% 1.955% 0.000 0 [MobilenetV1/MobilenetV1/Conv2d_13_pointwise/Relu6]
AVERAGE_POOL_2D 77.958 0.036 0.036 0.001% 1.956% 0.000 0 [MobilenetV1/Logits/AvgPool_1a/AvgPool]
CONV_2D 77.994 1.445 1.445 0.036% 1.992% 0.000 0 [MobilenetV1/Logits/Conv2d_1c_1x1/BiasAdd]
RESHAPE 79.440 0.002 0.002 0.000% 1.992% 0.000 0 [MobilenetV1/Predictions/Reshape]
SOFTMAX 79.443 0.029 0.029 0.001% 1.993% 0.000 0 [MobilenetV1/Predictions/Softmax]
============================== Top by Computation Time ==============================
[node type] [start] [first] [avg ms] [%] [cdf%] [mem KB] [times called] [Name]
CONV_2D 19.352 6.652 6.652 0.167% 0.167% 0.000 0 [MobilenetV1/MobilenetV1/Conv2d_3_pointwise/Relu6]
CONV_2D 6.421 6.107 6.107 0.153% 0.320% 0.000 0 [MobilenetV1/MobilenetV1/Conv2d_1_pointwise/Relu6]
CONV_2D 72.069 5.888 5.888 0.148% 0.468% 0.000 0 [MobilenetV1/MobilenetV1/Conv2d_13_pointwise/Relu6]
CONV_2D 30.694 5.800 5.800 0.145% 0.613% 0.000 0 [MobilenetV1/MobilenetV1/Conv2d_5_pointwise/Relu6]
CONV_2D 51.432 5.693 5.693 0.143% 0.756% 0.000 0 [MobilenetV1/MobilenetV1/Conv2d_9_pointwise/Relu6]
CONV_2D 57.493 5.472 5.472 0.137% 0.893% 0.000 0 [MobilenetV1/MobilenetV1/Conv2d_10_pointwise/Relu6]
CONV_2D 63.330 5.404 5.404 0.136% 1.029% 0.000 0 [MobilenetV1/MobilenetV1/Conv2d_11_pointwise/Relu6]
CONV_2D 45.752 5.322 5.322 0.133% 1.162% 0.000 0 [MobilenetV1/MobilenetV1/Conv2d_8_pointwise/Relu6]
CONV_2D 40.105 5.293 5.293 0.133% 1.295% 0.000 0 [MobilenetV1/MobilenetV1/Conv2d_7_pointwise/Relu6]
CONV_2D 0.000 4.269 4.269 0.107% 1.402% 0.000 0 [MobilenetV1/MobilenetV1/Conv2d_0/Relu6]
Number of nodes executed: 31
============================== Summary by node type ==============================
[Node type] [count] [avg ms] [avg %] [cdf %] [mem KB] [times called]
CONV_2D 15 1.406 89.270% 89.270% 0.000 0
DEPTHWISE_CONV_2D 13 0.169 10.730% 100.000% 0.000 0
SOFTMAX 1 0.000 0.000% 100.000% 0.000 0
RESHAPE 1 0.000 0.000% 100.000% 0.000 0
AVERAGE_POOL_2D 1 0.000 0.000% 100.000% 0.000 0
Timings (microseconds): count=50 first=79449 curr=81350 min=77385 max=88213 avg=79732 std=1929
Memory (bytes): count=0
31 nodes observed
Average inference timings in us: Warmup: 83235, Init: 38467, Inference: 79760.9
A convenient and simple C++ binary is also provided to benchmark multiple
performance options in a single run. This binary is built based on the
aforementioned benchmark tool that could only benchmark a single performance
option at a time. They share the same build/install/run process, but the BUILD
target name of this binary is benchmark_model_performance_options and it takes
some additional parameters as detailed below.
perf_options_list: string (default='all') option_benchmark_run_delay: float (default=-1.0) random_shuffle_benchmark_runs: bool (default=true) If you see an error that says: ERROR: Select TensorFlow op(s), included in the given model, is(are) not supported by this interpreter. you will need to
build with Tensorflow operators support.
Having Tensorflow ops in the TFLite file works when the benchmark tool is built with Tensorflow ops support. It doesn't require any additional option to use it.
To build the tool, you need to use the benchmark_model_plus_flex target with
the --config=monolithic flag.
Desktop
bazel build -c opt \
--config=monolithic \
tensorflow/lite/tools/benchmark:benchmark_model_plus_flex
Android
bazel build -c opt \
--config=monolithic --config=android_arm64 \
tensorflow/lite/tools/benchmark:benchmark_model_plus_flex
Follow the further instructions above replacing
benchmark_model with the benchmark_model_plus_flex file created here.
For example, on desktop it's very easy:
bazel-bin/tensorflow/lite/tools/benchmark/benchmark_model_plus_flex \
--graph=model_converted_with_TF_ops.tflite \
If you see an error that says ERROR: Op type not registered 'XXXXXXXX' in binary running on localhost. for custom ops running in your TFLite model,
you will need to manually build the tool to include your libraries providing
the custom ops.
While possible, this is not necessarily supported.
However, you should be able to create a new cc_binary rule that depends on
tensorflow/lite/tools/benchmark:benchmark_model_main along with your custom op
rules.
cc_binary(
name = "benchmark_model_plus_custom_ops",
deps = [
":my_custom_ops_provider",
"//tensorflow/lite/tools/benchmark:benchmark_model_main",
],
)
Use the benchmark_model_plus_custom_ops (or whatever) file created by your
custom rule instead of the benchmark_model file in the instructions,
above.