tensorflow/lite/tools/optimize/g3doc/quantize_weights.md
The Quantize Weights transformation is integrated with tflite_convert.
The recommended way of invoking this tool is by simply adding the
--post_training_quantize flag to your original tflite_convert invocation. For
example,
tflite_convert \
--output_file=/tmp/foo.tflite \
--saved_model_dir=/tmp/saved_model \
--post_training_quantize
The Quantize Weights tool provides a simple way to quantize the weights for a float TFLite model.
TODO(raghuramank): Add link to weight quantization tutorial.
float32 weights will be converted to 8 bit integers. This results in a model that is around 1/4th the size of the original model.
TFLite also has "hybrid" kernels implemented for many operations. These "hybrid" kernels take 8 bit integer weights and float inputs, dynamically quantize the inputs tensor (based on the input tensor's min and max elements), and does computations using the 8 bit integer values. This results in a 2-4x reduction in latency for "hybrid" kernels. In this mode the inference type is still FLOAT since the inputs and output to each operation is still float.
For operations that do not yet have "hybrid" kernels implemented, we introduce a Dequantize operation after 8 bit integer weights. These convert weights back to float32 during inference to allow original float32 kernels to run. Since we cache dequantized results, the result of each of this dequantized path will be on-par with the original float model.
TODO(yunluli): Fill in latency results from latency experiments.
Since this technique quantizes weights after the model has already been trained, there can be accuracy drops depending on the model. For common CNN networks, the observed accuracy drops are small and can be seen below.
TODO(yunluli): Fill in accuracy results from accuracy experiments.
One can also invoke the Quantize Weights directly via C++ if they have a float
::tflite::Model that they want to convert. They must provide a
flatbuffers::FlatBufferBuilder which owns the underlying buffer of the created
model. Here is an example invocation:
::tflite::Model* input_model = ...;
flatbuffers::FlatBufferBuilder builder;
TfLiteStatus status = ::tflite::optimize::QuantizeWeights(&builder, input_model);
CHECK(status, kTfLiteStatusOk);
const uint8_t* buffer = builder->GetBufferPointer();
tflite::Model* output_model = ::tflite::GetModel(buffer);