tensorflow/lite/g3doc/inference_with_metadata/task_library/text_embedder.md
Text embedders allow embedding text into a high-dimensional feature vector representing its semantic meaning, which can then be compared with the feature vector of other texts to evaluate their semantic similarity.
As opposed to text search, the text embedder allows computing the similarity between texts on-the-fly instead of searching through a predefined index built from a corpus.
Use the Task Library TextEmbedder API to deploy your custom text embedder into
your mobile apps.
Input text processing, including in-graph or out-of-graph Wordpiece or Sentencepiece tokenizations on input text.
Built-in utility function to compute the cosine similarity between feature vectors.
The following models are guaranteed to be compatible with the TextEmbedder
API.
The Universal Sentence Encoder TFLite model from TensorFlow Hub
Custom models that meet the model compatibility requirements.
// Initialization.
TextEmbedderOptions options:
options.mutable_base_options()->mutable_model_file()->set_file_name(model_path);
std::unique_ptr<TextEmbedder> text_embedder = TextEmbedder::CreateFromOptions(options).value();
// Run inference with your two inputs, `input_text1` and `input_text2`.
const EmbeddingResult result_1 = text_embedder->Embed(input_text1);
const EmbeddingResult result_2 = text_embedder->Embed(input_text2);
// Compute cosine similarity.
double similarity = TextEmbedder::CosineSimilarity(
result_1.embeddings[0].feature_vector()
result_2.embeddings[0].feature_vector());
See the
source code
for more options to configure TextEmbedder.
You can install the TensorFlow Lite Support Pypi package using the following command:
pip install tflite-support
from tflite_support.task import text
# Initialization.
text_embedder = text.TextEmbedder.create_from_file(model_path)
# Run inference on two texts.
result_1 = text_embedder.embed(text_1)
result_2 = text_embedder.embed(text_2)
# Compute cosine similarity.
feature_vector_1 = result_1.embeddings[0].feature_vector
feature_vector_2 = result_2.embeddings[0].feature_vector
similarity = text_embedder.cosine_similarity(
result_1.embeddings[0].feature_vector, result_2.embeddings[0].feature_vector)
See the
source code
for more options to configure TextEmbedder.
Cosine similarity between normalized feature vectors return a score between -1 and 1. Higher is better, i.e. a cosine similarity of 1 means the two vectors are identical.
Cosine similarity: 0.954312
Try out the simple CLI demo tool for TextEmbedder with your own model and test data.
The TextEmbedder API expects a TFLite model with mandatory
TFLite Model Metadata.
Three main types of models are supported:
BERT-based models (see source code for more details):
Exactly 3 input tensors (kTfLiteString)
Exactly one output tensor (kTfLiteUInt8/kTfLiteFloat32)
N components corresponding to the N dimensions of the
returned feature vector for this output layer.[1 x N] or [1 x 1 x 1 x N].An input_process_units for Wordpiece/Sentencepiece Tokenizer
Universal Sentence Encoder-based models (see source code for more details):
Exactly 3 input tensors (kTfLiteString)
Exactly 2 output tensors (kTfLiteUInt8/kTfLiteFloat32)
N components corresponding to the N dimensions of the
returned feature vector for this output layer.[1 x N] or [1 x 1 x 1 x N].Any text embedder model with:
An input text tensor (kTfLiteString)
At least one output embedding tensor (kTfLiteUInt8/kTfLiteFloat32)
N components corresponding to the N dimensions of the
returned feature vector for this output layer.[1 x N] or [1 x 1 x 1 x N].