Back to Tensorflow

you may not use this file except in compliance with the License.

site/en/tutorials/load_data/text.ipynb

latest30.4 KB
Original Source
Copyright 2018 The TensorFlow Authors.
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

Load text

<table class="tfo-notebook-buttons" align="left"> <td> <a target="_blank" href="https://www.tensorflow.org/tutorials/load_data/text">View on TensorFlow.org</a> </td> <td> <a target="_blank" href="https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/tutorials/load_data/text.ipynb">Run in Google Colab</a> </td> <td> <a target="_blank" href="https://github.com/tensorflow/docs/blob/master/site/en/tutorials/load_data/text.ipynb">View source on GitHub</a> </td> <td> <a href="https://storage.googleapis.com/tensorflow_docs/docs/site/en/tutorials/load_data/text.ipynb">Download notebook</a> </td> </table>

This tutorial demonstrates two ways to load and preprocess text.

  • First, you will use Keras utilities and preprocessing layers. These include tf.keras.utils.text_dataset_from_directory to turn data into a tf.data.Dataset and tf.keras.layers.TextVectorization for data standardization, tokenization, and vectorization. If you are new to TensorFlow, you should start with these.
  • Then, you will use lower-level utilities like tf.data.TextLineDataset to load text files, tf.lookup for custom in-model lookup tables, and TensorFlow Text APIs, such as text.UnicodeScriptTokenizer and text.case_fold_utf8, to preprocess the data for finer-grain control.
!pip install "tensorflow-text==2.13.*"
import collections
import pathlib

import matplotlib.pyplot as plt
import tensorflow as tf

from tensorflow.keras import layers
from tensorflow.keras import losses
from tensorflow.keras import utils
from tensorflow.keras.layers import TextVectorization

import tensorflow_datasets as tfds
import tensorflow_text as tf_text

Example 1: Predict the tag for a Stack Overflow question

As a first example, you will download a dataset of programming questions from Stack Overflow. Each question ("How do I sort a dictionary by value?") is labeled with exactly one tag (Python, CSharp, JavaScript, or Java). Your task is to develop a model that predicts the tag for a question. This is an example of multi-class classification—an important and widely applicable kind of machine learning problem.

To implement this task, you'll start with the simplest tools:

  • keras.utils.text_datasaet_from_directory: for loading text-file examples.
  • keras.layers.TextVectorization: for converting strings to token indices.

Download and explore the dataset

Begin by downloading the Stack Overflow dataset using tf.keras.utils.get_file, and exploring the directory structure:

data_url = 'https://storage.googleapis.com/download.tensorflow.org/data/stack_overflow_16k.tar.gz'

dataset_dir = utils.get_file(
    origin=data_url,
    untar=True,
    cache_dir='stack_overflow',
    cache_subdir='')

dataset_dir = pathlib.Path(dataset_dir).parent
list(dataset_dir.iterdir())
train_dir = dataset_dir/'train'
list(train_dir.iterdir())

The train/csharp, train/java, train/python and train/javascript directories contain many text files, each of which is a Stack Overflow question.

Print an example file and inspect the data:

sample_file = train_dir/'python/1755.txt'

with open(sample_file) as f:
  print(f.read())

Load the dataset

Next, you will load the data off-disk and prepare it into a format suitable for training. To do so, you will use the tf.keras.utils.text_dataset_from_directory utility to create a labeled tf.data.Dataset. If you're new to tf.data, it's a powerful collection of tools for building input pipelines. (Learn more in the tf.data: Build TensorFlow input pipelines guide.)

The tf.keras.utils.text_dataset_from_directory API expects a directory structure as follows:

train/
...csharp/
......1.txt
......2.txt
...java/
......1.txt
......2.txt
...javascript/
......1.txt
......2.txt
...python/
......1.txt
......2.txt

When running a machine learning experiment, it is a best practice to divide your dataset into three splits: training, validation, and test.

The Stack Overflow dataset has already been divided into training and test sets, but it lacks a validation set.

Create a validation set using an 80:20 split of the training data by using tf.keras.utils.text_dataset_from_directory with validation_split set to 0.2 (i.e. 20%):

batch_size = 32
seed = 42

raw_train_ds = utils.text_dataset_from_directory(
    train_dir,
    batch_size=batch_size,
    validation_split=0.2,
    subset='training',
    seed=seed)

As the previous cell output suggests, there are 8,000 examples in the training folder, of which you will use 80% (or 6,400) for training. You will learn in a moment that you can train a model by passing a tf.data.Dataset directly to Model.fit.

First, iterate over the dataset and print out a few examples, to get a feel for the data.

Note: To increase the difficulty of the classification problem, the dataset author replaced occurrences of the words Python, CSharp, JavaScript, or Java in the programming question with the word blank.

for text_batch, label_batch in raw_train_ds.take(1):
  for i in range(10):
    print("Question: ", text_batch.numpy()[i])
    print("Label:", label_batch.numpy()[i])

The labels are 0, 1, 2 or 3. To check which of these correspond to which string label, you can inspect the class_names property on the dataset:

for i, label in enumerate(raw_train_ds.class_names):
  print("Label", i, "corresponds to", label)

Next, you will create a validation and a test set using tf.keras.utils.text_dataset_from_directory. You will use the remaining 1,600 reviews from the training set for validation.

Note: When using the validation_split and subset arguments of tf.keras.utils.text_dataset_from_directory, make sure to either specify a random seed or pass shuffle=False, so that the validation and training splits have no overlap.

# Create a validation set.
raw_val_ds = utils.text_dataset_from_directory(
    train_dir,
    batch_size=batch_size,
    validation_split=0.2,
    subset='validation',
    seed=seed)
test_dir = dataset_dir/'test'

# Create a test set.
raw_test_ds = utils.text_dataset_from_directory(
    test_dir,
    batch_size=batch_size)

Configure the datasets for performance

These are two important methods you should use when loading data to make sure that I/O does not become blocking.

  • Dataset.cache keeps data in memory after it's loaded off-disk. This will ensure the dataset does not become a bottleneck while training your model. If your dataset is too large to fit into memory, you can also use this method to create a performant on-disk cache, which is more efficient to read than many small files.
  • Dataset.prefetch overlaps data preprocessing and model execution while training.

You can learn more about both methods, as well as how to cache data to disk in the Prefetching section of the Better performance with the tf.data API guide.

raw_train_ds = raw_train_ds.cache().prefetch(buffer_size=tf.data.AUTOTUNE)
raw_val_ds = raw_val_ds.cache().prefetch(buffer_size=tf.data.AUTOTUNE)
raw_test_ds = raw_test_ds.prefetch(buffer_size=tf.data.AUTOTUNE)

This first example will keep things simple by integrating the text processing into the model. But you may be able to increase performance by moving the text processing into the dataset pipeline, this is demonstrated in Example 2

Prepare the dataset for training

Next, you will standardize, tokenize, and vectorize the data using the tf.keras.layers.TextVectorization layer.

  • Standardization refers to preprocessing the text, typically to remove punctuation or HTML elements to simplify the dataset.
  • Tokenization refers to splitting strings into tokens (for example, splitting a sentence into individual words by splitting on whitespace).
  • Vectorization refers to converting tokens into numbers so they can be fed into a neural network.

All of these tasks can be accomplished with this layer. (You can learn more about each of these in the tf.keras.layers.TextVectorization API docs.)

Note that:

  • The default standardization converts text to lowercase and removes punctuation (standardize='lower_and_strip_punctuation').
  • The default tokenizer splits on whitespace (split='whitespace').
  • The default vectorization mode is 'int' (output_mode='int'). This outputs integer indices (one per token). This mode can be used to build models that take word order into account. You can also use other modes—like 'binary'—to build bag-of-words models.

You will build two models to learn more about standardization, tokenization, and vectorization with TextVectorization:

  • First, you will use the 'binary' vectorization mode to build a bag-of-words model.
  • Then, you will use the 'int' mode with a 1D ConvNet.
VOCAB_SIZE = 10000

binary_vectorize_layer = TextVectorization(
    max_tokens=VOCAB_SIZE,
    output_mode='binary')

For the 'int' mode, in addition to maximum vocabulary size, you need to set an explicit maximum sequence length (MAX_SEQUENCE_LENGTH), which will cause the layer to pad or truncate sequences to exactly output_sequence_length values:

MAX_SEQUENCE_LENGTH = 250

int_vectorize_layer = TextVectorization(
    max_tokens=VOCAB_SIZE,
    output_mode='int',
    output_sequence_length=MAX_SEQUENCE_LENGTH)

Next, call TextVectorization.adapt to fit the state of the preprocessing layer to the dataset. This will cause the model to build an index of strings to integers.

Note: It's important to only use your training data when calling TextVectorization.adapt, as using the test set would leak information.

# Make a text-only dataset (without labels), then call `TextVectorization.adapt`.
train_text = raw_train_ds.map(lambda text, labels: text)
binary_vectorize_layer.adapt(train_text)
int_vectorize_layer.adapt(train_text)

Print the result of using these layers to preprocess data:

# Retrieve a batch (of 32 reviews and labels) from the dataset.
text_batch, label_batch = next(iter(raw_train_ds))
first_question, first_label = text_batch[0], label_batch[0]
print("Question:", first_question)
print("Label:", first_label)

The binary vectorization layer returns a multi-hot vector, with a 1 in the location for each token that was in the input string.

print("'binary' vectorized question:",
      list(binary_vectorize_layer(first_question).numpy()))

plt.plot(binary_vectorize_layer(first_question).numpy())
plt.xlim(0,1000)
print("'int' vectorized question:",
      int_vectorize_layer(first_question).numpy())

As shown above, TextVectorization's 'binary' mode returns an array denoting which tokens exist at least once in the input, while the 'int' mode replaces each token by an integer, thus preserving their order.

You can lookup the token (string) that each integer corresponds to by calling TextVectorization.get_vocabulary on the layer:

print("1289 ---> ", int_vectorize_layer.get_vocabulary()[1289])
print("313 ---> ", int_vectorize_layer.get_vocabulary()[313])
print("Vocabulary size: {}".format(len(int_vectorize_layer.get_vocabulary())))

Train the model

It's time to create your neural network.

For the 'binary' vectorized data, define a simple bag-of-words linear model, then configure and train it:

binary_model = tf.keras.Sequential([
    binary_vectorize_layer,
    layers.Dense(4)])

binary_model.compile(
    loss=losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer='adam',
    metrics=['accuracy'])

tf.keras.utils.plot_model(binary_model, show_shapes=True)
bin_history = binary_model.fit(
    raw_train_ds, validation_data=raw_val_ds, epochs=10)

print()

Next, you will use the 'int' vectorized layer to build a 1D ConvNet:

def create_model(vocab_size, num_labels, vectorizer=None):
  my_layers =[]
  if vectorizer is not None:
    my_layers = [vectorizer]

  my_layers.extend([
      layers.Embedding(vocab_size, 64, mask_zero=True),
      layers.Dropout(0.5),
      layers.Conv1D(64, 5, padding="valid", activation="relu", strides=2),
      layers.GlobalMaxPooling1D(),
      layers.Dense(num_labels)
  ])

  model = tf.keras.Sequential(my_layers)
  return model
# `vocab_size` is `VOCAB_SIZE + 1` since `0` is used additionally for padding.
int_model = create_model(vocab_size=VOCAB_SIZE + 1, num_labels=4, vectorizer=int_vectorize_layer)

tf.keras.utils.plot_model(int_model, show_shapes=True)

int_model.compile(
    loss=losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer='adam',
    metrics=['accuracy'])
int_history = int_model.fit(raw_train_ds, validation_data=raw_val_ds, epochs=10)
loss = plt.plot(bin_history.epoch, bin_history.history['loss'], label='bin-loss')
plt.plot(bin_history.epoch, bin_history.history['val_loss'], '--', color=loss[0].get_color(), label='bin-val_loss')

loss = plt.plot(int_history.epoch, int_history.history['loss'], label='int-loss')
plt.plot(int_history.epoch, int_history.history['val_loss'], '--', color=loss[0].get_color(), label='int-val_loss')

plt.legend()
plt.xlabel('Epoch')
plt.ylabel('CE/token')

You are nearly ready to train your model.

As a final preprocessing step, you will apply the TextVectorization layers you created earlier to the training, validation, and test sets:

binary_train_ds = raw_train_ds.map(lambda x,y: (binary_vectorize_layer(x), y))
binary_val_ds = raw_val_ds.map(lambda x,y: (binary_vectorize_layer(x), y))
binary_test_ds = raw_test_ds.map(lambda x,y: (binary_vectorize_layer(x), y))

int_train_ds = raw_train_ds.map(lambda x,y: (int_vectorize_layer(x), y))
int_val_ds = raw_val_ds.map(lambda x,y: (int_vectorize_layer(x), y))
int_test_ds = raw_test_ds.map(lambda x,y: (int_vectorize_layer(x), y))

Export the model

binary_model.export('bin.tf')
loaded = tf.saved_model.load('bin.tf')
binary_model.predict(['How do you sort a list?'])
loaded.serve(tf.constant(['How do you sort a list?'])).numpy()

Including the text preprocessing logic inside your model enables you to export a model for production that simplifies deployment, and reduces the potential for train/test skew.

There is a performance difference to keep in mind when choosing where to apply tf.keras.layers.TextVectorization. Using it outside of your model enables you to do asynchronous CPU processing and buffering of your data when training on GPU. So, if you're training your model on the GPU, you probably want to go with this option to get the best performance while developing your model, then switch to including the TextVectorization layer inside your model when you're ready to prepare for deployment.

Visit the Save and load models tutorial to learn more about saving models.

<a name="example2">

Example 2: Predict the author of Iliad translations

The following provides an example of using tf.data.TextLineDataset to load examples from text files, and TensorFlow Text to preprocess the data. You will use three different English translations of the same work, Homer's Iliad, and train a model to identify the translator given a single line of text.

To implement this task you'll use some lower level tools.

  • You'll use tf.data.TextLineDataset to load text-lines from files.
  • You'll implement your own version of keras.layers.TextVectorization using:
    • text.UnicodeScriptTokenizer - to convert strings to tokens.
    • tf.lookup.StaticVocabularyTable - to convert tokens to integer ids.
  • You'll maximize performance by placing the text processing in the dataset pipeline, so it can run in parallel with the model training.

Download and explore the dataset

The texts of the three translations are by:

The text files used in this tutorial have undergone some typical preprocessing tasks like removing document header and footer, line numbers and chapter titles.

Download these lightly munged files locally:

DIRECTORY_URL = 'https://storage.googleapis.com/download.tensorflow.org/data/illiad/'
FILE_NAMES = ['cowper.txt', 'derby.txt', 'butler.txt']

for name in FILE_NAMES:
  text_dir = utils.get_file(name, origin=DIRECTORY_URL + name)

parent_dir = pathlib.Path(text_dir).parent
list(parent_dir.iterdir())

Load the dataset

Previously, with tf.keras.utils.text_dataset_from_directory all contents of a file were treated as a single example. Here, you will use tf.data.TextLineDataset, which is designed to create a tf.data.Dataset from a text file where each example is a line of text from the original file. TextLineDataset is useful for text data that is primarily line-based (for example, poetry or error logs).

Iterate through these files, loading each one into its own dataset. Each example needs to be individually labeled, so use Dataset.map to apply a labeler function to each one. This will iterate over every example in the dataset, returning (example, label) pairs.

def labeler(example, index):
  return example, tf.cast(index, tf.int64)
labeled_data_sets = []

for i, file_name in enumerate(FILE_NAMES):
  lines_dataset = tf.data.TextLineDataset(str(parent_dir/file_name))
  labeled_dataset = lines_dataset.map(lambda ex: labeler(ex, i))
  labeled_data_sets.append(labeled_dataset)

Next, you'll combine these labeled datasets into a single dataset using Dataset.concatenate, and shuffle it with Dataset.shuffle:

BUFFER_SIZE = 50000
BATCH_SIZE = 64
VALIDATION_SIZE = 5000
all_labeled_data = labeled_data_sets[0]
for labeled_dataset in labeled_data_sets[1:]:
  all_labeled_data = all_labeled_data.concatenate(labeled_dataset)

all_labeled_data = all_labeled_data.shuffle(
    BUFFER_SIZE, reshuffle_each_iteration=False)

Print out a few examples as before. The dataset hasn't been batched yet, hence each entry in all_labeled_data corresponds to one data point:

for text, label in all_labeled_data.take(10):
  print("Sentence: ", text.numpy())
  print("Label:", label.numpy())

Prepare the dataset for training

Instead of using tf.keras.layers.TextVectorization to preprocess the text dataset, you will now use the lower-level TensorFlow Text APIs to standardize and tokenize the data, build a vocabulary and use tf.lookup.StaticVocabularyTable to map tokens to integers to feed to the model. (Learn more about TensorFlow Text).

Define a function to convert the text to lower-case and tokenize it:

  • TensorFlow Text provides various tokenizers. In this example, you will use the text.UnicodeScriptTokenizer to tokenize the dataset.
  • You will use Dataset.map to apply the tokenization to the dataset.
class MyTokenizer(tf.keras.layers.Layer):
  def __init__(self):
    super().__init__()
    self.tokenizer = tf_text.UnicodeScriptTokenizer()

  def call(self, text):
    lower_case = tf_text.case_fold_utf8(text)
    result = self.tokenizer.tokenize(lower_case)
    # If you pass a batch of strings, it will return a RaggedTensor.
    if isinstance(result, tf.RaggedTensor):
      # Convert to dense 0-padded.
      result = result.to_tensor()
    return result
tokenizer = MyTokenizer()
tokenized_ds = all_labeled_data.map(lambda text, label: (tokenizer(text), label))
tokenized_ds

You can iterate over the dataset and print out a few tokenized examples:

for tokens, label in tokenized_ds.take(1):
  break

print(tokens)
print()
print(label)

Next, you will build a vocabulary by sorting tokens by frequency and keeping the top VOCAB_SIZE tokens:

tokenized_ds = tokenized_ds.cache().prefetch(tf.data.AUTOTUNE)

vocab_count = collections.Counter()
for toks, labels in tokenized_ds.ragged_batch(1000):
  toks = tf.reshape(toks, [-1])
  for tok in toks.numpy():
    vocab_count[tok] += 1

vocab = [tok for tok, count in vocab_count.most_common(VOCAB_SIZE)]

print("First five vocab entries:", vocab[:5])
print()

To convert the tokens into integers, use the vocab set to create a tf.lookup.StaticVocabularyTable. You will map tokens to integers with 0 reserved for padding, and n+1 reserved to denote an out-of-vocabulary (OOV) token.

class MyVocabTable(tf.keras.layers.Layer):
  def __init__(self, vocab):
    super().__init__()
    self.keys = [''] + vocab
    self.values = range(len(self.keys))

    self.init = tf.lookup.KeyValueTensorInitializer(
        self.keys, self.values, key_dtype=tf.string, value_dtype=tf.int64)

    num_oov_buckets = 1

    self.table = tf.lookup.StaticVocabularyTable(self.init, num_oov_buckets)

  def call(self, x):
    result = self.table.lookup(x)
    return result

Try it on a dummy vocabulary:

vocab_table = MyVocabTable(['a','b','c'])
vocab_table(tf.constant([''] + list('abcdefghi')))

Create one for the real vocabulary:

vocab_table = MyVocabTable(vocab)

Finally, define a layer to standardize, tokenize and vectorize the dataset using the tokenizer and lookup table:

preprocess_text = tf.keras.Sequential([
    tokenizer,
    vocab_table
])

You can try this on a single example to print the output:

example_text, example_label = next(iter(all_labeled_data))
print("Sentence: ", example_text.numpy())
vectorized_text = preprocess_text(example_text)
print("Vectorized sentence: ", vectorized_text.numpy())

Now create a dataset pipeline that will process the text on the fly using Dataset.map:

all_encoded_data = all_labeled_data.map(lambda text, labels:(preprocess_text(text), labels))

for ids, label in all_encoded_data.take(1):
  break

print("Ids: ", ids.numpy())
print("Label: ", label.numpy())

Split the dataset into training and test sets

The Keras TextVectorization layer also batches and pads the vectorized data. Padding is required because the examples inside of a batch need to be the same size and shape, but the examples in these datasets are not all the same size—each line of text has a different number of words.

tf.data.Dataset supports splitting and padded-batching datasets:

train_data = all_encoded_data.skip(VALIDATION_SIZE).shuffle(BUFFER_SIZE).padded_batch(BATCH_SIZE)
validation_data = all_encoded_data.take(VALIDATION_SIZE).padded_batch(BATCH_SIZE)

Now, validation_data and train_data are not collections of (example, label) pairs, but collections of batches. Each batch is a pair of (many examples, many labels) represented as arrays.

To illustrate this:

sample_text, sample_labels = next(iter(validation_data))
print("Text batch shape: ", sample_text.shape)
print("Label batch shape: ", sample_labels.shape)
print("First text example: ", sample_text[0])
print("First label example: ", sample_labels[0])

Configure the datasets for better performance as before:

train_data = train_data.prefetch(tf.data.AUTOTUNE)
validation_data = validation_data.prefetch(tf.data.AUTOTUNE)

Train the model

You can train a model on this dataset as before:

Since this text vectorization adds 0 for padding and n+1 for out-of-vocabulary (OOV) tokens, the vocabulary size has increased by two:

model = create_model(vocab_size=VOCAB_SIZE+2, num_labels=3)

model.compile(
    optimizer='adam',
    loss=losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=['accuracy'])
tf.keras.utils.plot_model(model, show_shapes=True)
history = model.fit(train_data, validation_data=validation_data, epochs=3)
metrics = model.evaluate(validation_data, return_dict=True)

print("Loss: ", metrics['loss'])
print("Accuracy: {:2.2%}".format(metrics['accuracy']))

Export the model

To make the model capable of taking raw strings as input, pack both the text processor and the model into a keras.Sequential:

export_model = tf.keras.Sequential([
    preprocess_text,
    model
])
export_model.compile(
    loss=losses.SparseCategoricalCrossentropy(from_logits=False),
    optimizer='adam',
    metrics=['accuracy'])

This model you can run directly on batches of strings:

# Create a test dataset of raw strings.
test_ds = all_labeled_data.take(VALIDATION_SIZE).batch(BATCH_SIZE)
test_ds = test_ds.cache().prefetch(tf.data.AUTOTUNE)
test_ds
loss, accuracy = export_model.evaluate(test_ds)

print("Loss: ", loss)
print("Accuracy: {:2.2%}".format(accuracy))

Use saved_model.save to export it.

<!-- TODO: After 2.15 switch this to use Model.export -->
tf.saved_model.save(export_model, 'export.tf')
loaded = tf.saved_model.load('export.tf')
export_model(tf.constant(['The field bristled with the long and deadly spears which they bore.'])).numpy()
loaded(tf.constant(['The field bristled with the long and deadly spears which they bore.'])).numpy()

The loss and accuracy for the model on encoded validation set and the exported model on the raw validation set are the same, as expected.

Run inference on new data

inputs = [
    "Join'd to th' Ionians with their flowing robes,",  # Label: 1
    "the allies, and his armour flashed about him so that he seemed to all",  # Label: 2
    "And with loud clangor of his arms he fell.",  # Label: 0
]

predicted_scores = export_model.predict(inputs)
predicted_labels = tf.math.argmax(predicted_scores, axis=1)

for input, label in zip(inputs, predicted_labels):
  print("Question: ", input)
  print("Predicted label: ", label.numpy())

Download more datasets using TensorFlow Datasets (TFDS)

You can download many more datasets from TensorFlow Datasets.

In this example, you will use the IMDB Large Movie Review dataset to train a model for sentiment classification:

# Training set.
train_ds = tfds.load(
    'imdb_reviews',
    split='train[:80%]',
    batch_size=BATCH_SIZE,
    shuffle_files=True,
    as_supervised=True)
# Validation set.
val_ds = tfds.load(
    'imdb_reviews',
    split='train[80%:]',
    batch_size=BATCH_SIZE,
    shuffle_files=True,
    as_supervised=True)

Print a few examples:

for review_batch, label_batch in val_ds.take(1):
  for i in range(5):
    print("Review: ", review_batch[i].numpy())
    print("Label: ", label_batch[i].numpy())

You can now preprocess the data and train a model as before.

Note: You will use tf.keras.losses.BinaryCrossentropy instead of tf.keras.losses.SparseCategoricalCrossentropy for your model, since this is a binary classification problem.

Prepare the dataset for training

vectorize_layer = TextVectorization(
    max_tokens=VOCAB_SIZE,
    output_mode='int',
    output_sequence_length=MAX_SEQUENCE_LENGTH)

# Make a text-only dataset (without labels), then call `TextVectorization.adapt`.
train_text = train_ds.map(lambda text, labels: text)
vectorize_layer.adapt(train_text)
def vectorize_text(text, label):
  text = tf.expand_dims(text, -1)
  return vectorize_layer(text), label
train_ds = train_ds.map(vectorize_text)
val_ds = val_ds.map(vectorize_text)
# Configure datasets for performance as before.
train_ds = train_ds.cache().prefetch(tf.data.AUTOTUNE)
val_ds = val_ds.cache().prefetch(tf.data.AUTOTUNE)

Create, configure and train the model

model = create_model(vocab_size=VOCAB_SIZE, num_labels=1)
model.summary()
model.compile(
    loss=losses.BinaryCrossentropy(from_logits=True),
    optimizer='adam',
    metrics=['accuracy'])
history = model.fit(train_ds, validation_data=val_ds, epochs=3)
loss, accuracy = model.evaluate(val_ds)

print("Loss: ", loss)
print("Accuracy: {:2.2%}".format(accuracy))

Export the model

export_model = tf.keras.Sequential(
    [vectorize_layer, model,
     layers.Activation('sigmoid')])

export_model.compile(
    loss=losses.SparseCategoricalCrossentropy(from_logits=False),
    optimizer='adam',
    metrics=['accuracy'])
# 0 --> negative review
# 1 --> positive review
inputs = [
    "This is a fantastic movie.",
    "This is a bad movie.",
    "This movie was so bad that it was good.",
    "I will never say yes to watching this movie.",
]

predicted_scores = export_model.predict(inputs)
predicted_labels = [int(round(x[0])) for x in predicted_scores]

for input, label in zip(inputs, predicted_labels):
  print("Question: ", input)
  print("Predicted label: ", label)

Conclusion

This tutorial demonstrated several ways to load and preprocess text. As a next step, you can explore additional text preprocessing TensorFlow Text tutorials, such as:

You can also find new datasets on TensorFlow Datasets. And, to learn more about tf.data, check out the guide on building input pipelines.