TF-NLP Data Processing

Code locations

Open sourced data processing libraries: tensorflow_models/official/nlp/data/

Preprocess data offline v.s. TFDS

Inside TF-NLP, there are flexible ways to provide training data to the input pipeline: 1) using python scripts/beam/flume to process/tokenize the data offline; 2) reading the text data directly from TFDS and using TF.Text for tokenization and preprocessing inside the tf.data input pipeline.

Preprocessing scripts

We have implemented data preprocessing for multiple datasets in the following python scripts:

Then, the processed files with tf.Example protos inside should be specified to the input_path argument in DataConfig.

TFDS usages

For convenience and consolidation, we built a common input_reader.py library to standardize input reading, which has built-in pass for TFDS. Specifying the arguments in the DataConfig, tfds_name, tfds_data_dir and tfds_split, will let the tf.data pipeline read from the corresponding dataset inside TFDS.

DataLoaders

To manage multiple datasets and processing functions, we defined the DataLoader class to work with the data loader factory.

Each dataloader defines the tf.data input pipeline inside the load method.

python

@abc.abstractmethod
def load(
    self,
    input_context: Optional[tf.distribute.InputContext] = None
) -> tf.data.Dataset:

Then, the load method is called inside each NLP task's build_input method and the trainer wrap that to create distributed datasets.

python

def build_inputs(self, params, input_context=None):
  """Returns tf.data.Dataset for pretraining."""
  data_loader = YourDataLoader(params)
  return data_loader.load(input_context)

By default, in the example above, params is the train_data or validation_data field of the task field of the experiment config. params is a type of DataConfig.

It is important to note that, for TPU training, the entire load method will run on the TPU workers and it requires that the function does not access resources outside, e.g. the task attributes.

To work with raw text features, we need to use the DataLoaders handling the text data with TF.Text. You can take the following dataloaders as references:

sentence_prediction_dataloader.py for BERT GLUE fine tuning using TFDS with raw text features.

Speed up training using TF.data service and dynamic sequence length on TPUs

With TF 2.x, we can enable some types of dynamic shapes on TPUs, thanks to TF 2.x programing model and TPUStrategy/XLA works.

Depending on the data distribution, we are seeing 50% to 90% speed up on typical text data for BERT pretraining applications relative to padded static shape inputs.

To enable dynamic sequence, we need to use tf data service for the global bucketizing over sequences. To enable it, you can simply add --enable_tf_data_service when you start experiments.

To pair with tf data service, we need to use the dataloaders that has the bucketizing function implemented. You can take the following dataloaders as references:

pretrain_dynamic_dataloader.py for BERT pretraining on the tokenized datasets.