official/vision/docs/faq.md
This user guide is a walkthrough on how to train and fine-tune models, and perform hyperparameter tuning in TF-Vision. For each model/task supported in TF-Vision, please refer to the corresponding tutorial to get more detailed instructions.
Available models under TF-Vision: There is a good collection of models available in TF-Vision for various vision tasks: image classification, object detection, video classification, semantic segmentation and Instance segmentation. Please check this page to know more about our available models. We will keep adding new supports, and your suggestions are appreciated.
Fine-tune from a checkpoint: TF-Vision supports loading pretrained
checkpoints for fine-tuning. It can be simply done by specifying
task.init_checkpoint and task.init_checkpoint_modules in the task
configuration. The value of task.init_checkpoint_modules depends on the
pretrained modules implementation which in general can be e.g. all,
backbone, and/or decoder (for detection and segmentation). If set to
all, all weights from the checkpoint will be loaded. If set to backbone,
only weights in the backbone component will be loaded and other weights
will be initialized from scratch. An example yaml file can be found
here.
Export SavedModel for serving: To export any TF 2.x models we trained,
including tf.keras.Model and the plain tf.Module, we use the
tf.saved_model.save() API. Our
exporting library
offers functionalities to export SavedModel for CPU/GPU/TPU serving.
TF-Vision supports loading pretrained checkpoints for fine-tuning. It can be
simply done by specifying task.init_checkpoint and
task.init_checkpoint_modules in the task configuration. The value of
task.init_checkpoint_modules depends on the pretrained modules implementation
which in general can either be e.g. all, backbone, and/or decoder (for detection
and segmentation). If set to all, all weights from the checkpoint will be
loaded. Let’s use a concrete example for elaboration. Suppose the requirements
are:
Train a classification model with 10-class.
save off the checkpoint of the model from step 1 ( but only save the
backbone before the last Conv2D + softmax).
use the checkpoint from step 2 to train a new classification model with 4
novel classes.
For 2, the model needs to specify the components to be saved in the checkpoint
in the
checkpoint_items.
For 3, you can specify the init_checkpoint and the init_checkpoint_modules ='backbone'. Then the new model with 4 classes will only initialize the
backbone
so that you can finetune the head. In this
example,
backbone is everything before the global average pooling layer for the
classification model.
To export any TF 2.x models we trained, including tf.keras.Model and the plain
tf.Module, we use the tf.saved_model.save() API. Our
exporting library
offers functionalities to export SavedModel for CPU/GPU/TPU serving. Moreover,
with the exported SavedModel, it is possible to further convert it to a TFLite
model for on-device inference.
TF-Vision modeling library provides a collection of baselines and checkpoints for various vision tasks including e.g. image classification, object detection, video classification and segmentation. The supported pretrained models and corresponding config file can be found here. Since we are actively developing new models, you are also recommended to check our repository to find anything that has been added but not reflected in the documentation yet.
We have provided an example project to demonstrate how to use TF Model Garden's building blocks to implement a new vision project from scratch. All the internal/external projects built on top of TFM can be found here for reference.
Set log_model_flops_and_params to true when exporting a saved model to log
params and flops as
here.
The regenrate_source_id will add some extra
computation
but rarely create the bottleneck. You can do a POC to see if the input pipeline
is the bottleneck or not.
All the pre-trained models are trained with well structured input pipelines defined in data loaders, which typically includes e.g. normalization and augmentation. The normalization approach used is task dependent, and you are recommended to check each task’s corresponding input pipeline for confirmation:
For example, the mean and std normalization is applied for classification tasks by default.
Here are the general steps to write a summary:
save_summary argument of run_experiment controls whether or not to
write a summary to the folder
[ref].eval_summary_manager to write the summary
[ref].
The default eval_summary_manager only write scalar summary.We have supported writing image summary to show predicted bounding boxes for RetinaNet task. It can be adapted to write other types of summary. Here are the steps:
We have created a custom summary manager that can write image summary [ref].
We optionally build the summary manager if the corresponding task is
supported to write such summary
[ref],
and pass it into the trainer as eval_summary_manager
[ref].
In the task, we collect necessary predictions
[ref]
in validation_step, update them in aggregate_logs
[ref],
and add visualization into returned logs in reduce_aggregated_logs
[ref],
so that the summary manager can identify such information and write it to
summary.
We also need to set allow_image_summary to True in task config to enable
this
[ref].
Check if there are any large intermediate tensors or objects that are still
alive from the previous inference. If you have any python variables that refer
to those tensors, then delete them. Also, you can import gc, and run garbage
collection through the command gc.collect().
To do this, you will need to create a custom trainer. And to include the
individual training losses, you will need to create a Mean metric for each of
the losses and then propagate loss value to this metric during the train step.
Indeed it depends whether you need to run these metrics on CPU, if not, you can
do alike maskrcnn:
define losses reference
and
define metrics reference.
Individual training losses should show up on Tensorboard if added in returned logs. Average precision is reported in reduce_aggregated_logs.
You can add tf.config.run_functions_eagerly = True in the main function to
enable eager mode. Refer
code
here.
Please find experiment config for single-task training and multi-task evaluation here.
If your previous training is complete with 30k and you want to train an additional 10k, there are below ways:
init_checkpoint to the last saved checkpointmodel_dir to the training directoryPlease be alert with the optimizer config. After you modify the training steps, the LR curve will change.
Also, if you start the training in the same model dir, you will lose checkpoints for the previous training run since we only keep the last 5. So if you are planning to experiment with fine-tuning, it is suggested to start a new run.
Check out these configs for storing the best checkpoint.
The prerequisite is to configure "MultiWorkerMirroredStrategy". The
tf.distribute.MultiWorkerMirroredStrategy implements synchronous distributed
training across multiple workers, each with potentially multiple GPUs. It
creates copies of all variables in the model on each device across all workers.
Please follow the guidelines
here.
task in the config yaml file using MultiEvalExperimentConfig, the eval jobs fail when loading the model. Is this an expected behavior ?No, this is not expected behavior. The reason for the issue is that the eval
jobs are not reading the model architecture under the task config but from a
eval_task copy to reconstruct the model.
To address this issue, refrain from using the eval_task model
configurations.The model should be constructed from the task. The
MultiTaskEvaluator
class takes the eval data tasks and the model should be created
here.
Users can add a training argument in the build_losses() method. build_losses is invoked in either from train_step or validation_step, you can pass correct training arguments from each step.
We have the implementation to support sampling from multiple training dataset
for all major tasks such as classification, retinanet, maskrcnn and segmentation
tasks. The create_combine_fn of
input_reader.py
creates and returns a combine_fn for dataset mixing and is called in the
build_inputs
method of the respective
tasks.
Refer sample config below:
train_data:
input_path:
d1: train1*,
d2: train2*,
weights:
d1: 0.8
d2: 0.2
You can add gradient magnitude logging into your metric log in the task class as a new dictionary key-value pair.
Please refer to the
Image Classification Task,
here you can obtain the pair of gradient and trainable_variables (grads,
tvar), add gradient magnitude and update the
metric logs.
It will be then processed in summary_manager’s
write_summaries
method.
Please find experiment config for single-task training and multi-task evaluation here.
Add prefetch_buffer_size in the config file. A known issue exists regarding
the auto-tuning of the prefetch_buffer_size. You might consider setting a
suitable value explicitly instead. Know more about prefetch_buffer_size
here.
Refer below Example.
train_data:
global_batch_size: 4096
dtype: 'bfloat16'
prefetch_buffer_size: 8
input_path: 'Input Path'
validation_data:
global_batch_size: 32
...
The user can set input_image_size to none if the model itself can be built
with arbitrary image size.
Refer below Example.
export_saved_model_lib.export_inference_graph(
input_type='image_tensor',
batch_size=1,
input_image_size=[None, None],
params=exp_config,
checkpoint_path=tf.train.latest_checkpoint(model_dir),
export_dir=export_dir)
The number of images that is seen during training is train_steps * global_batch_size. The relationship between global_batch_size and train_steps can be explained as follows:
train_epochs = 400
train_steps = math.floor(train_epochs * num_train_examples/train_data.global_batch_size)
// steps_per_loop = steps_per_epochs
steps_per_epoch = math.floor(num_train_examples/train_data.global_batch_size)
validation_steps = math.floor(num_val_examples/validation_data.global_batch_size)
// number of training steps to run between evaluations.
validation_interval = steps_per_epoch
Assuming a train dataset with num_train_examples images, and validation
dataset with num_val_examples images and the train_epochs is a
hyperparameter that you need to choose. train_steps depends on train_epochs
and num_train_examples.
Early stopping is not currently integrated into the Model Garden. An alternative approach is to set up the training pipeline to export the best model based on your specified criteria. The NewBestMetric class keeps track of the best metric value seen so far. Subsequently, you can train for an ample duration, and if signs of overfitting become apparent, you have the flexibility to halt the run accordingly. That works well for one-off experiments.
The best_checkpoint_eval_metric attribute of
config_definition
can be used for exporting the best checkpoint, specifying the evaluation metric
the trainer should monitor. Refer to the
YAML
file.
| Acronym | Meaning |
|---|---|
| TFM | Tensorflow Models |
| FAQs | Frequently Asked Questions |
| YAQ | Yet Another Question |
| TF | TensorFlow |