official/nlp/docs/optimization.md
This page describes the optimization package for Tensorflow Official Models (TFM) which includes optimizers, and learning rate schedulers.
We use an Optimizer factory class to manage optimizer and learning rate creation. Optimizer factory takes a config as an input, and it has member functions that are used to build optimizer and learning rate schedule. To create an optimizer and a LR schedule through OptimizerFactory, you need to do the following:
The following is an example for creating an SGD optimizer with stepwise LR scheduler with linear warmup:
params = {'optimizer': { 'type': 'sgd',
'sgd': {'momentum': 0.9}},
'learning_rate': {'type': 'stepwise',
'stepwise': {
'boundaries': [10000, 20000],
'values': [0.1, 0.01, 0.001]}},
'warmup': {'type': 'linear',
'linear': {'warmup_steps': 500,
'warmup_learning_rate': 0.01}}}
# Defines optimization config from a dictionary.
opt_config = optimization.OptimizationConfig(params)
# Initializes an optimization factory from optimization config.
opt_factory = optimization.OptimizerFactory(opt_config)
# Builds the desired learning rate scheduling instance.
lr = opt_factory.build_learning_rate()
# Builds the optimizer instance with the desired learning rate schedule.
optimizer = opt_factory.build_optimizer(lr)
To initialize an OptimizerFactory, optimizer and learning_rate fields must
be defined, while warmup is an optional field. The field type is used to
define the type of each optimization component. The set of available types are
explained in details in the following sections.
In the following sections, we explain how to create different optimizers, learning rate, and warmup schedulers. We also explain how to add new optimizers, or learning rate schedulers.
The list of supported optimizers can be found here.
OPTIMIZERS_CLS = {
'sgd': tf.keras.optimizers.SGD,
'adam': tf.keras.optimizers.Adam,
'adamw': nlp_optimization.AdamWeightDecay,
'lamb': tfa_optimizers.LAMB,
'rmsprop': tf.keras.optimizers.RMSprop
}
You can specify the type of optimizer to be one of the above using oneof config. The available config fields can be found here.
All optimizers support gradient clipping methods: clip by value, clip by norm, clip by global norm. To specify which method to use, you need to specify the appropriate field list here.
We will specify an rmsprop optimizer with discounting factor (rho) of 0.9, and global norm gradient clipping of 10.0. Below is the config to be used.
params = {'optimizer': { 'type': 'rmsprop',
'rmsprop': {'rho': 0.9,
'global_clipnorm': 10.0}}}
To add a new optimizer, you need to do the following:
Learning rate with an optional warmup can be configured by specifying
learning_rate, and warmup fields in optimization config. learning_rate is
a required field, while warmup is an optional one. The list of supported
learning_rate and warmup schedules can be found
here.
LR_CLS = {
'stepwise': tf.keras.optimizers.schedules.PiecewiseConstantDecay,
'polynomial': tf.keras.optimizers.schedules.PolynomialDecay,
'exponential': tf.keras.optimizers.schedules.ExponentialDecay,
'cosine': tf.keras.experimental.CosineDecay,
'power': lr_schedule.DirectPowerDecay,
}
WARMUP_CLS = {
'linear': lr_schedule.LinearWarmup,
'polynomial': lr_schedule.PolynomialWarmUp
}
In addition, a constant learning rate can be specified.
Learning rate takes step as an input, and it returns the learning rate value.
As the training progresses, usually learning rate value decays. Warmup schedule
is often used to stablize the training. Warmup schedule starts from a low
learning rate value, and it gradually increases until it reaches the initial
value for the regular learning rate decay schedule. We combine learning_rate
(lr) with warmup (warmup) schedules as follows
learning_rate = warmup(step)learning_rate = lr(step)learning_rate(warmup_steps) = warmup(warmup_steps)). Note that, warmup
schedule doesn't delay the regular learning rate decay by warmup_steps,
instead it replaces it.Learning rate value is logged every
summary_interval.
If warmup_steps are less that the summary_interval, you won't be able to see
warmup values.
We want to specify a cosine learning rate decay with decay_steps of 20000, with a linear warmup schedule for the first 500 steps.
params = {'learning_rate': {'type': 'cosine',
'cosine': {'decay_steps': 20000}},
'warmup': {'type': 'linear',
'linear': {'warmup_steps': 500}}}
Optimizer and learning rate are created inside the task. If different optimizers/learning rate schedulers are needed, they can be defined by overriding the class method.
decay_steps for cosine learning rate decay. Changing one without changing
the other might result in undesired behavior.