docs/basic/optimizer_on_angel_en.md
There are various optimization methods in machine learning, but in the big data use cases, the most commonly used method is a series of optimizations based on SGD. Currently, only a small number of optimization methods are implemented in Angel:
The update rule of SGD is:
in which represents the learning rate. SGD supports including regularization, and the exact optimization used in this case is PGD (proximal gradient descent).
There are two ways to represent SGD optimization in Json:
"optimizer": "sgd",
"optimizer": {
"type": "sgd",
"reg1": 0.01,
"reg2": 0.02
}
The update rule of Momentum is:
in which represents the momentum factor, represents the learning rate. Momentum supports regularization. Momentum is the default optimization method in Angel.
There are two ways to represent Momentum in Json:
"optimizer": "momentum",
"optimizer": {
"type": "momentum",
"momentum": 0.9,
"reg2": 0.01
}
The update rule of AdaGrad (referring to the exponential smoothing version, i.e., RMSprop) is:
in which is the smoothing factor and is the learning rate. AdaGrad also supports regularization.
There are two ways to represent AdaGrad in Json:
"optimizer": "adagrad",
"optimizer": {
"type": "adagrad",
"beta": 0.9,
"reg1": 0.01,
"reg2": 0.01
}
The update rule of AdaDelta is:
in which are smoothing factors. AdaDelta also supports regularization.
There are two ways to represent AdaDelta in Json:
"optimizer": "adadelta",
"optimizer": {
"type": "adadelta",
"alpha": 0.9,
"beta": 0.9,
"reg1": 0.01,
"reg2": 0.01
}
Adam is a better way to optimize. The formula is:
in which represents the exponential smoothing factor, i.e. the momentum, and represents the exponential smoothing of gradient , which can be regarded as diagonal approximation of Hessian. by default.
Let
then is a function with initial value of 1 and a limit of 1, which decreases first and then increases:
Which means the learning rate is decreased during the initial state of optimization where the gradient is relatively large, so as to smooth the gradient descent; and during the final state where the gradient is very small, the learning rate will be approximately increased in order to help jumping out of local minimum.
There are two ways to represent Adam in Json:
"optimizer": "adam",
"optimizer": {
"type": "adam",
"beta": 0.9,
"gamma": 0.99,
"reg2": 0.01
}
FTRL is an online learning algorithm which goal is to optimize the regret bound. It is proved to be effective under specific learning rate decay condition.
Another characteristic of FTRL that distinguishes it from PGD (proximal gradient descent) and other online learning methods such as FOBOS and RDA is that it can get very sparse solutions. FTRL's algorithm is demonstrated as follows:
There are two ways to represent FTRL in Json:
"optimizer": "ftrl",
"optimizer": {
"type": "ftrl",
"alpha": 0.1,
"beta": 1.0,
"reg1": 0.01,
"reg2": 0.01
}
Note: are regularization coefficient corresponding to "reg1, reg2" in Json.
Some reference experiences:
For wide & deep learning, in principle, it is necessary to ensure that the convergence speed of the two sides is not too different.
FTRL optimizer is designed for online learning. Its amplitude of each update is very small in order to ensure the robustness of the model. In online learning cases, the data is fed row by row or minibatch by minibatch, so it's neither theoretically feasible to let a small size of data to modify the model too much. Therefore, the batch size shouldn't be too large when using FTRL, and had best be less than 10,000.
The convergence rate of different optimizers: FTRL < SGD < Momentum < AdaGrad ~ AdaDelta < Adam
Optimizers with diagonal approximation of Hessian, i.e. AdaGrad, AdaDelta, Adam, etc., allow larger batch sizes to gaurantee the gradient and accuracy of Hessian diagonal matrix. However, for simpler first-order optimizers such as FTRL, SGD and Momentum, more iterations are required, so the batch size shouldn't be too large. Therefore: BatchSize(FTRL) < BatchSize(SGD) < BatchSize(Momentum) < BatchSize(AdaGrad) ~ BatchSize(AdaDelta) < BatchSize(Adam)
Regarding the learning rate, you can start from 1.0, and then increase or decrease it exponentially (with 2 or 0.5 as base). You can also use learning curve for early stop, but only when following these principles: SGD and Momentum can use relatively large learning rate, while AdaGrad, AdaDelta and Adam should generally use a less one since they are more sensitive to learning rage. (You can start from half of the learning rate of SGD and Momentum)
Regarding to learning rate decay, it shouldn't be too large if there are less epochs. Use standard decay in general cases and WarnRestarts for AdaGrad, AdaDelta and Adam.
Regarding to regularization, FTRL, SGD, AdaGrad, AdaDelta currently supports L1/L2 regularization, while Momentum, Adam only supports L2 regularization. We recommend not using regularization at first but afterwards instead.
Please refer to optimizer for derivation with L1 regularization.