[D] Rectified Adam (RAdam): a new state of the art optimizer
This blog post discusses a new optimizer built on top of Adam, introduced in this paper by Liyuan Liu et al.. Essentially, they seek to understand why a warmup phase is beneficial for scheduling learning rates, and then identify the underlying problem to be related to high variance and poor generalization during the first few batches. They find that the issue can be remedied by using either a warmup/low initial learning rate, or by turning off momentum for the first couple of batches. As more training examples are fed in, the variance stabilizes and the learning rate/momentum can be increased. They therefore proposed a Rectified Adam optimizer that dynamically changes the momentum in a way that hedges against high variance. The author of the blog post tests an implementation in Fastai and finds that RAdam works well in many different contexts, enough to take the leaderboard of the Imagenette mini-competition.
Implementations can be found on the author’s Github.