[D] Retrain your models, the Adam optimizer in PyTorch was fixed in version 1.3
I have noticed a small discrepancy between theory and the implementation of AdamW and in general Adam. The epsilon in the denominator of the following Adam update should not be scaled by the bias correction (Algorithm 2, L9-12). Only the running average of the gradient (m) and squared gradients (v) should be scaled by their corresponding bias corrections.
In the current implementation, the epsilon is scaled by the square root of bias_correction2
. I have plotted this ratio as a function of step given beta2 = 0.999
and eps = 1e-8
. In the early steps of optimization, this ratio slightly deviates from theory (denoted by the horizontal red line)
See more here: https://github.com/pytorch/pytorch/pull/22628