Skip to main content


Learn About Our Meetup

5000+ Members



Join our meetup, learn, connect, share, and get to know your Toronto AI community. 



Browse through the latest deep learning, ai, machine learning postings from Indeed for the GTA.



Are you looking to sponsor space, be a speaker, or volunteer, feel free to give us a shout.

[Discussion] Why not turn momentum update equation into exponentially weighted moving average update equation?

In Pytorch, the update equation of SGD with (non-Nesterov) momentum is m[i+1] = β m[i] + g L(w[i+1]), where g means gradient, β is the momentum coefficient, m[i] is the momentum at iteration i, L is the loss function, w[i] is the value of weights at iteration i.

If we are starting with m[0] = 0, then for all i > 0 m[i] = sum({ βj g L(w[i-j]) | j∈{0, …, i-1} }).

Now, let’s write down the formulas for exponentially weighted moving average of gradients (which we’ll denote as a[i]) to show that one is equivalent to the other multiplicated by a constant. We will make a non-traditional assumption that a[0] = 0. It doesn’t matter, because as i goes to infinity, the contribution of the zeroth term goes to zero.

a[i+1] = β a[i] + (1-β) g L(w[i+1]) We can rewrite it as a[i] = (1 – β) sum({ βj g L(w[i-j]) | j ∈ {0, …, i-1} }).

Notice that ∀ β ∈ [0, 1) it holds that (1 – β) m[i] = a[i].

It seems to me that we should change the update equation of momentum SGD to the equation of exponentially weighted moving average of gradients, i.e. add the 1 – β coefficient to the gradient term. Here’s why:

  1. It decouples learning rate from momentum coefficient. Currently, larger momentum coefficient increases the effective learning rate (i.e. by how much the weights are updated). Suppose we are in an ideal scenario, when for all iterations i, j we have g L(w[i]) = g L(w[j]) = g L, then lim_{i→∞} m[i] = g L / (1 – β). For β = 0.9 this value equals 10 g L. For β = 0.99 this value equals 100 g L. In contrast, if we use exponentially weighted moving average formula, for all β the analagous limit would equal just g L. I concede that this is an unrealistic scenario, and in real problems gradients at steps i, i+1, i+2, … , i+k somewhat cancel each other out, but still I think it’s a good point.
  2. Weighted moving average is a somewhat well known concept, while momentum isn’t.

I am interested to hear, what reasons are there not to change the update formula? And if you think this is a good change, how should the authors of deep learning libraries proceed?

submitted by /u/CrazyCrab
[link] [comments]