# [Discussion] Why not turn momentum update equation into exponentially weighted moving average update equation?

In Pytorch, the update equation of SGD with (non-Nesterov) momentum is m^{[i+1]} = β m^{[i]} + g L(w^{[i+1]}), where g means gradient, β is the momentum coefficient, m^{[i]} is the momentum at iteration i, L is the loss function, w^{[i]} is the value of weights at iteration i.

If we are starting with m^{[0]} = 0, then for all i > 0 m^{[i]} = sum({ β^{j} g L(w^{[i-j]}) | j∈{0, …, i-1} }).

Now, let’s write down the formulas for exponentially weighted moving average of gradients (which we’ll denote as a^{[i]}) to show that one is equivalent to the other multiplicated by a constant. We will make a non-traditional assumption that a^{[0]} = 0. It doesn’t matter, because as i goes to infinity, the contribution of the zeroth term goes to zero.

a^{[i+1]} = β a^{[i]} + (1-β) g L(w^{[i+1]}) We can rewrite it as a^{[i]} = (1 – β) sum({ β^{j} g L(w^{[i-j]}) | j ∈ {0, …, i-1} }).

Notice that ∀ β ∈ [0, 1) it holds that (1 – β) m^{[i]} = a^{[i]}.

It seems to me that we should change the update equation of momentum SGD to the equation of exponentially weighted moving average of gradients, i.e. add the 1 – β coefficient to the gradient term. Here’s why:

- It decouples learning rate from momentum coefficient. Currently, larger momentum coefficient increases the
*effective*learning rate (i.e. by how much the weights are updated). Suppose we are in an ideal scenario, when for all iterations i, j we have g L(w^{[i]}) = g L(w^{[j]}) = g L, then lim_{i→∞} m^{[i]}= g L / (1 – β). For β = 0.9 this value equals 10 g L. For β = 0.99 this value equals 100 g L. In contrast, if we use exponentially weighted moving average formula, for all β the analagous limit would equal just g L. I concede that this is an unrealistic scenario, and in real problems gradients at steps i, i+1, i+2, … , i+k somewhat cancel each other out, but still I think it’s a good point. - Weighted moving average is a somewhat well known concept, while momentum isn’t.

I am interested to hear, what reasons are there not to change the update formula? And if you think this is a good change, how should the authors of deep learning libraries proceed?

submitted by /u/CrazyCrab

[link] [comments]