Skip to main content


Learn About Our Meetup

5000+ Members



Join our meetup, learn, connect, share, and get to know your Toronto AI community. 



Browse through the latest deep learning, ai, machine learning postings from Indeed for the GTA.



Are you looking to sponsor space, be a speaker, or volunteer, feel free to give us a shout.

[D] Momentum updates average of g, e.g. Adagrad also of g^2. What other averages might be worth to update? E.g. 4: of g, x, x*g, x^2 give MSE fitted local parabola

Updating exponential moving average is a basic tool of SGD methods, starting with of gradient g in momentum method to extract local linear trend from the statistics.

Then e.g. Adagrad, ADAM family adds averages of g_i*g_i to strengthen underrepresented coordinates.

TONGA can be seen as another step: updates g_i*g_j averages to model (uncentered) covariance matrix of gradients for Newton-like step.

I wanted to propose a discussion about some other interesting/promising updated averages for SGD convergence e.g. met in literature?

For example updating 4 exponential moving averages: of g, x, gx, x2 gives MSE fitted parabola in a given direction, estimated Hessian = Cov(g,x).Cov(x,x)-1 in multiple directions (derivation). Analogously we could MSE fit e.g. in a single direction degree 3 polynomial if updating 6 averages: of g, x, gx, x2, g*x2, x3.

Have you seen such additional updated averages in literature, especially of g*x? Is it worth e.g. to expand momentum method by such additional averages to model parabola in its direction for smarter step size?

submitted by /u/jarekduda
[link] [comments]