Learn About Our Meetup

4200+ Members

[D] Momentum updates average of g, e.g. Adagrad also of g^2. What other averages might be worth to update? E.g. 4: of g, x, x*g, x^2 give MSE fitted local parabola

Updating exponential moving average is a basic tool of SGD methods, starting with of gradient g in momentum method to extract local linear trend from the statistics.

Then e.g. Adagrad, ADAM family adds averages of g_i*g_i to strengthen underrepresented coordinates.

TONGA can be seen as another step: updates g_i*g_j averages to model (uncentered) covariance matrix of gradients for Newton-like step.

I wanted to propose a discussion about some other interesting/promising updated averages for SGD convergence e.g. met in literature?

For example updating 4 exponential moving averages: of g, x, gx, x2 gives MSE fitted parabola in a given direction, estimated Hessian = Cov(g,x).Cov(x,x)-1 in multiple directions (derivation). Analogously we could MSE fit e.g. in a single direction degree 3 polynomial if updating 6 averages: of g, x, gx, x2, g*x2, x3.

Have you seen such additional updated averages in literature, especially of g*x? Is it worth e.g. to expand momentum method by such additional averages to model parabola in its direction for smarter step size?

submitted by /u/jarekduda
[link] [comments]

Next Meetup




Plug yourself into AI and don't miss a beat


Toronto AI is a social and collaborative hub to unite AI innovators of Toronto and surrounding areas. We explore AI technologies in digital art and music, healthcare, marketing, fintech, vr, robotics and more. Toronto AI was founded by Dave MacDonald and Patrick O'Mara.