[D] Momentum updates average of g, e.g. Adagrad also of g^2. What other averages might be worth to update? E.g. 4: of g, x, x*g, x^2 give MSE fitted local parabola
Updating exponential moving average is a basic tool of SGD methods, starting with of gradient g in momentum method to extract local linear trend from the statistics.
Then e.g. Adagrad, ADAM family adds averages of g_i*g_i to strengthen underrepresented coordinates.
TONGA can be seen as another step: updates g_i*g_j averages to model (uncentered) covariance matrix of gradients for Newton-like step.
I wanted to propose a discussion about some other interesting/promising updated averages for SGD convergence e.g. met in literature?
For example updating 4 exponential moving averages: of g, x, gx, x2 gives MSE fitted parabola in a given direction, estimated Hessian = Cov(g,x).Cov(x,x)-1 in multiple directions (derivation). Analogously we could MSE fit e.g. in a single direction degree 3 polynomial if updating 6 averages: of g, x, gx, x2, g*x2, x3.
Have you seen such additional updated averages in literature, especially of g*x? Is it worth e.g. to expand momentum method by such additional averages to model parabola in its direction for smarter step size?