[R] Why not use e.g. SGD coordinate-wise: learning rate ~ sqrt(variance(theta)/variance(g)) ?

Written by torontoai on December 2, 2019. Posted in Reddit MachineLearning.

Working on estimating position of minimum by modelling where linear trend of gradients interests zero, simple approximation (corr(g,theta)=1) leads to looking obvious:

learning rate ~ sqrt(var(theta)/var(g))

proportional to width of displacement of theta, and inversely proportional to width of displacement of gradients – assuming they are in line (corr(g,theta)=1), such learning rate would take us exactly to g=0 minimum of parabola in one step.

Adaptive variance estimation is just a matter of maintaining two exponential moving averages: of value and of value^2, hence we can e.g. cheaply do it coordinate-wise in SGD – getting 2nd order adaptation of learning rate independently for each coordinate (5th page here).

There is popular square root of mean gradient² in denominator (e.g. RMSprop, ADAM), but have anybody seen use of variance in SGD optimizers?

submitted by /u/jarekduda
[link] [comments]

Blog

Learn About Our Meetup

5000+ Members

MEETUPS

JOB POSTINGS

CONTACT

[R] Why not use e.g. SGD coordinate-wise: learning rate ~ sqrt(variance(theta)/variance(g)) ?