[D] Saddle-free Newton method for SGD and other actively repelling saddles – advantages, weaknesses, improvements?
While 2nd order methods have many advantages, e.g. natural gradient (e.g. in L-BFGS) attracts to close zero gradient point, which is usually saddle. Other try to pretend that our very non-convex function is locally convex (e.g. Gauss-Newton, Levenberg-Marquardt, Fisher information matrix e.g. in K-FAC, gradient covariance matrix in TONGA – overview) – again attracting rather not only to local minima (how bad it is?).
There is a belief that the number of saddles is ~exp(dim) larger than of minima. Actively repelling them (instead of attracting) requires control of sign of curvatures (as Hessian eigenvalues) – e.g. negating step sign in these directions.
It is e.g. done in saddle-free Newton method (SFN) ( https://arxiv.org/pdf/1406.2572 ) – 2014, 600+ citations, recent github. They claim to get a few times(!) lower error e.g. on MNIST this way, other methods got stuck on some plateaus with strong negative eigenvalues: https://i.imgur.com/xJLBGgl.png
Here is another very interesting paper: https://arxiv.org/pdf/1902.02366 investigating evolution of eigenvalues of Hessian for 3.3M parameters (~20 terabytes!), for example showing that rare negative curvature directions allow for relatively huge improvements: https://i.imgur.com/SwUasvc.png
So it looks great – it seems that we all should use SFN or other methods actively repelling saddles … but it didn’t happen – why is it so? What are the weaknesses?
What are other promising 2nd order approaches handling saddles?
How can we improve SFN-like methods? For example what I mostly don’t like is directly estimating Hessian from noisy data, what is very problematic numerically. Instead, we are really interested in linear behavior of 1st derivative – we can optimally estimate it with (online) linear regression of gradients: with weakening weights of old gradients. Another issue is focusing on Krylov subspace due to numerical method (Lanczos) – it should be rather based on gradient statistics like their PCA, what again can be made online to get local statistically relevant directions.
submitted by /u/jarekduda
[link] [comments]