Skip to main content


Learn About Our Meetup

5000+ Members



Join our meetup, learn, connect, share, and get to know your Toronto AI community. 



Browse through the latest deep learning, ai, machine learning postings from Indeed for the GTA.



Are you looking to sponsor space, be a speaker, or volunteer, feel free to give us a shout.

[D] Saddle-free Newton method for SGD and other actively repelling saddles – advantages, weaknesses, improvements?

While 2nd order methods have many advantages, e.g. natural gradient (e.g. in L-BFGS) attracts to close zero gradient point, which is usually saddle. Other try to pretend that our very non-convex function is locally convex (e.g. Gauss-Newton, Levenberg-Marquardt, Fisher information matrix e.g. in K-FAC, gradient covariance matrix in TONGA – overview) – again attracting rather not only to local minima (how bad it is?).

There is a belief that the number of saddles is ~exp(dim) larger than of minima. Actively repelling them (instead of attracting) requires control of sign of curvatures (as Hessian eigenvalues) – e.g. negating step sign in these directions.

It is e.g. done in saddle-free Newton method (SFN) ( ) – 2014, 600+ citations, recent github. They claim to get a few times(!) lower error e.g. on MNIST this way, other methods got stuck on some plateaus with strong negative eigenvalues:

Here is another very interesting paper: investigating evolution of eigenvalues of Hessian for 3.3M parameters (~20 terabytes!), for example showing that rare negative curvature directions allow for relatively huge improvements:

So it looks great – it seems that we all should use SFN or other methods actively repelling saddles … but it didn’t happen – why is it so? What are the weaknesses?

What are other promising 2nd order approaches handling saddles?

How can we improve SFN-like methods? For example what I mostly don’t like is directly estimating Hessian from noisy data, what is very problematic numerically. Instead, we are really interested in linear behavior of 1st derivative – we can optimally estimate it with (online) linear regression of gradients: with weakening weights of old gradients. Another issue is focusing on Krylov subspace due to numerical method (Lanczos) – it should be rather based on gradient statistics like their PCA, what again can be made online to get local statistically relevant directions.

submitted by /u/jarekduda
[link] [comments]