[D] Second order gradient optimization vs ADAM/momentum
I’m having trouble wrapping my head around how optimisers like ADAM and Momentum differ from second-order optimization methods.
The latter involves calculating/approximating the Hessian however the momentum based optimisers adjust their gradients from past steps (which is quite similar to how higher order derivatives work).
I know that mathematically and implementation-wise these two methods are different however can anyone provide any intuition as to how they differ in practice – perhaps by giving an example of where you would expect wildly different results from these two types of optimisers.
submitted by /u/mellow54