[D] Momentum methods helps to escape local minima, so what? It was never our objective.
Something that seems to be under-discussed in machine learning is why we bother with momentum method in the first place.
Suppose we are training a classifier and the loss function has two local minima, one of which is global. Suppose by sheer unluck, the gradient descent gets stuck in the worse local minima. If you ask around as to what can be done, you will hear answers like “oh just use the momentum method, it gets you out of the local minima”.
First, there is no guarantee you will be out of the local minima (only if the difference between the current and previous iterate is large enough do you have a chance), and more importantly,
Second, great, you have found the global mimina and….you have just potentially overfitted your classifier.
In other words, we are looking for local minima (or even just some point associated with the loss function) with good generalization properties, and I don’t think momentum methods guarantees that.
Has there been any research on the generalization properties of the minima that you find and what algorithm get you the best minima, not in terms of how small the loss is, but how well it achieves generalization?