[R] Do the loss landscapes of neural networks tend to resemble the Earth’s own topography in regards to min/max elevation regions?
Do the lowest loss regions of a NN tend to congregate in distinct regions with many high peaks like mountain ranges on Earth (assuming, of course, we are talking about the negative loss function so it’s maximization instead of minimization)? To elucidate, the highest summits on Earth tend to have many other peaks nearby with similar (but slightly lower) peak elevations (due to plate tectonics). One might expect- if given no prior information about Earth’s topography and assuming uniform distribution- the “tall” points on Earth to be rather randomly spread throughout Earth’s surface, but this isn’t the case as we see 90+% of the “tall” points on Earth are contained in less than 10% of the landmass. As a corollary, very rarely are high peaks not surrounded by other high peaks.
So does the NN loss landscape resemble this scenario like on Earth? Or are there pretty much just solo peaks dispersed rather randomly across the negative loss landscape? A consequence of the former would seem to indicate that if one is at a “high” point (say local max or saddle point), then other high(er) points are likely nearby.
The only literature I can seem to find exploring such an idea is here: https://arxiv.org/abs/1712.09913. The authors of this paper mapped the maximum and minimum eigenvalue ratios of the Hessian to determine the convexity of regions of a NN. It seemed to indicate the latter of these scenarios for the “smoother” networks (that solo peaks tend to occur more often) and the former for the more chaotic networks, but I could be misinterpreting. I’m unsure if convexity alone even helps answer my question since many peaks close by could all still have strongly convex curvatures.
Interested to hear others’ thoughts on the matter.
Bonus: I’m interested in this question from the perspective of Deep Q-networks (DQN) and policy gradient algorithms in reinforcement learning. I’m aware these have different loss landscapes than supervised learning due to the scarcity of rewards in RL, but if anyone has specific insights on this then that’d be great. If you’re not familiar with RL, then just assume this is about strongly supervised learning tasks such as image classification. Thanks.