[Discussion][D] Gradient norm tracking
Are there any best practices on how one should track gradient norms during training? Surprisingly, I haven’t been able to find much reliable information on it, except the classical Glorot’s paper.
My current approach is to track 2-norm of weights raw gradients. However, I don’t have any practical intuition on which values should make me worried. Tracking the actual weight updates (e.g adjusted by Adam) makes make much more sense, but I haven’t seen anyone doing so.
A few words why am I concerned: I’m working on some exotic NN architecture for 3D, where different architecture choices implicate gradient behavior drastically, up to blow up.