[D] Besides decaying learning rate and increasing batchsize: Decay momentum? Decay droprate? Increase L2 regularization?
Decaying learning rate is a popular practice even for adaptive optimizers such as Adam. Increasing batchsize was also shown to have the same effect.
But there are other hyperparameters with similar nature.
– Does it make sense to decay/increase them?
– Have anyone tried decaying momentum, or decaying droprate, or increasing L2 regularization?
– Are there other hyperparameters that need tuning like this?