[D] Multilayer hidden to hidden transformation in RNN (GRU/LSTM)
I’m training GRU neural network with single GRU layer (among other layers), and I tend to think that hidden to hidden transformation requires severe non-linearity to correctly “merge” memory with current timestep (and thus update hidden).
How do I approach this, what is the best practice?
Should I add more GRU layers or should I, for instance, add extra layers to hidden to hidden transformation with nonlinearity like relu?
If I take second approach I guess I should use tanh instead of relu to avoid exploding gradients, am I correct?
Thanks in advance.