[D] Effect of chaining multiple transformers (attention)
For recurrent neural networks (RNNs) increasing the number of units allows the network to (better) model a relationship over more distant inputs in an input sequence.
However what’s the effect of increasing the number of layers in a transformer? Since the transformer looks at multiple inputs of the sequence simultaneously at each layer – it doesn’t have an analogue with RNNs.
submitted by /u/mellow54
[link] [comments]