[D] Temporal coherence in transformers ? Why Fixed length inputs in Al-Rfou(2018) ?
Why use fixed length sequences in transformer ? In what way and why does it effect the performance and training of transformer ? Why did they not use sequences of length <= some number ?
Any paper regarding this?
Also, while reading the paper on Transformer-XL (Dai et. al, 2019) they say,
“We propose a novel neural architecture Transformer-XL that enables learning dependency beyond a fixed length without disrupting temporal coherence”
Why can’t we learn dependencies with a normal transformer(Vaswani et. al) beyond a fixed length without disrupting temporal coherence?
I think temporal coherence gets disturbed when the input length becomes comparable to the length of embedding used for a single word/character because the embedding then doesn’t contain enough information to link the word embedding to all the previous length of this input sequence . Am i right ?