[D] Transformer number of token performance limits
I am currently working on a research project that involves using a transformer-like model for a NLP task. Specifically summarizing long documents.
I was wondering if any of you knows of a paper that explores the limits of the transformer when using super long sequences.
Is there any issue with long sequences?
Is there a “length” limit that this kind of model starts to decrease their performance?
Thanks a lot in advance!