[R] FacebookAI releases Adaptive attention span and All-attention layer to reduce decrease computation time / memory footprint
To enable wider use of this powerful deep learning architecture, we propose two new methods. The first, adaptive attention span is a way to make Transformer networks more efficient for longer sentences. With this method, we were able to increase the attention span of a Transformer to over 8,000 tokens without significantly increasing computation time or memory footprint. The second, all-attention layer is a way to simplify the model architecture of Transformer networks. Even with a much simpler architecture, our all-attention network matched the state-of-the-art performance of Transformer networks.