[D] Some novel techniques I found that accelerates Transformer-XL to some extent
- Original Transformer-XL cached previous activations (input to Q, K and V) and computed K&V for the memory each iteration, but I found that you can just cache the K&V and not compute K&V again each iteration. This results in only negligible performance degradation as far as my toy experiment on Wikitext-103 went. This reduces computations at the cost of doubled GPU usage for the cache.
- Another trick is to apply the technique of , i.e., making K&V a single head with hidden dimension 64. Unbeknownst to the author, I found this works even on long-range language modeling like Wikitext-103 with negligible performance degradation. This essentially means that (1) GPU memory use of memory part becomes tiny, and (2) computation of Q&K&V is pretty much just Q.
By combining these techniques, what you get is (1) almost no GPU memory use of memory part, and (2) that you only have to compute Q for the current sequence each iteration, and (3) very fast inference even for long-range language modeling! I hope you found this post useful. (Caveat: the experiment I performed was under the assumption that the dataset is large enough.)
 Fast Transformer Decoding: One Write-Head is All You Need, Noam Shazeer