Skip to main content


Learn About Our Meetup

5000+ Members



Join our meetup, learn, connect, share, and get to know your Toronto AI community. 



Browse through the latest deep learning, ai, machine learning postings from Indeed for the GTA.



Are you looking to sponsor space, be a speaker, or volunteer, feel free to give us a shout.

[D] Some novel techniques I found that accelerates Transformer-XL to some extent

  1. Original Transformer-XL cached previous activations (input to Q, K and V) and computed K&V for the memory each iteration, but I found that you can just cache the K&V and not compute K&V again each iteration. This results in only negligible performance degradation as far as my toy experiment on Wikitext-103 went. This reduces computations at the cost of doubled GPU usage for the cache.
  2. Another trick is to apply the technique of [1], i.e., making K&V a single head with hidden dimension 64. Unbeknownst to the author, I found this works even on long-range language modeling like Wikitext-103 with negligible performance degradation. This essentially means that (1) GPU memory use of memory part becomes tiny, and (2) computation of Q&K&V is pretty much just Q.

By combining these techniques, what you get is (1) almost no GPU memory use of memory part, and (2) that you only have to compute Q for the current sequence each iteration, and (3) very fast inference even for long-range language modeling! I hope you found this post useful. (Caveat: the experiment I performed was under the assumption that the dataset is large enough.)

[1] Fast Transformer Decoding: One Write-Head is All You Need, Noam Shazeer

submitted by /u/HigherTopoi
[link] [comments]