Join our meetup, learn, connect, share, and get to know your Toronto AI community.
Browse through the latest deep learning, ai, machine learning postings from Indeed for the GTA.
Are you looking to sponsor space, be a speaker, or volunteer, feel free to give us a shout.
![]() |
The paper in question is here: https://arxiv.org/pdf/1904.11492.pdf The authors claim that assuming similar attention level for different Query points, a lot of computation can be saved by making a query-independent self-attention layer. That sounds good, but the following diagram of their architecture is confusing to me: After the Transform section, when the result is added back to the original image, each channel only gets one value broadcast over the entire plane. I had assumed that the goal was to calculate a global attention map (i.e query-independent and key-dependent). Could someone please explain why this is? submitted by /u/eukaryote31 |