Blog

The authors claim that assuming similar attention level for different Query points, a lot of computation can be saved by making a query-independent self-attention layer. That sounds good, but the following diagram of their architecture is confusing to me:

diagram 4(d) from the paper

After the Transform section, when the result is added back to the original image, each channel only gets one value broadcast over the entire plane. I had assumed that the goal was to calculate a global attention map (i.e query-independent and key-dependent). Could someone please explain why this is?

submitted by /u/eukaryote31
[link] [comments]