[D] Global Context block spatial resolution
The paper in question is here: https://arxiv.org/pdf/1904.11492.pdf
The authors claim that assuming similar attention level for different Query points, a lot of computation can be saved by making a query-independent self-attention layer. That sounds good, but the following diagram of their architecture is confusing to me:
After the Transform section, when the result is added back to the original image, each channel only gets one value broadcast over the entire plane. I had assumed that the goal was to calculate a global attention map (i.e query-independent and key-dependent). Could someone please explain why this is?