[D] Has anyone used attention as a mechanism for integrating out a dimension in a tensor of unknown size?
I frequently run into a problem where I’m dealing with a tensor in a neural network and one of the axes has a dimension of unknown size, which depends on properties of the input data. This can present problems when passing that tensor into fully connected layers, because those expect a tensor of a fixed predetermined size. One thing I’ve noticed is that attention layers seem to be pretty good at dealing with this problem. They can take an axis of an unknown size and “integrate” it out, giving importance to the most relevant entries in that axis. I usually see attention as giving importance to certain words or time stamps in an input series. Does this seem like a valid use for attention?