[D] Can attention be computed implicitly by an RNN?
I have been working with sequence to sequence models for a while now. And the attention mechanism is a cornerstone to most seq2seq setups. It is typically added as an explicit part of the network, which has its advantages if one might want to modify how attention is computed without messing the encoder or the decoder code.
However, I am wondering if there would be ways to compute attention weights through introspection of the weights of RNN cells. In very coarse terms, an LSTM cell for instance has a forget gate. Therefore, if we can see when what part of the sequence is forgotten, this could give us an indication of what the attention weights might.
Now, if such a mechanism existed, here are my assumptions of the properties it would need to have: – the RNN would need to be bidirectional – the RNN cells would need to have a forget gate
I also think that the biggest challenge would be to correlate changes in the RNN state and explicit parts of the sequence being analyzed.
Is there work being done on this? Or are there reasons why it cannot work?