[D] What does the feed-forward neural network in Transformer architecture actually learn?
So, after great lengths I think I’ve gotten solid intuition on what the self-attention layer will learn. It will essentially learn a contextualized meaning for each word in the input function (correct me if I’m wrong here).
Then each of those “contextualized-meaning embeddings” are then put through the same 2 layer, fully connected feed-forward network – which has an output of the same size (512), with a much larger hidden layer.
The output is then feed into the next Transformer layer, with a batch norm, and a residual connection along for the ride (Going to try leave them out of this for a while if possible)
Do we have any idea what the that feed-forward neural network actually learns? What is it’s purpose?Or why the same feedforward is applied to each “contextualized word”? Is it sort of learning what might be important? (But then again, didn’t the WO matrix that took the multi-head attention matrices into a single matrix learn to do the same thing?)