Skip to main content


Learn About Our Meetup

5000+ Members



Join our meetup, learn, connect, share, and get to know your Toronto AI community. 



Browse through the latest deep learning, ai, machine learning postings from Indeed for the GTA.



Are you looking to sponsor space, be a speaker, or volunteer, feel free to give us a shout.

[D] What does the feed-forward neural network in Transformer architecture actually learn?

So, I’ve been doing a deep dive into understanding Transformer (in the Neural Machine Translation context).I’ve found The Illustrated Transformer and The Annotated Transformer much help.

So, after great lengths I think I’ve gotten solid intuition on what the self-attention layer will learn. It will essentially learn a contextualized meaning for each word in the input function (correct me if I’m wrong here).

Then each of those “contextualized-meaning embeddings” are then put through the same 2 layer, fully connected feed-forward network – which has an output of the same size (512), with a much larger hidden layer.

The output is then feed into the next Transformer layer, with a batch norm, and a residual connection along for the ride (Going to try leave them out of this for a while if possible)

Do we have any idea what the that feed-forward neural network actually learns? What is it’s purpose?Or why the same feedforward is applied to each “contextualized word”? Is it sort of learning what might be important? (But then again, didn’t the WO matrix that took the multi-head attention matrices into a single matrix learn to do the same thing?)

submitted by /u/deepaurorasky
[link] [comments]