[D] Confused about using Masking in Transformer Encoder and Decoder
I dont get the idea why do we use masking before the calculation of attention. I get the idea that we want to input the decoder one word at a time, but i dont understand why in the implementation they used masking in encoder and a mask in the first part of the decoder where the inputs from encoder is passed.
submitted by /u/bikanation
[link] [comments]