[D] Computing `q dot q` instead of `q dot k` when calculating scores for self-attention in Transformer
Going through the Transformer paper, and its implementation, I have had a question:
In the self-attention routine in the encoder, is it plausible to compute q dot q
instead of q dot k
when calculating scores for each input token?
I see that in the self-attention, the memory_antecedent = query_antecedent
and q, k, v is computed (and trained) separately (c.f. compute_qkv
in T2T).
Would utilizing the same q
for the computation of scores (rather than having a separate k
) seriously deteriorate the performance?
submitted by /u/kingsiguk
[link] [comments]