[D] Computing `q dot q` instead of `q dot k` when calculating scores for self-attention in Transformer
Going through the Transformer paper, and its implementation, I have had a question:
In the self-attention routine in the encoder, is it plausible to compute
q dot q instead of
q dot k when calculating scores for each input token?
I see that in the self-attention, the
memory_antecedent = query_antecedent and q, k, v is computed (and trained) separately (c.f.
compute_qkv in T2T).
Would utilizing the same
q for the computation of scores (rather than having a separate
k) seriously deteriorate the performance?