[D] What is the rationale behind self-attention equation and how did they came up with the concept query, key and value?
I was reading this article by Jay Alammar The Illustrated Transformer which explains the transformer model in simple english. There is a section where he is talking about self attention and how its calculated. Here he introduced the concepts of query, key, value and the self-attention equation. I understood how query, key, value and the self-attention equation are calculated but how the researches came up with the idea of these vectors and equation.
I know that we are using hidden states in rnn because it has all the previous informations inside it and in a rnn model we pass this hidden state and new input to get new hidden state. In the article he says that
query, kery and value are the abstractions that are useful for calculating and thinking about attention.
But why? where is the concept of query, key and value come by? why can’t it be just one vector say only k or two vector(q,k)? What is the significance of these three vectors? Why this vector and the equation improve attention?
The next question; is the embedding vector necessary for a self-attention model. If the input dimension is small can’t we use the input vector to calcualte query,key and value and remove the embedding layer completely?