Skip to main content


Learn About Our Meetup

5000+ Members



Join our meetup, learn, connect, share, and get to know your Toronto AI community. 



Browse through the latest deep learning, ai, machine learning postings from Indeed for the GTA.



Are you looking to sponsor space, be a speaker, or volunteer, feel free to give us a shout.

[D] What is the rationale behind self-attention equation and how did they came up with the concept query, key and value?

I was reading this article by Jay Alammar The Illustrated Transformer which explains the transformer model in simple english. There is a section where he is talking about self attention and how its calculated. Here he introduced the concepts of query, key, value and the self-attention equation. I understood how query, key, value and the self-attention equation are calculated but how the researches came up with the idea of these vectors and equation.

I know that we are using hidden states in rnn because it has all the previous informations inside it and in a rnn model we pass this hidden state and new input to get new hidden state. In the article he says that

query, kery and value are the abstractions that are useful for calculating and thinking about attention.

But why? where is the concept of query, key and value come by? why can’t it be just one vector say only k or two vector(q,k)? What is the significance of these three vectors? Why this vector and the equation improve attention?

The next question; is the embedding vector necessary for a self-attention model. If the input dimension is small can’t we use the input vector to calcualte query,key and value and remove the embedding layer completely?

submitted by /u/begooboi
[link] [comments]