Skip to main content


Learn About Our Meetup

5000+ Members



Join our meetup, learn, connect, share, and get to know your Toronto AI community. 



Browse through the latest deep learning, ai, machine learning postings from Indeed for the GTA.



Are you looking to sponsor space, be a speaker, or volunteer, feel free to give us a shout.

[D] Link between embedding and attention?

Recently I’ve been thinking a lot about embeddings due to my silly question over in MLQuestions where I found on my data set that an embedding vastly improved my results even though the input was a discretized continuous variable.

So I have been looking for some intuition of what embeddings actually do apart from providing a vector space for one-hot encodings. In fact, what they do allow if imposed on a discretized continuous space, is to break up the ordering relationship, allowing each new interval to be completely independent in subsequent mapping relations. (Just as there is no ordering relation between words in a vocabulary.) In some cases where a continuous mapping is particular non-linear and convoluted, maybe this “breaking up” via discretization allows a more efficient expression of the mapping — each interval can have its own “starting point” in a space that better maps to the target space.

Well, thinking that maybe this is interesting, I starting wondering how a discretization + embedding layer could be inserted into a neural network, seeing as usually embedding is only the first layer because the table lookup (or one-hot encoding) is not differentiable.

I started thinking that a continuous approximation of a discretization would be to replace the embedding look-up with a logistic function that modulates the index of the embedding table. Similarly a one-hot would be such a look-up via a softmax function.

Then suddenly I realized, that is exactly a description of attention models — a differentiable table lookup. Is attention just a continuous version of discrete vector space embeddings? Is that why attention is so powerful, because it allows a conditioned “remapping” of a spatial transform? Similar to finding an optimal point in a vector space for each word in a vocabulary, an attention model finds an optimal transform for a given context (ie for a given distribution over a continuous space) to ease the work of the rest of the network.

Please tell me if I am out to lunch 🙂 I thought I might be onto some interesting ideas but I would be delighted to know if I simply stumbled onto a better understanding of something that is already known to work well!

submitted by /u/radarsat1
[link] [comments]