[D] word2vec architecture
I was trying to understand the skipgram model of word2vec, and I had some problems in understanding the details. I’m clear about the high level idea – given a word, predict the context of the word. However, when you actually train the model, what is the input and output of the model for a particular training instance? To be more concrete with an example, disregarding all sophisticated techniques like negative sampling etc., if I have the sentence “it is a beautiful day today”, the input to the cbow version would be average of one-hot encoding of “it”, “is”, “a”, “day”, “today” and the output should ideally be one-hot encoding of “beautiful”. For skip-gram, I’m confused – given input one-hot encoding of “beautiful”, what should be the output be? Should be average of one-hot encoding of “it”, “is”, “a”, “day”, “today” in a single training instance or “it”, “is”, “a”, “day”, “today” in 5 separate training instances? I tried to go through the gensim codebase to understand what they do, but it’s not clear.
As an extension to this question, I also wanted to know what happens in negative sampling. The way I have understood it is that instead of forcing determinate values in the output vector to say that we want each element to match precisely to the expected one-hot encoding of the output, we say that we want to enforce 1s and 0s at only a select few places in the vectors (corresponding to positive and negative samples), which reduces the amount of back-propagation. Is this correct?