[D] How to train for obtaining contextualized word embeddings
I am a little confused about how the actual training of models like ELMo and BERT can be achieved. In ELMo, the model predicts a representation of a word given its backward and forward context, while in BERT the encoder of the transformer model uses the attention mechanism over other words in the input to determine the representation for the masked word(s). In both models the representation is fed to a softmax layer over the vocabulary, correct? So say we have the two sentences “the bank of the river” and “the central bank of Germany”. The word “bank” should get different representations in the sentences because of the different contexts. However, if this representation is sent to the softmax layer, both would like the output to have the highest probability for the index of the word “bank” in the vocabulary. How is this achieved if the two representations are different? How can we condition to learn to create different contextual representation if we, in the end, still want to end up with the same word in the vocabulary from the softmax output? Should this not result in all representations being conditioned on the same thing regardless of context, i.e. the softmax output having the highest probability for the true target word?