[D] Why does the BERT paper say that standard conditional language models cannot be bidirectional?
In the original Bert paper, it is stated on page 4 (bottom, first column) that:
Unfortunately, standard conditional language models can only be trained left-to-right or right-to-left, since bidirectional conditioning would allow each word to indirectly “see itself”, and the model could trivially predict the target word in a multi-layered context.
It’s not at all obvious to me why, if you have the sentence “I like funny cats”, predicting the word “funny” while conditioning on the fact that it’s preceded “I”, “like” and succeeded by “cats” would be trivial and how the model could “indirectly see the target word”.
I saw this question asked on a number of online platforms but it never got a response. It would be great if someone with a good understanding of this could give an explanation