[D] Objective: Masked Language Model vs Autoencoding
Let’s say we have a simple “autoencoding transformer” architecture:
- bottleneck (Z)
We can train the model either using:
- the Masked Language Model objective, where we mask random inputs / replace them with a null token, and measure the loss on reconstruction of the masked inputs
- or the Autoencoding objective, where we don’t mask anything, and measure the loss on reconstruction of all inputs
Now we ask about the properties of Z – the latent representation of the data, after the model is trained. Will Z differ between the two objectives? How will it differ? Will it capture different information? Which loss will preserve more information in Z?
Does this have an obvious interpretation? Any intuitions?