[D] Why does backtranslation work?
I think I must be misunderstanding how backtranslation, because I’m not seeing how this could help. I’ll describe my current understanding then I’ll ask my question.
The usual setup is that you have some some small set B of parallel data between a source and target language. Your goal is to make a model that a language in the source language and produced the translated version in the target language.
In addition to the small dataset B, you also have some potentially very large corpus A of monolingual data in the target language. In order to leverage this data, you train a model in the reverse direction i.e target to source, by using B with the entries flipped. Then you use this model to make A’, which consists of the translations of entries in A by using the reverse model. Finally, you add A’ to B, get some final set C which you then train source –> target model.
In some sense, this should only help if your target –> source model is good. However, you trained this model only on B. This raises the following questions:
1) if you can build a good target –> source model from just B, why can’t you do the same with source –> target?
2) If you do get some improvements, why can’t you continue this process again? i.e. Train the source –> target model using C, then grab some large monolingual corpus from the source language, backtranslate that to make some new set A”, then add A” to C and re-train the target –> source model then make more source –> target examples by backtranslating the new model? Rise and repeat till you run out of compute.
Finally, is there a good reference for this kind of stuff? Most papers which use backtranslation are extremely vague about it.