[D] Need advise with pytorch distributed setup worse then single gpu
Model is tacotron2 based on this repo.
So, I made it work with Pytorch DDP, and it works, but the gap between single and distributed train seems to me too large.
So, single GPU loss much better, stable and 8 GPUs give only x2 time gain with x8 costs.
Do I miss something obvious?
Maybe because of batchnorm? Tried sync batch norm, but it does not really make a difference.