[D] Slow pytorch distributed training
My network is 1 Gbit ethernet and i am trying to use pytorch distributed training on two 8-gpu servers. Training procedure is simple classification objective with feed-forward network. I experience significant slowdown in comparison with single 8-gpu server training. Also “nload” tool shows full bandwidth usage even for small model (resnet18).
Is my network too slow for distributed training? If it is, what bandwidth (in Gbit/s) do I need to train heavy models like resnet101?