Join our meetup, learn, connect, share, and get to know your Toronto AI community.
Browse through the latest deep learning, ai, machine learning postings from Indeed for the GTA.
Are you looking to sponsor space, be a speaker, or volunteer, feel free to give us a shout.
My network is 1 Gbit ethernet and i am trying to use pytorch distributed training on two 8-gpu servers. Training procedure is simple classification objective with feed-forward network. I experience significant slowdown in comparison with single 8-gpu server training. Also “nload” tool shows full bandwidth usage even for small model (resnet18).
Is my network too slow for distributed training? If it is, what bandwidth (in Gbit/s) do I need to train heavy models like resnet101?
submitted by /u/borislestsov
[link] [comments]