[D] How do you change your metaparameters when scaling to multiple G(T)PUs?
Recently I’ve been scaling my work to larger models and datasets and am finally at a point where distributed training on multiple GPUs is required. I’ve found TensorFlow 2’s
distribute library and the
tf.distribute.MirroredStrategy [1,2] a really easy way to do synchronous distributed training on multiple GPUs on one machine. Are there general heuristics to how one should change their metaparameters as one scales from a single gpu solution to distributed gpu training?
The TensorFlow guides touch on this briefly:
In both cases (dataset or numpy), each batch of the given input is divided equally among the multiple replicas. […] Typically, you would want to increase your batch size as you add more accelerators so as to make effective use of the extra computing power. You will also need to re-tune your learning rate, depending on the model.
For example, I trained DenseNet-BC (k=12, d=100) on a single GPU with a batch size of 64 and stepwise learning rate annealing from 0.1 -> 0.001 -> 0.0001. Though the training curves look quite different when scaling to multiple GPUs, I’ve gotten essentially the same test loss and accuracy using the same learning rate scheme and an epoch-level batch size of 64*num_gpus (with very nice time-to-completion gains: ~8 hours on 1 gpu -> ~4 hours on 4 gpus).
 by Shallue et al shows that there is a universal relationship between batch size and training time, but also that this relationship varies great across dataset, neural network architecture, optimizer, and workloads. In practice, it seems like one might not have the compute to always perfectly optimize for the given task across learning rates, optimizers, and batch sizes. Are there general rules of thumb that have worked nicely for you in practice?
“If one could predict which workloads benefit most from data parallel training, then one could tailor their workloads to make maximal use of the available hardware. However, our results suggest that this will often not be straightforward, because the maximum useful batch size depends, at least somewhat, on every aspect of the workload: the neural network architecture, the dataset, and the optimizer.”