[D] Super-Convergence Skepticism
Smith and Topin’s 2017 paper Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates garnered quite a bit of attention, promising to cut training iterations by an order of magnitude without compromising accuracy. They propose the following change to the learning rate scheduler:
Using a 56-layer residual network, they claim that while it takes 80k iterations to train to 91% accuracy on CIFAR-10 using conventional algorithms, but that they can achieve a higher accuracy (92.4%) in only 10k iterations.
On Open Review there is concern that it’s not clear if the accuracy gains are significant (“no error bars”) and about whether this technique generalizes to other architectures.
I think we’ve mostly seen that Super-Convergence seems to converge to fine results on multiple architectures — the “train ImageNet in 3 hours for $25” is probably the most well-known example. I’ve used it to train ResNet-18, 34, and 50 on CIFAR. I’m not sure it actively/consistently improves test accuracy, but heck: even if it decreases accuracy a bit, if it cuts training time by a factor of 8 I’ll happily use it!
My main problem with the Super-Convergence paper is that the baseline they compare too seems laughably un-optimized for their problem. Consider the training run below. I grabbed the ResNet model from this github repo. There’s nothing particularly special here. Some minor things worth noting:
Obviously this isn’t an apples-to-apples comparison with the Super-Convergence paper, which trains a 56-layer ResNet. Maybe the larger network actually does just require far longer to train?
While the larger network does require more time, it is unlikely to make a huge difference. Consider the two runs I ran with ResNet50. Here I had to drop the batch size to 64 (I’m just using my personal PC). The second run has a batch size of 128 with half precision (Note: I need a brief warm-up period here due to some NaNs). While there is a clear problem with the network not converging quickly enough, the paper uses a batch size of 1000. One might be skeptical (without an actual experiment) that this would let us train in 10k iterations, but training in 20k seems easy.
To be clear, I think there are valuable ideas used in this paper. I have nothing against the learning rate test (proposed in an earlier paper). Warm-up periods seem useful as a means of being able to eventually use a higher learning rate, and I like that the LR test and training regimen acknowledge the fact that you can often use a higher learning rate after a bit of warm up (though I do think spending half of your training time warming up is excessive and I’m skeptical that it helps generalization).
The observation that you can avoid having your training loss flat-line (and wasting time!) by decreasing the learning rate slowly is useful. If the claim that this corner-cutting doesn’t have a detrimental effect on test accuracy is true (and I think it can be) then this is a very interesting observation!
But I’m not convinced at all that Super-Convergence delivers on its promise of bringing training times down by an order of magnitude — not if it is compared against a learning-rate scheduler that has been tuned to any extent.