[D] Research shows SGD with too large of a mini batch can lead to huge overfitting in deep learning. Why doesn’t batch gradient descent have this problem?
Here is an example paper showing test score getting very bad as batch size gets too large: https://arxiv.org/pdf/1804.07612.pdf
Batch gradient descent runs over the whole dataset. Does it have the same problem? If not, why?
submitted by /u/DstnB3
[link] [comments]