Applying a sequence of simple scaling rules, we increase the SGD batch size and reduce the number of parameter updates required to train our model by an order of magnitude, without sacrificing test set accuracy. This enables us to dramatically reduce model training time. For more details, see “Don’t Decay the Learning Rate, Increase the Batch Size”, accepted at ICLR 2018. |