
This type of statistic is widely used for sample size selection and has been proposed for use in deep learning, but has not been measured or applied systematically for modern training runs. When the noise scale is small, looking at a lot of data in parallel quickly becomes redundant, whereas when it is large, we can still learn a lot from huge batches of data. Heuristically, the noise scale measures the variation in the data as seen by the model (at a given stage in training).

We have found that by measuring the gradient noise scale, a simple statistic that quantifies the signal-to-noise ratio of the network gradients, we can approximately predict the maximum useful batch size.
