TL;DR: Batch/layer/instance/group norm are different methods for normalizing the inputs to the layers of deep neural networks

Ali Rahimi pointed out in his NIPS test-of-time talk that no one really understands how batch norm works — something something “internal covariate shift”? I’m still waiting for a good explanation, but for now here’s a quick comparison of what batch, layer, instance, and group norm actually do.1

### Batch normalization

Batch norm (Ioffe & Szegedy, 2015) was the OG normalization method proposed for training deep neural networks and has empirically been very successful. It also introduced the term internal covariate shift, defined as the change in the distribution of network activations due to the change in network parameters during training.

The goal of batch norm is to reduce internal covariate shift by normalizing each mini-batch of data using the mini-batch mean and variance. For a mini-batch of inputs $\{x_1, \ldots, x_m\}$, we compute

and then replace each $x_i$ with its normalized version

where $\epsilon$ is a small constant added for numerical stability.2 This process is repeated for every layer of the neural network.3

### Layer normalization

Layer norm (Ba, Kiros, & Hinton, 2016) attempted to address some shortcomings of batch norm:

1. It’s unclear how to apply batch norm in RNNs
2. Batch norm needs large mini-batches to estimate statistics accurately

Instead of normalizing examples across mini-batches, layer normalization normalizes features within each example. For input $x_i$ of dimension $D$, we compute

and then replace each component $x_i^d$ with its normalized version

### Instance normalization

Instance norm (Ulyanov, Vedaldi, & Lempitsky, 2016) hit arXiv just 6 days after layer norm, and is pretty similar. Instead of normalizing all of the features of an example at once, instance norm normalizes features within each channel.

### Group normalization

Group norm (Wu & He, 2018) is somewhere between layer and instance norm — instead of normalizing features within each channel, it normalizes features within pre-defined groups of channels.4

### Summary

Here’s a figure from the group norm paper that nicely illustrates all of the normalization techniques described above:

#### Footnotes

1. To keep things simple and easy to remember, many implementation details (and other interesting things) will not be discussed.

2. Instead of normalizing to zero mean and unit variance, learnable scale and shift parameters can be introduced at each layer.

3. For CNNs, the pixels in each channel are normalized using the same mean and variance.

4. In its extreme cases, group norm is equivalent to instance norm (one group for each channel) and to layer norm (one group period).

#### References

1. Ioffe, S., & Szegedy, C. (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In International Conference on Machine Learning (pp. 448–456).
2. Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). Layer normalization. ArXiv Preprint ArXiv:1607.06450.
3. Ulyanov, D., Vedaldi, A., & Lempitsky, V. (2016). Instance Normalization: The Missing Ingredient for Fast Stylization. ArXiv Preprint ArXiv:1607.08022.
4. Wu, Y., & He, K. (2018). Group normalization. ArXiv Preprint ArXiv:1803.08494.