Neural network normalization
TL;DR: Batch/layer/instance/group norm are different methods for normalizing the inputs to the layers of deep neural networks
Ali Rahimi pointed out in his NIPS testoftime talk that no one really understands how batch norm works — something something “internal covariate shift”? I’m still waiting for a good explanation, but for now here’s a quick comparison of what batch, layer, instance, and group norm actually do.^{1}
Batch normalization
Batch norm (Ioffe & Szegedy, 2015) was the OG normalization method proposed for training deep neural networks and has empirically been very successful. It also introduced the term internal covariate shift, defined as the change in the distribution of network activations due to the change in network parameters during training.
The goal of batch norm is to reduce internal covariate shift by normalizing each minibatch of data using the minibatch mean and variance. For a minibatch of inputs , we compute
and then replace each with its normalized version
where is a small constant added for numerical stability.^{2} This process is repeated for every layer of the neural network.^{3}
Layer normalization
Layer norm (Ba, Kiros, & Hinton, 2016) attempted to address some shortcomings of batch norm:
 It’s unclear how to apply batch norm in RNNs
 Batch norm needs large minibatches to estimate statistics accurately
Instead of normalizing examples across minibatches, layer normalization normalizes features within each example. For input of dimension , we compute
and then replace each component with its normalized version
Instance normalization
Instance norm (Ulyanov, Vedaldi, & Lempitsky, 2016) hit arXiv just 6 days after layer norm, and is pretty similar. Instead of normalizing all of the features of an example at once, instance norm normalizes features within each channel.
Group normalization
Group norm (Wu & He, 2018) is somewhere between layer and instance norm — instead of normalizing features within each channel, it normalizes features within predefined groups of channels.^{4}
Summary
Here’s a figure from the group norm paper that nicely illustrates all of the normalization techniques described above:
Footnotes

To keep things simple and easy to remember, many implementation details (and other interesting things) will not be discussed. ↩

Instead of normalizing to zero mean and unit variance, learnable scale and shift parameters can be introduced at each layer. ↩

For CNNs, the pixels in each channel are normalized using the same mean and variance. ↩

In its extreme cases, group norm is equivalent to instance norm (one group for each channel) and to layer norm (one group period). ↩
References
 Ioffe, S., & Szegedy, C. (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In International Conference on Machine Learning (pp. 448–456).
 Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). Layer normalization. ArXiv Preprint ArXiv:1607.06450.
 Ulyanov, D., Vedaldi, A., & Lempitsky, V. (2016). Instance Normalization: The Missing Ingredient for Fast Stylization. ArXiv Preprint ArXiv:1607.08022.
 Wu, Y., & He, K. (2018). Group normalization. ArXiv Preprint ArXiv:1803.08494.