Fréchet Inception Distance
TL;DR: FID measures the distance between the Inceptionv3 activation distributions for generated and real samples
I often run into things that are new to me (but not explained in that particular paper) when I’m reading. A recent example is Fréchet Inception Distance (FID), a method for measuring the quality of generated image samples.
Inception Score
Before FID, the Inception Score (IS) was the original method for measuring the quality of generated samples. (Salimans et al., 2016) proposed applying an Inceptionv3 network pretrained on ImageNet to generated samples and then comparing the conditional label distribution with the marginal label distribution:
Ideally, the generator should:
 Generate images with meaningful objects, so that the conditional label distribution is low entropy.
 Generate diverse images, so that the marginal label distribution is high entropy.
Higher scores are better, corresponding to a larger KLdivergence between the two distributions.
Fréchet Inception Distance
The FID is supposed to improve on the IS by actually comparing the statistics of generated samples to real samples, instead of evaluating generated samples in a vacuum.^{1} (Heusel, Ramsauer, Unterthiner, Nessler, & Hochreiter, 2017) propose using the Fréchet distance between two multivariate Gaussians,
where and are the 2048dimensional activations of the Inceptionv3 pool3 layer for real and generated samples respectively.^{2}
Lower FID is better, corresponding to more similar real and generated samples as measured by the distance between their activation distributions.
Measuring progress
It’s helpful to look at some of the IS and FID scores that have been reported to get a feel for what good/bad scores look like and see how different models compare. These scores are for unsupervised models on CIFAR10:^{3}
Paper  IS  FID 

Real CIFAR10 data (Salimans et al., 2016)  11.24  – 
Unsupervised representation learning with deep convolutional generative adversarial networks (DCGAN) (Radford, Metz, & Chintala, 2015)  6.16^{4}  37.1^{5} 
Conditional image generation with PixelCNN decoders (van den Oord et al., 2016)  4.60^{6}  65.9^{6} 
Adversarially learned inference (ALI) (Dumoulin et al., 2016)  5.34^{7}  – 
Improved techniques for training GANs (Salimans et al., 2016)  6.86  – 
Improving generative adversarial networks with denoising feature matching (WardeFarley & Bengio, 2016)  7.72  – 
Learning to generate samples from noise through infusion training (Bordes, Honari, & Vincent, 2017)  4.62  – 
BEGAN: Boundary equilibrium generative adversarial networks (Berthelot, Schumm, & Metz, 2017)  5.62  – 
MMD GAN: Towards deeper understanding of moment matching network (Li, Chang, Cheng, Yang, & Póczos, 2017)  6.17  – 
Improved training of Wasserstein GANs (Gulrajani, Ahmed, Arjovsky, Dumoulin, & Courville, 2017)  7.86  – 
Coulomb GANs: Provably optimal Nash equilibrium via potential fields (Unterthiner et al., 2017)  –  27.3 
GANs trained by a two timescale update rule converge to a local Nash equilibrium (Heusel, Ramsauer, Unterthiner, Nessler, & Hochreiter, 2017)  –  24.8 
Autoregressive quantile networks for generative modeling (AIQN) (Ostrovski, Dabney, & Munos, 2018)  5.29  49.5 
Spectral normalization for generative adversarial networks (SNGAN) (Miyato, Kataoka, Koyama, & Yoshida, 2018)  8.22  21.7 
Learning implicit generative models with the method of learned moments (Ravuri, Mohamed, Rosca, & Vinyals, 2018)  7.90  18.9 
There are no universally agreedupon performance metrics for unsupervised learning, and people have already pointed out many shortcomings of these Inceptionbased methods (Barratt & Sharma, 2018). Until something better comes along though, they’re going to show up in every paper so it’s worth knowing what they are.
Footnotes

Other feature layers of the Inceptionv3 network can also be used, with different dimensionalities. Note that at least samples are needed to estimate the Gaussian statistics for dimensional features. ↩

It’s hard to tell if different papers are using consistent methodology for computing IS and FID (e.g., some use 5k samples for FID while others use 50k), and some papers report different numbers for the same model. I tried to pick the best score reported by each paper for their own method where possible and yolo’d the rest. ↩

Reported by (Huang, Li, Poursaeed, Hopcroft, & Belongie, 2017). ↩

Reported by (Ostrovski, Dabney, & Munos, 2018). ↩

Reported by (Ostrovski, Dabney, & Munos, 2018). ↩ ↩^{2}

Reported by (WardeFarley & Bengio, 2016). ↩
References
 Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., & Chen, X. (2016). Improved techniques for training gans. In Advances in Neural Information Processing Systems (pp. 2234–2242).
 Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., & Hochreiter, S. (2017). Gans trained by a two timescale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems (pp. 6626–6637).
 Radford, A., Metz, L., & Chintala, S. (2015). Unsupervised representation learning with deep convolutional generative adversarial networks. ArXiv Preprint ArXiv:1511.06434.
 van den Oord, A., Kalchbrenner, N., Espeholt, L., Vinyals, O., Graves, A., & others. (2016). Conditional image generation with pixelcnn decoders. In Advances in Neural Information Processing Systems (pp. 4790–4798).
 Dumoulin, V., Belghazi, I., Poole, B., Mastropietro, O., Lamb, A., Arjovsky, M., & Courville, A. (2016). Adversarially learned inference. ArXiv Preprint ArXiv:1606.00704.
 WardeFarley, D., & Bengio, Y. (2016). Improving generative adversarial networks with denoising feature matching.
 Bordes, F., Honari, S., & Vincent, P. (2017). Learning to generate samples from noise through infusion training. ArXiv Preprint ArXiv:1703.06975.
 Berthelot, D., Schumm, T., & Metz, L. (2017). BEGAN: boundary equilibrium generative adversarial networks. ArXiv Preprint ArXiv:1703.10717.
 Li, C.L., Chang, W.C., Cheng, Y., Yang, Y., & Póczos, B. (2017). MMD GAN: Towards deeper understanding of moment matching network. In Advances in Neural Information Processing Systems (pp. 2203–2213).
 Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., & Courville, A. C. (2017). Improved training of wasserstein gans. In Advances in Neural Information Processing Systems (pp. 5767–5777).
 Unterthiner, T., Nessler, B., Klambauer, G., Heusel, M., Ramsauer, H., & Hochreiter, S. (2017). Coulomb gans: Provably optimal nash equilibria via potential fields. ArXiv Preprint ArXiv:1708.08819.
 Ostrovski, G., Dabney, W., & Munos, R. (2018). Autoregressive Quantile Networks for Generative Modeling. ArXiv Preprint ArXiv:1806.05575.
 Miyato, T., Kataoka, T., Koyama, M., & Yoshida, Y. (2018). Spectral normalization for generative adversarial networks. ArXiv Preprint ArXiv:1802.05957.
 Ravuri, S., Mohamed, S., Rosca, M., & Vinyals, O. (2018). Learning Implicit Generative Models with the Method of Learned Moments. ArXiv Preprint ArXiv:1806.11006.
 Barratt, S., & Sharma, R. (2018). A Note on the Inception Score. ArXiv Preprint ArXiv:1801.01973.
 Huang, X., Li, Y., Poursaeed, O., Hopcroft, J., & Belongie, S. (2017). Stacked generative adversarial networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (Vol. 2).