Jekyll2018-12-11T13:42:08-08:00nealjean.com/Neal JeanPhD student, Stanford AI LabNeal JeanChange of variables for normalizing flows2018-10-21T00:00:00-07:002018-10-21T00:00:00-07:00nealjean.com/ml/change-of-variables<blockquote> <p>TL;DR: The change of variables formula lets us compute tractable densities in normalizing flow models</p> </blockquote> <p>My friend <a href="http://aditya-grover.github.io/">Aditya</a> is teaching a new class at Stanford this quarter on <a href="https://deepgenerativemodels.github.io/">deep generative models</a>. While going over lectures, we realized it would be helpful to have a simple example for the <a href="https://en.wikipedia.org/wiki/Probability_density_function#Dependent_variables_and_change_of_variables">change of variables formula</a>, which is crucial for understanding normalizing flow models.</p> <p>There’s been a lot of work on normalizing flows in the last few years, including NICE <a href="#dinh2014nice">(Dinh, Krueger, &amp; Bengio, 2014)</a>, Real-NVP <a href="#dinh2016density">(Dinh, Sohl-Dickstein, &amp; Bengio, 2016)</a>, Inverse Autoregressive Flows (IAF) <a href="#kingma2016improved">(Kingma et al., 2016)</a>, and Masked Autoregressive Flows (MAF) <a href="#papamakarios2017masked">(Papamakarios, Murray, &amp; Pavlakou, 2017)</a>. Check out Eric Jang’s tutorial (<a href="https://blog.evjang.com/2018/01/nf1.html">Part I</a>, <a href="https://blog.evjang.com/2018/01/nf2.html">Part II</a>) for a great introduction — here we’ll just dive into the change of variables example.</p> <h3 id="probability-mass-is-conserved">Probability mass is conserved</h3> <p>Let’s start with a random variable <script type="math/tex">Z</script> that is uniformly distributed over the unit cube, <script type="math/tex">\mathbf{z} \in [0, 1]^3</script>. We can scale <script type="math/tex">Z</script> by a factor of 2 to get a new random variable <script type="math/tex">X</script>,</p> <script type="math/tex; mode=display">% <![CDATA[ \mathbf{x} = f(\mathbf{z}) = \mathbf{A} \mathbf{z} = \left| \begin{array} {ccc} 2 & 0 & 0 \\ 0 & 2 & 0 \\ 0 & 0 & 2 \end{array} \right| \mathbf{z}, %]]></script> <p>where <script type="math/tex">X</script> is uniform over a cube with side length 2:</p> <p><img src="/assets/blog/cubes.png" alt="cubes" /></p> <p>How is the density <script type="math/tex">p_X(\mathbf{x})</script> related to <script type="math/tex">p_Z(\mathbf{z})</script>?</p> <p>Since every distribution sums to 1 and the unit cube has volume <script type="math/tex">V_Z = 1</script>,</p> <script type="math/tex; mode=display">p_Z(\mathbf{z}) V_Z = 1,</script> <p>and <script type="math/tex">p_Z(\mathbf{z}) = 1</script> for all <script type="math/tex">\mathbf{z}</script> in the unit cube.</p> <p>The volume of the larger cube is easy to compute: <script type="math/tex">V_X = 2^3 = 8</script>. The <strong>total probability mass must be conserved</strong>, so we can solve for the density of <script type="math/tex">X</script>:</p> <script type="math/tex; mode=display">p_X(\mathbf{x}) = \frac{p_Z(\mathbf{z}) V_Z}{V_X} = \frac{1}{8}.</script> <p><em>The new density is equal to the original density multiplied by the ratio of the volumes</em>. Intuitively, this scaling factor tells us whether the volume is expanding (as in our example, where ratio &lt; 1) or shrinking (ratio &gt; 1).</p> <h3 id="change-of-variables-formula">Change of variables formula</h3> <p>The change of variables formula allows us to tractably compute normalized probability densities when we apply an invertible transformation <script type="math/tex">f</script>:<sup id="fnref:det_inv"><a href="#fn:det_inv" class="footnote">1</a></sup></p> <script type="math/tex; mode=display">p_X(\mathbf{x}) = p_Z(\mathbf{z}) \left| \det \left( \frac{\partial f^{-1}(\mathbf{x})}{\partial \mathbf{x}} \right) \right| = p_Z(\mathbf{z}) \left| \det \left( \frac{\partial f(\mathbf{z})}{\partial \mathbf{z}} \right) \right|^{-1}</script> <p>In our example, the invertible function is just multiplication by a scaling matrix, so the determinant of the <a href="https://en.wikipedia.org/wiki/Jacobian_matrix_and_determinant">Jacobian matrix</a> is easy to compute:</p> <script type="math/tex; mode=display">\det \left( \frac{\partial f(\mathbf{z})}{\partial \mathbf{z}} \right) = \det \mathbf{A} = 8.</script> <p>For any invertible function, the absolute value of the Jacobian determinant is a linear approximation for how much the function is locally expanding or shrinking the volume, and corresponds to the ratio of volumes in our simple example.</p> <h4 id="footnotes">Footnotes</h4> <div class="footnotes"> <ol> <li id="fn:det_inv"> <p>The determinant of an invertible matrix is equal to the inverse of the determinant of the matrix inverse. <a href="#fnref:det_inv" class="reversefootnote">&#8617;</a></p> </li> </ol> </div> <h4 id="references">References</h4> <ol class="bibliography"><li><span id="dinh2014nice">Dinh, L., Krueger, D., &amp; Bengio, Y. (2014). NICE: Non-linear independent components estimation. <i>ArXiv Preprint ArXiv:1410.8516</i>.</span></li> <li><span id="dinh2016density">Dinh, L., Sohl-Dickstein, J., &amp; Bengio, S. (2016). Density estimation using Real NVP. <i>ArXiv Preprint ArXiv:1605.08803</i>.</span></li> <li><span id="kingma2016improved">Kingma, D. P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I., &amp; Welling, M. (2016). Improved variational inference with inverse autoregressive flow. In <i>Advances in Neural Information Processing Systems</i> (pp. 4743–4751).</span></li> <li><span id="papamakarios2017masked">Papamakarios, G., Murray, I., &amp; Pavlakou, T. (2017). Masked autoregressive flow for density estimation. In <i>Advances in Neural Information Processing Systems</i> (pp. 2338–2347).</span></li></ol>Neal JeanTL;DR: The change of variables formula lets us compute tractable densities in normalizing flow modelsFréchet Inception Distance2018-07-15T00:00:00-07:002018-07-15T00:00:00-07:00nealjean.com/ml/frechet-inception-distance<blockquote> <p>TL;DR: FID measures the distance between the Inception-v3 activation distributions for generated and real samples</p> </blockquote> <p>I often run into things that are new to me (but not explained in that particular paper) when I’m reading. A recent example is Fréchet Inception Distance (FID), a method for <strong>measuring the quality of generated image samples</strong>.</p> <h3 id="inception-score">Inception Score</h3> <p>Before FID, the Inception Score (IS) was the original method for measuring the quality of generated samples. <a href="#salimans2016improved">(Salimans et al., 2016)</a> proposed applying an Inception-v3 network pre-trained on ImageNet to generated samples and then comparing the conditional label distribution with the marginal label distribution:</p> <script type="math/tex; mode=display">\text{IS} = \exp \left( \mathbb{E}_{x \sim p_g} D_{KL}(p(y \vert x) || p(y)) \right)</script> <p>Ideally, the generator should:</p> <ol> <li>Generate images <strong><em>with meaningful objects</em></strong>, so that the conditional label distribution <script type="math/tex">p(y \vert x)</script> is <em>low entropy</em>.</li> <li>Generate <strong><em>diverse</em></strong> images, so that the marginal label distribution <script type="math/tex">p(y) = \int_x p(y \vert x) p_g(x)</script> is <em>high entropy</em>.</li> </ol> <p>Higher scores are better, corresponding to a larger KL-divergence between the two distributions.</p> <h3 id="fréchet-inception-distance">Fréchet Inception Distance</h3> <p>The FID is supposed to improve on the IS by actually <em>comparing</em> the statistics of generated samples to real samples, instead of evaluating generated samples in a vacuum.<sup id="fnref:fid"><a href="#fn:fid" class="footnote">1</a></sup> <a href="#heusel2017gans">(Heusel, Ramsauer, Unterthiner, Nessler, &amp; Hochreiter, 2017)</a> propose using the Fréchet distance between two multivariate Gaussians,</p> <script type="math/tex; mode=display">\text{FID} = ||\mu_r - \mu_g||^2 + \text{Tr} (\Sigma_r + \Sigma_g - 2 (\Sigma_r \Sigma_g)^{1/2}),</script> <p>where <script type="math/tex">X_r \sim \mathcal{N} (\mu_r, \Sigma_r)</script> and <script type="math/tex">X_g \sim \mathcal{N} (\mu_g, \Sigma_g)</script> are the 2048-dimensional activations of the Inception-v3 pool3 layer for real and generated samples respectively.<sup id="fnref:activations"><a href="#fn:activations" class="footnote">2</a></sup></p> <p>Lower FID is better, corresponding to more similar real and generated samples as measured by the distance between their activation distributions.</p> <h3 id="measuring-progress">Measuring progress</h3> <p>It’s helpful to look at some of the IS and FID scores that have been reported to get a feel for what good/bad scores look like and see how different models compare. These scores are for unsupervised models on CIFAR-10:<sup id="fnref:inconsistencies"><a href="#fn:inconsistencies" class="footnote">3</a></sup></p> <table> <thead> <tr> <th>Paper</th> <th>IS</th> <th>FID</th> </tr> </thead> <tbody> <tr> <td><strong>Real CIFAR-10 data</strong> <a href="#salimans2016improved">(Salimans et al., 2016)</a></td> <td>11.24</td> <td>–</td> </tr> <tr> <td>Unsupervised representation learning with deep convolutional generative adversarial networks (DCGAN) <a href="#radford2015unsupervised">(Radford, Metz, &amp; Chintala, 2015)</a></td> <td>6.16<sup id="fnref:dcgan1"><a href="#fn:dcgan1" class="footnote">4</a></sup></td> <td>37.1<sup id="fnref:dcgan2"><a href="#fn:dcgan2" class="footnote">5</a></sup></td> </tr> <tr> <td>Conditional image generation with PixelCNN decoders <a href="#van2016conditional">(van den Oord et al., 2016)</a></td> <td>4.60<sup id="fnref:pixelcnn"><a href="#fn:pixelcnn" class="footnote">6</a></sup></td> <td>65.9<sup id="fnref:pixelcnn:1"><a href="#fn:pixelcnn" class="footnote">6</a></sup></td> </tr> <tr> <td>Adversarially learned inference (ALI) <a href="#dumoulin2016adversarially">(Dumoulin et al., 2016)</a></td> <td>5.34<sup id="fnref:ali"><a href="#fn:ali" class="footnote">7</a></sup></td> <td>–</td> </tr> <tr> <td>Improved techniques for training GANs <a href="#salimans2016improved">(Salimans et al., 2016)</a></td> <td>6.86</td> <td>–</td> </tr> <tr> <td>Improving generative adversarial networks with denoising feature matching <a href="#warde2016improving">(Warde-Farley &amp; Bengio, 2016)</a></td> <td>7.72</td> <td>–</td> </tr> <tr> <td>Learning to generate samples from noise through infusion training <a href="#bordes2017learning">(Bordes, Honari, &amp; Vincent, 2017)</a></td> <td>4.62</td> <td>–</td> </tr> <tr> <td>BEGAN: Boundary equilibrium generative adversarial networks <a href="#berthelot2017began">(Berthelot, Schumm, &amp; Metz, 2017)</a></td> <td>5.62</td> <td>–</td> </tr> <tr> <td>MMD GAN: Towards deeper understanding of moment matching network <a href="#li2017mmd">(Li, Chang, Cheng, Yang, &amp; Póczos, 2017)</a></td> <td>6.17</td> <td>–</td> </tr> <tr> <td>Improved training of Wasserstein GANs <a href="#gulrajani2017improved">(Gulrajani, Ahmed, Arjovsky, Dumoulin, &amp; Courville, 2017)</a></td> <td>7.86</td> <td>–</td> </tr> <tr> <td>Coulomb GANs: Provably optimal Nash equilibrium via potential fields <a href="#unterthiner2017coulomb">(Unterthiner et al., 2017)</a></td> <td>–</td> <td>27.3</td> </tr> <tr> <td>GANs trained by a two time-scale update rule converge to a local Nash equilibrium <a href="#heusel2017gans">(Heusel, Ramsauer, Unterthiner, Nessler, &amp; Hochreiter, 2017)</a></td> <td>–</td> <td>24.8</td> </tr> <tr> <td>Autoregressive quantile networks for generative modeling (AIQN) <a href="#ostrovski2018autoregressive">(Ostrovski, Dabney, &amp; Munos, 2018)</a></td> <td>5.29</td> <td>49.5</td> </tr> <tr> <td>Spectral normalization for generative adversarial networks (SN-GAN) <a href="#miyato2018spectral">(Miyato, Kataoka, Koyama, &amp; Yoshida, 2018)</a></td> <td>8.22</td> <td>21.7</td> </tr> <tr> <td>Learning implicit generative models with the method of learned moments <a href="#ravuri2018learning">(Ravuri, Mohamed, Rosca, &amp; Vinyals, 2018)</a></td> <td>7.90</td> <td>18.9</td> </tr> </tbody> </table> <p>There are no universally agreed-upon performance metrics for unsupervised learning, and <a href="https://arxiv.org/abs/1801.01973">people have already pointed out many shortcomings</a> of these Inception-based methods <a href="#barratt2018note">(Barratt &amp; Sharma, 2018)</a>. Until something better comes along though, they’re going to show up in every paper so it’s worth knowing what they are.</p> <h4 id="footnotes">Footnotes</h4> <div class="footnotes"> <ol> <li id="fn:fid"> <p>Implementations are available for both <a href="https://github.com/bioinf-jku/TTUR">TF</a> and <a href="https://github.com/mseitzer/pytorch-fid">PyTorch</a>. <a href="#fnref:fid" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:activations"> <p>Other feature layers of the Inception-v3 network can also be used, with different dimensionalities. Note that at least <script type="math/tex">d</script> samples are needed to estimate the Gaussian statistics for <script type="math/tex">d</script>-dimensional features. <a href="#fnref:activations" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:inconsistencies"> <p>It’s hard to tell if different papers are using consistent methodology for computing IS and FID (e.g., some use 5k samples for FID while others use 50k), and some papers report different numbers for the same model. I tried to pick the best score reported by each paper for their own method where possible and yolo’d the rest. <a href="#fnref:inconsistencies" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:dcgan1"> <p>Reported by <a href="#huang2017stacked">(Huang, Li, Poursaeed, Hopcroft, &amp; Belongie, 2017)</a>. <a href="#fnref:dcgan1" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:dcgan2"> <p>Reported by <a href="#ostrovski2018autoregressive">(Ostrovski, Dabney, &amp; Munos, 2018)</a>. <a href="#fnref:dcgan2" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:pixelcnn"> <p>Reported by <a href="#ostrovski2018autoregressive">(Ostrovski, Dabney, &amp; Munos, 2018)</a>. <a href="#fnref:pixelcnn" class="reversefootnote">&#8617;</a> <a href="#fnref:pixelcnn:1" class="reversefootnote">&#8617;<sup>2</sup></a></p> </li> <li id="fn:ali"> <p>Reported by <a href="#warde2016improving">(Warde-Farley &amp; Bengio, 2016)</a>. <a href="#fnref:ali" class="reversefootnote">&#8617;</a></p> </li> </ol> </div> <h4 id="references">References</h4> <ol class="bibliography"><li><span id="salimans2016improved">Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., &amp; Chen, X. (2016). Improved techniques for training gans. In <i>Advances in Neural Information Processing Systems</i> (pp. 2234–2242).</span></li> <li><span id="heusel2017gans">Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., &amp; Hochreiter, S. (2017). Gans trained by a two time-scale update rule converge to a local nash equilibrium. In <i>Advances in Neural Information Processing Systems</i> (pp. 6626–6637).</span></li> <li><span id="radford2015unsupervised">Radford, A., Metz, L., &amp; Chintala, S. (2015). Unsupervised representation learning with deep convolutional generative adversarial networks. <i>ArXiv Preprint ArXiv:1511.06434</i>.</span></li> <li><span id="van2016conditional">van den Oord, A., Kalchbrenner, N., Espeholt, L., Vinyals, O., Graves, A., &amp; others. (2016). Conditional image generation with pixelcnn decoders. In <i>Advances in Neural Information Processing Systems</i> (pp. 4790–4798).</span></li> <li><span id="dumoulin2016adversarially">Dumoulin, V., Belghazi, I., Poole, B., Mastropietro, O., Lamb, A., Arjovsky, M., &amp; Courville, A. (2016). Adversarially learned inference. <i>ArXiv Preprint ArXiv:1606.00704</i>.</span></li> <li><span id="warde2016improving">Warde-Farley, D., &amp; Bengio, Y. (2016). Improving generative adversarial networks with denoising feature matching.</span></li> <li><span id="bordes2017learning">Bordes, F., Honari, S., &amp; Vincent, P. (2017). Learning to generate samples from noise through infusion training. <i>ArXiv Preprint ArXiv:1703.06975</i>.</span></li> <li><span id="berthelot2017began">Berthelot, D., Schumm, T., &amp; Metz, L. (2017). BEGAN: boundary equilibrium generative adversarial networks. <i>ArXiv Preprint ArXiv:1703.10717</i>.</span></li> <li><span id="li2017mmd">Li, C.-L., Chang, W.-C., Cheng, Y., Yang, Y., &amp; Póczos, B. (2017). MMD GAN: Towards deeper understanding of moment matching network. In <i>Advances in Neural Information Processing Systems</i> (pp. 2203–2213).</span></li> <li><span id="gulrajani2017improved">Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., &amp; Courville, A. C. (2017). Improved training of wasserstein gans. In <i>Advances in Neural Information Processing Systems</i> (pp. 5767–5777).</span></li> <li><span id="unterthiner2017coulomb">Unterthiner, T., Nessler, B., Klambauer, G., Heusel, M., Ramsauer, H., &amp; Hochreiter, S. (2017). Coulomb gans: Provably optimal nash equilibria via potential fields. <i>ArXiv Preprint ArXiv:1708.08819</i>.</span></li> <li><span id="ostrovski2018autoregressive">Ostrovski, G., Dabney, W., &amp; Munos, R. (2018). Autoregressive Quantile Networks for Generative Modeling. <i>ArXiv Preprint ArXiv:1806.05575</i>.</span></li> <li><span id="miyato2018spectral">Miyato, T., Kataoka, T., Koyama, M., &amp; Yoshida, Y. (2018). Spectral normalization for generative adversarial networks. <i>ArXiv Preprint ArXiv:1802.05957</i>.</span></li> <li><span id="ravuri2018learning">Ravuri, S., Mohamed, S., Rosca, M., &amp; Vinyals, O. (2018). Learning Implicit Generative Models with the Method of Learned Moments. <i>ArXiv Preprint ArXiv:1806.11006</i>.</span></li> <li><span id="barratt2018note">Barratt, S., &amp; Sharma, R. (2018). A Note on the Inception Score. <i>ArXiv Preprint ArXiv:1801.01973</i>.</span></li> <li><span id="huang2017stacked">Huang, X., Li, Y., Poursaeed, O., Hopcroft, J., &amp; Belongie, S. (2017). Stacked generative adversarial networks. In <i>IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</i> (Vol. 2).</span></li></ol>Neal JeanTL;DR: FID measures the distance between the Inception-v3 activation distributions for generated and real samplesThe kernel trick2018-07-10T00:00:00-07:002018-07-10T00:00:00-07:00nealjean.com/ml/kernel-trick<blockquote> <p>TL;DR: The kernel trick lets us represent expressive functions implicitly and efficiently in high-dimensional feature spaces</p> </blockquote> <p>The kernel trick is usually introduced pretty early in intro machine learning classes — probably too early. The first time I saw it, I came away with only vague ideas about inner products and <em>infinite-dimensional</em> feature spaces… that can’t be right, can it?</p> <p>In this post, I’ll explain why thinking about <a href="https://en.wikipedia.org/wiki/Kernel_method">kernels</a> can help us in machine learning, and also show how it’s possible to magically work with infinite dimensional features.<sup id="fnref:credits"><a href="#fn:credits" class="footnote">1</a></sup></p> <h3 id="what-is-a-kernel-and-why-should-we-use-them">What is a kernel and why should we use them?</h3> <p>Given two points <script type="math/tex">x, x' \in \mathbb{R}^d</script>, a <strong>kernel</strong> is a function <script type="math/tex">k : \mathbb{R}^d \times \mathbb{R}^d \rightarrow \mathbb{R}</script> that computes how similar they are.<sup id="fnref:kernel"><a href="#fn:kernel" class="footnote">2</a></sup> An example is the familiar inner product, or linear kernel:</p> <script type="math/tex; mode=display">k(x, x') = x^\top x' = \langle x, x' \rangle.</script> <p>Kernels buy us two things:</p> <ol> <li>A <strong>new perspective on features</strong> — instead of describing a single input, they determine similarity between data points</li> <li>A <strong>computationally efficient</strong> way to work with high-dimensional features</li> </ol> <h3 id="when-can-we-use-kernels">When can we use kernels?</h3> <p>In general, we use kernels <strong>to replace inner products between feature vectors</strong>. Taking linear regression as an example, for inputs <script type="math/tex">x \in \mathbb{R}^d</script>, we can fit a model</p> <script type="math/tex; mode=display">f(x) = \langle w, x \rangle.</script> <p>Using an arbitrary feature map <script type="math/tex">\phi : \mathcal{X} \rightarrow \mathbb{R}^b</script>, we can extend this to nonlinear relationships — for example, if <script type="math/tex">x \in \mathbb{R}</script>, choosing <script type="math/tex">\phi(x) = [x^2, x, 1]</script> results in a quadratic model</p> <script type="math/tex; mode=display">f(x) = \langle w, \phi(x) \rangle = w_2 x^2 + w_1 x + w_0.</script> <p>A problem we run into with this approach is that we might need very high-dimensional <script type="math/tex">\phi(x)</script> to represent sufficiently expressive nonlinear functions.</p> <p>If we use gradient descent to minimize the mean squared error (MSE) over <script type="math/tex">n</script> training examples</p> <script type="math/tex; mode=display">L(w) = \frac{1}{n} \sum_{i=1}^n (y^{(i)} - \langle w, \phi(x^{(i)}) \rangle)^2,</script> <p>we end up with a parameter vector that is a linear combination of the feature vectors</p> <script type="math/tex; mode=display">w = \sum_{i=1}^n \alpha_i \phi(x^{(i)}),</script> <p>where <script type="math/tex">\alpha_1, \ldots, \alpha_n</script> are learned constants. This means that we can make predictions</p> <script type="math/tex; mode=display">\langle w, \phi(x) \rangle = \sum_{i=1}^n \alpha_i \langle \phi(x^{(i)}), \phi(x) \rangle</script> <p>that depend on the inner products between feature vectors.</p> <h3 id="efficient-computation-with-kernels">Efficient computation with kernels</h3> <p>When <script type="math/tex">\phi(x)</script> are high-dimensional, computing these inner products explicitly is expensive — kernels allow us to compute the similarity between data points <strong>implicitly</strong>, without ever working in the high-dimensional feature space.</p> <p>Suppose we have <script type="math/tex">x, x' \in \mathbb{R}^d</script> and kernel <script type="math/tex">k(x, x') = (x^\top x')^2</script>. We can expand this to get</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align} k(x, x') &= \left( \sum_{i=1}^d x_i x_i' \right) \left( \sum_{j=1}^d x_j x_j' \right) \\ &= \sum_{i=1}^d \sum_{j=1}^d (x_i x_j) (x_i' x_j') \\ &= \langle \phi(x), \phi(x') \rangle, \end{align} %]]></script> <p>so applying kernel <script type="math/tex">k</script> is equivalent to computing an inner product in the feature space corresponding to the feature mapping</p> <script type="math/tex; mode=display">\phi(x) = \left[ \begin{array}{c} x_1 x_1 \\ x_1 x_2 \\ \vdots \\ x_1 x_d \\ x_2 x_1 \\ \vdots \\ x_d x_{d-1} \\ x_d x_d \end{array} \right] \in \mathbb{R}^{d^2}.</script> <p>Making a prediction in this <script type="math/tex">d^2</script>-dimensional feature space using the explicit computation <script type="math/tex">\langle w, \phi(x) \rangle</script> requires <script type="math/tex">\mathcal{O}(d^2)</script> time. If the feature dimension is much larger than the number of examples, <script type="math/tex">d >> n</script>, the implicit computation using the corresponding kernel requires only <script type="math/tex">\mathcal{O}(nd)</script> time.<sup id="fnref:computation"><a href="#fn:computation" class="footnote">3</a></sup></p> <h3 id="infinite-dimensional-features">Infinite-dimensional features</h3> <p>The Gaussian or <a href="https://en.wikipedia.org/wiki/Radial_basis_function_kernel">radial basis function</a> (RBF) kernel allows us to implicitly work with infinite-dimensional features and represent very expressive functions — how is this possible?</p> <p>First, we rewrite the Gaussian kernel as follows:</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align} k(x, x') &= \exp \left( \frac{-||x-x'||^2}{2 \sigma^2} \right) \\ &= \exp \left( \frac{-||x||^2}{2 \sigma^2} \right) \exp \left( \frac{-||x'||^2}{2 \sigma^2} \right) \exp \left( \frac{\langle x, x' \rangle}{\sigma^2} \right). \\ \end{align} %]]></script> <p>Taking the Taylor expansion of the third factor,</p> <script type="math/tex; mode=display">\exp \left( \frac{\langle x, x' \rangle}{\sigma^2} \right) = 1 + \frac{\langle x, x' \rangle}{\sigma^2} + \frac{\langle x, x' \rangle^2}{2 \sigma^4} + \frac{\langle x, x' \rangle^3}{3 \sigma^6} + \cdots,</script> <p>we see that the Gaussian kernel corresponds to an infinite-dimensional feature space.<sup id="fnref:basecase"><a href="#fn:basecase" class="footnote">4</a></sup><sup id="fnref:recursive"><a href="#fn:recursive" class="footnote">5</a></sup></p> <h4 id="footnotes">Footnotes</h4> <div class="footnotes"> <ol> <li id="fn:credits"> <p>Most of this is adapted from <a href="https://web.stanford.edu/class/cs229t/Lectures/percy-notes.pdf">Percy Liang’s CS 229T notes</a> — check this out for more detail! <a href="#fnref:credits" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:kernel"> <p>A function <script type="math/tex">k : \mathbb{R}^d \times \mathbb{R}^d \rightarrow \mathbb{R}</script> is a <em>valid</em> kernel if and only if for every finite set of points <script type="math/tex">x_1, \ldots, x_n \in \mathbb{R}^d</script>, the kernel matrix <script type="math/tex">K \in \mathbb{R}^{n \times n}</script> defined by <script type="math/tex">K_{ij} = k(x_i, x_j)</script> is positive semidefinite. <a href="#fnref:kernel" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:computation"> <p>Each prediction requires computing one inner product for each of the <script type="math/tex">n</script> training examples, and each inner product <script type="math/tex">\langle \phi(x), \phi(x') \rangle = \langle x, x' \rangle^2</script> requires <script type="math/tex">\mathcal{O}(d)</script> time. <a href="#fnref:computation" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:basecase"> <p>The first two factors constitute a valid kernel, since for any function <script type="math/tex">f: \mathcal{X} \rightarrow \mathbb{R}</script>, <script type="math/tex">k(x, x') = f(x) f(x')</script> is positive semidefinite. <a href="#fnref:basecase" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:recursive"> <p>The sum or product of two kernels is also a valid kernel. The Gaussian kernel is the product of two kernels, the second of which is an infinite sum of polynomial kernels. <a href="#fnref:recursive" class="reversefootnote">&#8617;</a></p> </li> </ol> </div>Neal JeanTL;DR: The kernel trick lets us represent expressive functions implicitly and efficiently in high-dimensional feature spacesNeural network normalization2018-06-24T00:00:00-07:002018-06-24T00:00:00-07:00nealjean.com/ml/neural-network-normalization<blockquote> <p>TL;DR: Batch/layer/instance/group norm are different methods for normalizing the inputs to the layers of deep neural networks</p> </blockquote> <p>Ali Rahimi pointed out in his <a href="https://www.youtube.com/watch?time_continue=2&amp;v=Qi1Yry33TQE">NIPS test-of-time talk</a> that no one really understands how batch norm works — something something “internal covariate shift”? I’m still waiting for a good explanation, but for now here’s a quick comparison of what batch, layer, instance, and group norm actually <strong><em>do</em></strong>.<sup id="fnref:implementation"><a href="#fn:implementation" class="footnote">1</a></sup></p> <h3 id="batch-normalization">Batch normalization</h3> <p>Batch norm <a href="#ioffe2015batch">(Ioffe &amp; Szegedy, 2015)</a> was the OG normalization method proposed for training deep neural networks and has empirically been very successful. It also introduced the term <strong>internal covariate shift</strong>, defined as the change in the distribution of network activations due to the change in network parameters during training.</p> <p>The goal of batch norm is to <em>reduce internal covariate shift</em> by normalizing each mini-batch of data using the mini-batch mean and variance. For a mini-batch of inputs <script type="math/tex">\{x_1, \ldots, x_m\}</script>, we compute</p> <script type="math/tex; mode=display">\mu = \frac{1}{m} \sum_{i=1}^{m} x_i</script> <script type="math/tex; mode=display">\sigma^2 = \frac{1}{m} \sum_{i=1}^{m} (x_i - \mu)^2</script> <p>and then replace each <script type="math/tex">x_i</script> with its normalized version</p> <script type="math/tex; mode=display">\hat{x}_i = \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}}</script> <p>where <script type="math/tex">\epsilon</script> is a small constant added for numerical stability.<sup id="fnref:batchnorm1"><a href="#fn:batchnorm1" class="footnote">2</a></sup> This process is repeated for every layer of the neural network.<sup id="fnref:batchnorm2"><a href="#fn:batchnorm2" class="footnote">3</a></sup></p> <h3 id="layer-normalization">Layer normalization</h3> <p>Layer norm <a href="#ba2016layer">(Ba, Kiros, &amp; Hinton, 2016)</a> attempted to address some shortcomings of batch norm:</p> <ol> <li>It’s unclear how to apply batch norm in RNNs</li> <li>Batch norm needs large mini-batches to estimate statistics accurately</li> </ol> <p>Instead of normalizing examples across mini-batches, layer normalization <em>normalizes features within each example</em>. For input <script type="math/tex">x_i</script> of dimension <script type="math/tex">D</script>, we compute</p> <script type="math/tex; mode=display">\mu = \frac{1}{D} \sum_{d=1}^{D} x_i^d</script> <script type="math/tex; mode=display">\sigma^2 = \frac{1}{D} \sum_{d=1}^{D} (x_i^d - \mu)^2</script> <p>and then replace each component <script type="math/tex">x_i^d</script> with its normalized version</p> <script type="math/tex; mode=display">\hat{x}_i^d = \frac{x_i^d - \mu}{\sqrt{\sigma^2 + \epsilon}}.</script> <h3 id="instance-normalization">Instance normalization</h3> <p>Instance norm <a href="#ulyanov2016instance">(Ulyanov, Vedaldi, &amp; Lempitsky, 2016)</a> hit arXiv just 6 days after layer norm, and is pretty similar. Instead of normalizing all of the features of an example at once, instance norm normalizes features within each channel.</p> <h3 id="group-normalization">Group normalization</h3> <p>Group norm <a href="#wu2018group">(Wu &amp; He, 2018)</a> is somewhere between layer and instance norm — instead of normalizing features within each channel, it normalizes features within pre-defined groups of channels.<sup id="fnref:groupnorm"><a href="#fn:groupnorm" class="footnote">4</a></sup></p> <h3 id="summary">Summary</h3> <p>Here’s a figure from the group norm paper that nicely illustrates all of the normalization techniques described above:</p> <p><img src="/assets/blog/group-norm.png" alt="norms" /></p> <h4 id="footnotes">Footnotes</h4> <div class="footnotes"> <ol> <li id="fn:implementation"> <p>To keep things simple and easy to remember, many implementation details (and other interesting things) will not be discussed. <a href="#fnref:implementation" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:batchnorm1"> <p>Instead of normalizing to zero mean and unit variance, learnable scale and shift parameters can be introduced at each layer. <a href="#fnref:batchnorm1" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:batchnorm2"> <p>For CNNs, the pixels in each channel are normalized using the same mean and variance. <a href="#fnref:batchnorm2" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:groupnorm"> <p>In its extreme cases, group norm is equivalent to instance norm (one group for each channel) and to layer norm (one group period). <a href="#fnref:groupnorm" class="reversefootnote">&#8617;</a></p> </li> </ol> </div> <h4 id="references">References</h4> <ol class="bibliography"><li><span id="ioffe2015batch">Ioffe, S., &amp; Szegedy, C. (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In <i>International Conference on Machine Learning</i> (pp. 448–456).</span></li> <li><span id="ba2016layer">Ba, J. L., Kiros, J. R., &amp; Hinton, G. E. (2016). Layer normalization. <i>ArXiv Preprint ArXiv:1607.06450</i>.</span></li> <li><span id="ulyanov2016instance">Ulyanov, D., Vedaldi, A., &amp; Lempitsky, V. (2016). Instance Normalization: The Missing Ingredient for Fast Stylization. <i>ArXiv Preprint ArXiv:1607.08022</i>.</span></li> <li><span id="wu2018group">Wu, Y., &amp; He, K. (2018). Group normalization. <i>ArXiv Preprint ArXiv:1803.08494</i>.</span></li></ol>Neal JeanTL;DR: Batch/layer/instance/group norm are different methods for normalizing the inputs to the layers of deep neural networksUseful research tricks2018-05-28T00:00:00-07:002018-05-28T00:00:00-07:00nealjean.com/random/research-tricks<p>Some things that improved my productivity — writing them down here so that others can benefit as well!</p> <h3 id="automatically-switch-conda-environments">Automatically switch Conda environments</h3> <p>Use <a href="https://direnv.net/">direnv</a> to automatically switch between Conda environments for different projects.</p> <p>If you have repo SecretProject with corresponding Conda environment called <code class="highlighter-rouge">topsecret</code>, create an <code class="highlighter-rouge">.envrc</code> file in the root of your SecretProject repo containing:</p> <div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>source activate topsecret export PYTHONPATH=pwd </code></pre></div></div> <p>Now whenever you cd to SecretProject root or any subdirectory, direnv will automatically switch to your <code class="highlighter-rouge">topsecret</code> Conda environment and make sure that your imports work. This also works well with multiple windows/panes in tmux.</p> <h3 id="squash-useless-git-commits">Squash useless Git commits</h3> <p>To squash previous N commits into one:</p> <div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git reset --soft HEAD~N </code></pre></div></div> <p>And then proceed as usual.</p> <h3 id="viewing-images-on-a-server">Viewing images on a server</h3> <p>Credit: <a href="https://unix.stackexchange.com/questions/35333/what-is-the-fastest-way-to-view-images-from-the-terminal/182369#182369">via Russell Stewart</a></p> <p>Run this from the directory with images:</p> <div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>python -m http.server 8080 # python3 python -m SimpleHTTPServer 8080 # python2 </code></pre></div></div> <p>On your computer: View at localhost:8080 in your browser</p> <p>On remote server: View at remoteserver.com:8080</p> <p>If the server doesn’t expose ports, set up an ssh tunnel before viewing:</p> <div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ssh neal@remoteserver.com -NfL localhost:8080:localhost:8080 </code></pre></div></div>Neal JeanSome things that improved my productivity — writing them down here so that others can benefit as well!