TL;DR: The kernel trick lets us represent expressive functions implicitly and efficiently in high-dimensional feature spaces

The kernel trick is usually introduced pretty early in intro machine learning classes — probably too early. The first time I saw it, I came away with only vague ideas about inner products and infinite-dimensional feature spaces… that can’t be right, can it?

In this post, I’ll explain why thinking about kernels can help us in machine learning, and also show how it’s possible to magically work with infinite dimensional features.1

### What is a kernel and why should we use them?

Given two points $x, x' \in \mathbb{R}^d$, a kernel is a function $k : \mathbb{R}^d \times \mathbb{R}^d \rightarrow \mathbb{R}$ that computes how similar they are.2 An example is the familiar inner product, or linear kernel:

1. A new perspective on features — instead of describing a single input, they determine similarity between data points
2. A computationally efficient way to work with high-dimensional features

### When can we use kernels?

In general, we use kernels to replace inner products between feature vectors. Taking linear regression as an example, for inputs $x \in \mathbb{R}^d$, we can fit a model

Using an arbitrary feature map $\phi : \mathcal{X} \rightarrow \mathbb{R}^b$, we can extend this to nonlinear relationships — for example, if $x \in \mathbb{R}$, choosing $\phi(x) = [x^2, x, 1]$ results in a quadratic model

A problem we run into with this approach is that we might need very high-dimensional $\phi(x)$ to represent sufficiently expressive nonlinear functions.

If we use gradient descent to minimize the mean squared error (MSE) over $n$ training examples

we end up with a parameter vector that is a linear combination of the feature vectors

where $\alpha_1, \ldots, \alpha_n$ are learned constants. This means that we can make predictions

that depend on the inner products between feature vectors.

### Efficient computation with kernels

When $\phi(x)$ are high-dimensional, computing these inner products explicitly is expensive — kernels allow us to compute the similarity between data points implicitly, without ever working in the high-dimensional feature space.

Suppose we have $x, x' \in \mathbb{R}^d$ and kernel $k(x, x') = (x^\top x')^2$. We can expand this to get

so applying kernel $k$ is equivalent to computing an inner product in the feature space corresponding to the feature mapping

Making a prediction in this $d^2$-dimensional feature space using the explicit computation $\langle w, \phi(x) \rangle$ requires $\mathcal{O}(d^2)$ time. If the feature dimension is much larger than the number of examples, $d >> n$, the implicit computation using the corresponding kernel requires only $\mathcal{O}(nd)$ time.3

### Infinite-dimensional features

The Gaussian or radial basis function (RBF) kernel allows us to implicitly work with infinite-dimensional features and represent very expressive functions — how is this possible?

First, we rewrite the Gaussian kernel as follows:

Taking the Taylor expansion of the third factor,

we see that the Gaussian kernel corresponds to an infinite-dimensional feature space.45

#### Footnotes

1. Most of this is adapted from Percy Liang’s CS 229T notes — check this out for more detail!

2. A function $k : \mathbb{R}^d \times \mathbb{R}^d \rightarrow \mathbb{R}$ is a valid kernel if and only if for every finite set of points $x_1, \ldots, x_n \in \mathbb{R}^d$, the kernel matrix $K \in \mathbb{R}^{n \times n}$ defined by $K_{ij} = k(x_i, x_j)$ is positive semidefinite.

3. Each prediction requires computing one inner product for each of the $n$ training examples, and each inner product $\langle \phi(x), \phi(x') \rangle = \langle x, x' \rangle^2$ requires $\mathcal{O}(d)$ time.

4. The first two factors constitute a valid kernel, since for any function $f: \mathcal{X} \rightarrow \mathbb{R}$, $k(x, x') = f(x) f(x')$ is positive semidefinite.

5. The sum or product of two kernels is also a valid kernel. The Gaussian kernel is the product of two kernels, the second of which is an infinite sum of polynomial kernels.