The kernel trick
TL;DR: The kernel trick lets us represent expressive functions implicitly and efficiently in highdimensional feature spaces
The kernel trick is usually introduced pretty early in intro machine learning classes — probably too early. The first time I saw it, I came away with only vague ideas about inner products and infinitedimensional feature spaces… that can’t be right, can it?
In this post, I’ll explain why thinking about kernels can help us in machine learning, and also show how it’s possible to magically work with infinite dimensional features.^{1}
What is a kernel and why should we use them?
Given two points , a kernel is a function that computes how similar they are.^{2} An example is the familiar inner product, or linear kernel:
Kernels buy us two things:
 A new perspective on features — instead of describing a single input, they determine similarity between data points
 A computationally efficient way to work with highdimensional features
When can we use kernels?
In general, we use kernels to replace inner products between feature vectors. Taking linear regression as an example, for inputs , we can fit a model
Using an arbitrary feature map , we can extend this to nonlinear relationships — for example, if , choosing results in a quadratic model
A problem we run into with this approach is that we might need very highdimensional to represent sufficiently expressive nonlinear functions.
If we use gradient descent to minimize the mean squared error (MSE) over training examples
we end up with a parameter vector that is a linear combination of the feature vectors
where are learned constants. This means that we can make predictions
that depend on the inner products between feature vectors.
Efficient computation with kernels
When are highdimensional, computing these inner products explicitly is expensive — kernels allow us to compute the similarity between data points implicitly, without ever working in the highdimensional feature space.
Suppose we have and kernel . We can expand this to get
so applying kernel is equivalent to computing an inner product in the feature space corresponding to the feature mapping
Making a prediction in this dimensional feature space using the explicit computation requires time. If the feature dimension is much larger than the number of examples, , the implicit computation using the corresponding kernel requires only time.^{3}
Infinitedimensional features
The Gaussian or radial basis function (RBF) kernel allows us to implicitly work with infinitedimensional features and represent very expressive functions — how is this possible?
First, we rewrite the Gaussian kernel as follows:
Taking the Taylor expansion of the third factor,
we see that the Gaussian kernel corresponds to an infinitedimensional feature space.^{4}^{5}
Footnotes

Most of this is adapted from Percy Liang’s CS 229T notes — check this out for more detail! ↩

A function is a valid kernel if and only if for every finite set of points , the kernel matrix defined by is positive semidefinite. ↩

Each prediction requires computing one inner product for each of the training examples, and each inner product requires time. ↩

The first two factors constitute a valid kernel, since for any function , is positive semidefinite. ↩

The sum or product of two kernels is also a valid kernel. The Gaussian kernel is the product of two kernels, the second of which is an infinite sum of polynomial kernels. ↩