I write about the kernel trick and show an informal derivation when applied to a regularized least squares example
Published
23 October 2024
In this post, I will demonstrate an intuitive derivation of the kernelized form of L2-regularized least squares as an illustration of how the kernel trick can be useful.
Let’s begin by solving the basic L2-regularized least squares problem.
Solving the L2-regularized Least Squares Problem
The regularized least squares problem is:
wϵRMMin2N1[∥Φw−y∥2+λ∥w∥2]
Recall that $\Phi_{ij} = \phi_j (x_i)$, whereby our design matrix $\Phi$ has dimensions $N×M$ (this is an important detail).
This is also known as ridge regression.
Now, to derive the solution to the ridge regression empirical risk minimization:
Let’s now summarize our findings for the reframed ridge regression solution:
f^(x)=i=1∑Nαiϕ(xi)Tϕ(x), whereα=(G+λIN)−1y
We can make two key observations:
The input data {$x_i$} participates in predictions
For each new prediction, we have $O(N)$ operations
Now, to see how this reframing is useful, let’s make some comparisons with the original ridge regression solution.
Comparing two formulations of the solution
Original Solutionf^(x)=j=0∑M−1w^jTϕj(x)Wherew^=(M×MΦTΦ+λIm)−1ΦTyReformulated Solutionf^(x)=i=1∑Nαiϕ(xi)Tϕ(x)Whereα=(G+λIN)−1y , G=N×Nϕ(xi)Tϕ(xj)
These two formulations are the exact same function, but we can make the following key comparisons:
The original solution has $O(M)$ operations, while the reformulated solution only has $O(N)$ operations. There are quite some cases where M » N for the N×M design matrix $\Phi$. Thus, the N×N gram matrix $\Phi \Phi^T$ is much more computationally feasible.
Most importantly, the reformulated solution only depends on feature maps {$\phi_j$} through dot product $\phi(x)^T \phi(x’)$
Next, let’s see how we can generalize this solution to kernel form
Generalizing to Kernel Form
So, we have our reformulated solution: f^(x)=i=1∑Nαiϕ(xi)Tϕ(x)⇓Notice we have a similarity measure in the form of ϕ(x)Tϕ(x′),which involves computing the product of the two feature mappings⇓What if we forget about the feature map ϕand define a kernel K(x,x′)=ϕ(x)Tϕ(x′)⇓Such thatf^(x)=i=1∑NαiK(xi,x),α=(G+λIN)−1y,Gij=K(xi,xj)
Why is this useful? Given $K(x,x’) = \phi(x)^T\phi(x’)$, $K(x,x’)$ computes the same value as $\phi(x)^T\phi(x’)$, but does so without explicitly computing $\phi(x)$. In other words, the kernel encapsulates all the information about the feature space $\phi$ and inner product in it. We don’t need to explicitly map $x$ and $x’$ to the feature space via computing $\phi(x)$, then compute the inner product, because $K(x,x’)$ provides the result directly in terms of $x$ and $x’$ in the input space.
This might seem suspiciously convenient. Why are we able to swap $\phi(x)^T\phi(x’)$ for $K(x,x’)$? Well, Mercer’s Theorem states that if $K$ is a Symmetric Positive Definite kernel, there will exist a feature space $H$ and feature map $\phi$ such that $K(x,x’) = \phi(x)^T\phi(x’)$.
In other words, any kernel function that satisfies Mercer’s Theorem implicitly corresponds to a valid feature map $\phi(x)$.
In a sense, this is the core idea of the Kernel Trick.
Constructing a Kernel
There are a few ways to construct a kernel.
Firstly, you could construct a valid kernel by defining your basis function (feature mapping) explicitly, and then finding your kernel.
Next, you could directly define a kernel function as long as it is known to be valid. Some examples of SPD kernels include:
Linear Kernel: k(x,x′)Polynomial Kernel: k(x,x′)Gaussian (RBF) Kernel: k(x,x′)=xTx′ork(x,x′)=1+xTx′=(1+xTx′)m,mϵZ+=exp(2s2−∥x−x′∥2),s>0
Finally, you could also construct kernels from simpler kernels, by using them as building blocks. For example: