Kernel Methods
by Heinrich Hartmann / 2025-06-26 / Stemwede
Motivation
Kernel Theory starts from a familiar setting of Linear Regression and transforms the problem of finding an ideal weight vector in a Feature Space to the Observation Space, where we now seek an ideal function \(s\) that minimizes a dualized loss function. This study originated from an effort to understand Section 6.1 in Bishop's 2006 book on Pattern Recognition and Machine Learning, where I first learned about this concept.
As a mathematician, duality results are the source of a lot of powerful results and techniques. For example, the Fourier Transform translates between the Time domain and Frequency domain and has broad theoretical and practical applications. In Geometry, there is a deep duality between "Spaces" and "Algebras" that is at the basis of the whole field.
So once I learned that Linear Regression has a dual formulation, I was immediately curious to understand this. In this note, I share a rigorous and geometric perspective on the Kernel Duality in Regression. It's aimed at the mathematically inclined reader, who is not afraid of abstractions.
At the same time, we stick to a highly practical setting, working with finite sets and matrices and avoid dealing with infinite-dimensional function spaces, convergence issues, etc. entirely. We use a slightly unusual notation for matrix calculation that we developed in Matrix Calculus. The key difference is that we don't distinguish between row and column vectors, and have "\(*\)" denote the "tensor contraction" between adjacent indices, so our \(a*K*b\) is commonly denoted by \(a^t \cdot K \cdot b\).
Setting
- Let \(X\) be a (finite) set. We think of this as base/space of grid of the functions we want to learn.
- We consider functions on \(X\) as elements \(s,t \in \IR[X]\), \(s[x] \in \IR\).
- Let \(W\) be another set. We think of this as a set of parameters for the models that we are considering.
- We call \(\IR[W]\) the feature space, consisting of linear combinations of parameters \(a,b \in \IR[W]\), \(a[w] \in \IR\).
- For each parameter \(w \in W\), we are given basis function \(\phi(w) \in \IR[X].\)
- For each point \(x \in X\), we get a feature vector \(\psi(x) \in \IR[W]\), by evaluating \(\phi(w)\) on \(x\).
- Both operations are adjoint/transposed to each other: \(\psi(x)[w]=\phi(w)[x]\).
The data given \(\phi,\psi\) can be organized in a matrix \(\Psi\), so that evaluation of \(\phi\) and \(\psi\) can be expressed as matrix multiplication. The entries \(\Psi[w,x]\) are derived by mapping \(x\) to the feature space \(\psi(x)\) and contracting with the weight \(w\), or equivalently by starting with \(w\), and evaluating the basis function \(\phi(w)\) at \(x\). More formally:
- The Design Matrix is defined as \(\Psi[w,x] = \phi(w)[x] = \psi(x)[w]\), \(\Psi \in \IR[W,X]\).
- In this way \(\phi(w) = \hat{w} * \Psi\), \(\psi(x) = \Psi * \hat{x}\) and $$ \Psi[w,x] = \phi(w)[x] = \phi(w) * \hat{x} = \hat{w} * \Psi * \hat{x} = \hat{w} * \psi(x) = \psi(x)[w] $$
Example: 1d-Linear Models
- Let \(X=\{x_1, \dots, x_n\} \subset \IR\) a finite set of points.
- Let \(W = \{0,1\}=[2]\), with basis function \(\phi(0)[x] = 1\) and \(\phi(1)[x] = x\).
- For \(x \in X\), we have a feature vector \(\psi(x) = [1, x] \in \IR[W]\).
-
The design matrix is given by:
\[ \Psi = \begin{bmatrix} 1 & 1 & \cdots & 1 \\ x_1 & x_2 & \cdots & x_n \end{bmatrix} \]
Example: 1d-Polynomial Models
- Let \(X=\{x_1, \dots, x_n\} \subset \IR\) a finite set of points.
- Let \(W = \{0, \dots, D\}=[D+1]\), with polynomial basis functions \(\phi(d)[x] = x^d, x \in X\).
- For \(x \in X\), we have a feature vector \(\psi(x) = [1, x, x^2, \dots, x^D] \in \IR[W]\).
-
So \(\Psi[d,x] = x^d\) is the Vandermonde Matrix.
\[ \Psi = \begin{bmatrix} 1 & 1 & \cdots & 1 \\ x_1 & x_2 & \cdots & x_n \\ x_1^2 & x_2^2 & \cdots & x_n^2 \\ \vdots & \vdots & \ddots & \vdots \\ x_1^D & x_2^D & \cdots & x_n^D \end{bmatrix} \]
Example: Nd-Linear Models
- Let \(X \subset \IR[N]\) a finite set of points.
- Let \(W = [N]\), with basis functions \(\phi(n)[x] = x[n]\), the n-th component.
- The feature vector \(x \in X\) is the given by the point itself \(\psi(x) = x \in \IR[N]\).
- So \(\Psi[n,x] = x[n]\) has columns \(\Psi[:,x]=x\).
Example: 2D Image Reconstruction
- Let \(X = \{x_1,\dots, x_{1024} \} = [32] \times [32]\) be a grid of pixel coordinates for a 32x32 image.
- Let \(W = \{w_1,...,w_{25}\} \subset [32] \times [32]\) be a 5x5 grid of filter centers.
- Gaussian basis functions centered at each \(w \in W\): \(\phi(w)[x] = \exp(-\|x-w\|^2/2)\).
- For a pixel \(x \in X\), its feature vector \(\psi(x) \in \IR[25]\) contains the responses: \(\psi(x)[i] = \exp(-\|x-w_i\|^2/2)\).
- The design matrix \(\Psi \in \IR[25,1024]\) represents these Gaussian responses: \(\Psi[i,j] = \exp(-\|x_j-w_i\|^2/2)\).
Linear Regression
Learning Objective
We are given a training datum \(t \in \IR[X]\) and seek to find a parameters \(a \in \IR[W]\), so that the following (non-linear) function on \(\IR[W]\) is minimized:
Where \(L = \Psi * \Psi^* \in \IR[W,W]\).
If \(\hat{a}\) is minimizing \(\kL\), then \(\phi(\hat{a}) = \hat{a} * \Psi\) is the target function that we were looking for.
Proposition
- (Existence) \(\kL\) can always be minimized.
- (Uniqueness) Any two solutions \(a,a'\) differ by elements in \(Ker(\Psi^*)\).
These are features \(d \in \IR[W]\), so that \(\phi(d) = d * \Psi = 0 \in \IR[X]\). - The minimum of \(\kL\) is unique iff the basis functions \(\phi(w)\) are linearly independent in \(\IR[X]\).
- In this case, the minimum is given by \(\hat{a} = (\Psi * \Psi^*)^{-1} * \Psi * t = L^{-1} * \psi(t)\).
We will prove the proposition further below, after exploring some more variants of this. For now, let's just note that having a clear grasp, and algebraic formulas on the solution of a minimization problem is not typical. Usually, we are hunting down single approximate solutions with gradient descent methods.
Kernel Regression
Instead of minimizing \(\kL\) in the feature space \(\IR[W]\) we can pull the minimization problem over to \(\IR[X]\) by looking at \(\kL(\Psi * s)\). For a given \(t \in \IR[X]\), we seek \(s \in \IR[X]\) so that the following functional is minimized:
Where \(K=\Psi^* * \Psi \in \IR[X,X]\).
Note, that for a vector \(s \in \IR[X]\) the corresponding solution function can be expressed in terms of \(K\) as well:
Therefore, the entire optimization problem can be articulated in terms of \(X\) and \(K\). The space \(W\) and the map \(\Psi\) are implicit.
Proposition
- If \(s\) minimizes \(\kL^\vee\) then \(a = \Psi * s\) minimizes \(\kL\).
- If \(a\) minimizes \(\kL\), then there is an \(s\) so that \(\kL(a)=\kL(\Psi * s)\) assumes the minimum.
- If \(K\) is invertible, then the solution is unique, and given by \(\hat{s} = K^{-1} t\)
Regularized Regression
Returning back to our original setting, we may modify Loss function with a multiple of \(\| a \|^2\) in the weight space, to encourage solutions to have "small weights".
Proposition Assume \(\lambda >0\).
- There is always a unique global minimum \(\hat{a}\) for the regularized regression problem.
- The solution is given by \(\hat{a} = (\Psi * \Psi^* + \lambda I)^{-1} * \Psi * t = (L+\lambda I)^{-1} \psi(t).\)
Regularized Kernel Regression
Composing the regularized regression with \(\Psi\) leaves us with the following:
Proposition Assume \(\lambda >0\).
- A minimum for the function \(\kL_\lambda^\vee\) is attained at \(\hat{s} = (K+\lambda I)^{-1}t.\)
- Two minimum solutions \(s,s'\) differ by elements in \(Ker(K) = Ker(\Psi)\).
These functions \(u \in Ker(K)\) have vanishing feature vector \(\psi(u) = \Psi * u = 0 \in \IR[W]\). - The solutions is unique if \(Ker(\Psi) = Ker(K) = 0\).
This is the case iff the feature vectors \(\psi(x)\) are linearly independent or equivalently the functions \(K * \hat{x} \in \IR[X]\) are linearly independent.
Kernel Theory
Recall that \(K = \Psi^* * \Psi \in \IR[X,X]\).
Proposition
- \(K\) is symmetric: \(K[x,y] = K[y,x]\).
- \(K\) is positive semi-definite: \(s * K * s \geq 0\) for all \(s \in \IR[X]\).
Question
Let's assume conversely, that we are just given any symmetric, positive semi-definite matrix \(K \in \IR[X,X]\).
Can we construct a "latent feature space" \(H,\<,\>\), with a feature map \(\Phi: \IR[X] \ra H\), so that \(K = \Psi^* \Psi\) ?
Construction
Given a symmetric, positive-definite matrix \(K \in \IR[X,X]\). We give 3 different constructions of latent feature spaces for \(K\).
-
Cholesky Construction
Cholesky Decomposition \(K=R^*R\) with \(R\) upper triangular in \(\IR[X,X]\) with regards to some ordering of \(X\). We get a solution with:
- Feature Space \(H = \IR[X]\), \(\<a,b\> = a*b\).
- Feature map \(R: \IR[X] \ra \IR[X]\)
-
Spectral Construction (Mercer's Theorem)
The Spectral decomposition gives us \(K = U * D * U^*\) where \(U*U^* = I_X\) and \(D\) diagonal with \(D_{xx} \geq 0\). Let \(D^{1/2}\) be the diagonal matrix with entries \(\sqrt{D_{xx}}\). We get a solution with:
- Feature Space \(H = \IR[X]\), \(\<a,b\> = a*b\).
- Feature map \(L = D^{1/2} U^*: \IR[X] \ra \IR[X]\)
-
RKHS Construction (Reproduction Theorem)
Consider the map \(k(x) = \hat{x} * K = K[x,\_]\), which takes values in \(\IR[X]\). If we really wanted \(k\) to be our feature map, can we construct an appropriate feature space? It turns out we can!
Let \(H = Im(k) \subset \IR[X]\) the span of the functions \(k(x)\), \(x\in X\). As we will see below, there is a unique scalar product \((\_,\_)\) on \(H\) for which \(k\) \((k(x), k(y)) = K[x,y]\) for all \(x,y \in X\). We get a solution with:
- Feature space \(H = Im(k)\) with \((a,b)\) constructed below.
- Feature map \(k: \IR[X] \ra H \subset \IR[X]\)
We give two constructions of an appropriate scalar product on \(H\):
- RKHS Construction 1 For two functions \(s = \sum_x a[x] k_x\), \(t = \sum b[x] k_x\) we set $$ (s,t) := \sum_{x,y} a[x]b[y] k[x,y] = \sum_y b[y] s(y) = \sum_x a[x] t(x). $$ The given alternative representation shows that the scalar product is independent of the representations of \(s,t\) in terms of \(k_x\). We find: $$ (k(x), k(y)) = \sum_z \delta_{x}(z) k(y)[z] = k(x)[y] = K[x,y]. $$
-
RKHS Construction 2 Applying spectral decomposition to \(K\) we can write \(K = U * D * U^*\) or equivalently \(K = \sum_i \lambda_i u_i * u_i^*\) where \(u_i\) are Eigenvectors for the positive eigenvalues \(\lambda_i\) (and skipping all potential zero eigenvalues).
We get a pseudo-inverse for \(K\) as \(L := \sum_i \lambda_i^{-1} u_i * u_i^*\). In this way \(K * L = L * K = \sum_i u_i * u_i^*\) is an orthogonal projector onto \(Im(K)=F\).
For two functions \(s,t \in H\) define \((s,t) := s * L * t\). We find: $$ (k(x), k(y)) = \hat{x} * K * L * K * \hat{y} = \hat{x} * K * \hat{y} = K[x,y]. $$
Reproduction Theorem For a function \(a \in H\), the evaluation at a point \(x \in X\) can be expressed as a scalar product with the feature \(k(x)\):
\[ (k(x), s) = s[x]. \]A subset \(H \subset \IR[X]\) endowed with a scalar product \((\_,\_)\) that satisfies this property is called a Reproducing Kernel Hilbert Space (RKHS).
Proof If \(s = \sum_y a[y] k(y)\), then \((k(x), s) = \sum_y a[y] (k(x), k(y)) = \sum_y a[y] k(x,y) = s[x]\).
Abstract RKHS
Let \(X\) be an arbitrary set, and let \(K : X \times X \to \IR\) be a symmetric, positive semi-definite kernel, i.e. for all finite sets \(x_i \in X\) and \(\alpha_i \in \IR\) with \(i=1,\dots, n\) we have \(\sum_{i,j=1}^n \alpha_i \alpha_j K(x_i,x_j) \ge 0.\)
An RKHS for \(K\) is a Hilbert space \(H \subset \IR[X]\) of functions on \(X\), with inner product \((\_,\_)\), such that:
- Spanning property For all \(x\in X\), the function \(k(x) \in H\), and those functions are dense within \(H\) with regards to \((\_,\_)\).
- Reproducing property For all \(x \in X\), \(f \in H\), we have \(f(x) = (k(x), f)\).
Theorem
- For every symmetric, positive semi-definite kernel \(K\) on \(X\), there exists an RKHS \(H\) for \(K\).
- If \((H_1, (\cdot,\cdot)_1)\) and \((H_2, (\cdot,\cdot)_2)\) are two RKHS for \(K\), then there exists a unique isometric isomorphism \(T : H_1 \to H_2\) such that \(T(k_1(x)) = k_2(x)\) for all \(x \in X\).
Kernel Smoothing
Example - Smoothing with a tri-band kernel
Consider the following tri-banded kernel on \(X=[5]\):
This is a symmetric and positive definite matrix if \(\alpha \in [0,1).\) Applying \(K\) to a function \(s\) we get a "smoothed variant" of \(s\): $$ s^\flat = K * s $$ where each value \(s[x]\) is a weighted sum of it's neighbors for \(0<x<4\): $$ s^\flat[x] = \alpha s[x-1] + s[x] + \alpha s[x+1]. $$
We make the following observations:
-
The smoothing operation is given by multiplication with appropriate smoothing kernels: \(s^\flat = K *s\). In this sense kernel smoothing is inverse to Kernel regression where we seek an \(s\) for which \(K *s = t\) and \(t\) is a given training datum.
-
In the example, the amplitude of \(K * s\) is larger than \(s\) since the sum the weights for each point is larger than one.
-
The boundaries are special cases, since we have less neighbours.
In this section we will study construct a large class of Kernels which do avoid boundary problems and amplitude enlargement, as convolutions with delta functions.
We will also explore the duality between kernel smoothing and kernel regression in the regularized case.
Doubly Stochastic Kernels
A kernel \(K \in \IR[X,X]\) is called doubly stochastic if:
- Row sums: \(\sum_{y \in X} K[x,y] = 1\) for all \(x \in X\)
- Column sums: \(\sum_{x \in X} K[x,y] = 1\) for all \(y \in X\)
Note that both criteria are equivalent for symmetric kernels which we are considering here.
We call \(K\) a smoothing kernel if it's symmetric, positive semi-definite and doubly stochastic.
Convolution Kernels on Groups
Let \(X,+,0\) be a finite abelian group. This assumption provides a natural setting for convolution operations, which avoids boundary effects. For each pair \(x,y \in X\), we have the group difference defined as \(x-y \in X\). A convolution kernel on \(X\) is a kernel of the form: $$ K[x,y] = g(x-y) $$ for some function \(g \in \IR[X]\).
Examples:
- Cyclic Groups: Let \(X = \IZ/N\IZ =[N]\) be the cyclic group of order \(N\). This gives us a discrete circle with periodic boundary conditions.
- Discrete Tori: Let \(X = (\IZ/M\IZ) \times (\IZ/N\IZ) = [M]\times[N]\) for image processing applications with periodic boundaries in both dimensions.
- By the Chinese Remainder theorem, all abelian groups are isomorphic to products of cyclic groups.
Proposition If \(g \in \IR[X]\) satisfies \(g(z) \geq 0\), \(g(x) = g(-x)\) and \(\sum_{x \in X} g(x) = 1\), then the convolution kernel \(K[x,y] = g(x-y)\) is symmetric, positive semi-definite, and doubly stochastic.
Proof
(1) Symmetry:
$$
K[x,y] = g(x - y) = g(-(x - y)) = g(y - x) = K[y,x]
$$
by the assumption \(g(x) = g(-x)\).
(2) Stochastic:
$$
\sum_y K[x,y] = \sum_y g(x - y) = \sum_z g(z) = 1
$$
where we substituted \(z = x - y\).
(3) Positive semi-definite:
$$
s * K * s = \sum_{x,y} s(x) g(x - y) s(y) = \sum_z g(z) \sum_x s(x) s(x - z)
$$
$$
= \sum_z g(z) \langle s, \tau_z s \rangle \ge 0
$$
since \(g(z) \ge 0\) and inner products are real.
Example: Convolution Kernel on a 1D Torus
Let $$ X = \mathbb{Z}/5\mathbb{Z} = \set{ 0,1,2,3,4 } $$ be a discrete circle (1D-torus).
We define the convolution kernel as \(g = [\half, \frac{1}{4},0,0, \frac{1}{4}].\) This is a symmetric function \(g(x) = g(-x)\) with sum \(\sum_x g(x) = 1.\) The resulting kernel matrix is:
We validate that it is symmetric and doubly-stochastic.
Kernel Smoothing vs Kernel Regression
If \(s,t \in \IR[X]\) is a function and \(K\) a smoothing kernel. The following statements are equivalent:
- \(t = s^\flat = K*s\) is the \(K\)-smoothed version of \(s\).
- \(s\) is an exact solution to the least squares problem \(\|K s - t\|^2=0\).
So in the regression setting, we are seeking a function \(s\), so that the \(K\)-smoothed version \(K*s\) is close to \(t\).