Lecture 14.2: Kernels continued

Well-defined kernels

Here are the most common kernels: Kernels built by recursively combining one or more of the following rules are called well-defined kernels:
  1. $\mathsf{k}(\mathbf{x}, \mathbf{z})=\mathbf{x}^\top\mathbf{z}\quad\quad\quad(1)$
  2. $\mathsf{k}(\mathbf{x}, \mathbf{z})=c\mathsf{k_1}(\mathbf{x},\mathbf{z})\quad\quad(2)$
  3. $\mathsf{k}(\mathbf{x}, \mathbf{z})=\mathsf{k_1}(\mathbf{x},\mathbf{z})+\mathsf{k_2}(\mathbf{x},\mathbf{z})$
  4. $\mathsf{k}(\mathbf{x}, \mathbf{z})=g(\mathsf{k}(\mathbf{x},\mathbf{z}))$
  5. $\mathsf{k}(\mathbf{x}, \mathbf{z})=\mathsf{k_1}(\mathbf{x},\mathbf{z})\mathsf{k_2}(\mathbf{x},\mathbf{z})$
  6. $\mathsf{k}(\mathbf{x}, \mathbf{z})=f(\mathbf{x})\mathsf{k_1}(\mathbf{x},\mathbf{z})f(\mathbf{z})\quad(3)$
  7. $\mathsf{k}(\mathbf{x}, \mathbf{z})=e^{\mathsf{k_1}(\mathbf{x},\mathbf{z})}\quad\quad(4)$
  8. $\mathsf{k}(\mathbf{x}, \mathbf{z})=\mathbf{x}^\top A \mathbf{z}$
where $k_1,k_2$ are well-defined kernels, $c\geq 0$, $g$ is a polynomial function with positive coefficients, $f$ is any function and $\mathbf{A}\succeq 0$ is positive semi-definite. Kernel being well-defined is equivalent to the corresponding kernel matrix, $K$, being positive semidefinite (not proved here), which is equivalent to any of the following statement:
  1. All eigenvalues of $K$ are non-negative.
  2. $\exists \text{ real matrix } P \text{ s.t. } K=P^\top P.$
  3. $\forall \text{ real vector } \mathbf{x}, \mathbf{x}^\top K \mathbf{x} \ge 0.$
It is trivial to prove that linear kernel and polynomial kernel with integer $d$ are both well-defined kernel.
The RBF kernel $k_{RBF}=e^{\frac{-(\mathbf{x}-\mathbf{z})^2}{\sigma^2}}$ is a well-defined kernel matrix.
\[\begin{aligned} \mathsf{k}_1(\mathbf{x},\mathbf{z})&=\mathbf{x}^\top \mathbf{z} \qquad \textrm{well defined by rule (1)}\\ \mathsf{k}_2(\mathbf{x},\mathbf{z})&=\frac{2}{\sigma^2}\mathsf{k}_1(\mathbf{x},\mathbf{z})=\frac{2}{\sigma^2}\mathbf{x}^\top \mathbf{z} \qquad \textrm{well defined by rule (2)}\\ \mathsf{k}_3(\mathbf{x},\mathbf{z})&=e^{\mathsf{k}_2(\mathbf{x},\mathbf{z})}=e^{\frac{2\mathbf{x}^\top \mathbf{z}}{\sigma^2}} \qquad \textrm{well defined by rule (4)}\\ \mathsf{k}_{4}(\mathbf{x},\mathbf{z})&=e^{-\frac{\mathbf{x}^\top\mathbf{x}}{\sigma^2}} \mathsf{k}_3(\mathbf{x},\mathbf{z}) e^{-\frac{\mathbf{z}^\top\mathbf{z}}{\sigma^2}} =e^{-\frac{\mathbf{x}^\top\mathbf{x}}{\sigma^2}} e^{\frac{2\mathbf{x}^\top \mathbf{z}}{\sigma^2}} e^{-\frac{\mathbf{z}^\top\mathbf{z}}{\sigma^2}} \qquad \textrm{well defined by rule (3) with $f(\mathbf{x})= e^{-\frac{\mathbf{x}^\top\mathbf{x}}{\sigma^2}}$}\\ &=e^{\frac{-\mathbf{x}^\top\mathbf{x}+2\mathbf{x}^\top\mathbf{z}-\mathbf{z}^\top\mathbf{z}}{\sigma^2}}=e^{\frac{-(\mathbf{x}-\mathbf{z})^2}{\sigma^2}}=\mathsf{k}_{RBF}(\mathbf{x},\mathbf{z}) \end{aligned}\]

You can even define kernels of sets, or strings or molecules. The following kernel is defined on two sets, \[\mathsf{k}(S_1,S_2)=e^{|S_1 \cap S_2|}.\] It can be shown that this is a well-defined kernel. We can list out all possible samples $\Omega$ and arrange them into a sorted list. We define a vector $\mathbf{x} \in \{0,1\}^{|\Omega|}$, where each of its element indicates whether a corresponding sample is included in the set. It is easy to prove that \[\mathsf{k}(S_1,S_2)=e^{x_1^\top x_2},\] which is a well-defined kernel by rules (1) and (4).

Kernel Machines

An algorithm can be kernelized in 2 steps:
  1. Rewrite the algorithm and the classifier entirely in terms of inner-product, $\mathbf{x}_1^\top \mathbf{x}_2$.
  2. Define a kernel function and substitute $\mathsf{k}(\mathbf{x}_i,\mathbf{x}_j)$ for $\mathbf{x}_i^\top \mathbf{x}_j$.
Quiz 1: How can you kernelize nearest neighbors (with Euclidean distances)? $D=\{(\mathbf{x}_1,y_1),\ldots,(\mathbf{x}_n,y_n)\}.$
Observations: Nearest neighbor under squared Euclidean distance is the same as nearest neighbor under Euclidean distance. Therefore, out of the original version,
\[ h(\mathbf{x})= y_j \quad \text{where} \quad j=argmin_{(\vec{x_j},y_j) \in D}(x-x_j)^2 \] we can derive the kernel version, \[h(\mathbf{x})= y_j \quad \text{where} \quad j=argmin_{(\vec{x_j},y_j) \in D} (\mathsf{k}(\vec{\mathbf{x}}, \vec{\mathbf{x}})-2\mathsf{k}(\vec{\mathbf{x}}, \vec{\mathbf{x}_j})+\mathsf{k}(\vec{\mathbf{x}_j}, \vec{\mathbf{x}_j})) \] Actually, the kernel nearest neighbor is rarely used in practice, because it brings little improvement since the original $k$-NN is already highly non-linear. \subsection{Kernel Regression} Kernel regression is kernelized Ordinary Least Squares Regression (OLS). Vanilla OLS minimizes the following squared loss regression loss function, \begin{equation} \min_\mathbf{w} \sum_{i=1}^{n} (\mathbf{w}^\top \mathbf{x}_i -y_i)^2, \end{equation} to find the hyper-plane $\mathbf{w}$. The prediction at a test-point is simply $h(\mathbf{x})=\mathbf{w}^\top \mathbf{x}$. % If we let $\mathbf{X}=[\mathbf{x}_1,\ldots,\mathbf{x}_n]$ and $\mathbf{y}=[y_1,\ldots,y_n]^\top$, the solution of OLS can be written in closed form: \begin{equation} \mathbf{w}=(\mathbf{X}\mathbf{X}^\top)^{-1} \mathbf{X} \mathbf{y}\qquad(5)%\label{eq:kernel:OLS} \end{equation} \subsubsection{kernelization} We begin by expressing the solution $\mathbf{w}$ as a linear combination of the training inputs \begin{equation} \mathbf{w}=\sum_{i=1}^{n} \alpha_i\mathbf{x}_i=\mathbf{X}\vec{\alpha}. \end{equation} You can verify that such a vector $\vec \alpha$ must always exist by observing the gradient updates that occur if (5) is minimized with gradient descent and the initial vector is set to $\mathbf{w}_0=\vec 0$ (because the squared loss is convex the solution is independent of its initialization.) We now kernelize the algorithm by substituting $k(\mathbf{x},\mathbf{z})$ for any inner-product $\mathbf{x}^\top \mathbf{z}$. It follows that the prediction at a test point becomes \begin{equation} h(\mathbf{x})=\mathbf{w}^\top \mathbf{x} = \sum_{i=1}^{n} \alpha_i \mathbf{x}_i^\top\mathbf{x} =\sum_{i=1}^{n} \alpha_i k(\mathbf{x}_i,\mathbf{x})= K_{X:x}\vec{\alpha}. \end{equation} It remains to show that we can also solve for the values of $\alpha$ in closed form. As it turns out, this is straight-forward.
Kernelized ordinary least squares has the solution $\vec{\alpha}={\mathbf{K}}^{-1} \mathbf{y}$.
\[\begin{aligned} \mathbf{X}\vec{\alpha}&=\mathbf{w}= (\mathbf{X}\mathbf{X}^\top)^{-1} \mathbf{X} \mathbf{y} \textrm{ \ \ | multiply from left by $(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top$} \\ (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{X}\vec{\alpha} &= (\mathbf{X}^\top\mathbf{X})^{-1}\underbrace{\mathbf{X}^\top(\mathbf{X}\mathbf{X}^\top)^{-1} \mathbf{X}}_{=\mathbf{I}} \mathbf{y}\\ \vec{\alpha}&= (\mathbf{X} ^\top \mathbf{X})^{-1} \mathbf{y} | \textrm{ substitute $\mathbf{K}=\mathbf{X}^\top \mathbf{X}$}\\ \vec{\alpha}&= \mathbf{K}^{-1} \mathbf{y} \\ \end{aligned}\]

Kernel regression can be extended to the kernelized version of ridge regression. The solution then becomes \begin{equation} \vec{\alpha}=(\mathbf{K}+\sigma^2\mathbf{I})\mathbf{y}. \end{equation} In practice a small value of $\sigma^2>0$ increases stability, especially if $\mathbf{K}$ is not invertible.

Kernel SVM

The original SVM is a quadratic programming problem \[\begin{aligned} &\min_{\mathbf{w},b}\mathbf{w}^\top\mathbf{w}+C \sum_{i=1}^{n} \xi_i \\ \text{s.t. }\forall i, &\quad y_i(\mathbf{w}^\top\mathbf{x}_i +b) \geq 1 - \xi_i\\ &\quad \xi_i \geq 0 \end{aligned}\] It has a dual form \[\begin{aligned} &\min_{\alpha_1,\cdots,\alpha_n}\frac{1}{2} \sum_{i,j}\alpha_i \alpha_j y_i y_j K_{ij} - \sum_{i=1}^{n}\alpha_i \\ \text{s.t.} &\quad 0 \leq \alpha_i \leq C\\ &\quad \sum_{i=1}^{n} \alpha_i y_i = 0 \end{aligned}\] Easy to show $ 0 \leq \alpha_i \leftrightarrow y_i(\mathbf{w}^\top \phi(x_i)+b)=1$ Almost all $\alpha_i = 0$ (called sparse), support vector $\alpha_i \neq 0$.