Lecture 14: Kernels continued

$\begin{aligned} \mathsf{k}_1(\mathbf{x},\mathbf{z})&=\mathbf{x}^\top \mathbf{z} \qquad \textrm{well defined by rule (1)}\\ \mathsf{k}_2(\mathbf{x},\mathbf{z})&=\frac{2}{\sigma^2}\mathsf{k}_1(\mathbf{x},\mathbf{z})=\frac{2}{\sigma^2}\mathbf{x}^\top \mathbf{z} \qquad \textrm{well defined by rule (2)}\\ \mathsf{k}_3(\mathbf{x},\mathbf{z})&=e^{\mathsf{k}_2(\mathbf{x},\mathbf{z})}=e^{\frac{2\mathbf{x}^\top \mathbf{z}}{\sigma^2}} \qquad \textrm{well defined by rule (4)}\\ \mathsf{k}_{4}(\mathbf{x},\mathbf{z})&=e^{-\frac{\mathbf{x}^\top\mathbf{x}}{\sigma^2}} \mathsf{k}_3(\mathbf{x},\mathbf{z}) e^{-\frac{\mathbf{z}^\top\mathbf{z}}{\sigma^2}} =e^{-\frac{\mathbf{x}^\top\mathbf{x}}{\sigma^2}} e^{\frac{2\mathbf{x}^\top \mathbf{z}}{\sigma^2}} e^{-\frac{\mathbf{z}^\top\mathbf{z}}{\sigma^2}} \qquad \textrm{well defined by rule (3) with $f(\mathbf{x})= e^{-\frac{\mathbf{x}^\top\mathbf{x}}{\sigma^2}}$}\\ &=e^{\frac{-\mathbf{x}^\top\mathbf{x}+2\mathbf{x}^\top\mathbf{z}-\mathbf{z}^\top\mathbf{z}}{\sigma^2}}=e^{\frac{-(\mathbf{x}-\mathbf{z})^2}{\sigma^2}}=\mathsf{k}_{RBF}(\mathbf{x},\mathbf{z}) \end{aligned}$

Kernel Machines

Kernel Regression

kernelization

$\begin{aligned} \mathbf{X}\vec{\alpha}&=\mathbf{w}= (\mathbf{X}\mathbf{X}^\top)^{-1} \mathbf{X} \mathbf{y} \ \ \ \textrm{ | multiply from left by $(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top$} \\ (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{X}\vec{\alpha} &= (\mathbf{X}^\top\mathbf{X})^{-1}\underbrace{\mathbf{X}^\top(\mathbf{X}\mathbf{X}^\top)^{-1} \mathbf{X}}_{=\mathbf{I}} \mathbf{y}\\ \vec{\alpha}&= (\mathbf{X} ^\top \mathbf{X})^{-1} \mathbf{y} | \textrm{ substitute $\mathbf{K}=\mathbf{X}^\top \mathbf{X}$}\\ \vec{\alpha}&= \mathbf{K}^{-1} \mathbf{y} \\ \end{aligned}$

Testing

Remember that we defined

$\mathbf{w}=\mathbf{X}\vec{\alpha}.$ The prediction of a test point

$\mathbf{z}$ then becomes

$h(\mathbf{z})=\mathbf{z}^\top \mathbf{w} =\mathbf{z}^\top\underbrace{\mathbf{X}\vec{\alpha}}_{\mathbf{w}} =\mathbf{z}\mathbf{X}\underbrace{(\mathbf{K}+\sigma^2\mathbf{I})^{-1}\mathbf{y}}_{\vec{\alpha}} =\underbrace{\mathbf{K}_*}_{\mathbf{z}\mathbf{x}}(\mathbf{K}+\sigma^2\mathbf{I})^{-1}\mathbf{y},$ where

$\mathbf{K}_*$ is the kernel of the test point with the training points, i.e.

$[\mathbf{K}_*]_{i}=\phi(\mathbf{z})^\top\phi(\mathbf{x}_i)$ , the inner-product between the test point

$\mathbf{z}$ with the training point

$\mathbf{x}_i$ after the mapping into feature space through

$\phi$ .

Kernel SVM

Kernel SVM - the smart nearest neighbor

Do you remember the k-nearest neighbor algorithm? For binary classification problems (

$y_i\in\{+1,-1\}$ ), we can write the decision function for a test point

$\mathbf{z}$ as

$h(\mathbf{z})=\textrm{sign}\left(\sum_{i=1}^n y_i \delta^{nn}(\mathbf{x}_i,\mathbf{z})\right),$ where

$\delta^{nn}(\mathbf{z},\mathbf{x}_i)\in\{0,1\}$ with

$\delta^{nn}(\mathbf{z},\mathbf{x}_i)=1$ only if

$\mathbf{x}_i$ is one of the

$k$ nearest neighbors of test point

$\mathbf{z}$ . The SVM decision function

$h(\mathbf{z})=\textrm{sign}\left(\sum_{i=1}^n y_i\alpha_i k(\mathbf{x}_i,\mathbf{z})+b\right)$ is very similar, but instead of limiting the decision to the

$k$ nearest neighbors, it considers all training points but the kernel function assigns more weight to those that are closer (large

$k(\mathbf{z},\mathbf{x}_i))$ . In some sense you can view the RBF kernel as a soft nearest neighbor assignment, as the exponential decay with distance with assign almost no weight to all but the neighboring points of

$\mathbf{z}$ . The Kernel SVM algorithm also learns a weight

$\alpha_i>0$ for each training point and a bias

$b$ and it essentially "removes" useless training points by setting many

$\alpha_i=0$ .