Processing math: 100%

Lecture 14.2: Kernels continued

Well-defined kernels

Here are the most common kernels: Kernels built by recursively combining one or more of the following rules are called well-defined kernels:
  1. k(x,z)=xz(1)
  2. k(x,z)=ck1(x,z)(2)
  3. k(x,z)=k1(x,z)+k2(x,z)
  4. k(x,z)=g(k(x,z))
  5. k(x,z)=k1(x,z)k2(x,z)
  6. k(x,z)=f(x)k1(x,z)f(z)(3)
  7. k(x,z)=ek1(x,z)(4)
  8. k(x,z)=xAz
where k1,k2 are well-defined kernels, c0, g is a polynomial function with positive coefficients, f is any function and A0 is positive semi-definite. Kernel being well-defined is equivalent to the corresponding kernel matrix, K, being positive semidefinite (not proved here), which is equivalent to any of the following statement:
  1. All eigenvalues of K are non-negative.
  2.  real matrix P s.t. K=PP.
  3.  real vector x,xKx0.
It is trivial to prove that linear kernel and polynomial kernel with integer d are both well-defined kernel.
The RBF kernel kRBF=e(xz)2σ2 is a well-defined kernel matrix.
k1(x,z)=xzwell defined by rule (1)k2(x,z)=2σ2k1(x,z)=2σ2xzwell defined by rule (2)k3(x,z)=ek2(x,z)=e2xzσ2well defined by rule (4)k4(x,z)=exxσ2k3(x,z)ezzσ2=exxσ2e2xzσ2ezzσ2well defined by rule (3) with f(x)=exxσ2=exx+2xzzzσ2=e(xz)2σ2=kRBF(x,z)

You can even define kernels of sets, or strings or molecules. The following kernel is defined on two sets, k(S1,S2)=e|S1S2|. It can be shown that this is a well-defined kernel. We can list out all possible samples Ω and arrange them into a sorted list. We define a vector x{0,1}|Ω|, where each of its element indicates whether a corresponding sample is included in the set. It is easy to prove that k(S1,S2)=ex1x2, which is a well-defined kernel by rules (1) and (4).

Kernel Machines

An algorithm can be kernelized in 2 steps:
  1. Rewrite the algorithm and the classifier entirely in terms of inner-product, x1x2.
  2. Define a kernel function and substitute k(xi,xj) for xixj.
Quiz 1: How can you kernelize nearest neighbors (with Euclidean distances)? D={(x1,y1),,(xn,yn)}.
Observations: Nearest neighbor under squared Euclidean distance is the same as nearest neighbor under Euclidean distance. Therefore, out of the original version,
h(x)=yjwherej=argmin(xj,yj)D(xxj)2 we can derive the kernel version, h(x)=yjwherej=argmin(xj,yj)D(k(x,x)2k(x,xj)+k(xj,xj)) Actually, the kernel nearest neighbor is rarely used in practice, because it brings little improvement since the original k-NN is already highly non-linear. \subsection{Kernel Regression} Kernel regression is kernelized Ordinary Least Squares Regression (OLS). Vanilla OLS minimizes the following squared loss regression loss function, minwni=1(wxiyi)2, to find the hyper-plane w. The prediction at a test-point is simply h(x)=wx. % If we let X=[x1,,xn] and y=[y1,,yn], the solution of OLS can be written in closed form: w=(XX)1Xy(5) \subsubsection{kernelization} We begin by expressing the solution w as a linear combination of the training inputs w=ni=1αixi=Xα. You can verify that such a vector α must always exist by observing the gradient updates that occur if (5) is minimized with gradient descent and the initial vector is set to w0=0 (because the squared loss is convex the solution is independent of its initialization.) We now kernelize the algorithm by substituting k(x,z) for any inner-product xz. It follows that the prediction at a test point becomes h(x)=wx=ni=1αixix=ni=1αik(xi,x)=KX:xα. It remains to show that we can also solve for the values of α in closed form. As it turns out, this is straight-forward.
Kernelized ordinary least squares has the solution α=K1y.
Xα=w=(XX)1Xy \ \ | multiply from left by (XX)1X(XX)1XXα=(XX)1X(XX)1X=Iyα=(XX)1y| substitute K=XXα=K1y

Kernel regression can be extended to the kernelized version of ridge regression. The solution then becomes α=(K+σ2I)y. In practice a small value of σ2>0 increases stability, especially if K is not invertible.

Kernel SVM

The original SVM is a quadratic programming problem minw,bww+Cni=1ξis.t. i,yi(wxi+b)1ξiξi0 It has a dual form minα1,,αn12i,jαiαjyiyjKijni=1αis.t.0αiCni=1αiyi=0 Easy to show 0αiyi(wϕ(xi)+b)=1 Almost all αi=0 (called sparse), support vector αi0.