Processing math: 100%
Lecture 14.2: Kernels continued
Well-defined kernels
Here are the most common kernels:
- Linear: k(x,z)=x⊤z
- RBF: k(x,z)=e−(x−z)2σ2
- Polynomial: k(x,z)=(1+x⊤z)d
Kernels built by recursively combining one or more of the following rules are
called well-defined kernels:
- k(x,z)=x⊤z(1)
- k(x,z)=ck1(x,z)(2)
- k(x,z)=k1(x,z)+k2(x,z)
- k(x,z)=g(k(x,z))
- k(x,z)=k1(x,z)k2(x,z)
- k(x,z)=f(x)k1(x,z)f(z)(3)
- k(x,z)=ek1(x,z)(4)
- k(x,z)=x⊤Az
where k1,k2 are well-defined kernels, c≥0, g is a polynomial function with positive coefficients, f is any function and A⪰0 is positive semi-definite.
Kernel being well-defined is equivalent to the corresponding kernel matrix, K, being
positive semidefinite (not proved here), which is equivalent to
any of the following statement:
- All eigenvalues of K are non-negative.
- ∃ real matrix P s.t. K=P⊤P.
- ∀ real vector x,x⊤Kx≥0.
It is trivial to prove that linear kernel and polynomial kernel with integer d
are both well-defined kernel.
The RBF kernel
kRBF=e−(x−z)2σ2
is a well-defined kernel matrix.
k1(x,z)=x⊤zwell defined by rule (1)k2(x,z)=2σ2k1(x,z)=2σ2x⊤zwell defined by rule (2)k3(x,z)=ek2(x,z)=e2x⊤zσ2well defined by rule (4)k4(x,z)=e−x⊤xσ2k3(x,z)e−z⊤zσ2=e−x⊤xσ2e2x⊤zσ2e−z⊤zσ2well defined by rule (3) with f(x)=e−x⊤xσ2=e−x⊤x+2x⊤z−z⊤zσ2=e−(x−z)2σ2=kRBF(x,z)
You can even define kernels of sets, or strings or molecules.
The following kernel is defined on two sets,
k(S1,S2)=e|S1∩S2|.
It can be shown that this is a well-defined kernel. We can
list out all possible samples Ω and arrange them into a sorted list.
We define a vector x∈{0,1}|Ω|,
where each of its element indicates whether a corresponding sample
is included in the set. It is easy to prove that
k(S1,S2)=ex⊤1x2,
which is a well-defined kernel by rules (1) and (4).
Kernel Machines
An algorithm can be kernelized in 2 steps:
- Rewrite the algorithm and the classifier entirely in terms of
inner-product, x⊤1x2.
- Define a kernel function and substitute k(xi,xj)
for x⊤ixj.
Quiz 1: How can you kernelize nearest neighbors (with Euclidean distances)?
D={(x1,y1),…,(xn,yn)}.
Observations: Nearest neighbor under squared Euclidean distance is
the same as nearest neighbor under Euclidean distance.
Therefore, out of the original version,
h(x)=yjwherej=argmin(→xj,yj)∈D(x−xj)2
we can derive the kernel version,
h(x)=yjwherej=argmin(→xj,yj)∈D(k(→x,→x)−2k(→x,→xj)+k(→xj,→xj))
Actually, the kernel nearest neighbor is rarely used in practice,
because it brings little improvement since the original k-NN
is already highly non-linear.
\subsection{Kernel Regression}
Kernel regression is kernelized Ordinary Least Squares Regression (OLS). Vanilla OLS minimizes the following squared loss regression loss function,
minwn∑i=1(w⊤xi−yi)2,
to find the hyper-plane w. The prediction at a test-point is simply h(x)=w⊤x.
%
If we let X=[x1,…,xn] and y=[y1,…,yn]⊤, the solution of OLS can be written in closed form:
w=(XX⊤)−1Xy(5)
\subsubsection{kernelization}
We begin by expressing the solution w as a linear combination of the training inputs
w=n∑i=1αixi=X→α.
You can verify that such a vector →α must always exist by observing the gradient updates that occur if (5) is minimized with gradient descent and the initial vector is set to w0=→0 (because the squared loss is convex the solution is independent of its initialization.)
We now kernelize the algorithm by substituting k(x,z) for any inner-product x⊤z.
It follows that the prediction at a test point becomes
h(x)=w⊤x=n∑i=1αix⊤ix=n∑i=1αik(xi,x)=KX:x→α.
It remains to show that we can also solve for the values of α in closed form. As it turns out, this is straight-forward.
Kernelized ordinary least squares has the solution →α=K−1y.
X→α=w=(XX⊤)−1Xy \ \ | multiply from left by (X⊤X)−1X⊤(X⊤X)−1X⊤X→α=(X⊤X)−1X⊤(XX⊤)−1X⏟=Iy→α=(X⊤X)−1y| substitute K=X⊤X→α=K−1y
Kernel regression can be extended to the kernelized version of ridge regression. The solution then becomes
→α=(K+σ2I)y.
In practice a small value of σ2>0 increases stability, especially if K is not invertible.
Kernel SVM
The original SVM is a quadratic programming problem
minw,bw⊤w+Cn∑i=1ξis.t. ∀i,yi(w⊤xi+b)≥1−ξiξi≥0
It has a dual form
minα1,⋯,αn12∑i,jαiαjyiyjKij−n∑i=1αis.t.0≤αi≤Cn∑i=1αiyi=0
Easy to show 0≤αi↔yi(w⊤ϕ(xi)+b)=1
Almost all αi=0 (called sparse), support vector αi≠0.