Processing math: 100%
Lecture 14: Kernels continued
Well-defined kernels
Here are the most common kernels:
- Linear: k(x,z)=x⊤z
- RBF: k(x,z)=e−(x−z)2σ2
- Polynomial: k(x,z)=(1+x⊤z)d
Kernels built by recursively combining one or more of the following rules are
called well-defined kernels:
- k(x,z)=x⊤z(1)
- k(x,z)=ck1(x,z)(2)
- k(x,z)=k1(x,z)+k2(x,z)
- k(x,z)=g(k(x,z))
- k(x,z)=k1(x,z)k2(x,z)
- k(x,z)=f(x)k1(x,z)f(z)(3)
- k(x,z)=ek1(x,z)(4)
- k(x,z)=x⊤Az
where k1,k2 are well-defined kernels, c≥0, g is a polynomial function with positive coefficients, f is any function and A⪰0 is positive semi-definite.
Kernel being well-defined is equivalent to the corresponding kernel matrix, K, being
positive semidefinite (not proved here), which is equivalent to
any of the following statement:
- All eigenvalues of K are non-negative.
- ∃ real matrix P s.t. K=P⊤P.
- ∀ real vector x,x⊤Kx≥0.
It is trivial to prove that linear kernel and polynomial kernel with integer d
are both well-defined kernel.
The RBF kernel
kRBF=e−(x−z)2σ2
is a well-defined kernel matrix.
k1(x,z)=x⊤zwell defined by rule (1)k2(x,z)=2σ2k1(x,z)=2σ2x⊤zwell defined by rule (2)k3(x,z)=ek2(x,z)=e2x⊤zσ2well defined by rule (4)k4(x,z)=e−x⊤xσ2k3(x,z)e−z⊤zσ2=e−x⊤xσ2e2x⊤zσ2e−z⊤zσ2well defined by rule (3) with f(x)=e−x⊤xσ2=e−x⊤x+2x⊤z−z⊤zσ2=e−(x−z)2σ2=kRBF(x,z)
You can even define kernels of sets, or strings or molecules.
The following kernel is defined on two sets,
k(S1,S2)=e|S1∩S2|.
It can be shown that this is a well-defined kernel. We can
list out all possible samples Ω and arrange them into a sorted list.
We define a vector x∈{0,1}|Ω|,
where each of its element indicates whether a corresponding sample
is included in the set. It is easy to prove that
k(S1,S2)=ex⊤1x2,
which is a well-defined kernel by rules (1) and (4).
Kernel Machines
An algorithm can be kernelized in 2 steps:
- Rewrite the algorithm and the classifier entirely in terms of
inner-product, x⊤1x2.
- Define a kernel function and substitute k(xi,xj)
for x⊤ixj.
Quiz 1: How can you kernelize nearest neighbors (with Euclidean distances)?
D={(x1,y1),…,(xn,yn)}.
Observations: Nearest neighbor under squared Euclidean distance is
the same as nearest neighbor under Euclidean distance.
Therefore, out of the original version,
h(x)=yjwherej=argmin(xj,yj)∈D(x−xj)2
we can derive the kernel version,
h(x)=yjwherej=argmin(xj,yj)∈D(k(x,x)−2k(x,xj)+k(xj,xj))
Actually, the kernel nearest neighbor is rarely used in practice,
because it brings little improvement since the original k-NN
is already highly non-linear.
Kernel Regression
Kernel regression is kernelized Ordinary Least Squares Regression (OLS). Vanilla OLS minimizes the following squared loss regression loss function,
minwn∑i=1(w⊤xi−yi)2,
to find the hyper-plane w. The prediction at a test-point is simply h(x)=w⊤x.
If we let X=[x1,…,xn] and y=[y1,…,yn]⊤, the solution of OLS can be written in closed form:
w=(XX⊤)−1Xy(5)
kernelization
We begin by expressing the solution w as a linear combination of the training inputs
w=n∑i=1αixi=X→α.
You can verify that such a vector →α must always exist by observing the gradient updates that occur if (5) is minimized with gradient descent and the initial vector is set to w0=→0 (because the squared loss is convex the solution is independent of its initialization.)
We now kernelize the algorithm by substituting k(x,z) for any inner-product x⊤z.
It follows that the prediction at a test point becomes
h(x)=w⊤x=n∑i=1αix⊤ix=n∑i=1αik(xi,x)=KX:x→α.
It remains to show that we can also solve for the values of α in closed form. As it turns out, this is straight-forward.
Kernelized ordinary least squares has the solution →α=K−1y.
X→α=w=(XX⊤)−1Xy | multiply from left by (X⊤X)−1X⊤(X⊤X)−1X⊤X→α=(X⊤X)−1X⊤(XX⊤)−1X⏟=Iy→α=(X⊤X)−1y| substitute K=X⊤X→α=K−1y
Kernel regression can be extended to the kernelized version of ridge regression. The solution then becomes
→α=(K+σ2I)−1y.
In practice a small value of σ2>0 increases stability, especially if K is not invertible. If σ=0 kernel ridge regression, becomes kernelized ordinary least squares. Typically kernel ridge regression is also referred to as kernel regression.
Testing
Remember that we defined w=X→α. The prediction of a test point z then becomes
h(z)=z⊤w=z⊤X→α⏟w=zX(K+σ2I)−1y⏟→α=K∗⏟zx(K+σ2I)−1y, where K∗ is the kernel of the test point with the training points, i.e. [K∗]i=ϕ(z)⊤ϕ(xi), the inner-product between the test point z with the training point xi after the mapping into feature space through ϕ.
Kernel SVM
The original, primal SVM is a quadratic programming problem:
minw,bw⊤w+Cn∑i=1ξis.t. ∀i,yi(w⊤xi+b)≥1−ξiξi≥0
has the dual form
minα1,⋯,αn12∑i,jαiαjyiyjKij−n∑i=1αis.t.0≤αi≤Cn∑i=1αiyi=0
where w=∑ni=1αiyiϕ(xi) (although this is never computed) and
h(x)=sign(∑ni=1αiyik(xi,x)+b).
Almost all αi=0 (i.e. →α is sparse). We refer to those inputs with αi>0 as support vectors. For test-time you only have to store the vectors xi and values αi that correspond to support vectors.
It is easy to show αi>0↔yi(w⊤ϕ(xi)+b)=1. This allows us to solve for b from the support vectors (it is best to average the b from all support vectors, as there may be numerical precision problems).
Quiz: What is the dual form of the hard-margin SVM?
Kernel SVM - the smart nearest neighbor
Do you remember the k-nearest neighbor algorithm? For binary classification problems (yi∈{+1,−1}), we can write the decision function for a test point z as
h(z)=sign(n∑i=1yiδnn(xi,z)),
where δnn(z,xi)∈{0,1} with δnn(z,xi)=1 only if xi is one of the k nearest neighbors of test point z.
The SVM decision function
h(z)=sign(n∑i=1yiαik(xi,z)+b)
is very similar, but instead of limiting the decision to the k nearest neighbors, it considers all training points but the kernel function assigns more weight to those that are closer (large k(z,xi)). In some sense you can view the RBF kernel as a soft nearest neighbor assignment, as the exponential decay with distance with assign almost no weight to all but the neighboring points of z.
The Kernel SVM algorithm also learns a weight αi>0 for each training point and a bias b and it essentially "removes" useless training points by setting many αi=0.