Advantage: It is simple, and your problem stays convex and well behaved. (i.e.you can still use your normal gradient descent descent code)
Disadvantage: ϕ(x) might be very high dimensional. (Let's worry about this later)
Consider the following example: x=(x1x2⋮xd), and define ϕ(x)=(1x1⋮xdx1x2⋮xd−1xd⋮x1x2⋯xd).
Quiz: What is the dimensionality of ϕ(x)?
This new representation, ϕ(x), is very expressive and allows for complicated non-linear decision boundaries - but the dimensionality is extremely high. This makes our algorithm unbearable (and quickly prohibitively) slow.
Now, note that when we do gradient descent with many loss functions, the gradient is a linear combination of the input samples. Take squared loss for example:
ℓ(w)=n∑i=1(w⊤xi−yi)2 The gradient descent rule, with step-size s>0, updates w over time, wt+1←wt−s(∂ℓ∂w) where: ∂ℓ∂w=n∑i=12(w⊤xi−yi)⏟γi : function of xi,yixi=n∑i=1γixi We will now show that we can express w as a linear combination of all input vectors, w=n∑i=1αixi. Since the loss is convex, the final solution is independent of the initialization, and we can initialize w0 to be whatever we want. For convenience, let us pick w0=(0⋮0). For this initial choice of w0, the linear combination in w=∑ni=1αixi is trivially α1=⋯=αn=0. We now show that throughout the entire gradient descent optimization such coefficients α1,…,αn must always exist, as we can re-write the gradient updates entirely in terms of updating the αi coefficients: w1=w0−sn∑i=12(w⊤0xi−yi)xi=n∑i=1α0ixi−sn∑i=1γ0ixi=n∑i=1α1ixi(with α1i=α0i−sγ0i)w2=w1−sn∑i=12(w⊤1xi−yi)xi=n∑i=1α1ixi−sn∑i=1γ1ixi=n∑i=1α2ixi(with α2i=α1ixi−sγ1i)w3=w2−sn∑i=12(w⊤2xi−yi)xi=n∑i=1α2ixi−sn∑i=1γ2ixi=n∑i=1α3ixi(with α3i=α2i−sγ2i)⋯⋯⋯wt=wt−1−sn∑i=12(w⊤t−1xi−yi)xi=n∑i=1αt−1ixi−sn∑i=1γt−1ixi=n∑i=1αtixi(with αti=αt−1i−sγt−1i) The update-rule for αti is thus αti=αt−1i−sγt−1i, and we have αti=−st−1∑r=0γri. In other words, we can perform the entire gradient descent update rule without ever expressing w explicitly. We just keep track of the n coefficients α1,…,αn. Now that w can be written as a linear combination of the training set, we can also express the inner-product of w with any input xi purely in terms of inner-products between training inputs: w⊤xj=n∑i=1αix⊤ixj. Consequently, we can also re-write the squared-loss from ℓ(w)=∑ni=1(w⊤xi−yi)2 entirely in terms of inner-product between training inputs: ℓ(α)=n∑i=1(n∑j=1αjx⊤jxi−yi)2 During test-time we also only need these coefficients to make a prediction on a test-input xt, and can write the entire classifier in terms of inner-products between the test point and training points: h(xt)=w⊤xt=n∑j=1αjx⊤jxt. Do you notice a theme? The only information we ever need in order to learn a hyper-plane classifier with the squared-loss is inner-products between all pairs of data vectors.Linear: K(x,z)=x⊤z.
(The linear kernel is equivalent to just using a good old linear classifier - but it can be faster to use a kernel matrix if the dimensionality d of the data is high.)
Polynomial: K(x,z)=(1+x⊤z)d.
Radial Basis Function (RBF) (aka Gaussian Kernel): K(x,z)=e−‖x−z‖2σ2.
The RBF kernel is the most popular Kernel! It is a Universal approximator!! Its corresponding feature vector is infinite dimensional.
In the following we provide some other kernels.
Exponential Kernel: K(x,z)=e−‖x−z‖2σ2
Laplacian Kernel: K(x,z)=e−|x−z|σ
Sigmoid Kernel: K(x,z)=tanh(ax⊤+c)
Think about it: Can any function K(⋅,⋅) be used as a kernel?
No, the matrix K(xi,xj) has to correspond to real inner-products after some transformation x→ϕ(x). This is the case if and only if K is positive semi-definite.Definition: A matrix A∈Rn×n is positive semi-definite iff ∀q∈Rn, q⊤Aq≥0.
Why is that?
Remember K(x,z)=ϕ(x)⊤ϕ(z). A matrix of form A=(x⊤1x1,...,x⊤1xn⋮ ⋮x⊤nx1,...,x⊤nxn)=(x1⋮xn)(x1,⋯xn) must be positive semi-definite because: q⊤Aq=((x1,⋯xn)q⏟a vector with the same dimension of xi)⊤((x1,⋯xn)q)≥0 for ∀q∈Rn.You can even define kernels over sets, strings, graphs and molecules.
![]()
Figure 1: The demo shows how kernel function solves the problem linear classifiers can not solve. RBF works well with the decision boundary in this case.