Advantage: It is simple, and your problem stays convex and well behaved. (i.e.you can still use your normal gradient descent descent code)
Disadvantage: ϕ(x) might be very high dimensional. (Let's worry about this later)
Consider the following example: x=(x1x2⋮xd), and define ϕ(x)=(1x1⋮xdx1x2⋮xd−1xd⋮x1x2⋯xd).
Quiz: What is the dimensionality of ϕ(x)?
So, as is shown above, ϕ(x) is very expressive but the dimensionality is extremely high.
Now, note that when we do gradient descent with many loss functions, the gradient is a linear combination of the input samples. Take squared loss for example:
ℓ(w)=n∑i=1(w⊤xi−yi)2 The gradient descent rule, with step-size s>0, updates w over time, wt+1←wt−s(∂ℓ∂w) where: ∂ℓ∂w=n∑i=12(w⊤xi−yi)⏟γi : function of xi,yixi=n∑i=1γixi We will now show that we can express w as a linear combination of all input vectors, w=n∑i=1αixi. Since the loss is convex, the final solution is independent of the initialization, and we can initialize w0 to be whatever we want. For convenience, let us pick w0=(0⋮0). For this initial choice of w0, the linear combination in w=∑ni=1αixi is trivially α1=⋯=αn=0. We now show that throughout the entire gradient descent optimization such coefficients α1,…,αn must always exist, as we can re-write the gradient updates entirely in terms of updating the αi coefficients: w1=w0−sn∑i=12(w⊤0xi−yi)xi=n∑i=1α0ixi−sn∑i=1γ0ixi=n∑i=1α1ixi(with α1i=α0i−sγ0i)w2=w1−sn∑i=12(w⊤1xi−yi)xi=n∑i=1α1ixi−sn∑i=1γ1ixi=n∑i=1α2ixi(with α2i=α1ixi−sγ1i)w3=w2−sn∑i=12(w⊤2xi−yi)xi=n∑i=1α2ixi−sn∑i=1γ2ixi=n∑i=1α3ixi(with α3i=α2i−sγ2i)⋯⋯⋯wt=wt−1−sn∑i=12(w⊤t−1xi−yi)xi=n∑i=1αt−1ixi−sn∑i=1γt−1ixi=n∑i=1αtixi(with αti=αt−1i−sγt−1i) The update-rule for αti is thus αti=αt−1i−sγt−1i, and we have αti=−st−1∑r=0γri. In other words, we can perform the entire gradient descent update rule without ever expressing w explicitly. We just keep track of the n coefficients α1,…,αn. Now that w can be written as a linear combination of the training set, we can also express the inner-product of w with any input xi purely in terms of inner-products between training inputs: w⊤xj=n∑i=1αix⊤ixj. Consequently, we can also re-write the squared-loss from ℓ(w)=∑ni=1(w⊤xi−yi)2 entirely in terms of inner-product between training inputs: ℓ(α)=n∑i=1(n∑j=1αjx⊤jxi−yi)2 During test-time we also only need these coefficients to make a prediction on a test-input xt, and can write the entire classifier in terms of inner-products between the test point and training points: h(xt)=w⊤xt=n∑j=1αjx⊤jxt. Do you notice a theme? The only information we ever need in order to learn a hyper-plane classifier with the squared-loss is inner-products between all pairs of data vectors.Linear: K(x,z)=x⊤z.
(The linear kernel is equivalent to just using a good old linear classifier - but it can be faster to use a kernel matrix if the dimensionality d of the data is high.)
Polynomial: K(x,z)=(1+x⊤z)d.
Radial Basis Function (RBF) (aka Gaussian Kernel): K(x,z)=e−‖.
The RBF kernel is the most popular Kernel! It is a Universal approximator!! Its corresponding feature vector is infinite dimensional.
In the following we provide some other kernels.
Exponential Kernel: \mathsf{K}(\mathbf{x},\mathbf{z})= e^\frac{-\| \mathbf{x}-\mathbf{z}\|}{2\sigma^2}
Laplacian Kernel: \mathsf{K}(\mathbf{x},\mathbf{z})= e^\frac{-| \mathbf{x}-\mathbf{z}|}{\sigma}
Sigmoid Kernel: \mathsf{K}(\mathbf{x},\mathbf{z})=\tanh(\mathbf{a}\mathbf{x}^\top + c)
Think about it: Can any function \mathsf{K}(\cdot,\cdot) be used as a kernel?
No, the matrix \mathsf{K}(\mathbf{x}_i,\mathbf{x}_j) has to correspond to real inner-products after some transformation {\mathbf{x}}\rightarrow \phi({\mathbf{x}}). This is the case if and only if \mathsf{K} is positive semi-definite.Definition: A matrix A\in \mathbb{R}^{n\times n} is positive semi-definite iff \forall \mathbf{q}\in\mathbb{R}^n, \mathbf{q}^\top A\mathbf{q}\geq 0.
Why is that?
Remember \mathsf{K}(\mathbf{x},\mathbf{z})=\phi(\mathbf{x})^\top \phi(\mathbf{z}). A matrix of form A=\begin{pmatrix} \mathbf{x}_1^\top \mathbf{x}_1, ..., \mathbf{x}_1^\top \mathbf{x}_n \\ \vdots ~ \vdots \\ \mathbf{x}_n^\top \mathbf{x}_1, ..., \mathbf{x}_n^\top \mathbf{x}_n \end{pmatrix}=\begin{pmatrix} \mathbf{x}_1\\ \vdots \\ \mathbf{x}_n \end{pmatrix} \begin{pmatrix} \mathbf{x}_1, \cdots \mathbf{x}_n \end{pmatrix} must be positive semi-definite because: \mathbf{q}^\top A\mathbf{q}=(\underbrace{\begin{pmatrix} \mathbf{x}_1, \cdots \mathbf{x}_n \end{pmatrix}\mathbf{q}}_{\text{a vector with the same dimension of } \mathbf{x}_i})^\top (\begin{pmatrix} \mathbf{x}_1, \cdots \mathbf{x}_n \end{pmatrix}\mathbf{q})\geq 0 for \forall \mathbf{q}\in\mathbb{R}^n.You can even define kernels over sets, strings, graphs and molecules.
![]()
Figure 1: The demo shows how kernel function solves the problem linear classifiers can not solve. RBF works well with the decision boundary in this case.