Regularization with Dot-Product Kernels 
Alex J. Smola Zoltfin L. )viri and Robert C. Williamson 
Department of Engineering 
Australian National University 
Canberra, ACT, 0200 
Abstract 
In this paper we give necessary and sufficient conditions under 
which kernels of dot product type k(x, y) -- k(x. y) satisfy Mer- 
cer's condition and thus may be used in Support Vector Ma- 
chines (SVM), Regularization Networks (RN) or Gaussian Pro- 
cesses (GP). In particular, we show that if the kernel is analytic 
(i.e. can be expanded in a Taylor series), all expansion coefficients 
have to be nonnegative. We give an explicit functional form for the 
feature map by calculating its eigenfunctions and eigenvalues. 
1 Introduction 
Kernel functions are widely used in learning algorithms such as Support Vector Ma- 
chines, Gaussian Processes, or Regularization Networks. A possible interpretation 
of their effects is that they represent dot products in some feature space , i.e. 
k(x, y) -- (fi(x) . (fi(y) (1) 
where b is a map from input (data) space 2 into if. Another interpretation is to 
connect b with the regularization properties of the corresponding learning algorithm 
[8]. Most popular kernels can be described by three main categories: translation 
invariant kernels [9] 
k(x,y) = k(x - y), (2) 
kernels originating from generative models (e.g. those of Jaakkola and Haussler, or 
Watkins), and thirdly, dot-product kernels 
(, y) = (. y). (3) 
Since k influences the properties of the estimates generated by any of the algorithms 
above, it is natural to ask which regularization properties are associated with k. 
In [8, 10, 9] the general connections between kernels and regularization properties 
are pointed out, containing details on the connection between the Fourier spectrum 
of translation invariant kernels and the smoothness properties of the estimates. In 
a nutshell, the necessary and sufficient condition for k(x - y) to be a Mercer kernel 
(i.e. be admissible for any of the aforementioned kernel methods) is that its Fourier 
transform be nonnegative. This also allowed for an easy to check criterion for new 
kernel functions. Moreover, [5] gave a similar analysis for kernels derived from 
generative models. 
Dot product kernels k(x. y), on the other hand, have been eluding further theo- 
retical analysis and only a necessary condition [1] was found, based on geometrical 
considerations. Unfortunately, it does not provide much insight into smoothness 
properties of the corresponding estimate. 
Our aim in the present paper is to shed some light on the properties of dot product 
kernels, give an explicit equation how its eigenvalues can be determined, and, finally, 
show that for analytic kernels that can be expanded in terms of monomials  or 
associated Legendre polynomials P() [a], i.e. 
k(x, y) -- k(x. y) with k() --  a  or k() --  bP() (a) 
n=O n=O 
a necessary and sufficient condition is a _> 0 for all n E N if no assumption 
about the dimensionality of the input space is made (for finite dimensional spaces 
of dimension d, the condition is that b _> 0). In other words, the polynomial 
series expansion in dot product kernels plays the role of the Fourier transform in 
translation invariant kernels. 
2 Regularization Kernels and Integral Operators 
Let us briefly review some results from regularization theory, needed for the fur- 
ther understanding of the paper. Many algorithms (SVM, GP, RN, etc.) can be 
understood as minimizing a regularized risk functional 
Rreg[f] :-- Remp[f] + /[f] (5) 
where Remp is the training error of the function f on the given data,  > 0 and f[f] 
is the so-called regularization term. The first term depends on the specific problem 
at hand (classification, regression, large margin algorithms, etc.),  is generally 
adjusted by some model selection criterion, and f[f] is a nonnegative functional 
of f which models our belief which functions should be considered to be simple (a 
prior in the Bayesian sense or a structure in a Structural Risk Minimization sense). 
2.1 Regularization Operators 
One possible interpretation of k is [8] that it leads to regularized risk functionals 
where 
1 
f[f] = ]]pf]]2 or equivalently {Pk(x, .),Pk(y,.)) = k(x,y). (6) 
Here P is a regularization operator mapping functions f on 2 into a dot product 
space (we choose L2(:)). The following theorem allows us to construct explicit 
operators P and it provides a criterion whether a symmetric function k(x,y) is 
suitable. 
Theorem I (Mercer [3]) Suppose k  Loo(X ) such that the integral operator 
Tk : L(x) 
a"f(.) := fx (7) 
is positive. Let ,I,j  L() be the eigenfunction of T with eigenvalue Aj  0 and 
normalized such that ]]j ]]: = i and let j denote its complex conjugate. en 
1. 
2. Loo(X) and supj IIjlIL < 
3. k(x,x') -- y] )jj(x)j(x') holds for almost all (x,x'), 
jN 
converges absolutely and uniformly for almost all (x, x'). 
where the series 
This means that by finding the eigensystem (Ai, i) of Tk we can also determine 
the regularization operator P via [8] 
PS---- %//(I) i for any f =  ai(I) i. 
'= i=1 
(8) 
The eigensystem (Ai, i) tells us which functions are considered "simple" in terms 
of the operator P. Consequently, in order to determine the regularization properties 
of dot product kernels we have to find their eigenfunctions and eigenvalues. 
2.2 Specific Assumptions 
Before we diagonalize Tk for a given kernel we have yet to specify the assumptions 
we make about the measure p and the domain of integration . Since a suitable 
choice can drastically simplify the problem we try to keep as much of the symmetries 
imposed by k(x. y) as possible. The predominant symmetry in dot product kernels 
is rotation invariance. Therefore we set choose the unit ball in ]d 
X := Ud := {xlx   and Ilxll _< 1}. (9) 
This is a benign assumption since the radius can always be adjusted by rescaling 
k(x. y) - k((Ox). (Oy)). Similar considerations apply to translation. In some cases 
the unit sphere in ]a is more amenable to our analysis. There we choose 
X := Sd-1 := {;BIiB  ]d and I111 = 1}. (10) 
The latter is a good approximation of the situation where dot product kernels 
perform best -- if the training data has approximately equal Euclidean norm (e.g. 
in images or handwritten digits). For the sake of simplicity we will limit ourselves 
to (10) in most of the cases. 
Secondly we choose p to be the uniform measure on . This means that we have to 
solve the following integral equation: Find functions i: L () --> 1 together with 
coefficients /i such that T'i(x) := fx k(x . y)i(y)dy = )ii(;B). 
3 Orthogonal Polynomials and Spherical Harmonics 
Before we can give eigenfunctions or state necessary and sufficient conditions we 
need some basic relations about Legendre Polynomials and spherical harmonics. 
Denote by P,() the Legendre Polynomials and by pd() the associated Legendre 
Polynomials (see e.g. [4] for details). They have the following properties 
 The polynomials P,() and p.d() are of degree n, and moreover P, := 
 The (associated) Legendre Polynomials form an orthogonal basis with 
f/ ISd-ll 1 
1Pnd()Pdm()(1--)d--d--is_l N(d, (11) 
Here I$-ll - 
- c(e/) denotes the surface of Se-1, and N(d,n) denotes 
the multiplicity of spherical harmonics of order n on Se-1, i.e. N(d,n) = 
2n--2 {n+d-3 
 This admits the orthogonal expansion of any analytic function k() on 
[-1, 1] into P by 
(12) 
Moreover, the Legendre Polynomials may be expanded into an orthonormal basis 
of spherical harmonics y_d. by the Funk-Hecke equation (cf. e.g. [4]) to obtain 
n,$ 
e2(x. y) = 
N(d,) 
j=l 
(13) 
where [[x[[ ---- [[y[[ ---- 1 and moreover 
s Yj(x)Y,,j,(x)dx = 5.,.,Sj,j,. (14) 
d--1 
4 Conditions and Eigensystems on Sd-1 
Schoenberg [7] gives necessary and sufficient conditions under which a function 
k(x. y) defined on S-i satisfies Mercer's condition. In particular he proves the 
following two theorems: 
Theorem 2 (Dot Product Kernels in Finite Dimensions) A kernel k(x. y) 
defined on $-1 x $-1 satisfies Mercer's condition if and only if its expansion into 
Legendre polynomials P has only nonnegative coeJficients, i.e. 
k() =  b.pd() with b. _ O. (15) 
i=0 
Theorem 3 (Dot Product Kernels in Infinite Dimensions) A kernel k(x.y) 
defined on the unit sphere in a Hilbert space satisfies Mercer's condition if and only 
if its Taylor series expansion has only nonnegative coeJficients: 
k() ----  an n with an _ O. 
(16) 
Therefore, all we have to do in order to check whether a particular kernel may be 
used in a SV machine or a Gaussian Process is to look at its polynomial series 
expansion and check the coefficients. This will be done in Section 5. 
Before doing so note that (16) is a more stringent condition than (15). In other 
words, in order to prove Mercer's condition for arbitrary dimensions it suffices to 
show that the Taylor expansion contains only positive coefficients. On the other 
hand, in order to prove that a candidate of a kernel function will never satisfy 
Mercer's condition, it is sufficient to show this for (15) where P -- P,, i.e. for the 
Legendre Polynomials. 
We conclude this section with an explicit representation of the eigensystem of k(x.y). 
It is given by the following lemma: 
Lemma 4 (Eigensystem of Dot Product Kernels) Denote by k(x.y) a kernel 
on S d_ 1 X Sd-1 satisfying condition (15) of Theorem 2. Then the eigensystem of k 
is given by 
,j _- y_d. with eigenvalues A,j ---- as 
n$ 
of multiplicity N(d,n). (17) 
In other words, an 
N(d,) determines the regularization properties of k(x. y). 
Proof Using the Funk-Hecke formula (13) we may expand (15) further into Spheri- 
cal Harmonics Yfi. The latter, however, are orthonormal, hence computing the dot 
n$ ' 
product of the resulting expansion with Yfi. (y) over Sa-1 leaves only the coefficient 
y_d .f x 
,3 Ix)N-d,,i which proves that YJ. are eigenfunctions of the integral operator T 
In order to obtain the eigensystem of k(x  y) on Ud we have to expand k into 
k(x . y) -- y.:,=o(llxllllyll)"P (11- ' Ily-) and expand  into 
The latter is very technical and is thus omitted. See [6] for details. 
5 Examples and Applications 
In the following we will analyze a few kernels and state under which conditions they 
may be used as SV kernels. 
Example I (Homogeneous Polynomial Kernels k(x, y) -- (x. y)P) It is well 
known that this kernel satisfies Mercer's condition for p E N. We will show that for 
p  N this is never the case. 
Thus we have to show that (15) cannot hold for an expansion in terms of Legendre 
Polynomials (d = 3). From [2, 7.126.1] we obtain for k(x,y) = I1 p (we need I1 to 
make k well-defined). 
= x/r(p q- 1) 
i 2pt (1+  )r(++) 
if n even. (18) 
For odd n the integral vanishes since P(-) = (-1)"P(). In order to satisfy 
(15), the integral has to be nonnegative for all n. One can see that F (1 + p ) 
2 
is the only term in (18) that may change its sign. Since the sign of the F function 
alternates with period i for x < 0 (and has poles for negative integer arguments) we 
cannot find any p for which n = 2[ + l J and n = 2[ + 11 correspond to positive 
values of the integral. 
Example 2 (Inhomogeneous Polynomial Kernels k(x, y) = (x .y q- 1)P) 
Likewise we might conjecture that k() -- (1 q- )P is an admissible kernel for all 
p > O. Again, we expand k in a series of Legendre Polynomials to obtain [2, 7.127] 
 2P+1I'2(p + 1) 
1P()( + 1)Pd = F(p+ 2 + n)F(p + 1 - n)' 
(19) 
For p  N all terms with n > p vanish and the remainder is positive. For noninteger 
p, however, (19) may change its sign. This is due to F(p + 1- n). In particular, 
for any p  N (with p  O) we have F(p + 1 - n)  0 for n -- [Pl + 1. This violates 
condition (15), hence such kernels cannot be used in SV machines either. 
Example 3 (Vovk's Real Polynomial k(x,y)= 1--('Y)P with p E N) This 
1--(x.y) 
p--1 
kernel can be written as k() -- --=o , hence all the coefficients ai -- i which 
means that this kernel can be used regardless of the dimensionality of the input 
space. Likewise we can analyze the an infinite power series: 
Example 4 (Vovk's Infinite Polynomial k(x, y) -- (1- (x. y))-l) This kernel 
can be written as k() -- Y-n=o , hence all the coefficients ai -- 1. It suggests poor 
generalization properties of that kernel. 
Example 5 (Neural Networks Kernels k(x,y) = tanh(a + (x. y))) It is a 
longstanding open question whether kernels k() -- tanh(a + ) may be used as SV 
kernels, or, for which sets of parameters this might be possible. We show that is 
impossible for any set of parameters. 
The technique is identical to the one of Examples I and 2: we have to show that k 
fails the conditions of Theorem 2. Since this is very technical (and is best done by 
using computer algebra programs, e.g. Maple), we refer the reader to [6] for details 
and explain for the simpler case of Theorem 3 how the method works. Expanding 
tanh(a + ) into a Taylor series yields 
tanha +  1 __ 2 tanha 3 (l_tanh2a)(l_3tanh2a)+O(4) (20) 
cosh 2 a co- z 3 
Now we analyze (20) coecient-wise. Since all of them have to be nonnegative we 
obtain from the first te a  ]0, ), the third te a  (-, 0], and finally from 
the rough te [a[  [arctanh ], arctanh 1]. is leaves us with a  , hence under 
no conditions on its parameters the kernel above satisfies Mercer's condition. 
6 Eigensystems on Ud 
In order to find the eigensystem of Tk on Ud we have to find a different representation 
of k where the radial part [[x[[ [[y[[ and the angular part  -- ( [1]  [1) are factored 
out separately. We assume that k(x  y) can be written as 
k(x . y) --  n,(]]x]]]]y]])Pd() (21) 
nO 
where nn are polynomials. To see that we can always find such an expansion for 
analytic functions, first expand k in a Taylor series and then expand each coefficient 
(]]x]]]]y]])  into (]]x]]]]y]])  --j=o cj(d,n)Pf(). Rearranging terms into a series of 
Pf gives expansion (21). This allows us to factorize the integral operator into its 
radial and its angular part. We obtain the following theorem: 
Theorem 5 (Eigenfunctions of Tk on Ud) For any kernel k with expansion 
(21) the eigensystem of the integral operator T on Ud is given by 
= v?. (  ) (22) 
ISa-l  , 
the eigensystem of the integral operator 
01 rd-lln(rzcry)n,l(rzc)drzc ---- )n,ln,l(ry). (23) 
In general, (23) cannot be solved analytically. However, the accuracy of numerically 
solving (23) (finite integral in one dimension) is much higher than when diagonal- 
izing T directly. 
Proof All we have to do is split the integral fud dx into fo 1 rd-ldr f$d_ d. More- 
over note that since Tk commutes with the group of rotations it follows from group 
theory [4] that we may separate the angular and the radial part in the eigenfunc- 
tions, hence use the ansatz (x)--  (11-) b(llxll)' 
Next apply the Funk-Hecke equation (13) to expand the associated Legendre 
Polynomials P into the spherical harmonics Yfi. As in Lemma 4 this leads to the 
spherical harmonics as the angular part of the eigensystem. The remaining radial 
part is then (23). See [6] for more details.  
This leads to the eigensystem of the homogeneous polynomial kernel k(x,y) -- 
(x. y)P: if we use (18) in conjunction with (12) to expand P into a series of P() 
we obtain an expansion of type (21) where all (rry)  (rry) p for n _ p and 
(rry) -- 0 otherwise. Hence, the only solution to (23) is b(r) -- r , thus 
 n,j(x) ----IIxlIPYj( I1-)' Eigenvalues can be obtained in a similar way. 
7 Discussion 
In this paper we gave conditions on the properties of dot product kernels, under 
which the latter satisfy Mercer's condition. While the requirements are relatively 
easy to check in the case where data is restricted to spheres (which allowed us to 
prove that several kernels never may be suitable SV kernels) and led to explicit 
formulations for eigenvalues and eigenfunctions, the corresponding calculations on 
balls are more intricate and mainly amenable to numerical analysis. 
Acknowledgments: AS was supported by the DFG (Sm 62-1). The authors thank 
Bernhard SchSlkopf for helpful discussions. 
References 
[1] C. J. C. Burges. Geometry and invariance in kernel based methods. In B. Sch51kopf, 
C. J. C. Burges, and A. J. Smola, editors, Advances in Kernel Methods -- Support 
Vector Learning, pages 89-116, Cambridge, MA, 1999. MIT Press. 
[2] I. S. Gradshteyn and I. M. Ryzhik. Table of integrals, series, and products. Academic 
Press, New York, 1981. 
[3] J. Mercer. Functions of positive and negative type and their connection with the 
theory of integral equations. Philos. Trans. Roy. Soc. London, A 209:415-446, 1909. 
[4] C. Miiller. Analysis of Spherical Symmetries in Euclidean Spaces, volume 129 of 
Applied Mathematical Sciences. Springer, New York, 1997. 
[5] N. Oliver, B. SchSlkopf, and A.J. Smola. Natural regularization in SVMs. In A.J. 
Smola, P.L. Bartlett, B. SchSlkopf, and D. Schuurmans, editors, Advances in Large 
Margin Classifiers, pages 51 - 60, Cambridge, MA, 2000. MIT Press. 
[6] Z. Ovari. Kernels, eigenvalues and support vector machines. Honours thesis, Aus- 
tralian National University, Canberra, 2000. 
[7] I. Schoenberg. Positive definite functions on spheres. Duke Math. J., 9:96-108, 1942. 
[8] A. Smola, B. SchSlkopf, and K.-R. M/iller. The connection between regularization 
operators and support vector kernels. Neural Networks, 11:637-649, 1998. 
[9] G. Wahba. Spline Models for Observational Data, volume 59 of CBMS-NSF Regional 
Conference Series in Applied Mathematics. SIAM, Philadelphia, 1990. 
[10] C. K. I. Williams. Prediction with Gaussian processes: From linear regression to 
linear prediction and beyond. In M. I. Jordan, editor, Learning and Inference in 
Graphical Models. Kluwer, 1998. 
