Sparsity of data representation of optimal kernel 
machine and leave-one-out estimator 
A. Kowalczyk 
Chief Technology Office, Telstra 
770 Blackburn Road, Clayton, Vic. 3168, Australia 
(adam.kowalczyk@ team.telstra.com) 
Abstract 
Vapnik's result that the expectation of the generalisation error of the opti- 
mal hyperplane is bounded by the expectation of the ratio of the number 
of support vectors to the number of training examples is extended to a 
broad class of kernel machines. The class includes Support Vector Ma- 
chines for soft margin classification and regression, and Regularization 
Networks with a variety of kernels and cost functions. We show that key 
inequalities in Vapnik's result become equalities once "the classification 
error" is replaced by "the margin error", with the latter defined as an in- 
stance with positive cost. In particular we show that expectations of the 
true margin error and the empirical margin error are equal, and that the 
sparse solutions for kernel machines are possible only if the cost function 
is "partially" insensitive. 
1 Introduction 
Minimization of regularized risk is a backbone of several recent advances in machine learn- 
ing, including Support Vector Machines (SVM) [13], Regularization Networks (RN) [5] or 
Gaussian Processes [15]. Such a machine is typically implemented as a weighted sum of 
a kernel function evaluated for pairs composed of a data vector in question and a number 
of selected training vectors, so called support vectors. For practical machines it is desired 
to have as few support vectors as possible. It has been observed empirically that SVM 
solutions have often very few support vectors, or that they are sarse, while RN machines 
are not. The paper shows that this behaviour is determined by the properties of the cost 
function used (its partial insensitivity, to be precise). 
Another motivation for interest in sparsity of solutions comes from celebrated result of 
Vapnik [ 13] which links the number of support vectors to the generalization error of SVM 
via a bound on leave-one-out estimator [9]. This result has been originally shown for a 
special case of classification with hard margin cost function (optimal hyperplane). The 
papers by Opper and Winther [10], Jaakkola and Haussler [6], and Joachims [7] extend 
Vapnik's result in the direction of bounds for classification error of SVM's. The first of 
those papers deals with the hard margin case, while the other two derive tighter bounds on 
classification error of the soft margin SVMs with e-insensitive linear cost. 
In this paper we extend Vapnik's result in another direction. Firstly, we show that it holds 
for to a wide range of kernel machines optimized for a variety of cost functions, for both 
classification and regression tasks. Secondly, we find that Vapnik's key inequalities become 
equalities once "the misclassification error" is replaced by "the margin error" (defined as 
the rate of data instances incurring positive costs). In particular, we find that for margin 
errors the following three expectations: (i) of the empirical risk, (ii) of the the true risk and 
(iii) of the leave-one-out risk estimator are equal to each other. Moreover, we show that 
they are equal to the expectation of the ratio of support vectors to the number of training 
examples. 
The main results are given in Section 2. Brief discussion of results is given in Section 3. 
2 Main results 
Given an/-sample {(a:,/), .... , (vct,lt)) of patterns a:i E X C 11 ' and target values /i E 
Y C R. The lemning algorithms used by SVMs [13], RNs [5] or Gaussian Processes [15] 
minimise the regularized risk functional of the fo: 
l 
min Rea[f,b] =  c(xi,Yi,i[f,b]) + 11fll. 
(f,b)exR i=1 
Here  denotes a reproducing kernel nilbert space (RKHS) [ 1], I I.I is the coesponding 
norm,  > 0 is a regulmization constant, c: X x Y x   + is a non-negative cost 
hnction penalising for the deviation i[f,b] = Ui - Oi of the estimator Oi := f(zi) + b 
from target i at location zi, b   is a constant (bias) and  E {0, 1} is another constant 
( = 0 is used to switch the bias off). 
The impoant Representer Theorem [8, 4] states that the minimizer (1) has the expansion: 
l 
= E 
(2) 
i=1 
where k: X x X --> 11 is the kernel corresponding to the RKHS 7/. In the following 
section we shall show that under general assumptions this expansion is unique. 
If cti  0, then xi is called a support vector of f(.). 
2.1 Unique Representer Theorem 
We recall, that a function is called a real analytic function on a domain C 11q if for every 
point of this domain the Taylor series for the function converges to this function in some 
neighborhood of that point. t 
A proof of the following crucial Lemma is omitted due to lack of space. 
Lemma 2.1. If gv : X --> 11 is an analytic function on an open connected subset X C 11 n, 
then the subset 9v - (0) C X is either equal to X or has Lebesgue measure O. 
Analyticity is essential for the above result and the result does not hold even for functions 
infinitely differentiable, in general. Indeed, for every closed subset V C 11 ' there exists 
an infinitely differentiable function (G ) on 11" such that q5 -t (0) = V and there exist 
closed subsets with positive Lebesgue measure and empty interion Hence the Lemma, and 
consequently the subsequent results, do not hold for the broader class of all G  functions. 
Examples of analytic functions are polynomials. The ordinary functions such as sin(:c), cos(:c) 
and exp(:c) are examples of non-polynomial analytic functions. The function q;(:c) := exp(-1/:c 2) 
for :c > 0 and 0, otherwise, is an example of infinitely differentiable function of the real line but not 
analytic (locally it is not equal to its Taylor series expansion at zero). 
Standing assumptions. The following is assumed. 
1. The set X C 11  is open and connected and either Y = {+1} (the case of classifi- 
cation) or Y C 11 is an open segment (the case of regression). 
2. The kernel k  X x X --> 11 is a real analytic function on its domain. 
3. The cost function c(x, y, ) is convex, differentiable on 11 and c(x, y, 0) = 0 
for every (x, y) E X x Y. It can be shown that 
Oc # o. (3) 
c(x,y,) > 0   
4. l is a fixed integer, I < l _< dim(7-/), and the training sample (x, y), ..., (xt,yt) 
is iid drawn from a continuous probability density p(x, y) on X x Y. 
5. The phrase "with probability 1" will mean with probability 1 with respect to the 
selection of the training sample. 
Note that standard polynomial kernel k(x,x') = (1 + x. x') a, x,x' E 11 ', satisfies the 
above assumptions with dim(n) +a 
= ( ct )' Similarly, the Gaussian kernel k(x,x') = 
exp(-IIx - x'll 2/a) satisfies them with dim(n) = oo. 
Typical cost functions such as the super-linear loss functions Cv(X,y, ) = (y)[ := 
(max(0, y)) v used for SVM classification, or cw(x,y, ) = (ll- used for SVM 
regression, or the super-linear loss cv(x , y, ) = Il v for p > I for RS regression, satisfy 
the above assumptions 2. Similarly, variations of Huber robust loss [11, 14] satisfy those 
assumptions. 
The following result strengthens the Representer Theorem [8, 4] 
Theorem 2.2. If l _< dim7-l, then both, the minimizer of the regularized risk (1) and its 
expansion (2) are unique with probability 1. 
Proof outline. Convexity of the functional (f, b)Rreg [f, b] and its strict convexity with 
respect to f  7-/implies the uniqueness of f  7-/minimizing the regularized risk (1); 
cf. [3]. From the assumption that l _< dim 7-/we derive the existence of , ..., t  X such 
that the functions f (i, .), i = 1,..., l, are linearly independent. Equivalently, the following 
Gram determinant is  0: 
qb(,---,l) := det[(k(i, .), k(j, .) )7t]l_<i,j_<l -- det[k(i, j )]_<i,j_<l  O. 
Now Lemma 2.1 implies that O(x, ..., xt)  0 with probability l, since (p  X t --> It is an 
analytic function. Hence functions k(xi, .) are linearly independent and the expansion (2) 
is unique with probability 1. Q.E.D. 
2.2 Leave-one-out estimator 
In this section the minimizer (1) for the whole data sequence of/-training instances and 
some other objects related to it will be additionally marked with superscript '(l)'. The 
superscript '(/\i)' will be used analogously to mark objects corresponding to the minimizer 
of (1) for the reduced training sequence, with ith instance removed. 
Lemma 2.3. With probability 1, for every i  {1, ..., l}: 
a (t)i  0  c(xi, Yi, i[f (t), b(t)]) > 0, (4) 
# 0 > 0. (5) 
2Note that in general, if a function b: 1 --> 1 is convex, differentiable and such that dO/d(O) = 
0, then the cost function c(x,y,) := b(()+) is convex and differentiable. 
Proof outline. With probability 1, functions k(xj, .), j = 1, ..., l, are linearly independent 
(cf. the proof of Theorem 2.2) and there exists a feature map  : X --> 11 t such that 
vectors zj := (xj), i = 1,...,! are linearly independent, k(xj,x) = zj. (x) and 
f(t)(x) = z (t) . (x) + fb  for every x E X, where z  := yJj=_ o(t)jzj. The pair 
(z  , b ) minimizes the function 
l 
l(rte)g(z,b) := y c(xj,yj,j(z,b)) + 11zl (6) 
j=l 
where j (z, b) := yj - z  zj - fib. This function is differentiable due to the standing 
assumptions on the cost c. Hence, necessarily gradR,.eg = 0, at the minimum (z  , b ), 
which due to the linear independence of vectors zj, gives 
a(t) j = 10C (xj j(z  
), of 'YJ' 'b(t))) (7) 
for every j = 1, ..., l. This equality combined with equivalence (3) proves (4). 
Now we proceed to the proof of (5). Note that the pair (z (t\i), b(t\i)), where z (t\i) := 
5-57i o(t)jzj, corresponds in the feature space to the minimizer (f(t), b(t)) of the 
reduced regularized risk: 
l 
R(rt)j)(z,b) '= y c(xj,yj,j(z,b)) + 11zll 
j----l; ji 
Sufficiency in (5). From (4) and characterization (7) of the critical point it follows immedi- 
ately that if o(l)i = 0, then the minimizers for the full and reduced data sets are identical. 
Necessity in (5). A supposition of a(t)  0 and c(xi, y, [f(t), b(t)]) = 0 leads to a 
contradiction. Indeed, from (4), c(xi, Yi, i(z(t), b(t))) > 0, hence: 
= 
_< l(r?j) ( z(t) , b(t) ) = l(rte)a ( z(t) , b(t) ) - c(xi, yi, i( z(t) , b(t) ) ) 
< (z(t) b(t))= rain R  (z b). 
reg   reg   
(z,b)x 
This contradiction completes the proof. Q.E.D. 
We say that xi is a sensitive support vector if o(t) i  0 and f(t)  f(t\i), i.e., if its removal 
from the training set changes the solution. 
Corollary 2.4. Every support vector is sensitive with probability 1. 
Proof. If cti  O, then the vector z   Lin(z, .... ,Zi-l,Zi+l,...,Zl) since z  
has a non-trivial component oiz i in the direction of ith feature vector zi, while 
Z (l\i)  Lin(z, .... , zi-_, zi+_, ..., zt). Thus z  and Z (l\i) have different direc- 
tions in Lin(z_,...,zt) cZ and there exists j G {1,...,/} such that f(t)(xj,)  
f(t\i) (xj, ). Q.E.D. 
We define the empirical risk and the expected (true) risk of margin error 
Remp[f , b] 
Rexp[f , b] 
l 
5-i:/{c(x,v,[l,t,])>o} {i; c(xi,Yi,i[f,b]) > O) 
l l 
:= Prob[c(x,y,y- f(x) - fb) > 0], 
where (f, b) E 7-/ x  I{.) denotes the indicator function and  denotes the cardinality 
(number of elements) of a set. 
From the above Lemma we obtain immediately the following result: 
Corollary 2.5. With probability 1: 
(i; c(xi,Yi, f(t\i)(xi) + fib (t\i)) > 0) (i; a(t)i  0) 
= = Remp[f  , b(t)]. 
l l 
There exist counter-examples showing the phrase "with probability 1" above cannot be 
omitted. The sum on L.H.S. above is the leave-one-out estimator of the risk of margin 
error [14] for the minimizer of regularized risk (1). The above corollary shows that this 
estimator is uniquely determined by the number of support vectors as well as the number 
of training margin errors. 
Now from the Lunts-Brailovsky Theorem [14, Theorem 10.8] applied to the risk 
Q(x, y; f, b) := I{c(x,y,y-f(x)-/%>o) the following result is obtained. 
Theorem 2.6. 
E[Rexp(f(l_l),b(t_l))] = E[Remp(f(t),b(t))] = E[{i; oz(l)i  0}] 
l , (8) 
where the first expectation is in the selection of training (l - 1)-sample and the remaining 
two are with re,sect to the selection of training l-sample. 
A cost function is called partially insensitive if there exists (x, y) E X x Y and it  2 
such that c(x,y,t) = c(x,y,2) = 0. Otherwise, the cost c is called sensitive. Typical 
SVM cost functions are partially insensitive while typical RN cost functions are sensitive. 
The following result can be derived from Theorem 2.6 and Lemma 2.3. 
Corollary 2.7. If the number of support vectors is < l with a probability > O, then the cost 
function has to be partially insensitive. 
Typical cost functions penalize for an allocation of a wrong sign, i.e. 
V(x,y,)exxvx Y) < 0  c(x, y, y -9) > 0. (9) 
Let us define the risk ofmisclassification of the kernel machine )(x) = f(x) + fib for 
(f,b) G 7-/x 11 as Rctas[f,b] := Prob[yO(x) < 0]. Assuming (9), we have Rctas[f,b] _< 
Rexp[f, b]. Combining this observation with (8) we obtain an extension of Vapnik's result 
[14, Theorem 10.5]: 
Corollary 2.8. If condition (9) holds then 
E[Rctas(f(t_l) b(/_l))] < E[(i; o(t)i  0)] = E[Remp(f(t),b(t))] ' (10) 
Note that the original Vapnik's result consists in an inequality analogous to the inequality 
in the above condition for the specific case of classification by optimal hyperplanes (hard 
margin support vector machines). 
3 Brief Discussion of Results 
Essentiality of assumptions. For every formal result in this paper and any of the standing 
assumption there exists an example of a minimizer of (1) which violates the conclusions of 
the result. In this sense all those assumptions are essential. 
Linear combinations of admissible cost functions. Any weighted sum of cost func- 
tions satisfying our Standing Assumption 3 will satisfy this assumption as well, hence our 
formalism will apply to it. An illustrative example is the following cost function for clas- 
sification c(:,/, ) = 5-,a q. Uj(max(0,/( - ej)) vj , where Uj > 0, ej _> 0 and pj > 1 are 
constants and /E Y = {+1}. 
Non-differentiable cost functions. Our formal results can be extended with mi- 
nor modifications to the case of typical, non-differentiable linear cost function such as 
c = (/)+ = max(0, /) for SVM classification, c = (11 - e)+ for svm regression and 
to the classification with hard margins SVMs (optimal hyperplanes). Details are beyond 
the scope of this paper. Note that the above linear cost functions can be uniformly approx- 
imated by differentiable cost functions, e.g. by Huber cost function [11, 14], to which our 
formalism applies. This implies that our formalism "applies approximately" to the linear 
loss case and some partial extension of it can be obtained directly using some limit argu- 
ments. However, using direct algebraic approach based on an evaluation of Kuhn-Tucker 
conditions one can come to stronger conclusions. Details will be presented elsewhere. 
Theory of generalization. Equality of expectations of empirical and expected risk pro- 
vided by Theorem 2.6 implies that minimizers of regularized risk (1) are on average con- 
sistent. We should emphasize that this result holds for small training samples, of the size 
l smaller than VC dimension of the function class, which is dim(74) + 1 in our case. This 
should be contrasted with uniform convergence bounds [2, 13, 14] which are vacuous un- 
less l > > VC dimension. 
Significance of approximate solutions for RNs. Corollary 2.7 shows that sparsity of 
solutions is practically not achievable for optimal RN solutions since they use sensitive 
cost functions. This emphasizes the significance of research into approximately optimal 
solution algorithms in such a case, cf. [12]. 
Application to selection of the regularization constant. The bound provided by Corol- 
lary 2.8 and the equivalence given by Theorem 2.6 can be used as a justification of a heuris- 
tic that the optimal value of regularization constant , is the one which minimizes the num- 
ber of margin errors (cf. [14]). This is especially appealing in the case of regression with 
e-insensitive cost, where the margin error has a straightforward interpretation of sample 
being outside of the e-tube. 
Application to modelling of additive noise. Let us suppose that data is iid drawn form 
the distribution of the form y = f(x) + eoise, where eoise is a random noise independent 
of x, with 0 mean. Theorem 2.6 implies the following heuristic for approximation of the 
noise distribution in the regression model y = f(x) + eois: 
# {i; # 0) 
Prob[enoise > 6]  
l 
Here (f (t), b(t) ) is a minimizer of the regularized risk (1) with an e-insensitive cost function, 
i.e. such that c(x, /, ) > 0 iff I1 > 6, 
Acknowledgement. The permission of the Chief Technology Officer, Telstra, to publish 
this paper, is gratefully acknowledged. 
References 
[1] N. Aronszajn. Theory of reproducing kernels. Transactions of the American Mathe- 
matical Society, 68:337 - 404, 1950. 
[2] P. Bartlett and J. Shave-Taylor. Generalization performance of support vector ma- 
chines and other pattern classifiers. In B. Sch61kopf, et. al., eds., Advances in Kernel 
Metho&, pages 43-54, MIT Press, 1998. 
[3] C. Burges and D. J. Crisp. Uniqueness of the SVM solution. In S. Sola et. al., ed., 
Adv. in Neural Info. Proc. Sys. 12, pages 144-152, MIT Press, 2000. 
[4] 
[5] 
[6] 
[7] 
[8] 
[9] 
[10] 
[11] 
[12] 
[13] 
[14] 
[15] 
D. Cox and F. O'Sullivan. Asymptotic analysis of penalized likelihood and related 
estimators. Ann. Statist., 18:1676-1695, 1990. 
E Girosi, M. Jones, and T. Poggio. Regularization theory and neural networks archi- 
tectures. Neural Computation, 7(2):219-269, 1995. 
T. Jaakkola and D. Haussler. Probabilistic kernel regression models. In Proc. Seventh 
Work. on AIand Stat., San Francisco, 1999. Morgan Kaufman. 
T. Joachims. Estimating the Generalization Performance of an SVM Efficiently. In 
Proc. of the International Conference on Machine Learning, 2000. Morgan Kaufman. 
G. Kimeldorf and G. Wahba. A correspondence between Bayesian estimation of 
stochastic processes and smoothing by splines. Ann. Math. Statist., 41:495-502, 1970. 
A. Lunts and V. Brailovsky. Evaluation of attributes obtained in statistical decision 
rules. Engineering Cybernetics, 3:98-109, 1967. 
M. Opper and O. Winther. Gaussian process classification and SVM: Mean field 
results and leave-one out estimator. In P. Bartlett, et. al eds., Advances in Large 
Margin Classifiers, pages 301-316, MIT Press, 2000. 
A. Smola and B. SchOlkopf. A tutorial on support vector regression. Statistics and 
Computing, 1998. In press. 
A. J. Smola and B. SchOlkopf. Sparse greedy matrix approximation for machine 
learning. Typescript, March 2000. 
V. Vapnik. The Nature of Statistical Learning Theory. Springer Verlag, New York, 
1995. 
V. Vapnik. Statistical Learning Theory. Wiley, New York, 1998. 
C. K. I. Williams. Prediction with Gaussian processes: From linear regression to 
linear prediction and beyond. In M. I. Jordan, editor, Learning and Inference in 
Graphical Models. Kluwer, 1998. 
