From Margin To Sparsity 
Thore Graepel, Ralf Herbrich 
Computer Science Department 
Technical University of Berlin 
Berlin, Germany 
{guru, ralfh} cs. tu- berlin. de 
Robert C. Williamson 
Department of Engineering 
Australian National University 
Canberra, Australia 
Bob. Williamsonanu. edu. au 
Abstract 
We present an improvement of Novikoff's perceptron convergence 
theorem. Reinterpreting this mistake bound as a margin dependent 
sparsity guarantee allows us to give a PAC-style generalisation er- 
ror bound for the classifier learned by the perceptron learning algo- 
rithm. The bound value crucially depends on the margin a support 
vector machine would achieve on the same data set using the same 
kernel. Ironically, the bound yields better guarantees than are cur- 
rently available for the support vector solution itself. 
I Introduction 
In the last few years there has been a large controversy about the significance 
of the attained margin, i.e. the smallest real valued output of a classifiers before 
thresholding, as an indicator of generalisation performance. Results in the VC, PAC 
and luckiness frameworks seem to indicate that a large margin is a pre-requisite 
for small generalisation error bounds (see [14, 12]). These results caused many 
researchers to focus on large margin methods such as the well known support vector 
machine (SVM). On the other hand, the notion of sparsity is deemed important for 
generalisation as can be seen from the popularity of Occam's razor like arguments 
as well as compression considerations (see [8]). 
In this paper we reconcile the two notions by reinterpreting an improved version of 
Novikoff's well known perceptron convergence theorem as a sparsity guarantee in 
dual space: the existence of large margin classifiers implies the existence of sparse 
consistent classifiers in dual space. Even better, this solution is easily found by 
the perceptron algorithm. By combining the perceptron mistake bound with a 
compression bound that originated from the work of Littlestone and Warmuth [8] 
we are able to provide a PAC like generalisation error bound for the classifier found 
by the perceptron algorithm whose size is determined by the magnitude of the 
maximally achievable margin on the dataset. 
The paper is structured as follows: after introducing the perceptron in dual variables 
in Section 2 we improve on Novikoff's perceptron convergence bound in Section 3. 
Our main result is presented in the subsequent section and its consequences on the 
theoretical foundation of SVMs are discussed in Section 5. 
2 (Dual) Kernel Perceptrons 
We consider learning given m objects X = {x,..., Xm} E ,m and a set Y = 
{Y,---Ym} Eym drawn lid from a fixed distribution Ix = z over the space 
X x {-1, +1} = Z of input-output pairs. Our hypotheses are linear classifiers 
x  sign ((w, & (x))) in some fixed feature space    where we assume that 
mapping &  X   is chosen a priori . Given the features i ' X   the classical 
(primal) perceptron algorithm aims at finding a weight vector w   consistent 
with the training data. Recently, Vapnik [14] and others -- in their work on SVMs 
-- have rediscovered that it may be advantageous to learn in the dual representation 
(see [1]), i.e. expanding the weight vector in terms of the training data 
m m 
i1 i1 
and learn the m expansion coecients a  m rather than the components of 
w  . This is particularly useful if the dimensionality n = dim () of the feature 
space  is much greater (or possibly infinite) than the number m of training points. 
This dual representation can be used for a rather wide class of learning algorithms 
(see [15]) --in particular if all we need for learning is the real valued output (w, 
of the classifier at the m training points x,...,Xm. Thus it suces to choose 
symmetric function k  X x X   called kernel and to ensure that there exists 
mapping &  X   such that 
Vx, x' e x . (x,x') = (O (x),O (x')). (2) 
A sucient condition is given by Mercer's theorem. 
Theorem 1 (Mercer Kernel [9, 7]). Any symmetric function k e L (X x 
that is positive semidefinite, i.e. 
0, 
is called a Mercer kernel and has the following property: if i  L (X) solve the 
eigenvalue problem fx k (x,x') i (x') dx' = ii (x) with fx  (x) dx =  and 
Vi  j . fx  (x)  (x) dx = 0 then k can be expanded in a uniformly convergent 
series, i.e. 
 (x,x' =   (x  (x'. 
i1 
In order to see that a Mercer kernel fulfils equation (2) consider the mapping 
whose existence is ensured by the third property. Finally, the perceptron learning 
algorithm we are going to consider is described in the following definition. 
Definition I (Perceptron Learning). The perceptton learning procedure with 
the fixed learning rate / ]+ is as follows: 
1. Start in step zero, i.e. t - 0, with the vector ct - 0. 
2. If there exists an index i e {1,... ,m} such that Yi <w,xi>: _< 0 then 
(t+)i = (t)i + rlyi q= w+ = w + rlyixi. 
and t - t+ 1. 
1Somtimes, we abbreviate qb (x) by x always assuming qb is fixed. 
(4) 
3. Stop, if there is no i 6 {1,... ,m} such that Yi {w,,xi): < 0. 
Other variants of this algorithm have been presented elsewhere (see [2, 3]). 
3 An Improvement of Novikoff's Theorem 
In the early 60's Novikoff [10] was able to give an upper bound on the number 
of mistakes made by the classical perceptron learning procedure. Two years later, 
this bound was generalised to feature spaces using Mercer kernels by Aizerman et 
al. [1]. The quantity determining the upper bound is the maximally achievable 
unnormalised margin maxseam 7z (at) normalised by the total extent R (X) of the 
data in feature space, i.e. R (X) =maxx, ex IIx11. 
Definition 2 (Unnormalised Margin). Given a training set Z = (X, Y) and a 
vector at  1 m the unnormalised margin 7z (at) is given by 
7z (at) = min Yi 
Theorem 2 (Novikoffs Perceptron Convergence Theorem/10, 1]). Let Z: 
(X, Y) be a training set of size m. Suppose that there exists a vector at*  m such 
that 7z (at*) > O. Then the number of mistakes made by the perceptton algorithm 
in Definition I on Z is at most 
2 
Surprisingly, this bound is highly influenced by the data point xi  X with the 
largest norm IIxll albeit rescaling of a data point would not change its classifica- 
tion. Let us consider rescaling of the training set X before applying the perceptron 
algorithm. Then for the normalised training set we would have R (Xnorm) = I and 
7z (at) would change into the normalised margin Fz (at) first advocated in [6]. 
Definition 3 (Normalised Margin). Given a training set Z = (X,Y) and a 
vector at  ]m the normalised margin Fz (at) is given by 
Fz (at) = min yi{w,xi}c 
By definition, for all x  X we have R (X) > Ilxill. Hence for any at  ]m and 
all (xi,Yi)  Z such that Yi (Wct,Xi}/c > 0 
> IIxlbc _ 1 
y(w,x,)- y(w,x)- y(w,x) ' 
IIw11 IIw11 IIwllllx11 
which immediately implies for all Z: (X, Y)  2 m such that 7z (a) > 0 
a(x) > 
- (5) 
Thus when normalising the data in feature space, i.e. 
 (x,x') 
knorm(X,X) = 
' 
the upper bound on the number of steps until convergence of the classical perceptron 
learning procedure of Rosenblatt [11] is provably decreasing and is given by the 
squared r.h.s of (5). 
Considering the form of the update rule (4) we observe that this result not only 
bounds the number of mistakes made during learning but also the number [[e[[ 0 
of non-zero coefficients in the e vector. To be precise, for  - I it bounds the  
norm [[e[[ of the coefficient vector e which, in turn, bounds the zero norm [[e[[ 0 
from above for all vectors with integer components. Theorem 2 thus establishes a 
relation between the existence of a large margin classifier w* and the sparseness of 
any solution found by the perceptton algorithm. 
4 Main Result 
In order to exploit the guaranteed sparseness of the solution of a kernel perceptron 
we make use of the following lemma to be found in [8, 4]. 
Lemma I (Compression Lemma). Fix d E {1,...,m}. For any measure Pz, 
the probability that m examples Z drawn iid according to Pz will yield a classifier 
e (Z) learned by the perceptton algorithm with I1' (z)11o = d whose generalisation 
error Px [Y (w,:,(z), b (X)) c _ O] is greater than  is at most 
(6) 
Proof. Since we restrict the solution a (Z) with generalisation error greater than 
s only to use d points Ze/ C_ Z but still to be consistent with the remaining set 
Z \ Ze/, this probability is at most (1 - s)m-e/ for a fixed subset Ze/. The result 
follows by the union bound over all () subsets Ze/. Intuitively, the consistency on 
the m - d unused training points witnesses the small generalisation error with high 
probability. [] 
5 and solve for s we have that with probability at most  over 
If we set (6) to  
the random draw of the training set Z the perceptron learning algorithm finds 
a vector et such that Iletl10 - d and whose generalisation error is greater than 
5(m,d): 1 (ln(()) q- ln(m) q-ln()) Thus by the union bound, if the per- 
m--e/ ' 
ceptron algorithm converges, the probability that the generalisation error of its 
solution is greater than 5 (m, Iletl10) is at most 5. We have shown the following 
sparsity bounds also to be found in [4]. 
Theorem 3 (Generalisation Error Bound for Perceptrons). For any measure 
Pz, with probability at least I - 5 over the random draw of the training set Z of 
size m, if the perceptton learning algorithm converges to the vector e of coefficients 
then its generalisation error Px [Y (w,(z), b (X))c _ 0] is less than 
m 
(7) 
This theorem in itself constitutes a powerful result and can easily be adopted to 
hold for a large class of learning algorithms including SVMs [4]. This bound often 
outperforms margin bounds for practically relevant training set sizes, e.g. m < 
100 000. Combining Theorem 2 and Theorem 3 thus gives our main result. 
Theorem 4 (Margin Bound). For any measure Pz, with probability at least 
I -  over the random draw of the training set Z of size m, if there exists a 
vector t * such that 
_< m 
then the generalisation error Px [Y (w,(z),qb (X))c _< 0] of the classifier t 
found by the perceptton algorithm is less than 
1 (ln((nm.))+ln(m)+ln()) . (8) 
The most intriguing feature of this result is that the mere existence of a large 
margin classifier at* is sufficient to guarantee a small generalisation error for the 
solution at of the perceptron although its attained margin ?z (at) is likely to be 
much smaller than ?z (at*). It has long been argued that the attained margin ?z 
itself is the crucial quantity controlling the generalisation error of at. In light of 
our new result if there exists a consistent classifier t* with large margin we know 
that there also exists at least one classifier t with high sparsity that can efficiently 
be found using the perceptron algorithm. In fact, whenever the SVM appears to 
be theoretically justified by a large observed margin, every solution found by the 
perceptron algorithm has a small guaranteed generalisation error -- mostly even 
smaller than current bounds on the generalisation error of SVMs. Note that for 
a given training sample Z it is not unlikely that by permutation of Z there exist 
0 ((nm.)) many different consistent sparse classifiers 
5 Impact on the Foundations of Support Vector Machines 
Support vector machines owe their popularity mainly to their theoretical justifica- 
tion in the learning theory. In particular, two arguments have been put forward to 
single out the solutions found by SVMs [14, p. 139]' 
SVM (optimal hyperplanes) can generalise because 
1. the expectation of the data compression is large. 
2. the expectation of the margin is large. 
The second reason is often justified by margin results (see [14, 12]) which bound 
the generalisation of a classifier at in terms of its own attained margin 7z (at). If 
we require the slightly stronger condition that n* < m 
W' n _> 4, then our bound (8) 
for solutions of perceptron learning can be upper bounded by 
1 ( (2em n ()) 
-- *ln -- +ln(m-)+ln 
which has to be compared with the PAC margin bound (see [12, 5]) 
2 64n* log 2 k, 1---* ] 1g2 (32m) + log 2 (2m) + log 2 
m 
Despite the fact that the former result also holds true for the margin Fz (et*) (which 
could loosely be upper bounded by (5)) 
 the PAC margin bound's decay (as a function of m) is slower by a log a (32m) 
factor, 
digit 101112131415161718191 
perceptton 0.2 0.2 0.4 0.4 0.4 0.4 0.4 0.5 0.6 0.7 
[[Ct[[ 0 740 643 1168 1512 1078 1277 823 1103 1856 1920 
mistakes 844 843 1345 1811 1222 1497 960 1323 2326 2367 
bound 6.7 6.0 9.8 12.0 9.2 10.5 7.4 9.4 14.3 14.6 
SVM 0.2 0.1 0.4 0.4 0.4 0.5 0.3 0.4 0.5 0.6 
[[Ct[[ 0 1379 989 1958 1900 1224 2024 1527 2064 2332 2765 
bound 11.2 8.6 14.9 14.5 10.2 15.3 12.2 15.5 17.1 19.6 
Table 1: Results of kernel perceptrons and SVMs on NIST (taken from [2, Table 
3]). The kernel used was k(x,x ) = ({x,x}:v + 1) 4 and m = 60000. For both 
algorithms we give the measured generalisation error (in %), the attained sparsity 
and the bound value (in %, 5 = 0.05) of (7). 
for any m and almost any 5 the margin bound given in Theorem 4 guaran- 
tees a smaller generalisation error. 
For example, using the empirical value n* m 600 (see [14, p. 153]) in 
the NIST handwritten digit recognition task and inserting this value into 
the PAC margin bound, it would need the astronomically large number of 
m > 410 743 386 to obtain a bound value of 0.112 as obtained by (3) for 
the digit "0" (see Table 1). 
With regard to the first reason, it has been confirmed experimentally that SVMs find 
solutions which are sparse in the expansion coefficients c. However, there cannot 
exist any distribution-free guarantee that the number of support vectors will in fact 
be small 2. In contrast, Theorem 2 gives an explicit bound on the sparsity in terms 
of the achievable margin "/z (c*). Furthermore, experimental results on the NIST 
datasets show that the sparsity of solution found by the perceptron algorithm is 
consistently (and often by a factor of two) greater than that of the SVM solution 
(see [2, Table 3] and Table 1). 
6 Conclusion 
We have shown that the generalisation error of a very simple and efficient learning 
algorithm for linear classifiers -- the perceptron algorithm -- can be bounded by 
a quantity involving the margin of the classifier the SVM would have found on the 
same training data using the same kernel. This result implies that the SVM solution 
is not at all singled out as being superior in terms of provable generalisation error. 
Also, the result indicates that sparsity of the solution may be a more fundamental 
property than the size of the attained margin (since a large value of the latter 
implies a large value of the former). 
Our analysis raises an interesting question: having chosen a good kernel, correspond- 
ing to a metric in which inter-class distances are great and intra-class distances are 
short, in how far does it matter which consistent classifier we use? Experimental 
2Consider a distribution Px on two parallel lines with support in the unit ball. Suppose 
that their mutual distance is x/. Then the number of support vectors equals the training 
set size whereas the perceptron algorithm never uses more than two points by Theorem 2. 
One could argue that it is the number of essential support vectors [13] that characterises 
the data compression of an SVM (which would also have been two in our example). Their 
determination, however, involves a combinatorial optimisation problem and can thus never 
be performed in practical applications. 
results seem to indicate that a vast variety of heuristics for finding consistent clas- 
sifiers, e.g. kernel Fisher discriminant, linear programming machines, Bayes point 
machines, kernel PCA & linear SVM, sparse greedy matrix approximation perform 
comparably (see http://www. kernel-machines. org/). 
Acknowledgements 
This work was done while TG and RH were visiting the ANU Canberra. They 
would like to thank Peter Bartlett and Jon Baxter for many interesting discussions. 
Furthermore, we would like to thank the anonymous reviewer, Olivier Bousquet and 
Matthias Seeger for very useful remarks on the paper. 
References 
[1] M. Aizerman, E. Braverman, and L. Rozonoer. Theoretical foundations of the po- 
tential function method in pattern recognition learning. Automation and Remote 
Control, 25:821-837, 1964. 
[2] Y. Freund and R. E. Schapire. Large margin classification using the perceptron 
algorithm. Machine Learning, 1999. 
[3] T. Friess, N. Cristianini, and C. Campbell. The Kernel-Adatron: A fast and sim- 
ple learning procedure for Support Vector Machines. In Proceedings of the 15-th 
International Conference in Machine Learning, pages 188-196, 1998. 
[4] T. Graepel, R. Herbrich, and J. Shawe-Taylor. Generalisation error bounds for sparse 
linear classifiers. In Proceedings of the Thirteenth Annual Conference on Computa- 
tional Learning Theory, pages 298-303, 2000. in press. 
[5] R. Herbrich. Learning Linear Classifiers - Theory and Algorithms. PhD thesis, Tech- 
nische Universit/it Berlin, 2000. accepted for publication by MIT Press. 
[6] R. Herbrich and T. Graepel. A PAC-Bayesian margin bound for linear classifiers: 
Why SVMs work. In Advances in Neural Information System Processing 13, 2001. 
[7] H. KSnig. Eigenvalue Distribution of Compact Operators. Birkh/iuser, Basel, 1986. 
[8] N. Littlestone and M. Warmuth. Relating data compression and learnability. Tech- 
nical report, University of California Santa Cruz, 1986. 
[9] T. Mercer. Functions of positive and negative type and their connection with the 
theory of integral equations. Transaction of London Philosophy Society (A), 209:415- 
446, 1909. 
[10] A. Novikoff. On convergence proofs for perceptrons. In Report at the Symposium 
on Mathematical Theory of Automata, pages 24-26, Politechnical Institute Brooklyn, 
1962. 
[11] M. Rosenblatt. Principles of neurodynamics: Perceptton and Theory of Brain Mech- 
anisms. Spartan-Books, Washington D.C., 1962. 
[12] J. Shawe-Taylor, P. L. Bartlett, R. C. Williamson, and M. Anthony. Structural risk 
minimization over data-dependent hierarchies. IEEE Transactions on Information 
Theory, 44(5):1926-1940, 1998. 
[13] V. Vapnik. Statistical Learning Theory. John Wiley and Sons, New York, 1998. 
[14] V. Vapnik. The Nature of Statistical Learning Theory. Springer, second edition, 1999. 
[15] G. Wahba. Support Vector Machines, Reproducing Kernel Hilbert Spaces and the ran- 
domized GACV. Technical report, Department of Statistics, University of Wisconsin, 
Madison, 1997. TR-NO-984. 
