A PAC-Bayesian Margin Bound for Linear 
Classifiers: Why SVMs work 
Ralf Herbrich 
Statistics Research Group 
Computer Science Department 
Technical University of Berlin 
ralfh cs. tu-berlin. de 
Thore Graepel 
Statistics Research Group 
Computer Science Department 
Technical University of Berlin 
guru cs. tu- berlin. de 
Abstract 
We present a bound on the generalisation error of linear classifiers 
in terms of a refined margin quantity on the training set. The 
result is obtained in a PAC-Bayesian framework and is based on 
geometrical arguments in the space of linear classifiers. The new 
bound constitutes an exponential improvement of the so far tightest 
margin bound by Shawe-Taylor et al. [8] and scales logarithmically 
in the inverse margin. Even in the case of less training examples 
than input dimensions sufficiently large margins lead to non-trivial 
bound values and -- for maximum margins -- to a vanishing com- 
plexity term. Furthermore, the classical margin is too coarse a 
measure for the essential quantity that controls the generalisation 
error: the volume ratio between the whole hypothesis space and 
the subset of consistent hypotheses. The practical relevance of the 
result lies in the fact that the well-known support vector machine 
is optimal w.r.t. the new bound only if the feature vectors are all of 
the same length. As a consequence we recommend to use SVMs on 
normalised feature vectors only -- a recommendation that is well 
supported by our numerical experiments on two benchmark data 
sets. 
I Introduction 
Linear classifiers are exceedingly popular in the machine learning community due 
to their straight-forward applicability and high flexibility which has recently been 
boosted by the so-called kernel methods [13]. A natural and popular framework 
for the theoretical analysis of classifiers is the PAC (probably approximately cor- 
rect) framework [11] which is closely related to Vapnik's work on the generalisation 
error [12]. For binary classifiers it turned out that the growth function is an ap- 
propriate measure of "complexity" and can tightly be upper bounded by the VC 
(Vapnik-Chervonenkis) dimension [14]. Later, structural risk minimisation [12] was 
suggested for directly minimising the VC dimension based on a training set and an 
a priori structuring of the hypothesis space. 
In practice, e.g. in the case of linear classifiers, often a thresholded real-valued func- 
tion is used for classification. In 1993, Kearns [4] demonstrated that considerably 
tighter bounds can be obtained by considering a scale-sensitive complexity measure 
known as the fat shattering dimension. Further results [1] provided bounds on the 
Growth function similar to those proved by Vapnik and others [14, 6]. The popular- 
ity of the theory was boosted by the invention of the support vector machine (SVM) 
[13] which aims at directly minimising the complexity as suggested by theory. 
Until recently, however, the success of the SVM remained somewhat obscure because 
in PAC/VC theory the structuring of the hypothesis space must be independent of 
the training data -- in contrast to the data-dependence of the canonical hyperplane. 
As a consequence Shawe-Taylor et.al. [8] developed the luckiness framework, where 
luckiness refers to a complexity measure that is a function of both hypothesis and 
training sample. 
Recently, David McAllester presented some PAC-Bayesian theorems [5] that bound 
the generalisation error of Bayesian classifiers independently of the correctness of the 
prior and regardless of the underlying data distribution -- thus fulfilling the basic 
desiderata of PAC theory. In [3] McAllester's bounds on the Gibbs classifier were 
extended to the Bayes (optimal) classifier. The PAC-Bayesian framework provides 
a posteriori bounds and is thus closely related in spirit to the luckiness framework . 
In this paper we give a tight margin bound for linear classifiers in the PAC-Bayesian 
framework. The main idea is to identify the generalisation error of the classifier h of 
interest with that of the Bayes (optimal) classifier of a (point-symmetric) subset Q 
that is summarised by h. We show that for a uniform prior the normalised margin 
of h is directly related to the volume of a large subset Q summarised by h. In 
particular, the result suggests that a learning algorithm for linear classifiers should 
aim at maximising the normalised margin instead of the classical margin. In Section 
2 and 3 we review the basic PAC-Bayesian theorem and show how it can be applied 
to single classifiers. In Section 4 we give our main result and outline its proof. In 
Section 5 we discuss the consequences of the new result for the application of SVMs 
and demonstrate experimentally that in fact a normalisation of the feature vectors 
leads to considerably superior generalisation performance. 
We denote n-tuples by italic bold letters (e.g. x - (x,... ,Xn)), vectors by roman 
bold letters (e.g. x), random variables by sans serif font (e.g. X) and vector spaces 
by calligraphic capitalised letters (e.g. A'). The symbols P, , [ and  denote a 
probability measure, the expectation of a random variable, the indicator function 
and the normed space (2-norm) of sequences of length n, respectively. 
2 A PAC Margin Bound 
We consider learning in the PAC framework. Let A' be the input space, and let y = 
{-1, +1}. Let a labelled training sample z = (x,y)  (X x y)m = zm be drawn 
lid according to some unknown probability measure Pz = P1xPx. Furthermore for 
a given hypothesis space %/ C_ yx we assume the existence of a %rue" hypothesis 
h*  %/that labelled the data 
P1x=x (Y) = ly=h*(x). 
We consider linear hypotheses 
7-I = {hw'x  sign ((w, (b (x))x;) I TM  W}, ]42 = {w  K; 
(1) 
I Ilwll = 1}, (2) 
1In fact, even Shawe-Taylor et.al. concede that "... a Bayesian might say that luckiness 
is just a complicated way of encoding a prior. The sole justification for our particular way 
of encoding is that it allows us to get the PAC like results we sought..." [9, p. 4]. 
where the mapping qb  X --> ]C C_  maps 2 the input data to some feature space 
]C and Ilwll; = 1 leads to a one-to-one correspondence of hypotheses hw to their 
parameters w. From the existence of h* we know that there exists a version space 
V (z) c- w, 
V(z)-{wBY I V(x,y)  z  hw (x) - y } . 
Our analysis aims at bounding the true risk R [w] of consistent hypotheses hw, 
R [w] = x (nw (X) # Y) . 
Since all classifiers w  V (z) are indistinguishable in terms of number of errors 
committed on the given training set z let us introduce the concept of the margin 
7z (w) of a classifier w, i.e. 
,z(w) = 
min Yi (w, xi)c (3) 
Ilwll 
The following theorem due to Shawe-Taylor et al. [8] bounds the generalisation 
errors R[w] of all classifier w  V (z) in terms of the margin % (w). 
Theorem I (PAC margin bound). For all probability measures Pz such that 
Px <ll <X>ll _< - 1, for any 8 > 0 with probability at least 1- 8 over the ran- 
dom draw of the training set z, if we succeed in correctly classifying m samples z 
with a linear classifier w achieving a positive margin % (w) > X/32 2/m then the 
generalisation R [w] of w is bounded from above by 
__( (_8emh (_)) [( 8q )2] 
2 n (w) log 2 , n (w) ] 1g2 (32m) + In , n (w) = -/ v) . (4) 
m 
As the bound on R [w] depends linearly on 5/2 2 (w) we see that Theorem I provides 
a theoretical foundation of all algorithms that aim at maximising "/z (w), e.g. SVMs 
and Boosting [13, 7]. 
3 PAC-Bayesian Analysis 
We first present a result [5] that bounds the risk of the generalised Gibbs clas- 
sification strategy Gibbsw() by the measure Pw (W (z)) on a consistent subset 
W (z) C- V (z). This average risk is then related via the Bayes-Gibbs lemma to the 
risk of the Bayes classification strategy Bayesw() on W (z). For a single consistent 
hypothesis w  BY it is then necessary to identify a consistent subset Q (w) such 
that the Bayes strategy BayesQ(w) on Q (w) always agrees with w. Let us define 
the Gibbs classification strategy Gibbsw() w.r.t. the subset W (z) C- V (z) by 
Gibbsw() (x) = hw (x) , w - Pwlwew(z)  (5) 
Then the following theorem [5] holds for the risk of Gibbsw(). 
Theorem 2 (PAC-Bayesian bound for subsets of classifiers). For any mea- 
sure Pw and any measure Pz, for any 5 > 0 with probability at least I - 5 over 
the random draw of the training set z for all subsets W (z) C- V (z) such that 
Pw (W (z)) > 0 the generalisation error of the associated Gibbs classification strat- 
egy Gibbsw() is bounded from above by 
R[Gibbsw()] < 1 (ln( 1 ) ()) 
- Pw(W(z)) +21n(m)+ln +1 . (6) 
2For notational simplicity we sometimes abbreviate qb (x) by x which should not be 
confused with the sample x of training objects. 
Now consider the Bayes classifier Bayesw(z), 
Bayesw(z) (x) - sign (l:wlwew()[hw (x)]) , 
where the expectation I:wlwew() is taken over a cut-off posterior given by com- 
bining the PAC-likelihood (1) and the prior Pw. 
Lemma I (Bayes-Gibbs Lemma). For any two measures Pw and Px and any 
setW C W 
Px (Bayesw (X)  Y) _< 2. Px (Gibbsw (X)  Y) . (7) 
Proof. (Sketch) Consider only the simple PAC setting we need. At all those points 
x E A:' at which Bayesw is wrong by definition at least half of the classifiers w E W 
under consideration make a mistake as well. [] 
The combination of Lemma I with Theorem 2 yields a bound on the risk of 
Bayesw(). For a single hypothesis w  14/let us find a (Bayes-admissible) subset 
Q (w) of version space V (z) such that BayesQ(w) on Q (w) agrees with w on every 
point in A:'. 
Definition 1 (Bayes-admissibility). Given the hypothesis space in (2) and a 
prior measure Pw over W we call a subset Q (w) c_ W Bayes admissible w.r.t. w 
and Pw if and only if 
Vx E A:': 
hw (x) - BayesQ(w) (x) . 
Although difficult to achieve in general the following geometrically plausible lemma 
establishes Bayes-admissibility for the case of interest. 
Lemma 2 (Bayes-admissibility for linear classifiers). For uniform mea- 
sure Pw over Y]? each ball Q (w) - {v  Y]? Ill TM - vll; _ r } is Bayes admissible 
w.r.t. its centre w. 
Please note that by considering a ball Q (w) rather than just w we make use of 
the fact that w summarises all its neighbouring classifiers v 6 Q (w). Now using 
uniform prior Pw the normalised margin 
(w) = min Yi (w, xi)c (8) 
Ilwll IIx11 ' 
quantifies the relative volume of classifiers summarised by w and thus allows us 
to bound its risk. Note that in contrast to the classical margin % (see 3) this 
normalised margin is a dimensionless quantity and constitutes a measure for the 
relative size of the version space invariant under rescaling of both weight vectors w 
and feature vectors xi. 
4 A PAC-Bayesian Margin Bound 
Combining the ideas outlined in the previous section allows us to derive a gener- 
alisation error bound for linear classifiers w E V (z) in terms of their normalised 
margin F, (w). 
Figure 1: Illustration of the volume ratio for the classifier at the north pole. Four 
training points shown as grand circles make up version space -- the polyhedron 
on top of the sphere. The radius of the "cap" of the sphere is proportional to the 
margin Fz, which only for constant Ilxill; is maximised by the SVM. 
Theorem 3 (PAC-Bayesian margin bound). Suppose ]C C_  is a given 
feature space of dimensionality n. For all probability measures Pz, for any 
 > 0 with probability at least I -  over the random draw of the training set 
z, if we succeed in correctly classifying m samples z with a linear classifier w 
achieving a positive margin Fz (w) > 0 then the generalisation error R [w] of 
w is bounded from above by 
 1-/1- F(w) +21n(m)+ln +2 . 
(9) 
where d = min (m, n). 
Proof. Geometrically the hypothesis space W is the unit sphere in  (see Figure 
1). Let us assume that Pw is uniform on the unit sphere as suggested by symmetry. 
Given the training set z and a classifier w all classifiers v E Q (w) 
{veW > (w) } 
(10) 
are within V (z) (For a proof see [2]). Such a set Q (w) is Bayes-admissible by 
Lemma 2 and hence we can use Pw (Q (w)) to bound the generalisation error of 
w. Since Pw is uniform, the value - In (Pw (Q (w))) is simply the logarithm of the 
volume ratio between the surface of the unit sphere and the surface of all v fulfilling 
equation (10). In [2] it is shown that this ratio is exactly given by 
In 
(f fo= sin "-a (0) dO ) 
rccs(v/1-F(w)) sin -2 (0)dO 
It can be shown that this ratio is tightly bounded from above by 
1 - - (w) 
+ln(2). 
10 2o 3o 4o 10 2o 3o 4o 
p p 
(a) (b) 
Figure 2: Generalisation errors of classifiers learned by an SVM with (dashed line) 
and without (solid line) normalisation of the feature vectors xi. The error bars 
indicate one standard deviation over 100 random splits of the data sets. The two 
plots are obtained on the (a) thyroid and (b) sonar data set. 
With In (2) < I we obtain the desired result. Note that m points maximally span 
an m-dimensional space and thus we can marginalise over the remaining n -m 
dimensions of feature space/C. This gives d = min (m, n). [] 
An appealing feature of equation (9) is that for rz (w) = I the bound reduces 
2 (2 in (m) - in (5) q- 2) with a rapid decay to zero as m increases. In case of 
to  
margins Fz (w) > 0.91 the troublesome situation of d - m, which occurs e.g. for 
RBF kernels, is compensated for. Furthermore, upper bounding 1/(1- x/T- F) by 
2/F we see that Theorem 3 is an exponential improvement of Theorem I in terms 
of the attained margins. It should be noted, however, that the new bound depends 
on the dimensionality of the input space via d - min (m, n). 
5 Experimental Study 
Theorem 3 suggest the following learning algorithm: given a version space V (z) 
(through a given training set z) find the classifier w that maximises F (w). This 
algorithm, however, is given by the SVM only if the training data in feature space 
/C are normalised. We investigate the influence of such a normalisation on the 
generalisation error in the feature space/C of all monomials up to the p-th degree 
(well-known from handwritten digit recognition, see [13]). Since the SVM learning 
algorithm as well as the resulting classifier only refer to inner products in /C, it 
suffices to use an easy-to-calculate kernel function k: A:' x A:' - 1 such that for all 
x,x   2d, k (x,x ) = {qb (x),qb (x))x;, given in our case by the polynomial kernel 
Vp   ' k (X,X') ---- (<X,X')2 d q- 1) p . 
Earlier experiment have shown [13] that without normalisation too large values 
of p may lead to "overfitting". We used the UCI [10] data sets thyroid (d = 5, 
m = 140, mtest = 75) and sonar (d = 60, m = 124, mtest = 60) and plotted 
the generalisation error of SVM solutions (estimated over 100 different splits of the 
data set) as a function of p (see Figure 2). As suggested by Theorem 3 in almost all 
cases the normalisation improved the performance of the support vector machine 
solution at a statistically significant level. As a consequence, we recommend: 
When training an SVM, always normalise your data in feature space. 
Intuitively, it is only the spatial direction of both weight vector and feature vectors 
that determines the classification. Hence the different lengths of feature vectors in 
the training set should not enter the SVM optimisation problem. 
6 Conclusion 
The PAC-Bayesian framework together with simple geometrical arguments yields 
the so far tightest margin bound for linear classifiers. The role of the normalised 
margin Fz in the new bound suggests that the SVM is theoretically justified only for 
input vectors of constant length. We hope that this result is recognised as a useful 
bridge between theory and practice in the spirit of Vapnik's famous statement: 
Nothing is more practical than a good theory 
Acknowledgements We would like to thank David McAllester, John Shawe- 
Taylor, Bob Williamson, Olivier Chapelle, John Langford, Alex Smola and Bernhard 
SchSlkopf for interesting discussions and useful suggestions on earlier drafts. 
References 
[1] N. Alon, S. Ben-David, N. Cesa-Bianchi, and D. Haussler. Scale sensitive dimensions, 
uniform convergence and learnability. Journal of the ACM, 44(4):615-631, 1997. 
[2] R. Herbrich. Learning Linear Classifiers - Theory and Algorithms. PhD thesis, Tech- 
nische Universit/it Berlin, 2000. accepted for publication by MIT Press. 
[3] R. Herbrich, T. Graepel, and C. Campbell. Bayesian learning in reproducing kernel 
Hilbert spaces. Technical report, Technical University of Berlin, 1999. TR 99-11. 
[4] M. J. Kearns and R. Schapire. Efficient distribution-free learning of probabilistic 
concepts. Journal of Computer and System Sciences, 48(2):464-497, 1993. 
[5] D. A. McAllester. Some PAC Bayesian theorems. In Proceedings of the Eleventh An- 
nual Conference on Computational Learning Theory, pages 230-234, Madison, Wis- 
consin, 1998. 
[6] N. Sauer. On the density of families of sets. Journal of Combinatorial Theory, Series 
A, 13:145-147, 1972. 
[7] R. E. Schapire, Y. Freund, P. Bartlett, and W. S. Lee. Boosting the margin: A 
new explanation for the effectiveness of voting methods. In Proceedings of the l,-th 
International Conference in Machine Learning, 1997. 
[8] J. Shawe-Taylor, P. L. Bartlett, R. C. Williamson, and M. Anthony. Structural risk 
minimization over data-dependent hierarchies. IEEE Transactions on Information 
Theory, 44(5):1926-1940, 1998. 
[9] J. Shawe-Taylor and R. C. Williamson. A PAC analysis of a Bayesian estimator. 
Technical report, Royal Holloway, University of London, 1997. NC2-TR-1997-013. 
[10] UCI. University of California Irvine: Machine Learning Repository, 1990. 
[11] L. G. Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134- 
1142, 1984. 
[12] V. Vapnik. Estimation of Dependences Based on Empirical Data. Springer, 1982. 
[13] V. Vapnik. The Nature of Statistical Learning Theory. Springer, 1995. 
[14] V. Vapnik and A. Chervonenkis. On the uniform convergence of relative frequencies of 
events to their probabilities. Theory of Probability and its Application, 16(2):264-281, 
1971. 
