Some new bounds on the generalization error of 
combined classifiers 
Vladimir Koltchinskii 
Department of Mathematics and Statistics 
University of New Mexico 
Albuquerque, NM $7131-1141 
vlad@math. unm. edu 
Dmitriy Panchenko 
Department of Mathematics and Statistics 
University of New Mexico 
Albuquerque, NM 87131-1141 
panchenk @ math. unm. edu 
Fernando Lozano 
Department of Electrical and Computer Engineering 
University of New Mexico 
Albuquerque, NM $7131 
fiozano @eece. unm. edu 
Abstract 
In this paper we develop the method of bounding the generalization error 
of a classifier in terms of its margin distribution which was introduced in 
the recent papers of Bartlett and Schapire, Freund, Bartlett and Lee. The 
theory of Gaussian and empirical processes allow us to prove the margin 
type inequalities for the most general functional classes, the complexity 
of the class being measured via the so called Gaussian complexity func- 
tions. As a simple application of our results, we obtain the bounds of 
Schapire, Freund, Bartlett and Lee for the generalization error of boost- 
ing. We also substantially improve the results of Bartlett on bounding 
the generalization error of neural networks in terms of l-norms of the 
weights of neurons. Furthermore, under additional assumptions on the 
complexity of the class of hypotheses we provide some tighter bounds, 
which in the case of boosting improve the results of Schapire, Freund, 
Bartlett and Lee. 
1 
Introduction and margin type inequalities for general functional 
classes 
Let (X, Y) be a random couple, where X is an instance in a space $ and Y E {- 1, 1 } is 
a label. Let G be a set of functions from $ into 11. For g E G, sign(g(X)) will be used as 
a predictor (a classifier) of the unknown label Y. If the distribution of (X, Y) is unknown, 
then the choice of the predictor is based on the training data (X, Y),..., (X, Y) that 
consists of r i.i.d. copies of (X, Y). The goal of learning is to find a predictor 0  G (based 
on the training data) whose generalization (classification) error ]P{Y(X) _< 0} is small 
enough. We will first introduce some probabilistic bounds for general functional classes 
and then give several examples of their applications to bounding the generalization error of 
boosting and neural networks. We omit all the proofs and refer an interested reader to [5]. 
Let (S, .4, P) be a probability space and let W be a class of measurable functions from 
(S,.4) into 11. Let {Xk} be a sequence of i.i.d. random variables taking values in 
(S, .4) with common distribution P. Let P be the empirical measure based on the sample 
(X,..., X), P := n - Y]i= 8x, where 8x denotes the probability distribution con- 
centrated at the point x. We will denote Pf := fs fdP, Pnf := fs fdPn, etc. In what 
follows,  (.T') denotes the Banach space of uniformly bounded real valued functions on 
.T' with the norm IIYll '- supfy IY(fNI, Y E Define 
:- EIIn Y] mx, - Esup n - y]gif(Xi), 
i=1 /' i=1 
where {gi} is a sequence of i.i.d. standard normal random variables, independent of {Xi}. 
We will call n -> Gn(.T') the Gaussian complexity function of the class .T'. One can find in 
the literature (see, e.g. [11]) various upper bounds on such quantities as G () in terms of 
entropies, VC-dimensions, etc. 
We give below a bound in terms of margin cost functions (compare to [6, 7]) and Gaussian 
complexities. 
Let  = {qok 11 --> }= be a class of Lipschitz functions such that (1 + sgn(-x))/2 _< 
qo(x) for all x E 11 and all k. For each qo  , L(qo) will denote it's Lipschitz constant. 
Theorem 1 For all t > 0, 
y. P{f _< o) 
> inf [Pn9(f) + v/-L(9)Gn.T') 
+ 
Let us consider a special family of cost functions. Assume that qo is a fixed nonincreasing 
Lipschitz function from 11 into 11 such that qo(x) _> (1 +sgn(-x))/2 for all x  11. One can 
easily observe that L(qo(./8)) _< L(qo)8 -. Applying Theorem 1 to the class of Lipschitz 
functions  := {qo(./8)  k _> 0}, where 8 := 2 -, we get the following result. 
Theorem 2 For all t > 0, 
y. P{f _< o) 
exp{-2t2}. 
In [5] an example was given which shows that, in general, the order of the factor 6 - in the 
second term of the bound can not be improved. 
Given a metric space (T, d), we denote Ha(T; e) the e-entropy of T with respect to d, 
i.e. Ha(T; e) := log Na(T; e), where Na(T; e) is the minimal number of balls of radius 
e covering T. The next theorem improves the previous results under some additional as- 
sumptions on the growth of random entropies Hap,,2 (W; '). Define for 7 E (0, 1] 
and 
6n("/;f) :=sup{6  (O, 1) ' 6vP{f _< 6} _< n -+} 
n("/;f) := sup{6  (O, 1) ' 6VPn{f _< 6} _< n -+ }. 
We call 6n("/; f) and n ("/; f), respectively, the "/-margin and the empirical "/-margin of f. 
Theorem 3 Suppose that for some o E (0, 2) and for some constant D > 0 
Hctp,,2 (Y;u) _< Du -, u > 0 a.s. 
Then for any ? > 2, ,for some constants A, B > 0 and for all large enough n 
-- 2+c 
{V/E r. A-(?; 7) _< 6(?; 7) _< A(?; 7)} 
_> 1- B(log21og2n)exp{-n{/2 }. 
(1) 
This implies that with high probability for all f   
P{f _< 0) _< 
The bound of Theorem 2 corresponds to the case of ? = 1. It is easy to see from the 
definitions of v-margins that the quantity (n-*/23(; f)*)- increases in   (0, 1]. 
This shows that the bound in the case of ? < I is tighter. Further discussion of this 
type of bounds and their experimental study in the case of convex combinations of simple 
classifiers is given in the next section. 
2 
Bounding the generalization error of convex combinations of 
classifiers 
Recently, several authors ([1, 8]) suggested a new class of upper bounds on generalization 
error that are expressed in terms of the empirical distribution of the margin of the predictor 
(the classifier). The margin is defined as the product YO(X). The bounds in question are 
especially useful in the case of the classifiers that are the combinations of simpler classifiers 
(that belong, say, to a class ). One of the examples of such classifiers is provided by the 
classifiers obtained by boosting [3, 4], bagging [2] and other voting methods of combining 
the classifiers. We will now demonstrate how our general results can be applied to the case 
of convex combinations of simple base classifiers. 
We assume that  := $x {-1, 1) and. := {f: f  r), where ](x,y) := yf(x).Pwill 
denote the distribution of (X, Y), Pn the empirical distribution based on the observations 
((X, Y),..., (X, Y)). It is easy to see that G (.) = G (r). One can easily see 
that if  := conv(), where  is a class of base classifiers, then Gn(') = Gn(). 
These easy observations allow us to obtain useful bounds for boosting and other methods 
of combining the classifiers. For instance, we get in this case the following theorem that 
implies the bound of Schapire, Freund, Bartlett and Lee [8] when  is a VC-class of sets. 
Theorem 4 Let . := conv(), where  is a class of measurable functions from 
into 11. For all t > 0, 
 3f  'P{f() < 0} > inf P( ) + G() 
- a[o,] 6 
+ +-- < 2exp{-2t2}. 
n  - 
In particular, if  is a VC-class of classifiers h  $  {-1, 1} (which means that the class 
of sets {{x- h(x) = +1}- h  } is a Vapnik-Chervonenkis class) with VC-dimension 
V(), we have with some constant C > 0, Gn() _< C(V()/n) /2. This implies that 
with probability at least 1 - o 
P{yf(x) < 0} < inf [P{yf(x) < 6} + U V/C?) 
- - - + 
which slightly improves the bound obtained previously by Schapire, Freund, Bartlett and 
Lee [8]. 
where 
Theorem 3 provides some improvement of the above bounds on generalization error of 
convex combinations of base classifiers. To be specific, consider the case when 7/ is a 
VC-class of classifiers. Let V := V(7/) be its VC-dimension. A well known bound (going 
back to Dudley) on the entropy of the convex hull (see [11], p. 142) implies that 
2(v-) 
Htp,,2(conv(7/);u) _< sup HtQ,2(conv(7/);u) _< Du v 
QGP(S) 
It immediately follows from Theorem 3 that for all -/ > 2(v-) and for some constants 
-- 2V-1 
C, B 
C 
conv(7/)' P{f < 0} > n_V/23n(?; f)v }  B log2 log2 nexp{ -1 z}, 
_ _ 2 n2 
n(?; f):=sup{5 G (O, 1) ' SVPn{(X,y) ' yf(x) _< 5} _< n-t+}. 
This shows that in the case when the VC-dimension of the base is relatively small the 
generalization error of boosting and some other convex combinations of simple classifiers 
obtained by various versions of voting methods becomes better than it was suggested by the 
bounds of Schapire, Freund, Bartlett and Lee. One can also conjecture that the remarkable 
generalization ability of these methods observed in numerous experiments can be related 
to the fact that the combined classifier belongs to a subset of the convex hull for which 
the random entropy Ht,, , is much smaller than for the whole convex hull (see [9, 10] for 
improved margin type bounds in a much more special setting). 
To demonstrate the improvement provided by our bounds over previous results, we show 
some experimental evidence obtained for a simple artificially generated problem, for which 
we are able to compute exactly the generalization error as well as the 7-margins. 
We consider the problem of learning a classifier consisting of the indicator function of the 
union of a finite number of intervals in the input space $ = [0, 1]. We used the Adaboost 
algorithm [4] to find a combined classifier using as base class 7/ = {[0, b]: b G [0, 1]} U 
{[b, 1]: b  [0, 1]} (i.e. decision stumps). Notice that in this case V = 2, and according to 
the theory values of gamma in (2/3, 1) should result in tighter bounds on the generalization 
error. 
For our experiments we used a target function with 10 equally spaced intervals, and a sam- 
ple size of 1000, generated according to the uniform distribution in [0, 1]. We ran Adaboost 
for 500 rounds, and computed at each round the generalization error of the combined clas- 
sifier and the bound U(n-V12(7; f))- for different values of 7. We set the constant 
U to one. 
In figure 1 we plot the generalization error and the bounds for 7 = 1, 0.8 and 2/3. As 
expected, for 7 = 1 (which corresponds roughly to the bounds in [8]) the bound is very 
loose, and as 7 decreases, the bound gets closer to the generalization error. In figure 2 
we show that by reducing further the value of 7 we get a curve even closer to the actual 
generalization error (although for 7 = 0.2 we do not get an upper bound). This seems to 
support the conjecture that Adaboost generates combined classifiers that belong to a subset 
of of the convex hull of 7/with a smaller random entropy. In figure 3 we plot the ratio 
g,(7; f)/6,(7; f) for 7 = 0.4,2/3 and 0.8 against the boosting iteration. We can see that 
the ratio is close to one in all the examples indicating that the value of the constant A in 
theorem 3 is close to one in this case. 
0 50 100 150 200 250 300 350 400 450 500 
boosting round 
Figure 1: Comparison of the generalization error (thicker line) with (r-'r/2,(7;/)'r)- 
for 7 = 1, 0.8 and 2/3 (thinner lines, top to bottom). 
Figure 2: Comparison of the generalization error (thicker line) with (r-'r/2,(7;/)'r)- 
for 7 = 0.5, 0.4 and 0.2 (thinner lines, top to bottom). 
50 100 150 200 250 300 350 4O0 450 500 
2OO 
250 300 350 4O0 450 500 
Figure 3: 
bottom) 
50 100 
Ratio f('7;/)/6('7; f) 
150 200 250 300 350 4o0 450 500 
versus boosting round for '7 = 0.4, 2/3, 0.8 (top to 
3 Bounding the generalization error in neural network learning 
We turn now to the applications of the bounds of previous section in neural network learn- 
ing. Let 7-/ be a class of measurable functions from ($, A) into ll. Given a sigmoid cr 
from IR into [-1, 1] and a vector w := (w,..., w)  IR , let N,(u,..., us) := 
cr(y,i=  wjuj). We call the function N,w a neuron with weights w and sigmoid or. For 
w  , Ilwll :- Iwl. Let crj  j _> 1 be functions from IR into [-1, 1], satisfying 
the Lipschitz conditions: 
Let {Aj } be a sequence of positive numbers. We define recursively classes of neural net- 
works with restrictions on the weights of neurons (j below is the number of layers): 
7-/40 = 7-/, 74j(A,..., Aj) := 
:= 'n _> O,h  w  ,llwll _< 
[.J 74j__ ( A_, . . . , Aj__ ). 
Theorem 5 For all t > 0 and for all l _> I 
]P{3f  74t (A,... 
,At)' P{f < O) > 
 I t 
inf [P99( ) + H(2LjAj + 1)G(74)+ 
e(o,] 
+ (log log2 (25-))/2 t+ 2 
n ]q  } -< 2exp{-2t2} 
Remark. Bartlett [1] obtained a similar bound for a more special class 7-/and with larger 
constants. In the case when Aj -- A, Lj -- L (the case considered by Bartlett) the ex- 
(AL)tU+x)/2 
pression in the right hand side of his bound includes  , which is replaced in our 
bound by (nz') These improvement can be substantial in applications, since the above 
quantities play the role of complexity penalties. 
Finally, it is worth mentioning that the theorems of Section 1 can be applied also to bound- 
ing the generalization error in multi-class problems. Namely, we assume that the labels 
take values in a finite set y with card(y) =: L. Consider a class . of functions from 
 := $ x y into IR. A function f  . predicts a label y  y for an example x  $ iff 
f(x,y) > max f(x,y'). 
yy 
The margin of an example (x, y) is defined as 
mf(x,y) :- f(x,y) - max f(x,y'), 
y y 
so f misclassifies the example (x,y) iff mf(x,y) _ O. Let 
r := {f(.,y).y  y,f  .}. 
The next result follows from Theorem 2. 
Theorem 6 For all t > 0, 
 P{mj, _< O) > 
inf [P{mi < 6} + 
ae(o,] - 
4vrL(2L- 1) G(Y')+ 
+ (loglog2 (25-1)) 1/2 t+ 2 
n ]+ --} -< , exp{-,t'}. 
References 
[lO] 
[11] 
[1] Bartlett, P. (1998) The Sample Complexity of Pattern Classification with Neural Net- 
works: The Size of the Weights is More Important than the Size of the Network. IEEE 
Transactions on Information Theory, 44, 525-536. 
[2] Breiman, L. (1996). Bagging Predictors. Machine Learning, 26(2), 123-140. 
[3] Freund Y. (1995) Boosting a weak learning algorithm by majority. Information and 
Computation, 121,2,256-285. 
[4] Freund Y. and Schapire, R.E. (1997) A decision-theoretic generalization of on-line 
learning and an application to boosting. Journal of Computer and System Sciences, 
55(1),119-139. 
[5] Koltchinskii, V. and Panchenko, D. (2000) Empirical margin distributions and bound- 
ing the generalization error of combined classifiers, preprint. 
[6] Mason, L., Bartlett, P. and Baxter, J. (1999) Improved Generalization through Explicit 
Optimization of Margins. Machine Learning, 0, 1-11. 
[7] Mason, L., Baxter, J., Bartlett, P. and Frean, M. (1999) Functional Gradient Tech- 
niques for Combining Hypotheses. In: Advances in Large Margin Classifiers. Smola, 
Bartlett, SchOlkopf and Schnurmans (Eds), to appear. 
[8] Schapire, R., Freund, Y., Bartlett, P. and Lee, W. S. (1998) Boosting the Margin: A 
New Explanation of Effectiveness of Voting Methods. Ann. Statist., 26, 1651-1687. 
[9] Shawe-Taylor, J. and Cristianini, N. (1999) Margin Distribution Bounds on Gener- 
alization. In: Lecture Notes in Artificial Intelligence, 1572. Computational Learning 
Theory, 4th European Conference, EuroCOLT' 99,263-273. 
Shawe-Taylor, J. and Cristianini, N. (1999) Further Results on the Margin Distribu- 
tion. Proc. of COLT' 99,278-285. 
van der Vaart, A. and Wellner, J. (1996) Weak convergence and Empirical Processes. 
With Applications to Statistics. Springer-Verlag, New York. 
