Weak Learners and Improved Rates of 
Convergence in Boosting 
Shie Mannor and Ron Meir 
Department of Electrical Engineering 
Technion, Haifa 32000, Israel 
{shie,rmeir}@{techunix,ee}.technion.ac.il 
Abstract 
The problem of constructing weak classifiers for boosting algo- 
rithms is studied. We present an algorithm that produces a linear 
classifier that is guaranteed to achieve an error better than random 
guessing for any distribution on the data. While this weak learner 
is not useful for learning in general, we show that under reasonable 
conditions on the distribution it yields an effective weak learner for 
one-dimensional problems. Preliminary simulations suggest that 
similar behavior can be expected in higher dimensions, a result 
which is corroborated by some recent theoretical bounds. Addi- 
tionally, we provide improved convergence rate bounds for the gen- 
eralization error in situations where the empirical error can be made 
small, which is exactly the situation that occurs if weak learners 
with guaranteed performance that is better than random guessing 
can be established. 
I Introduction 
The recently introduced boosting approach to classification (e.g., [10]) has been 
shown to be a highly effective procedure for constructing complex classifiers. Boost- 
ing type algorithms have recently been shown [9] to be strongly related to other in- 
cremental greedy algorithms (e.g., [6]). Although a great deal of numerical evidence 
suggests that boosting works very well across a wide spectrum of tasks, it is not a 
panacea for solving classification problems. In fact, many versions of boosting algo- 
rithms currently exist (e.g., [4],[9]), each possessing advantages and disadvantages 
in terms of classification accuracy, interpretability and ease of implementation. 
The field of boosting provides two major theoretical results. First, it is shown that 
in certain situations the training error of the classifier formed converges to zero 
(see (2)). Moreover, under certain conditions, a positive margin can be guaranteed. 
Second, bounds are provided for the generalization error of the classifier (see (1)). 
The main contribution of this paper is twofold. First, we present a simple and 
efficient algorithm which is shown, for every distribution on the data, to yield a 
linear classifier with guaranteed error which is smaller than 1/2 - 7 where 7 is 
strictly positive. This establishes that a weak linear classifier exists. From the 
theory of boosting [10] it is known that such a condition suffices to guarantee that 
the training error converges to zero as the number of boosting iterations increases. 
In fact, the empirical error with a finite margin is shown to converge to zero if 7 
is sufficiently large. However, the existence of a weak learner with error 1/2 - "/ 
is not always useful in terms of generalization error, since it applies even to the 
extreme case where the binary labels are drawn independently at random with 
equal probability at each point, in which case we cannot expect any generalization. 
It is then clear that in order to construct useful weak learners, some assumptions 
need to be made about the data. In this work we show that under certain natural 
conditions, a useful weak learner can be constructed for one-dimensional problems, 
in which case the linear hyper-plane degenerates to a point. We speculate that 
similar results hold for higher dimensional problems, and present some supporting 
numerical evidence for this. In fact, some very recent results [7] show that this 
expectation is indeed borne out. The second contribution of our work consists of 
establishing faster convergence rates for the generalized error bounds introduced 
recently by Mason et al. [8]. These improved bounds show that faster convergence 
can be achieved if we allow for convergence to a slightly larger value than in previous 
bounds. Given the guaranteed convergence of the empirical loss to zero (in the 
limited situations in which we have proved such a bound), such a result may yield a 
better trade-off between the terms appearing in the bound, offering a better model 
selection criterion (see Chapter 15 in [1]). 
2 Construction of a Linear Weak Learner 
We recall the basic generalization bound for convex combinations of classifiers. Let 
H be a class of binary classifiers of VC-dimension dr, and denote by co(H) the 
convex hull of H. Given a sample $ = {(x,y),..., (Xm,Ym)}  (r X {-1, +1}) m 
of m examples drawn independently at random from a probability distribution D 
over X x {-1, +1}, Schapire et al. [10] show that with probability at least I - 5, 
for every f  co(H) and every 0 > 0, 
(1 (dvlog2(m/dv) 
PD[Yf(X) _< 0] _< Ps[Yf(X) _< O] + 0  02 + log(1/5) , 
(1) 
where the margin-error P s[Yy(x) _< 0] denotes the fraction of training points for 
which yif(xi) _< O. Clearly, if the first term can be made small for a large value of 
the margin 0, a tight bound can be established. Schapire et al. [10] also show that 
if each weak classifier can achieve an error smaller than 1/2 - 3', then 
Ps[Yf(X) _< 0] _< ((1- 2"/)-(1 + 2"/)+) TM (2) 
, 
where T is the number of boosting iterations. Note that if 3' > 0, the bound 
decreases to zero exponentially fast. It is thus clear that a large value of 3' is needed 
in order to guarantee a small value for the margin-error. However, if 3' (and thus 
0) behaves like m- for some/9 > 0, the rate of convergence in the second term in 
(1) will deteriorate, leading to worse bounds than those available by using standard 
VC results [11]. What is needed is a characterization of conditions under which 
the achievable 0 does not decrease rapidly with m. In this section we present such 
conditions for one-dimensional problems, and mention recent work [7] that proves 
a similar result in higher dimensions. 
We begin by demonstrating that for any distribution on m points, a linear classifier 
can achieve an error smaller than 1/2-3', where 3' = f(1/m). In view of our com- 
ments above, such a fast convergence of "/to zero may be useless for generalization 
bounds. We then use our construction to show that, under certain regularity con- 
ditions, a value of % and thus of 0, which is independent of m can be established 
for one-dimensional problems. 
Let {x,... ,Xm} be points in 1 a, and denote by {y,... ,Ym} their binary labels, 
i.e., Yi  {-1, q-l}. A linear decision rule takes the form (x) - sgn(a. x q-b), 
where  is the standard inner product in 1 a. Let P  A m be a probability measure 
on the m points. The weighted misclassification error for a classifier  is Pe (a, b) = 
Y]4m_- PiI(yi  i). For technical reasons, we prefer to use the expression 1 - 2Pe - 
m i e 
Y]4= PiYii' Obviously if I - 2Pe <> e we have that P X 2 2' 
Lemma I For any sample of m distinct points, $ - {(xi, Yi)}im__  (1  x 
{-1, q-l}) m, and a probability measure P  A m on S, there is some a  1  and 
b  1 such that the weighted misclassification error of the linear classifier  - 
sgn(a. x q- b) is bounded away from 1/2, in particular y]4m__ PiI(yi  i) _<  -- 
4m 
Proof The basic idea of the proof is to project a finite number of points onto a line 
h so that no two points coincide. Since there is at least one point x whose weight 
is not smaller than l/m, we consider the four possible linear classifiers defined by 
h with boundaries near x (at both sides of it and with opposite sign), and show 
that one of these yields the desired result. We proceed to the detailed proof. Fix 
a probability vector P - (P,..., Pm)  Am. We may assume w.l.o.g that all the 
xi are different, or we can merge two elements and get m - 1 points. First, observe 
m 1 
that if [y]i= PiYi[ > mm' then the problem is trivially solved. To see this, denote 
by $ the sub-samples of $ labelled by 4-1 respectively. Assume, for example, 
 Then the choice a - 0, b - 1, namely i - 1 
that Y]ies+ Pi > 
 Similarly, the choice a = 0, b = -1 solves 
for all i, implies that Y]i PiYii _> mm' 
 Thus, we can assume, without loss of 
the problem if Y]ies_ Pi > Y]ies+ Pi + mm' 
m i Next, note that there exists a direction u such 
generality, that 
that i  j implies that u. xi  u. xj. This can be seen by the following argument. 
Construct all one-dimensional lines containing two data points or more; clearly the 
number of such lines is at most m(m - 1)/2. It is then obvious that any line, which 
is not perpendicular to any of these lines obeys the required condition. Let xi be 
a data-point for which Pi > l/m, and set e to be a positive number such that 
0 < e  min{[u.xi-u.xj[ 'i,j  1,...,m}. Such an e always exists since the 
points are assumed to be distinct. Note the following trivial algebraic fact: 
IA q- BI < & A > 52  A- B > 252 - 5. 
(3) 
For each j = 1, 2,..., m let the classification be given by j = sgn(u. xj + b), where 
the bias b is given by b = -u'xi+eyi. Then clearly i = Yi and j = sgn(u.xj-u.xi), 
and therefore y]j Pjyjj = Pi + Y]j7i Pjyjsgn(u . xj - u . xi). Let A = Pi and 
B = Y]j7i Pjyjsgn(u .xj - u. xi). If I A + B I > 1/2m we are done. Otherwise, if 
 = sgn(-u. xj + b), with b  
I A + B I < 1/2m, consider the classifier yj = u. 
! 
(note that y = Yi and yj = -j, j  i). Using (3) with 5 = 1/2m and 52 = 1/m 
the claim follows. I 
We comment that the upper bound in Lemma 1 may be improved to 1/2-1/(4(m- 
1)), m _> 2, using a more refined argument. 
Remark 1 Lemma I implies that an error of 1/2- ?, where ? = q(1/m), can 
be guaranteed for any set of arbitrarily weighted points. It is well known that the 
problem of finding a linear classifier with minimal classification error is NP-hard 
(in d) [5]. Moreover, even the problem of approximating the optimal solution is 
NP-hard [2]. Since the algorithm described in Lemma I is clearly polynomial (in 
m and d), there seems to be a transition as a function of "/between the class NP 
and P (assuming, as usual, that they are different). This issue warrants further 
investigation. 
While the result given in Lemma 1 is interesting, its generality precludes its use- 
fulness for bounding generalization error. This can be seen by observing that the 
theorem guarantees the given margin even in the case where the labels Yi are drawn 
uniformly at random from {4.1}, in which case no generalization can be expected. 
In order to obtain a more useful result, we need to restrict the complexity of the 
data distribution. We do this by imposing constraints on the types of decision re- 
gions characterizing the data. In order to generate complex, yet tractable, decision 
regions we consider a multi-linear mapping from ]ta to {-1, 1} k, generated by the 
k hyperplanes 7i = {x' wix -wio,x  },i = 1,...,k, as in the first hidden 
layer of a neural network. Such a mapping generates a partition of the input space 
a into M connected components, {a \ [J/_Pi }, each characterized by a unique 
binary vector of length k. Assume that the weight vectors (wi, wi0)  a+ are in 
general position. The number of connected components is given by (e.g., Lemma 
a 
3.3. in [1]) C(k,d + 1) - 2 -i=0 (k'  This number can be bounded from below 
by 2(), which in turn is bounded below by 2((k- 1)/d) a. An upper bound is 
given by 2(e(k- 1)/d) a, m _ d. In other words, C(k,d + 1) - ) ((k/d)a). In order 
to generate a binary classification problem, we observe that there exists a binary 
function from {-1, 1}   {-1, 1}, characterized by these M decision regions. This 
can be seen as follows. Choose an arbitrary connected component, and label it by 
-]-1 (say). Proceed by labelling all its neighbors by -1, where neighbors share a 
common boundary (a (d- 1)-dimensional hyperplane in d dimensions). Proceeding 
by induction, we generate a binary classification problem composed of exactly M 
decision regions. Thus, we have constructed a binary classification problem, char- 
acterized by at least 2( ) _> 2((k- 1)/d))  decision regions. Clearly as k becomes 
arbitrarily large, very elaborate regions are formed. 
We now apply these ideas, together with Lemma l, to a one dimensional problem. 
Note that in this case the partition is composed of intervals. 
Theorem 1 Let , be a class of functions from t to {4-1} which partitions the real 
line into at most k intervals, k _ 2. Let l be an arbitrary probability measure on 
. Then for any f  , there exist a, -*   for which, 
I 1 
l {x' f(x)sgn(ax- -*) -- 1} _  + 4- 
(4) 
Proof Let a function f be given, and denote its connected components by I,..., I, 
that is I -[-x,/), 12 - [/,/2), 13 -[/2,/3), and so on until Ik -[/_, x], with 
-x - l0  l  12  '"  l_. Associate with every interval a point in , 
x = l - 1, x2 = (l +/2)/2,... ,x_ = (l-2 + lk_)/2, x = l_ + 1, 
a weight/i - l(Ii),i - 1,... ,k, and a label f(xi)  {4.1}. We now apply Lemma 1 
k 
to conclude that there exist a  {4.1} and -   such that -i= lif(xi)sgn(axi- 
-)  1/(4k). The value of - lies between li and li+ for some i  {0,1,...,k- 
1} (recall that l0 - -x). We identify -* of (4) as li+. This is the case since 
by choosing this -*, f(x) in any segment Ii is equal to f(xi) so we have that 
1 k 1 1  
 {x' f(x)sgn(ax - -*) = 1} =  + -i= if(xi)sgn(axi - T*) _  + . 
Note that the result in Theorem I is in fact more general than we need, as it applies 
to arbitrary distributions, rather than distributions over a finite set of points. An 
open problem at this point is whether a similar result applies to d-dimensional prob- 
lems. We conjecture that in d dimensions "/behaves like k-a) for some function 
l, where k is a measure for the number of homogeneous convex regions defined by 
the data (a homogeneous region is one in which all points possess identical labels). 
While we do not have a general proof at this stage, we have recently shown [7] 
that the conjecture holds under certain natural conditions on the data. This result 
implies that, at least under appropriate conditions, boosting-like algorithm are ex- 
pected to have excellent generalization performance. To provide some motivation, 
we present results of some numerical simulations for two-dimensional problems. For 
this simulation we used random lines to generate a partition of the unit square in 
11 2 . We then drew 1000 points at random from the unit square and assigned them 
labels according to the partition. Finally, in order to have a non-trivial problem we 
made sure that the cumulative weights of each class are equal. We then calculated 
the optimal linear classifier by exhaustive search. In Figure l(b) we show a sam- 
ple decision region with 93 regions. Figure l(a) shows the dependence of "/on the 
number of regions k. As it turns out there is a significant logarithmic dependence 
between ? and k, which leads us to conjecture that -/- Ck -t q- E for some C, l and 
E. In the presented case it turns out that l = 3 turns out to fit our model well. 
It is important to note, however, that the procedure described above only supports 
our claim in an average-case, rather than worst-case, setting as is needed. 
(a) gamma as a function of regions 
0.3 
0.25 
0.2 ... 
'50 50 I O0 150 200 
Number of Regions 
(b) A complex partition 
0.2 0. 0.6 0.8 1 
X 
Figure 1: (a) ? as a function of the number of regions. (b) A typical complex 
partition of the unit square used in the simulations. 
3 Improved Convergence Rates 
In Section 2 we proved that under certain conditions a weak learner exists with a 
sufficiently large margin, and thus the first term in (1) indeed converges to zero. 
We now analyze the second term in (1) and show that it may be made to converge 
considerably faster, if the first term is made somewhat larger. First, we briefly 
recall the framework introduced recently by Mason et al. [8]. These authors begin 
by introducing the notion of a B-admissible family of functions. For completeness 
we repeat their definition. 
Definition 1 (Definition 2 in [8]) A family {CN ' N  N} of margin cost functions 
is B-admissible for B _> 0 if for all N  N there is an interval Y C  of length no 
more than B and a function N ' [--1, 1]  Y that satisfies 
sgn(-c) _< Ez.QN, [ N( Z)] _< CN() 
for all   [-1, 1], where Ez.QN,(.) denotes the expectation when Z is chosen 
randomly as Z (l/N) N 
= Y-i= Zi, and P(Zi = 1) = (1 + c)/2. 
Denote the convex hull of a class H by co(H). The main theoretical result in [8] is 
the following lemma. 
Lemma 2 ([8], Theorem 3) For any B-admissible family {CN  N E N} of 
margin functions, for any binary hypothesis class of VC dimension dv and any 
distribution D on A:' x {-1, +1}, with probability at least 1 - 6 over a random 
sample S of m examples drawn at random according to D, every N and ev- 
ery f  co(H) satisfies PD[yf(x) _< 0] _< Es[Cv(yf(x)] + ev, where ev = 
o + 
Remark 2 The most appealing feature of Lemma 2, as of other results for convex 
combinations, is the fact that the bound does not depend on the number of hy- 
potheses from H defining f  co(H), which may in fact be infinite. Using standard 
VC results (e.g. [11]) would lead to useless bounds, since the VC dimension of these 
classes is often huge (possibly infinite). 
Lemma 2 considers binary hypotheses. Since recent works has demonstrated the 
effectiveness of using real valued hypotheses, we consider the case where the weak 
classifiers may be confidence-rated, i.e., taking values in [-1, 1] rather than {1}. 
We first extend Lemma 2 to confidence-rated classifiers. Note that the variables Zi 
in Definition 1 are no longer binary in this case. 
Lemma 3 Let the conditions of Lemma 2 hold, except that H is a class of real 
valued functions om  to [-1, +1] of pseudo-dimension dp. Assume fuher that 
  in Definition 1 obeys a Lipschitz condition of the form I(x) - (x')l 5 
LIx-x' I for every x,x'  X. Then with probability at least 1-5, PD[yf(x) 5 0] 5 
Es[CN(yf(x)] +ON, where ON = 0 ([(LB2Ndplogm + log(N/d))/m]W2) . 
Proof The proof is very similar to the proof of Theorem 2, and will be omitted for 
the sake of brevity. 1 
It is well known that in the standard setting where CN is replaced by the empirical 
classification error, improved rates, replacing O(logm/m) by O(logm/m), are 
possible in two situations: (i) if the minimal value of CN is zero (the restricted 
model of [1]), and (ii) if the empirical error is replaced by (1 +a)CN for some a > 0. 
The latter case is especially important in a model selection setup, where nested 
classes of hypothesis functions are considered, since in this case one expects that, 
with high probability, CN becomes smaller as the classes become more complex. In 
this situation, case (ii) provides better overall bounds, often leading to the optimal 
minimax rates for non parametric problems (see a discussion of these issues in Sec. 
15.4 of [1]). 
We now establish a faster convergence rate to a slightly larger value than 
Es[CN(Yf(X))]. In situations where the latter quantity approaches zero, the 
overall convergence rate may be improved, as discussed above. We consider cost 
functions CN(a), which obey the condition 
C(a) 5 (1 + 3)I(a < 0) + V (3 > O,V > 0). (5) 
for some positive 3 and O (see [8] for details on legitimate cost functions). 
Theorem 2 Let D be a distribution over X x {-1, +1}, and let S be a sample of 
m points chosen independently at random according to D. Let dp be the pseudo- 
dimension of the class H, and assume that C(o) obeys condition (5). Then for 
sufficiently large m, with probability at least 1- 6, every function f  co(H) satisfies 
the following bound for every 0 < a < 1/ 
Por/(x)5015 ' 
Proof The proof combines two ideas. First, we use the method of [8] to transform 
the problem from co(H) to a discrete approximation of it. Then, we use recent 
results for relative uniform deviations of averages from their means [3]. Due to lack 
of space, we defer the complete proof to the full version of the paper. 
4 Discussion 
In this paper we have presented two main results pertaining to the theory of boost- 
ing. First, we have shown that, under reasonable conditions, an effective weak 
classifier exists for one dimensional problems. We conjectured, and supported our 
claim by numerical simulations, that such a result holds for multi-dimensional prob- 
lems as well. The non-trivial extension of the proof to multiple dimensions can be 
found in [7]. Second, using recent advances in the theory of uniform convergence and 
boosting we have presented bounds on the generalization error, which may, under 
certain conditions, be significantly better than standard bounds, being particularly 
useful in the context of model selection. 
Acknowledgment We thank Shai Ben-David and Yoav Freund for helpful discus- 
sions. 
References 
[10] 
[11] 
[1] M. Anthony and P.L. Bartlett. Neural Network Learning: Theoretical Foun- 
dations. Cambridge University Press, 1999. 
[2] P. Bartlett and S. Ben-David. On the hardness of learning with neural net- 
works. In Proceedings of the fourth European Conference on computational 
Learning Theory, 99. 
[3] P. Bartlett and G. Luosi. An inequality for uniform deviations of sample 
averages from their means. Statistics and Probability Letters, 44:55-62, 1999. 
[4] T. Hastie J. Friedman and R. Tibshirani. Additive logistic regression: a sta- 
tistical view of boosting. The Annals of Statistics, To appear, 2000. 
[5] D.S. Johnson and F.P. Preparata. The densest hemisphere problem. Theoret- 
ical Computer Science, 6:93-107, 1978. 
[6] S. Mallat and Z. Zhan. Matching pursuit with time-frequencey dictionaries. 
IEEE Trans. Signal Processing, 41(12):3397-3415, December 1993. 
[7] S. Mannor and R. Meir. On the existence of weak learners and applications to 
boosting. Submitted to Machine Learning 
[8] L. Mason, P. Bartlett and J. Baxter. Improved generalization through explicit 
optimization of margins. Machine Learning, 2000. To appear. 
[9] L. Mason, P. Bartlett, J. Baxter and M. Frean. Functional gradient techniques 
for combining hypotheses. In B. SchSlkopf A. Smola, P. Bartlett and D. Schu- 
urmans, editors, Advances in Large Margin Classifiers. MIT Press, 2000. 
R.E. Schapire, Y. Freund, P. Bartlett and W.S. Lee. Boosting the margin: 
a new explanation for the effectiveness of voting methods. The Annals of 
Statistics, 26(5):1651-1686, 1998. 
V. N. Vapnik. Estimation of Dependences Based on Empirical Data. Springer 
Verlag, New York, 1982. 
