Convergence of Large Margin Separable Linear 
Classification 
Tong Zhang 
Mathematical Sciences Department 
IBM T.J. Watson Research Center 
Yorktown Heights, NY 10598 
tzhang @watson.ibm.corn 
Abstract 
Large margin linear classification methods have been successfully ap- 
plied to many applications. For a linearly separable problem, it is known 
that under appropriate assumptions, the expected misclassification error 
of the computed "optimal hyperplane" approaches zero at a rate propor- 
tional to the inverse training sample size. This rate is usually charac- 
terized by the margin and the maximum norm of the input data. In this 
paper, we argue that another quantity, namely the robustness of the in- 
put data distribution, also plays an important role in characterizing the 
convergence behavior of expected misclassification error. Based on this 
concept of robustness, we show that for a large margin separable linear 
classification problem, the expected misclassification error may converge 
exponentially in the number of training sample size. 
1 Introduction 
We consider the binary classification problem: to determine a label y  {-1, 1} associ- 
ated with an input vector z. A useful method for solving this problem is by using linear 
discriminant functions. Specifically, we seek a weight vector w and a threshold 0 such that 
w "z < 0 if its label y = - 1 and w "z >  if its label y = 1. 
In this paper, we are mainly interested in problems that are linearly separable by a positive 
margin (although, as we shall see later, our analysis is suitable for non-separable problems). 
That is, there exists a hyperplane that perfectly separates the in-class data from the out-of- 
class data. We shall also assume  = 0 throughout the rest of the paper for simplicity. 
This restriction usually does not cause problems in practice since one can always append a 
constant feature to the input data z, which offset the effect of . 
For linearly separable problems, given a training set of rz labeled data 
(oe,y),... ,(z',y'), Vapnik recently proposed a method that optimizes a hard 
margin bound which he calls the "optimal hyperplane" method (see [11]). The optimal 
hyperplane w, is the solution to the following quadratic programming problem: 
min-lw2 s.t. wWveiy > 1 fori:l,...,n. (1) 
w 9. -- 
For linearly non-separable problems, a generalization of the optimal hyperplane method 
has appeared in [2], where a slack variable i is introduced for each data point (z i , yi) for 
i = 1,..., n. We compute a hyperplane w, that solves 
minwTw+C- i s.t. wTziyi > l_i, i > 0 for/= 1,... n. (2) 
i 
Where C > 0 is a given parameter (also see I11]). 
In this paper, we are interested in the quality of the computed weight w, for the purpose of 
predicting the label y of an unseen data point x. We study this predictive power of w, in the 
standard batch learning framework. That is, we assume that the training data (zc i, yi) for 
i = 1,... n are independently drawn from the same underlying data distribution D which 
is unknown. The predictive power of the computed parameter w, then corresponds to the 
classification performance of w, with respect to the true distribution D. 
We organize the paper as follows. In Section 2, we briefly review a number of existing 
techniques for analyzing separable linear classification problems. We then derive an ex- 
ponential convergence rate of misclassification error in Section 3 for certain large margin 
linear classification. Section 4 compares the newly derived bound with known results from 
the traditional margin analysis. We explain that the exponential bound relies on a new 
quantity (the robustness of the distribution) which is not explored in a traditional margin 
bound. Note that for certain batch learning problems, exponential learning curves have al- 
ready been observed [10]. It is thus not surprising that an exponential rate of convergence 
can be achieved by large margin linear classification. 
2 Some known results on generalization analysis 
There are a number of ways to obtain bounds on the generalization error of a linear classi- 
fier. A general framework is to use techniques from empirical processes (aka VC analysis). 
Many such results that are related to large margin classification have been described in 
chapter 4 of [3]. 
The main advantage of this framework is its generality. The analysis does not require the 
estimated parameter to converge to the true parameter, which is ideal for combinatorial 
problems. However, for problems that are numerical in natural, the potential parameter 
space can be significantly reduced by using the first order condition of the optimal solution. 
In this case, the VC analysis may become suboptimal since it assumes a larger search space 
than what a typical numerical procedure uses. Generally speaking, for a problem that is 
linearly separable with a large margin, the expected classification error of the computed 
hyperplane resulted from this analysis is of the order o(l--g--v-)  Similar generalization 
bounds can also be obtained for non-separable problems. 
In chapter 10 of [11], Vapnik described a leave-one-out cross-validation analysis for lin- 
early separable problems. This analysis takes into account the first order KKT condition of 
the optimal hyperplane w,. The expected generalization performance from this analysis is 
O (), which is better than the corresponding bounds from the VC analysis. Unfortunately, 
this technique is only suitable for deriving an expected generalization bound (for example, 
it is not useful for obtaining a PAC style probability bound). 
Another well-known technique for analyzing linearly separable problems is the mistake 
bound framework in online learning. It is possible to obtain an algorithm with a small gen- 
eralization error in the batch learning setting from an algorithm with a small online mistake 
Bounds described in [3] would imply an expected classification error of O(   which can be 
slightly improved (by a log rz factor) if we adopt a slightly better covering number estimate such as 
the bounds in [12, 14]. 
bound. The readers are referred to [6] and references therein for this type of analysis. The 
technique may lead to a bound with an expected generalization performance of O (). 
Besides the above mentioned approaches, generalization ability can also be studied in 
the statistical mechanical learning framework. It was shown that for linearly separable 
problems, exponential decrease of misclassification error is possible under this framework 
[1, 5, 7, 8]. Unfortunately, it is unclear how to relate the statistical mechanical learning 
framework to the batch learning framework considered in this paper. Their analysis, em- 
ploying approximation techniques, does not seem to imply small sample bounds which we 
are interested in. 
The statistical mechanical learning result suggests that it may be possible to obtain a similar 
exponential decay of misclassification error in the batch learning setting, which we prove 
in the next section. Furthermore, we show that the exponential rate depends on a quantity 
that is different than the traditional margin concept. Our analysis relies on a PAC style 
probability estimate on the convergence rate of the estimated parameter from (2) to the true 
parameter. Consequently, it is suitable for non-separable problems. A direct analysis on the 
convergence rate of the estimated parameter to the true parameter is important for problems 
that are numerical in nature such as (2). However, a disadvantage of our analysis is that we 
are unable to directly deal with the linearly separable formulation (1). 
3 Exponential convergence 
We can rewrite the SVM formulation (2) by eliminating  as: 
w,(A):argrnnl y,f(wWzeiyi--1) + wWw, 
i 
where 3, = 1/(nO) and 
-z z_<0, 
f(z) : 0 z 2> O. 
Denote by D the true underlying data distribution of (a, 4), and let w, (3,) be the optimal 
solution with respect to the true distribution as: 
w, (3,) = arg inf EDf(w T zeF- 1) + --XwTw. (4) 
= 2 
Let w, be the solution to 
w, =arginf-lw 2 s.t. EDf(wWzeF --1)=0, (5) 
= 2 
which is the infinite-sample version of the optimal hyperplane method. 
Throughout this section, we assume IIw, 112 < and EDIIxlI2 < The latter condition 
ensures that EDf(w  zeF -- 1) _< IIwlIEDII11 +  exists for all w. 
3.1 Continuity of solution under regularization 
In this section, we show that IIm,() - m, 0 as 3,  0. This continuity result allows 
us to approximate (5) by using (4) and (3) with a small positive regularization parameter 3,. 
We only need to show that within any sequence of 3, that converges to zero, there exists a 
subsequence ),i  0 such that w, (),i) converges to w, strongly. 
We first consider the following inequality which follows from the definition of w, (,3,): 
3' 3' 2 
Er>f(w,(X)'zey - 1) + w,(X) 2 _< w,. (6) 
Therefore IIw,()ll2 5 IIw, ll2, 
It is well-known that every bounded sequence in a Hilbert space contains a weakly conver- 
gent subsequence (cf. Proposition 66.4 in [4]). Therefore within any sequence of ; that 
converges to zero, there exists a subsequence ;i  0 such that w,(;i) converges weakly. 
We denote the limit by . 
Since f(w,(,) :r zy - 1) is dominated by 11",11211x112 + m which has a finite integral with 
respect to D, therefore from (6) and the Lebesgue dominated convergence theorem, we 
obtain 
0 = limEDf(w,()i)rzy - 1) = EDlimf(w,()i)rzy - 1) = EDf(rzy -- 1). (7) 
$ 
Also note that llll2 _< lira/Ilw,(Xi)ll2 _< Ilw, ll2, therefore by the definition of w,, we 
must have  = w,. 
Since w, is the weak limit of w,(;i), we obtain 11",112 < lin II-,(x)112. Also since 
II-,(x)112 _< 11-,112, therefore lira/II-,(x)112 - This equality implies that 
w, (;i) converges to w, strongly since 
2 9. limw,(X)"w, = 0. 
lim(w,(Xi ) - w,) 2 = limw,(Xi) 2 + w, - 
$ $ 
3.2 Accuracy of estimated hyperplane with non-zero regularization parameter 
Our goal is to show that for the estimation method (3) with a nonzero regularization pa- 
rameter ; > 0, the estimated parameter w, (;) converges to the true parameter w, (;) in 
probability when the sample size n  . Furthermore, we give a large deviation bound 
on the rate of convergence. 
From (4), we obtain the following first order condition: 
ED/3(A, z, y)zy + Aw, (),): 0, (8) 
where fi(,,z,y) = f(w,(,)Crzy - 1) and f(z)  [-1,0] denotes a member of the 
subgradient of f at z [9]. 2 In the finite sample case, we can also interpret fi(;, z, y) in 
(8) as a scaled dual variable a: fi = -a/C, where a appears in the dual (or Kernel) 
formulation of an SVM (for example, see chapter 10 of [11]). 
The convexity of f implies that f(z ) + (z2 - z)f(z) _< f(z2) for any subgradient f of 
f. This implies the following inequality: 
1_  f(w,(X)T zciy i _ 1)+ (w,,(X)- w,(X)) T 1_ -.(X, zc i, yi)zciyi 
n n 
i i 
1 
n 
i 
which is equivalent to: 
_ 
1  f(w,(X)rzciyi_ 1)+ 2w,(X) 2 + 
n 
i 
_ 
(w,,(A) - w,(A))T[ 1 -.()%zci,yi)zciy i + Aw,(A)] + (w,(A)- 
n 
i 
<- -.f(w,())rzciyi -1) + w,()) 2. 
i 
2For readers not familiar with the subgradient concept in convex analysis, our analysis requires 
little modification if we replace f with a smoother convex function such as f2, which avoids the 
discontinuity in the first order derivative. 
Also note that by the definition of w, (,k), we have: 
i 
Therefore by comparing the above two inequalities, we obtain: 
Therefore we have 
i 
 _ i zciyi 
1 Z()%: r ,yi) -ED()%:r y):ryll2. 
i 
(9) 
Note that in (9), we have already bounded the convergence of w, (3,) to w, (3,) in terms of 
the convergence of the empirical expectation of a random vector/5(3,, z, y)zy to its mean. 
In order to obtain a large deviation bound on the convergence rate, we need the following 
result which can be found in [13], page 95: 
Theorem 3.1 Let i be zero-mean independent random vectors in a Hilbert space. If there 
exists M > 0 such that for all natural numbers l_> 2: y]i__i Zll11 _< l!Mq Then for 
all5 > O: P(11  E,,11= > 5) < 2exp( -' 2/(bM2 +SM)). 
 _ _ 5 
Using the fact that/5(3,, z, y)  [- 1, 0], it is easy to verify the following corollary by using 
Theorem 3.1 and (9), where we also bound the l-th moment of the right hand side of (9) 
using the following form of Jensen's inequality: la + bl * _< 2*-(lal* + Ibl ) for/_> 2. 
Corollary 3.1 If there exists M > 0 such thatfor all natural numbers l ) 2: E. llxll 5 
l!M . Then for all (5 > O: 
_ _ n)252/(4bM2 + )SM)). 
V(llm,(&)- m(i)11 > 5) < 2 exp(- 
Let PT> (.) denote the probability with respect to distribution D, then the following bound 
on the expected misclassification error of the computed hyperplane w, (3,) is a straight- 
forward consequence of Corollary 3.1' 
Corollary 3.2 Under the assumptions of Corollary 3.1, then for any non-random values 
), 7, K > O, we have: 
ExPD(m(i) T xy _< O) _<PD(m,(i) T y _< ) + PD(11112 _> K) 
+ 2exp(-,272/(4bK2M 2 + )7KM)), 
where the expectation Ex is taken over n random samples from D with w,,()) estimated 
from the n samples. 
We now consider linearly separable classification problems where the solution w, of (5) 
is finite. Throughout the rest of this section, we impose an additional assumption that the 
distribution D is finitely supported: IImll2 
measure D. 
_< M almost everywhere with respect to the 
From Section 3.1, we know that for any sufficiently small positive number 3,, IIw, - 
o,(),)11, < l/M, which means that w,(),) also separates the in-class data from the out- 
of-class data with a margin of at least 2(1 - MIIw, - w,(x)11=). Therefore for sufficiently 
small 3,, we can define: 
7(X): sup{b: PD(W,(A) T my _< b): 0} _> 1 - MIlo, - o,(),)11 > o. 
By Corollary 3.2, we obtain the following upper-bound on the misclassification error if we 
compute a linear separator from (3) with a non-zero small regularization parameter 3,: 
Ex PD(wn(/) T my <(O) <( 2 exp(-,2'(/)2 /(4M 4 q- ,'(/)M2) ). 
This indicates that the expected misclassification error of an appropriately computed hyper- 
plane for a linearly separable problem is exponential in n. However, the rate of convergence 
depends on ,3,(})/M 2 . This quantity is different than the margin concept which has been 
widely used in the literature to characterize the generalization behavior of a linear clas- 
sification problem. The new quantity measures the convergence rate of w, (3,) to w, as 
3, --> 0. The faster the convergence, the more "robust" the linear classification problem is, 
and hence the faster the exponential decay of misclassification error is. As we shall see in 
the next section, this "robustness" is related to the degree of outliers in the problem. 
4 Example 
We give an example to illustrate the "robustness" concept that characterizes the exponential 
decay of misclassification error. It is known from Vapnik's cross-validation bound in [11] 
(Theorem 10.7) that by using the large margin idea alone, one can derive an expected 
misclassification error bound that is of the order O(1/n), where the constant is margin 
dependent. We show that this bound is tight by using the following example. 
Example 4.1 Consider a two-dimensional problem. Assume that with probability of 1 - % 
we observe a data point m with label t such that my = [1, 0]; and with probability of % we 
observe a data point m with label t such that my = [-1, 1]. This problem is obviously 
linearly separable with a large margin that is 3' independent. 
Now, for rz random training data, with probability at most 7 ' q- (1 - 7) ', we observe either 
mit i = [1, 0] for all i = 1,... ,rz, or mit i = [-1, 1] for all i = 1,... ,rz. For all other 
cases, the computed optimal hyperplane w, = w,. This means that the misclassification 
error is 7(1 - 7)(7 '-i + (1 - 7) '-l). This error converges to zero exponentially as rz  
oo. However the convergence rate depends on the fraction of outliers in the distribution 
characterized by 7. 
In particular, for any rz, if we let 3' = i/n, then we have an expected misclassification error 
that is at least - l/n) ' ,, 
[] 
The above tightness construction of the linear decay rate of the expected generalization 
error (using the margin concept alone) requires the scenario that a small fraction (which 
shall be in the order of inverse sample size) of data are very different from other data. 
This small portion of data can be considered as outliers, which can be measured by the 
"robustness" of the distribution. In general, w, (3,) converges to w, slowly when there 
exist such a small portion of data (outliers) that cannot be correctly classified from the 
observation of the remaining data. It can be seen that the optimal hyperplane in (1) is quite 
sensitive to even a single outlier. Intuitively, this instability is quite undesirable. However, 
the previous large margin learning bounds seemed to have dismissed this concern. This 
paper indicates that such a concern is still valid. In the worst case, even if the problem 
is separable by a large margin, outliers can still cause a slow down of the exponential 
convergence rate. 
5 Conclusion 
In this paper, we derived new generalization bounds for large margin linearly separable 
classification. Even though we have only discussed the consequence of this analysis for 
separable problems, the technique can be easily applied to non separable problems (see 
Corollary 3.2). For large margin separable problems, we show that exponential decay of 
generalization error may be achieved with an appropriately chosen regularization parame- 
ter. However, the bound depends on a quantity which characterizes the robustness of the 
distribution. An important difference of the robustness concept and the margin concept is 
that outliers may not be observable with large probability from data while margin generally 
will. This implies that without any prior knowledge, it could be difficult to directly apply 
our bound using only the observed data. 
References 
[10] 
[11] 
[12] 
[13] 
[14] 
[1] J.K. Anlauf and M. Biehl. The AdaTron: an adaptive perceptron algorithm. Europhys. 
Lett., 10(7):687-692, 1989. 
[2] C. Cortes and V.N. Vapnik. Support vector networks. Machine Learning, 20:273-297, 
1995. 
[3] Nello Cristianini and John Shawe-Taylor. An Introduction to Support Vector Machines 
and other Kernel-based Learning Methods. Cambridge University Press, 2000. 
[4] Harro G. Heuser. Functional analysis. John Wiley & Sons Ltd., Chichester, 1982. 
Translated from the German by John Horvfith, A Wiley-Interscience Publication. 
[5] W. Kinzel. Statistical mechanics of the perceptron with maximal stability. In Lecture 
Notes in Physics, volume 368, pages 175-188. Springer-Verlag, 1990. 
[6] J. Kivinen and M.K. Warmuth. Additive versus exponentiated gradient updates for 
linear prediction. Journal of Information and Computation, 132:1-64, 1997. 
[7] M. Opper. Learning times of neural networks: Exact solution for a perceptron algo- 
rithm. Phys. Rev. A, 38(7):3824-3826, 1988. 
[8] M. Opper. Learning in neural networks: Solvable dynamics. Europhysics Letters, 
8(4):389-392, 1989. 
[9] R. Tyrrell Rockafellar. Convex analysis. Princeton University Press, Princeton, NJ, 
1970. 
Dale Schuurmans. Characterizing rational versus exponential learning curves. J. 
Cornput. Syst. Sci., 55:140-160, 1997. 
V.N. Vapnik. Statistical learning theory. John Wiley & Sons, New York, 1998. 
Robert C. Williamson, Alexander J. Smola, and Bernhard Sch/31kopf. Entropy num- 
bers of linear function classes. In COLT'00, pages 309-319, 2000. 
Vadim Yurinsky. Sums and Gaussian vectors. Springer-Verlag, Berlin, 1995. 
Tong Zhang. Analysis of regularized linear functions for classification problems. 
Technical Report RC-21572, IBM, 1999. Abstract in NIPS'99, pp. 370-376. 
