Fast Training of Support Vector Classifiers 
F. P6rez-Cruz+, P. L. Alarc6n-Diana+, A. Navia-Vizquezand A. Art6s-Rodr/guez. 
tDpto. Teoda de la Serial y Com., Escuela Polit6cnica, Universidad de AlcalJ. 
28871-AlcalJ de Henares (Madrid) Spain. e-mail: fernando @tsc.uc3m.es 
:Dpto. Tecnologias de las comunicaciones, Escuela Polit6cnica Superior, 
Universidad Carlos Ill de Madrid, Avda. Universidad 30, 28911-Leganes (Madrid) Spain. 
Abstract 
In this communication we present a new algorithm for solving Support 
Vector Classifiers (SVC) with large training data sets. The new algorithm 
is based on an Iterative Re-Weighted Least Squares procedure which is 
used to optimize the SVC. Moreover, a novel sample selection strategy 
for the working set is presented, which randomly chooses the working 
set among the training samples that do not fulfill the stopping criteria. 
The validity of both proposals, the optimization procedure and sample 
selection strategy, is shown by means of computer experiments using 
well-known data sets. 
1 INTRODUCTION 
The Support Vector Classifier (SVC) is a powerful tool to solve pattern recognition prob- 
lems [13, 14] in such a way that the solution is completely described as a linear combination 
of several training samples, named the Support Vectors. The training procedure for solving 
the SVC is usually based on Quadratic Programming (QP) which presents some inherent 
limitations, mainly the computational complexity and memory requirements for large train- 
ing data sets. This problem is typically avoided by dividing the QP problem into sets of 
smaller ones [6, 1, 7, 11], that are iteratively solved in order to reach the SVC solution for 
the whole set of training samples. These schemes rely on an optimizing engine, QP, and in 
the sample selection strategy for each sub-problem, in order to obtain a fast solution for the 
SVC. 
An Iterative Re-Weighted Least Squares (IRWLS) procedure has already been proposed as 
an alternative solver for the SVC [10] and the Support Vector Regressor [9], being compu- 
tationally efficient in absolute terms. In this communication, we will show that the IRWLS 
algorithm can replace the QP one in any chunking scheme in order to find the SVC solution 
for large training data sets. Moreover, we consider that the strategy to decide which training 
samples must join the working set is critical to reduce the total number of iterations needed 
to attain the SVC solution, and the runtime complexity as a consequence. To aim for this 
issue, the computer program $VC '"aaa have been developed so as to solve the SVC for 
large training data sets using IRWLS procedure and fixed-size working sets. 
The paper is organized as follows. In Section 2, we start by giving a summary of the 
IRWLS procedure for SVC and explain how it can be incorporated to a chunking scheme 
to obtain an overall implementation which efficiently deals with large training data sets. 
We present in Section 3 a novel strategy to make up the working set. Section 4 shows the 
capabilities of the new implementation and they are compared with the fastest available 
SVC implementation, SVA tiaht [6]. We end with some concluding remarks. 
2 IRWLS-SVC 
In order to solve classification problems, the SVC has to minimize 
1 
Lp -- 11wl +Cyi--ttii--ai(yi(qb(xi)Tw+b)-l+i) (1) 
i i i 
with respect to w, b and i and maximize it with respect to ai and Pi, subject to ai, Pi _> 0, 
where q(.) is a nonlinear transformation (usually unknown) to a higher dimensional space 
and C is a penalization factor. The solution to (1) is defined by the Karush-Kuhn-Tucker 
(KKT) conditions [2]. For further details on the SVC, one can refer to the tutorial survey 
by Burges [2] and to the work of Vapnik [13, 14]. 
In order to obtain an IRWLS procedure we will first need to rearrange (1) in such a way that 
the terms depending on i can be removed because, at the solution C - 
(one of the KKT conditions [2]) must hold. 
Lp 
(2) 
where 
ei = Yi - (qb T (xi)w + b) and ai = 
2o i 
I - yi(qb T (xi)w + b) 
The weighted least square nature of (2) can be understood if ei is defined as the error on 
each sample and ai as its associated weight, where  IIwll  is a regularizing functional. The 
minimization of (2) cannot be accomplished in a single step because ai = ai (el), and we 
need to apply an IRWLS procedure [4], summarized below in tree steps: 
1. Considering the ai fixed, minimize (2). 
2. Recalculate ai from the solution on step 1. 
3. Repeat until convergence. 
In order to work with Reproducing Kernels in Hilbert Space (RKHS), as the QP proce- 
dure does, we require that w = Y'4/3iyiqb(xi) and in order to obtain a non-zero b, that 
Y'4/3iyi = 0. Substituting them into (2), its minimum with respect to/3i and b for a fixed 
set of ai is found by solving the following linear equation system I 
H+D - 
[ yT [ ] [ b ] = [  ] (3) 
The detailed description of the steps needed to obtain (3) from (2) can be found in [10]. 
where 
y = [y,y2,...yn] T 
(H)ij -- yiYjtp T (Xi)((Xj) -- yiYjK(xi,xj) 
(Da)ij = aiS[i - j] 
 = [, ,... , n] r 
Vi, j = 1,... ,n 
Vi, j = 1,... ,n 
(4) 
(5) 
(6) 
(7) 
and 5[.] is the discrete impulse function Finally, the dependency of ai upon the Lagrange 
multipliers is eliminated using the KKT conditions, obtaining 
102, eiYi <0 
ai = C (8) 
, e-'-'-'i ' yiei _> 0 
2.1 IRWLS ALGORITHMIC IMPLEMENTATION 
The SVC solution with the IRWLS procedure can be simplified by dividing the training 
samples into three sets. The first set, $, contains the training samples verifying 0 < 
/3i < C, which have to be determined by solving (3). The second one, $2, includes every 
training sample whose/3i = 0. And the last one, $a, is made up of the training samples 
whose/3i = C. This division in sets is fully justified in [10]. The IRWLS-SVC algorithm 
is shown in Table 1. 
0. Initialization: 
Sx will contain every training sample, S2= 0 and Sa = 0. Compute H. 
e_a = y, _a = O, b_a = O, Ga = Gin, a = 1 and Goa = 
 (a)s (Y)& (b)& I - C 
(y) 0 = G,a ' 
()s = 0 d ()s = C 
2.  = _a - ou(Z - Z_a) - (a - a_a)l 
<o 
3. ai = C Vi E S U S2 U S3 
[  , eiYi ) O 
4. Sets reordering: 
a. Move evew staple in S wi eiyi  0 to 2. 
b. Move evew staple in S with i  C to 3. 
c. Move evew staple in S wi ai = 0 to 2. 
d. Move evew staple in S2 with ai  0 to 
b_a =  d G, = -Ys ()s + 
6. Go to step 1 d repeat until convergence. 
Table 1: IRWLS-SVC algorithm. 
The IRWLS-SVC procedure has to be slightly modified in order to be used inside a chunk- 
ing scheme as the one proposed in [8, 6], such that it can be directly applied in the one 
proposed in [1]. A chunking scheme is needed to solve the SVC whenever H is too large 
to fit into memory. In those cases, several SVC with a reduced set of training samples are 
iteratively solved until the solution for the whole set is found. The samples are divide into 
a working set, $, which is solved as a full SVC problem, and an inactive set, Sin. If there 
are support vectors in the inactive set, as it might be, the inactive set modifies the IRWLS- 
SVC procedure, adding a contribution to the independent term in the linear equation system 
(3). Those support vectors in Sin can be seen as anchored samples in $a, because their/3i is 
not zero and can not be modified by the IRWLS procedure. Then, such contribution (Gin and G0n) will be calculated as Ga and G0a are (Table 1, 5 tt step), before calling the 
IRWLS-SVC algorithm. We have already modified the IRWLS-SVC in Table 1 to consider 
Gin and Gt,,, which must be set to zero if the Hessian matrix, H, fits into memory for the 
whole set of training samples. 
The resolution of the SVC for large training data sets, employing as minimization engine 
the IRWLS procedure, is summarized in the following steps: 
1. Select the samples that will form the working set. 
2. Construct Gin (H)s.,s ()s and Go, r 
= =-Ys,,()S,, 
3. Solve e IRS-SVC procedure, following the steps in Table 1. 
4. Compute e eor of evew trning staple. 
5. If e stopping conditions 
ye < e Vii 3 =0 
leiyil<e Vi I 0<i<C 
e hlfilled, the SVC solution has been reached. 
(9) 
(10) 
(11) 
The stopping conditions are the ones proposed in [6] and e must be a small value around 
10 -a, a full discussion concerning this topic can be found in [6]. 
3 SAMPLE SELECTION STRATEGY 
The selection of the training samples that will constitute the working set in each iteration 
is the most critical decision in any chunking scheme, because such decision is directly 
involved in the number of IRWLS-SVC (or QP-SVC) procedures to be called and in the 
number of reproducing kernel evaluations to be made, which are, by far, the two most time 
consuming operations in any chunking schemes. 
In order to solve the SVC efficiently, we first need to define a candidate set of training 
samples to form the working set in each iteration. The candidate set will be made up, as 
it could not be otherwise, with all the training samples that violate the stopping conditions 
(9)-(11); and we will also add all those training samples that satisfy condition (11) but a 
small variation on their error will make them violate such condition. 
The strategies to select the working set are as numerous as the number of problems to be 
solved, but one can think three different simple strategies: 
 Select those samples which do not fulfill the stopping criteria and present the 
largest l eel values. 
 Select those samples which do not fulfill the stopping criteria and present the 
smallest l eel values. 
 Select them randomly from the ones that do not fulfill the stopping conditions. 
The first strategy seems the more natural one and it was proposed in [6]. If the largest l eil 
samples are selected we guanrantee that attained solution gives the greatest step towards the 
solution of (1). But if the step is too large, which usually happens, it will cause the solution 
in each iteration and the/3i values to oscillate around its optimal value. The magnitude of 
this effect is directly proportional to the value of C and q (size of the working set), so in 
the case of small C (C < 10) and low q (q < 20) it would be less noticeable. 
The second one is the most conservative strategy because we will be moving towards the 
solution of (1) with small steps. Its drawback is readily discerned if the starting point is 
inappropriate, needing too many iterations to reach the SVC solution. 
The last strategy, which has been implemented together with the IRWLS-SVC procedure, 
is a mid-point between the other two, but if the number of samples whose 0 < i < C 
increases above q there might be some iterations where we will make no progress (working 
set is only made up of the training samples that fulfill the stopping condition in (11)). This 
situation is easily avoided by introducing one sample that violates each one of the stopping 
conditions per class. Finally, if the cardinality of the candidate set is less than q the working 
set is completed with those samples that fulfil the stopping criteria conditions and present 
the least levi. 
In summary, the sample selection strategy proposed is2: 
1. Construct the candidate set, $c with those samples that do not fulfill stopping 
conditions (9) and (10), and those samples whose 3 obeys 0 < 3i < C. 
2. Ifl&l < ngoto5. 
3. Choose a sample per class that violates each one of the stopping conditions and 
move them from $ to the working set, $. 
4. Choose randomly r - I samples from $ and move then to $. Go to Step 6. 
5. Move every sample form $ to $ and the r- I samples that fulfill the stopping 
conditions (9) and (10) and present the lowest leil values are used to complete $. 
6. Go on, obtaining Gin and 
4 BENCHMARK FOR THE IRWLS-SVC 
We have prepared two different experiments to test both the IRWLS and the sample selec- 
tion strategy for solving the SVC. The first one compares the IRWLS against QP and the 
second one compares the samples selection strategy, together with the IRWLS, against a 
complete solving procedure for SVC, the qVM fight . 
In the first trial, we have replaced the LOQO interior point optimizer used by qVM fight 
version 3.02 [5] by the IRWLS-SVC procedure in Table 1, to compare both optimizing en- 
gines with equal samples selection strategy. The comparison has been made over a Pentium 
III-450MHz with 128Mb running on Window98 and the programs have been compiled us- 
ing Microsoft Developer 6.0. In Table 2, we show the results for two data sets: the first 
Adult4 4781 Splice 2175 
CPU time Optimize Time CPU time Optimize Time 
q LOQO IRWLS LOQO IRWLS LOQO IRWLS LOQO IRWLS 
20 21.25 20.70 0.61 0.39 46.19 30.76 21.94 4.77 
40 20.60 19.22 1.01 0.17 71.34 24.93 46.26 8.07 
70 21.15 18.72 2.30 0.46 53.77 20.32 34.24 7.72 
Table 2: CPU Time indicates the consume time in seconds for the whole procedure. The 
Optimize Time indicates the consume time in second for the LOQO or IRWLS procedure. 
one, containing 4781 training samples, needs most CPU resources to compute the RKHS 
and the second one, containing 2175 training samples, uses most CPU resources to solve 
the SVC for each $, where q indicates the size of the working set. The value of C has 
2In what follows, I ' I represents absolute value for numbers and cardinality for sets 
been set to 1 and 1000, respectively, and a Radial Basis Function (RBF) RKHS [2] has 
been employed, where its parameter cr has been set, respectively, to 10 and 70. 
As it can be seen, the $VM figt*t with IRWLS is significantly faster than the LOQO pro- 
cedure in all cases. The kernel cache size has been set to 64Mb for both data sets and for 
both procedures. The results in Table 2 validates the IRWLS procedure as the fastest SVC 
solver. 
For the second trial, we have compiled a computer program that uses the IRWLS-SVC 
procedure and the working set selection in Section 3, we will refer to it as $VU raait 
from now on. We have borrowed the chunking and shrinking ideas from the $VM fight 
[6] for our computer program. To test these two programs several data sets have been 
used. The Adult and Web data sets have been obtained from J. Platt's web page 
http://research.microsoft.com/~ jplatt/smo.html/; the Gauss-M data set is a two dimen- 
sional classification problem proposed in [3] to test neural networks, which comprises a 
gaussian random variable for each class, which highly overlap. The Banana, Diabetes and 
Splice data sets have been obtained from Gunnar Ritsch web page http://svm.first.gmd.deF 
raetsch/. The selection of U and the RKHS has been done as indicated in [11] for Adult 
and Web data sets and in http://svm.first.gmd.deF raetsch/for Banana, Diabetes and Splice 
data sets. In Table 3, we show the runtime complexity for each data set, where the value of 
q has been elected as the one that reduces the runtime complexity. 
Database Dim N C cr SV q CPU time 
Sampl. radit light radit light 
Adult6 123 11221 1 10 4477 150 40 118.2 124.46 
Adult9 123 32562 1 10 12181 130 70 1093.29 1097.09 
Adult1 123 1605 1000 10 630 100 10 25.98 113.54 
Web 1 300 2477 5 10 224 100 10 2.42 2.36 
Web7 300 24693 5 10 1444 150 10 158.13 124.57 
Gauss-M 2 4000 1 1 1736 70 10 12.69 48.28 
Gauss-M 2 4000 100 1 1516 100 10 61.68 3053.20 
Banana 2 400 316.2 1 80 40 70 0.33 0.77 
Banana 2 4900 316.2 1 1084 70 40 22.46 1786.56 
Diabetes 8 768 10 2 409 40 10 2.41 6.04 
Splice 69 2175 1000 70 525 150 20 14.06 49.19 
Table 3: Several data sets runtime complexity, when solved with the $VC raait, radit for 
short, and qVM light, light for short. 
One can appreciate that the .VC radit is faster than the qVM light for most data sets. For 
the Web data set, which is the only data set the qVM figlzt is sligthly faster, the value 
of C is low and most training samples end up as support vector with 3i < C. In such 
cases the best strategy is to take the largest step towards the solution in every iteration, as 
the qVM light does [6], because most training samples/i will not be affected by the others 
training samples/j value. But in those case the value of C increases the $VC ait samples 
selection strategy is a much more appropriate strategy than the one used in qVM fight . 
5 CONCLUSIONS 
In this communication a new algorithm for solving the SVC for large training data sets 
has been presented. Its two major contributions deal with the optimizing engine and the 
sample selection strategy. An IRWLS procedure is used to solve the SVC in each step, 
which is much faster that the usual QP procedure, and simpler to implement, because the 
most difficult step is the linear equation system solution that can be easily obtained by LU 
decomposition means [ 12]. The random working set selection from the samples not fulfill- 
ing the KKT conditions is the best option if the working is be large, because it reduces the 
number of chunks to be solved. This strategy benefits from the IRWLS procedure, which 
allows to work with large training data set. All these modifications have been concreted in 
the $VC raaa solving procedure, publicly available at http://svm.tsc.uc3m.es/. 
6 ACKNOWLEDGEMENTS 
We are sincerely grateful to Thorsten Joachims who has allowed and encouraged us to 
use his SVM liglzt to test our IRWLS procedure, comparisons which could not have been 
properly done otherwise. 
References 
[10] 
[11] 
[12] 
[13] 
[14] 
[1] B. E. Boser, I. M. Guyon, and V. Vapnik. A training algorithm for optimal margin 
classifiers. In 5th Annual Workshop on Computational Learning Theory, Pittsburg, 
U.S.A., 1992. 
[2] C. J. C. Burges. A tutorial on support vector machines for pattern recognition. Data 
Mining and Knowledge Discovery, 2(2): 121-167, 1998. 
[3] S. Haykin. Neural Networks: A comprehensive foundation. Prentice-Hall, 1994. 
[4] P. W. Holland and R. E. Welch. Robust regression using iterative re-weighted least 
squares. Communications of Statistics Theory Methods, A6(9):813-27, 1977. 
[5] T. Joachims. http://www-ai.informatik.uni-dortmund.de /forschung/verfahren 
/svm_light/svm_light.eng.html. Technical report, University of Dortmund, Infor- 
matik, AI-Unit Collaborative Research Center on 'Complexity Reduction in Multi- 
variate Data', 1998. 
[6] T. Joachims. Making Large Scale SVM Learning Practical, In Advances in Ker- 
nel Methods-- Support Vector Learning, Editors Sch61kopf, B., Burges, C. J. C. and 
Smola, A. J., pages 169-184. M.I.T. Press, 1999. 
[7] E. Osuna, R. Freund, and F. Girosi. An improved training algorithm for support 
vector machines. In Proc. of the 1997 IEEE Workshop on Neural Networks for Signal 
Processing, pages 276-285, Amelia Island, U.S.A, 1997. 
[8] E. Osuna and F. Girosi. Reducing the run-time complexity of support vector ma- 
chines. In ICPR'98, Brisbane, Australia, August 1998. 
[9] F. P6rez-Cruz, A. Navia-Vfizquez,, P. L. Alarc6n-Diana, and A. Art6s-Rodrfguez. An 
irwls proceure for svr. In the Proceedings of the EUSIPCO'00, Tampere, Finland, 9 
2000. 
F. P6rez-Cmz, A. Navia-Vfizquez, J. L. Rojo-lvarez, and A. Art6s-Rodriguez. A new 
training algorithm for support vector machines. In Proceedings of the Fifth Bayona 
Workshop on Emerging Technologies in Telecommunications, volume 1, pages 116- 
120, Baiona, Spain, 9 1999. 
J. C. Platt. Sequential Minimal Optimization: A Fast Algorithm for Training Suppor 
Vector Machines, In Advances in Kernel Methods-- Support Vector Learning, Editors 
Sch61kopf, B., Burges, C. J. C. and Smola, A. J., pages 185-208. M.I.T. Press, 1999. 
W. H. Press, S. A. Teukolsky, W. T. Vetteding, and B. P. Flannery. Numerical Recipes 
in C. Cambridge University Press, Cambridge, UK, 2 edition, 1994. 
V. N. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, 1995. 
V. N. Vapnik. Statistical Learning Theory. John Wiley & Sons, 1998. 
