Regularized Winnow Methods 
Tong Zhang 
Mathematical Sciences Department 
IBM T.J. Watson Research Center 
Yorktown Heights, NY 10598 
tzhang @watson.ibm.corn 
Abstract 
In theory, the Winnow multiplicative update has certain advantages over 
the Perceptron additive update when there are many irrelevant attributes. 
Recently, there has been much effort on enhancing the Perceptron algo- 
rithm by using regularization, leading to a class of linear classification 
methods called support vector machines. Similarly, it is also possible to 
apply the regularization idea to the Winnow algorithm, which gives meth- 
ods we call regularized Winnows. We show that the resulting methods 
compare with the basic Winnows in a similar way that a support vector 
machine compares with the Perceptron. We investigate algorithmic is- 
sues and learning properties of the derived methods. Some experimental 
results will also be provided to illustrate different methods. 
1 Introduction 
In this paper, we consider the binary classification problem that is to determine a label 
y  {- 1, 1} associated with an input vector x. A useful method for solving this problem is 
through linear discriminant functions, which consist of linear combinations of the compo- 
nents of the input variable. Specifically, we seek a weight vector w and a threshold 0 such 
that w ' z < 0 if its label y = - 1 and w ' z > 0 if its label y = 1. Given a training set of 
labeled data (zz, yz),..., (z', y'), a number of approaches to finding linear discriminant 
functions have been advanced over the years. In this paper, we are especially interested in 
the following two families of online algorithms: Perceptron [12] and Winnow [10]. These 
algorithms typically fix the threshold 0 and update the weight vector w by going through 
the training data repeatedly. They are mistake driven in the sense that the weight vector is 
updated only when the algorithm is not able to correctly classify an example. 
For the Perceptron algorithm, the update rule is additive: if the linear discriminant function 
misclassifies an input training vector zi with true label yi, then we update each component 
j of the weight vector w as: wj <-- wj q- r/x}y * , where r/> 0 is a parameter called learning 
rate. The initial weight vector can be taken as w = 0. 
For the (unnormalized) Winnow algorithm (with positive weight), the update rule is mul- 
tiplicative: if the linear discriminant function misclassifies an input training vector a i 
with true label y*, then we update each component j of the weight vector w as: wj <-- 
wj exp(r/a}y i), where r/ > 0 is the learning rate parameter, and the initial weight vector 
can be taken as wj =/ > 0. The Winnow algorithm belongs to a general family of algo- 
rithms called exponentiated gradient descent with unnormalized weights (EGU) [9]. There 
can be several variants. One is called balanced Winnow, which is equivalent to an embed- 
ding of the input space into a higher dimensional space as:  = [a:, -a:]. This modification 
allows the positive weight Winnow algorithm for the augmented input 5: to have the effect 
of both positive and negative weights for the original input a:. Another modification is to 
normalize the one-norm of the weight to so that 5-],j wj = W, leading to the normalized 
Winnow. 
Theoretical properties of multiplicative update algorithms have been extensively studied 
since the introduction of Winnow. For linearly separable binary-classification problems, 
both Perceptron and Winnow are able to find a weight that separate the in-class vectors from 
the out-of-class vectors in the training set within a finite number of steps. However, the 
number of mistakes (updates) before finding a separating hyperplane can be very different 
[10, 9]. This difference suggests that the two algorithms serve for different purposes. 
For linearly separable problems, Vapnik proposed a method that optimizes the Perceptron 
mistake bound which he calls "optimal hyperplane" (see [15]). The same method has also 
appeared in the statistical mechanical learning literature (see [1, 8, 11]), and is referred 
to as achieving optimal stability. For non-separable problems, a generalization of optimal 
hyperplane was proposed in [2] by introducing a "soft-margin" loss term. In this paper, we 
derive regularized Winnow methods by constructing "optimal hyperplanes" that minimize 
the Winnow mistake bound (rather than the Perceptron mistake bound as in an SVM). We 
then derive a "soft-margin" version of the algorithms for non-separable problems. 
For simplicity, we shall assume 0 = 0 in this paper. The restriction does not cause problems 
in practice since one can always append a constant feature to the input data x, which offset 
the effect of 0. The formulation with 0 = 0 can be more amenable to theoretical analysis. 
For an SVM, a fixed threshold also allows a simple Perceptron like numerical algorithm as 
described in chapter 12 of [13], and in [7]. Although more complex, a non-fixed 0 does not 
introduce any fundamental difficulty. 
The paper is organized as follows. In Section 2, we review mistake bounds for Perceptron 
and Winnow. Based on the bounds, we show how regularized Winnow methods can be 
derived by mimicking the optimal stability method (and SVM) for Perceptron. We also 
discuss the relationship of the newly derived methods with related methods. In Section 3, 
we investigate learning aspects of the newly proposed methods in a context similar to some 
known SVM results. An example will be given in Section 4 to illustrate these methods. 
2 SVM and regularized Winnow 
2.1 From Perceptton to SVM 
We review the derivation of SVM from Perceptron, which serves as a reference for our 
derivation of regularized Winnow. Consider linearly separable problems and let w be 
a weight that separates the in-class vectors from the out-of-class vectors in the training 
set. It is well known that the Perceptron algorithm computes a weight that correctly 
classifies all training data after at most M updates (a proof can be found in [15]) where 
M : litoil,'. ma" IIxll,'./(mi The weight vector w, that minimizes the right 
hand side of the bound is called the optimal hyperplane in [15] or the optimal stability hy- 
perplane in [1, 8, 11]. This optimal hyperplane is the solution to the following quadratic 
programming problem: 
min-lw2 s.t. wTciy i > 1 fori = 1,...,n. 
w 9. -- 
For non-separable problems, we introduce a slack variable i for each data point (z i , yi) 
(i = 1,... , n), and compute a weight vector w, (C') that solves 
minwTw+C'Z(i s.t. wTziy i > 1-(i, (i>O fori=l,... n. 
i 
Where C' > 0 is a given parameter [15]. It is known that when C' --> oo, i --> 0 and w,(C') 
converges to the weight vector w, of the optimal hyperpl. ane. We can write down the KKT 
condition for the above optimization problem, and let a' be the Lagrangian multiplier for 
w T a:y  > 1 - f. After elimination of w and f, we obtain the following dual optimization 
problem of the dual variable a (see [15], chapter 10 for details): 
maxyoz i- _l(yoziziyi) 2 s.t. oz i C [0, C] for/: 1,... n. 
i 
The weight w,(C') is given by w,(C') = 5-]4 zzY  at the optimal solution. To solve this 
problem, one can use the following modification of the Perceptron update algorithm (see 
[7] and chapter 12 of [13]): at each data point (z i, yi), we fix all a with k 3h i, and update 
ozi to maximize the dual objective functional, which gives: 
o? --> max(min(C', o? + r](1 - w T ziyi)), 0), 
where w = Y]4 iziY i' The learning rate r] can be set as r] = 1/ziTz  which corresponds 
to the exact maximization of the dual objective functional. 
2.2 From Winnow to regularized Winnow 
Similar to Perceptron, if a problem is linearly separable with a positive weight w, then 
Winnow computes a solution that correctly classfries all training data after at most M up- 
dates with M = 2W(5-],j wj In Jll11 _ 
l-]-11 ) max./IIx'llL/6 2, where 0 < 5 < min/w :v ziy i, 
W > I1"11 and the learning rate is r]: /(W max/I111), The proof of this specific 
bound can be found in [16] which employed techniques in [5] (also see [10] for earlier 
results). Note that unlike the Perceptron mistake bound, the above bound is learning rate 
dependent. It also depends on the prior/ > 0 which is the initial value of w in the basic 
Winnows. 
For problems separable with positive weights, to obtain an optimal stability hyperplane 
associated with the Winnow mistake bound, we consider fixing Ilmll such that Ilmll - 
w > 0. It is then natural to define the optimal hyperplane as the (positive weight) solution 
to the following convex programming problem: 
min  wj In wj wTziy i 
s.t. > 1 for/= 1,... ,n. 
 o -- e/.zj - 
We use e to denote the base of natural logarithm. Similar to the derivation of SVM, for 
non-separable problems, we introduce a slack variable i for each data point (z i, yi), and 
compute a weight vector w, (C') that solves 
+C s.t. >1- , >0 fori=l,...,n. 
 o,  el_t j - - 
Where C' > 0 is a given parameter. Note that to derive the above methods, we have assumed 
that Ilmll is fixed at Ilmll - I111 - w, where W is a given parameter. This implies that 
the derived methods are in fact regularized versions of the normalized Winnow. One can 
also ignore this normalization constraint so that the derived methods correspond to regular- 
ized versions of the unnormalized Winnow. The entropy regularization condition is natural 
to all exponentiated gradient methods [9], as can be observed from the theoretical results 
in [9]. The regularized normalized Winnow is closely related to the maximum entropy 
discrimination [6] (the two methods are almost identical for linearly separable problems). 
However, in the framework of maximum entropy discrimination, the Winnow connection 
is non-obvious. As we shall show later, it is possible to derive interesting learning bounds 
for our methods that are connected with the Winnow mistake bound. 
Similar to the SVM formulation, the non-separable formulation of regularized Winnow 
approaches the separable formulation as C  c. We shall thus only focus on the non- 
separable case below. Also similar to an SVM, we can write down the KKT condition and 
let ai be the Lagrangian multiplier for w"xiy i > 1 - i. After elimination of w and , 
we obtain (the algebra resembles that of [15], chapter 10, which we shall skip due to the 
limitation of space) the following dual formulation for regularized unnormalized Winnow: 
a zsy ) s.t. G [0, C] for/= 1,... ,n. 
The j-th component of weight w,(G) is given by w,(G)5 /5 exp(y4 aii i 
= zsy ) at the 
optimal solution. For regularized normalized Winnow with IIwll - w > 0, we obtain 
rnaxya i- Wln(yljexp(yaic}yi)) s.t. a i G [0, C] for/: 1,... ,n. 
 j  
The weight w,(C)is given by w,(C)j = Wlj exp(E exp(E  y 
at the optimal solution. 
Similar to the Perceptron-like update rule for the dual SVM formulation, it is possible to 
derive Winnow-like update rules for the regularized Winnow formulations. At each data 
point (xi, yi), we fix all a with k  i, and update ai to maximize the dual objective 
functionals. We shall not try to derive an analytical solution, but rather use a gradient 
ascent method with a learning rate r/: ai  ai + r/- Lr> (ai), where we use Lr> to denote 
the dual objective function to be maximized. r/can be either fixed as a small number or 
computed by the Newton's method. It is not hard to verify that we obtain the following 
update rule for regularized unnormalized Winnow: 
c i - max(min(C, c i + r(1 - wTciyi)),O), 
where = exp(E aj This gradient ascent on the dual variable gives an EGU 
rule as in [9]. Compared wth the SVM dual update rule which is a soft-margin version 
of the Perceptron update rule, this method naturally corresponds to a soft-margin version 
of unnormalized Winnow update. Similarly, we obtain the following dual update rule for 
regularized normalized Winnow: 
o? -- max(min(C, o? + r/(1 - w T ciyi)),O), 
where wj Wlj exp(yi a i i i ' ' ' 
= exp(E a*x}y*). Again, this rule (which is 
an EG rule in [9]) can be naturally regarded as the soft-margin version of the normalized 
Winnow update. In our experience, these update rules are numerically very efficient. Note 
that for regularized normalized Winnow, the normalization constant W needs to be care- 
fully chosen based on the data. For example, if data is infinity-norm bounded by 1, then it 
does not seem to be appropriate if we choose W < 1 since Iw"l _< 1. a hyperplane with 
Ilwll <  does not achieve reasonable margin. This problem is less crucial for unnormal- 
ized Winnow, but the norm of the initial weight/j still affects the solution. 
Besides maximum entropy discrimination which is closely related to regularized normal- 
ized Winnow, a large margin version of unnormalized Winnow has also been proposed 
based on some heuristics [3, 4]. However, their algorithm was purely mistake driven with- 
out dual variables ai (the algorithm does not compute an optimal stability hyperplane for 
the Winnow mistake bound). In addition, they did not include a regularization parameter 
C which in practice may be important for non-separable problems. 
Some statistical properties of regularized Winnows 
In this section, we derive some learning bounds based on our formulations that minimize 
the Winnow mistake bound. The following result is an analogy of a leave-one-out cross- 
validation bound for separable SVMs -- Theorem 10.7 in [15]. 
Theorem 3.1 The expected misclassification error err,, with the true distribution by 
using hyperplane w obtained from the linearly separable (G = c) unnormal- 
ized regularized Winnow algorithm with n training samples is bounded by err, < 
 E min(K, .SW(E j wj in ) ma IIx11), where the right-hand side expectatio- 
is taken with n + 1 random samples (zc , y), . . . , (zc +, y+). K is the number of sup- 
port vectors of the solution. Let w be the optimal solution using all the samples with 
dual c i for i = 1,... , n + 1. Let w  be the weight obtained from setting c  = O, then 
w - max(llwll, 
Proof Sketch. We only describe the major steps due to the limitation of space. Denote by 
Oi the weight obtained from the optimal solution by removing (:r i , yi) from the training 
sample. Similar to the proof of Theorem 10.7 in [15], we need to bound the leave-one- 
out cross-validation error, which is at most K. Also note that the leave-one-out cross- 
validation error is at most I{k  II - wllllmll _> t}l. We then use the following 
two inequalities: II  - mll 5 2W(E  - m - m ln(/ws)), and Ej wj - 
 w 5 - w 5 ln(wp/w 5)  the latter inequality can be obtained by 
wjln(/wj) 5 Ej wi - 
compng the dual objective functionals and by using the cogesponding KKT condition 
of the dual problem. The remaining problem is now reduced to proving that I{k  j wj - 
wj - wj ln(w/wj)  1/(2Wllll)}l  w j wj In  For the dual formulation, 
__ __ j' 
by sumng over index k of the KKT first order condition with respective to the dual a, 
multiplied by a one obtains   = j wj In  We thus only need to show that if 
 wj - wjln(w/wj)  1/(2Wllll),then  > 2/(3Wllll). This can be 
j wj - _ _ 
checked directly through Taylor expansion.  
By using the same technique, we may also obtain a bound for regulzed noalized Win- 
now. One disadvantage of the above bound is that it is the expectation of a random estimator 
that is no better than the leave-one-out cross-validation egor based on observed data. How- 
ever, the bound does convey some useful information: for example, we can observe that 
the expected misclassification egor (leaning curve) converges at a rate of O(1/n) as long 
wj 
as W(j wj In ) and sup IIll e reasonably bounded. 
It is also not difficult to obtain interesting PAC style bounds by using the coveting number 
result for entropy regulization in I16] and ideas in [14]. Although the PAC analysis would 
imply a slightly suboptimal leaning curve of O(log n/n) for linely septable problems, 
the bound itself provides a probability confidence and can be generalized to non-sepable 
problems. We state below an example for non-sepable problems, which justifies the 
entropy regulzation. The bound itself is a direct consequence of Theorem 2.2 and a 
covering number result with entropy regulization in [16]. Note that as in [14], the sque 
root can be removed if kv = 0;  can also be made data-dependent. 
Theorem 3.2 If the data is infinity-norm bounded as I < b, then consider the family 
F ofhyperplanes w such that Ilwll _< a and ) < c. Denote by 
the misclassification error of w with the true distribution. Then there is a constant C such 
that for any 7 > O, with probability 1 - r 1 over n random samples, any w  F satisfies: 
err(w) < -- + b2(a 2 + ac) + 2) + In , 
where k. = I{i  w ' ziy i < ?}l is the number of samples with margin less than 
4 An example 
We use an artificial dataset to show that a regularized Winnow can enhance a Winnow 
just like an SVM can enhance a Perceptron. In addition, it shows that for problems with 
many irrelevant features, the Winnow algorithms are superior to the Perceptron family 
algorithms. 
The data in this experiment are generated as follows. We select an input data dimension 
d, with d = 500 or d = 5000. The first 5 components of the target linear weight w are 
set to ones; the 6th component is -1; and the remaining components are zeros. The linear 
threshold 0 is 2. Data are generated as random vectors with each component randomly 
chosen to be either 0 or 1 with probability 0.5 each. Five percent of the data are given wrong 
labels. The remaining data are given correct labels, but we remove data with margins that 
are less than 1. One thousand training and one thousand test data are generated. 
We shall only consider balanced versions of the Winnows. We also compensate the effect 
of 0 by appending a constant 1 to each data point, as mentioned earlier. We use UWin 
and NWin to denote the basic unnormalized and normalized Winnows respectively. LM- 
UWin and LM-NWin denote the corresponding large margin versions. The SVM style large 
margin Perceptron is denoted as LM-Perc. We use 200 iterations over the training data for 
all algorithms. The initial values for the Winnows are set to be the priors: /j = 0.01. 
For online algorithms, we fix the learning rates at 0.01. For large margin Winnows, we 
use learning rates r/ = 0.01 in the gradient ascent update. For (2-norm regularized) large 
margin Perceptron, we use the exact update which corresponds to a choice r/= 1/xiTx i. 
Accuracies (in percentage) of different methods are listed in Table 1. For regularization 
methods, accuracies are reported with the optimal regularization parameters. The superior- 
ity of the regularized Winnows is obvious, especially for high dimensional data. Accuracies 
of regularized algorithms with different regularization parameters are plotted in Figure 1. 
These behaviors are very typical for regularized algorithms. In practice, the optimal regu- 
larization parameter can be found by cross-validation. 
dimension 
500 
5000 
Perceptron LM-Perc 
82.2 87.1 
67.9 69.8 
UWin LM-UWin NWin LM-NWin 
82.4 
94.0 
82.4 
94.3 
69.7 87.4 69.7 88.6 
Table 1: Testset accuracy (in percentage) on the artificial dataset 
lO = lO 1 lO 0 lO  lO  lO  lO = lO 1 lO 0 
d = 500 d = 5000 
Figure 1' Testset accuracy (in percentage) as a function of ,X =  
5 Conclusion 
In this paper, we derived regularized versions of Winnow online update algorithms. We 
studied algorithmic and theoretical properties of the newly obtained algorithms, and com- 
pared them to the Perceptron family algorithms. Experimental results indicated that for 
problems with many irrelevant features, the Winnow family algorithms are superior to Per- 
ceptron family algorithms. This is consistent with the implications from both the online 
learning theory, and learning bounds obtained in this paper. 
References 
[10] 
[11] 
[12] 
[13] 
[14] 
[15] 
[16] 
[1] J.K. Anlauf and M. Biehl. The AdaTron: an adaptive perceptron algorithm. Europhys. 
Lett., 10(7):687-692, 1989. 
[2] C. Cortes and V.N. Vapnik. Support vector networks. Machine Learning, 20:273-297, 
1995. 
[3] I. Dagan, Y. Karov, and D. Roth. Mistake-driven learning in text categorization. In 
Proceedings of the Second Conference on Empirical Methods' in NLP, 1997. 
[4] A. Grove and D. Roth. Linear concepts and hidden variables. Machine Learning, 
2000. To Appear; early version appeared in NIPS-10. 
[5] A.J. Grove, N. Littlestone, and D. Schuurmans. General convergence results for linear 
discriminant updates. In Proc. loth Annu. Conf. on Cornput. Learning Theory, pages 
171-183, 1997. 
[6] Tommi Jaakkola, Marina Meila, and Tony Jebara. Maximum entropy discrimination. 
In S.A. Solla, T.K. Leen, and K.-R. Mtiller, editors, Advances in Neural Information 
Processing Systems 12, pages 470-476. MIT Press, 2000. 
[7] T.S. Jaakkola, Mark Diekhans, and D. Haussler. A discriminative framework for 
detecting remote protein homologies. Journal of Computational Biology, to appear. 
[8] W. Kinzel. Statistical mechanics of the perceptron with maximal stability. In Lecture 
Notes in Physics, volume 368, pages 175-188. Springer-Verlag, 1990. 
[9] J. Kivinen and M.K. Warmuth. Additive versus exponentiated gradient updates for 
linear prediction. Journal of Information and Computation, 132:1-64, 1997. 
N. Littlestone. Learning quickly when irrelevant attributes abound: a new linear- 
threshold algorithm. Machine Learning, 2:285-318, 1988. 
M. Opper. Learning times of neural networks: Exact solution for a perceptron algo- 
rithm. Phys. Rev. A, 38(7):3824-3826, 1988. 
F. Rosenblatt. Principles of Neurodynamics: Percepttons and the Theory of Brain 
Mechanisms. Spartan, New York, 1962. 
Bernhard Sch61kopf, Christopher J. C. Burges, and Alexander J. Smola, editors. Ad- 
vances in Kernel Methods: Support Vector Learning. The MIT press, 1999. 
J. Shawe-Taylor, P.L. Bartlett, R.C. Williamson, and M. Anthony. Structural risk 
minimization over data-dependent hierarchies. IEEE Trans. Inf. Theory, 44(5): 1926- 
1940, 1998. 
V.N. Vapnik. Statistical learning theory. John Wiley & Sons, New York, 1998. 
Tong Zhang. Analysis of regularized linear functions for classification problems. 
Technical Report RC-21572, IBM, 1999. Abstract in NIPS'99, pp. 370-376. 
