A Mathematical Programming Approach to the 
Kernel Fisher Algorithm 
Sebastian Mika*, Gunnar R/itsch*, and Klaus-Robert Miiller *+ 
* GMD FIRST. IDA, Kekulstrage 7, 12489 Berlin, Germany 
+University of Potsdam, Am Neuen Palais 10, 14469 Potsdam 
{ mika, raetsch, klaus} @first. grad. de 
Abstract 
We investigate a new kernel-based classifier: the Kernel Fisher Discrim- 
inant (KFD). A mathematical programming formulation based on the ob- 
servation that KFD maximizes the average margin permits an interesting 
modification of the original KFD algorithm yielding the sparse KFD. We 
find that both, KFD and the proposed sparse KFD, can be understood 
in an unifying probabilistic context. Furthermore, we show connections 
to Support Vector Machines and Relevance Vector Machines. From this 
understanding, we are able to outline an interesting kernel-regression 
technique based upon the KFD algorithm. Simulations support the use- 
fulness of our approach. 
1 Introduction 
Recent years have shown an enormous interest in kernel-based classification algorithms, 
primarily in Support Vector Machines (SVM) [2]. The success of SVMs seems to be trig- 
gered by (i) their good generalization performance, (ii) the existence of a unique solution, 
and (iii) the strong theoretical background: structural risk minimization [12], supporting 
the good empirical results. One of the key ingredients responsible for this success is the 
use of Mercer kernels, allowing for nonlinear decision surfaces which even might incorpo- 
rate some prior knowledge about the problem to solve. For our purpose, a Mercer kernel 
can be defined as a function k : 11  x 11  --> IlL for which some (nonlinear) mapping 
 : 11 n --> .T into afeature s7ace .T exists, such that k(a:, $t) = ((a:)' ($t)). Clearly, the 
use of such kernel functions is not limited to SVMs. The interpretation as a dot-product 
in another space makes it particularly easy to develop new algorithms: take any (usually) 
linear method and reformulate it using training samples only in dot-products, which are 
then replaced by the kernel. Examples thereof, among others, are Kernel-PCA [9] and the 
Kernel Fisher Discriminant (KFD [4]; see also [8, 1]). 
In this article we consider algorithmic ideas for KFD. Interestingly KFD - although ex- 
hibiting a similarly good performance as SVMs - has no explicit concept of a margin. This 
is noteworthy since the margin is often regarded as explanation for good generalization 
in SVMs. We will give an alternative formulation of KFD which makes the difference 
between both techniques explicit and allows a better understanding of the algorithms. An- 
other advantage of the new formulation is that we can derive more efficient algorithms for 
optimizing KFDs, that have e.g. sparseness properties or can be used for regression. 
2 A Review of Kernel Fisher Discriminant 
The idea of the KFD is to solve the problem of Fisher's linear discriminant in a kernel 
feature space 3 r, thereby yielding a nonlinear discriminant in the input space. First we 
fix some notation. Let {a:1i = 1,... ,g} be our training sample and U E {-1,1} t be 
the vector of corresponding labels. Furthermore define ! E 11 t as the vector of all ones, 
lx, 12  11 t as binary (0, l) vectors corresponding to the class labels and let 27, 27, and 272 
be appropriate index sets over  and the two classes, respectively (with i = Irl), 
In the linear case, Fisher's discriminant is computed by maximizing the coefficient d(w) = 
(w-rSBw)/(w-rSww) of between and within class variance, i.e. $B = (rn2 - rn)(rn2 - 
m) -r and $w = Y'-',k=,2 Y'-iez (eel - mk)(a:i -- mk) -r, where mk denotes the sample 
mean for class k. To solve the problem in a kernel feature space 3 r one needs a formulation 
which makes use of the training samples only in terms of dot-products. One first shows 
[4], that there exists an expansion for w  3 r in terms of mapped training patterns, i.e. 
w- 5-,zcti(aei). (1) 
Using some straight forward algebra, the optimization problem for the KFD can then be 
written as [5]: 
= 
where tt i = Kli, N = KK -r- 
/Cj = ((x). (xj)) = k(,). 
o-N = TN' (2) 
i=,2ii,  = 2 - , M = T, and 
The projection of a test point onto the discriminant 
is computed by (to. (a:)) = 5-,z oi k(a:i, a:). As the dimension of the feature space is 
usually much higher than the number of training samples  some form of regularization 
is necessary. In [4] it was proposed to add e.g. the identity or the kernel matrix K to N, 
penalizing I1,112 or Ilwll 2, respectively (see also [3]). 
There are several equivalent ways to optimize (2). One could either solve the generalized 
eigenproblem Ma = ,Na, selecting the eigenvector a with maximal eigenvalue ,, or 
compute a -- N - (tt 2 -/z). Another way which will be detailed in the following exploits 
the special structure of problem (2). 
3 Casting KFD into a Quadratic Program 
Although there exist many efficient off-the-shelve eigensolvers or Cholesky packages 
which could be used to optimize (2) there remain two problems: for a large sample size  
the matrices N and M become unpleasantly large and the solutions c are non-sparse (with 
no obvious way to introduce sparsity in e.g. the matrix inverse). In the following we show 
how KFD can be cast as a convex quadratic programming problem. This new formulation 
will prove helpful in solving the problems mentioned above and makes it much easier to 
gain a deeper understanding of KFD. 
As a first step we exploit the facts that the matrix M is only rank one, i.e. c-rMc = 
(c-r(tt2 -/z)) 2 and that with c any multiple of c is an optimal solution to (2). Thus we 
may fix c-r(tt2 -/z) to any non-zero value, say 2 and minimize c-rNc. This amounts to 
the following quadratic program: 
min c-rNc + U P(c) (3) 
subject to: ' 
cT(p2- p) = 2. (3a) 
The regularization formerly incorporated in N is made visible explicitly here through the 
operator P, where U is a regularization constant. This program still makes use of the 
rather un-intuitive matrix N. This can be avoided by our final reformulation which can 
be understood as follows: Fisher's Discriminant tries to minimize the variance of the data 
along the projection whilst maximizing the distance between the average outputs for each 
class. Considering the argumentation leading to (3) the following quadratic program does 
exactly this: 
min I111 = + CP(a) (4) 
subject to: Ka + lb = $t + (4a) 
li-r = 0fori=l,2 (4b) 
for c, E 11 , andb, C 6  C  0. The constraint (4a), which can beread as 
(w. a:i) + b = ti + i for all i E Z, pulls the output for each sample to its class-label. The 
term IIll 2 minimizes the variance of the error committed while the constraints (4b) ensure 
that the average output for each class is the label, i.e. for 4-1 labels the average distance of 
the projections is two. The following proposition establishes the link to KFD: 
Proposition 1. For given U   any optimal solution  to the optimization problem (3) 
is also optimal for (4) and vice versa. 
The formal, rather straightforward but lengthy, proof of Proposition 1 is omitted here. It 
shows (i) that the feasible sets of (3) and (4) are identical with respect to c and (ii) that the 
objective functions coincide. Formulation (4) has a number of appealing properties which 
we will exploit in the following. 
4 A Probabilistic Interpretation 
We would like to point out the following connection (which is not specific to the formu- 
lation (4) of KFD): The Fisher discriminant is the Bayes optimal classifier for two normal 
distributions with equal covariance (i.e. KFD is Bayes optimal for two Gaussian in feature 
space.). To see this connection to Gaussians consider a regression onto the labels of the 
form (w. (x)) + b, where w is given by (1). Assuming a Gaussian noise model with 
variance 0- the likelihood can be written as 
p(gtla, a 2 -- exp(--- 
20-2 
1 
y((w . (xi)) + b - yi) 2) = exp(- I1112. 
i 
Now, assume some prior p(lC) over the weights with hyper-parameters C. Comput- 
ing the posterior we would end up with the Relevance Vector Machine (RVM) [11]. An 
advantage of the RVM approach is that all hyper-parameters 0- and O are estimated auto- 
matically. The drawback however is that one has to solve a hard, computationally expen- 
sive optimization problem. The following simplifications show how KFD can be seen as 
an approximation to this probabilistic approach. Assuming the noise variance 0- is known 
(i.e. dropping all terms depending solely on 0-) and taking the logarithm of the posterior 
p(yla, 0-2p(1c), yields the following optimization problem 
min IIll"- og(p(lC)) (5) 
subject to the constraint (4a). Interpreting the prior as a regularization operator P, intro- 
ducing an appropriate weighting factor U, and adding the two zero-mean constraints (4b) 
yields the KFD problem (4). The latter are necessary for classification as the two classes 
are independently assumed to be zero-mean Gaussians. This probabilistic interpretation 
has some appealing properties which we outline in the following: 
Interpretation of outputs The probabilistic framework reflects the fact, that the outputs 
produced by KFD can be interpreted as probabilities, thus making it possible to assign a 
confidence to the final classification. This is in contrast to SVMs whose outputs can not 
directly be seen as probabilities. 
Noise models In the above illustration we assumed a Gaussian noise model and some yet 
unspecified prior which was then interpreted as regularizer. Of course, one is not limited 
to Gaussian models. E.g. assuming a Laplacian noise model we would get IIll instead of 
IIll in the objective (5) or (4), respectively. Table 1 gives a selection of different noise 
models and their corresponding loss functions which could be used (cf. Figure 1 for an 
illustration). All of them still lead to convex linear or quadratic programming problems in 
the KFD framework. 
Table 1: Loss functions 
for the slack variables  
and their corresponding 
density/noise models in 
a probabilistic frame- 
work [ 10]. 
loss function density model 
s-ins. exp(-Il) 
x exp(-Il) 
Laplacian 
 2  exp(-) 
Gaussian  v 
1 2 { 
 exp(-) if I1 -< cr 
Huber's exp( - Il) otherwise 
Regularizers Still open in this probabilistic interpretation is the choice of the prior or 
regularizer p(c IC). One choice would be a zero-mean Gaussian as for the RVM. Assum- 
ing again that this Gaussians' variance C is known and a multiple of the identity this would 
lead to a regularizer of the form P(c) = IIcll Crucially, choosing a single, fixed variance 
parameter for all c we would not achieve sparsity as in RVM anymore. But of course any 
other choice, e.g. from Table 1 is possible. Especially interesting is the choice of a Lapla- 
clan prior which in the optimization procedure would correspond to a/-1oss on the c's, 
i.e. P(a) = IIcll. This choice leads to sparse solutions in the KFD as the l-norm can 
be seen as an approximation to the/o-norm. In the following we call this particular setting 
,warse KFD (SKFD). 
f 
Figure 1: Illustration of Gaussian, Laplacian, Huber's robust and s-insensitive loss func- 
tions (dotted) and corresponding densities (solid). 
Regression and connection to SVM Considering the program (4) it is rather simple to 
modify the KFD approach for regression. Instead of 4-1 outputs $t we now have real-valued 
$t's. And instead of two classes there is only one class left. Thus, we can use KFD for 
regression as well by simply dropping the distinction between classes in constraint (4b). 
The remaining constraint requires the average error to be zero while the variance of the 
errors is minimized. 
This as well gives a connection to SVM regression (e.g. [12]), where one uses the s- 
insensitive loss for  (cf. Table 1) and a K-regularizer, i.e. P(c) = c(rKc = Ilwll 
Finally, we can as well draw the connection to a SVM classifier. In SVM classification one 
is maximizing the (smallest) margin, traded off against the complexity controlled by iiwll 
Contrary, besides parallels in the algorithmic formulation, in KFD is no explicit concept of 
a margin. Instead, implicitly the average margin, i.e. the average distance of samples from 
different classes, is maximized. 
Optimization Besides a more intuitive understanding, the formulation (4) allows for de- 
riving more efficient algorithms as well. Using a sparsity regularizer (i.e. SKFD) one could 
employ chunking techniques during the optimization of (4). However, the problem of se- 
lecting a good working set is not solved yet, and contrary to e.g. SVM, for KFD all samples 
will influence the final solution via the constraints (4a), not just the ones with oi  0. Thus 
these samples can not simply be eliminated from the optimization problem. Another in- 
teresting option induced by (4) is to use a sparsity regularizer and a linear loss function, 
e.g. the Laplacian loss (cf. Table 1). This results in a linear program which we call linear 
sparse KFD (LSKFD). This can very efficiently be solved by column generation techniques 
known from mathematical programming. A final possibility to optimize (4) for the stan- 
dard KFD problem (i.e. quadratic loss and regularizer) is described in [6]. Here one uses 
a greedy approximation scheme which iteratively constructs a (sparse) solution to the full 
problem. Such an approach is straight forward to implement and much faster than solving 
a quadratic program, provided that the number of non-zero o's necessary to get a good 
approximation to the full solution is small. 
5 Experiments 
In this section we present some experimental results targeting at (i) showing that the KFD 
and some of its variants proposed here are capable of producing state of the art results 
and (ii) comparing the influence of different settings for the regularization P(c) and the 
loss-function applied to  in kernel based classifiers. 
The Output Distribution In an initial experiment we compare the output distributions 
generated by a SVM and the KFD (cf. Figure 2). By maximizing the smallest margin and 
using linear slack variables for patterns which do not achieve a reasonable margin, the 
SVM produces a training output sharply peaked around +1 with Laplacian tails inside the 
margin area (the inside margin area is the interval [-1, 1], the outside area its complement). 
Contrary, KFD produces normal distributions which have a small variance along the dis- 
criminating direction. Comparing the distributions on the training set to those on the test 
set, there is almost no difference for KFD. In this sense the direction found on the training 
data is consistent with the test data. For SVM the output distribution on the test set is signif- 
icantly different. In the example given in Figure 2 the KFD performed slightly better than 
SVM (1.5% vs. 1.7%; for both the best parameters found by 5-fold cross validation were 
used), a fact that is surprising looking only on the training distribution (which is perfectly 
separated for SVM but has some overlap for KFD). 
SVM training set 
SVM test set 
KFD training set 
KFD test set 
-2 -1 0 1 2 -2 -1 0 1 2 -2 -1 0 1 2 -2 -1 0 1 2 
Figure 2: Comparison of output distributions on training and test set for SVM and KFD for 
optimal parameters on the ringnorm dataset (averaged over 100 different partitions). It is 
clearly observable, that the training and test set distributions for KFD are almost identical 
while they are considerable different for SVM. 
Performance To evaluate the performance of the various KFD approaches on real data 
sets we performed an extensive comparison to SVM 1. The results in Table 2 show the 
Thanks to M. Zwitter and M. Soklic for the breast cancer data. All data sets used in the experi- 
ments can be obtained via http: //www. first. grad. de/-raet sch/. 
average test error and the standard deviation of the averages' estimation, over 100 runs 
with different realizations of the datasets. To estimate the necessary parameters, we ran 
5-fold cross validation on the first five realizations of the training sets and took the model 
parameters to be the median over the five estimates (see [7] for details of the experimental 
setup). 
From Table 2 it can be seen that both, SVM and the KFD variants on average perform 
equally well. In terms of (4) KFD denotes the formulation with quadratic regularizer, SKFD 
with/-regularizer, and LSKFD with /-regularizer and l loss on . The comparable 
performance might be seen as an indicator, that maximizing the smallest margin or the 
average margin does not make a big difference on the data sets studied. The same seems 
to be true for using different regularizer and loss functions. Noteworthy is the significantly 
higher degree of sparsity for KFD. 
Regression Just to show that the proposed KFD regression works in principle, we con- 
ducted a toy experiment on the sinc function (cf. Figure 3). In terms of the number of 
support vectors we obtain similarly sparse results as with RVMs [11], i.e. a much smaller 
number of non-zero coefficients than in SVM regression. A thorough evaluation is cur- 
rently being carried out. 
1 
0.8 
0.6 
0.4 
0.2 
o 
-0.2 
-0.4 
1.2 
1 
0.8 
0.6 
0.4 
0.2 
o 
-0.2 
-0.4 
-lO 
-5 0 5 10 -5 0 5 10 
Figure 3: Illustration of KFD regression. The left panel shows a fit to the noise-free sinc 
function sampled on 100 equally spaced points, the right panel with Gaussian noise of 
std. dev. 0.2 added. In both cases we used RBF-kemel exp(- Ila:- yll /c)of width c = 4.0 
and c = 3.0, respectively. The regularization was C = 0.01 and C = 0.1 (small dots 
training samples, circled dots SVs). 
Banana 
B.Cancer 
Diabetes 
German 
Heart 
Ringnorm 
F. Sonar 
Thyroid 
Titanic 
Waveform 
SVM 
11.5+0.07 (78%) 
26.04-0.47 (42%) 
23.54-0.17 (57%) 
23.64-0.21 (58%) 
16.04-0.33 (51%) 
1.74-0.01 (62%) 
32.44-0.18 (9%) 
4.84-0.22 (79%) 
22.44-0.10 (10%) 
9.94-0.04 (60%) 
KFD 
SKFD 
10.84-0.05 
25.84-0.46 
23.24-0.16 
23.74-0.22 
16.14-0.34 
1.54-0.01 
33.24-0.17 
4.24-0.21 
23.24-0.20 
9.94-0.04 
LSKFD 
11.24-0.48 (86%) 
25.24-0.44 (88%) 
23.14-0.18 (97%) 
23.64-0.23 (96%) 
16.44-0.31 (88%) 
1.64-0.01 (85%) 
33.44-0.17 (67%) 
4.34-0.18 (88%) 
22.64-0.17 (8%) 
o4 
10.64-0.04 (92%) 
25.84-0.47 (88%) 
23.64-0.18 (97%) 
24.14-0.23 (98%) 
16.04-0.36 (96%) 
1.54-0.01 (94%) 
34.44-0.23 (99%) 
4.74-0.22 (89%) 
22.54-0.20 (95%) 
10.24-0.04 (96%) 
Table 2: Comparison between KFD, sparse KFD (SKFD), sparse KFD with linear loss 
on { (LSKFD), and SVMs (see text). All experiments were carded out with RBF-kernels 
exp(-Ila:- ull Best result in bold face, second best in italics. The numbers in brackets 
denote the fraction of expansions coefficients which were zero. 
6 Conclusion and Outlook 
In this work we showed how KFD can be reformulated as a mathematical programming 
problem. This allows a better understanding of KFD and interesting extensions: First, a 
probabilistic interpretation gives new insights about connections to RVM, SVM and regu- 
larization properties. Second, using a Laplacian prior, i.e. a l regularizer yields the sparse 
algorithm SKFD. Third, the more general modeling permits a very natural KFD algorithm 
for regression. Finally, due to the quadratic programming formulation, we can use tricks 
known from SVM literature like chunking or active set methods for solving the optimiza- 
tion problem. However the optimal choice of a working set is not completely resolved and 
is still an issue of ongoing research. In this sense sparse KFD inherits some of the most ap- 
pealing properties of both, SVM and RVM: a unique, mathematical programming solution 
from SVM and a higher sparsity together with interpretable outputs from RVM. 
Our experimental studies show a competitive performance of our new KFD algorithms if 
compared to SVMs. This indicates that neither the margin nor sparsity nor a specific out- 
put distribution alone seem to be responsible for the good performance of kernel-machines. 
Further theoretical and experimental research is therefore needed to learn more about this 
interesting question. Our future research will also investigate the role of output distribu- 
tions and their difference between training and test set. 
Acknowledgments This work was partially supported by grants of the DFG (JA 379/7- 
1,9-1). Thanks to K. Tsuda for helpful comments and discussions. 
References 
[1] G. Baudat and F. Anouar. Generalized discriminant analysis using a kernel approach. Neural 
Computation, 12(10):2385-2404, 2000. 
[2] B.E. Boser, I.M. Guyon, and V.N. Vapnik. A training algorithm for optimal margin classifiers. In 
D. Haussler, editor, Proceedings of the 5th Annual ACM Workshop on Computational Learning 
Theory, pages 144-152, 1992. 
[3] J.H. Friedman. Regularized discriminant analysis. Journal of the American Statistical Associ- 
ation, 84(405): 165-175, 1989. 
[4] S. Mika, G. Riitsch, J. Weston, B. Sch61kopf, and K.-R. Mtiller. Fisher discriminant analysis 
with kernels. In Y.-H. Hu, J. Larsen, E. Wilson, and S. Douglas, editors, Neural Networks for 
Signal Processing IX, pages 41-48. IEEE, 1999. 
[5] S. Mika, G. Riitsch, J. Weston, B. Sch61kopf, A.J. Smola, and K.-R. Mtiller. Invariant feature 
extraction and classification in kernel spaces. In S.A. Solla, T.K. Leen, and K.-R. Mtiller, 
editors, Advances in Neural Information Processing Systems 12, pages 526-532. MIT Press, 
2000. 
[6] S. Mika, A.J. Smola, and B. Sch61kopf. An improved training algorithm for kernel fisher dis- 
cfiminants. In Proceedings AISTATS 2001. Morgan Kaufmann, 2001. to appear. 
[7] G. Riitsch, T. Onoda, and K.-R. Mtiller. Soft margins for AdaBoost. Machine Learning, 
42(3):287-320, March 2001. also NeuroCOLT Technical Report NC-TR-1998-021. 
[8] V. Roth and V. Steinhage. Nonlinear discriminant analysis using kernel functions. In S.A. Solla, 
T.K. Leen, and K.-R. Mtiller, editors, Advances in Neural Information Processing Systems 12, 
pages 568-574. MIT Press, 2000. 
[9] B. Schi51kopf, A.J. Smola, and K.-R. Mtiller. Nonlinear component analysis as a kernel eigen- 
value problem. Neural Computation, 10:1299-1319, 1998. 
[10] A.J. Smola. Learning with Kernels. PhD thesis, Technische Universitiit Berlin, 1998. 
[11] M.E. Tipping. The relevance vector machine. In S.A. Solla, T.K. Leen, and K.-R. Mtiller, 
editors, Advances in Neural Information Processing Systems 12, pages 652-658. MIT Press, 
2000. 
[12] V.N. Vapnik. The nature of statistical learning theory. Springer Verlag, New York, 1995. 
