Sparse Kernel 
Principal Component Analysis 
Michael E. Tipping 
Microsoft Research 
St George House, I Guildhall St 
Cambridge CB2 3NH, U.K. 
mt ippingmi cro sol t. com 
Abstract 
'Kernel' principal component analysis (PCA) is an elegant non- 
linear generalisation of the popular linear data analysis method, 
where a kernel function implicitly defines a nonlinear transforma- 
tion into a feature space wherein standard PCA is performed. Un- 
fortunately, the technique is not 'sparse', since the components 
thus obtained are expressed in terms of kernels associated with ev- 
ery training vector. This paper shows that by approximating the 
covariance matrix in feature space by a reduced number of exam- 
ple vectors, using a maximum-likelihood approach, we may obtain 
a highly sparse form of kernel PCA without loss of effectiveness. 
I Introduction 
Principal component analysis (PCA) is a well-established technique for dimension- 
ality reduction, and examples of its many applications include data compression, 
image processing, visualisation, exploratory data analysis, pattern recognition and 
time series prediction. Given a set of N d-dimensional data vectors x, which we 
take to have zero mean, the principal components are the linear projections onto 
the 'principal axes', defined as the leading eigenvectors of the sample covariance 
matrix S - N - N 
= xx -- N-XTX, where X -- (x,x2,...,xv) T is the 
conventionally-defined 'design' matrix. These projections are of interest as they 
retain maximum variance and minimise error of subsequent linear reconstruction. 
However, because PCA only defines a linear projection of the data, the scope of 
its application is necessarily somewhat limited. This has naturally motivated vari- 
ous developments of nonlinear 'principal component analysis' in an effort to model 
non-trivial data structures more faithfully, and a particularly interesting recent in- 
novation has been 'kernel PCA' [4]. 
Kernel PCA, summarised in Section 2, makes use of the 'kernel trick', so effectively 
exploited by the 'support vector machine', in that a kernel function k(., .) may 
be considered to represent a dot (inner) product in some transformed space if it 
satisfies Mercer's condition -- i.e. if it is the continuous symmetric kernel of a 
positive integral operator. This can be an elegant way to 'non-linearise' linear 
procedures which depend only on inner products of the examples. 
Applications utilising kernel PCA are emerging [2], but in practice the approach 
suffers from one important disadvantage in that it is not a sparse method. Com- 
putation of principal component projections for a given input x requires evaluation 
of the kernel function k(x, Xn) in respect of all N 'training' examples Xn. This is 
an unfortunate limitation as in practice, to obtain the best model, we would like to 
estimate the kernel principal components from as much data as possible. 
Here we tackle this problem by first approximating the covariance matrix in feature 
space by a subset of outer products of feature vectors, using a maximum-likelihood 
criterion based on a 'probabilistic PCA' model detailed in Section 3. Subsequently 
applying (kernel) PCA defines sparse projections. Importantly, the approximation 
we adopt is principled and controllable, and is related to the choice of the number of 
components to 'discard' in the conventional approach. We demonstrate its efficacy 
in Section 4 and illustrate how it can offer similar performance to a full non-sparse 
kernel PCA implementation while offering much reduced computational overheads. 
2 Kernel PCA 
Although PCA is conventionally defined (as above) in terms of the covariance, or 
outer-product, matrix, it is well-established that the eigenvectors of XTX can be 
obtained from those of the inner-product matrix XX T. If U is an orthogonal ma- 
trix of column eigenvectors of XX T with corresponding eigenvalues in the diagonal 
matrix A, then by definition (XXT)U = UA. Pre-multiplying by X T gives: 
(xTx)(xTu) = (XTU)A. (1) 
From inspection, it can be seen that the eigenvectors of XTX are XTU, with eigen- 
values A. Note, however, that the column vectors XTU are not normalised since 
for column i, u[XXTui = AiU[Ui = Ai, SO the correctly normalised eigenvectors of 
1 
XTX, and thus the principal axes of the data, are given by epca ---- XTUA 2 
This derivation is useful if d > N, when the dimensionality of x is greater than 
the number of examples, but it is also fundamental for implementing kernel PCA. 
In kernel PCA, the data vectors x are implicitly mapped into a feature space by 
a set of functions {b}  x - b(x). Although the vectors b = b(x) in the 
feature space are generally not known explicitly, their inner products are defined 
by the kernel: bb = k(xm,x). Defining ,I, as the (notional) design matrix in 
feature space, and exploiting the above inner-product PCA formulation, allows the 
eigenvectors of the covariance matrix in feature space  , S, = N -  bb, to be 
specified as: 
1 
ekpca = (ITUA 2, (2) 
where U,A are the eigenvectors/values of the kernel matrix K, with (K)mn = 
k(xm, Xn). Although we can't compute Ukpca since we don't know  explicitly, we 
can compute projections of arbitrary test vectors x, -- b, onto Ukpca in feature 
space: 
T T T _1  
= (, (I) UA 2 T 
(), Ukpca (3) 
where k, is the N-vector of inner products of x, with the data in kernel space: 
(k)n = k(x,,xn). We can thus compute, and plot, these projections- Figure 1 
gives an example for some synthetic 3-cluster data in two dimensions. 
1Here, and in the rest of the paper, we do not 'centre' the data in feature space, 
although this may be achieved if desired (see [4]). In fact, we would argue that when using 
a Gaussian kernel, it does not necessarily make sense to do so. 
0.218 
0.057 
0.047 
0.203 
0.053 
0.04-3 
0.191 
0.051 
0.036 
Figure 1: Contour plots of the first nine principal component projections evaluated over 
region of input space for data from 3 Gaussian clusters (standard deviation 0.1; axis scales 
are shown in Figure 3) each comprising 30 vectors. A Gaussian kernel, exp(-]lx- 
with width r ---- 0.25, was used. The corresponding eigenvalues are given above each 
projection. Note how the first three components 'pick out' the individual clusters [4]. 
3 Probabilistic Feature-Space PCA 
Our approach to sparsifying kernel PCA is to a priori approximate the feature space 
sample covariance matrix So with a sum of weighted outer products of a reduced 
number of feature vectors. (The basis of this technique is thus general and its 
application not necessarily limited to kernel PCA.) This is achieved probabilistically, 
by maximising the likelihood of the feature vectors under a Gaussian density model 
qb - A/'(0, C), where we specify the covariance C by: 
N 
C = cr2I + y wiqbiqb  = cr2I + I'TWI ', (4) 
i=1 
where wz... WN are the adjustable weights, W is a matrix with those weights on 
the diagonal, and cr 2 is an isotropic 'noise' component common to all dimensions 
of feature space. Of course, a naive maximum of the likelihood under this model 
is obtained with cr 2 = 0 and all wi = 1/N. However, if we fix cr 2, and optimise 
only the weighting factors wi, we will find that the maximum-likelihood estimates 
of many wi are zero, thus realising a sparse representation of the covariance matrix. 
This probabilistic approach is motivated by the fact that if we relax the form of the 
model, by defining it in terms of outer products of N arbitrary vectors vi (rather 
N 
than the fixed training vectors), i.e. C = cr2I+ i=z wiviv, then we realise a form 
of 'probabilistic PCA' [6]. That is, if {u, ,k} are the set of eigenvectors/values of 
then the likelihood under this model is maximised by vi 
for those i for which ,ki > cr 2. For ,ki _< cr 2, the most likely weights wi are zero. 
3.1 Computations in feature space 
We wish to maximise the likelihood under a Gaussian model with covariance given 
by (4). Ignoring terms independent of the weighting parameters, its log is given by: 
I [Nlog ]C] + tr (C-T)] (5) 
-- . 
Computing (5) requires the quantities ]C] and b TC -b, which for infinite dimen- 
sionality feature spaces might appear problematic. However, by judicious re-writing 
of the terms of interest, we are able to both compute the log-likelihood (to within 
a constant) and optimise it with respect to the weights. First, we can write: 
log Ir2 + I'WI'l = Dlogr 2+ log IW - 4- -2I'I'l 4- log IW I. (6) 
The potential problem of infinite dimensionality, D, of the feature space now en- 
ters only in the first term, which is constant if 2is fixed and so does not affect 
maximisation. The term in ]W] is straightforward and the remaining term can be 
expressed in terms of the inner-product (kernel) matrix: 
W-1 + -2T = W-1 + -2K, (7) 
where K is the kernel matrix such that (K)mn = k(xm,Xn). 
For the data-dependent term in the likelihood, we can use the Woodbury matrix 
inversion identity to compute the quantities C-: 
T 2 T 
4n( I + W)-4n =4n [-2I- -4( w- + -2)-] 4n, 
= -2k(xn,Xn) - -4k(W- + -2K)-kn, (8) 
with k = [k(x,x), k(x,x2,... ,k(x,x)] . 
3.2 Optimising the weights 
To maximise the log-likelihood with respect to the wi, differentiating (5) gives us: 
= ' 
) 
= 2w  + N - Nw , (10) 
where  and  are defined respectively by 
 = (W - + -") -, (11) 
 = -2k. (2) 
Setting (10) to zero gives re-estimation equations for the weights: 
N 
wnew= N-1  2 
i ni - ii. 
(13) 
n=l 
The re-estimates (13) are equivalent to expectation-maximisation updates, which 
would be obtained by adopting a factor analytic perspective [3], and introducing a 
set of 'hidden' Gaussian explanatory variables whose conditional means and com- 
mon covariance, given the feature vectors and the current values of the weights, 
are given by u n and  respectively (hence the notation). As such, (13) is guar- 
anteed to increase  unless it is already at a maximum. However, an alternative 
re-arrangement of (10), motivated by [5], leads to a re-estimation update which 
typically converges significantly more quickly: 
N 2 
Y]=/i (14) 
wnew __-- 
i N(1- ii/Wi)' 
Note that these wi updates (14) are defined in terms of the computable (i.e. not 
dependent on explicit feature space vectors) quantities 1 and/. 
3.3 Principal component analysis 
The principal axes 
Sparse kernel PCA proceeds by finding the principal axes of the covariance model 
C = cr2I + Tw. These are identical to those of Tw, but with eigenvalues 
all cr 2 larger. Letting  = W, then, we need the eigenvectors of T. 
Using the technique of Section 2, if the eigenvectors of T = WTw  = 
WKW are , with corresponding eigenvalues , then the eigevectors/values 
{U, A} of C that we desire are given by: 
1 
2 (15) 
, 
U __-- (I)Tw21-- 
A = _ + aI. 
(16) 
Computing projections 
Again, we can't compute the eigenvectors U explicitly in (15), but we can compute 
the projections of a general feature vector qb, onto the principal axes: 
1 
()U ---- T T 1_ ^ ^ 
(],(I) W2 -- T 
=k,P, (17) 
h 
where k, is the sparse vector containing the non-zero weighted elements of k,, 
1 
defined earlier. The corresponding rows of WO 2 are combined into a sin- 
gle projecting matrix P, each column of which gives the coefficients of the kernel 
functions for the evaluation of each principal component. 
3.4 Computing Reconstruction Error 
The squared reconstruction error in kernel space for a test vector 
11(I UUT)qb, 112 k(x, x,) 'T ' -- J- ' 
- = , - k,K k,, 
is given by: 
(18) 
h 
with K the kernel matrix evaluated only for the representing vectors. 
4 Examples 
To obtain sparse kernel PCA projections, we first specify the noise variance cr , 
which is the the amount of variance per co-ordinate that we are prepared to allow 
to be explained by the (structure-free) isotropic noise rather than with the principal 
axes (this choice is a surrogate for deciding how many principal axes to retain in 
conventional kernel PCA). Unfortunately, the measure is in feature space, which 
makes it rather more difficult to interpret than if it were in data space (equally so, 
of course, for interpretation of the eigenvalue spectrum in the non-sparse case). 
We apply sparse kernel PCA to the Gaussian data of Figure 1 earlier, with the same 
kernel function and specifying cr = 0.25, deliberately chosen to give nine representing 
kernels so as to facilitate comparison. Figure 2 shows the nine principal component 
projections based on the approximated covariance matrix, and gives qualitatively 
equivalent results to Figure 1 while utilising only 10% of the kernels. Figure 3 shows 
the data and highlights those examples corresponding to the nine kernels with non- 
zero weights. Note, although we do not consider this aspect further here, that these 
representing vectors are themselves highly informative of the structure of the data 
(i.e. with a Gaussian kernel, for example, they tend to represent distinguishable 
clusters). Also in Figure 3, contours of reconstruction error, based only on those 
nine kernels, are plotted and indicate that the nonlinear model has more faithfully 
captured the structure of the data than would standard linear PCA. 
0.199 0.184 0.161 
0.082 
0.074 0.074 
0.074 0.072 
0.071 
'" 
Figure 2: The nine principal component projections obtained by sparse kernel PCA. 
To further illustrate the fidelity of the sparse approximation, we analyse the 200 
training examples of the 7-dimensional 'Pima Indians diabetes' database [1]. Fig- 
ure 4 (left) shows a plot of reconstruction error against the number of principal 
components utilised by both conventional kernel PCA and its sparse counterpart, 
with cr 2 chosen so as to utilise 20% of the kernels (40). An expected small reduc- 
tion in accuracy is evident in the sparse case. Figure 4 (right) shows the error on 
the associated test set when using a linear support vector machine to classify the 
data based on those numbers of principal components. Here the sparse projections 
actually perform marginally better on average, a consequence of both randomness 
and, we note with interest, presumably some inherent complexity control implied 
by the use of a sparse approximation. 
0.6 
0.5 
0.4 
0.3 
0.2 
0.1 
0 
-0.1 
-0.2 
-0.3 
-0.4 
-0.6 -0.4 -0.2 0 0.2 0.4 0.6 
Figure 3: The data with the nine representing kernels circled and contours of reconstruc- 
tion error (computed in feature space although displayed as a function of x) overlaid. 
0.25 
0 
 0.2 
.0 0.15 
""-' 0.1 
0 
 0.05 
0 
 ' i._ Standard I 
11o 
1 oo 
9o 
8o 
7o 
60 
0 
5 10 15 20 25 5 10 15 20 25 
Figure 4: RMS reconstruction error (left) and test set misclassifications (right) for num- 
bers of retained principal components ranging from 1-25. For the standard case, this was 
based on all 200 training examples, for the sparse form, a subset of 40. A Gaussian kernel 
of width 10 was utilised, which gives near-optimal results if used in an SVM classification. 
References 
[1] B. D. Ripley. Pattern Recognition and Neural Networks. Cambridge University Press, 
Cambridge, 1996. 
[2] S. Romdhani, S. Gong, and A. Psarrou. A multi-view nonlinear active shape model 
using kernel PCA. In Proceedings of the 1999 British Machine Vision Conference, 
pages 483-492, 1999. 
[3] D. B. Rubin and D. T. Thayer. EM algorithms for ML factor analysis. Psychometrika, 
47(1):69-76, 1982. 
[4] B. SchSlkopf, A. Smola, and K.-R. Miiller. Nonlinear component analysis as a kernel 
eigenvalue problem. Neural Computation, 10:1299-1319, 1998. Technical Report No. 
44, 1996, Max Planck Institut fiir biologische Kybernetik, Tiibingen. 
[5] M. E. Tipping. The Relevance Vector Machine. In S. A. Solla, T. K. Leen, and K.-R. 
Miiller, editors, Advances in Neural Information Processing Systems 12, pages 652-658. 
Cambridge, Mass: MIT Press, 2000. 
[6] M. E. Tipping and C. M. Bishop. Probabilistic principal component analysis. Journal 
of the Royal Statistical Society, Series B, 61(3):611-622, 1999. 
