Sparse Greedy 
Gaussian Process Regression 
Alex J. Smola* 
RSISE and Department of Engineering 
Australian National University 
Canberra, ACT, 0200 
Alex. Smola @anu. edu. au 
Peter Bartlett 
RSISE 
Australian National University 
Canberra, ACT, 0200 
Peter. Bartlett@anu. edu. au 
Abstract 
1 
We present a simple sparse greedy technique to approximate the 
maximum a posteriori estimate of Gaussian Processes with much 
improved scaling behaviour in the sample size m. In particular, 
computational requirements are O(n2m), storage is O(nm), the 
cost for prediction is O(n) and the cost to compute confidence 
bounds is O(nm), where n  m. We show how to compute a 
stopping criterion, give bounds on the approximation error, and 
show applications to large scale problems. 
Introduction 
Gaussian processes have become popular because they allow exact Bayesian analysis 
with simple matrix manipulations, yet provide good performance. They share with 
Support Vector machines and Regularization Networks the concept of regularization 
via Reproducing Kernel Hilbert spaces [3], that is, they allow the direct specification 
of the smoothness properties of the class of functions under consideration. However, 
Gaussian processes are not always the method of choice for large datasets, since they 
involve evaluations of the covariance function at m points (where m denotes the 
sample size) in order to carry out inference at a single additional point. This may 
be rather costly to implement -- practitioners prefer to use only a small number of 
basis functions (i.e. covariance function evaluations). 
Furthermore, the Maximum a Posteriori (MAP) estimate requires computation, 
storage, and inversion of the full m x m covariance matrix Kij -- k(xi, xj) where 
Xl,..., x, are training patterns. While there exist techniques [2, 8] to reduce the 
computational cost of finding an estimate to O(km 2) rather than O(m 3) when 
the covariance matrix contains a significant number of small eigenvalues, all these 
methods still require computation and storage of the full covariance matrix. None 
of these methods addresses the problem of speeding up the prediction stage (except 
for the rare case when the integral operator corresponding to the kernel can be 
diagonalized analytically [8]). 
We devise a sparse greedy method, similar to those proposed in the context of 
wavelets [4], solutions of linear systems [5] or matrix approximation [7] that finds 
*Supported by the DFG (Sm 62-1) and the Australian Research Council. 
an approximation of the MAP estimate by expanding it in terms of a small subset 
of kernel functions k(xi, .). Briefly, the technique works as follows: given a set of 
(already chosen) kernel functions, we seek the additional function that increases 
the posterior probability most. We add it to the set of basis functions and repeat 
until the maximum is approximated sufficiently well. A similar approach for a tight 
upper bound on the posterior probability gives a stopping criterion. 
2 Gaussian Process Regression 
Consider a finite set X = {21,... Xm} of inputs. In Gaussian Process Regression, 
we assume that for any such set there is a covariance matrix K with elements 
Kij = k(xi, xj). We assume that for each input x there is a corresponding output 
y(x), and that these outputs are generated by 
y(x) = t(x) +  (1) 
where t(x) and  are both normal random variables, with  0 N(0, a 2) and 
t = (t(Xl),...,t(Xm)) q- 0 N(0, K). We can use Bayes theorem to determine the 
distribution of the output y(x) at a (new) input x. Conditioned on the data (X, y), 
the output y(x) is normally distributed. It follows that the mean of this distribution 
is the maximum a posteriori probability (MAP) estimate of y. We are interested in 
estimating this mean, and also the variance. 
It is possible to give an equivalent parametric representation of y that is more con- 
venient for our purposes. We may assume that the vector y = (y(xl),... ,y(Xm)) T 
of outputs is generated by 
y=Ka+, (2) 
where a 0 N(0, K -1) and  0 N(0, er21). Consequently the posterior probability 
p(aly , X) over the latent variables a is proportional to 
exp(- 1_ ly - Kall 2) exp(-a -Ka) (3) 
2a 
and the conditional expectation of y(x) for a (new) location x is E[y(x) [y, X] = 
k-aopt, where k - denotes the vector (k(gl,g),...,k(gm,3)) and aopt is the value 
of a that maximizes (3). Thus, it suffices to compute aopt before any predictions 
are required. The problem of choosing the MAP estimate of a is equivalent to the 
problem of minimizing the negative log-posterior, 
1 T (r2K -3- K TK) a] (4) 
minimize [-y* Ka + a 
ot E  m 
(ignoring constant terms and rescaling by a2). It is easy to show that the mean of 
the conditional distribution of y(x) is k v (K +a21)-ly, and its variance is k(x,x)+ 
a - kT(K -]- al)-lk (see, for example, [2]). 
3 Approximate Minimization of Quadratic Forms 
For Gaussian process regression, searching for an approximate solution to (4) relies 
on the assumption that a set of variables whose posterior probability is close to that 
of the mode of the distribution will be a good approximation for the MAP estimate. 
The following theorem suggests a simple approach to estimating the accuracy of an 
approximate solution to (4). It uses an idea from [2] in a modified, slightly more 
general form. 
Theorem I (Approximation Bounds for Quadratic Forms) Denote by K E 
"x" a positive semidefinite matrix, y, a E R" and define the two quadratic forms 
Q(a) := _yVKa + a - (aK + K -K)a, (5) 
q*(a) ::--yTa + aT (O.1 + K)a. (6) 
Suppose Q and Q* have minima Qmin and Qin' Then for all a, a* 6 " we have 
1 
Q(a)_ Qmin _ 
Q*(a*) _ QLin 
with equalities throughout when Q(a) = Qmin and Q*(a*) = Qin' 
(8) 
Hence, by minimizing Q* in addition to Q we can bound Q's closeness to the 
optimum and vice versa. 
Proof The minimum of Q(a) is obtained for aopt = (K 4- (r21)-ly (which also 
minimizes Q*), hence 
Qmin ---- -YVK(K 4. (r21)-lY and Qin ---- --YT( K 4. (r21)-lY  (9) 
 2 * i yl 2. 
This allows us to combine Qmin and Qmin to Qmin 4. tr Qmin = -ll I Since by 
definition Q(a) ) Qmin for all a (an,likewise Q*(a*) _ Qin for all a*) we may 
solve Qmin 4. ff2)in for either Q or to obtain lower bounds for each of the two 
quantities. This proves (7) and (8). 1 
Equation (7) is useful for computing an approximation to the MAP solution, 
whereas (8) can be used to obtain error bars on the estimate. To see this, note that 
in calculating the variance, the expensive quantity to compute is -k - (K+er21)-lk. 
However, this can be found by solving 
1 T K) a] (10) 
minimize [-k-a 4. a (ertl 4. , 
and the expression inside the parentheses is Q*(a) with y = k (see (6)). Hence, an 
approximate minimizer of (10) gives an upper bound on the error bars, and lower 
bounds can be obtained from (8). 
()+llyll ) i.e. the 
In practice we will use the quantiy gap(a, a*) '- 2(Q()+2Q* *  2 
 -_q(c0+,=q.(c.)+llyll = , 
relative size of the difference between upper and lower bound as stopping criterion. 
4 A Sparse Greedy Algorithm 
The central idea is that in order to obtain a faster algorithm, one has to reduce 
the number of free variables. Denote by P 6 "x with m _ n and m,n  N an 
extension matrix (i.e. P- is a projection) with pTp = 1. We will make the ansatz 
ap := P/ where/ 6 R s (11) 
and find solutions/ such that Q(ap) (or Q*(ap)) is minimized. The solution is 
/opt = (Pq- ( rr2K + Kq-K) p)-i pq-Kq-y ' (12) 
Clearly if P is of rank m, this will also be the solution of (4) (the minimum negative 
log posterior for all a 6 R"). In all other cases, however, it is an approximation. 
Computational Cost of Greedy Decompositions 
For a given P 6 R"x let us analyze the computational cost involved in the esti- 
mation procedures. To compute (12) we need to evaluate P-Ky which is O(nm), 
(KP) - (KP) which is O(n2m) and invert an n x n matrix, which is O(n3). Hence the 
total cost is O(nm). Predictions then cost only k-a which is O(n). Using P also 
to minimize Q*(P/*) costs no more than O(n3), which is needed to upper-bound 
the log posterior. 
For error bars, we have to approximately minimize (10) which can done for a -- P/ 
at O(n a) cost. If we compute (PKP-) -1 beforehand, this can be done by at O(n 2) 
1]TpT(rr2 K 
and likewise for upper bounds. We have to minimize -kTKp/ q-   q- 
K -K)P/ which costs O(n2m) (once the inverse matrices have been computed, 
one may, however, use them to compute error bars at different locations, too, thus 
costing only O(n2)). The lower bounds on the error bars may not be so crucial, 
since a bad estimate will only lead to overly conservative confidence intervals and 
not have any other negative effect. Finally note that all we ever have to compute 
and store is KP, i.e. the m x n submatrix of K rather than K itself. Table 1 
summarizes the scaling behaviour of several optimization algorithms. 
Exact Conjugate Optimal Sparse Sparse Greedy 
Solution Gradient [2] Decomposition Approximation 
Memory O(m 2) O(m 2) O(nm) O(nm) 
Initialization O(m s) O(nm 2) O(n2m) O(nn2m) 
Pred. Mean O(m) O(m) O(n) O(n) 
Error Bars O(m ) O(nm ) O(n2m) or O(n ) O(nn2m) or O(n ) 
Table 1: Computational Cost of Optimization Methods. Note that n << m and also 
note that the n used in Conjugate Gradient, Sparse Decomposition, and Sparse 
Greedy Approximation methods will differ, with ncc _< nSD _< nSCA since the 
search spaces are more restricted. n = 60 gives near-optimal results. 
Sparse Greedy Approximation 
Several choices for P are possible, including choosing the principal components 
of K [8], using conjugate gradient descent to minimize Q [2], symmetric Cholesky 
factorization [1], or using a sparse greedy approximation of K [7]. Yet these methods 
have the disadvantage that they either do not take the specific form of y into account 
[8, 7] or lead to expansions that cost O(m) for prediction and require computation 
and storage of the full matrix [8, 2]. 
If we require a sparse expansion of y(x) in terms of k(xi,x) (i.e. many o i in y: k TO 
will be 0) we must consider matrices P that are a collection of unit vectors ei (here 
(ei)j = 5ij). We use a greedy approach to find a good approximation. First, for 
n = 1, we choose P = ei such that Q(P/) is minimal. In this case we could 
permit ourselves to consider all possible indices i E {1,... m} and find the best one 
by trying out all of them. Next assume that we have found a good solution P/ 
where P contains n columns. In order to improve this solution, we may expand 
P into the matrix Phew := [Bold, e/] E ]mx(n+l) and seek the best ei such that 
Pnew minimizes min Q(Pnew/). (Performing a full search over all possible n + 1 
out of m indices would be too costly.) This greedy approach to finding a sparse 
approximate solution is described in Algorithm 1. The algorithm also maintains an 
approximate minimum of Q*, and exploits the bounds of Theorem i to determine 
when the approximation is sufficiently accurate. (Note that we leave unspecified 
how the subsets M C_ I, M* C- I* are chosen. Assume for now that we choose 
M = I,M* = I*, the full set of indices that have not yet been selected.) This 
method is very similar to Matching Pursuit [4] or iterative reduced set Support 
Vector algorithms [6], with the difference that the target to be approximated (the 
full solution a) is only given implicitly via Q(a). 
Approximation Quality 
Natarajan [5] studies the following Sparse Linear Approximation problem: Given 
A  R"x", b  R", e > 0, find x  R" with minimal number of nonzero entries 
such that IlJx- bll _< 
1 
If we define A := (a2K q- KTK) and b := A-1Ky, then we may write Q(a) = 
lib- Aa[[  q- c where c is a constant independent of a. Thus the problem of 
sparse approximate minimization of Q(a) is a special case of Natarajan's problem 
(where the matrix A is square, symmetric, and positive definite). In addition, the 
algorithm considered by Natarajan in [5] involves sequentially choosing columns of 
A to maximally decrease [[Ax - b[[. This is clearly equivalent to the sparse greedy 
algorithm described above. Hence, it is straightforward to obtain the following 
result from Theorem 2 in [5]. 
Theorem 2 (Approximation Rate) Algorithm 1 achieves Q(a) _ Q(aopt) + e 
when a has 
18n*(e/4) (11A-1Kyll I 
n _< A In e 
non-zero components, where n*(e/4) is the minimal number of nonzero components 
in + /4, = + 1 is 
the minimum of the magnitudes of the singular values of A, the matx obtained by 
normalizing the columns W A. 
Randomized Algorithms for Subset Selection 
Unfortunately, the approximation algorithm considered above is still too expensive 
for large m since each search operation involves (m) indices. Yet, if we are satisfied 
with finding a relatively good index rather than the best, we may resort to selecting 
a random subset of size n << m. In Algorithm 1, this corresponds to choosing 
M C_ I, M* C- I* as random subsets of size n. In fact, a constant value of n will 
typically suffice. To see why, we recall a simple lemma from [7]: the cumulative 
distribution function of the maximum of m i.i.d. random variables 1,...,, is 
F(.) ", where F(.) is the cdf of i. Thus, in order to find a column to add to P 
that is with probability 0.95 among the best 0.05 of all such columns, a random 
subsample of size [log 0.05/log 0.951 = 59 will suffice. 
Algorithm I Sparse Greedy Quadratic Minimization. 
Require: Training data X = {381,... , Xm}, Targets y, Noise er 2, Precision 
Initialize index sets I, I* = {1,..., m}; S, S* = 0. 
repeat 
Choose M C- I, M* C- I*. 
Find arg mini M Q ([P, ei]fiopt), arg mini.  M* Q* ([P*, ei. ]fipt)' 
Move i from I to S, i* from I* to S*. 
Set P := [P, ei], P* := [P*,ei.]. 
2 * * 1 e , 
until Q(Pfiopt)q-er Q (Pfiopt)q- 11Yl[2 _< ([Q(Pfiopt)[ q- [er2Q*(Pfiopt)q- 
Output: Set of indices S, riopt, (pTKp)-l, and (pT (K TK q- er2K)P) -1 
Numerical Considerations 
The crucial part is to obtain the values of Q(Popt) cheaply (with P = [Pold,ei]), 
provided we solved the problem for Pold. From (12) one can see that all that needs 
to be done is a rank-1 update on the inverse. In the following we will show that this 
can be obtained in O(mn) operations, provided the inverse of the smaller subsystem 
is known. Expressing the relevant terms using Pold and ki we obtain 
pTKTy = [Pold, ei]TKTy = (PodKTy, k/Ty) 
pT (KTK q-rr2K) p = [ vqd (KTK q-rr2K) vld VolTd (gT q-er21) ki ] 
k/T (K + eral)Pold k/Tk/+ eraKii 
Thus computation of the terms costs only O(nm), given the values for Fol d. Fur- 
thermore, it is easy to verify that we can write the inverse of a symmetric positive 
semidefinite matrix as 
where h' :-- (C 4-BTA-1B) -1. 
A -1 4- (A-1B)Tff(A-1B) -If(A-lB) ] 
-(If(A-lB)) T 
(13) 
Hence, inversion of pT (KTK + tr2K) p costs only 
O(n2). Thus, to find P of size m x n takes O(nn2m) time. For the error bars, 
(pTKp)-i will generally be a good starting value for the minimization of (10), 
so the typical cost for (10) will be O(rmn) for some - < n, rather than O(mn2). 
Finally, for added numerical stability one may want to use an incremental Cholesky 
factorization in (13) instead of the inverse of a matrix. 
5 Experiments and Discussion 
We used the/balone dataset from the UCI Repository to investigate the properties 
of the algorithm. The dataset is of size 4177, split into 4000 training and 177 testing 
split to analyze the numerical performance, and a (3000, 1177) split to assess the 
generalization error (the latter was needed in order to be able to invert (and keep in 
memory) the full matrix K 4- er21 for a comparison). The data was rescaled to zero 
mean and unit variance coordinate-wise. Finally, the gender encoding in 
(male/female/infant) was mapped into { (1, 0, 0), (0, 1, 0), (0, 0, 1) }. 
In all our experiments we used Gaussian kernels k(x,x') -- exp( 11-'112 as co- 
variance kernels. Figure i analyzes the speed of convergence for different n. 
10  
0 20 40 60 80 100 120 140 160 180 200 
Number of Iterations 
Figure 1: Speed of Convergence. 
We plot the size of the gap be- 
tween upper and lower bound of the 
log posterior (gap(a,a*)) for the 
first 4000 samples from the Abalone 
dataset (er 2 -- 0.1 and 2032 -- 10). 
From top to bottom: subsets of size 
1, 2, 5, 10, 20, 50, 100, 200. The 
results were averaged over 10 runs. 
The relative variance of the gap size 
was less than 10%. 
One can see that that subsets of size 
50 and above ensure rapid conver- 
gence. 
For the optimal parameters (2er 2 = 0.1 and 2032 = 10, chosen after [7]) the average 
test error of the sparse greedy approximation trained until gap(a, a*) < 0.025 on a 
(3000, 1177) split (the results were averaged over ten independent choices of training 
sets.) was 1.785 + 0.32, slightly worse than for the GP estimate (1.782 + 0.33). 
The log posterior was -1.572  10(1 + 0.005), the optimal value -1.571  10(1 + 
0.005). Hence for all practical purposes full inversion of the covariance matrix and 
the sparse greedy approximation have statistically indistinguishable generalization 
performance. 
In a third experiment (Table 2) we analyzed the number of basis functions needed 
to minimize the log posterior to gap(a, a*) < 0.025, depending on different choices 
of the kernel width er. In all cases, less than 10% of the kernel functions suffice to 
find a good minimizer of the log posterior, for the error bars, even less than 2% are 
sufficient. This is a dramatic improvement over previous techniques. 
Kernel width 2w 2 i 2 5 10 20 50 
Kernels for log-posterior 373 287 255 257 251 270 
Kernels for error bars 79-61 49-43 26-27 17-16 12-9 8-5 
Table 2: Number of basis functions needed to minimize the log posterior on the 
Abalone dataset (4000 training samples), depending on the width of the kernel w. 
Also, number of basis functions required to approximate k-(K -(r21)-lk which is 
needed to compute the error bars. We averaged over the remaining 177 test samples. 
To ensure that our results were not dataset specific and that the algorithm scales well 
we tested it on a larger synthetic dataset of size 10000 in 20 dimensions distributed 
according to N(0, 1). The data was generated by adding normal noise with variance 
er 2 -- 0.1 to a function consisting of 200 randomly chosen Gaussians of width 2w  ---- 
40 and normally distributed coefficients and centers. 
We purposely chose an inadequate Gaussian process prior (but correct noise level) 
of Gaussians with width 2w  -- 10 in order to avoid trivial sparse expansions. After 
500 iterations (i.e. after using 5% of all basis functions) the size of the gap(a, c*) 
was less than 0.023 (note that this problem is too large to be solved exactly). 
We believe that sparse greedy approximation methods are a key technique to scale 
up Gaussian Process regression to sample sizes of 10.000 and beyond. The tech- 
niques presented in the paper, however, are by no means limited to regression. 
Work on the solutions of dense quadratic programs and classification problems is in 
progress. The authors thank Bob Williamson and Bernhard SchSlkopf. 
References 
[1] S. Fine and K. Scheinberg. Efficient SVM training using low-rank kernel representation. 
Technical report, IBM Watson Research Center, New York, 2000. 
[2] M. Gibbs and D.J.C. Mackay. Efficient implementation of gaussian processes. Technical 
report, Cavendish Laboratory, Cambridge, UK, 1997. 
[3] F. Girosi. An equivalence between sparse approximation and support vector machines. 
Neural Computation, 10(6):1455-1480, 1998. 
[4] S. Mallat and Z. Zhang. Matching Pursuit in a time-frequency dictionary. IEEE 
Transactions on Signal Processing, 41:3397-3415, 1993. 
[5] B. K. Natarajan. Sparse approximate solutions to linear systems. SIAM Journal of 
Computing, 25(2):227-234, 1995. 
[6] B. SchSlkopf, S. Mika, C. Burges, P. Knirsch, K.-R. Miiller, G. R&tsch, and A. Smola. 
Input space vs. feature space in kernel-based methods. IEEE Transactions on Neural 
Networks, 10(5):1000- 1017, 1999. 
[7] A.J. Smola and B. SchSlkopf. Sparse greedy matrix approximation for machine learn- 
ing. In P. Langley, editor, Proceedings of the 17th International Conference on Machine 
Learning, pages 911 - 918, San Francisco, 2000. Morgan Kaufman. 
[8] C.K.I. Williams and M. Seeger. The effect of the input density distribution on kernel- 
based classifiers. In P. Langley, editor, Proceedings of the Seventeenth International 
Conference on Machine Learning, pages 1159 - 1166, San Francisco, California, 2000. 
Morgan Kaufmann. 
