Learning curves for Gaussian processes 
regression: A framework for good 
approximations 
DSrthe Malzahn Manfred Opper 
Neural Computing Research Group 
School of Engineering and Applied Science 
Aston University, Birmingham B4 7ET, United Kingdom. 
[malzahnd, opperm] aston. ac. uk 
Abstract 
Based on a statistical mechanics approach, we develop a method 
for approximately computing average case learning curves for Gaus- 
sian process regression models. The approximation works well in 
the large sample size limit and for arbitrary dimensionality of the 
input space. We explain how the approximation can be systemati- 
cally improved and argue that similar techniques can be applied to 
general likelihood models. 
I Introduction 
Gaussian process (GP) models have gained considerable interest in the Neural Com- 
putation Community (see e.g.[l, 2, 3, 4] ) in recent years. Being non-parametric 
models by construction their theoretical understanding seems to be less well devel- 
oped compared to simpler parametric models like neural networks. We are especially 
interested in developing theoretical approaches which will at least give good approx- 
imations to generalization errors when the number of training data is sufficiently 
large. 
In this paper we present a step in this direction which is based on a statistical me- 
chanics approach. In contrast to most previous applications of statistical mechanics 
to learning theory we are not limited to the so called "thermodynamic" limit which 
would require a high dimensional input space. 
Our work is very much motivated by recent papers of Peter Sollich (see e.g. [5]) who 
presented a nice approximate treatment of the Bayesian generalization error of GP 
regression which actually gives good results even in the case of a one dimensional 
input space. His method is based on an exact recursion for the generalization 
error of the regression problem together with approximations that decouple certain 
correlations of random variables. Unfortunately, the method seems to be limited 
because the exact recursion is an artifact of the Gaussianity of the regression model 
and is not available for other cases such as classification models. Second, it is 
not clear how to assess the quality of the approximations made and how one may 
systematically improve on them. Finally, the calculation is (so far) restricted to 
a full Bayesian scenario, where a prior average over the unknown data generating 
function simplifies the analysis. 
Our approach has the advantage that it is more general and may also be applied to 
other likelihoods. It allows us to compute other quantities besides the generalization 
error. Finally, it is possible to compute the corrections to our approximations. 
2 Regression with Gaussian processes 
To explain the Gaussian process scenario for regression problems [2], we assume 
that we observe corrupted values y(x)  R of an unknown function f(x) at input 
points x  R a. If the corruption is due to independent Gaussian noise with variance 
cr 2, the likelihood for a set of m example data D - (y(x),... ,y(xm))) is given by 
m (yl-f(xi)) 2 ) 
exp - --i----1 2er 2 
P(DIf ) = (2r2)  (1) 
where Yi -' y(xi). The goal of a learner is to give an estimate of the function f(x). 
The available prior information is that f is a realization of a Gaussian process 
(random field) with zero mean and covariance C(x,x ) - E[f(x)f(x)], where E 
denotes the expectation over the Gaussian process. We assume that the prediction 
at a test point x is given by the posterior expectation of f(x), i.e. by 
Ef(x)P(DIf) (2) 
](x) = E(f(x)lD) = Z 
where the partition function Z normalises the posterior. Calling the true data 
generating function f* (in order to distinguish it from the functions over which 
we integrate in the expectations) we are interested in the learning curve, i.e. the 
generalization (mean square) error averaged over independent draws of example 
data, i.e. eg = [((f*(x)- f(x))2)]D as a function of m, the sample size. The 
brackets [...]D denote averages over example data sets where we assume that the 
inputs xi are drawn independently at random from a density p(x). (...) denotes 
an average over test inputs drawn from the same density. Later, the same brackets 
will also be used for averages over several different test points and for joint averages 
over test inputs and test outputs. 
3 The Partition Function 
As typical of statistical mechanics approaches, we base our analysis on the averaged 
"free energy" [- lnZ]D where the partition function Z (see Eq. (2)) is 
Z=EP(DIf). (3) 
[In Z]D serves as a generating function for suitable posterior averages. The concrete 
application to eg will be given in the next section. The computation of [In Z]D is 
based on the replica trick In Z = limn-0 zn -t where we compute [zn]D for integer 
n ' 
n and perform the continuation at the end. 
Introducing a set of auxiliary integration variables Zka in order to decouple the 
squares, we get 
[Zn]D =  e exp i Zka(fa(Xk) -- yk) 
k----1 a----1 D 
(4) 
where En denotes the expectation over the n times replicated GP measure. In 
general, it seems impossible to perform the average over the data. Using a cumu- 
lant expansion, an infinite series of terms would be created. However one may be 
tempted to try the following heuristic approximation: If (for fixed function f), the 
distribution of f(xk) - yk was a zero mean Gaussian, we would simply end up with 
only the second cumulant and 
[Z]D m  exp 2  
X En exp 
x (5) 
I 
a,b k 
ZaZt{(fa(X) - y)(ft(x) - y)})  
Although such a reasoning may be justified in cases where the dimensionality of in- 
puts x is large, the assumption of approximate Gaussianity is typically (in the sense 
of the prior measure over functions f) completely wrong for small dimensions. Nev- 
ertheless, we will argue in the next section that the expression Eq. (5) (justified by 
a different reason) is a good approximation for large sample sizes and nonzero noise 
level. We will postpone the argument and proceed to evaluate Eq. (5) following a 
fairly standard recipe: The high dimensional integrals over za are turned into low 
dimensional integrals by the introduction of "order-parameters" thb -- ykm__ zzt 
so that 
a<b 
X En exp 
1 y r!t((f(x) - y)(ft(x) - 
--  a,b 
where e G({")) = f I-[, a,a m 
 I-[a<, 5 (y= zkz, - thb). We expect that in the 
limit of large sample size m, the integrals are well approximated by the saddle-point 
method. To perform the limit n - 0, we make the assumption that the saddle-point 
of the matrix t/ is replica symmetric, i.e. t/, = t/for a  b and t/a = t/0. After 
some calculations we arrive at 
m mr/ t/ 
[lnZ]D = cr2t/ + ln(t/0 --t/) + (Ef2(x)) (7) 
 2(.0 - .) 2 
lnEexp rio -rl ((f(x) _y)2}] m (ln(2rm)- 1) 
2 - 
into which we have to insert the values t/and t/0 that make the right hand side an 
extremum. We have defined a new auxiliary (translated) Gaussian measure over 
functions by 
EO{{f}}= Eexp[ .o-.(f2(x)} ] 
(8) 
Eexp 
2 
where  is a functional of f. For a given input distribution it is possible to compute 
the required expectations in terms of sums over eigenvalues and eigenfunctions of 
the covariance kernel C(x, x'). We will give the details as well as the explicit order 
parameter equations in a full version of the paper. 
4 Generalization error 
To relate the generalization error with the order parameters, note that in the replica 
framework (assuming the approximation Eq. (5)) we have 
o 
x --En exp 
-  qab((fa(X) - y)(f(x) - 
a,b 
which by a partial integration and a subsequent saddle point integration yields 
mq - cr " (9) 
eg = (vo - v)" 
It is also possible to compute other error measures in terms of the order parameters 
like the expected error on the (noisy) training data defined by 
1 cr4q 
et -- -- -[(Yi- f(Xi))]D -- -- 
m m 
(10) 
The "true" training error which compares the prediction with the data generating 
function f* is somewhat more complicated and will be given elsewhere. 
5 Why (and when) the approximation works 
Our intuition behind the approximation Eq. (5) is that for sufficiently large sample 
size, the partition function is dominated by regions in function space which are close 
to the data generating function f* such that terms like ((fa(x)- y)(fs(x)- y)) are 
typically small and higher order polynomials in fa (x) - y generated by a cumulant 
expansion are less important. This intuition can be checked self consistently by es- 
timating the omitted terms perturbatively. We use the following modified partition 
function 
[zn(A)]D =  e  ']",aZaEn exp iA Z:a(fa(Xk)-y) 
- --- -'-]ZkaZo((fa(X)-y)(fo(x)-y)) (11) 
a,b k D 
which for A - I becomes the "true" partition function, whereas Eq. (5) is ob- 
tained for A - 0. Expanding in powers of A (the terms with odd powers vanish) 
is equivalent to generating the cumulant expansion and subsequently expanding 
the non-quadratic terms down. Within the saddle-point approximation, the first 
nonzero correction to our approximation of [ln Z] is given by 
A4( (q -q) (o'('(x,x)) + (((x,x)F(x))- (((x,x')F(x)F(x')) 
2m 
+v(x, (x', x")) - (x, (x, x'))) 
I + 
+ (-q- --)(('(x,x))-('(x,x')))). (12) 
m 
0(x,x') = E{f(x)f(x')} denotes the covariance with respect to the auxiliary 
measure and F(x) '- f*(x) - {O(x,x")f*(x")}. The significance of the individ- 
ual terms as m - oc can be estimated from the following scaling. We find 
that (r/0- r/) = O(m) is a positive quantity, whereas r/ = O(m) is negative. 
0(x,x') = O(1/m). Using these relations, we can show that Eq. (12) remains 
finite as m --> oc, whereas the leading approximation Eq. (7) diverges with m. 
We have not (yet) computed the resulting correction to eg. However, we have 
studied the somewhat simpler error measure e' -'  Y.i[E{(f*(xi) - f(xi))21D}]D 
m 
which can be obtained from a derivative of [ln Z]D with respect to cr 2. It equals the 
error of a Gibbs algorithm (sampling from the posterior) on the training data. We 
can show that the correction to e' is typically by a factor of O(1/m) smaller than 
the leading term. However, our approximation becomes worse with decreasing noise 
variance cr . cr = 0 is a singular case for which (at least for some GPs with slowly 
decreasing eigenvalues) it can be shown that our approximation for eg decays to 
zero at the wrong rate. For small values of or, cr - 0, we expect that higher order 
terms in the perturbation expansion will become relevant. 
6 Results 
We compare our analytical results for the error measures eg and st with simula- 
tions of GP regression. For simplicity, we have chosen periodic processes of the 
form f(x) - x/-Y.n (anCOS(2wnx)q-bnsin(2wnx)) for x E [0,1] where the coeffi- 
cients an, bn are independent Gaussians with E{a2n} = E{b2n} = An. This choice 
is convenient for analytical calculations by the orthogonality of the trigonometric 
functions when we sample the xi from a uniform density in [0, 1]. The An and 
the translation invariant covariance kernel are related by c(x- y) -' C(x,y) = 
2 Y-n An cos(2rn(x-y)) and An = fo  c(x)cos(2rnx) dx. We specialise on the (pe- 
riodic) RBF kernel c(x) = y.k___ exp [-(x-k)2/2l ] with l = 0.1. For an il- 
lustration we generated learning curves for two target functions f* as displayed in 
Fig. 1. One function is a sine-wave f*(x) = 2x/Q sin(2rx) while the other is a ran- 
dom realisation from the prior distribution. The symbols in the left panel of Fig. 1 
represent example sets of fifty data points. The data points have been obtained by 
corruption of the target function with Gaussian noise of variance cr 2 = 0.01. The 
right panel of Fig. I shows the data averaged generalization and training errors ca, 
st as a function of the number m of example data. Solid curves display simulation 
results while the results of our theory Eqs. (9), (10) are given by dashed lines. The 
training error st converges to the noise level cr 2. As one can see from the pictures 
our theory is very accurate when the number m of example data is sufficiently large. 
While the generalization error ea differs initially, the asymptotic decay is the same. 
7 The Bayes error 
We can also apply our method to the Bayesian generalization error (previously ap- 
proximated by Peter Sollich [5]). The Bayes error is obtained by averaging the 
generalization error over "true" functions f* drawn at random from the prior dis- 
tribution. Within our approach this can be achieved by an average of Eq. (7) over 
f*. The resulting order parameter equations and their relation to the Bayes error 
turn out be identical to Sollich's result. Hence, we have managed to re-derive his 
approximation within a broader framework from which also possible corrections can 
be obtained. 
f (x) 
f (x) 
2 
1 
0 
-1 
-2 
0 
Data generating function 
0.2 0.4 0.6 0.8 
x 
2 
1 
0 
-1 
-2 
0 
0.2 0.4 0.6 0.8 
0 
x 
Learning curves 
I ' I 
50 100 150 
Number m of example data 
I ' I 
200 
1 0 50 100 150 200 
Number m of example data 
10  
10 -1 
10 -2 
10 -3 
10 -4 
10  
10 -1 
10 -2 
10 -3 
10 -4 
10 -5 
Figure 1: The left panels show two data generating functions f*(x) and example 
sets of 50 data points. The right panels display the corresponding averaged learning 
curves. Solid curves display simulation results for generalization and training errors 
sg, st. The results of our theory Eqs. (9), (10) are given by dashed lines. 
8 Future work 
At present, we extend our method in the following directions: 
 The statistical mechanics framework presented in this paper is based on 
a partition function Z which can be used to generate a variety of other 
data averages for posterior expectations. An obvious interesting quantity 
is given by the sample fluctuations of the generalization error 
[{(f*(x)- f(x))2>2] r ) - ([{(f*(x)- f(x))2>]r)) 2 (13) 
which gives confidence intervals on s a. 
Obviously, our method is not restricted to a regression model (in this case 
however, all resulting integrals are elementary) but can also be directly 
generalized to other likelihoods such as the classification case [4, 6]. A 
further application to Support Vector Machines should be possible. 
The saddle-point approximation neglects fluctuations of the order parame- 
ters. This may be well justified when m is sufficiently large. It is possible 
to improve on this by including the quadratic expansion around the saddle- 
point. 
Finally, one may criticise our method as being of minor relevance to prac- 
tical applications, because our calculations require the knowledge of the 
unknown function f* and the density of the inputs x. However, Eqs. (9) 
and (10) show that important error measures are solely expressed by the 
order parameters r/ and r/0. Hence, estimating some error measures and 
the posterior variance at the data points empirically would allow us to pre- 
dict values for the order parameters. Those in turn could be used to make 
predictions for the unknown generalization error. 
Acknowledgement 
This work has been supported by EPSRC grant GR/M81601. 
References 
[1] D. J. C. Mackay, Gaussian Processes, A Replacement for Neu- 
ral Networks, NIPS tutorial 1997, May be obtained from 
http://wol. ra. phy. cam. etc. uk/pub/mackay/. 
[2] C. K. I. Williams and C. E. Rasmussen, Gaussian Processes for Regression, in 
Neural Information Processing Systems 8, D. S. Touretzky, M. C. Mozer and 
M. E. Hasselmo eds., 514-520, MIT Press (1996). 
[3] C. K. I. Williams, Computing with Infinite Networks, in Neural Information 
Processing Systems 9, M. C. Mozer, M. I. Jordan and T. Petsche, eds., 295-301. 
MIT Press (1997). 
[4] D. Barber and C. K. I. Williams, Gaussian Processes for Bayesian Classification 
via Hybrid Monte Carlo, in Neural Information Processing Systems 9, M . C. 
Mozer, M. I. Jordan and T. Petsche, eds., 340-346. MIT Press (1997). 
[5] P. Sollich, Learning curves for Gaussian processes, in Neural Information Pro- 
cessing Systems 11, M. S. Kearns, S. A. Solla and D. A. Cohn, eds. 344 - 350, 
MIT Press (1999). 
[6] L. Csat6, E. Fokou, M. Opper, B. Schottky, and O. Winther. Efficient ap- 
proaches to Gaussian process classification. In Advances in Neural Information 
Processing Systems, volume 12, 2000. 
