Computing with Finite and Infinite Networks 
Ole Winther* 
Theoretical Physics, Lund University 
S/31vegatan 14 A, S-223 62 Lund, Sweden 
winther@nimis .thep. lu. se 
Abstract 
Using statistical mechanics results, I calculate learning curves (average 
generalization error) for Gaussian processes (GPs) and Bayesian neural 
networks (NNs) used for regression. Applying the results to learning a 
teacher defined by a two-layer network, I can directly compare GP and 
Bayesian NN learning. I find that a GP in general requires (9 (d  )-training 
examples to learn input features of order s (d is the input dimension), 
whereas a NN can learn the task with order the number of adjustable 
weights training examples. Since a GP can be considered as an infinite 
NN, the results show that even in the Bayesian approach, it is important 
to limit the complexity of the learning machine. The theoretical findings 
are confirmed in simulations with analytical GP learning and a NN mean 
field algorithm. 
1 Introduction 
Non-parametric kernel methods such as Gaussian Processes (GPs) and Support Vector Ma- 
chines (SVMs) are closely related to neural networks (NNs). These may be considered as 
single layer networks in a possible infinite dimensional feature space. Both the Bayesian 
GP approach and SVMs regularize the learning problem so that only a finite number of the 
features (dependent on the amount of data) is used. 
Neal [1] has shown that Bayesian NNs converge to GPs in the limit of infinite number of 
hidden units and furthermore argued that (1) there is no reason to believe that real-world 
problem should require only a 'small' number of hidden units and (2) there are in the 
Bayesian approach no reasons (besides computational) to limit the size of the network. 
Williams [2] has derived kernels allowing for efficient computation with both infinite feed- 
forward and radial basis networks. 
In this paper, I show that learning with a finite rather than infinite networks can make a 
profound difference by studying the case where the task to be learned is defined by a large 
but finite two-layer NN. A theoretical analysis of the Bayesian approach to learning this 
task shows that the Bayesian student makes a learning transition from a linear model to 
specialized non-linear one when the number of examples is of the order of the number of 
adjustable weights in the network. This effect-which is also seen in the simulations-is a 
consequence of the finite complexity of the network. In an infinite network, i.e. a GP on the 
*http://www.thep.lu.se/tf2/staff/winther/ 
other hand such a transition will not occur. It will eventually learn the task but it requires 
O(d  )-training examples to learn features of order s, where d is the input dimension. 
Here, I focus entirely on regression. However, the basic conclusions regarding learning 
with kernel methods and NNs turn out to be valid more generally, e.g. for classification 
unpublished results and [3]. 
I consider the usual Bayesian setup of supervised learning: A training set DN = 
{(x, y)li - 1 ..., N} (x  W and y  /t) is known and the output for the new in- 
put x is predicted by the function f(x) which is sampled from the prior distribution of 
model outputs. I will consider both a Gaussian process prior and the prior implied by 
a large (but finite) two-layer network. The output noise is taken to be Gaussian, so the 
Likelihood becomes p(lf (x)) - e- (v-; (x))2/2 / . The error measure is minus the 
log-Likelihood and Bayes regressor (which minimizes the expected error) is the posterior 
mean prediction 
Elf(x) I-[ p(y If(x)) 
(f(x)) = Ef i_[ p(y if(x) ) , (1) 
where I have introduced El, f = f(x), ..., f(xN), f(x), to denote an average with re- 
spect to the model output prior. 
Gaussian processes. In this case, the model output prior is by definition Gaussian 
1 exp (_fTc_f) 
P(f): V/(2rr) N der i2 ' 
(2) 
where C is the covariance matrix. The covariance matrix is computed from the kernel 
(covariance function) C'(x, x'). Below I give an explicit example corresponding to an 
infinite two-layer network. 
Bayesian neural networks The output of the two-layer NN is given by f(x, w, W) = 
1 K 
 5-] W  (w  x), where an especially convenient choice of transfer function in what 
follows is (I)(z) = f_ dte-t22/x/. I consider a Bayesian framework (with fixed 
known hyperparameters) with a weight prior that factorizes over hidden units p(w, W) = 
I-[ [p(W )p(w )] and Gaussian input-to-hidden weights w 
From Bayesian NNs to GPs. The prior over outputs for the Bayesian neural network is 
p(f) = f dwdWp(w, W) I-[i(5(f(xi) - f(xi, w, W)). In the infinite hidden unit limit, 
K  oc, when p(W) has zero mean and finite, say unit variance, it follows from the 
central limit theorem (CLT) that the prior distribution converges to a Gaussian process 
f  A/'(0, C) with kernel [1, 2] 
: f dwp(w) x) x') 
+ + 
(3) 
The rest of the paper deals with theoretical statistical mechanics analysis and simulations 
for GPs and Bayesian NNs learning tasks defined by either a NN or a GP. For the simula- 
tions, I use analytical GP learning (scaling like O(N3)) [4] and a TAP mean field algorithm 
for Bayesian NN. 
2 Statistical mechanics of learning 
The aim of the average case statistical mechanics analysis is to derive learning curves, i.e. 
the expected generalization error as a function of the number of training examples. The 
generalization error of the Bayes regressor (fix)) eq. (1) is 
eg = (((y - (f(x)))2)), (4) 
where double brackets ((...)) = f Hi [dxidyip(xi, yi)]... denote an average over both 
training examples and the test example (x, y). Rather than using eq. (4) directly, eg will-as 
usually done-be derived from the average of the free energy -((ln Z)), where the partition 
function is given by 
Z = E, v exp 252 (yi- f(xi) (5) 
I will not give many details of the actual calculations here since it is beyond the scope of 
the paper, but only outline some of the basic assumptions. 
2.1 Gaussian processes 
The calculation for Gaussian processes is given in another NIPS contribution [5]. The basic 
assumption made is that y-f(x) becomes Gaussian with zero mean  under an average over 
the training example y - f(x)  ./V'(0, (( (y - f(x))2))). This assumption can be justified 
by the CLT when f(x) is a sum of many random parts contributing on the same scale. 
Corrections to the Gaussian assumption may also be calculated [5]. The free energy may 
be written in term of a set of order parameters which is found by saddlepoint integration. 
2 the generalization error is 
Assuming that the teacher is noisy y = f,(x) + 
given by the following equation which depends upon an orderparameter 
2 
rr, + ((f,2(x))) - 
c = (6) 
1 q- 
N 
v= , (7) 
rr 2 q- 
where the new normalized measure El...  El exp (-v((f2 (x)))/2)... has been intro- 
duced. 
Kernels in feature space. By performing a Karhunen-Loeve expansion, f(x) can be 
written as a linear perceptron with weights o:p in a possible infinite feature space 
(8) 
where the features bp(x) are orthonormal eigenvectors of the covariance function with 
eigenvalues ' f dxp(x)U(x,x)(x) = )b(x ) and f dxp(x)p,(x)(x) = ,. 
The teacher f, (x) may also be expanded in terms of the the features' 
p 
f dxp(x)f,(x)p(x). 
Using the orthonormality the averages may be found: ((f2(x))) = E),w, 
((f(x)f, (x))) = - Awpa and ((f,2(x))) = - Aa. For a Gaussian process prior, 
Generalization to non-zero mean is straightforward. 
the prior over the weight is a spherical Gaussian w  iV(O, I). Averaging over w, the sad- 
dlepoint equations can be written in terms of the number of examples N, the noise levels 
2 the eigenvectors of the covariance function ),p and the teacher projections ap 
o.2and o.,, ' 
N 2 ),pa 
es : -- o.*+E(l+php) 2 o'2 +E(l+php) 2 (9) 
// 
p p 
( 
)'P (10) 
These eqs. are valid for a fixed teacher. However, eq. (9) may also be averaged over the 
distribution of teachers. In the Bayes optimal scenario, the teacher is sampled from the 
2 1, where the 
same prior as the student and o -2 2 Thus ap IV(O, I) implying ap 
average over the teacher is denoted by an overline. In this case the equations reduce to the 
_Bayes: N/v. 
Bayes optimal result first derived by Sollich [6]: e s = e s 
Learning finite nets. Next, I consider the case where the teacher is the two-layer network 
f,(x) : f(w, W) and the GP student uses the infinite net kernel eq. (3). The average 
over the teacher corresponds to an average over the weight prior and since f, (x)f, (x') = 
U(x, x'), I get 
2Ap / clxclx'p(x)p(x')C(x,x')Op(x)p(x') = Xp (11) 
ap = , 
where the eigenvalue equation and the orthonormality have been used. The theory therefore 
predicts that a GP student (with the infinite network kernel) will have the same learning 
curve irre,spectively of the number of hidden units of the NN teacher. This result is a direct 
consequence of the Gaussian assumption made for the average over examples. However, 
what is more surprising is that it is found to be a very good approximation in simulations 
down to K = 1, i.e. a simple perceptron with a sigmoid non-linearity. 
Inner product kernels. I specialize to inner product kernels U(x, x') = c(x  x'/d) 
and consider large input dimensionality d and input components which are iid with 
zero mean and unit variance. The eigenvectors are products of the input components 
0p(x) = I-[,ep x, and are indexed by subsets of input indices, e.g. p = {1, 2, 42} [3]. 
(d) d' ', 
The eigenvalues are Xp - till(0) with degeneracy nip I Ipl  pl/Ip I where Ipl is 
-- dial = ' 
the cardinality (in the example above Ipl - 3), Plugging these results into eqs. (9) and (10), 
it follows that to learn features that are order s in the inputs, O(d ) examples are needed. 
The same behavior has been predicted for learning in SVMs [3]. 
The infinite net eq. (3) reduces to an inner product covariance function for 5] = TI/d (T 
controls the degree on non-linearity of the rule) and large d, x  x  d: 
C(x,x'): c(x. x'/d): 2_ arcsin d-- 2 T) (12) 
Figure 1 shows learning curves for GPs for the infinite network kernel. The mismatch 
between theory and simulations is expected to be due to O(1/d)-corrections to the eigen- 
values ),p. The figure clearly shows that learning of the different order features takes place 
on different scales. The stars on the es-axis show the theoretical prediction of asymptotic 
error for N = O(d), O(da),... (the teacher is an odd function). 
2.2 Bayesian neural networks 
The limit of large but finite NNs allows for efficient computation since the prior over 
functions can be approximated by a Gaussian. The hidden-to-output weights are for sim- 
0.6 
0.4 
0.2 
0.1 
0.1 
0.05 
:) 20 40 60 80 1 O0 500 1 000 
1500 2000 
Figure 1: Learning curve for Gaussian processes with the infinite network kernel (d = 10, 
T = 10 and o -2 = 0.01) for two scales of training examples. The full line is the the 
theoretical prediction for the Bayes optimal GP scenario. The two other curves (almost on 
top of each other as predicted by theory) are simulations for the Bayes optimal scenario 
(dotted line) and for GP learning a neural network with K = 30 hidden units (dash-dotted 
line). 
plicity set to one and we introduce the 'fields' h (x) = w  x and write the output as 
f(x, w): f(h(x)):  K 
 ' (I)(h(x)), h(x) = h(x),..., h:(x). In the following, I 
discuss the TAP mean field algorithm used to find an approximation to the Bayes regressor 
and briefly the theoretical statistical mechanics analysis for the NN task. 
Mean field algorithm. The derivation sketched here is a straightforward generalization 
of previous results for neural networks [7]. The basic cavity assumption [7, 8] is that for 
large d, K and for a suitable input distribution, the predictive distribution p(f(x) lDv ) is 
Gaussian: 
p(f(x) IDN)  :V((fix)), (f2 (x)) -- (f(x))2). 
The predictive distribution for the fields h(x) is also assumed to be Gaussian 
p(h(x)lDv ) /V((h(x)), V), 
where V = (h(x)h(x) T) - (h(x))(h(x)Y). Using these assumptions, I get an approxi- 
mate Bayes regressor 
To me predictions, we therefore need the two first moments of the weights since 
hk(x) = x ana = -- We can simplify 
this in the lge d limit by ting the inputs to by lid with zero mean and unit viance: 
Vk  wk. w - wk. w. This approximation can be avoiaea at a substantial com- 
putational cost [8]. Furthermore, wk .w  tums out equal to the prior covelance 
[7]. The following exact relation is obtained for the mean weights 
(wk) = akixi, aki- O(hk(xi)) lnp(YilDNN(xi'Yi)) (14) 
i 
where 
p(ylDmX(x,y)) = f dh(x)p(y Ih(x))p(h(x)lDmX(x,y)). 
0.05 
g 
0.04 
0.03 
0.02 
0'010! 
2 4 6 8 0 
N 
dK 
Figure 2: Learning curves for Bayesian NNs and GPs. The dashed line is simulations 
for the TAP mean field algorithm (d = 30, K = 5, T = 1 and o -2 = 0.01) learning a 
corresponding NN task, i.e. an approximation to the Bayes optimal scenario. The dash- 
dotted line is the simulations for GPs learning the NN task. Virtually on top of that curve 
is the curve for Bayes optimal GP scenario (dotted line). The full lines are the theoretical 
prediction. Up to N = Nc = 2.51dK, the learning curves for Bayesian NNs and GPs coin- 
cide. At No, the statistical mechanics theory predicts a first order transition to a specialized 
solution for the NN Bayes optimal scenario (lower full line). 
p(ylh(x)) is the Likelihood and p(h(x)lDv\(x, y)) is a predictive distribution for 
h(x) for a training set where the ith example has been left out. In accordance with above, 
I assume p(h(xi)lDN\(Xi,yi))  ./V((h(xi))\i, V). Finally, generalizing the relation 
found in Refs. [7, 8], I can relate the reduced mean to the full posterior mean: 
v: 
I 
to express everything in terms of (wk) and ok, k = 1,..., K and i = 1,..., N. 
The mean field eqs. are solved by iteration in ok and (w,,k) following the recipe given in 
Ref. [8]. The algorithm is tested using a teacher sampled from the NN prior, i.e. the Bayes 
optimal scenario. Two types of solutions are found: a linear symmetric and a non-linear 
specialized. In the symmetric solution, (wk) = (w) and (wk). (wk) = O(T/dK). This 
means that the machine is linear (when T << K). For N = O(dK), a transition to a 
specialized solution occurs, where each (wk), k = 1,..., K, aligns to a distinct weight 
vector of the teacher and (wk)  (wk) = O(T/d). The Bayesian student thus learns the 
linear features for N = O(d). However, unlike the GP, it learns all of the remaining non- 
linear features for N = O(dK). The resulting empirical learning curve averaged over 25 
independent runs is shown in figure 2. It turned out that setting (hk(x))\ : (hk(x)) 
was a necessary heuristic in order to find the specialized solution. The transition to the 
specialized solution-although very abrupt for the individual run-is smeared out because it 
occurs at different N for each run. 
The theoretical learning curve is also shown in figure 2. It has been derived by gener- 
alizing the results of Ref. [9] for the Gibbs algorithm to the Bayes optimal scenario. The 
picture that emerges is in accordance with the empirical findings. The transition to the 
specialized solution is predicted to be first order, i.e. with a discontinuous jump in the rele- 
vant order parameters at the number of examples N (o -2, T), where the specialized solution 
becomes the physical solution (i.e. the lowest free energy solution). 
The mean field algorithm cannot completely reproduce the theoretical predictions because 
the solution gets trapped in the meta-stable symmetric solution. This is often observed 
for first order transitions and should also be observable in the Monte Carlo approach to 
Bayesian NNs [1]. 
3 Discussion 
Learning a finite two-layer regression NN using (1) the Bayes optimal algorithm and (2) 
the Bayes optimal algorithm for an infinite network (implemented by a GP) is compared. 
It is found that the Bayes optimal algorithm can have a very superior performance. 
This can be explained as an entropic effect: The infinite network will-although the cor- 
rect finite network solution is included a priori- have a vanishing probability of finding 
this solution. The finite network on the other hand is much more constraint wrt the func- 
tions it implements. It can thus-even in the Bayesian setting-give a great pay off to limit 
complexity. 
For d-dimensional inner product kernel with iid input distribution, it is found that it in 
general requires O(d ) training examples to learn features of O(s). Unpublished results 
and [3] show that these conclusions remain true also for SVM and GP classification. 
For SVM hand-written digit recognition, fourth order kernels give good results in prac- 
tise. Since N = (9(104) - O(105), it can be concluded that the 'effective' dimension, 
deflective : (9(10) against typically d = 400, i.e. some inputs must be very correlated 
and/or carry very little information. It could therefore be interesting to develop methods 
to measure the effective dimension and to extract the important lower dimensional features 
rather than performing the classification directly from the images. 
Acknowledgments 
I am thankful to Manfred Opper for valuable discussions and for sharing his results with 
me and to Klaus-Robert Mtiller for discussions at NIPS. This research is supported by the 
Swedish Foundation for Strategic Research. 
References 
[1] R. Neal, Bayesian Learning for Neural Networks', Lecture Notes in Statistics, Springer (1996). 
[2] C.K.I. Williams, Computing with Infinite Networks, in NeuralInformation Processing Systems 
9, Eds. M. C. Mozer, M. I. Jordan and T. Petsche, 295-301, MIT Press (1997). 
[3] R. Dietrich, M. Opper and H. Sompolinsky, Statistical Mechanics of Support Vector Machines, 
Phys. Rev. Lett. 82, 2975-2978 (1999). 
[4] C. K. I. Williams and C. E. Rasmussen, Gaussian Processes for Regression, In Advances in 
Neural Information Processing Systems 8 (NIPS'95). Eds. D. S. Touretzky, M. C. Mozer and 
M. E. Hasselmo, 514-520, MIT Press (1996). 
[5] D. Malzahn and M. Opper, In this volume. 
[6] P. Sollich, Learning Curves for Gaussian Processes, In Advances in Neural Information Pro- 
cessing Systems 11 (NIPS'98), Eds. M. S. Keams, S. A. Solla, and D. A. Cohn, 344-350, MIT 
Press (1999). 
[7] M. Opper and O. Winther, Mean Field Approach to Bayes Learning in Feed-Forward Neural 
Networks, Phys. Rev. Lett. 76, 1964-1967 (1996). 
[8] M. Opper and O. Winther, Gaussian Processes for Classification: Mean Field Algorithms, Neu- 
ral Computation 12, 2655-2684 (2000). 
[9] M. Ahr, M. Biehl and R. Urbanczik, Statistical physics and practical training of soft-committee 
machines Eur. Phys. J. B 10, 583 (1999). 
