Learning continuous distributions: 
Simulations with field theoretic priors 
Ilya Nemenman ,2and William Bialek 2 X Department of Physics, Princeton University, Princeton, New Jersey 08544 
2NEC Research Institute, 4 Independence Way, Princeton, New Jersey 08540 
nemenman @research. nj. nec. com, bialek@research. nj. nec. com 
Abstract 
Learning of a smooth but nonparametric probability density can be reg- 
ularized using methods of Quantum Field Theory. We implement a field 
theoretic prior numerically, test its efficacy, and show that the free pa- 
rameter of the theory ('smoothness scale') can be determined self con- 
sistently by the data; this forms an infinite dimensional generalization of 
the MDL principle. Finally, we study the implications of one's choice 
of the prior and the parameterization and conclude that the smoothness 
scale determination makes density estimation very weakly sensitive to 
the choice of the prior, and that even wrong choices can be advantageous 
for small data sets. 
One of the central problems in learning is to balance 'goodness of fit' criteria against the 
complexity of models. An important development in the Bayesian approach was thus the 
realization that there does not need to be any extra penalty for model complexity: if we 
compute the total probability that data are generated by a model, there is a factor from the 
volume in parameter space--the 'Occam factor'--that discriminates against models with 
more parameters [1, 2]. This works remarkably well for systems with a finite number of 
parameters and creates a complexity 'razor' (after 'Occam's razor') that is almost equiv- 
alent to the celebrated Minimal Description Length (MDL) principle [3]. In addition, if 
the a priori distributions involved are strictly Gaussian, the ideas have also been proven to 
apply to some infinite-dimensional (nonparametric) problems [4]. It is not clear, however, 
what happens if we leave the finite dimensional setting to consider nonparametric prob- 
lems which are not Gaussian, such as the estimation of a smooth probability density. A 
possible route to progress on the nonparametric problem was opened by noticing [5] that 
a Bayesian prior for density estimation is equivalent to a quantum field theory (QFT). In 
particular, there are field theoretic methods for computing the infinite dimensional analog 
of the Occam factor, at least asymptotically for large numbers of examples. These obser- 
vations have led to a number of papers [6, 7, 8, 9] exploring alternative formulations and 
their implications for the speed of learning. Here we return to the original formulation 
of Ref. [5] and use numerical methods to address some of the questions left open by the 
analytic work [10]: What is the result of balancing the infinite dimensional Occam factor 
against the goodness of fit? Is the QFT inference optimal in using all of the information 
relevant for learning [ 11] ? What happens if our learning problem is strongly atypical of the 
prior distribution? 
Following Ref. [5], if N i. i.d. samples {xi}, i = 1... N, are observed, then the probability 
that a particular density Q (z) gave rise to these data is given by 
q(x) 
= P[Q(x)] 
f[dQ(x)]P[Q(x)] , 
q(xd 
(1) 
where P[Q(x)] encodes our a priori expectations of Q. Specifying this prior on a space of 
functions defines a QFT, and the optimal least square estimator is then 
Qest(xl{x}) = (q(x)q(x)q(x2)'"q(xv))() 
' 
(2) 
where (...)(o) means averaging with respect to the prior. Since Q(x) _> 0, it is convenient 
to define an unconstrained field (x), Q(x) -- (1/o) exp[-qS(x)]. Other definitions are 
also possible [6], but we think that most of our results do not depend on this choice. 
The next step is to select a prior that regularizes the infinite number of degrees of freedom 
and allows learning. We want the prior 7>[q5] to make sense as a continuous theory, inde- 
pendent of discretization of x on small scales. We also require that when we estimate the 
distribution Q (x) the answer must be everywhere finite. These conditions imply that our 
field theory must be convergent at small length scales. For x in one dimension, a minimal 
choice is 
1 
7>[(x)] =  exp 
2 fdx,oxVJ 5 fdxe-O()-I , (3) 
where r/> 1/2, ,7, is the normalization constant, and the &function enforces normalization 
of Q. We refer to  and r/as the smoothness scale and the exponent, respectively. 
In [5] this theory was solved for large N and r/= 1: 
N 
(II O(xO)  
i=1 
eff ---- 
1 
- exp (-ff[qScl(X); (xi)]), (4) 
 0  1 /Ne-Ool N 
/dx[(xOcl) r- V o ] q-Eocl(xj)' (5) 
j=l 
N 
2 Ne-0ol(X ) 
0xqScl(X) q- -0 = E 5(x - xj), (6) 
j=l 
where 0cl is the 'classical' (maximum likelihood, saddle point) solution. In the effective 
action [Eq. (5)], it is the square root term that arises from integrating over fluctuations 
around the classical solution (Occam factors). It was shown that Eq. (4) is nonsingular 
even at finite N, that the mean value of d converges to the negative logarithm of the 
target distribution P(x) very quickly, and that the variance of fluctuations O(x) -- c)(x) - 
[- log oP(x)] falls off as ,- 1//NP(x). Finally, it was speculated that if the actual  is 
unknown one may average over it and hope that, much as in Bayesian model selection [2], 
the competition between the data and the fluctuations will select the optimal smoothness 
scale *. 
At the first glance the theory seems to look almost exactly like a Gaussian Process [4]. This 
impression is produced by a Gaussian form of the smoothness penalty in Eq. (3), and by 
the fluctuation determinant that plays against the goodness of fit in the smoothness scale 
(model) selection. However, both similarities are incomplete. The Gaussian penalty in 
the prior is amended by the normalization constraint, which gives rise to the exponential 
term in Eq. (6), and violates many familiar results that hold for Gaussian Processes, the 
representer theorem [12] being just one of them. In the semi-classical limit of large N, 
Gaussianity is restored approximately, but the classical solution is extremely non-trivial, 
and the fluctuation determinant is only the leading term of the Occam's razor, not the com- 
plete razor as it is for a Gaussian Process. In addition, it has no data dependence and is thus 
remarkably different from the usual determinants arising in the literature. 
The algorithm to implement the discussed density estimation procedure numerically is 
rather simple. First, to make the problem well posed [10, 11] we confine x to a box 
0 _< x _< L with periodic boundary conditions. The boundary value problem Eq. (6) is 
then solved by a standard 'relaxation' (or Newton) method of iterative improvements to 
a guessed solution [13] (the target precision is always 10-5). The independent variable 
x E [0, 1] is discretized in equal steps [104 for Figs. (1.a-2.b), and 105 for Figs. (3.a, 3.b)]. 
We use an equally spaced grid to ensure stability of the method, while small step sizes are 
needed since the scale for variation of d (x) is [5] 
(7) 
which can be rather small for large N or small . 
Since the theory is short scale insensitive, we can generate random probability densities 
chosen from the prior by replacing  with its Fourier series and truncating the latter at some 
sufficiently high wavenumber kc [kc = 1000 for Figs. (1.a-2.b), and 5000 for Figs. (3.a, 
3.b)]. Then Eq. (3) enforces the amplitude of the k'th mode to be distributed a priori 
normally with the standard deviation 
2/2(L)  
crk = .-/2 2- (8) 
Coded in such a way, the simulations are extremely computationally intensive. There- 
fore, Monte Carlo averagings given here are only over 500 runs, fluctuation determi- 
nants are calculated according to Eq. (5), not using numerical path integration, and 
Qd = (1/to)exp[-d] is always used as an approximation to Qest. 
As an example of the algorithm's performance, Fig. (1.a) shows one particular learning run 
for r/ = 1 and  = 0.2. We see that singularities and overfitting are absent even for N as 
low as 10. Moreover, the approach of Qd(x) to the actual distribution P(x) is remarkably 
fast: for N = 10, they are similar; for N = 1000, very close; for N = 100000, one needs 
to look carefully to see the difference between the two. 
To quantify this similarity of distributions, we compute the Kullback-Leibler divergence 
DKi(PlIOest) between the true distribution P(x) and its estimate Qest(X), and then av- 
erage over the realizations of the data points and the true distribution. As discussed in 
[11], this learning curve A(N) measures the (average) excess cost incurred in coding the 
N + l'st data point because of the finiteness of the data sample, and thus can be called the 
"universal learning curve". If the inference algorithm uses all of the information contained 
in the data that is relevant for learning ("predictive information" [11]), then [5, 9, 11, 10] 
A(N)  (L/)/2nN /2n-. (9) 
We test this prediction against the learning curves in the actual simulations. For r/ = 1 
and  = 0.4, 0.2, 0.05, these are shown on Fig. (1.b). One sees that the exponents are 
extremely close to the expected 1/2, and the ratios of the prefactors are within the errors 
from the predicted scaling  1/v/-. All of this means that the proposed algorithm for 
finding densities not only works, but is at most a constant factor away from being optimal 
in using the predictive information of the sample set. 
Next we investigate how one's choice of the prior influences learning. We first stress that 
there is no such thing as a wrong prior. If one admits a possibility of it being wrong, then 
3.5 
3 
2.5 
2 
o.5 
1 
0.5 
o 
o 
(a) 
Fit for 10 samples 
Fit for 1000 samples 
Fit for 100000 samples 
Actual distribution 
0.2 0.4 0.6 0.8 1 
x N 
(b) 
10  , , 
 /=0.4, data and best fit 
. * /=0.2, data and best fit 
' -. [  /=0.05, data and best fit 
10' '"'' k>- 
10'  '''  ' 
104 102 103 104 105 
Figure 1' (a) Qd found for different N at  = 0.2. (b) A as a function of N and . 
The best fits are: for  = 0.4, A = (0.54 4- 0.07)N-O.4834-o.m4; for  = 0.2, A = 
(0.83 4- 0.08)N-'4934-'9; for  = 0.05, A = (1.64 4- 0.16)N -'57'9. 
it does not encode all of the a priori knowledge! It does make sense, however, to ask what 
happens if the distribution we are trying to learn is an extreme outlier in the prior 79 []. 
One way to generate such an example is to choose a typical function from a different prior 
79'[], and this is what we mean by 'learning with a wrong prior.' If the prior is wrong 
in this sense, and learning is described by Eqs. (2-6), then we still expect the asymptotic 
behavior, Eq. (9), to hold; only the prefactors of A should change, and those must increase 
since there is an obvious advantage in having the fight prior; we illustrate this in Figs. (2.a, 
2.b). 
For Fig. (2.a), both 79'[] and 79[] are given by Eq. (3), but 79' has the 'actual' smoothness 
scale a = 0.4, 0.05, and for 79 the 'learning' smoothness scale is  = 0.2 (we show the 
case a =  = 0.2 again as a reference). The A  1/v/- behavior is seen unmistakably. 
The prefactors are a bit larger (unfortunately, insignificantly) than the corresponding ones 
from Fig. (1.b), so we may expect that the 'right' , indeed, provides better learning (see 
later for a detailed discussion). 
Further, Fig. (2.b) illustrates learning when not only , but also r/is 'wrong' in the sense 
defined above. We illustrate this for % = 2, 0.8, 0.6, 0 (remember that only % > 0.5 
removes UV divergences). Again, the inverse square root decay of A should be observed, 
and this is evident for % = 2. The % = 0.8, 0.6, 0 cases are different: even for N as high 
as 10 5 the estimate of the distribution is far from the target, thus the asymptotic regime is 
not reached. This is a crucial observation for our subsequent analysis of the smoothness 
scale determination from the data. Remarkably, A (both averaged and in the single runs 
shown) is monotonic, so even in the cases of qualitatively less smooth distributions there 
still is no overfitting. On the other hand, A is well above the asymptote for r/= 2 and small 
N, which means that initially too many details are expected and wrongfully introduced into 
the estimate, but then they are almost immediately (N  300) eliminated by the data. 
Following the argument suggested in [5], we now view 79[], Eq. (3), as being a part of 
some wider model that involves a prior over . The details of the prior are irrelevant, 
however, if Seer (), Eq. (5), has a minimum that becomes more prominent as N grows. We 
explicitly note that this mechanism is not tuning of the prior's parameters, but Bayesian 
inference at work: * emerges in a competition between the smoothness, the data, and the 
Occam terms to make Seer smaller, and thus the total probability of the data is larger. In its 
(a) (b) 
10 0 
10 -1 
I/  /a=0.4, data and best fit  
[[ , /a=0.05, data and best fit 
10-3H ......................... 
101 10 2 10 3 10 4 10 5 
N 
qa=l, /a=0.2, data, best fit -   _> 
o qa=2, /a=0.1, data, best fit 
[] qa=0.8, /a=0.1, data, best fit 
, qa=0.6, /a=0.1, data, one run 
 qa=0, /a=0.12, data, one run 
101 10 2 10 3 
N 
104 105 
Figure 2: (a) A as a function of N and a. Best fits are: for a = 0.4, A = (0.56 q- 
0.08)N-'4774-'zs; for a = 0.05, A = (1.90 q- 0.16)N -'524-'8. Learning is always 
with  = 0.2. (b) A as a function of N, r/a and a. Best fits: for r/a = 2, a = 0.1, 
A = (0.40q-0.05)N-'49a4-'za; for r/a = 0.8,a = 0.1, A = (1.06q-0.08)N -'ass4-'8. 
 = 0.2 for all graphs, but the one with r/a = 0, for which  = 0.1. 
turn, larger probability means shorter total code length. 
The data term, on average, is equal to NDI<,(PllOc), and, for very regular P(a:) (an 
implicit assumption in [5]), it is small. Thus only the kinetic and the Occam terms matter, 
and *  N1/a[5]. For less regular distributions P(a:), this is not true [cf. Fig. (2.b)]. For 
r/ = 1, Oc(a:) approximates large-scale features of P(a:) very well, but details at scales 
smaller than  X/-/NL are averaged out. If P(a:) is taken from the prior, Eq. (3), with 
some r/a, then these details fall off with the wave number k as  k -' . Thus the data term 
is  N's-v '-' and is not necessarily small. For r/a < 1.5 this dominates the kinetic 
term and competes with the fluctuations to set 
*  N ('*'-)/'*' , r/a < 1.5. (10) 
There are two remarkable things about Eq. (10). First, for r/a = 1, * stabilizes at some 
constant value, which we expect to be equal to ga. Second, even for r/  r/a, Eqs. (9, 10) 
ensure that A scales as ,--, N /'*'-, which is at worst a constant factor away from the best 
scaling, Eq. (9), achievable with the 'right' prior, r/= r/a. So, by allowing * to vary with 
N we can correctly capture the structure of models that are qualitatively different from our 
expectations (r/  r/a) and produce estimates of Q that are extremely robust to the choice 
of the prior. To our knowledge, this feature has not been noted before in a reference to a 
nonparametric problem. 
We present simulations relevant to these predictions in Figs. (3.a, 3.b). Unlike on the pre- 
vious Figures, the results are not averaged due to extreme computational costs, so all our 
further claims have to be taken cautiously. On the other hand, selecting * in single runs 
has some practical advantages: we are able to ensure the best possible learning for any 
realization of the data. Fig. (3.a) shows single learning runs for various r/a and ga. In ad- 
dition, to keep the Figure readable, we do not show runs with r/a = 0.6, 0.7, 1.2, 1.5, 3, 
and r/a  oo, which is a finitely parameterizable distribution. All of these display a good 
agreement with the predicted scalings: Eq. (10) for r/a < 1.5, and * ,--, N -/a otherwise. 
Next we calculate the KL divergence between the target and the estimate at  = *; the 
average of this divergence over the samples and the prior is the learning curve [cf. Eq. (9)]. 
For r/a = 0.8, 2 we plot the divergencies on Fig. (3.b) side by side with their fixed  = 0.2 
10 o 
10 -1 
10 -2 
10 -3 
(a) 
 qa=l, a=0.2 
na:o.8, 
<> qa=l,variable /, mean 0.12 
' qa=2' /a=0'l 
10 2 10 4 
N 
lO 
lO 
o 
lO 
lO 
(b) 
 qa=0.8, /a=0.1, /=-/ 
 qa:0.8, /a=0.1, /=-0.2 
e qa =2, /a=0.1, /=-/ 
 qa=2, /a=0.1, /=-0.2 
10 2 
N 
10'* 10 6 
Figure 3: (a) Comparison of learning speed for the same data sets with different a priori 
assumptions. (b) Smoothness scale selection by the data. The lines that go off the axis for 
small N symbolize that Seer monotonically decreases as  --> ec. 
analogues. Again, the predictions clearly are fulfilled. Note, that for %  r/there is a 
qualitative advantage in using the data induced smoothness scale. 
The last four Figures have illustrated some aspects of learning with 'wrong' priors. How- 
ever, all of our results may be considered as belonging to the 'wrong prior' class. Indeed, 
the actual probability distributions we used were not nonparametric continuous functions 
with smoothness constraints, but were composed of kc Fourier modes, thus had 2kc param- 
eters. For finite parameterization, asymptotic properties of learning usually do not depend 
on the priors (cf. [3, 11]), and priorless theories can be considered [14]. In such theories 
it would take well over 2kc samples to even start to close down on the actual value of the 
parameters, and yet a lot more to get accurate results. However, using the wrong contin- 
uous parameterization [(a:)] we were able to obtain good fits for as low as 1000 samples 
[cf. Fig. (1.a)] with the help of the prior Eq. (3). Moreover, learning happened continuously 
and monotonically without huge chaotic jumps of overfitting that necessarily accompany 
any brute force parameter estimation method at low N. So, for some cases, a seemingly 
more complex model is actually easier to learn! 
Thus our claim: when data are scarce and the parameters are abundant, one gains even by 
using the regularizing powers of wrong priors. The priors select some large scale features 
that are the most important to learn first and fill in the details as more data become available 
(see [11] on relation of this to the Structural Risk Minimization theory). If the global 
features are dominant (arguably, this is generic), one actually wins in the learning speed 
[cf. Figs. (1.b, 2.a, 3.b)]. If, however, small scale details are as important, then one at least 
is guaranteed to avoid overfitting [cf. Fig. (2.b)]. 
One can summarize this in an Occam-like fashion [ 11]: if two models provide equally good 
fits to data, a simpler one should always be used. In particular, the predictive information, 
which quantifies complexity [11], and of which A is the derivative, in a QFT model is 
 N 1/2rt, and it is  kc log N in the parametric case. So, for kc > N l/2rt, one should 
prefer a 'wrong' QFT formulation to the correct finite parameter model. These results are 
very much in the spirit of our whole program: not only is the value of * selected that 
simplifies the description of the data, but the continuous parameterization itself serves the 
same purpose. This is an unexpectedly neat generalization of the MDL principle [3] to 
nonparametric cases. 
Summary: The field theoretic approach to density estimation not only regularizes the learn- 
ing process but also allows the self-consistent selection of smoothness criteria through an 
infinite dimensional version of the Occam factors. We have shown numerically that this 
works, even more clearly than was conjectured: for r/a < 1.5, the learning curve truly be- 
comes a property of the data, and not of the Bayesian prior! If we can extend these results to 
other r/ and combine this work with the reparameterization invariant formulation of [7, 8], 
this should give a complete theory of Bayesian learning for one dimensional distributions, 
and this theory has no arbitrary parameters. In addition, if this theory properly treats the 
limit r/a --> oo, we should be able to see how the well-studied finite dimensional Occam 
factors and the MDL principle arise from a more general nonparametric formulation. 
References 
[10] 
[11] 
[12] 
[13] 
[14] 
[1] D. MacKay, Neural Comp. 4, 415-448 (1992). 
[2] V. Balasubramanian, Neural Comp. 9, 349-368 (1997), 
http://xxx. lanl. gov/abs/adap-org/9601001. 
[3] J. Rissanen. Stochastic Complexity and Statistical Inquiry. World Scientific, Singa- 
pore (1989). 
[4] D. MacKay, NIPS, Tutorial Lecture Notes (1997), 
ftp://wol. ra.phy. cam. ac.uk/pub/mackay/gp.ps. gz. 
[5] W. Bialek, C. Callan, and S. Strong, Phys. Rev. Lett. 77, 4693-4697 (1996), 
http://xxx. lanl. gov/abs/cond-mat/9607180. 
[6] T. Holy, Phys. Rev. Lett. 79, 3545-3548 (1997), 
http://xxx. lanl. gov/abs/physics/9706015. 
[7] V. Periwal, Phys. Rev. Lett. 78, 4671-4674 (1997), 
http://xxx. lanl. gov/hep-th/9703135. 
[8] V. Periwal, Nucl. Phys. B, 554 [NS], 719-730 (1999), 
http://xxx. lanl. gov/adap-org/9801001. 
[9] T. Aida, Phys. Rev. Lett. 83, 3554-3557 (1999), 
http://xxx. lanl. gov/cond-mat/9911474. 
A more detailed version of our current analysis may be found in: I. Nemenman, Ph.D. 
Thesis, Princeton, (2000), http: //xxx. lanl. gov/ads/phys its / 0 0 0 9 0 3 2. 
W. Bialek, I. Nemenman, N. Tishby. Preprint 
http://xxx. lanl. gov/abs/physics/0007070. 
O. Wahba. In B. Sh61kopf, C. J. S. Burges, and A. J. Smola, eds., Advances in Kernel 
Methods--Support Vector Learning, pp. 69-88. MIT Press, Cambridge, MA (1999), 
ftp://ftp. stat .wisc. edu/pub/wahba/nips97rr.ps. 
W. Press et al. Numerical Recipes in C. Cambridge UP, Cambridge (1988). 
Vapnik, V. Statistical Learning Theory. John Wiley & Sons, New York (1998). 
