High-temperature expansions for learning 
models of nonnegative data 
Oliver B. Downs 
Dept. of Mathematics 
Princeton University 
Princeton, NJ 08544 
obdowns@princeton.edu 
Abstract 
Recent work has exploited boundedness of data in the unsupervised 
learning of new types of generative model. For nonnegative data it was 
recently shown that the maximum-entropy generative model is a Non- 
negative Boltzmann Distribution not a Gaussian distribution, when the 
model is constrained to match the first and second order statistics of the 
data. Learning for practical sized problems is made difficult by the need 
to compute expectations under the model distribution. The computa- 
tional cost of Markov chain Monte Carlo methods and low fidelity of 
naive mean field techniques has led to increasing interest in advanced 
mean field theories and variational methods. Here I present a second- 
order mean-field approximation for the Nonnegative Boltzmann Machine 
model, obtained using a "high-temperature" expansion. The theory is 
tested on learning a bimodal 2-dimensional model, a high-dimensional 
translationally invariant distribution, and a generative model for hand- 
written digits. 
1 Introduction 
Unsupervised learning of generative and feature-extracting models for continuous nonneg- 
ative data has recently been proposed [1], [2]. In [1], it was pointed out that the maximum 
entropy distribution (matching l st- and 2nd-order statistics) for continuous nonnegative 
data is not Gaussian, and indeed that a Gaussian is not in general a good approximation 
to that distribution. The true maximum entropy distribution is known as the Nonnega- 
tive Boltzmann Distribution (NNBD), (previously the rectified Gaussian distribution [3]), 
which has the functional form 
{ exp [-E(x)] if xi ) 0 Vi, 
p(x) = if any xi < 0, (1) 
where the energy function E(x) and normalisation constant Z are: 
E(x) = /?xr Ax - br x, (2) 
Z = x dx exp[-E(x)]. (3) 
>o 
In contrast to the Gaussian distribution, the NNBD can be multimodal in which case its 
modes are confined to the boundaries of the nonnegative orthant. 
The Nonnegative Boltzmann Machine (NNBM) has been proposed as a method for learning 
the maximum likelihood parameters for this maximum entropy model from data. Without 
hidden units, it has the stochastic-EM learning rule: 
(4) 
(5) 
where the subscript "c" denotes a "clamped" average over the data, and the subscript "f" 
denotes a "free" average over the NNBD: 
M 
{f(x)) =  Ef(x(U)) (6) 
/=1 
{f(x))f = f>odxp(x)f(x). (7) 
This learning rule has hitherto been extremely computationally costly to implement, since 
naive variational/mean-field approximations for {xx:V}f are found empirically to be poor, 
leading to the need to use Markov chain Monte Carlo methods. This has made the NNBM 
impractical for application to high-dimensional data. 
While the NNBD is generally skewed and hence has moments of order greater than 2, the 
maximum-likelihood learning rule suggests that the distribution can be described solely in 
terms of the lst- and 2nd-order statistics of the data. With that in mind, I have pursued 
advanced approximate models for the NNBM. 
In the following section I derive a second-order approximation for (xixj)f analogous to the 
TAP-Onsager correction for the mean-field Ising Model, using a high temperature expan- 
sion, [4]. This produces an analytic approximation for the parameters A, b in terms of 
the mean and cross-correlation matrix of the training data. 
2 
Learning approximate NNBM parameters using high-temperature 
expansion 
Here I use Taylor expansion of a "free energy" directly related to the partition function 
of the distribution, Z in the  = 0 limit, to derive a second-order approximation for the 
NNBM model parameters. In this free energy we embody the constraint that Eq. 5 is 
satisfied: 
/o  ( ) 
G(, m) = -In ImIm dx m exp - E Aijxixj - E i(l)(xi - mi) 
i,j i 
(8) 
where  is an "inverse temperature". There is a direct relationship between the "free en- 
ergy", G and the normalisation, Z of the NNBD, Eq. 3. 
Thus, 
- In Z = G(, m) + Constant(b, m) 
(9) 
0 In Z OG 
= = 
The Lagrange multipliers, Ai embody the constraint that (Xi) f match the mean field of the 
patterns, mi = (x)c. This effectively forces Ab = 0 in Eq. 5, with bi = -A(/). 
Since the Lagrange constraint is enforced for all temperatures, we can solve for the specific 
case/ = 0. 
1--[k f:oXiexp( - Y',tAt(O)(xt -mt))dxk 1 
mi- {xi)fl=o - 1__[ f__oexp(_ Y',tAt(O)(x t -mt))dx = Xi(0 (11) 
Note that this embodies the unboundedness of x in the nonnegative orthant, as compared 
to the equivalent term of Georges & Yedidia for the Ising model, m = tanh(A (0)). 
We consider Taylor expansion of Eq. 8 about the "high temperature" limit,/? = 0. 
= C(0, m) + + 2 +'" (12) 
Since the integrand becomes factorable in xi in this limit, the infinite temperature values of 
G and its derivatives are analytically calculable. 
G(/,m)lz=o = - yln exp - Ai(O)(xi-mi) dxn (13) 
k =0 - 
using Eq. 11; 
G(,m)lz=o = - ln X(0) exp Xi(0)mi = N + lnm (14) 
k 
The first derivative is then as follows 
OG =o 
= 
Hk f exp (- Y't /m(O)(xl -- mr)) dx 
= y(1 + 5ij)Aijmimj 
i,j 
(15) 
(16) 
This term is exactly the result of applying naive mean-field theory to this system, as in [1]. 
Likewise we obtain the second derivative 
- = -- Aijxixj q- (1 q- (ij)Aijmimj 
+ y Aijxixj y 0/ (x - m) (17) 
i,j k 
= - y y oqjtAijAktmimjmmt (18) 
i,j k,l 
Where ijkl contains the integer coefficients ising from integration by paas in the first 
and second terms and (1 + 5ij) in the second term of Eq. 17. 
This expansion is to the same order as the TAP-Onsager cogection term for the Ising model, 
which can be derived by an analogous approach to the equivalent free-energy [4]. Substi- 
tuting these results into Eq. 10, we obtain 
+ 
-   aijtAtmimjmmt (19) 
kl 
We arrive at an analytic approximation for Aij as a function of the 1 st and 2nd moments of 
the data, using Eq. 19 in the learning rule, Eq. 4, setting/XAij = 0 and solving the linear 
equation for A. 
We can obtain an equivalent expansion for hi (/) and hence bi. To first order in/ (equiva- 
lent to the order of/ in the approximation for A), we have 
Oi /=o 
X,(/)  X,(0) +/ 73- +'" (20) 
Using Eqs. 11 & 15 
Hence 
Y5 = 07., 
= - + 
(21) 
(22) 
(23) 
1 
bi = -i(fl) m --- + fl y(1 + 5ij)Aijmj (24) 
mi 
J 
The approach presented here makes an explicit approximation of the statistics required 
for the NNBM learning rule (a:a::c)f, which can be substituted in the fixed-point equation 
Eq. 4, and yields a linear equation in A to be solved. This is in contrast to the linear 
response theory approach of Kappen & Rodriguez [6] to the Boltzmann Machine, which 
exploits the relationship 
0  In Z 
ObOb - - = (25) 
between the free energy and the covariance matrix X of the model. In the learning problem, 
this produces a quadratic equation in A, the solution of which is non-trivial. Computa- 
tionally efficient solutions of the linear response theory are then obtained by secondary 
approximation of the 2nd-order term, compromising the fidelity of the model. 
3 Learning a 'Competitive' Nonnegative Boltzmann Distribution 
A visualisable test problem is that of learning a bimodal NNBD in 2 dimensions. Monte- 
Carlo slice sampling (See [1] & [5]) was used to generate 200 samples from a NNBD as 
shown in Fig. 1 (a). The high temperature expansion was then used to learn approximate 
parameters for the NNBM model of this data. A surface plot of the resulting model distri- 
bution is shown in Fig. l(b), it is clearly a valid candidate generative distribution for the 
data. This is in strong contrast with a naive mean field (fl = 0) model, which by construc- 
tion would be unable to produce a multiple-peaked approximation, as previously described, 
[1]. 
4 
Orientation Tuning in Visual Cortex - a translationally invariant 
model 
The neural network model of Ben-Yishai et. al [7] for orientation-tuning in visual cortex 
has the property that its dynamics exhibit a continuum of stable states which are trans- 
(a) (b) 
8 
2 
4 
6 8 8 
x 2 
2 
6 
Figure 1: (a) Training data, generated from 2-dimensional 'competitive' NNBD, (b) 
Learned model distribution, under the high temperature expansion. 
lationally invariant across the network. The energy function of the network model is a 
translationally invariant function of the angles of maximal response, Oi, of the N neurons, 
and can be mapped directly onto the energy of the NNBM, as described in [1]. 
( , ,)) 
Aij -- ? tij q- r N cos li - j , bi - ? (26) 
We can generate training data for the NNBM by sampling from the neural network model 
with known parameters. It is easily shown that Aij has 2 equal negative eigenvalues, the 
remainder being positive and equal in value. The corresponding pair of eigenvectors of A 
are sinusoids of period equal to the width of the stable activation bumps of the network, 
with a small relative phase. 
Here, the NNBM parameters have been solved using the high-temperature expansion for 
training data generated by Monte Carlo slice-sampling [5] from a 10-neuron model with 
parameters e = 4, ? = 100 in Eq. 26. Fig. 2 illustrates modal activity patterns of the learned 
NNBM model distribution, found using gradient ascent of the log-likelihood function from 
a random initialisation of the variables. 
Ax or [-Ax + b] + (27) 
where the superscript + denotes rectification. 
These modes of the approximate NNBM model are highly similar to the training patterns, 
also the eigenvectors and eigenvalues of A exhibit similar properties between their learned 
and training forms. This gives evidence that the approximation is successful in learning a 
high-dimensional translationally invariant NNBM model. 
5 Generative Model for Handwritten Digits 
In figure 3, I show the results of applying the high-temperature NNBM to learning a gen- 
erative model for the feature coactivations of the Nonnegative Matrix Factorization [2] 
(a) 
0.4 
 0.2 
 0 
.?_ 
"-0.2 
-0.4 
1 2 3 4 5 6 7 8 9 10 
Neuron Number 
2 4 6 8 
Neuron Number 
6 
0 
1 2 3 4 5 6 7 8 9 10 
Neuron Number 
(b) 
0.4 
-0.2 
-0.4 
0 2 4 6 8 10 
Neuron Number 
Figure 2: Upper: 2 modal states of the NNBM model density, located by gradient-ascent 
of the log-likelihood from different random initialisations, Lower: The two negative- 
eigenvalue eigenvectors of A - a) in the learned model, and b) as used to generate the 
training data. 
decomposition of a database of the handwritten digits, 0-9. This problem contains none of 
the space-filling symmetry of the visual cortex model, and hence requires a more strongly 
multimodal generative model distribution to generate distinct digits. Here performance is 
poor, although superior to uniformly-sampled feature activitations. 
6 Discussion 
In this work, an approximate technique has been derived for directly determining the 
NNBM parameters A, b in terms of the lst- and 2nd-order statistics of the data, using 
the method of high-temperature expansion. To second order this produces corrections to 
the naive mean field approximation of the system analogous to the TAP term for the Ising 
Model/Boltzmann Machine. The efficacy of this approximation has been demonstrated 
in the pathological case of learning the 'competitive' NNBD, learning the translationally 
invariant model in 10 dimensions, and a generative model for handwritten digits. 
These results demonstrate an improvement in approximation to models in this class over 
a naive mean field (fi = 0) approach, without reversion to secondary assumptions such as 
those made in the linear response theory for the Boltzmann Machine. 
There is strong current interest in the relationship between TAP-like mean field theory, 
variational approximation and belief-propagation in graphical models with loops. All of 
these can be interpreted in terms of minimising an effective free energy of the system [8]. 
The distinction in the work presented here lies in choosing optimal approximate statistics 
to learn the true model, under the assumption that satisfaction of the fixed-point equations 
of the true model optimises the free energy. This compares favourably with variational 
a) b) 
Figure 3: Digit images generated with feature activations sampled from a) a uniform dis- 
tribution, and b) a high-temperature NNBM model for the digits. 
approaches which directly optimise an approximate model distribution. 
Methods of this type fail when they add spurious fixed points to the learning dynamics. 
Future work will focus on understanding the origins of such fixed points, and the regimes 
in which they lead to a poor approximation of the model parameters. 
7 Acknowledgements 
This work was inspired by the NIPS 1999 Workshop on Advanced Mean Field Methods. 
The author is especially grateful to David MacKay and Gayle Wittenberg for comments on 
early versions of this manuscript. I also acknowledge guidance from John Hopfield and 
David Heckerman, detailed discussion with Bert Kappen, Daniel Lee and David Barber 
and encouragement from Kim Midwood. 
References 
[1] Downs, OB, MacKay, DJC, & Lee, DD (2000). The Nonnegative Boltzmann Machine. Ad- 
vances in Neural Information Processing Systems 12, 428-434. 
[2] Lee, DD, and Seung, HS (1999) Learning the parts of objects by non-negative matrix factor- 
ization. Nature 401, 788-791. 
[3] Socci, ND, Lee, DD, and Seung, HS (1998). The rectified Gaussian distribution. Advances in 
Neural Information Processing Systems 10, 350-356. 
[4] Georges, A, & Yedidia, JS (1991). How to expand around mean-field theory using high- 
temperature expansions. Journal of Physics A 24, 2173-2192. 
[5] Neal, RM (1997). Markov chain Monte Carlo methods based on 'slicing' the density function. 
Technical Report 9722, Dept. of Statistics, University of Toronto. 
[6] Kappen, HJ & Rodriguez, FB (1998). Efficient learning in Boltzmann Machines using linear 
response theory. Neural Computation 10, 1137-1156. 
[7] Ben-Yishai, R, Bar-Or, RL, & Sompolinsky, H (1995). Theory of orientation tuning in visual 
cortex. Proc. Nat. Acad. Sci. USA, 92(9):3844-3848. 
[8] Yedidia, JS, Freeman, WT, & Weiss, Y (2000). Generalized Belief Propagation. Mitsubishi 
Electric Research Laboratory Technical Report, TR-2000-26. 
