A variational meamfield theory for 
sigmoidal belief networks 
C. Bhattacharyya 
Computer Science and Automation 
Indian Institute of Science 
Bangalore, India, 560012 
cbchiru csa. iisc. ernet. in 
S. Sathiya Keerthi 
Mechanical and Production Engineering 
National University of Singapore 
mpessk  guppy. rope. n us. edu. s g 
Abstract 
A variational derivation of Plefka's mean-field theory is presented. 
This theory is then applied to sigmoidal belief networks with the 
aid of further approximations. Empirical evaluation on small scale 
networks show that the proposed approximations are quite com- 
petitive. 
I Introduction 
Application of mean-field theory to solve the problem of inference in Belief Net- 
works(BNs) is well known [1]. In this paper we will discuss a variational mean-field 
theory and its application to BNs, sigmoidal BNs in particular. 
We present a variational derivation of the mean-field theory, proposed by Plefka[2]. 
The theory will be developed for a stochastic system, consisting of N binary random 
variables, Si E {0, 1}, described by the energy function E(S), and the following 
Boltzmann Gibbs distribution at a temperature T: 
E(g) 
P =  , Z= e T 
The application of this mean-field method to Boltzmann Machines(BMs) is already 
done [3]. A large class of BNs are described by the following energy function: 
N i--1 
E(,) = - y.{Si In f(Mi) + (1 - Si)ln(1 - f(Mi)} Mi = y. wijSj q- hi 
i----1 j----1 
The application of the mean-field theory for such energy functions is not straight- 
forward and further approximations are needed. We propose a new approximation 
scheme and discuss its utility for sigmoid networks, which is obtained by substitut- 
ing 
1 
f(x) = 1 + e -x 
in the above energy function. The paper is organized as follows. In section 2 we 
present a variational derivation of Plefka's mean-field theory. In section 3 the theory 
is extended to sigmoidal belief networks. In section 4 empirical evaluation is done. 
Concluding remarks are given in section 5. 
2 A Variational mean-field theory 
Piefire,[2] proposed a mean-field theory in the context of spin glasses. This theory 
can, in principle, yield arbitrarily close approximation to log Z. In this section we 
present an alternate derivation from a variational viewpoint, see also [4],[5]. 
Let -/be a real parameter that takes values from 0 to 1. Let us define a "/dependent 
partition and distribution function, 
Zv =  e-vE(g)/T , PV -- 
e-vE(g)/T 
(1) 
Note that Z - Z and p - p. Introducing an external real vector, ' let us rewrite 
(1) as 
2 e-E''s'2 (2) 
where Z is the partition function associated with the distribution function 15 v given 
by 
2 = o,s,, = (3) 
 z 
Using Jensen's Inequality, (e -)  e -(), we get 
Zv = 2 -ive- E, 's' _> 2e-E, '' (4) 
where 
(5) 
Taking logarithms on both sides of (4) we obtain 
log Zv _> log  -  Oiui 
(6) 
The right hand side is defined as a function of g and "/via the following assumption. 
Invertibility assumption: For each fixed g and % (5) can be solved for '. 
If the invertibility assumption holds then we can use  as the independent vector 
(with ' dependent on g) and rewrite (6) as 
-lnZ v < 
(7) 
where G is as defined in 
i 
This then gives a variational feel  treat  as an external variable vector and choose 
it to minimize G for a fixed % The stationarity conditions of the above minimization 
problem yield 
0 = 0u = 0. 
At the minimum point we have the equality G = - log Z. 
It is difficult to invert (5) for "/ 0, thus making it impossible to write an algebraic 
expression for G for any nonzero % At "/= 0 the inversion is straightforward and 
one obtains 
N 
G(, 0) = -](ui lnui + (1 -ui)ln(1- ui)), 50 = -- 
i----1 
A Taylor series approach is then undertaken around ? = 0 to build an approximation 
to G. Define 
 ' OG ,=o 
GM = G(<0) + (s) 
Then GM can be considered as an approximation of G. The stationarity conditions 
are enforced by setting 
OG OGM 
Oi= oui Ou. =0' 
In this paper we will restrict ourselves to M = 2. To do this we need to evaluate 
the following derivatives 
OG ,=o 
 = (E} o (9) 
where 
OG 
Oh,  
1( ) 
T2 ((E-(E)o)2) o -  cv(E'Si) (10) 
i var(Si) 
cov(E,qi) = ((E - (E)o)(q i - ui))o , var(qi) = ((qi - ui))o  
For M = 1 we have the standard mean-field approach. The expression for M = 2 
can be identified with the TAP correction. The term (10) yields the TAP term for 
BM energy function. 
3 Mean-field approximations for BNs 
The method, as developed in the previous section, is not directly useful for BNs 
because of the intractability of the partial derivatives at "/- 0. To overcome this 
problem, we suggest an approximation based on Taylor series expansion. Though 
in this paper we will be restricting ourselves to sigmoid activation function, this 
method is applicable to other activation functions also. This method enables cal- 
culation of all the necessary terms required for extending Plefka's method for BNs. 
Since, for BN operation T is fixed to 1, T will be dropped from all equations in the 
rest of the paper. 
Let us define a new energy function 
N 
/(/?, g, 7, t) -- -y{$i In f(i(/?))+ (1- $i)ln(1- f(i(/?))} 
i----1 
where 0 _< fi _< 1, 
where 
i--1 i--1 
() =  wj(sj - ) + M ,  =  w + n 
j=l j=l 
(11) 
E = (1) m/c(1). 
Let us now define the following function 
A(7, fi, if) = -In y. e-V+Y- s + y. OiU i 
g i 
The Oi are assumed to be functions of if, fi, 7, 
equations (12) By replacing  by c in (15) we obtain Ac 
Ac(f, fi,7) = -In y. e-Vc+Y- os + y. Oiui 
g i 
h h 
where the definition of 7 is obtained by replacing E by Ec. In view of (14) one can 
consider Ac as an approximation to A. This observation suggests an approximation 
to G. 
G(%7) = A(% 1,7) m Ac(% 1,7) (17) 
The required terms needed in the Taylor expansion of G in 3' can be approximated 
by 
G(0, 7) = A(0, 1, 7) = At(0, 1, 7) 
OkG =o OkA v=o,z= OAc v=o,z= 
The biggest advantage in working with Ac rather than G is that the partial deriva- 
tives of Ac with respect to 3' at 3' = 0 and/ = 1 can be expressed as functions of 
if. We define 
M 3' OAc =o,= (18) 
OMC(7, 3') = At(0, 1, 7) + y. k! 03' k 
k=l 
(16) 
(14) 
(15) 
which are obtained by inverting 
uk =  $kPvfi Vk, PrO = (12) 
 e-7+, 's' 
Since fi is the important parameter, (fi, 4, 4, ) will be referred to as (fi) so as 
to avoid notational clumsiness. We use a Taylor series approximation of (fi) with 
respect to fl. Let us define 
   fi 0 8:o (13) 
c() = (0) +  o ' 
k=l 
A A 
If Ec approximates E, then we can write 
Figure l: Three layer BN (2 x 4 x 6) with top down propagation of beliefs. The 
activation function was chosen to be sigmoid. 
In light of the above discussion one can consider M - MC; hence the mean-field 
equations can be stated as 
OG O8M 
Oi- Oui  Ou--.  Oui -0 (19) 
In this paper we will restrict ourselves to M = 2. The relevant objective functions 
for a general C is given by 
8c: X] log,, + (1 - ,,)log(1 - + 
(20) 
^ ^ 1( ) cov2(Ec,sd) 
All these objective functions can be expressed as a function of 4. 
(21) 
4 Experimental results 
To test the approximation schemes developed in the previous schemes, numerical 
experiments were conducted. Saul et al.[1] pioneered the application of mean-field 
theory to BNs. We will refer to their method as the SJJ approach. We compare 
our schemes with the SJJ approach. 
Small Networks were chosen so that In Z can be computed by exact enumeration 
for evaluation purposes. For all the experiments the network topology was fixed 
to the one shown in figure 1. This choice of the network enables us to compare 
the results with those of [1]. To compare the performance of our methods with 
their method we repeated the experiment conducted by them for sigmoid BNs. Ten 
thousand networks were generated by randomly choosing weight values in [-1, 1]. 
The bottom layer units, or the visible units of each network were instantiated to 
zero. The likelihood, In Z, was computed by exact enumeration of all the states in 
^ 
the higher two layers. The approximate value of -In Z was computed by GMC; 
ff was computed by solving the fixed point equations obtained from (19). The 
goodness of approximation scheme was tested by the following measure 
h 
 = lnZ I (22) 
For a proper comparison we also implemented the SJJ method. The goodness 
h 
of approximation for the SJJ scheme is evaluated by substituting Gc, in (22) 
by Lsapprox, for specific formula see [1]. The results are presented in the form 
of histograms in Figure 2. We also repeated the experiment with weights and 
Table 1: 
ranges. 
small weights [-1, 1] large weights [-5,5] 
G -0.0404 -0.0440 
G2 0.0155 0.0231 
G22 0.0029 -0.0456 
$JJ 0.0157 0.0962 
Mean of  for randomly generated sigmoid networks, in different weight 
biases taking values between -5 and 5, the results are again presented in the form of 
histograms in Figure 3. The findings are summarized in the form of means tabulated 
in Table 1. 
h 
For small weights G2 and the SJJ appro^ach show close results, which was expected. 
But the improvement achieved by the G22 scheme is remarkable; it gave a mean 
value of 0.0029 which compares substantially well against the mean value of 0.01139 
reported in [6]. The improvement in [6] was achieved by using mixture distribution 
which requires introduction of extra variational variables; more than 100 extra vari- 
ational variables are needed for a 5 component mixture. This results in substantial 
increase in the com^putation costs. On the other hand the extra computational 
h h 
cost for G22 over G2 is marginal. This makes the G22 scheme computationally 
attractive over the mixture distribution. 
h 
Figure 2: Histograms for Gc and SJJ scheme for weights taking values in [-1, 1], 
for sigmod networks. The plot on the left show histograms for  for the schemes 
h h h 
G and G2 They did not have any overlaps; G, gives a mean of-0.040 while 
gives a mean of 0.0155. The middle plot shows the histogram for the SJJ scheme, 
h 
mean is given by 0.0157.The plot at the extreme right is for the scheme G22, having 
a mean of 0.0029 
h 
Of the three schemes G2 is the most robust and also yields reasonably accurate 
h 
results. It is outperformed only by G22 in the case of sigmoid networks with low 
weights. Empirical evidence thus suggests that the choice of a scheme is not straight- 
forward and depends on the activation function and also parameter values. 
h 
Figure 3: Histograms for the Gc and SJJ schemes for weights taking values in 
[-5, 5] for sigmoid networks. The leftmost histogram shows  for G scheme having 
h 
a mean of -0.0440, second from left is for G2 scheme having a mean of 0.0231, and 
second from right is for SJJ scheme, having a mean of 0.0962. The scheme G2 is 
at the extreme right with mean -0.0456. 
5 Discussion 
Application of Plefka's theory to BNs is not straightforward. It requires compu- 
tation of some averages which are not tractable. We presented a scheme in which 
the BN energy function is approximated by a Taylor series, which gives a tractable 
approximation to the terms required for Plefka's method. Various approximation 
schemes depending on the degree of the Taylor series expansion are derived. Unlike 
the approach in [1], the schemes discussed here are simpler as they do not introduce 
extra variational variables. Empirical evaluation on small scale networks shows that 
the quality of approximations is quite good. For a more detailed discussion of these 
points see [7]. 
References 
[1] Saul, L. K. and Jaakkola, T. and Jordan, M. I.(1996), Mean field theory for sigmoid 
belief networks, Journal of Artificial Intelligence Research,, 
[2] Plefka, T. (1982), Convergence condition of the TAP equation for the Infinite-ranged 
Ising glass model,J. Phys. A: Math. Gen.,15 
[3] 
Kappen, H. J and Rodriguez, F. B(1998), Boltzmann machine learning using mean 
field theory and linear response correction, Advances in Neural Information Process- 
ing Systems 10, (eds.) M. I. Jordan and M. J. Kearns and S. A. Solla, MIT press 
[4] Georges, A. and Yedidia, J. S.(1991), How to expand around mean-field theory using 
high temperature expansions,J. Phys. A: Math. Gen., 
[5] Bhattacharyya, C. and Keerthi, S. S.(2000), Information geometry and Plefka's mean- 
field theory, J. Phys. A: Math. Gen.,33 
[6] 
Bishop, M. C. and Lawrence, N. and Jaakkola, T. and Jordan, M. I.(1997), Approxi- 
mating Posterior Distributions in Belief Networks using Mixtures, Advances in Neural 
Information Processing Systems 10, (eds.) Jordan, M. I. and Kearns, M. J. and Solla, 
S., MIT press 
[7] Bhattacharyya, C. and Keerthi, S.S. (1999), Mean field theory for a special class of 
belief networks, accepted in Journal of Artificial Intelligence Research 
