Generalized Belief Propagation 
Jonathan S. Yedidia 
MERL 
201 Broadway 
Cambridge, MA 02139 
Phone: 617-621-7544 
yedidiaOmerl.com 
William T. Freeman 
MERL 
201 Broadway 
Cambridge, MA 02139 
Phone: 617-621-7527 
freemanOmerl.com 
Yair Weiss 
Computer Science Division 
UC Berkeley, 485 Soda Hall 
Berkeley, CA 94720-1776 
Phone: 510-642-5029 
yweissOcs. be rkeley. ed u 
Abstract 
Belief propagation (BP) was only supposed to work for tree-like 
networks but works surprisingly well in many applications involving 
networks with loops, including turbo codes. However, there has 
been little understanding of the algorithm or the nature of the 
solutions it finds for general graphs. 
We show that BP can only converge to a stationary point of an 
approximate free energy, known as the Bethe free energy in statis- 
tical physics. This result characterizes BP fixed-points and makes 
connections with variational approaches to approximate inference. 
More importantly, our analysis lets us build on the progress made 
in statistical physics since Bethe's approximation was introduced 
in 1935. Kikuchi and others have shown how to construct more ac- 
curate free energy approximations, of which Bethe's approximation 
is the simplest. Exploiting the insights from our analysis, we de- 
rive generalized belief propagation (GBP) versions of these Kikuchi 
approximations. These new message passing algorithms can be 
significantly more accurate than ordinary BP, at an adjustable in- 
crease in complexity. We illustrate such a new GBP algorithm on 
a grid Markov network and show that it gives much more accurate 
marginal probabilities than those found using ordinary BP. 
I Introduction 
Local "belief propagation" (BP) algorithms such as those introduced by Pearl are 
guaranteed to converge to the correct marginal posterior probabilities in tree-like 
graphical models. For general networks with loops, the situation is much less clear. 
On the one hand, a number of researchers have empirically demonstrated good 
performance for BP algorithms applied to networks with loops. One dramatic case 
is the near Shannon-limit performance of "Turbo codes", whose decoding algorithm 
is equivalent to BP on a loopy network [2, 6]. For some problems in computer vision 
involving networks with loops, BP has also shown to be accurate and to converge 
very quickly [2, 1, 7]. On the other hand, for other networks with loops, BP may 
give poor results or fail to converge [7]. 
For a general graph, little has been understood about what approximation BP 
represents, and how it might be improved. This paper's goal is to provide that 
understanding and introduce a set of new algorithms resulting from that under- 
standing. We show that BP is the first in a progression of local message-passing 
algorithms, each giving equivalent results to a corresponding approximation from 
statistical physics known as the "Kikuchi" approximation to the Gibbs free energy. 
These algorithms have the attractive property of being user-adjustable: by pay- 
ing some additional computational cost, one can obtain considerable improvement 
in the accuracy of one's approximation, and can sometimes obtain a convergent 
message-passing algorithm when ordinary BP does not converge. 
2 
Belief propagation fixed-points are zero gradient points of 
the Bethe free energy 
We assume that we are given an undirected graphical model of N nodes with pair- 
wise potentials (a Markov network). Such a model is very general, as essentially 
any graphical model can be converted into this form. The state of each node i is 
denoted by xi, and the joint probability distribution function is given by 
1 
P(X,X2,...,XN) =  H ij(xi,xj) H i(xi) (1) 
ij i 
where i(xi) is the local "evidence" for node i, ij(x,xj) is the compatibility 
matrix between nodes i and j, and Z is a normalization constant. Note that we are 
subsuming any fixed evidence nodes into our definition of 
The standard BP update rules are: 
mij(xj) - o y ij(xi,xj)i(xi) H mki(xi) (2) 
xi keN(i)j 
bi(xi) <-- ai(xi) H mi(xi) (3) 
ev(i) 
where a denotes a normalization constant and N(i)\j means all nodes neigh- 
boring node i, except j. Here mij refers to the message that node i sends 
to node j and bi is the belief (approximate marginal posterior probability) at 
node i, obtained by multiplying all incoming messages to that node by the lo- 
cal evidence. Similarly, we can define the belief bij(xi,xj) at the pair of nodes 
(xi, xj) as the product of the local potentials and all messages incoming to the 
pair of nodes: bij(xi,xj) = ocPij(xi,xj) mi(xi) I-[tev(j)\i mtj(xj), where 
xj) -- 
Claim 1: Let {mj} be a set of BP messages and let {bj, bi} be the beliefs calculated 
from those messages. Then the beliefs are fixed-points of the BP algorithm if and 
only if they are zero gradient points of the Bethe free energy, FZ: 
y y bij(xi,xj) [ln bij(xi,xj) - ln cpij(xi, xj)] 
ij xi 
- Y(qi- 1) y bi(xi)[in bi(xi)- in i(xi)] 
i 
(4) 
subject to the normalization and marginalization constraints: Y'x bi(xi) = 1, 
Y'x bij(xi, xj) = bj(xj). (qi is the number of neighbors of node i.) 
To prove this claim we add Lagrange multipliers to form a Lagrangian L: ,ij(Xj) 
is the multiplier corresponding to the constraint that bij (xi, xj) marginalizes down 
to bj(xj), and "/ij, "/i are multipliers corresponding to the normalization constraints. 
o -- 0 gives: lnbij(xi,xj) = ln(ij(xi, xj)) 
The equation obj(x,xj) -- 
oL - 0 gives: (qi- 1)(lnbi(xi) + 1) = 
Aji(xi) + "/ij - 1. The equation ob(x) - 
lni(xi) d- -jeN(i) AJi(xi) -]- 'i. Setting Aij(Xj) ---- in -IkeN(j) mkj(Xj) and us- 
ing the marginalization constraints, we find that the stationary conditions on the 
Lagrangian are equivalent to the BP fixed-point conditions. (Empirically, we find 
that stable BP fixed-points correspond to local minima of the Bethe free energy, 
rather than maxima or saddle-points.) 
2.1 Implications 
The fact that Fz({bij,bi}) is bounded below implies that the BP equations always 
possess a fixed-point (obtained at the global minimum of F). To our knowledge, 
this is the first proof of existence of fixed-points for a general graph with arbitrary 
potentials (see [9] for a complicated proof for a special case). 
The free energy formulation clarifies the relationship to variational approaches which 
also minimize an approximate free energy [3]. For example, the mean field approx- 
imation finds a set of {bi} that minimize: 
FMF({bi}) = --   bi(xi)bj(xj) In ij(xi,xj)+-  bi(xi) [ln bi(xi) - In i(xi)] 
ij xi ,xj i xi 
(5) 
subject to the constraint --i bi(xi) = 1. 
The BP free energy includes first-order terms bi(xi) as well as second-order terms 
bij (xi, xj), while the mean field free energy uses only the first order ones. It is easy 
to show that the BP free energy is exact for trees while the mean field one is not. 
Furthermore the optimization methods are different: typically FMF is minimized 
directly in the primal variables {bi} while FZ is minimized using the messages, which 
are a combination of the dual variables {Aij(xj)}. 
Kabashima and Saad [4] have previously pointed out the correspondence between 
BP and the Bethe approximation (expressed using the TAP formalism) for some 
specific graphical models with random disorder. Our proof answers in the affirma- 
tive their question about whether there is a "deep general link between the two 
methods." [4] 
3 Kikuchi Approximations to the Gibbs Free Energy 
The Bethe approximation, for which the energy and entropy are approximated by 
terms that involve at most pairs of nodes, is the simplest version of the Kikuchi 
"cluster variational method." [5, 10] In a general Kikuchi approximation, the free 
energy is approximated as a sum of the free energies of basic clusters of nodes, 
minus the free energy of over-counted cluster intersections, minus the free energy 
of the over-counted intersections of intersections, and so on. 
Let R be a set of regions that include some chosen basic clusters of nodes, their 
intersections, the intersections of the intersections, and so on. The choice of basic 
clusters determines the Kikuchi approximation-for the Bethe approximation, the 
basic clusters consist of all linked pairs of nodes. Let xr be the state of the nodes 
in region r and b.(x.) be the "belief" in x. We define the energy of a region by 
Er(xr) -- - In I-[ij i/(xi,x/) - In Hi i(xi) -- -lnr(xr), where the products are 
over all interactions contained within the region r. For models with higher than 
pair-wise interactions, the region energy is generalized to include those interactions 
as well. 
The Kikuchi free energy is 
where cr is the over-counting number of region r, defined by: cr - 1 - Ysesper(r) cs 
where super(r) is the set of all super-regions of r. For the largest regions in R, 
cr - 1. The belief br(cr) in region r has several constraints: it must sum to one 
and be consistent with the beliefs in regions which intersect with r. In general, 
increasing the size of the basic clusters improves the approximation one obtains by 
minimizing the Kikuchi free energy. 
4 Generalized belief propagation (GBP) 
Minimizing the Kikuchi free energy subject to the constraints on the beliefs is 
not simple. Nearly all applications of the Kikuchi approximation in the physics 
literature exploit symmetries in the underlying physical system and the choice of 
clusters to reduce the number of equations that need to be solved from O(N) to 
0(1). But just as the Bethe free energy can be minimized by the BP algorithm, we 
introduce a class of analogous generalized belief propagation (GBP) algorithms that 
minimize an arbitrary Kikuchi free energy. These algorithms represent an advance 
in physics, in that they open the way to the exploitation of Kikuchi approximations 
for inhomogeneous physical systems. 
There are in fact many possible GBP algorithms which all correspond to the same 
Kikuchi approximation. We present a "canonical" GBP algorithm which has the 
nice property of reducing to ordinary BP at the Bethe level. We introduce messages 
mrs(xs) between all regions r and their "direct sub-regions" s. (Define the set 
suba(r) of direct sub-regions of r to be those regions that are sub-regions of r 
but have no super-regions that are also sub-regions of r, and similarly for the set 
supera(r ) of "direct super-regions.") It is helpful to think of this as a message 
from those nodes in r but not in s (which we denote by r\s) to the nodes in s. 
Intuitively, we want messages to propagate information that lies outside of a region 
into it. Thus, for a given region r, we want the belief br(xr) to depend on exactly 
those messages mr, s, that start outside of the region r and go into the region r. We 
define this set of messages M(r) to be those messages mr, s, (xs,) such that region 
r\s  has no nodes in common with region r, and such that region s  is a sub-region 
of r or the same as region r. We also define the set M(r, s) of messages to be all 
those messages that start in a sub-region of r and also belong to M(s), and we 
define M(r)\M(s) to be those messages that are in M(r) but not in M(s). 
The canonical generalized belief propagation update rules are: 
where for brevity we have suppressed the functional dependences of the beliefs and 
messages. The messages are updated starting with the messages into the smallest 
regions first. One can then use the newly computed messages in the product over 
21//(r, s) of the message-update rule. Empirically, this helps convergence. 
Claim 2: Let {mrs(Xs)} be a set of canonical GBP messages and let {br(xr)} be 
the beliefs calculated from those messages. Then the beliefs are fixed-points of 
the canonical GBP algorithm if and only if they are zero gradient points of the 
constrained Kikuchi free energy FK. 
We prove this claim by adding Lagrange multipliers: "/r to enforce the normal- 
ization of br and rs(xs) to enforce the consistency of each region r with all of 
its direct sub-regions s. This set of consistency constraints is actually more than 
sufficient, but there is no harm in adding extra constraints. We then rotate to 
another set of Lagrange multipliers lrs(xs) of equal dimensionality which enforce a 
linear combination of the original constraints: rs (Xs) enforces all those constraints 
involving marginalizations by all direct super-regions r  of s into s except that of 
region r itself. The rotation matrix is in a block form which can be guaranteed 
to be full rank. We can then show that the lrs(xs) constraints can be written 
in the form lrs(Xs) -r'eR(,)cr, -x b(xr) where R(trs) is the set of all regions 
which receive the message/rs in the belief update rule of the canonical algorithm. 
We then re-arrange the sum over all /'s into a sum over all regions, which has 
the form .reRcr.xr(xr).,e(r)lrs(Xs). ((r) is aset of lr,s, in one-to- 
one correspondence with the mr,s, in M(r).) Finally, we differentiate the Kikuchi 
free energy with respect to br(r), and identify lrs(Xs) = lnmrs(Xs) to obtain the 
canonical GBP belief update rules, Eq. 8. Using the belief update rules in the 
marginalization constraints, we obtain the canonical GBP message update rules, 
Eq. 7. 
It is clear from this proof outline that other GBP message passing algorithms which 
are equivalent to the Kikuchi approximation exist. If one writes any set of con- 
straints which are sufficient to insure the consistency of all Kikuchi regions, one can 
associate the exponentiated Lagrange multipliers of those constraints with a set of 
messages. 
The GBP algorithms we have described solve exactly those networks which have 
the topology of a tree of basic clusters. This is reminiscent of Pearl's method of 
clustering [8], wherein wherein one groups clusters of nodes into "super-nodes," 
and then applies a belief propagation method to the equivalent super-node lattice. 
We can show that the clustering method, using Kikuchi clusters as super-nodes, 
also gives results equivalent to the Kikuchi approximation for those lattices and 
cluster choices where there are no intersections between the intersections of the 
Kikuchi basic clusters. For those networks and cluster choices which do not obey this 
condition, (a simple example that we discuss below is the square lattice with clusters 
that consist of all square plaquettes of four nodes), Pearl's clustering method must 
be modified by adding additional update conditions to agree with GBP algorithms 
and the Kikuchi approximation. 
5 Application to Specific Lattices 
We illustrate the canonical GBP algorithm for the Kikuchi approximation of over- 
lapping 4-node clusters on a square lattice of nodes. Figure 1 (a), (b), (c) illustrates 
the beliefs at a node, pair of nodes, and at a cluster of 4 nodes, in terms of messages 
propagated in the network. Vectors are the single index messages also used in ordi- 
nary BP. Vectors with line segments indicate the double-indexed messages arising 
from the Kikuchi approximation used here. These can be thought of as correction 
terms accounting for correlations between messages that ordinary BP treats as in- 
dependent. (For comparison, Fig. 1 (d), (e), (f) shows the corresponding marginal 
computations for the triangular lattice with all triangles chosen as the basic Kikuchi 
clusters). 
We find the message update rules by equating marginalizations of Fig. 1 (b) and 
(c) with the beliefs in Fig. 1 (a) and (b), respectively. Figure 2 (a) and (b) show 
(graphically) the resulting fixed point equations. The update rule (a) is like that for 
ordinary BP, with the addition of two double-indexed messages. The update rule 
for the double-indexed messages involves division by the newly-computed single- 
indexed messages. Fixed points of these message update equations give beliefs that 
are stationary points (empirically minima) of the corresponding Kikuchi approxi- 
mation to the free energy. 
(d) 
d 
(a) (b) (c) (e) (f) 
Figure 1: Marginal probabilities in terms of the node links and GBP mes- 
sages. For (a) node, (b) line, (c) square cluster, using a Kikuchi ap- 
proximation with 4-node clusters on a square lattice. E.g., (b) depicts 
(a special case of Eq. 8, written here using node labels): bab(xa,xb) = 
ct;bab(x,x);b(x);b(x)MMaMA/l*fMfMgMnM, where super and sub- 
scripts indicate which nodes message M goes from and to. (d), (e), (f): Marginal 
probabilities for triangular lattice with 3-node Kikuchi clusters. 
(a) (b) 
Figure 2: Graphical depiction of message update equations (Eq. 7; marginal- 
ize over nodes shown untilled) for GBP using overlapping 4-node Kikuchi 
clusters. (a) Update equation for the single-index messages: M}(x) = 
cty] b ;b(xb);ba(x,x)iVl*[MfMMnM n. (b) Update equation for double- 
indexed messages (involves a division by the single-index messages on the left hand 
side). 
6 Experimental Results 
Ordinary BP is expected to perform relatively poorly for networks with many tight 
loops, conflicting interactions, and weak evidence. We constructed such a network, 
known in the physics literature as the square lattice Ising spin glass in a random 
magnetic field. The nodes are on a square lattice, with nearest neighbor nodes 
connected by a compatibility matrix of the form j = exp(- Jj) exp(J) 
and local evidence vectors of the form i = (exp(hi);exp(-hi)). To instantiate a 
particular network, the J and h parameters are chosen randomly and indepen- 
dently from zero-mean Gaussian probability distributions with standard deviations 
J and h respectively. 
The following results are for n by n lattices with toroidal boundary conditions and 
with J = 1, and h = 0.1. This model is designed to show off the weaknesses 
of ordinary BP, which performs well for many other networks. Ordinary BP is a 
special case of canonical GBP, so we exploited this to use the same general-purpose 
GBP code for both ordinary BP and canonical GBP using overlapping square four- 
node clusters, thus making computational cost comparisons reasonable. We started 
with randomized messages and only stepped half-way towards the computed values 
of the messages at each iteration in order to help convergence. We found that 
canonical GBP took about twice as long as ordinary BP per iteration, but would 
typically reach a given level of convergence in many fewer iterations. In fact, for 
the majority of the dozens of samples that we looked at, BP did not converge at 
all, while canonical GBP always converged for this model and always to accurate 
answers. (We found that for the zero-field 3-dimensional spin glass with toroidal 
boundary conditions, which is an even more difficult model, canonical GBP with 
2x2x2 cubic clusters would also fail to converge). 
For n = 20 or larger, it was difficult to make comparisons with any other algorithm, 
because ordinary BP did not converge and Monte Carlo simulations suffered from 
extremely slow equilibration. However, generalized belief propagation converged 
reasonably rapidly to plausible-looking beliefs. For small n, we could compare with 
exact results, by using Pearl's clustering method on a chain of n by 1 super-nodes. 
To give a qualitative feel for the results, we compare ordinary BP, canonical GBP, 
and the exact results for an n = 10 lattice where ordinary BP did converge. Listing 
the values of the one-node marginal probabilities in one of the rows, we find that 
ordinary BP gives (.0043807, .74502, .32866, .62190, .37745, .41243, .57842, .74555, 
.85315, .99632), canonical GBP gives (.40255, .54115, .49184, .54232, .44812, .48014, 
.51501, .57693, .57710, .59757), and the exact results were (.40131, .54038, .48923, 
.54506, .44537, .47856, .51686, .58108, .57791, .59881). 
References 
[1] W. T. Freeman and E. Pasztor. Learning low-level vision. In 7th Intl. Conf. 
Computer Vision, pages 1182-1189, 1999. 
[2] B. J. Frey. Graphical Models for Machine Learning and Digital Communica- 
tion. MIT Press, 1998. 
[3] M. Jordan, Z. Ghahramani, T. Jaakkola, and L. Saul. An introduction to 
variational methods for graphical models. In M. Jordan, editor, Learning in 
Graphical Models. MIT Press, 1998. 
[4] Y. Kabashima and D. Saad. Belief propagation vs. TAP for decoding corrupted 
messages. Euro. Phys. Lett., 44:668, 1998. 
[5] R. Kikuchi. Phys. Rev., 81:988, 1951. 
[6] R. McEliece, D. MacKay, and J. Cheng. Turbo decoding as an instance of 
Pearl's 'belief propagation' algorithm. IEEE J. on $el. Areas in Comm., 
16(2):140-152, 1998. 
[7] K. Murphy, Y. Weiss, and M. Jordan. Loopy belief propagation for approxi- 
mate inference: an empirical study. In Proc. Uncertainty in AI, 1999. 
[8] J. Pearl. Probabilistic reasoning in intelligent systems: networks of plausible 
inference. Morgan Kaufmann, 1988. 
[9] T. J. Richardson. The geometry of turbo-decoding dynamics. IEEE Trans. 
Info. Theory, 46(1):9-23, Jan. 2000. 
[10] Special issue on Kikuchi methods. Progr. Theor. Phys. Suppl., vol. 115, 1994. 
