An Information Maximization Approach to 
Overcomplete and Recurrent Representations 
Oren Shriki and Haim Sompolinsky 
Racah Institute of Physics and 
Center for Neural Computation 
Hebrew University 
Jerusalem, 91904, Israel 
Daniel D. Lee 
Bell Laboratories 
Lucent Technologies 
Murray Hill, NJ 07974 
Abstract 
The principle of maximizing mutual information is applied to learning 
overcomplete and recurrent representations. The underlying model con- 
sists of a network of input units driving a larger number of output units 
with recurrent interactions. In the limit of zero noise, the network is de- 
terministic and the mutual information can be related to the entropy of 
the output units. Maximizing this entropy with respect to both the feed- 
forward connections as well as the recurrent interactions results in simple 
learning rules for both sets of parameters. The conventional independent 
components (ICA) learning algorithm can be recovered as a special case 
where there is an equal number of output units and no recurrent con- 
nections. The application of these new learning rules is illustrated on a 
simple two-dimensional input example. 
1 Introduction 
Many unsupervised learning algorithms such as principal component analysis, vector quan- 
tization, self-organizing feature maps, and others use the principle of minimizing recon- 
struction error to learn appropriate features from multivariate data [1, 2]. Independent 
components analysis (ICA) can similarly be understood as maximizing the likelihood of 
the data under a non-Gaussian generative model, and thus is related to minimizing a re- 
construction cost [3, 4, 5]. On the other hand, the same ICA algorithm can also be derived 
without regard to a particular generative model by maximizing the mutual information be- 
tween the data and a nonlinearly transformed version of the data [6]. This principle of 
information maximization has also been previously applied to explain optimal properties 
for single units, linear networks, and symplectic transformations [7, 8, 9]. 
In these proceedings, we show how the principle of maximizing mutual information can 
be generalized to overcomplete as well as recurrent representations. In the limit of zero 
noise, we derive gradient descent learning rules for both the feedforward and recurrent 
weights. Finally, we show the application of these learning rules to some simple illustrative 
examples. 
Recurrent 
Weights 
M output variables 
Wll 
Feedforward 
Weights 
N input variables 
Figure 1: Network diagram of an overcomplete, recurrent representation. x are input data 
which influence the output signals s through feedforward connections W. The signals s 
also interact with each other through the recurrent interactions K. 
2 Information Maximization 
The "Infomax" formulation of ICA considers the problem of maximizing the mutual in- 
formation between N-dimensional data observations {x} which are input to a network 
resulting in N-dimensional output signals {s} [6]. Here, we consider the general problem 
where the signals s are M-dimensional with M _> N. Thus, the representation is overcom- 
plete because there are more signal components than data components. We also consider 
the situation where a signal component si can influence another component sj through a 
recurrent interaction Kji. As a network, this is diagrammed in Fig. 1 with the feedfor- 
ward connections described by the M x N matrix I4/and the recurrent connections by the 
M x M matrix K. The network response s is a deterministic function of the input x: 
) 
si = g Wijxj q- E KikSk (1) 
j=l k=i 
where g is some nonlinear squashing function. In this case, the mutual information between 
the inputs x and outputs s is functionally only dependent on the entropy of the outputs: 
I(s, x) = H(s)- H(slx )  H(s). 
(2) 
The distribution of s is a N-dimensional manifold embedded in a M-dimensional vector 
space and nominally has a negatively divergent entropy. However, as shown in Appendix 
1, the probability density of s can be related to the input distribution via the relation: 
P(x) 
P(s) (3) 
/det(xTx) 
where the susceptibility (or Jacobian) matrix X is defined as: 
Xij = Oxj' (4) 
This result can be understood in terms of the singular value decomposition (SVD) of the 
matrix X. The transformation performed by X can be decomposed into a series of three 
transformations: an orthogonal transformation that rotates the axes, a diagonal transfor- 
mation that scales each axis, followed by another orthogonal transformation. A volume 
element in the input space is mapped onto a volume element in the output space, and its 
volume change is described by the diagonal scaling operation. This scale change is given 
by the product of the square roots of the eigenvalues of xTx. Thus, the relationship be- 
tween the probability distribution in the input and output spaces includes the proportionality 
factor, x/'det(xtrX), as formally derived in Appendix 1. 
We now get the following expression for the entropy of the outputs: 
/ (P(x) ) I (log det(xTx) ) + H(x), (5) 
H(s) ,- - dxP(x) log k x/'det(X2rX) =  
where the brackets indicate averaging over the input distribution. 
3 Learning rules 
From Eq. (5), we see that minimizing the following cost function: 
1 
E = Tr (log(xTx)), (6) 
is equivalent to maximizing the mutual information. We first note that the susceptibility X 
satisfies the following recursion relation: 
)ij = g ' (Wij 'q- E Kik)kj ) = (GW + GKx)ij, (7) 
k 
where O,j = d,jg and g -- g' (Ej W, jzj + Ek K, ksk). 
Solving for X in Eq. (7) yields the result: 
X = ( G- - K) -W = OW, (8) 
where - -- G - - K. ij can be interpreted as the sensitivity in the recurrent network 
of the ith unit's output to changes in the total input of the jth unit. 
We next derive the learning rules for the network parameters using gradient descent, as 
shown in detail in Appendix 2. The resulting expression for the learning rule for the feed- 
forward weights is: 
OE 
AW = -VW = q (FT + T?xtr) (9) 
where q is the learning rate, the matrix F is defined as 
r = (XTX)-XXT (10) 
and the vector 3' is given by 
(11) 
= (xr). (gl) 3 . 
Multiplying the gradient in Eq. (9) by the matrix (WW ) yields an expression analogous 
to the "natural" gradient learning rule [ 10]: 
AW = ,W (Z + (XY7xY)). (12) 
Similarly, the learning role for the recuent interactions is 
OE 
aK = -VOK = V(xF)T + TTsT)' (13) 
In the case when there me equal numbers of input and output units, M = N, and there 
are no recuent interactions, K = 0, most of the previous expressions simplify. The 
susceptibility matrix X is diagonal,  = G, and F = W -. Substituting back into Eq. (9) 
for the learning rule for W results in the update rule: 
aw =. + 
tit t 
where zi = &/gi. Thus, the well-known Infomax ICA learning rule is recovered as a 
special case of Eq. (9) [6]. 
(a) 
(b) 
(c) 
Figure 2: Results of fitting 3 filters to a 2-dimensional hexagon distribution with 10000 
sample points. 
4 Examples 
We now apply the preceding learning algorithms to a simple two-dimensional (N = 2) 
input example. Each input point is generated by a linear combination of three (two- 
dimensional) unit vectors with angles of 0 , 120  and 240 . The coefficients are taken 
from a uniform distribution on the unit interval. The resulting distribution has the shape 
of a unit hexagon, which is slightly more dense close to the origin than at the boundaries. 
Samples of the input distribution are shown in Fig. 2. The second order cross correlations 
vanish, so that all the structure in the data is described only by higher order correlations. 
We fix the sigmoidal nonlinearity to be #(x) = tanh(x). 
4.1 Feedforward weights 
A set of M = 3 overcomplete filters for W are learned by applying the update rule in 
Eq. (9) to random normalized initial conditions while keeping the recurrent interactions 
fixed at K = 0. The length of the rows of W were constrained to be identical so that the 
filters are projections along certain directions in the two-dimensional space. The algorithm 
converged after about 20 iterations. Examples of the resulting learned filters are shown 
by plotting the rows of W as vectors in Fig. 2. As shown in the figure, there are several 
different local minimum solutions. If the lengths of the rows of W are left unconstrained, 
slight deviations from these solutions occur, but relative orientation differences of 60  or 
120  between the various filters are preserved. 
4.2 Recurrent interactions 
To investigate the effect of recurrent interactions on the representation, we fixed the feed- 
forward weights in W to point in the directions shown in Fig. 2(a), and learned the optimal 
recurrent interactions K using Eq. (13). Depending upon the length of the rows of W 
which scaled the input patterns, different optimal values are seen for the recurrent connec- 
tions. This is shown in Fig. 3 by plotting the value of the cost function against the strength 
of the uniform recurrent interaction. For small scaled inputs, the optimal recurrent strength 
is negative which effectively amplifies the output signals since the 3 signals are negatively 
correlated. With large scaled inputs, the optimal recurrent strength is positive which tend to 
decrease the outputs. Thus, in this example, optimizing the recurrent connections performs 
gain control on the inputs. 
3 
2.5 
2 
1.5 
0.5 
0 
-0.5 
-1 
'5 -1' '5 
 -0.'5 0 0'.5 1. 
Figure 3: Effect of adding recurrent interactions to the representation. The cost function 
is plotted as a function of the recurrent interaction strength, for two different input scaling 
parameters. 
5 Discussion 
The learned feedforward weights are similar to the results of another ICA model that can 
learn overcomplete representations [ 11]. Our algorithm, however, does not need to perform 
approximate inference on a generative model. Instead, it directly maximizes the mutual in- 
formation between the outputs and inputs of a nonlinear network. Our method also has 
the advantage of being able to learn recurrent connections that can enhance the representa- 
tional power of the network. We also note that this approach can be easily generalized to 
undercomplete representations by simply changing the order of the matrix product in the 
cost function. However, more work still needs to be done in order to understand technical 
issues regarding speed of convergence and local minima in larger applications. Possible 
extensions of this work would be to optimize the nonlinearity that is used, or to adaptively 
change the number of output units to best match the input distribution. 
We acknowledge the financial support of Bell Laboratories, Lucent Technologies, and the 
US-Israel Binational Science Foundation. 
6 Appendix 1: Relationship between input and output distributions 
In general, the relation between the input and output distributions is given by 
P(s) = / dxP(x)P(slx ). 
(15) 
Since we use a deterministic mapping, the conditional distribution of the response given 
the input is given by P(slx ) = 8(s - #(Wx + Ks)). By adding independent Gaussian 
noise to the responses of the output units and considering the limit where the variance of 
the noise goes to zero, we can write this term as 
P(slx) lim 1 11s_a(Wx+Ks)112 
= e- 2--x- (16) 
a-0 
The output space can be partitioned into those points which belong to the image of the 
input space, and those which are not. For points outside the image of the input space, 
?(s) = 0. Consider a point s inside the image. This means that there exists x0 such that 
s = g(Wxo + Ks). For small A, we can expand g(Wx + Ks) - s _ X6x, where X is 
defined in Eq. (4), and 6x = x - x0. We then get 
-  fix fix 
P(slx) = lim 
a-0 (2r/X2)N/2 e 
I - [ fix fix 
= lira (17) 
The expression in the sque brackets is a delta function in x ound xo. Using Eq. (1S) we 
finally get 
P(s) = P(x) (s) (8) 
where the characteristic function Q(s) is 1 if s belongs to the image of the input space 
and is zero otherwise. Note that for the case when X is a square matrix (M = N), this 
expression reduces to the relation P(s) = P(x)/I det(x)l. 
7 Appendix 2: Derivation of the learning rules 
To derive the appropriate learning rules, we need to calculate the derivatives of E with 
respect to some set of parameters A. In general, these derivatives are obtained from the 
expression: 
Tr (xTx) -0(XTX) =-Tr (xTx)-X T0X 
0X - - 0X  . (19) 
7.1 Feedforward weights 
In order to derive the learning rule for the weights W, we first calculate 
OXab ( OWcb O(I) ac ) O(I) ac 
O-t m =   a c O----tt  + O W  =  a t 6 m +  O W  . (20) 
From the definition of , we see that: 
0 OG  
OWlm --  ai OWjc (21) 
and 
OG  & Og g' Os 
OW, m = --I  Om = --& 1:5  07m' 
where g' -- g" (yj Wijxj + Yk Kiks). 
The derivatives of s also satisfy a recursion relation similar to Eq. (7): 
OWlm ---- gi ' 5ilXm q- Y Kij OWlm ' 
3 
which has the solution: 
(22) 
(23) 
-- : (I)ilX m. (24) 
OWlm 
Putting all these results together in Eq. (19) and taking the trace, we get the gradient descent 
rule in Eq. (9). 
7.2 Recurrent interactions 
To derive the learning rules for the recurrent weights K, we first calculate the derivatives 
of Xat, with respect to Ktm: 
OXab 
OIlrn 
(25) 
From the definition of , we obtain: 
OKlm 
(g)20Klm 
5ilSjm. 
(26) 
The derivatives of g are obtained from the following relations: 
Ogl gi' 
OKlm g OKlm 
and 
(27) 
-- ---- ilSm . (28) 
OKlm 
which results from a recursion relation similar to Eq. (23). Finally, after combining these 
results and calculating the trace, we get the gradient descent learning rule in Eq. (13). 
References 
[1] Jolliffe, IT (1986). Principal Component Analysis. New York: Springer-Verlag. 
[2] Haykin, S (1999). Neural networks: a comprehensive foundation. 2rid ed., Prentice-Hall, Upper 
Saddle River, NJ. 
[3] Jutten, C & Herault, J (1991). Blind separation of sources, part I: An adaptive algorithm based 
on neuromimetic architecture. Signal Processing 24, 1-10. 
[4] Hinton, G & Ghahramani, Z (1997). Generafive models for discovering sparse distributed rep- 
resentations. Philosophical Transactions Royal Society B 352, 1177-1190. 
[5] Pearlmutter, B & Parra, L (1996). A context-sensitive generalization of ICA. In ICONIP'96, 
151-157. 
[6] Bell, AJ & Sejnowski, TJ (1995). An information maximization approach to blind separation 
and blind deconvolution. Neural Cornput. 7, 1129-1159. 
[7] Barlow, HB (1989). Unsupervised learning. Neural Cornput. 1,295-311. 
[8] Linsker, R (1992). Local synaptic learning rules suffice to maximize mutual information in a 
linear network. Neural Cornput. 4, 691-702. 
[9] Parra, L, Deco, G, & Miesbach, S (1996). Statistical independence and novelty detection with 
information preserving nonlinear maps. Neural Cornput. 8, 260-269. 
[10] Amari, S, Cichocki, A & Yang, H (1996). A new learning algorithm for blind signal separation. 
Advances in Neural Information Processing Systems 8, 757-763. 
[11] Lewicki, MS & Sejnowski, TJ (2000). Learning overcomplete representations. Neural Compu- 
tation 12 337-365. 
