Algorithms for Non-negative Matrix 
Factorization 
Daniel D. Lee* 
*Bell Laboratories 
Lucent Technologies 
Murray Hill, NJ 07974 
H. Sebastian Seung *t 
tDept. of Brain and Cog. Sci. 
Massachusetts Institute of Technology 
Cambridge, MA 02138 
Abstract 
Non-negative matrix factorization (NMF) has previously been shown to 
be a useful decomposition for multivariate data. Two different multi- 
plicative algorithms for NMF are analyzed. They differ only slightly in 
the multiplicative factor used in the update rules. One algorithm can be 
shown to minimize the conventional least squares error while the other 
minimizes the generalized Kullback-Leibler divergence. The monotonic 
convergence of both algorithms can be proven using an auxiliary func- 
tion analogous to that used for proving convergence of the Expectation- 
Maximization algorithm. The algorithms can also be interpreted as diag- 
onally rescaled gradient descent, where the rescaling factor is optimally 
chosen to ensure convergence. 
1 Introduction 
Unsupervised learning algorithms such as principal components analysis and vector quan- 
tization can be understood as factorizing a data matrix subject to different constraints. De- 
pending upon the constraints utilized, the resulting factors can be shown to have very dif- 
ferent representational properties. Principal components analysis enforces only a weak or- 
thogonality constraint, resulting in a very distributed representation that uses cancellations 
to generate variability [1, 2]. On the other hand, vector quantization uses a hard winner- 
take-all constraint that results in clustering the data into mutually exclusive prototypes [3]. 
We have previously shown that nonnegativity is a useful constraint for matrix factorization 
that can learn a parts representation of the data [4, 5]. The nonnegative basis vectors that are 
learned are used in distributed, yet still sparse combinations to generate expressiveness in 
the reconstructions [6, 7]. In this submission, we analyze in detail two numerical algorithms 
for learning the optimal nonnegative factors from data. 
2 Non-negative matrix factorization 
We formally consider algorithms for solving the following problem: 
Non-negative matrix factorization (NMF) Given a non-negative matrix 
V, find non-negative matrix factors W and H such that: 
V  WH (1) 
NMF can be applied to the statistical analysis of multivariate data in the following manner. 
Given a set of of multivariate n-dimensional data vectors, the vectors are placed in the 
columns of an n x m matrix V where m is the number of examples in the data set. This 
matrix is then approximately factorized into an n x r matrix W and an r x m matrix H. 
Usually r is chosen to be smaller than n or m, so that W and H are smaller than the original 
matrix V. This results in a compressed version of the original data matrix. 
What is the significance of the approximation in Eq. (1)? It can be rewritten column by 
column as v ,, Wh, where v and h are the corresponding columns of V and H. In other 
words, each data vector v is approximated by a linear combination of the columns of W, 
weighted by the components of h. Therefore W can be regarded as containing a basis 
that is optimized for the linear approximation of the data in V. Since relatively few basis 
vectors are used to represent many data vectors, good approximation can only be achieved 
if the basis vectors discover structure that is latent in the data. 
The present submission is not about applications of NMF, but focuses instead on the tech- 
nical aspects of finding non-negative matrix factorizations. Of course, other types of ma- 
trix factorizations have been extensively studied in numerical linear algebra, but the non- 
negativity constraint makes much of this previous work inapplicable to the present case 
[8]. 
Here we discuss two algorithms for NMF based on iterative updates of W and H. Because 
these algorithms are easy to implement and their convergence properties are guaranteed, 
we have found them very useful in practical applications. Other algorithms may possibly 
be more efficient in overall computation time, but are more difficult to implement and may 
not generalize to different cost functions. Algorithms similar to ours where only one of the 
factors is adapted have previously been used for the deconvolution of emission tomography 
and astronomical images [9, 10, 11, 12]. 
At each iteration of our algorithms, the new value of W or H is found by multiplying the 
current value by some factor that depends on the quality of the approximation in Eq. (1). We 
prove that the quality of the approximation improves monotonically with the application 
of these multiplicative update rules. In practice, this means that repeated iteration of the 
update rules is guaranteed to converge to a locally optimal matrix factorization. 
3 Cost functions 
To find an approximate factorization V -, WH, we first need to define cost functions 
that quantify the quality of the approximation. Such a cost function can be constructed 
using some measure of distance between two non-negative matrices A and/3. One useful 
measure is simply the square of the Euclidean distance between A and/3 [13], 
IIA-/311 = y(Aij - Bij) 2 (2) 
This is lower bounded by zero, and clearly vanishes if and only if A =/3. 
Another useful measure is 
Aij 
D(AII/3)- Aij log Bij 
-- - Aij + Bij) (3) 
Like the Euclidean distance this is also lower bounded by zero, and vanishes if and only 
if A = /3. But it cannot be called a "distance", because it is not symmetric in A and/3, 
so we will refer to it as the "divergence" of A from/3. It reduces to the Kullback-Leibler 
divergence, or relative entropy, when Y-ij Aij -- Eij Bij -- 1, SO that A and B can be 
regarded as normalized probability distributions. 
We now consider two alternative formulations of NMF as optimization problems: 
Problem 1 Minimize I IV - wsl 12 with re,s7ect to W and H, subject to the constraints 
W,H _> O. 
Problem 2 Minimize D(VI IWH) with respect to W and H, subject to the constraints 
W,H _> O. 
Although the functions IIV-WHII and D(VIIWH) are convex in W only or H only, they 
are not convex in both variables together. Therefore it is unrealistic to expect an algorithm 
to solve Problems 1 and 2 in the sense of finding global minima. However, there are many 
techniques from numerical optimization that can be applied to find local minima. 
Gradient descent is perhaps the simplest technique to implement, but convergence can be 
slow. Other methods such as conjugate gradient have faster convergence, at least in the 
vicinity of local minima, but are more complicated to implement than gradient descent 
[8]. The convergence of gradient based methods also have the disadvantage of being very 
sensitive to the choice of step size, which can be very inconvenient for large applications. 
4 Multiplicative update rules 
We have found that the following "multiplicative update rules" are a good compromise 
between speed and ease of implementation for solving Problems 1 and 2. 
Theorem 1 The Euclidean distance I IV - WHII is nonincreasing under the update rules 
(WTV)au (VHT)ia 
Hau <-- Hau (WTWH)au Wia -- Wia (WHHT)ia (4) 
The Euclidean distance is invariant under these updates if and only if W and H are at a 
stationary point of the distance. 
Theorem 2 The divergence D(VI IWH) is nonincreasing under the update rules 
Hat` - Hat`  Wat`/(WH)t` Wa - Wa Zt` Hat`t`/(WH)t` 
The divergence is invariant under these updates if and only if W and H are at a stationary 
point of the divergence. 
Proofs of these theorems are given in a later section. For now, we note that each update 
consists of multiplication by a factor. In particular, it is straightforward to see that this 
multiplicative factor is unity when V = WH, so that perfect reconstruction is necessarily 
a fixed point of the update rules. 
5 Multiplicative versus additive update rules 
It is useful to contrast these multiplicative updates with those arising from gradient descent 
[14]. In particular, a simple additive update for H that reduces the squared distance can be 
written as 
Hat, - Hat` + flat` [(wTV)at` - (WTWH)at`] . (6) 
If r/at` are all set equal to some small positive number, this is equivalent to conventional 
gradient descent. As long as this number is sufficiently small, the update should reduce 
IIV-WHII, 
Now if we diagonally rescale the variables and set 
Ha (7) 
]au = (WTWH)au, 
then we obtain the update rule for H that is given in Theorem 1. Note that this rescaling 
results in a multiplicative factor with the positive component of the gradient in the denom- 
inator and the absolute value of the negative component in the numerator of the factor. 
For the divergence, diagonally rescaled gradient descent takes the form 
Hau - Hu + Ou Wi (WH)iu Wi  (8) 
Again, if me are small and positive, mis update should reduce D(VIIWH). If we now 
set 
Hu (9) 
then we obtain the update rule for H that is given in Theorem 2. This rescaling can also 
be interetated as a multiplicative role with the positive component of the gradient in the 
denominator and negative component as the numerator of the multiplicative factor. 
Since our choices for au are not small, it may seem that there is no gumantee that such a 
rescaled gradient descent should cause the cost function to decrease. Surisingly, this is 
indeed the case as shown in the next section. 
6 Proofs of convergence 
To prove Theorems 1 and 2, we will make use of an auxiliary function similar to that used 
in the Expectation-Maximization algorithm [ 15, 16]. 
Definition 1 G(h, h') is an auxiliary function for F(h) if the conditions 
G(h,h') > F(h), G(h,h) - F(h) (10) 
are sati,sfied. 
The auxiliary function is a useful concept because of the following lemma, which is also 
graphically illustrated in Fig. 1. 
Lemma 1 If G is an auxiliary function, then F is nonincreasing under the update 
h t+ = argnnG(h,h t) (11) 
,roof: < *) < G(n*,n*) = F(n*)  
Note that F(n = F(n only if h t is a local minimum of G(n, n*). if the derivatives 
of F exist and are continuous in a small neighborhood of h t, this also implies that the 
derivatives VF(h t) = 0. Thus, by iterating the update in Eq. (1 l) we obtain a sequence 
of estimates that converge to a local minimum hmin = arg mint, F(h) of the objective 
function: 
F(hmin) _< ...F(h t+l) _< F(ht)... _< F(h2) _< F(hl) _< F(ho). (12) 
We will show that by defining the appropriate auxiliary functions G(h, h t) for both I I V - 
WHll and D(V, WH), the update rules in Theorems 1 and 2 easily follow from Eq. (ll). 
,,,,' F(h) 
h t h t+l hmi n h 
Figure 1: Minimizing the auxiliary function G(h, h t) _> F(h) guarantees that F(h t+) _< 
F(h t) for h + = argminn G(h, ht). 
Lemma 2 If K(h t) is the diagonal matrix 
Kab(h t) __- (ab(WTwht)a/ht a 
then 
a(n,n t) = F(n t) + (n- nt)TvF(n t) + (n- nt)Tx(nt)(n - n t) 
is an auxiliary function for 
1 
F) =  (, - V)  
i a 
(13) 
(14) 
(15) 
Proof: Since G(h,h) = F(h) is obvious, we need only show that G(h,h t) _> F(h). To 
do this, we compare 
F(n) = F(n t) + (n- nt)TvF(n t) + (n- nt)T(wTw)(n- a t) (16) 
with Eq. (14) to find that G(n,n t) _> F(n) is equivalent to 
0 _< (h- ht)T[K(h t) - wTwI(h- h t) (lV) 
To prove positive semidefiniteness, consider the matrix1: 
Mab(h t) = h(X(h t) - wTW)ah (18) 
which is just a rescaling of the components of K - WTW. Then K - WTW is positive 
semidefinite if and only if M is, and 
uT Mu =  uaMau (19) 
ab 
  t T t 2 t T t 
h(W W)hu-uh(W W)hu (20) 
ab 
1 
=  (ww)nn( - ) (22) 
ab 
) 0 (23) 
1One can also show that K - WW is positive semidefinite by considering the matrix K (I - 
K- WTWK- )K. Then h (WTWh*) is a positive eigenvector of K- WTWK- with 
unity eigenvalue, and application of the Frobenius-Peon theorem shows that Eq. 17 holds. 
 
We can now demonstrate the convergence of Theorem 1: 
Proof of Theorem 1 Replacing G(h, h t) in Eq. (11) by Eq. (14) results in the update rule: 
h t+ = h t - K(ht)-VF(h t) (24) 
Since Eq. (14) is an auxiliary function, F is nonincreasing under this update rule, according 
to Lemma 1. Writing the components of this equation explicitly, we obtain 
(WTV)a 
= (wTwnt)a' (25) 
By reversing the roles of W and H in Lemma 1 and 2, F can similarly be shown to be 
nonincreasing under the update rules for W.  
We now consider the following auxiliary function for the divergence cost function: 
Lemma 3 Define 
G(n,n = 
i 
a 
This is an auxiliary function for 
y(vi log vi-vi) + yWiaha (26) 
ia 
log Waha - log  Wh (27) 
F(h) = yvilog ( vi ) 
i Ea Wiana -vi q- Wiana 
a 
(28) 
we obtain 
Wia h ta ( Wia h ta ) 
-logWiaha _< - a Eb Wibh logWiaha -log Eb Wibh (31) 
a 
From this inequality it follows that F(h) < G(h, ht).  
Theorem 2 then follows from the application of Lemma 1: 
Proof of Theorem 2: The minimum of G(h, h t) with respect to h is determined by setting 
the gradient to zero: 
aG(n,n *) man[ 1 
dha --  vi ibh a q-  Wia --0 (32) 
i i 
Thus, the update rule of Eq. (1 l) takes the form 
ht+ hl i vi 
a -  wk .  wnwa' (33) 
Since G is an auxiliary function, F in Eq. (28) is nonincreasing under this update. Rewrit- 
ten in matrix form, this is equivalent to the update rule in Eq. (5). By reversing the roles of 
H and W, the update rule for W can similarly be shown to be nonincreasing.  
Proof: It is straightforward to verify that G(h, h) = F(h). To show that G(h, h t) _> F(h), 
we use convexity of the log function to derive the inequality 
- log y Wiaha _< - y aa log Wiah (29) 
OZa 
a a 
which holds for all nonnegative OZa that sum to unity. Setting 
7 Discussion 
We have shown that application of the update rules in Eqs. (4) and (5) are guaranteed to 
find at least locally optimal solutions of Problems 1 and 2, respectively. The convergence 
proofs rely upon defining an appropriate auxiliary function. We are currently working to 
generalize these theorems to more complex constraints. The update rules themselves are 
extremely easy to implement computationally, and will hopefully be utilized by others for 
a wide variety of applications. 
We acknowledge the support of Bell Laboratories. We would also like to thank Carlos 
Brody, Ken Clarkson, Corinna Cortes, Roland Freund, Linda Kaufman, Yann Le Cun, Sam 
Roweis, Larry Saul, and Margaret Wright for helpful discussions. 
References 
[1] Jolliffe, IT (1986). Principal Component Analysis. New York: Springer-Verlag. 
[2] Turk, M & Pentland, A (1991). Eigenfaces for recognition. J. Cogn. Neurosci. 3, 71-86. 
[3] Gersho, A & Gray, RM (1992). Vector Quantization and Signal Compression. Kluwer Acad. 
Press. 
[4] Lee, DD & Seung, HS. Unsupervised learning by convex and conic coding (1997). Proceedings 
of the Conference on Neural Information Processing Systems 9, 515-521. 
[5] Lee, DD & Seung, HS (1999). Learning the parts of objects by non-negative matrix factoriza- 
tion. Nature 401, 788-791. 
[6] Field, DJ (1994). What is the goal of sensory coding? Neural Cornput. 6, 559-601. 
[7] Foldiak, P & Young, M (1995). Sparse coding in the primate cortex. The Handbook of Brain 
Theory and Neural Networks, 895-898. (MIT Press, Cambridge, MA). 
[8] Press, WH, Teukolsky, SA, Vetterling, WT & Flannery, BP (1993). Numerical recipes: the art 
of scientific computing. (Cambridge University Press, Cambridge, England). 
[9] Shepp, LA & Vardi, Y (1982). Maximum likelihood reconstruction for emission tomography. 
IEEE Trans. MI-2, 113-122. 
[10] Richardson, WH (1972). Bayesian-based iterative method of image restoration. J. Opt. Soc. 
Am. 62, 55-59. 
[11] Lucy, LB (1974). An iterative technique for the rectification of observed distributions. Astron. 
J. 74, 745-754. 
[12] Bournart, CA & Sauer, K (1996). A unified approach to statistical tomography using coordinate 
descent optimization. IEEE Trans. Image Proc. 5, 480-492. 
[13] Paatero, P & Tapper, U (1997). Least squares formulation of robust non-negative factor analy- 
sis. Chemometr. Intell. Lab. 37, 23-35. 
[14] Kivinen, J & Warmuth, M (1997). Additive versus exponentiated gradient updates for linear 
prediction. Journal of Information and Computation 132, 1-64. 
[15] Dempster, AP, Laird, NM & Rubin, DB (1977). Maximum likelihood from incomplete data via 
the EM algorithm. J. Royal Star. Soc. 39, 1-38. 
[16] Saul, L & Peteira, F (1997). Aggregate and mixed-order Markov models for statistical language 
processing. In C. Cardie and R. Weischedel (eds). Proceedings of the Second Conference on 
Empirical Methods in Natural Language Processing, 81-89. ACL Press. 
