Multiplicative Updates for Convolutional NMF Under β -Divergence
Pedro J. Villasana T., Stanislaw Gorlow, Arvind T. Hariraman
11 Multiplicative Updates for Convolutional NMFUnder β -Divergence Pedro J. Villasana T., Stanislaw Gorlow,
Member, IEEE and Arvind T. Hariraman
Abstract —In this letter, we generalize the convolutional NMFby taking the β -divergence as the contrast function and presentthe correct multiplicative updates for its factors in closed form.The new updates unify the β -NMF and the convolutional NMF.We state why almost all of the existing updates are inexact andapproximative w.r.t. the convolutional data model. We show thatour updates are stable and that their convergence performanceis consistent across the most common values of β . Index Terms —Convolution, nonnegative matrix factorization,multiplicative update rules, β -divergence. I. I
NTRODUCTION N ONNEGATIVE matrix factorization finds its applicationin the fields of machine learning and in connection withinverse problems, mostly. It became immensely popular afterLee and Seung derived multiplicative update rules that madethe up until then additive steps in the direction of the negativegradient obsolete [1]. In [2], they gave empirical evidence oftheir convergence to a stationary point, using (a) the squaredEuclidean distance, and (b) the generalized Kullback–Leiblerdivergence as the contrast function. The factorization’s originscan be traced back to [3], [4].A convolutional variant of the factorization is introduced in[5] based on the Kullback–Leibler divergence. The main ideais to exploit temporal dependencies in the neighborhood of apoint in the time-frequency plane. In their original form, theupdates result in a biased factorization. To provide a remedy,multiple coefficient matrices are updated in [6], one for eachtranslation, and the final update is by taking the average overall coefficient matrices.A nonnegative matrix factor deconvolution in 2D based onthe Kullback–Leibler divergence is found in [7]. Not only dothe authors give a derivation of the update rules, they show asimple way of making the update rules multiplicative. It maybe pointed out that the update rule for the coefficient matrixis different from the one in [6]. The same guiding principlesare applied to derive the convolutional factorization based onthe (squared) Euclidean distance in [8]. But in the attempt togive a formal proof for their update rules, the authors largelyreformulate a biased factorization comparable to [5].In [9], nonnegative matrix factorization is generalized to afamily of α -divergences under the constraints of sparsity andsmoothness, while the unconstrained β -divergence is broughtinto focus in [10]. For both cases multiplicative update ruleswere given. The properties of the β -divergence are discussedin detail in [10], [11]. The combined α - β -divergence togetherwith the corresponding multiplicative updates can be found in[12].In this letter, we provide multiplicative update rules for thefactorial deconvolution under the β -divergence. Furthermore, we argue that the updates in [5], [6] and in [7] are empiricaland/or inexact w.r.t. the convolutional data model. Accordingto our simulation, the updates in [8] do not signify any extraimprovement over [6] despite the additional load. Finally, weshow that the exact updates are stable and that their behavioris consistent for β ∈ { , , } .II. N ONNEGATIVE M ATRIX F ACTORIZATION
The nonnegative matrix factorization (NMF) is an umbrellaterm for a low-rank matrix approximation of the form V (cid:39) W H (1)with V ∈ R K × N (cid:62) , W ∈ R K × I (cid:62) , and H ∈ R I × N (cid:62) , where I isthe predetermined rank of the factorization. The letters abovehelp distinguish between visible ( v ) and hidden variables ( h )that are put in relation through weights ( w ). The factorizationis usually formulated as a convex minimization problem witha dedicated cost function C according tominimize W , H C ( W , H ) subject to w ki , h in (cid:62) (2)with C ( W , H ) = L ( V , W H ) , (3)where L is a loss function that assesses the error between V and the factorization W H . A. β -Divergence The loss in (3) can be expressed by means of a contrast ordistance function between the elements of V and W H . Dueto its robustness with respect to outliers for certain values ofthe input parameter β ∈ R , we resort to the β -divergence [13]as a subclass of the Bregman divergence [11], [14], which forthe points p and q is given by [11] d β ( p, q ) = p β + ( β − q β − β p q β − β ( β − , β (cid:54) = 0 , , p log pq − p + q , β = 1 , pq − log pq − , β = 0 . (4)Accordingly, the β -divergence for matrices V and W H canbe defined entrywise, as D β ( V (cid:107) W H ) def = K (cid:88) k =1 N (cid:88) n =1 d β (cid:16) v kn , (cid:88) i w ki h in (cid:17) , (5)which can further be viewed as the β -divergence between two(unnormalized) marginal probability mass functions with k asthe marginal variable. Note that the β -divergence has a singleglobal minimum for (cid:80) n v kn = (cid:80) i,n w ki h in , ∀ k , although itis strictly convex only for β ∈ [1 , [10], [11]. a r X i v : . [ c s . L G ] M a y B. Discrete Convolution
As can be seen explicitly in (5), the weight w ki for the i thhidden variable h in at point ( k, n ) is applied using the scalarproduct. Given that h i evolves with n , we can assume that h i is correlated with its past and future states. We can take thisinto account by extending the dot product to a convolution inour model. Postulating causality and letting the weights havefinite support of cardinality M , the convolution writes u ( n ) = ( w ki ∗ h i )( n ) def = M − (cid:88) m =0 w ki ( m ) h i,n − m . (6)The operation can be converted to a matrix multiplication bylining up the states h in in a truncated Toeplitz matrix: h i, h i, h i, · · · h i,N − h i,N h i, h i, · · · h i,N − h i,N − ... ... ... . . . ... ... · · · h i,N − M h i,N − M +1 . (7)In accordance with (1), the convolutional NMF (CNMF) canbe formulated as follows to accommodate the structure givenin (7), see also [5], [6]: V (cid:39) U = M − (cid:88) m =0 W m H m −→ , (8)where · m −→ is a columnwise right-shift operation (similar toa logical shift in programming languages) that shifts all thecolumns of H by m positions to the right, and fills the vacantpositions with zeros. The operation is size-preserving. It canbe seen that the CNMF has M times as many weights as theregular NMF, whereas the number of hidden states is equal.III. β -CNMFEnsuing from the preliminary considerations in Section II,we adopt the formulation of the CNMF from (8) and derivemultiplicative update rules for gradient descent, while takingthe entrywise β -divergence from (5) as the loss function. Theresult is referred to as the β -CNMF. A. Problem Statement
Under the premise that V is factorizable into { W m } and H , m = 0 , , . . . , M − , and given the cost function C ( { W m } , H ) = D β ( V (cid:107) U ) , (9)we seek to find the multiplicative equivalents of the iterativeupdate rules for gradient descent: W t +1 m = W tm − κ ∂∂ W tm C (cid:0)(cid:8) W tm (cid:9) , H t (cid:1) , (10a) H t +1 = H t − µ ∂∂ H t C (cid:0)(cid:8) W tm (cid:9) , H t (cid:1) , (10b)where (10a) and (10b) alternate at each iteration ( t (cid:62) ). Thestep sizes κ and µ are allowed to change at every iteration. B. Multiplicative Updates Rules
Computing the partial derivatives of C w.r.t. W m and H ,and by choosing appropriate values for κ and µ , the iterativeupdate rules from (10) become multiplicative in the form W t +1 m = W tm ◦ (cid:104) U t ◦ ( β − H t T m −→ (cid:105) ◦− ◦ (cid:104) V ◦ U t ◦ ( β − (cid:105) H t T m −→ , (11a) H t +1 = H t ◦ (cid:104)(cid:88) m W t T m (cid:101) U t ◦ ( β − m ←− (cid:105) ◦− ◦ (cid:88) m W t T m (cid:104) V m ←− ◦ (cid:101) U t ◦ ( β − m ←− (cid:105) , (11b)for m = 0 , , . . . , M − , with (cid:101) U t = (cid:88) m W t +1 m H t m −→ , (12)where ◦ denotes the Hadamard, i.e. entrywise, product, · ◦ p isequivalent to entrywise exponentiation and · ◦− , respectively,stands for the entrywise inverse. The · m ←− operator is the left-shift counterpart of the right-shift operator. The details of thederivation of (11) can be found in the Appendix. C. Relation to Previous Works
Several multiplicative updates for the CNMF can be foundin the existing literature using different loss functions. In [5],the loss function is stated as L ( V , U ) = (cid:118)(cid:117)(cid:117)(cid:116) K (cid:88) k =1 N (cid:88) n =1 (cid:12)(cid:12)(cid:12)(cid:12) v kn log v kn u kn − v kn + u kn (cid:12)(cid:12)(cid:12)(cid:12) (13)and the corresponding update rules for W m and H are W t +1 m = W tm ◦ (cid:104) t T m −→ (cid:105) ◦− ◦ (cid:104) V ◦ U t ◦− (cid:105) H t T m −→ , (14a) H t +1 = H t ◦ (cid:104) W t T m (cid:105) ◦− ◦ W t T m (cid:104) V m ←− ◦ (cid:101) U t ◦− m ←− (cid:105) , (14b)where the -matrix is of size K -by- N with (cid:104) (cid:105) kn = 1 . At t ,for every m , H is updated first using W tm and subsequently W t +1 m is computed from the updated H , or vice versa. In [5]it is mentioned and more explicitly stated in [6] that to avoidbias in H towards W M − it is best to first update all { W m } using H t and to update H according to H t +1 = 1 M M − (cid:88) m =0 H t ◦ (cid:104) W t +1 T m (cid:105) ◦− ◦ W t +1 T m (cid:104) V m ←− ◦ (cid:101) U t ◦− m ←− (cid:105) . (15)Comparing (14) and (15) with (11), it can be noted that bothupdate rules have the same factors for β = 1 . The respectiveloss function in this case is the generalized Kullback–Leiblerdivergence D ( V (cid:107) U ) and not (13). Moreover, the -matrixin (14b) is not aligned with the U -matrix. Given that β = 1 ,(14) and (11) are identical for M = 1 , i.e. when the NMF isnonconvolutional. For M > , (14) is equal to the updates in[2] for D ( V (cid:107) W H ) for different W m -matrices where the H -matrix is time-aligned via m . Eqs. (15) and (11b) are alsodifferent for M > because unlike (11b) (14b) is the updatederived from a nonconvolutional model. H in (15) brings out ILLASANA et al. : β -CNMF 3 the central tendency of the elements in H but does not makethe factorization in the original loss convolutional. For all thereasons given above, the updates in (14b) and (15) are at bestan approximation of the update in (11b) for β = 1 .In [7], multiplicative updates are given for a CNMF in 2D(time and frequency) with the (generalized) Kullback–Leiblerdivergence and the squared Euclidean distance as the loss orcost function. In the dimension of time, the updates are verymuch the same as the updates in (11) for β = 2 . For β = 1 ,there is the minor difference that the -matrix is not alignedwith the U -matrix, just like in in (14b).Other multiplicative updates for a CNMF with the squaredEuclidean distance can be found in [8]. In essence, they arederived in the exact same manner as the updates in (14), butfor a different loss function, which is D ( V (cid:107) U ) . Thus, theupdates are equal to the ones in [2] with W = W m , U as in(8) and H time-aligned. For the same reasons that the updatesin (14b) and (15) are approximative of the update in (11b) for β = 1 , the updates in [8] are approximative of the update in(11b) for β = 2 . Beyond, (12) is updated more efficiently as (cid:101) U t = (cid:101) U t + (cid:0) W t +1 m − W tm (cid:1) H t m −→ (16)in between the updates of W m and W m +1 . D. Interpretation
The two update rules given in (11) are of significant valuebecause, apart from being exact: • They are multiplicative, and thus, they converge fast andare easy to implement. • Eq. (11a) extends the update rule of the β -NMF for W to a set of M weight matrices that are linked through aconvolution operation. It also extends the correspondingupdate rule of the existing convolutional NMFs with thesquared Euclidean distance or the generalized Kullback–Leibler divergence to the family of β -divergences. • Eq. (11b) is even more important, as it yields a completeupdate of the hidden states at every iteration taking all M weight matrices into account at once.The update rule in (11b) can be viewed as an equivalent to adescent in the direction of the Reynolds gradient, which is totake the average over the partial derivatives under the groupof time translations (in m ). The average operator reduces thegradient spread as a function of W m at each iteration t suchthat the loss function converges to a single point that is mostlikely.The β -NMF being referred to here is the heuristic β -NMFderived in [9]. It was shown in [10] to converge faster than acomputationally equivalent maximization-minimization (MM)algorithm for β (cid:54)∈ [1 , and equally fast in the opposite case.The heuristic updates were proven to converge for β ∈ [0 , ,which is the interval of practical value.IV. S IMULATION
In this section, we compare our proposed updates with theexisting ones in terms of convergence behavior and run time
TABLE IA
VERAGE R UN T IME OF THE E XISTING C ONVOLUTIONAL U PDATES R ELATIVE TO THE P ROPOSED U PDATES FOR D IFFERENT B ETAS
Smaragdis Schmidt et al.
Wang et al.
Biased Average β = 0 β = 1 β = 2 for 1000 iterations. To that end, we generate 100 distinct V -matrices from M = 16 χ -distributed W m -matrices, w ki ( m ) = (cid:88) p =1 w ki p ( m ) ∼ χ w ki p ( m ) ∼ N (0 , , (17)and a uniformly distributed H -matrix, h in ∼ U (0 , . (18)The factorizations are repeated with 10 random initializationsof { W t m } and H t with non-zero entries. The results shownin Fig. 1 thus are computed over ensembles of 1000 losses ateach iteration. The number of visible and hidden variables is K = 1000 and I = 10 , and the number of realizations (timesamples) is N = 100 . The run time was measured on an IntelXeon E5-2637 v3 CPU at 3.5 GHz with 16 GB of RAM. InTable I, the figures represent the averages over 1000 runs putinto relation to the average run time of the proposed updatesfor 1000 iterations.In Fig. 1 it can be seen that the biased updates are clearlyleast stable under the divergence for which they were meantin the first place ( β = 1 ). Already in [7], convergence issueswith these updates were reported. For β ∈ { , } at less than100 iterations they can converge faster because H and U areupdated M times per iteration, which explains the significantincrease in run time. Between 100 and 1000 iterations, otherupdates show better performance. Wang’s updates are similarin performance to Smaragdis’ average updates in spite of theadditional intermediate updates of U for β = 1 , and slightlyworse otherwise. As stated above, Schmidt’s updates are thesame as ours for β (cid:54) = 1 , and so is their behavior. For β = 1 ,our updates show the smallest variance overall and yield thethe lowest cost below 100 iterations, which is a typical upperbound in practice. The longer run time is due to the shiftingof the -matrix. For β ∈ { , } , the loss distributions of our,Wang’s, and Smaragdis’ average updates look like they havethe same mean but a slightly different standard deviation. Totest the hypothesis that the costs are statistically equivalent inrespect of the mean, we employ Welch’s t -test. The p -valuessuggest that the null hypothesis can be rejected almost surelyfor β = 0 , whereas for β = 2 in general it cannot.V. C ONCLUSION
To the best of our knowledge, our letter is the only one toprovide a complete and exact derivation of the multiplicativeupdates for the convolutional NMF. Above, the cost functionis generalized to the family of β -divergences. It is shown by -2 -1 = Cost -2 -1 = Iteration -2 -1 = -2 -1 Cost -2 -1 Iteration -2 -1 p Welch’s t -test Villasana et al.
Smaragdis (biased )Smaragdis (average)Schmidt et al.
Wang et al. p Iteration p Fig. 1. Simulation results including the mean and the standard deviation of the loss distribution and the p -value from Welch’s t -test for the hypothesis thatthe proposed and the existing update rules on average have the same cost (and eventually converge to the same minimum) as a function of iteration. simulation that the updates are stable and that their behavioris consistent for β ∈ { , , } .A PPENDIX
Let u kn = (cid:80) i,m w ki ( m ) h i,n − m and U = (cid:104) u kn (cid:105) ∈ R K × N .Then, for any p ∈ { , , . . . , K } , q ∈ { , , . . . , I } , and r ∈{ , , . . . , M − } : ∂∂w pq ( r ) C ( { W m } , H )= ∂∂w pq ( r ) (cid:88) n (cid:32) u βpn β − v pn u β − pn β − (cid:33) = (cid:88) n (cid:0) u β − pn − v pn u β − pn (cid:1) h q,n − r . (19)Choosing κ from (10a) as κ = w pq ( r ) (cid:80) n u β − pn h q,n − r , (20)leads to the first update rule w t +1 ki ( m ) = w tki ( m ) (cid:80) n v kn u t β − kn h ti,n − m (cid:80) n u t β − kn h ti,n − m . (cid:4) (21) Further, for any p ∈ { , , . . . , I } and q ∈ { , , . . . , N } : ∂∂h pq C ( { W m } , H )= ∂∂h pq (cid:88) k,n (cid:32) u βkn β − v kn u β − kn β − (cid:33) . (22)It is straightforward to show that ∂∂h pq u kn = w kp ( n − q ) (23)by setting n − m = q (cid:32) m = n − q . As a result, plugging in q + m for n in (22) and using (23), we finally obtain ∂∂h pq C ( { W m } , H )= (cid:88) k,m w kp ( m ) (cid:16) u β − k,q + m − v k,q + m u β − k,q + m (cid:17) . (24)Choosing µ from (10b) as µ = h pq (cid:80) k,m w kp ( m ) u β − k,q + m , (25)leads to the second update rule h t +1 in = h tin (cid:80) k,m w tki ( m ) v k,n + m u t β − k,n + m (cid:80) k,m w tki ( m ) u t β − k,n + m . (cid:4) (26) ILLASANA et al. : β -CNMF 5 R EFERENCES[1] D. D. Lee and H. S. Seung, “Learning the parts of objects by nonnegativematrix factorization,”
Nature , vol. 401, pp. 788–791, 1999.[2] ——, “Algorithms for non-negative matrix factorization,” in
Advancesin Neural Information Processing Systems , 2001, pp. 556–562.[3] P. Paatero and U. Tapper, “Positive matrix factorization: A non-negativefactor model with optimal utilization of error estimates of data values,”
Environmetrics , vol. 5, no. 2, pp. 111–126, 1994.[4] P. Paatero, “Least squares formulation of robust non-negative factoranalysis,”
Chemometrics and Intelligent Laboratory Systems , vol. 37,no. 1, pp. 23–35, 1997.[5] P. Smaragdis, “Non-negative matrix factor deconvolution; extractionof multiple sound sources from monophonic inputs,” in
IndependentComponent Analysis and Blind Signal Separation , 2004, pp. 494–499.[6] ——, “Convolutive speech bases and their application to supervisedspeech separation,”
IEEE Transactions on Audio, Speech, and LanguageProcessing , vol. 15, no. 1, pp. 1–12, 2007.[7] M. N. Schmidt and M. Mørup, “Nonnegative matrix factor 2-D de-convolution for blind single channel source separation,” in
IndependentComponent Analysis and Blind Signal Separation , 2006, pp. 700–707.[8] W. Wang, A. Cichocki, and J. A. Chambers, “A multiplicative algorithmfor convolutive non-negative matrix factorization based on squaredEuclidean distance,”
IEEE Transactions on Signal Processing , vol. 57,no. 7, pp. 2858–2864, 2009.[9] A. Cichocki, R. Zdunek, and S. Amari, “Csisz´ar’s divergences for non-negative matrix factorization: Family of new algorithms,” in
IndependentComponent Analysis and Blind Signal Separation , 2006, pp. 32–39.[10] C. F´evotte and J. Idier, “Algorithms for nonnegative matrix factorizationwith the β -divergence,” Neural Computation , vol. 23, no. 9, pp. 2421–2456, 2011.[11] A. Cichocki and S.-i. Amari, “Families of alpha- beta- and gamma-divergences: Flexible and robust measures of similarities,”
Entropy ,vol. 12, no. 6, pp. 1532–1568, 2010.[12] A. Cichocki, S. Cruces, and S.-i. Amari, “Generalized alpha-beta diver-gences and their application to robust nonnegative matrix factorization,”
Entropy , vol. 13, no. 1, pp. 134–170, 2011.[13] A. Basu, I. R. Harris, N. L. Hjort, and M. C. Jones, “Robust and efficientestimation by minimising a density power divergence,”
Biometrika ,vol. 85, no. 3, pp. 549–559, 1998.[14] L. M. Bregman, “The relaxation method of finding the common pointof convex sets and its application to the solution of problems in convexprogramming,”