[PDF] Primal-Dual Algorithms for Non-negative Matrix Factorization with the Kullback-Leibler Divergence

Abstract

Non-negative matrix factorization (NMF) approximates a given matrix as a product of two non-negative matrices. Multiplicative algorithms deliver reliable results, but they show slow convergence for high-dimensional data and may be stuck away from local minima. Gradient descent methods have better behavior, but only apply to smooth losses such as the least-squares loss. In this article, we propose a first-order primal-dual algorithm for non-negative decomposition problems (where one factor is fixed) with the KL divergence, based on the Chambolle-Pock algorithm. All required computations may be obtained in closed form and we provide an efficient heuristic way to select step-sizes. By using alternating optimization, our algorithm readily extends to NMF and, on synthetic examples, face recognition or music source separation datasets, it is either faster than existing algorithms, or leads to improved local optima, or both.

Full PDF

aa r X i v : . [ c s . L G ] D ec Primal-Dual Algorithms for Non-negative Matrix Factorizationwith the Kullback-Leibler Divergence

Felipe Yanez and Francis Bach

INRIA – SIERRA Project-TeamD´epartement d’Informatique de l’ ´Ecole Normale Sup´erieureParis, France

December 5, 2014

Abstract

Non-negative matrix factorization (NMF) approximates a given matrix as a product of twonon-negative matrices. Multiplicative algorithms deliver reliable results, but they show slowconvergence for high-dimensional data and may be stuck away from local minima. Gradientdescent methods have better behavior, but only apply to smooth losses such as the least-squares loss. In this article, we propose a ﬁrst-order primal-dual algorithm for non-negativedecomposition problems (where one factor is ﬁxed) with the KL divergence, based on theChambolle-Pock algorithm. All required computations may be obtained in closed form andwe provide an eﬃcient heuristic way to select step-sizes. By using alternating optimization,our algorithm readily extends to NMF and, on synthetic examples, face recognition or musicsource separation datasets, it is either faster than existing algorithms, or leads to improvedlocal optima, or both.

The current development of techniques for big data applications has been extremely useful in manyﬁelds including data analysis, bioinformatics and scientiﬁc computing. These techniques need tohandle large amounts of data and often rely on dimensionality reduction; this is often cast asapproximating a matrix with a low-rank element.Non-negative matrix factorization (NMF) is a method that aims at ﬁnding part-based, lin-ear representations of non-negative data by factorizing it as the product of two low-rank non-negative matrices (Paatero and Tapper, 1994; Lee and Seung, 1999). In 2000, two multiplica-tive algorithms for NMF were introduced by Lee and Seung, one that minimizes the conventionalleast-squares error, and other one that minimizes the generalized Kullback-Leibler (KL) diver-gence (Lee and Seung, 2000).These algorithms extend to other losses and have been reported in diﬀerent applications,e.g., face recognition (Wang et al., 2005), music analysis (F´evotte et al., 2009), and text min-ing (Guduru, 2006). An important weakness of multiplicative algorithms is their slow convergencerate in high-dimensional data and their susceptibility to become trapped in poor local optima (Lin,2007). Gradient descent methods for NMF provide additional ﬂexibility and fast convergence (Lin,2007; Kim et al., 2008; Gillis, 2011). These methods have been extensively studied for the mini-mization of the least-squares error (Lin, 2007; Kim et al., 2008).The goal of this paper is to provide similar ﬁrst-order methods for the KL divergence, with up-dates as cheap as multiplicative updates. Our method builds on the recent work of Sun and F´evotte12014) which consider the alternating direction method of multipliers (ADMM) adapted to thisproblem. We instead rely on the Chambolle-Pock algorithm (Chambolle and Pock, 2011), whichmay be seen as a linearized version of ADMM, and thus we may reuse some of the tools developedby Sun and F´evotte (2014) while having an empirically faster algorithm.

The main contributions of this article are as follows:– We propose a new primal-dual formulation for the convex KL decomposition problem inSection 3.1, and an extension to the non-convex problem of NMF by alternating minimizationin Section 3.5.– We provide a purely data-driven way to select all step-sizes of our algorithm in Section 3.3.– In our simulations in Section 4 on synthetic examples, face recognition or music source sepa-ration datasets, our algorithm is either faster than existing algorithms, or leads to improvedlocal optima, or both.– We derive a cheap and eﬃcient implementation (Algorithm 2). Matlab code is available onlineat: anonymized website

Let V ∈ R n × m + denote the n × m given matrix formed by m non-negative column vectors ofdimensionality n . Considering r ≤ min( n, m ), let W ∈ R m × r + and H ∈ R r × n + be the matrix factorssuch that V ≈ WH . Two widely used cost functions for NMF are the conventional least-squares error (not detailedherein), and the generalized KL divergence D ( V k WH ) = − m X i =1 n X j =1 V ij (cid:26) log (cid:18) ( WH ) ij V ij (cid:19) + 1 (cid:27) + m X i =1 n X j =1 ( WH ) ij . (1)In this work, only the KL divergence is considered. Therefore, the optimization problem is asfollows: minimize W , H ≥ D ( V k WH ) . (2)We recall that the previous problem is non-convex in both factors simultaneously, whereasconvex in each factor separately, i.e., the non-negative decomposition (ND) problems,minimize W ≥ D ( V k WH ) (3)and minimize H ≥ D ( V k WH ) , (4)are convex.We now present two algorithms for NMF, multiplicative updates (Lee and Seung, 2000), andthe ADMM-based approach (Sun and F´evotte, 2014).2 .1 Multiplicative updates Lee and Seung (2000) introduced two multiplicative updates algorithms for NMF. One minimizesthe conventional least-squares error, and the other one minimizes the KL divergence.The NMF problem, for both losses, is a non-convex problem in W and H simultaneously, butconvex with respect to each variable taken separately; this make alternating optimization tech-niques, i.e., solving at each iteration two separate convex problems, very adapted: ﬁrst ﬁxing H toestimate W , and then ﬁxing W to estimate H (Lee and Seung, 2000; F´evotte et al., 2009). Themultiplicative updates algorithms (like ours) follow this approach.For the KL divergence loss, the multiplicative update rule (Lee and Seung, 2000) for W and H is as follows and may be derived from expectation-maximization (EM) for a certain probabilisticmodel (Lee and Seung, 2000; F´evotte and Cemgil, 2009): W ia ← W ia P nµ =1 H aµ V iµ / ( WH ) iµ P nν =1 H aν , and H aµ ← H aµ P mi =1 W ia V iµ / ( WH ) iµ P mk =1 W ka . The complexity per iteration is O ( rmn ). Sun and F´evotte (2014) propose an ADMM technique to solve Problem (2) by reformulating it asminimize D ( V k X )subject to X = YZY = W , Z = HW ≥ , H ≥ . The updates for the primal variables W , H , X , Y and Z are as follows and involve certainproximal operators for the KL loss which are the same as ours in Section 3.2: Y ⊤ ← (cid:16) ZZ ⊤ + I (cid:17) − (cid:16) ZX ⊤ + W ⊤ + ρ (cid:16) Z α ⊤ X − α ⊤ Y (cid:17)(cid:17) Z ← (cid:16) Y ⊤ Y + I (cid:17) − (cid:16) Y ⊤ X + H + ρ (cid:16) Y ⊤ α X − α Z (cid:17)(cid:17) X ← ( ρ YZ − α X − ) + q ( ρ YZ − α X − ) + 4 ρ V ρ W ← (cid:16) Y + ρ α Y (cid:17) + H ← (cid:16) Z + ρ α Z (cid:17) + . Note that the primal updates require solving linear systems of sizes r × r , but that the overalcomplexity remains O ( rmn ) per iteration (the same as multiplicative updates).The updates for the dual variables α X , α Y and α Z are then: α X ← α X + ρ ( X − YZ ) α Y ← α Y + ρ ( Y − W ) α Z ← α Z + ρ ( Z − H ) . ρ ∈ R + , that needs to be tuned (inour experiments we choose the best performing one from several candidates).Our approach has the following diﬀerences: (1) we aim at solving alternatively convex problemswith a few steps of primal-dual algorithms for convex problems, as opposed to aiming at solvingdirectly the non-convex problem with an iterative approach, (2) for the convex decompositionproblem, we have certiﬁcates of guarantees, which can be of used for online methods for whichdecomposition problems are repeatedly solved (Lef`evre et al., 2011) and (3) we use a diﬀerentsplitting method, namely the one of Chambolle and Pock (2011), which does not require matrixinversions, and which allows us to compute all step-sizes in a data-driven way. In this section we present a formulation of the convex KL decomposition problem as a ﬁrst-orderprimal-dual algorithm (FPA), followed by the proposed NMF technique.

We consider a vector a ∈ R p + and a matrix K ∈ R p × q + as known parameters, and x ∈ R q + as anunknown vector to be estimated, where the following expression holds, a ≈ Kx, and we aim at minimizing the KL divergence between a and Kx .This is equivalent to a ND problem as deﬁned in Problems (3) and (4), considering a as acolumn of the given data, K as the ﬁxed factor, and x as a column of the estimated factor, i.e., inProblem (3) a and x are column vectors of V ⊤ and W ⊤ with the same index and K is H ⊤ , andin Problem (4) a and x are columns of V and H with the same index and K is W .The convex ND problem with KL divergence is thusminimize x ∈ R q + − p X i =1 a i (log( K i x/a i ) + 1) + p X i =1 K i x, (5)which may be written as minimize x ∈X F ( Kx ) + G ( x ) , (6)with F ( z ) = − p X i =1 a i (log( z i /a i ) + 1) G ( x ) = x (cid:23) + p X i =1 K i x. Following Pock et al. (2009); Chambolle and Pock (2011), we obtain the dual problemmaximize y ∈Y − F ∗ ( y ) − G ∗ ( − K ∗ y ) , with F ∗ ( y ) = sup z n y ⊤ z − F ( z ) o = − p X i =1 a i log ( − y i ) G ∗ ( y ) = sup x n y ⊤ x − G ( x ) o = y (cid:22) K ⊤ .

4e then get the dual problem maximize K ⊤ ( − y ) (cid:22) K ⊤ a ⊤ log ( − y ) . (7)In order to provide a certiﬁcate of optimality, we have to make sure that the constraint K ⊤ ( − y ) (cid:22) K ⊤ is satisﬁed. Therefore, when it is not satisﬁed, we project as follows: y ← y/ max { K ⊤ ( − y ) ⊘ K ⊤ } , where ⊘ represents the entry-wise division operator. The general FPA framework of the approach proposed by Chambolle and Pock for Problem (6) ispresented in Algorithm 1.

Algorithm 1:

First-order primal-dual algorithm.Select K ∈ R p × q + , x ∈ R q + , σ >

0, and τ > x = x old = x , and y = Kx ; for N iterations do y ← prox σF ∗ ( y − σK ¯ x ); x ← prox τG ( x − τ K ∗ y );¯ x ← x − x old ; x old ← x ; endreturn x ⋆ = x. Algorithm 1 requires the computation of proximal operators prox σF ∗ ( y ) and prox τG ( x ). Theseare deﬁned as follows: prox σF ∗ ( y ) = arg min v (cid:26) k v − y k σ + F ∗ ( v ) (cid:27) , and prox τG ( x ) = arg min u (cid:26) k u − x k τ + G ( u ) (cid:27) . For further details, see (Boyd and Vandenberghe, 2004; Rockafellar, 1997).Using the convex functions F ∗ and G , we can easily solve the problems for the proximaloperators and derive the following closed-form solution operators prox σF ∗ ( y ) = 12 (cid:16) y − p y ◦ y + 4 σa (cid:17) , and prox τG ( x ) = (cid:16) x − τ K ⊤ (cid:17) + . The detailed derivation of these operators may be found in the Appendix, the ﬁrst one was alreadycomputed by Sun and F´evotte (2014). 5 .3 Automatic heuristic selection of σ and τ In this section, we provide a heuristic way to select σ and τ without user intervention, based on theconvergence result of Chambolle and Pock (2011, Theorem 1), which states that (a) the step-sizeshave to satisfy τ σ k K k ≤

1, where k K k = max {k Kx k : x ∈ X with k x k ≤ } is the largest singularvalue of K ; and (b) the convergence rate is controlled by the quantity C = k y − y ⋆ k σ + k x − x ⋆ k τ , where ( x ⋆ , y ⋆ ) is an optimal primal/dual pair. If ( x ⋆ , y ⋆ ) was known, we could thus considerthe following minimization problem with the constraint τ σ k K k ≤ σ,τ k y − y ⋆ k σ + k x − x ⋆ k τ ⇐⇒ min σ k y − y ⋆ k σ + k x − x ⋆ k σ k K k . Applying ﬁrst order conditions, we ﬁnd that σ = k y − y ⋆ kk x − x ⋆ k k K k and τ = k x − x ⋆ kk y − y ⋆ k k K k . However, we do not know the optimal pair ( x ⋆ , y ⋆ ) and we use heuristic replacements. Thatis, we consider the unknown constants α and β , and assume that x ⋆ = α and y ⋆ = β solveProblems (5) and (7). Letting ( x , y ) = ( , ) we have k x − x ⋆ k = | α |√ q and k y − y ⋆ k = | β |√ p. Plugging x ⋆ to Problem (5), we are able to ﬁnd that α = ⊤ a ⊤ K >

0. Now, using optimalityconditions: y ⋆ ◦ ( Kx ⋆ ) = − a , we obtain β = − σ = r pq α k K k and τ = r qp α k K k . Finally, an automatic heuristic selection of step sizes σ and τ is as follows: σ = √ p P pi =1 K i √ q k K k P pi =1 a i and τ = √ q P pi =1 a i √ p k K k P pi =1 K i . Note the invariance by rescaling of a and K . The proposed method is based on Algorithm 1. The required parameters to solve each ND problemare • Problem (3): (cid:4) a ← (cid:0) V ⊤ (cid:1) i (cid:4) K ← H ⊤ (cid:4) x ← (cid:0) W ⊤ (cid:1) i σ ← p mr ⊤ H11 ⊤ ( V ⊤ ) i k H k (cid:4) τ ← p rm ⊤ ( V ⊤ ) i ⊤ H1 k H k • Problem (4): (cid:4) a ← V i (cid:4) K ← W (cid:4) x ← H i (cid:4) σ ← p nr ⊤ W11 ⊤ V i k W k (cid:4) τ ← p rn ⊤ V i ⊤ W1 k W k .The previous summary treats each ND problem by columns. For algorithmic eﬃciency, wework directly with the matrices, e.g., a ∈ R n × m + instead of R n + . We also include normalizationsteps such that the columns of W have sums equal to 1. The stopping criteria is enabled formaximum number of iterations (access to data) and for duality gap tolerance. A pseudo-code of the ﬁrst-order primal-dual algorithm for non-negative matrix factorization canbe found in Algorithm 2. It corresponds to alternating between minimizing with respect to H andminimizing with respect to W . A key algorithmic choice is the number of inner iterations iter ND of the convex method, which we consider in Section 4.The running-time complexity is still O ( rnm ) for each inner iterations. Note moreover, thatcomputing the largest singular value of H or W (required for the heuristic selection of step-sizeseverytime we switch from one convex problem to the other) is of order O ( r max { m, n } ) and is thusnegligible compared to the iteration cost. Probabilistic latent semantic analysis (Hofmann, 1999) or latent Dirichlet allocation (Blei et al.,2003), generative probabilistic models for collections of discrete data, have been extensively usedin text analysis. Their formulations are related to ours in Problem (5), where we just need toinclude an additional constraint: ⊤ x = 1. In this sense, if we modify G , i.e., G ( x ) = { ⊤ x =1; x (cid:23) } + 1 ⊤ Kx , we can use Algorithm 1 to ﬁnd the latent topics. It is important to mentionthat herein prox τG ( x ) does not have a closed solution, but can be eﬃciently solved with dedicatedmethods for orthogonal projections on the simplex (Maculan and de Paula, 1989). The proposed technique was tested on synthetic data, the CBCL face images database and a musicexcerpt from a recognized jazz song by Louis Armstrong & His Hot Five. The performance ofthe proposed ﬁrst-order primal-dual algorithm (FPA) was compared against the traditional mul-tiplicative updates algorithm (MUA) by Lee and Seung (2000) and the contemporary alternatingdirection method of multipliers (ADMM) by Sun and F´evotte (2014). The three techniques wereimplemented in Matlab.

A given matrix V of size n = 200 and m = 1000 is randomly generated from the uniform distribu-tion U (0 , r = 15. Initializations W and H are deﬁned bythe standard normal distribution’s magnitude plus a small oﬀset. To examine the accuracy of our method, we ﬁrst apply Algorithm 2 to convex ND problems forﬁxed values of n , m and r , solving separately Problems (3) and (4). For both problems, we set7 lgorithm 2: Proposed technique.Select V ∈ R n × m + , W ∈ R n × r + , and H ∈ R r × m + ;Set W = ¯ W = W old = W , H = ¯ H = H old = H , and χ = WH ; while stopping criteria not reached do Normalize W and set σ = p mr ⊤ H11 ⊤ V ⊤ k H k , τ = p rm ⊤ V ⊤ ⊤ H1 k H k , and H ( − χ ⊤ ) ≤ H1 ; for iter ND iterations do χ ⊤ ← χ ⊤ − σ ◦ (cid:0) ¯ WH (cid:1) ⊤ ; χ ⊤ ← (cid:16) χ ⊤ − p χ ⊤ ◦ χ ⊤ + 4 σ ◦ V ⊤ (cid:17) ; W ⊤ ← (cid:0) W ⊤ − τ ◦ (cid:0) H (cid:0) χ ⊤ + (cid:1)(cid:1)(cid:1) + ;¯ W ⊤ ← W ⊤ − W ⊤ old ; W ⊤ old ← W ⊤ ; end Normalize H and set σ = p nr ⊤ W11 ⊤ V k W k , τ = p rn ⊤ V1 ⊤ W1 k W k , and W ⊤ ( − χ ) ≤ W ⊤ ; for iter ND iterations do χ ← χ − σ ◦ (cid:0) W ¯ H (cid:1) ; χ ← (cid:0) χ − √ χ ◦ χ + 4 σ ◦ V (cid:1) ; H ← (cid:0) H − τ ◦ (cid:0) W ⊤ ( χ + ) (cid:1)(cid:1) + ;¯ H ← H − H old ; H old ← H ; endendreturn W ⋆ = W , and H ⋆ = H . the number of iterations of the traditional MUA and contemporary ADMM to 1200, as well asfor the proposed FPA. Optimal factors W ⋆ and H ⋆ are obtained by running 5000 iterations ofthe MUA, starting from W and H . For the ﬁrst ND problem, we ﬁx H to H ⋆ and estimate W starting from W ; for the second one, we ﬁx W to W ⋆ and estimate H from H . The optimalregularization parameter of ADMM, the tuning parameter that controls the convergence rate, is ρ = 0 .

15 (small values imply larger step sizes, which may result in faster convergence but alsoinstability). Figure 1 (a-b) present us the distance to optimum of MUA and ADMM, as well as forthe primal and dual of our technique that reveals strong duality. The FPA and ADMM algorithmconverge to the same point, whereas the MUA exhibits slow convergence. Note moreover thesigniﬁcant advantage towards our algorithm FPA, together with the fact that we set automaticallyall step-sizes. In Figure 1 (c-d), we illustrate the distance to optimum versus time of the threemethods.

The setting is slightly diﬀerent as in the ND experiment, we increased the problem dimension to n = 250, m = 2000 and r = 50, and repeat both previously presented experiments. For all methods,8

200 400 600 800 1000 120010 −5 Number of iterations D i s t an c e t o op t i m u m MUA objectiveADMM objective ( ρ = 0.15)FPA primalFPA dual (a) Estimate W given H ⋆ . −5 Number of iterations D i s t an c e t o op t i m u m MUA objectiveADMM objective ( ρ = 0.15)FPA primalFPA dual (b) Estimate H given W ⋆ . Time (s) D i s t an c e t o op t i m u m MUA objectiveADMM objective ( ρ = 0.15)FPA primal (c) Estimate W given H ⋆ . Time (s) D i s t an c e t o op t i m u m MUA objectiveADMM objective ( ρ = 0.15)FPA primal (d) Estimate H given W ⋆ . Figure 1: ND on synthetic data. (a-b) Distance to optimum versus iteration number. Distance tooptimum reveals the diﬀerence between the values of the objective function and optimal point, p ⋆ .In the case of the dual function values, the distance to optimum is the diﬀerence between p ⋆ andthe dual points. (c-d) Distance to optimum versus time.we set the number of iterations to 3000. The parameter iter ND indicates the number of iterationsto solve each ND problem. We set iter ND to 5 iterations. To have a fair comparison betweenalgorithms, for FPA, the number of iterations means access to data, i.e., we use 5 iterations to solveProblem (3) (as well as for Problem (4)), and repeat this 600 times. The optimal regularizationparameter of the ADMM is here ρ = 1.In Figure 2 we present the objective function of the three algorithm for the non-convex Problem (2).The MUA initially reports high decrement in the objective, but as time increases it exhibits evi-dent slow tail convergence. The FPA primal objective decreases dramatically in only seconds (fewiterations), and furthermore, it presents fast tail convergence achieving the lowest objective value.The ADMM has poor initial performance, but then achieves an optimal value similar to the oneobtained by FPA. In order to show that FPA converges faster and with lower computational cost,we store the cost function values and computation times at each iteration. The total time requiredby the FPA was 190s, whereas 205s by the ADMM and 473s by the MUA. Then we analyze theADMM and MUA for the same time 190s (the vertical dotted line in Figure 2 (b)), i.e., 2786 and1211 iterations, respectively: the competitive algorithms have a signiﬁcantly larger cost functionfor the same running time. The result of this experiment is illustrated in Figure 2 (b). The resultsconsidering the objective function versus iteration number may be found in the Appendix. The problem dimension is n = 150, m = 2000 and r = 50. We run 3000 iterations of each methodusing initializations W and H ; then we increase ten times the low-rank element, r = 100; and9 Time (s) O b j e c t i v e f un c t i on MUAADMM ( ρ = 1)FPA (iter ND = 5) (a) Objective function versus time. Time (s) O b j e c t i v e f un c t i on MUAADMM ( ρ = 1)FPA (iter ND = 5) (b) Objective function versus time (zoomed). Figure 2: NMF on synthetic data. Recall that the dual function is not presented due to thenon-convexity of the NMF problem.ﬁnally run 2000 more iterations, producing W and H . The idea is to use as initializations theestimations obtained after the ﬁrst 3000 iterations, W and H , considering that the low-rankelement changed. A trivial solution could be to include random entries so that W and H havethe proper dimensions, but that increases the objective value, diminishing the estimations. Onthe other hand, if we include zero entries so that W and H have the proper dimensions, wewould be in a saddle-point where none of the algorithms could perform. However, if we set onlyone factor with zero entries, [ W , c ] ∈ R n × r with c = 0, and the other one with non-zero values,[ H ; ν ] ∈ R r × m , we still maintain the same last objective value and perform FPA. In this situation,MUA cannot perform either (because of the absorbing of zeros), therefore we try some values of c to run the algorithm. Figure 3 illustrates the proposed experiment. Notice that as c →

0, theMUA starts to get stuck in a poor local optima, i.e., the one obtained with W and H . ADMMhas a similar behavior as FPA, therefore, it is not displayed. Number of iterations O b j e c t i v e f un c t i on MUA c = 1MUA c = 0.01MUA c = 0.0001FPA (iter ND = 5) c = 0 Figure 3: NMF with warm restarts on synthetic data. Value of the objective function at eachiteration.

We use the CBCL face images database (Sung, 1996) composed of m = 2429 images of size n = 361pixels. The low-rank element was set to r = 49. Figure 4 shows samples from the database.Our next experiment is to determine the optimal the number of iterations for the currentdatabase. Therefore, we run 3000 iterations of FPA, using 3, 5, 10 and 15 iterations for the NDproblem. The MUA and ADMM ( ρ =50) algorithms are performing as well. Figure 5 illustratesthe decay of the objective function of the FPA, MUA and ADMM algorithms.10igure 4: MIT-CBCL Face Database Number of iterations O b j e c t i v e f un c t i on MUA objectiveADMM objective ( ρ = 50)FPA primal (iter ND = 3)FPA primal (iter ND = 5)FPA primal (iter ND = 10)FPA primal (iter ND = 15) (a) Objective function versus iteration number. Number of iterations O b j e c t i v e f un c t i on MUA objectiveADMM objective ( ρ = 50)FPA primal (iter ND = 3)FPA primal (iter ND = 5)FPA primal (iter ND = 10)FPA primal (iter ND = 15) (b) Objective versus iteration number (zoomed). Figure 5: NMF on the CBCL database. Value of the objective function at each iteration solvingProblem (2) varying the number of iterations to solve each ND problem.11e appreciate that setting the number of iterations to 3 yield to over-alternation, whereas using15 or even more iterations result in an under-alternating method. Using 10 iterations reveal goodperformance, but the best trade-oﬀ is obtained with 5 iterations. Therefore, we set iter ND = 5,i.e., the number of iterations to solve Problem (3) and Problem (4). All following results in theMIT-CBCL Face Database Time (s) O b j e c t i v e f un c t i on MUAADMM ( ρ = 50)FPA (iter ND = 5) (a) Objective function versus time. Time (s) O b j e c t i v e f un c t i on MUAADMM ( ρ = 50)FPA (iter ND = 5) (b) Objective function versus time (zoomed). Figure 6: NMF on the CBCL face image database.

The last experiment is to decompose a 108-second-long music excerpt from “My Heart (Will AlwaysLead Me Back to You)” by Louis Armstrong & His Hot Five in the 1920s (F´evotte et al., 2009).The time-domain recorded signal is illustrated in Figure 7.

Time (s) A m p li t ude Figure 7: Time-domain recorded signal.The recording consists of a trumpet, a double bass, a clarinet, a trombone, and a piano. Therecorded signal is original unprocessed mono material contaminated with noise. The signal wasdownsampled to 11025 kHz, yielding 1,19 × samples. The Fourier Transform of the recordedsignal was computed using a sinebell analysis window of length 23 ms with 50% overlap betweentwo frames, leading to m = 9312 frames and n = 129 frequency bins. Additionally, we set r = 10.Figure 8 illustrates the previously described spectrogram.12 umber of frames N u m be r o f f r e c uen cy b i n s dB −200−150−100−50050 Figure 8: Log-power spectrogram.The decomposition of the song is produced by the three algorithms. We initialize them withthe same random values W and H . For a fair competition, the number of iterations is set to5000 for MUA and ADMM, and for our algorithm FPA we consider it as access to data, i.e., weuse 5 iterations for the ND, repeating it 1000 times. For comparison, we measure the computationtime of the three techniques. FPA has a run time of 13 min, whereas the ADMM ( ρ = 10) one of15 min and the MUA one of 80 min. Time (s) O b j e c t i v e f un c t i on MUAADMM ( ρ = 10)FPA (iter ND = 5) (a) Objective function versus time. Time (s) O b j e c t i v e f un c t i on MUAADMM ( ρ = 10)FPA (iter ND = 5) (b) Objective function versus time (zoomed). Figure 9: NMF on an excerpt of Armstrong’s song.In this experiment, Figure 9 illustrates the evolution of the objective of the three techniquesalong time . Initially the MUA obtained the lowest objective value, but as previously discussed,as the number of iterations increases the MUA starts exhibiting evident slow tail convergence andsince approximately 100s it is reached by the FPA and shows no further substantial decrement,i.e., it gets stuck in a worse local optima. FPA converges to a slight lower cost value, overpassingMUA. Finally, ADMM reveals a slow initial performance on this dataset, but later converges toa similar point as the previous algorithms. The results considering the objective function versusiteration number may be found in the Appendix.

We have presented an alternating projected gradient descent technique for NMF that minimizesthe KL divergence loss; this approach solves convex ND problems with the FPA. Our approachdemonstrated faster convergence than the traditional MUA by Lee and Seung (2000) and contem-porary ADMM by Sun and F´evotte (2014). The FPA introduces a new parameter, the number ofiterations for each convex ND problem. Experiments reveal that the number of iterations is mostlybounded between 3 and 10 iterations, which leads to a trivial tuning by the user. Therefore, our13lgorithm aﬀords reasonable simplicity, where further user manipulation is not required. Finally,an extension to latent Dirichlet allocation and probabilistic latent semantic indexing can be easilyimplemented using our proposed method, thus allowing to go beyond the potential slowness of theexpectation-maximization (EM) algorithm.

Acknowledgements

This work was partially supported by a grant from the European Research Council (ERC SIERRA239993).

References

D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation.

Journal of Machine LearningResearch , 3:993–1022, 2003.S. Boyd and L. Vandenberghe.

Convex Optimization . Cambridge University Press, 2004.A. Chambolle and T. Pock. A ﬁrst-order primal-dual algorithm for convex problems with applica-tions to imaging.

Journal of Mathematical Imaging and Vision , 40:120–145, 2011.C. F´evotte and A. T. Cemgil. Nonnegative matrix factorizations as probabilistic inference incomposite models. In

Proceedings of the 17th European Signal Processing Conference , 2009.C. F´evotte, N. Bertin, and J. L. Durrieu. Nonnegative matrix factorization with the itakura-saitodivergence: With application to music analysis.

Neural Computation , 21:793–830, 2009.N. Gillis.

Nonnegative Matrix Factorization: Complexity, Algorithms and Applications . PhD thesis,Universit´e Catholique de Louvain, 2011.N. Guduru. Text mining with support vector machines and non-negative matrix factorizationalgorithms. Master’s thesis, University of Rhode Island, 2006.T. Hofmann. Probabilistic latent semantic analysis. In

Proceedings of the 15th Conference onUncertainty in Artiﬁcial Intelligence , 1999.D. Kim, S. Sra, and I. S. Dhillon. Fast projection-based methods for the least squares nonnegativematrix approximation problem.

Statistical Analysis and Data Mining , 1:38–51, 2008.D. D. Lee and H. S. Seung. Learning the parts of objects by non-negative matrix factorization.

Nature , 401:788–791, 1999.D. D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. In

Advances inNeural Information Processing Systems 13 , 2000.A. Lef`evre, F. Bach, and C. F´evotte. Online algorithms for nonnegative matrix factorization withthe Itakura-Saito divergence. In , 2011.C. J. Lin. Projected gradient methods for nonnegative matrix factorization.

Neural Computation ,19:2756–2779, 2007.N. Maculan and G. Galdino de Paula. A linear-time median-ﬁnding algorithm for projecting avector on the simplex of R n . Operations research letters , 8:219–222, 1989.P. Paatero and U. Tapper. Positive matrix factorization: A non-negative factor model with optimalutilization of error.

Environmetrics , 5:111–126, 1994.14. Pock, D. Cremers, H. Bischof, and A. Chambolle. An algorithm for minimizing the mumford-shah functional. In

Proceedings of the 12th International Conference on Computer Vision , 2009.R. T. Rockafellar.

Convex Analysis . Princeton University Press, 1997.D. L. Sun and C. F´evotte. Alternating direction method of multipliers for non-negative matrixfactorization with the beta-divergence. In

Proceedings of the 39th International Conference onAcoustic, Speech and Signal Processing , 2014.K. K. Sung.

Learning and Example Selection for Object and Pattern Recognition . PhD thesis,Massachusetts Institute of Technology, 1996.Y. Wang, Y. Jia, C. Hu, and M. Turk. Non-negative matrix factorization framework for facerecognition.

International Journal of Pattern Recognition and Artiﬁcial Intelligence , 19:495–511, 2005. 15 ppendix

Derivation of proximal operators

The deﬁnition of the proximal operator of F ∗ and G , i.e., ( I + σ∂F ∗ ) − ( y ) and ( I + τ ∂G ) − ( x ),respectively, is as follows (Pock et al., 2009; Chambolle and Pock, 2011):( I + σ∂F ∗ ) − ( y ) = arg min v (cid:26) k v − y k σ + F ∗ ( v ) (cid:27) , and( I + τ ∂G ) − ( x ) = arg min u (cid:26) k u − x k τ + G ( u ) (cid:27) , where ∂F ∗ and ∂G are the subgradients of the convex functions F ∗ and G .To facilitate the computation of ( I + σ∂F ∗ ) − ( y ), we consider Moreau’s identity y = ( I + τ ∂F ∗ ) − ( y ) + σ (cid:18) I + 1 σ ∂F (cid:19) − (cid:16) yσ (cid:17) = prox σF ∗ ( y ) + σ prox F/σ ( y/σ ) . Let us consider the variable v ∈ Y , and using Moreau’s identity, we can compute prox σF ∗ ( y ) = y − σ prox F/σ ( y/σ )= y − σ arg min v (cid:26) σ (cid:13)(cid:13)(cid:13) v − yσ (cid:13)(cid:13)(cid:13) + F ( v ) (cid:27) = y − σ arg min v ( σ (cid:13)(cid:13)(cid:13) v − yσ (cid:13)(cid:13)(cid:13) − n X i =1 a i (cid:18) log (cid:18) v i a i (cid:19) + 1 (cid:19)) = y − σ arg min v ( n X i =1 σ (cid:18) v i − v i y i σ + y i σ (cid:19) − a i (cid:18) log (cid:18) v i a i (cid:19) + 1 (cid:19)) = y − σ arg min v ( n X i =1 σ v i − y i v i − a i log ( v i ) ) . Applying ﬁrst order conditions to obtain the minimum: ddv i n σ v i − y i v i − a i log ( v i ) o = 0 = ⇒ σv i − y i − a i v i = 0= ⇒ σv i − y i v i − a i = 0= ⇒ v i = y i ± q y i + 4 σa i σ = ⇒ v = y + √ y ◦ y + 4 σa σ , as v ≻ . Finally, the proximal operator is as follows: prox σF ∗ ( y ) = 12 (cid:16) y − p y ◦ y + 4 σa (cid:17) . For the second proximal operator, we consider u ∈ X and compute prox τG ( x ) as16 rox τG ( x ) = arg min u (cid:23) (cid:26) k u − x k τ + G ( u ) (cid:27) = arg min u (cid:23) ( k u − x k τ + n X i =1 K i u ) = (cid:16) x − τ K ⊤ (cid:17) + . Synthetic data: additional results

NMF problem

A given matrix V of size n = 250 and m = 2000 is randomly generated from the uniform distribu-tion U (0 , r = 50. For the three methods, we set the numberof iterations to 3000. We set iter ND to 5 iterations. The optimal tuning parameter of the ADMMis here ρ = 1. In Figure 10 we present the objective function versus iteration number of the threealgorithms for the non-convex NMF problem. Number of iterations O b j e c t i v e f un c t i on MUAADMM ( ρ = 1)FPA (iter ND = 5) (a) Objective versus iteration number. Number of iterations O b j e c t i v e f un c t i on MUAADMM ( ρ = 1)FPA (iter ND = 5) (b) Zoomed version. Figure 10: NMF on synthetic data. It is important to recall that the dual function is not presenteddue to the non-convexity of the NMF problem.

MIT-CBCL Face Database

ND problem

We solve convex ND problems for ﬁxed values of n , m and r , setting the number of iterations ofall algorithms to 1500. Optimal factors W ⋆ and H ⋆ are obtained by running 5000 iterations of theMUA. The optimal tuning parameter of the ADMM is here ρ = 0 .

1. Figure 11 (a-b) presents us thedistance to optimum of the MUA and ADMM, as well as for the primal and dual of our techniquethat reveals strong duality in all experiments. In Figure 11 (c-d), we illustrate the distance tooptimum versus time of the three methods.

NMF problem

For all methods, we set the number of iterations to 3000. We set iter ND to 5 iterations. Theoptimal tuning parameter of the ADMM is here ρ = 50. In Figure 12 we present the objectivefunction versus iteration number of the three algorithms. Features learned from the CBCL face image database

The features learned from the CBCL face image database obtained with the three algorithms ispresented in Figure 13. The ﬁgure reveals the parts-based learned by the algorithm, i.e. W .17

500 1000 150010 Number of iterations D i s t an c e t o op t i m u m MUA objectiveADMM objective ( ρ = 0.1)FPA primalFPA dual (a) Estimate W given H ⋆ . Number of iterations D i s t an c e t o op t i m u m MUA objectiveADMM objective ( ρ = 0.1)FPA primalFPA dual (b) Estimate H given W ⋆ . Time (s) D i s t an c e t o op t i m u m MUA objectiveADMM objective ( ρ = 0.1)FPA primal (c) Estimate W given H ⋆ . Time (s) D i s t an c e t o op t i m u m MUA objectiveADMM objective ( ρ = 0.1)FPA primal (d) Estimate H given W ⋆ . Figure 11: Distance to optimum versus (a-b) iteration number, and (c-d) time. Number of iterations O b j e c t i v e f un c t i on MUAADMM ( ρ = 50)FPA (iter ND = 5) (a) Objective versus iteration number. Number of iterations O b j e c t i v e f un c t i on MUAADMM ( ρ = 50)FPA (iter ND = 5) (b) Zoomed version. Figure 12: NMF on the CBCL face image database. “My Heart (Will Always Lead Me Back to You)”: additional results

ND problem

We solve convex ND problems for ﬁxed values of n , m and r , setting the number of iterations ofall algorithms to 2000. Optimal factors W ⋆ and H ⋆ are obtained by running 5000 iterations of theMUA. The optimal tuning parameter of the ADMM is here ρ = 0 .

5. Figure 14 (a-b) presents us thedistance to optimum of the MUA and ADMM, as well as for the primal and dual of our techniquethat reveals strong duality in all experiments. In Figure 14 (c-d), we illustrate the distance tooptimum versus time of the three methods.

NMF problem

For all methods, we set the number of iterations to 5000. We set iter ND to 5 iterations. Theoptimal tuning parameter of the ADMM is here ρ = 10. In Figure 15 we present the objective18 a) MUA. (b) ADMM. (c) FPA. Figure 13: Features learned from the CBCL face image database. Number of iterations D i s t an c e t o op t i m u m MUA objectiveADMM objective ( ρ = 0.5)FPA primalFPA dual (a) Estimate W given H ⋆ . Number of iterations D i s t an c e t o op t i m u m MUA objectiveADMM objective ( ρ = 0.5)FPA primalFPA dual (b) Estimate H given W ⋆ . Time (s) D i s t an c e t o op t i m u m MUA objectiveADMM objective ( ρ = 0.5)FPA primal (c) Estimate W given H ⋆ . Time (s) D i s t an c e t o op t i m u m MUA objectiveADMM objective ( ρ = 0.5)FPA primal (d) Estimate H given W ⋆ . Figure 14: Distance to optimum versus (a-b) iteration number, and (c-d) time.function versus iteration number of the three algorithms.19 Number of iterations O b j e c t i v e f un c t i on MUAADMM ( ρ = 10)FPA (iter ND = 5) (a) Objective versus iteration number. Number of iterations O b j e c t i v e f un c t i on MUAADMM ( ρ = 10)FPA (iter ND = 5) (b) Zoomed version. Figure 15: NMF on an excerpt of Armstrongs song.

Features learned from the song

The decomposition of the song by Louis Armstrong and band obtained with the proposed FPAis presented in Figure 16, revealing the parts-based learned by the algorithm, i.e., W . The time-domain signal is recovered from Wiener ﬁltering (F´evotte et al., 2009).20 f ( H z ) f ( H z ) f ( H z ) f ( H z ) f ( H z ) f ( H z ) f ( H z ) f ( H z ) f ( H z ) f ( H z ) Frecuency bins W , parts-based learned by the FPA. Time (s)