[PDF] A Unified Convergence Analysis of the Multiplicative Update Algorithm for Regularized Nonnegative Matrix Factorization

Abstract

The multiplicative update (MU) algorithm has been extensively used to estimate the basis and coefficient matrices in nonnegative matrix factorization (NMF) problems under a wide range of divergences and regularizers. However, theoretical convergence guarantees have only been derived for a few special divergences without regularization. In this work, we provide a conceptually simple, self-contained, and unified proof for the convergence of the MU algorithm applied on NMF with a wide range of divergences and regularizers. Our main result shows the sequence of iterates (i.e., pairs of basis and coefficient matrices) produced by the MU algorithm converges to the set of stationary points of the non-convex NMF optimization problem. Our proof strategy has the potential to open up new avenues for analyzing similar problems in machine learning and signal processing.

Full PDF

aa r X i v : . [ m a t h . O C ] J un A Uniﬁed Convergence Analysis of theMultiplicative Update Algorithm for RegularizedNonnegative Matrix Factorization

Renbo Zhao,

Student Member,

Vincent Y. F. Tan,

Senior Member

Abstract —The multiplicative update (MU) algorithm has beenextensively used to estimate the basis and coefﬁcient matricesin nonnegative matrix factorization (NMF) problems under awide range of divergences and regularizers. However, theoreticalconvergence guarantees have only been derived for a few specialdivergences without regularization. In this work, we provide aconceptually simple, self-contained, and uniﬁed proof for theconvergence of the MU algorithm applied on NMF with a widerange of divergences and regularizers. Our main result shows thesequence of iterates (i.e., pairs of basis and coefﬁcient matrices)produced by the MU algorithm converges to the set of stationarypoints of the non-convex NMF optimization problem. Our proofstrategy has the potential to open up new avenues for analyzingsimilar problems in machine learning and signal processing.

Index Terms —Nonnegative Matrix Factorization, Multiplica-tive Update Algorithm, Convergence Analysis, Nonconvex Opti-mization, Stationary Points

I. I

NTRODUCTION

Nonnegative Matrix Factorization (NMF) has been a populardimensionality reduction technique in recent years, due to itsnon-subtractive and parts-based interpretation on the learnedbasis [2]. In the general formulation of NMF, given a non-negative matrix V ∈ R F × N + , one seeks to ﬁnd a nonnegativebasis matrix W ∈ R F × K + and a nonnegative coefﬁcient matrix H ∈ R K × N + such that V ≈ WH . To ﬁnd such pair of matrices,a popular approach is to solve the optimization problem min W ≥ , H ≥ h ℓ ( W , H ) , D ( V k WH ) i . (1)where D ( ·k· ) denotes the divergence (or distance) betweentwo nonnegative matrices and W ≥ (and H ≥ ) denotesentrywise inequality. In the NMF literature, many algorithmshave been proposed to solve (1), including multiplicativeupdates (MU) [3]–[6], block principal pivoting (BPP) [7],projected gradient descent (PGD) [8] and the alternating direc-tion method of multipliers (ADMM) [9], [10]. However, somealgorithms only solve (1) for certain divergences D ( ·k· ) . Forexample, the BPP algorithm is only applicable to the squared-Frobenius loss k V − WH k . Among all the algorithms, theMU algorithm has arguably the widest applicability—it hasbeen used to solve (1) when D ( ·k· ) belongs to the family of α -divergence [11], β -divergence [5], γ -divergence [12], etc. An abridged version of this paper was presented at the ICASSP 2017 [1].Renbo Zhao and Vincent Y. F. Tan are with the Department of Electricaland Computer Engineering and the Department of Mathematics, NationalUniversity of Singapore (NUS). Renbo Zhao is also with the Departmentof Industrial and Systems Engineering, NUS. They are supported in part bythe NUS Young Investigator Award (grant number R-263-000-B37-133).

Despite the popularity of the MU algorithm, its convergenceproperties have not been studied systematically when thedivergence is not the standard squared-Frobenius loss andwhen there are regularizers on W and H . To describe thisproblem precisely, let { ( W t , H t ) } ∞ t =1 be the sequence ofpairs of basis and coefﬁcient matrices generated by the MUalgorithm, where t ≥ denotes the iteration index. Manyprevious works [3], [5], [11] showed that the sequence of(nonnegative) objective values { ℓ ( W t , H t ) } ∞ t =1 in the MUalgorithm is non-increasing and hence the algorithm con-verges. However, the convergence of objective values does notimply the convergence of { ( W t , H t ) } ∞ t =1 , whose limit points(assuming they exist) serve as natural candidates for the outputof the MU algorithm. The limit points of { ( W t , H t ) } ∞ t =1 tendto be empirically appealing, i.e., they represent each columnof V (i.e., a data sample) as a linear combination of K nonnegative basis vectors in a meaningful manner [2], [13].As such, the convergence properties of { ( W t , H t ) } ∞ t =1 , andespecially the optimality of its limit points, are of theoreticaland practical importance. A. Related Works

Due to the nonconvex nature of (1), algorithms that guar-antee to converge to the global minima of (1) are in gen-eral out-of-reach. Indeed, [14] has shown that (1) is NP-hard. To ameliorate this situation, a line of works in whichstructural assumptions on the data matrix V —such as theseparability [15] assumptions [16]—has emerged. Under suchan assumption, polynomial-time algorithms [16]–[19] havebeen proposed to ﬁnd W and H such that WH exactly equals V . However, for many applications in signal processingand machine learning, the data matrix V does not strictlysatisfy the aforementioned assumptions. In such scenarios,exact (nonnegative) factorization of V is generally infeasible.As a result, given a general nonnegative matrix V , manyworks [3]–[10] only aim to reduce (or preserve) the functionvalue of ℓ ( · , · ) at each iteration t , hoping that the sequence { ( W t , H t ) } ∞ t =1 will converge to a limit point that is “reason-ably good”. To understand the theoretical convergence proper-ties of this sequence, many works have been conducted [20]–[26]. In particular, for the MU algorithm, some representativeworks include [20]–[23]. When the divergence D ( V k WH ) = k V − WH k , Lin [20] and Gillis and Glineur [21] modiﬁedthe algorithm originally proposed in [3] (in different ways),and proved the convergence of the modiﬁed algorithms to the set of stationary points of (1). However, their approachescannot be easily generalized to other divergences, e.g., the(generalized) Kullback-Leibler (KL) divergence. To overcomethis restriction, the authors of [22] and [23] modiﬁed thenonnegativity constraints on W and H in (1) to W ≥ ǫ and H ≥ ǫ , for some ǫ > . Accordingly, they developedalgorithms for this “positive matrix factorization” problem [27]with a wider class of divergences (including the β -divergences)and showed that their algorithms converge to the stationarypoints of the new problem. However, the positivity constraintson W and H are restrictive and changes the original NMFproblem in (1) substantially. Therefore the convergence analy-ses in [22] and [23] are not applicable to the MU algorithms forthe canonical NMF problem (in which W and H are allowedto have zero entries), which is of interest in many applications.In another related work [28], the authors analyzed the stabilityof local minima of (1) under the MU algorithm, where D ( ·k· ) belongs to the class of β -divergences. However, the stabilityanalysis therein does not yield deﬁnite answers on whether(and when) the MU algorithm converges to any local minimum(or even stationary point) of ℓ ( · , · ) if the algorithm is startedat an arbitrary (feasible) starting point. B. Motivations and Main Contributions

In this work, we analyze the convergence of the MUalgorithm for regularized

NMF problems with a general classof divergences, termed h -divergences (see Deﬁnition 1) ina uniﬁed manner. The set of h -divergences includes manyimportant classes of divergences, including (but not limitedto) α ( α = 0) , β , γ , α - β and R´enyi divergences. For eachclass of divergences, the corresponding MU algorithm hasbeen proposed in the literature [5], [11], [12], [29], butwithout convergence guarantees. In addition, we also includeregularizers on W and H in the objective function. The pur-pose of including regularizers are twofold: (i) convenience ofmathematical analysis and (ii) increased generality of problemsetting. Although many MU algorithms have been proposedfor NMF problems with various regularizers [30]–[33], thusfar, the convergence analyses of these algorithms are stilllacking. The absence of theoretical convergence guarantees forthe MU algorithms in the abovementioned cases thus becomesa major motivation of our work.Our contributions consist of two parts. First, we developa uniﬁed MU algorithm for the NMF problem (1) with any(weighted) h -divergence and ℓ , (and Tikhonov) regularizerson W and H . Our algorithm subsumes many existing algo-rithms in previous works [3]–[6] as special cases. From ourupdate rules, we discover that minimizing D ( V k WH ) withthe ℓ , regularization on W and H corresponds to a stability-preserving heuristic commonly employed in implementing theMU algorithms. (See Remark 8 for details.) Therefore, thisjustiﬁes the need to incorporate ℓ , regularization into theNMF objective. Second, we conduct a novel convergenceanalysis for this uniﬁed MU algorithm, by making innovativeuse of the recently-proposed block majorization-minimization framework [34], [35]. Our results show that the sequence of See Deﬁnition 4 for the deﬁnition of convergence of a sequence to a set. iterates { ( W t , H t ) } ∞ t =1 generated from our MU algorithm hasat least one limit point and any limit point of this sequence is astationary point of (1). Thus, for the ﬁrst time, it is shown thatthe host of MU algorithms in the NMF literature [5], [11], [12],[29]–[33] enjoys strong theoretical convergence guarantees. C. Notations

In this paper we use R + , R ++ and N to denote the set ofnonnegative real numbers, positive real numbers and naturalnumbers (excluding zero) respectively. For n ∈ N , we deﬁne [ n ] , { , , . . . , n } . We use boldface capital letters, boldfacelowercase letters and plain lowercase letters to denote matrices,vectors and scalars respectively. For a vector x , we denote its i -th entry, ℓ and ℓ norms as x i , k x k and k x k respectively.For a matrix X , we denote its ( i, j ) -th entry as x ij and its ℓ , norm as k X k , , P ij | x i,j | . In addition, for a scalar δ ∈ R , we use X = δ and X ≥ δ to denote entrywiseequality and inequality. For matrices X and Y , we use X ⊙ Y and h X , Y i to denote their Hadamard product and Frobeniusinner product respectively. We use c = to denote equality upto additive constants. In this work, technical lemmas (whoseindices begin with ‘T’) will appear in Appendix D.II. P ROBLEM F ORMULATION

A. Deﬁnition of h -Divergences Before introducing the notion of h -divergences , we ﬁrstdeﬁne an important function h ( σ, t ) , (cid:26) ( σ t − /t, t = 0log σ, t = 0 , (2)where for any t ∈ R , the domain of σ is given by the naturaldomain of σ h ( σ, t ) , denoted as Ξ t . To be more explicit, Ξ t = [0 , ∞ ) if t = 0 and Ξ t = (0 , ∞ ) if t = 0 . Deﬁnition 1 ( h -divergences [6, Section IV]) . Given any V ∈ R F × N + , D ( V k· ) : R F × N + → R + is called a h -divergence iffor any b V ∈ R F × N + , there exists a constant P ≥ such that D ( V k b V ) c = P X p =1 µ p h  F X i =1 N X j =1 ν pij h ( b v ij , ζ p ) , ξ p  , (3)where ‘ c = ’ omits constants that are independent of b V and µ p , ν pij , ζ p and ξ p are all real constants independent of b V . Inaddition, { ζ p } Pp =1 are distinct and for any i ∈ [ F ] and j ∈ [ N ] ,there exists p ′ ∈ [ P ] such that ν p ′ ij = 0 . Remark h -divergences) . By choosing the constants µ p , ν pij , ζ p and ξ p in different ways, we obtain different h -divergences. The h -divergences subsume many importantclasses of divergences, including the families of α ( α = 0) , β , γ , α - β and R´enyi divergences [5], [11], [12], [29]. In par-ticular, some important instances in the h -divergences includethe Hellinger, Itakura-Saito (IS), KL and squared-Frobeniusdivergences. Each instance can be obtained by appropriatelychoosing the constants { µ p } p , { ν pij } p,i,j , { ζ p } p and { ξ p } p .See Remark 2 for an example of how (3) yields the KLdivergence with an appropriate set of parameters. Remark h -divergences) . When µ p = ξ p = 1 , forall p ∈ [ P ] , D ( V k· ) is separable across the entries of b V , i.e., D ( V k b V ) c = F X i =1 N X j =1 P X p =1 ν pij h ( b v ij , ζ p ) . (4)We term such a divergence as a separable h -divergence . Inparticular, any member in the classes of α - (for α = 0 )or β -divergences is separable. For example, taking P = 2 , ν ij = − v ij , ζ = 0 , ν ij = 1 and ζ = 1 , we obtainthe KL divergence, which belongs to both classes ( α - and β -divergences). Remark h -divergences) . For some special in-stances in the class of h -divergences, such as squared Eu-clidean distance and KL divergence, a weighted version hasbeen proposed and studied in the literature [36]–[40]. Based onDeﬁnition 1, we can also deﬁne weighted h -divergences, whichsubsume the aforementioned weighted divergences as specialcases. Given a nonnegative matrix M ∈ R F × N + , deﬁne itssupport Ω( M ) , { ( i, j ) ∈ [ F ] × [ N ] : m ij > } . Also, for any p ∈ [ P ] , deﬁne ν ′ pij , ν pij m ij . For any h -divergence D ( V k· ) ,deﬁne its M -weighted version D M ( V k· ) : R F × N R + as D M ( V k b V ) c = P X p =1 µ p h  X ( i,j ) ∈ Ω( M ) ν ′ pij h ( b v ij , ζ p ) , ξ p  . (5)Comparing (5) to (3), we observe that the only changes arethe constants { ν pij } pij . These constants are independent of b V .Therefore, our algorithms and convergence analysis developedfor the h -divergences are also applicable to their weightedcounterparts. Indeed, the weighted h -divergences in (5) aremore general than h -divergences in (3), since by choosing M such that m ij = 1 for any i and j , we recover (3) from (5). B. Optimization Problem

For convenience, ﬁrst deﬁne two functions φ ( · ) , k·k , and φ ( · ) , k·k . The ﬁrst and second functions are known asthe ℓ , and Tikhonov regularizers respectively. Accordingly,for any V ∈ R F × N + , deﬁne the regularized objective function ℓ ( W , H ) , D ( V k WH )+ X i =1 λ i φ i ( W )+ X j =1 e λ j φ j ( H ) , (6)where W ∈ R F × K + , H ∈ R K × N + , λ , e λ > and λ , e λ ≥ . Theoptimization problem in this work can be stated succinctly as min W ∈ R F × K + , H ∈ R K × N + ℓ ( W , H ) . (7) Remark . Theabove regularizers involving φ ( · ) and φ ( · ) are collectivelyknown as the elastic-net regularizer [41] in the literature. Both ℓ , and Tikhonov regularizers have been widely employed inthe NMF literature. Speciﬁcally, the ℓ , regularizer promoteselement-wise sparsity on the basis matrix W and coefﬁcientmatrix H [30], thereby enhancing the interpretability of bothbasis vectors and the conic combination model in NMF. TheTikhonov regularizer promotes smoothness on W and H andalso prevents overﬁtting [42]. Remark λ and e λ ) . We require both λ and e λ to be positive for both convenience of analysis and numericalstability. Speciﬁcally, the inclusion of ℓ , regularization on W and H ensures that both W ℓ ( W , H ) and H ℓ ( W , H ) coercive, a property that we will leverage in our analysis. Inaddition, as will be shown in Proposition 2, the positivity of λ and e λ prevents the denominators in the multiplicative updaterules of W and H from being arbitrarily close to zero, therebyensuring that the updates in the MU algorithm are numericallystable. III. A LGORITHMS

In this section we ﬁrst deﬁne the notions of surrogatefunctions and ﬁrst-order surrogate functions. Next, we presenta general framework for deriving the MU algorithm for theproblem (7), based on majorization-minimization [6]. Thisframework is sufﬁcient for our convergence analysis. However,as side contributions, we also present a systematic procedureto construct ﬁrst-order surrogate functions of ( W , H ) ℓ ( W , H ) for W (resp. H ) and to derive the speciﬁc mul-tiplicative update rules for W (resp. H ). Finally, we discusshow to apply these techniques to the family of α -divergencesand how to extend the techniques to the dual KL divergence. A. First-Order Surrogate Functions

Deﬁnition 2.

Given n ﬁnite-dimensional real Euclidean spaces {X i } ni =1 , deﬁne X , Q ni =1 X i . For any x ∈ X , denote its n -block form as ( x , . . . , x n ) , where for any i ∈ [ n ] , x i ∈ X i denotes the i -th block of x . Consider a differentiable function f : X → R . For any i ∈ [ n ] , a ﬁrst-order surrogate function of ( x , . . . , x n ) f ( x , . . . , x n ) for x i , denoted as F i ( · | · ) : X i × X → R , satisﬁes the following ﬁve properties: (P1) F i ( e x i | e x ) = f ( e x ) , for any e x ∈ X , (P2) F i ( x i | e x ) ≥ f ( e x , . . . , x i , . . . , e x n ) , for any ( x i , e x ) ∈X i × X . (P3) F i ( · | · ) is differentiable on X i × X and for any e x ∈ X ,there exists a function g ( · | e x ) : X i → R such that ∇ F i ( · | e x ) = g ( · / e x i | e x ) on X i . (P4) ∇ x i F i ( x i | e x ) | x i = e x i = ∇ x i f ( e x , . . . , x i , . . . , e x n ) | x i = e x i ,for any e x ∈ X , (P5) F i ( · | e x ) is strictly convex on X i , for any e x ∈ X .If F i ( · | · ) only satisﬁes properties (P1) to (P3) , then it is calleda surrogate function of f for x i . Note that in general, (ﬁrst-order) surrogate functions may not be unique . Remark (P1) to (P5) ) . From (P5) ,we know the minimizer of F i ( ·| e x ) over X i is unique. Let usdenote it as x ∗ i . From both (P1) and (P2) , we can deduce that f ( e x , . . . , x ∗ i , . . . , e x n ) ≤ f ( e x ) . In addition, (P3) ensures thatminimizing x i F i ( x i | e x ) over X i yields a multiplicativeupdate for the i -th block x i . Finally, (P4) ensures that for any e x ∈ X and i ∈ [ n ] , the gradient of x i F i ( x i | e x ) agreeswith that of x i f ( e x , . . . , x i , . . . , e x n ) at e x i . This propertywill be leveraged in our convergence analysis. Remark . With a slight abuse of termi-nology, we shall term any function x i e F i ( x i | e x ) a (ﬁrst-order) surrogate function if it differs from x i F i ( x i | e x ) by Algorithm 1

General Framework for Multiplicative Updates

Input : Data matrix V ∈ R F × N , latent dimension K , reg-ularization weights λ , e λ > and λ , e λ ≥ , maximumnumber of iterations t max Initialize W ∈ R F × K ++ , H ∈ R K × N ++ For t = 0 , , . . . , t max − W t +1 := arg min W ∈ R F × K + G ( W | W t , H t ) (8) H t +1 := arg min H ∈ R K × N + G ( H | W t +1 , H t ) (9) EndOutput : Basis matrix W t max and coefﬁcient matrix H t max a constant that is independent of x i . This is because such aconstant difference does not affect the minimizer(s) of F i ( ·| e x ) or e F i ( ·| e x ) over X i or their gradients w.r.t. x i . Consequently,it does not affect the resulting multiplicative updates for x i orthe convergence analysis in Section IV. B. General Framework for Multiplicative Updates

The general framework for deriving the MU algorithmfor (7) is shown in Algorithm 1, where G ( ·|· ) and G ( ·|· ) denote the ﬁrst-order surrogate functions of ( W , H ) ℓ ( W , H ) for W and H respectively. As will be shown inProposition 2, the minimization steps in (8) and (9) indeedresult in multiplicative updates for W and H respectively. C. Construction of First-Order Surrogate Functions

We only focus on constructing a ﬁrst-order surrogate func-tion of ( W , H ) ℓ ( W , H ) for W , and when D ( V k · ) is a separable h -divergence. By symmetry between W and H , such a surrogate function for H can be easily obtained(by taking transposition). In addition, since h ( · , t ) is eitherconvex or concave for t ∈ R (cf. Lemma T-1), D ( V k · ) in (3) is a difference-of-convex function . Therefore, by usingeither Jensen’s inequality or a ﬁrst-order Taylor expansion,the nonseparable h -divergence can be easily converted to theseparable one. Such conversion techniques are common in theNMF literature [5], [6], [43]. Proposition 1.

For any V ∈ R F × N + , W ∈ R F × K + , f W ∈ R F × K ++ , e H ∈ R K × N ++ , and ϑ , ϑ ∈ R , deﬁne e Z , ( f W , e H ) and a function G ( W | e Z ) , F X i =1 K X k =1 (cid:20) ( s + ik + λ ) e w ik h (cid:18) w ik e w ik , ϑ (cid:19) +2 λ e w ik h (cid:18) w ik e w ik , ϑ (cid:19) − s − ik e w ik h (cid:18) w ik e w ik , ϑ (cid:19)(cid:21) , (10) where S + and S − are the sums of positive and unsigned neg-ative terms in ∇ W D ( V k W e H ) (cid:12)(cid:12) W = f W respectively (cf. [3]). Then for any separable h -divergence D ( V k· ) , there exist Note that the decomposition of ∇ W D ( V k W e H ) (cid:12)(cid:12) W = f W into the pos-itive and negative terms is not unique. However, Proposition 1 (and henceProposition 2) holds for any of such decompositions. real numbers ϑ < ϑ such that G ( ·|· ) is a ﬁrst-ordersurrogate function of ( W , H ) ℓ ( W , H ) with respect tothe variable W .Proof. See Appendix A. (cid:3)

D. Derivation of Multiplicative Updates

Based on Proposition 1, by setting ∇ W G ( W | e Z ) to zero,we obtain the multiplicative update for W . Proposition 2.

Let V , D ( V k· ) , ϑ , ϑ , S + and S − be givenas in Proposition 1. For any t ≥ , let f W (resp. e H ) denote thebasis (resp. coefﬁcient) matrix at iteration t , and W (resp. H )denote the basis (resp. coefﬁcient) matrix at iteration t +1 . Forany ( i, k ) ∈ [ F ] × [ K ] , the multiplicative update correspondingto (8) in Algorithm 1 admits the form w ik := e w ik (cid:18) s − ik s + ik + 2 λ e w ik + λ (cid:19) / ( ϑ − ϑ ) . (11) Remark . In (11), the presence of λ > ensures numerical stability, i.e., it prevents the denominatorof the multiplicative factor to be arbitrarily small (which maylead to numerical overﬂow). As a popular heuristic (e.g., [29]),a small positive number is usually added to this denominatorartiﬁcially. Here we establish the connection between thisartiﬁcially added small number and the ℓ regularization for h -divergences, thereby theoretically justifying this heuristic. Remark W ) . As shown in [6], both matrices S + and S − are entry-wise positive, i.e., S + , S − ∈ R F × K ++ .Therefore, if f W ∈ R F × K ++ in (11), then W ∈ R F × K ++ . Sincethe initial basis matrix W ∈ R F × K ++ in Algorithm 1, for any ﬁnite index t ∈ N , W t will be entry-wise positive. Similararguments apply to H t . Therefore, the positivity requirementson ( f W , e H ) in Proposition 1 can be satisﬁed at any ﬁniteiteration. Note that this does not prevent any limit point of { ( W t , H t ) } ∞ t =1 from having zero entries. E. A Concrete Example

As the ﬁrst-order surrogate function (10) in Proposition 1and the multiplicative update rule (11) in Proposition 2 mayseem abstract, as a concrete example, we apply them to thefamily of α -divergences ( α = 0 ). Details are deferred toAppendix B. F. Extension to the Dual KL Divergence

When α = 0 , the corresponding α -divergence is calledthe (generalized) dual KL divergence. Strictly speaking, itdoes not belong to the class of h -divergences. However,equipped with a few more algebraic manipulations based onseveral technical deﬁnitions, we can also construct a ﬁrst-order surrogate function and derive a multiplicative update inthe form of (10) and (11) respectively. See Appendix C fordetails. Consequently, the result of our convergence analysis(i.e., Theorem 1) also applies to this case. This connection has been observed for some special h -divergences [5],[30], but here we provide a more general and uniﬁed discussion. IV. C

ONVERGENCE A NALYSIS

A. Preliminaries

We ﬁrst deﬁne important concepts and quantities that willbe used in our convergence analysis in Section IV-B.

Deﬁnition 3 (Stationary points of constrained optimizationproblems) . Given a ﬁnite-dimensional real Euclidean space X with inner product h· , ·i , a differentiable function g : X → R and a set K ⊆ X , x ∈ K is a stationarypoint of the constrained optimization problem min x ∈K g ( x ) if h∇ g ( x ) , x − x i ≥ , for all x ∈ K .Deﬁne X , (cid:2) W T H (cid:3) ∈ R K × ( F + N )+ and with a slight abuseof notation, we write ℓ ( X ) , ℓ ( W , H ) . Thus by Deﬁnition 3,we have that ( W , H ) is a stationary point of (7) if and onlyif (cid:10) ∇ X ℓ ( X ) , X − X (cid:11) ≥ , for any X ∈ R K × ( F + N )+ , where X , [ W T H ] . In particular, this is true if (cid:10) ∇ W ℓ ( W , H ) , W − W (cid:11) ≥ , ∀ W ∈ R F × K + , (12) (cid:10) ∇ H ℓ ( W , H ) , H − H (cid:11) ≥ , ∀ H ∈ R K × N + . (13) Remark . In some previous works (e.g., [20]), stationarypoints are deﬁned in terms of KKT conditions, i.e., W ≥ , H ≥ ∇ W ℓ ( W , H ) ≥ , ∇ H ℓ ( W , H ) ≥ W ⊙ ∇ W ℓ ( W , H ) (cid:12)(cid:12) W = W = 0 , H ⊙ ∇ H ℓ ( W , H ) (cid:12)(cid:12) H = H = 0 . Since both W and H are nonnegative, it is easy to showthese three conditions are equivalent to (12) and (13). In ouranalysis, we will use (12) and (13) for convenience. Deﬁnition 4 (Convergence of a sequence to a set) . Given aﬁnite-dimensional real Euclidean space X with norm k·k , asequence { x n } ∞ n =1 in X is said to converge to a set A ⊆ X ,denoted as x n → A , if lim n →∞ inf a ∈A k x n − a k = 0 . B. Main Result

Theorem 1.

For any V ∈ R F × N + , K ∈ N , λ , e λ > and λ , e λ ≥ , the sequence of iterates { ( W t , H t ) } ∞ t =1 generatedby Algorithm 1 converges to the set of stationary points of (7) .Proof. Since it is known that x n → A (cf. Deﬁnition 4) ifand only if every limit point of { x n } ∞ n =1 lies in A , it sufﬁcesto show every limit point of { ( W t , H t ) } ∞ t =1 is a stationarypoint of (7). Since λ , e λ > , ( W , H ) ℓ ( W , H ) is jointlycoercive [44] in ( W , H ) . In addition, the continuous differen-tiability of h ( · , t ) implies the joint continuous differentiabilityof ( W , H ) ℓ ( W , H ) in ( W , H ) . Hence the sub-level set S , n ( W , H ) ∈ R F × K + × R K × N + (cid:12)(cid:12) ℓ ( W , H ) ≤ ℓ ( W , H ) o (14)is compact. Since the sequence { ℓ ( W t , H t ) } ∞ t =1 is nonin-creasing, { ( W t , H t ) } ∞ t =1 ⊆ S . By the compactness of S , { ( W t , H t ) } ∞ t =1 has at least one limit point. Pick any suchlimit point and denote it as ˚ Z , ( ˚ W , ˚ H ) . We also deﬁne Z t , (cid:26)(cid:0) W t/ , H t/ (cid:1) , t even (cid:0) W ⌊ t/ ⌋ +1 , H ⌊ t/ ⌋ (cid:1) , t odd , ∀ t ∈ N . (15)Note that the subsequence of { Z t } ∞ t =1 with even indices,i.e., { Z t ′ } ∞ t ′ =1 correspond to the sequence { ( W t , H t ) } ∞ t =1 .Hence, there exists a subsequence { Z t j } ∞ j =1 that convergesto ˚ Z ∈ S and { t j } ∞ j =1 are all even. Moreover, there ex-ists a subsequence of the sequence (cid:8) Z t j − (cid:9) ∞ j =1 , denoted as (cid:8) Z t ji − (cid:9) ∞ i =1 , such that Z t ji − converges to (possibly) someother limit point ˚ Z ′ , ( ˚ W ′ , ˚ H ′ ) as i → ∞ .Next we show ˚ Z = ˚ Z ′ . By the update rule (9), we have H t ji / ∈ arg min H ∈ R K × N + G (cid:0) H | Z t ji − (cid:1) , ∀ i ∈ N . (16)Thus for any i ∈ N , G ( H t ji / | Z t ji − ) ≤ G ( H | Z t ji − ) , ∀ H ∈ R K × N + . (17)By (P2) , we also have for any i ∈ N , ℓ ( Z t ji / ) , ℓ ( W t ji / , H t ji / ) ≤ G ( H t ji / | Z t ji − ) . (18)Taking i → ∞ on both sides of (17) and (18), we have ℓ (˚ Z ) ≤ G (˚ H | ˚ Z ′ ) ≤ G ( H | ˚ Z ′ ) , ∀ H ∈ R K × N + , (19)by the joint continuity of G ( ·|· ) in both arguments in (P3) .Thus ˚ H ∈ arg min H ∈ R K × N + G ( H | ˚ Z ′ ) . (20)Taking H = ˚ H ′ in (19), we have ℓ (˚ Z ) ≤ G (˚ H | ˚ Z ′ ) ≤ G (˚ H ′ | ˚ Z ′ ) , ℓ (˚ Z ′ ) . (21)Since { ℓ ( Z t ) } ∞ t =1 converges (to a unique limit point), we have ℓ (˚ Z ) = ℓ (˚ Z ′ ) . This implies that ℓ (˚ Z ) = G (˚ H | ˚ Z ′ ) . Then forany H ∈ R K × N + , G (˚ H ′ | ˚ Z ′ ) = ℓ (˚ Z ′ ) = ℓ (˚ Z ) = G (˚ H | ˚ Z ′ ) ≤ G ( H | ˚ Z ′ ) . (22)This implies that ˚ H ′ ∈ arg min H ∈ R K × N + G ( H | ˚ Z ′ ) . (23)Combining (20) and (23), by the strictly convexity of G ( ·| ˚ Z ′ ) in (P5) , ˚ H = ˚ H ′ . By symmetry, we can show ˚ W = ˚ W ′ , hence ˚ Z = ˚ Z ′ . Thus (22) becomes G (˚ H | ˚ Z ) ≤ G ( H | ˚ Z ) , ∀ H ∈ R K × N + . (24)Now, the convexity of G ( ·| ˚ Z ) implies that D ∇ H G (˚ H | ˚ Z ) , H − ˚ H E ≥ , ∀ H ∈ R K × N + . (25)From the ﬁrst-order property of G ( ·| ˚ Z ) in (P4) , we have D ∇ H ℓ ( ˚ W , ˚ H ) , H − ˚ H E ≥ , ∀ H ∈ R K × N + . (26)Similarly, we also have D ∇ W ℓ ( ˚ W , ˚ H ) , W − ˚ W E ≥ , ∀ W ∈ R F × K + . (27) The variational inequalities (26) and (27) together show that ( ˚ W , ˚ H ) is a stationary point of (7). (cid:3) Remark . We now provide some intuitions of the proof. Weﬁrst use the positivity of λ and e λ to assert that ( W , H ) ℓ ( W , H ) is coercive and hence that S is compact. Thisallows us to extract convergent subsequences. The most crucialstep (24) states that at an arbitrary limit point of { Z t } ∞ t =1 ,denoted as ˚ Z = ( ˚ W , ˚ H ) , ˚ H serves as a minimizer of G ( ·| ˚ Z ) over R K × N + . By symmetry, ˚ W also serves as a minimizerof G ( ·| ˚ Z ) over R F × K + . In the single-block case, this ideais fairly intuitive. However, to prove (24) in the double-block case, we consider two subsequences { Z t ji } ∞ i =1 and { Z t ji − } ∞ i =1 . In each sequence, only W or H is updated. Thenwe show these two sequences converge to the same limit point.This implies the Gauss-Seidel minimization procedure [45,Section 7.3] in the double-block case is essentially the sameas the minimization in the single-block case. The claim thenfollows. V. C ONCLUSION AND F UTURE W ORK

In this work, we present a uniﬁed MU algorithm for(weighted) h -divergences with ℓ and Tikhonov regularizationand analyze its convergence (to stationary points).In the future, we plan to investigate the further properties ofthe MU algorithm. Speciﬁcally, we would like to understandwhether it is able to converge to second-order stationarypoints [46]. This question is motivated by some recent workswhich have shown that for low-rank matrix factorizationproblems [47], under mild conditions, all the second-orderstationary points are local minima. For these problems, thelocal minima have been shown to possess strong theoreticalproperties [48]. Therefore, investigation into convergence tothe second-order stationary points of the MU algorithm ismeaningful. Acknowledgements : The authors would like to thank Prof.Zhirong Yang for many useful comments on the manuscript.A

PPENDIX AP ROOF OF P ROPOSITION G ( ·|· ) is a surrogate function of ( W , H ) ℓ ( W , H ) for W by decomposing G ( ·|· ) intotwo functions e G ( ·|· ) and G ( ·|· ) , where e G ( W | e Z ) , F X i =1 K X k =1 (cid:20) s + ik e w ik h (cid:18) w ik e w ik , ϑ (cid:19) − s − ik e w ik h (cid:18) w ik e w ik , ϑ (cid:19)(cid:21) , (28) G ( W | e Z ) , F X i =1 K X k =1 ( λ e w ik + 2 λ e w ik ) h (cid:18) w ik e w ik , ϑ (cid:19) . (29)Deﬁne constants { ζ ′ p } Pp =1 such that ζ ′ p , if ζ p ∈ (0 , and ζ ′ p , ζ p otherwise, for any p ∈ [ P ] . Accordingly, deﬁne ζ ′ min , min { ζ ′ p } Pp =1 and ζ ′ max , max { ζ ′ p } Pp =1 . (30)When ϑ = ζ ′ min and ϑ = ζ ′ max , Yang and Oja [6] showedthat e G ( ·|· ) is a surrogate function of ( W , H ) D ( V k WH ) for W . By Lemmas T-1 and T-2(b), e G ( ·|· ) is a surrogatefunction for any ϑ ≥ ζ ′ max . Deﬁne a new function G ′ ( W | e Z ) , G ( W | e Z )+ X i =1 λ i φ i ( f W )+ X j =1 e λ j φ j ( e H ) , (31)so that W G ′ ( W | e Z ) and W G ′ ( W | e Z ) differs by aconstant that is independent of W . By Lemma T-2(a), to show G ( ·|· ) is is a ﬁrst-order surrogate function of ( W , H ) ℓ ( W , H ) for W , it sufﬁces to show G ′ ( ·|· ) is a surrogatefunction of ( W , H ) P i =1 λ i φ i ( W ) + P j =1 e λ j φ j ( H ) for W , with ϑ ≥ ζ ′ max . First, note that λ k W k , + λ k W k = F X i =1 K X k =1 λ e w ik (cid:20) w ik e w ik − (cid:21) + 2 λ e w ik " (cid:18) w ik e w ik (cid:19) − + λ k f W k , + λ k f W k (32) ≤ F X i =1 K X k =1 ( λ e w ik + 2 λ e w ik ) h (cid:18) w ik e w ik , max { ζ ′ max , , λ ) } (cid:19) + λ k f W k , + λ k f W k , (33)where sgn( λ ) equals if λ = 0 and equals if λ > ,and in (33) we use the monotonicity of h ( σ, · ) for any σ > in Lemma T-1. Since h (1 , t ) = 0 for any t > , by choosing ϑ = max { ζ ′ max , , λ ) } , we see that G ′ ( ·|· ) satisﬁesproperties (P1) and (P2) . In addition, by (41) in Lemma T-1, G ′ ( ·|· ) obviously satisﬁes (P3) .To prove G ( ·|· ) is a ﬁrst-order surrogate function of ( W , H ) ℓ ( W , H ) for W , we show it also satisﬁesproperties (P4) and (P5) . First, for any i ∈ [ F ] and k ∈ [ K ] , h ∇ W G ( W | e Z ) i ik = ( s + ik + λ + 2 λ e w ik ) (cid:18) w ik e w ik (cid:19) ϑ − − s − ik (cid:18) w ik e w ik (cid:19) ϑ − . Therefore, ∇ W G ( W | e Z ) (cid:12)(cid:12) W = f W = ∇ W ℓ ( W , e H ) (cid:12)(cid:12) W = f W .This shows (P4) . Next, since ϑ ≤ ≤ ϑ , ϑ − ϑ > and S + , S − ∈ R F × K + , for any i ∈ [ F ] , k ∈ [ K ] and W ∈ R F × K ++ , ∂ ∂w ik G ( W | e Z ) = (1 − ϑ ) s − ik (cid:18) w ik e w ik (cid:19) ϑ − + (cid:18) s + ik + λ e w ik + 2 λ (cid:19) ( ϑ − (cid:18) w ik e w ik (cid:19) ϑ − > . This implies W G ( W | e Z ) is strictly convex on R F × K ++ .Since W G ( W | e Z ) is continuous and convex on R F × K + ,it is strictly convex on R F × K + . This proves (P5) .A PPENDIX BF IRST - ORDER S URROGATE F UNCTIONS AND M ULTIPLICATIVE U PDATES FOR THE α -D IVERGENCES ( α = 0) Deﬁnition 5 ( α -divergences) . Given any matrix V ∈ R F × N ++ ,the α -divergences D alp α ( V k· ) : R F × N ++ → R , is deﬁned as D alp α ( V k b V ) , P Fi =1 P Nj =1 d alp α ( v ij k b v ij ) , where for v, b v > , d alp α ( v k b v ) ,  ( b v [( v/ b v ) α − − α ( v − b v )) α ( α − , α ∈ R \ { , } v log( v/ b v ) − v + b v, α = 1 b v log( b v/v ) − b v + v, α = 0 . To construct the ﬁrst-order surrogate function G alp ,α ( ·|· ) of ( W , H ) ℓ ( W , H ) in (6) for W , when D ( V k· ) belongs tothe family of α -divergences ( α = 0 ), ﬁrst recall the deﬁnitionsof S + and S − in Proposition 1. From the deﬁnition of the α -divergences (in Deﬁnition 5), given any f W ∈ R F × K + and e H ∈ R K × N + , and for any ( i, k ) ∈ [ F ] × [ K ] , we have s + ik = 1 α N X j =1 e h kj , and s − ik = 1 α N X j =1 q αij e h kj , α = 0 , (34)where for any ( i, j ) ∈ [ F ] × [ N ] , q ij , v ij / ( f W e H ) ij . Inaddition, from Deﬁnition 5, we can also observe the values of ζ and ζ (see Deﬁnition 1), hence deduce from (30) that ( ζ ′ max , ζ ′ min ) = (cid:26) (1 , − α ) , α > − α, , α < . (35)By choosing ϑ = ζ ′ min and ϑ = max { ζ ′ max , , λ ) } as in Appendix A, we can obtain G alp ,α ( ·|· ) per (10) inProposition 1. Additionally, the multiplicative update in (11)in the case of α -divergences becomes w ik := e w ik P Nj =1 q αij e h kj P Nj =1 e h kj + 2 αλ e w ik + αλ ! /φ ( α ) , α = 0 , where φ ( α ) =  α + sgn( λ ) , α > , α ∈ ( − , − α, α < − . (36)A PPENDIX CF IRST - ORDER S URROGATE F UNCTIONS AND M ULTIPLICATIVE U PDATES FOR THE D UAL

KLD

IVERGENCE

In [12], a surrogate function of ( W , H ) D ( V k WH ) for W , when D ( V k· ) is the dual KL divergence, is given by e G alp , ( W | e Z ) , X ik w ik log (cid:18) w ik e w ik (cid:19) X j h kj − w ik X j (log q ij + 1) h kj . (37)However, this surrogate function cannot be expressed as (28)with an appropriate choice of parameters, hence it cannot bedirectly used to derive the ﬁrst-order surrogate function for (6).Therefore consider a majorant for e G alp , ( W | e Z ) as follows e G alp , ′ ( W | e Z ) , η X ik ( w ik / e w ik ) η e w ik X j q − ηij h kj − (1 + η )( w ik / e w ik ) e w ik X j h kj , (38) where η is any positive real number. By Lemma T-2(b), e G alp , ( ·k· ) is also a surrogate function of ( W , H ) D ( V k WH ) for W . By comparing (28) to (38), we have ϑ = 1 , ϑ ≥ η,s + ik = 1 η X j q − ηij h kj , s − ik = 1 + ηη X j h kj . Note that in this case S + and S − may not representthe sums of positive and unsigned negative terms in ∇ W D ( V k W e H ) (cid:12)(cid:12) W = f W . However, as long as they lie in R F × K + , Propositions 1 and 2 still hold. Therefore, by choosing ϑ = max { η, λ ) } , the ﬁrst-order surrogate functionfor the dual KL-divergence can be constructed per Propo-sition 1. Additionally, per Proposition 2, the multiplicativeupdate is w ik , w ik (1 + η ) P j h kj P j q − ηij h kj + 2 ηλ e w ik + ηλ ! /ψ ( η ) , (39)where ψ ( η ) , max { η, λ ) − } . (40)A PPENDIX DT ECHNICAL L EMMAS

Among the following technical lemmas, Lemma T-1 isadapted from [49, Lemma 1], whereas Lemma T-2 can besimply proved by deﬁnition.

Lemma T-1 (Regularity of h in (2)) . For any t ∈ R , denotethe natural domain of h ( · , t ) as Ξ t . Then h ( · , t ) is continuouslydifferentiable on int (Ξ t ) , i.e., the interior of Ξ t . In particular, ∂∂σ h ( σ, t ) = σ t − , ∀ t ∈ R , ∀ σ ∈ int (Ξ t ) . (41) In addition, h ( · , t ) is either convex or concave on int (Ξ t ) ,for any t ∈ R . Finally, for every ν > , the function h ( ν, · ) isnondecreasing on R . Lemma T-2 (Calculus of Surrogate Functions) . Let f : X → R be given as in Deﬁnition 2. Let i be any index in [ n ] .(a) If f = f + e f and F i ( ·|· ) and e F i ( ·|· ) are surrogate functionsof ( x , . . . , x n ) f ( x , . . . , x n ) and ( x , . . . , x n ) f ( x , . . . , x n ) for the x i respectively, then F i ( ·|· ) , F i ( ·|· ) + e F i ( ·|· ) is a surrogate function of ( x , . . . , x n ) f ( x , . . . , x n ) for x i .(b) If F i ( ·|· ) is a surrogate function of ( x , . . . , x n ) f ( x , . . . , x n ) for x i , and there exists e F i : X i × X → R that satisﬁes (P3) and e F i ( e x i | e x ) = F i ( e x i | e x ) , ∀ e x ∈ X , (42) e F i ( x i | e x ) ≥ F i ( x i | e x ) , ∀ ( x i , e x ) ∈ X i × X , (43) then e F i ( ·|· ) is a surrogate function of f for x i . R EFERENCES[1] R. Zhao and V. Y. F. Tan, “A uniﬁed convergence analysis of themultiplicative update algorithm for nonnegative matrix factorization,”in

Proc. ICASSP , New Orleans, LA, USA, Mar. 2017. [2] D. D. Lee and H. S. Seung, “Learning the parts of objects by nonnegativematrix factorization,”

Nature , vol. 401, pp. 788–791, October 1999.[3] ——, “Algorithms for non-negative matrix factorization,” in

Proc. NIPS ,Denver, USA, Dec. 2000, pp. 556–562.[4] I. S. Dhillon and S. Sra, “Generalized nonnegative matrix approxima-tions with Bregman divergences,” in

Proc. NIPS , Dec. 2006.[5] C. F´evotte and J. Idier, “Algorithms for nonnegative matrix factorizationwith the beta-divergence,”

Neural Comput. , vol. 23, no. 9, pp. 2421–2456, 2011.[6] Z. Yang and E. Oja, “Uniﬁed development of multiplicative algorithmsfor linear and quadratic nonnegative matrix factorization,”

IEEE Trans.Neural Netw. , vol. 22, no. 12, pp. 1878–1891, Dec. 2011.[7] J. Kim and H. Park, “Toward faster nonnegative matrix factorization: Anew algorithm and comparisons,” in

Proc. ICDM , Pisa, Italy, Dec. 2008,pp. 353–362.[8] C.-J. Lin, “Projected gradient methods for nonnegative matrix factoriza-tion,”

Neural Comput. , vol. 19, no. 10, pp. 2756–2779, Oct. 2007.[9] Y. Xu, W. Yin, Z. Wen, and Y. Zhang, “An alternating direction algorithmfor matrix completion with nonnegative factors,”

Frontiers Math. China ,vol. 7, pp. 365–384, 2012.[10] D. Sun and C. Fevotte, “Alternating direction method of multipliersfor non-negative matrix factorization with the beta-divergence,” in

Proc.ICASSP , Florence, Italy, May 2014, pp. 6201–6205.[11] A. Cichocki, H.-K. Lee, Y.-D. Kim, and S. Choi, “Nonnegative matrixfactorization with alpha-divergence,”

Pattern Recognit. Lett. , vol. 29,no. 9, pp. 1433–1440, 2008.[12] A. Cichocki and S. Amari, “Families of Alpha-Beta-and Gamma-divergences: Flexible and robust measures of similarities,”

Entropy ,vol. 12, no. 6, pp. 1532–1568, 2010.[13] C. F´evotte, N. Bertin, and J.-L. Durrieu, “Nonnegative matrix factor-ization with the Itakura-Saito divergence. With application to musicanalysis,”

Neural Comput. , vol. 21, no. 3, pp. 793–830, Mar. 2009.[14] S. A. Vavasis, “On the complexity of nonnegative matrix factorization,”

SIAM J. Optim. , vol. 20, no. 3, pp. 1364–1377, 2009.[15] D. Donoho and V. Stodden, “When does non-negative matrix factoriza-tion give correct decomposition into parts?” in

Proc. NIPS , Vancouver,Canada, Dec. 2004, pp. 1141–1148.[16] V. Bittorf, B. Recht, C. R´e, and J. A. Tropp, “Factoring nonnegativematrices with linear programs,” in

Proc. NIPS , 2012, pp. 1214–1222.[17] S. Arora, R. Ge, R. Kannan, and A. Moitra, “Computing a nonnegativematrix factorization – provably,” in

Proc. STOC , New York, New York,USA, May 2012, pp. 145–162.[18] N. Gillis and R. Luce, “Robust near-separable nonnegative matrixfactorization using linear optimization,”

J. Mach. Learn. Res. , vol. 15,no. 1, pp. 1249–1280, 2014.[19] K. Huang, N. D. Sidiropoulos, and A. Swami, “Non-negative matrixfactorization revisited: Uniqueness and algorithm for symmetric decom-position,”

IEEE Trans. Signal Process. , vol. 62, no. 1, pp. 211–224, Jan.2014.[20] C.-J. Lin, “On the convergence of multiplicative update algorithms fornonnegative matrix factorization,”

IEEE Trans. Neural Netw. , vol. 18,no. 6, pp. 1589–1596, 2007.[21] N. Gillis and F. Glineur, “Nonnegative factorization and the maximumedge biclique problem,” arXiv:0810.4225, 2008.[22] N. Takahashi and R. Hibi, “Global convergence of modiﬁed multiplica-tive updates for nonnegative matrix factorization,”

Comput. Optim. Appl. ,vol. 57, no. 2, pp. 417–440, 2014.[23] N. Takahashi, J. Katayama, and J. Takeuchi, “A generalized sufﬁcientcondition for global convergence of modiﬁed multiplicative updates fornmf,” in

Int. Symp. Nonlinear Theory Appl. , Luzern, Switzerland, 2014,pp. 44–47.[24] H. Kim and H. Park, “Sparse non-negative matrix factorizations viaalternating non-negativity-constrained least squares for microarray dataanalysis,”

Bioinform. , pp. 1495–1502, 2007.[25] D. Hajinezhad, T. H. Chang, X. Wang, Q. Shi, and M. Hong, “Non-negative matrix factorization using admm: Algorithm and convergenceanalysis,” in

Proc. ICASSP , Shanghai, China, Mar. 2016, pp. 4742–4746.[26] C. F´evotte and A. T. Cemgil, “Nonnegative matrix factorizations as prob-abilistic inference in composite models,” in

Proc. EUSIPCO , Glasgow,UK, Aug. 2009, pp. 1913–1917.[27] P. Paatero and U. Tapper, “Positive matrix factorization: A non-negativefactor model with optimal utilization of error estimates of data values,”

Environmetrics , vol. 5, no. 2, pp. 111–126, 1994.[28] R. Badeau, N. Bertin, and E. Vincent, “Stability analysis of mul-tiplicative update algorithms and application to nonnegative matrixfactorization,”

IEEE Trans. Neural Netw. , vol. 21, no. 12, pp. 1869–1881, Dec. 2010. [29] A. Cichocki, S. Cruces, and S.-i. Amari, “Generalized alpha-beta diver-gences and their application to robust nonnegative matrix factorization,”

Entropy , vol. 13, no. 1, pp. 134–170, 2011.[30] P. O. Hoyer, “Non-negative sparse coding,” in

Proc. NNSP , Valais,Switzerland, Sep. 2002, pp. 557–565.[31] D. Cai, X. He, J. Han, and T. S. Huang, “Graph regularized nonnegativematrix factorization for data representation,”

IEEE Trans. Pattern Anal.Mach. Intell. , vol. 33, no. 8, pp. 1548–1560, 2011.[32] L. Taslaman and B. Nilsson, “A framework for regularized non-negativematrix factorization, with application to the analysis of gene expressiondata,”

PloS ONE , vol. 7, no. 11, pp. 1–7, 2012.[33] A. Mirzal, “Nonparametric tikhonov regularized nmf and its applicationin cancer clustering,”

IEEE/ACM Trans. Comput. Biol. Bioinform. ,vol. 11, no. 6, pp. 1208–1217, 2014.[34] M. Razaviyayn, M. Hong, and Z.-Q. Luo, “A uniﬁed convergenceanalysis of block successive minimization methods for nonsmoothoptimization,”

SIAM J. Optim. , vol. 23, no. 2, 2013.[35] M. Hong, M. Razaviyayn, Z. Q. Luo, and J. S. Pang, “A uniﬁedalgorithmic framework for block-structured optimization involving bigdata: With applications in machine learning and signal processing,”

IEEESignal Process. Mag. , vol. 33, no. 1, pp. 57–77, Jan. 2016.[36] D. Guillamet, M. Bressan, and J. Vitria, “A weighted non-negative matrixfactorization for local representations,” in

Proc. CVPR , vol. 1, 2001, pp.942–947.[37] D. Guillamet, J. Vitri`a, and B. Schiele, “Introducing a weighted non-negative matrix factorization for image classiﬁcation,”

Pattern Recognit.Lett. , vol. 24, no. 14, pp. 2447–2454, Oct. 2003.[38] N.-D. Ho, “Nonnegative matrix factorization: Algorithms and applica-tions.”[39] Q. Gu, J. Zhou, and C. H. Q. Ding, “Collaborative ﬁltering: Weightednonnegative matrix factorization incorporating user and item graphs.” in

Proc. SDM , 2010, pp. 199–210.[40] X. Zheng, S. Zhu, J. Gao, and H. Mamitsuka, “Instance-wise weightednonnegative matrix factorization for aggregating partitions with locallyreliable clusters,” in

Proc. IJCAI , Buenos Aires, Argentina, 2015, pp.4091–4097.[41] H. Zou and T. Hastie, “Regularization and variable selection via theelastic net,”

J. Roy. Statist. Soc. Ser. B , vol. 67, no. 2, pp. 301–320,2005.[42] V. P. Pauca, J. Piper, and R. J. Plemmons, “Nonnegative matrix factor-ization for spectral data analysis,”

Linear Algebra Appl. , vol. 416, no. 1,pp. 29 – 47, 2006.[43] J. Mairal, “Incremental majorization-minimization optimization withapplication to large-scale machine learning,”

SIAM J. Optim. , vol. 25,no. 2, pp. 829–855, 2015.[44] D. P. Bertsekas,

Nonlinear Programming . Athena Scitiﬁc, 1999.[45] R. L. Burden and J. D. Faires,

Numerical Analysis . Brooks/ColePublishing Company, 2016.[46] Y. Nesterov and B. Polyak, “Cubic regularization of newton method andits global performance,”

Mathematical Programming , vol. 108, no. 1, pp.177–205, 2006.[47] R. Sun and Z.-Q. Luo, “Guaranteed matrix completion via non-convexfactorization,”

IEEE Trans. Inf. Theory , vol. 62, no. 11, Nov. 2016.[48] R. Ge, J. D. Lee, and T. Ma, “Matrix completion has no spurious localminimum,” in