A Unified Convergence Analysis of the Multiplicative Update Algorithm for Regularized Nonnegative Matrix Factorization
aa r X i v : . [ m a t h . O C ] J un A Unified Convergence Analysis of theMultiplicative Update Algorithm for RegularizedNonnegative Matrix Factorization
Renbo Zhao,
Student Member,
Vincent Y. F. Tan,
Senior Member
Abstract —The multiplicative update (MU) algorithm has beenextensively used to estimate the basis and coefficient matricesin nonnegative matrix factorization (NMF) problems under awide range of divergences and regularizers. However, theoreticalconvergence guarantees have only been derived for a few specialdivergences without regularization. In this work, we provide aconceptually simple, self-contained, and unified proof for theconvergence of the MU algorithm applied on NMF with a widerange of divergences and regularizers. Our main result shows thesequence of iterates (i.e., pairs of basis and coefficient matrices)produced by the MU algorithm converges to the set of stationarypoints of the non-convex NMF optimization problem. Our proofstrategy has the potential to open up new avenues for analyzingsimilar problems in machine learning and signal processing.
Index Terms —Nonnegative Matrix Factorization, Multiplica-tive Update Algorithm, Convergence Analysis, Nonconvex Opti-mization, Stationary Points
I. I
NTRODUCTION
Nonnegative Matrix Factorization (NMF) has been a populardimensionality reduction technique in recent years, due to itsnon-subtractive and parts-based interpretation on the learnedbasis [2]. In the general formulation of NMF, given a non-negative matrix V ∈ R F × N + , one seeks to find a nonnegativebasis matrix W ∈ R F × K + and a nonnegative coefficient matrix H ∈ R K × N + such that V ≈ WH . To find such pair of matrices,a popular approach is to solve the optimization problem min W ≥ , H ≥ h ℓ ( W , H ) , D ( V k WH ) i . (1)where D ( ·k· ) denotes the divergence (or distance) betweentwo nonnegative matrices and W ≥ (and H ≥ ) denotesentrywise inequality. In the NMF literature, many algorithmshave been proposed to solve (1), including multiplicativeupdates (MU) [3]–[6], block principal pivoting (BPP) [7],projected gradient descent (PGD) [8] and the alternating direc-tion method of multipliers (ADMM) [9], [10]. However, somealgorithms only solve (1) for certain divergences D ( ·k· ) . Forexample, the BPP algorithm is only applicable to the squared-Frobenius loss k V − WH k . Among all the algorithms, theMU algorithm has arguably the widest applicability—it hasbeen used to solve (1) when D ( ·k· ) belongs to the family of α -divergence [11], β -divergence [5], γ -divergence [12], etc. An abridged version of this paper was presented at the ICASSP 2017 [1].Renbo Zhao and Vincent Y. F. Tan are with the Department of Electricaland Computer Engineering and the Department of Mathematics, NationalUniversity of Singapore (NUS). Renbo Zhao is also with the Departmentof Industrial and Systems Engineering, NUS. They are supported in part bythe NUS Young Investigator Award (grant number R-263-000-B37-133).
Despite the popularity of the MU algorithm, its convergenceproperties have not been studied systematically when thedivergence is not the standard squared-Frobenius loss andwhen there are regularizers on W and H . To describe thisproblem precisely, let { ( W t , H t ) } ∞ t =1 be the sequence ofpairs of basis and coefficient matrices generated by the MUalgorithm, where t ≥ denotes the iteration index. Manyprevious works [3], [5], [11] showed that the sequence of(nonnegative) objective values { ℓ ( W t , H t ) } ∞ t =1 in the MUalgorithm is non-increasing and hence the algorithm con-verges. However, the convergence of objective values does notimply the convergence of { ( W t , H t ) } ∞ t =1 , whose limit points(assuming they exist) serve as natural candidates for the outputof the MU algorithm. The limit points of { ( W t , H t ) } ∞ t =1 tendto be empirically appealing, i.e., they represent each columnof V (i.e., a data sample) as a linear combination of K nonnegative basis vectors in a meaningful manner [2], [13].As such, the convergence properties of { ( W t , H t ) } ∞ t =1 , andespecially the optimality of its limit points, are of theoreticaland practical importance. A. Related Works
Due to the nonconvex nature of (1), algorithms that guar-antee to converge to the global minima of (1) are in gen-eral out-of-reach. Indeed, [14] has shown that (1) is NP-hard. To ameliorate this situation, a line of works in whichstructural assumptions on the data matrix V —such as theseparability [15] assumptions [16]—has emerged. Under suchan assumption, polynomial-time algorithms [16]–[19] havebeen proposed to find W and H such that WH exactly equals V . However, for many applications in signal processingand machine learning, the data matrix V does not strictlysatisfy the aforementioned assumptions. In such scenarios,exact (nonnegative) factorization of V is generally infeasible.As a result, given a general nonnegative matrix V , manyworks [3]–[10] only aim to reduce (or preserve) the functionvalue of ℓ ( · , · ) at each iteration t , hoping that the sequence { ( W t , H t ) } ∞ t =1 will converge to a limit point that is “reason-ably good”. To understand the theoretical convergence proper-ties of this sequence, many works have been conducted [20]–[26]. In particular, for the MU algorithm, some representativeworks include [20]–[23]. When the divergence D ( V k WH ) = k V − WH k , Lin [20] and Gillis and Glineur [21] modifiedthe algorithm originally proposed in [3] (in different ways),and proved the convergence of the modified algorithms to the set of stationary points of (1). However, their approachescannot be easily generalized to other divergences, e.g., the(generalized) Kullback-Leibler (KL) divergence. To overcomethis restriction, the authors of [22] and [23] modified thenonnegativity constraints on W and H in (1) to W ≥ ǫ and H ≥ ǫ , for some ǫ > . Accordingly, they developedalgorithms for this “positive matrix factorization” problem [27]with a wider class of divergences (including the β -divergences)and showed that their algorithms converge to the stationarypoints of the new problem. However, the positivity constraintson W and H are restrictive and changes the original NMFproblem in (1) substantially. Therefore the convergence analy-ses in [22] and [23] are not applicable to the MU algorithms forthe canonical NMF problem (in which W and H are allowedto have zero entries), which is of interest in many applications.In another related work [28], the authors analyzed the stabilityof local minima of (1) under the MU algorithm, where D ( ·k· ) belongs to the class of β -divergences. However, the stabilityanalysis therein does not yield definite answers on whether(and when) the MU algorithm converges to any local minimum(or even stationary point) of ℓ ( · , · ) if the algorithm is startedat an arbitrary (feasible) starting point. B. Motivations and Main Contributions
In this work, we analyze the convergence of the MUalgorithm for regularized
NMF problems with a general classof divergences, termed h -divergences (see Definition 1) ina unified manner. The set of h -divergences includes manyimportant classes of divergences, including (but not limitedto) α ( α = 0) , β , γ , α - β and R´enyi divergences. For eachclass of divergences, the corresponding MU algorithm hasbeen proposed in the literature [5], [11], [12], [29], butwithout convergence guarantees. In addition, we also includeregularizers on W and H in the objective function. The pur-pose of including regularizers are twofold: (i) convenience ofmathematical analysis and (ii) increased generality of problemsetting. Although many MU algorithms have been proposedfor NMF problems with various regularizers [30]–[33], thusfar, the convergence analyses of these algorithms are stilllacking. The absence of theoretical convergence guarantees forthe MU algorithms in the abovementioned cases thus becomesa major motivation of our work.Our contributions consist of two parts. First, we developa unified MU algorithm for the NMF problem (1) with any(weighted) h -divergence and ℓ , (and Tikhonov) regularizerson W and H . Our algorithm subsumes many existing algo-rithms in previous works [3]–[6] as special cases. From ourupdate rules, we discover that minimizing D ( V k WH ) withthe ℓ , regularization on W and H corresponds to a stability-preserving heuristic commonly employed in implementing theMU algorithms. (See Remark 8 for details.) Therefore, thisjustifies the need to incorporate ℓ , regularization into theNMF objective. Second, we conduct a novel convergenceanalysis for this unified MU algorithm, by making innovativeuse of the recently-proposed block majorization-minimization framework [34], [35]. Our results show that the sequence of See Definition 4 for the definition of convergence of a sequence to a set. iterates { ( W t , H t ) } ∞ t =1 generated from our MU algorithm hasat least one limit point and any limit point of this sequence is astationary point of (1). Thus, for the first time, it is shown thatthe host of MU algorithms in the NMF literature [5], [11], [12],[29]–[33] enjoys strong theoretical convergence guarantees. C. Notations
In this paper we use R + , R ++ and N to denote the set ofnonnegative real numbers, positive real numbers and naturalnumbers (excluding zero) respectively. For n ∈ N , we define [ n ] , { , , . . . , n } . We use boldface capital letters, boldfacelowercase letters and plain lowercase letters to denote matrices,vectors and scalars respectively. For a vector x , we denote its i -th entry, ℓ and ℓ norms as x i , k x k and k x k respectively.For a matrix X , we denote its ( i, j ) -th entry as x ij and its ℓ , norm as k X k , , P ij | x i,j | . In addition, for a scalar δ ∈ R , we use X = δ and X ≥ δ to denote entrywiseequality and inequality. For matrices X and Y , we use X ⊙ Y and h X , Y i to denote their Hadamard product and Frobeniusinner product respectively. We use c = to denote equality upto additive constants. In this work, technical lemmas (whoseindices begin with ‘T’) will appear in Appendix D.II. P ROBLEM F ORMULATION
A. Definition of h -Divergences Before introducing the notion of h -divergences , we firstdefine an important function h ( σ, t ) , (cid:26) ( σ t − /t, t = 0log σ, t = 0 , (2)where for any t ∈ R , the domain of σ is given by the naturaldomain of σ h ( σ, t ) , denoted as Ξ t . To be more explicit, Ξ t = [0 , ∞ ) if t = 0 and Ξ t = (0 , ∞ ) if t = 0 . Definition 1 ( h -divergences [6, Section IV]) . Given any V ∈ R F × N + , D ( V k· ) : R F × N + → R + is called a h -divergence iffor any b V ∈ R F × N + , there exists a constant P ≥ such that D ( V k b V ) c = P X p =1 µ p h F X i =1 N X j =1 ν pij h ( b v ij , ζ p ) , ξ p , (3)where ‘ c = ’ omits constants that are independent of b V and µ p , ν pij , ζ p and ξ p are all real constants independent of b V . Inaddition, { ζ p } Pp =1 are distinct and for any i ∈ [ F ] and j ∈ [ N ] ,there exists p ′ ∈ [ P ] such that ν p ′ ij = 0 . Remark h -divergences) . By choosing the constants µ p , ν pij , ζ p and ξ p in different ways, we obtain different h -divergences. The h -divergences subsume many importantclasses of divergences, including the families of α ( α = 0) , β , γ , α - β and R´enyi divergences [5], [11], [12], [29]. In par-ticular, some important instances in the h -divergences includethe Hellinger, Itakura-Saito (IS), KL and squared-Frobeniusdivergences. Each instance can be obtained by appropriatelychoosing the constants { µ p } p , { ν pij } p,i,j , { ζ p } p and { ξ p } p .See Remark 2 for an example of how (3) yields the KLdivergence with an appropriate set of parameters. Remark h -divergences) . When µ p = ξ p = 1 , forall p ∈ [ P ] , D ( V k· ) is separable across the entries of b V , i.e., D ( V k b V ) c = F X i =1 N X j =1 P X p =1 ν pij h ( b v ij , ζ p ) . (4)We term such a divergence as a separable h -divergence . Inparticular, any member in the classes of α - (for α = 0 )or β -divergences is separable. For example, taking P = 2 , ν ij = − v ij , ζ = 0 , ν ij = 1 and ζ = 1 , we obtainthe KL divergence, which belongs to both classes ( α - and β -divergences). Remark h -divergences) . For some special in-stances in the class of h -divergences, such as squared Eu-clidean distance and KL divergence, a weighted version hasbeen proposed and studied in the literature [36]–[40]. Based onDefinition 1, we can also define weighted h -divergences, whichsubsume the aforementioned weighted divergences as specialcases. Given a nonnegative matrix M ∈ R F × N + , define itssupport Ω( M ) , { ( i, j ) ∈ [ F ] × [ N ] : m ij > } . Also, for any p ∈ [ P ] , define ν ′ pij , ν pij m ij . For any h -divergence D ( V k· ) ,define its M -weighted version D M ( V k· ) : R F × N R + as D M ( V k b V ) c = P X p =1 µ p h X ( i,j ) ∈ Ω( M ) ν ′ pij h ( b v ij , ζ p ) , ξ p . (5)Comparing (5) to (3), we observe that the only changes arethe constants { ν pij } pij . These constants are independent of b V .Therefore, our algorithms and convergence analysis developedfor the h -divergences are also applicable to their weightedcounterparts. Indeed, the weighted h -divergences in (5) aremore general than h -divergences in (3), since by choosing M such that m ij = 1 for any i and j , we recover (3) from (5). B. Optimization Problem
For convenience, first define two functions φ ( · ) , k·k , and φ ( · ) , k·k . The first and second functions are known asthe ℓ , and Tikhonov regularizers respectively. Accordingly,for any V ∈ R F × N + , define the regularized objective function ℓ ( W , H ) , D ( V k WH )+ X i =1 λ i φ i ( W )+ X j =1 e λ j φ j ( H ) , (6)where W ∈ R F × K + , H ∈ R K × N + , λ , e λ > and λ , e λ ≥ . Theoptimization problem in this work can be stated succinctly as min W ∈ R F × K + , H ∈ R K × N + ℓ ( W , H ) . (7) Remark . Theabove regularizers involving φ ( · ) and φ ( · ) are collectivelyknown as the elastic-net regularizer [41] in the literature. Both ℓ , and Tikhonov regularizers have been widely employed inthe NMF literature. Specifically, the ℓ , regularizer promoteselement-wise sparsity on the basis matrix W and coefficientmatrix H [30], thereby enhancing the interpretability of bothbasis vectors and the conic combination model in NMF. TheTikhonov regularizer promotes smoothness on W and H andalso prevents overfitting [42]. Remark λ and e λ ) . We require both λ and e λ to be positive for both convenience of analysis and numericalstability. Specifically, the inclusion of ℓ , regularization on W and H ensures that both W ℓ ( W , H ) and H ℓ ( W , H ) coercive, a property that we will leverage in our analysis. Inaddition, as will be shown in Proposition 2, the positivity of λ and e λ prevents the denominators in the multiplicative updaterules of W and H from being arbitrarily close to zero, therebyensuring that the updates in the MU algorithm are numericallystable. III. A LGORITHMS
In this section we first define the notions of surrogatefunctions and first-order surrogate functions. Next, we presenta general framework for deriving the MU algorithm for theproblem (7), based on majorization-minimization [6]. Thisframework is sufficient for our convergence analysis. However,as side contributions, we also present a systematic procedureto construct first-order surrogate functions of ( W , H ) ℓ ( W , H ) for W (resp. H ) and to derive the specific mul-tiplicative update rules for W (resp. H ). Finally, we discusshow to apply these techniques to the family of α -divergencesand how to extend the techniques to the dual KL divergence. A. First-Order Surrogate Functions
Definition 2.
Given n finite-dimensional real Euclidean spaces {X i } ni =1 , define X , Q ni =1 X i . For any x ∈ X , denote its n -block form as ( x , . . . , x n ) , where for any i ∈ [ n ] , x i ∈ X i denotes the i -th block of x . Consider a differentiable function f : X → R . For any i ∈ [ n ] , a first-order surrogate function of ( x , . . . , x n ) f ( x , . . . , x n ) for x i , denoted as F i ( · | · ) : X i × X → R , satisfies the following five properties: (P1) F i ( e x i | e x ) = f ( e x ) , for any e x ∈ X , (P2) F i ( x i | e x ) ≥ f ( e x , . . . , x i , . . . , e x n ) , for any ( x i , e x ) ∈X i × X . (P3) F i ( · | · ) is differentiable on X i × X and for any e x ∈ X ,there exists a function g ( · | e x ) : X i → R such that ∇ F i ( · | e x ) = g ( · / e x i | e x ) on X i . (P4) ∇ x i F i ( x i | e x ) | x i = e x i = ∇ x i f ( e x , . . . , x i , . . . , e x n ) | x i = e x i ,for any e x ∈ X , (P5) F i ( · | e x ) is strictly convex on X i , for any e x ∈ X .If F i ( · | · ) only satisfies properties (P1) to (P3) , then it is calleda surrogate function of f for x i . Note that in general, (first-order) surrogate functions may not be unique . Remark (P1) to (P5) ) . From (P5) ,we know the minimizer of F i ( ·| e x ) over X i is unique. Let usdenote it as x ∗ i . From both (P1) and (P2) , we can deduce that f ( e x , . . . , x ∗ i , . . . , e x n ) ≤ f ( e x ) . In addition, (P3) ensures thatminimizing x i F i ( x i | e x ) over X i yields a multiplicativeupdate for the i -th block x i . Finally, (P4) ensures that for any e x ∈ X and i ∈ [ n ] , the gradient of x i F i ( x i | e x ) agreeswith that of x i f ( e x , . . . , x i , . . . , e x n ) at e x i . This propertywill be leveraged in our convergence analysis. Remark . With a slight abuse of termi-nology, we shall term any function x i e F i ( x i | e x ) a (first-order) surrogate function if it differs from x i F i ( x i | e x ) by Algorithm 1
General Framework for Multiplicative Updates
Input : Data matrix V ∈ R F × N , latent dimension K , reg-ularization weights λ , e λ > and λ , e λ ≥ , maximumnumber of iterations t max Initialize W ∈ R F × K ++ , H ∈ R K × N ++ For t = 0 , , . . . , t max − W t +1 := arg min W ∈ R F × K + G ( W | W t , H t ) (8) H t +1 := arg min H ∈ R K × N + G ( H | W t +1 , H t ) (9) EndOutput : Basis matrix W t max and coefficient matrix H t max a constant that is independent of x i . This is because such aconstant difference does not affect the minimizer(s) of F i ( ·| e x ) or e F i ( ·| e x ) over X i or their gradients w.r.t. x i . Consequently,it does not affect the resulting multiplicative updates for x i orthe convergence analysis in Section IV. B. General Framework for Multiplicative Updates
The general framework for deriving the MU algorithmfor (7) is shown in Algorithm 1, where G ( ·|· ) and G ( ·|· ) denote the first-order surrogate functions of ( W , H ) ℓ ( W , H ) for W and H respectively. As will be shown inProposition 2, the minimization steps in (8) and (9) indeedresult in multiplicative updates for W and H respectively. C. Construction of First-Order Surrogate Functions
We only focus on constructing a first-order surrogate func-tion of ( W , H ) ℓ ( W , H ) for W , and when D ( V k · ) is a separable h -divergence. By symmetry between W and H , such a surrogate function for H can be easily obtained(by taking transposition). In addition, since h ( · , t ) is eitherconvex or concave for t ∈ R (cf. Lemma T-1), D ( V k · ) in (3) is a difference-of-convex function . Therefore, by usingeither Jensen’s inequality or a first-order Taylor expansion,the nonseparable h -divergence can be easily converted to theseparable one. Such conversion techniques are common in theNMF literature [5], [6], [43]. Proposition 1.
For any V ∈ R F × N + , W ∈ R F × K + , f W ∈ R F × K ++ , e H ∈ R K × N ++ , and ϑ , ϑ ∈ R , define e Z , ( f W , e H ) and a function G ( W | e Z ) , F X i =1 K X k =1 (cid:20) ( s + ik + λ ) e w ik h (cid:18) w ik e w ik , ϑ (cid:19) +2 λ e w ik h (cid:18) w ik e w ik , ϑ (cid:19) − s − ik e w ik h (cid:18) w ik e w ik , ϑ (cid:19)(cid:21) , (10) where S + and S − are the sums of positive and unsigned neg-ative terms in ∇ W D ( V k W e H ) (cid:12)(cid:12) W = f W respectively (cf. [3]). Then for any separable h -divergence D ( V k· ) , there exist Note that the decomposition of ∇ W D ( V k W e H ) (cid:12)(cid:12) W = f W into the pos-itive and negative terms is not unique. However, Proposition 1 (and henceProposition 2) holds for any of such decompositions. real numbers ϑ < ϑ such that G ( ·|· ) is a first-ordersurrogate function of ( W , H ) ℓ ( W , H ) with respect tothe variable W .Proof. See Appendix A. (cid:3)
D. Derivation of Multiplicative Updates
Based on Proposition 1, by setting ∇ W G ( W | e Z ) to zero,we obtain the multiplicative update for W . Proposition 2.
Let V , D ( V k· ) , ϑ , ϑ , S + and S − be givenas in Proposition 1. For any t ≥ , let f W (resp. e H ) denote thebasis (resp. coefficient) matrix at iteration t , and W (resp. H )denote the basis (resp. coefficient) matrix at iteration t +1 . Forany ( i, k ) ∈ [ F ] × [ K ] , the multiplicative update correspondingto (8) in Algorithm 1 admits the form w ik := e w ik (cid:18) s − ik s + ik + 2 λ e w ik + λ (cid:19) / ( ϑ − ϑ ) . (11) Remark . In (11), the presence of λ > ensures numerical stability, i.e., it prevents the denominatorof the multiplicative factor to be arbitrarily small (which maylead to numerical overflow). As a popular heuristic (e.g., [29]),a small positive number is usually added to this denominatorartificially. Here we establish the connection between thisartificially added small number and the ℓ regularization for h -divergences, thereby theoretically justifying this heuristic. Remark W ) . As shown in [6], both matrices S + and S − are entry-wise positive, i.e., S + , S − ∈ R F × K ++ .Therefore, if f W ∈ R F × K ++ in (11), then W ∈ R F × K ++ . Sincethe initial basis matrix W ∈ R F × K ++ in Algorithm 1, for any finite index t ∈ N , W t will be entry-wise positive. Similararguments apply to H t . Therefore, the positivity requirementson ( f W , e H ) in Proposition 1 can be satisfied at any finiteiteration. Note that this does not prevent any limit point of { ( W t , H t ) } ∞ t =1 from having zero entries. E. A Concrete Example
As the first-order surrogate function (10) in Proposition 1and the multiplicative update rule (11) in Proposition 2 mayseem abstract, as a concrete example, we apply them to thefamily of α -divergences ( α = 0 ). Details are deferred toAppendix B. F. Extension to the Dual KL Divergence
When α = 0 , the corresponding α -divergence is calledthe (generalized) dual KL divergence. Strictly speaking, itdoes not belong to the class of h -divergences. However,equipped with a few more algebraic manipulations based onseveral technical definitions, we can also construct a first-order surrogate function and derive a multiplicative update inthe form of (10) and (11) respectively. See Appendix C fordetails. Consequently, the result of our convergence analysis(i.e., Theorem 1) also applies to this case. This connection has been observed for some special h -divergences [5],[30], but here we provide a more general and unified discussion. IV. C
ONVERGENCE A NALYSIS
A. Preliminaries
We first define important concepts and quantities that willbe used in our convergence analysis in Section IV-B.
Definition 3 (Stationary points of constrained optimizationproblems) . Given a finite-dimensional real Euclidean space X with inner product h· , ·i , a differentiable function g : X → R and a set K ⊆ X , x ∈ K is a stationarypoint of the constrained optimization problem min x ∈K g ( x ) if h∇ g ( x ) , x − x i ≥ , for all x ∈ K .Define X , (cid:2) W T H (cid:3) ∈ R K × ( F + N )+ and with a slight abuseof notation, we write ℓ ( X ) , ℓ ( W , H ) . Thus by Definition 3,we have that ( W , H ) is a stationary point of (7) if and onlyif (cid:10) ∇ X ℓ ( X ) , X − X (cid:11) ≥ , for any X ∈ R K × ( F + N )+ , where X , [ W T H ] . In particular, this is true if (cid:10) ∇ W ℓ ( W , H ) , W − W (cid:11) ≥ , ∀ W ∈ R F × K + , (12) (cid:10) ∇ H ℓ ( W , H ) , H − H (cid:11) ≥ , ∀ H ∈ R K × N + . (13) Remark . In some previous works (e.g., [20]), stationarypoints are defined in terms of KKT conditions, i.e., W ≥ , H ≥ ∇ W ℓ ( W , H ) ≥ , ∇ H ℓ ( W , H ) ≥ W ⊙ ∇ W ℓ ( W , H ) (cid:12)(cid:12) W = W = 0 , H ⊙ ∇ H ℓ ( W , H ) (cid:12)(cid:12) H = H = 0 . Since both W and H are nonnegative, it is easy to showthese three conditions are equivalent to (12) and (13). In ouranalysis, we will use (12) and (13) for convenience. Definition 4 (Convergence of a sequence to a set) . Given afinite-dimensional real Euclidean space X with norm k·k , asequence { x n } ∞ n =1 in X is said to converge to a set A ⊆ X ,denoted as x n → A , if lim n →∞ inf a ∈A k x n − a k = 0 . B. Main Result
Theorem 1.
For any V ∈ R F × N + , K ∈ N , λ , e λ > and λ , e λ ≥ , the sequence of iterates { ( W t , H t ) } ∞ t =1 generatedby Algorithm 1 converges to the set of stationary points of (7) .Proof. Since it is known that x n → A (cf. Definition 4) ifand only if every limit point of { x n } ∞ n =1 lies in A , it sufficesto show every limit point of { ( W t , H t ) } ∞ t =1 is a stationarypoint of (7). Since λ , e λ > , ( W , H ) ℓ ( W , H ) is jointlycoercive [44] in ( W , H ) . In addition, the continuous differen-tiability of h ( · , t ) implies the joint continuous differentiabilityof ( W , H ) ℓ ( W , H ) in ( W , H ) . Hence the sub-level set S , n ( W , H ) ∈ R F × K + × R K × N + (cid:12)(cid:12) ℓ ( W , H ) ≤ ℓ ( W , H ) o (14)is compact. Since the sequence { ℓ ( W t , H t ) } ∞ t =1 is nonin-creasing, { ( W t , H t ) } ∞ t =1 ⊆ S . By the compactness of S , { ( W t , H t ) } ∞ t =1 has at least one limit point. Pick any suchlimit point and denote it as ˚ Z , ( ˚ W , ˚ H ) . We also define Z t , (cid:26)(cid:0) W t/ , H t/ (cid:1) , t even (cid:0) W ⌊ t/ ⌋ +1 , H ⌊ t/ ⌋ (cid:1) , t odd , ∀ t ∈ N . (15)Note that the subsequence of { Z t } ∞ t =1 with even indices,i.e., { Z t ′ } ∞ t ′ =1 correspond to the sequence { ( W t , H t ) } ∞ t =1 .Hence, there exists a subsequence { Z t j } ∞ j =1 that convergesto ˚ Z ∈ S and { t j } ∞ j =1 are all even. Moreover, there ex-ists a subsequence of the sequence (cid:8) Z t j − (cid:9) ∞ j =1 , denoted as (cid:8) Z t ji − (cid:9) ∞ i =1 , such that Z t ji − converges to (possibly) someother limit point ˚ Z ′ , ( ˚ W ′ , ˚ H ′ ) as i → ∞ .Next we show ˚ Z = ˚ Z ′ . By the update rule (9), we have H t ji / ∈ arg min H ∈ R K × N + G (cid:0) H | Z t ji − (cid:1) , ∀ i ∈ N . (16)Thus for any i ∈ N , G ( H t ji / | Z t ji − ) ≤ G ( H | Z t ji − ) , ∀ H ∈ R K × N + . (17)By (P2) , we also have for any i ∈ N , ℓ ( Z t ji / ) , ℓ ( W t ji / , H t ji / ) ≤ G ( H t ji / | Z t ji − ) . (18)Taking i → ∞ on both sides of (17) and (18), we have ℓ (˚ Z ) ≤ G (˚ H | ˚ Z ′ ) ≤ G ( H | ˚ Z ′ ) , ∀ H ∈ R K × N + , (19)by the joint continuity of G ( ·|· ) in both arguments in (P3) .Thus ˚ H ∈ arg min H ∈ R K × N + G ( H | ˚ Z ′ ) . (20)Taking H = ˚ H ′ in (19), we have ℓ (˚ Z ) ≤ G (˚ H | ˚ Z ′ ) ≤ G (˚ H ′ | ˚ Z ′ ) , ℓ (˚ Z ′ ) . (21)Since { ℓ ( Z t ) } ∞ t =1 converges (to a unique limit point), we have ℓ (˚ Z ) = ℓ (˚ Z ′ ) . This implies that ℓ (˚ Z ) = G (˚ H | ˚ Z ′ ) . Then forany H ∈ R K × N + , G (˚ H ′ | ˚ Z ′ ) = ℓ (˚ Z ′ ) = ℓ (˚ Z ) = G (˚ H | ˚ Z ′ ) ≤ G ( H | ˚ Z ′ ) . (22)This implies that ˚ H ′ ∈ arg min H ∈ R K × N + G ( H | ˚ Z ′ ) . (23)Combining (20) and (23), by the strictly convexity of G ( ·| ˚ Z ′ ) in (P5) , ˚ H = ˚ H ′ . By symmetry, we can show ˚ W = ˚ W ′ , hence ˚ Z = ˚ Z ′ . Thus (22) becomes G (˚ H | ˚ Z ) ≤ G ( H | ˚ Z ) , ∀ H ∈ R K × N + . (24)Now, the convexity of G ( ·| ˚ Z ) implies that D ∇ H G (˚ H | ˚ Z ) , H − ˚ H E ≥ , ∀ H ∈ R K × N + . (25)From the first-order property of G ( ·| ˚ Z ) in (P4) , we have D ∇ H ℓ ( ˚ W , ˚ H ) , H − ˚ H E ≥ , ∀ H ∈ R K × N + . (26)Similarly, we also have D ∇ W ℓ ( ˚ W , ˚ H ) , W − ˚ W E ≥ , ∀ W ∈ R F × K + . (27) The variational inequalities (26) and (27) together show that ( ˚ W , ˚ H ) is a stationary point of (7). (cid:3) Remark . We now provide some intuitions of the proof. Wefirst use the positivity of λ and e λ to assert that ( W , H ) ℓ ( W , H ) is coercive and hence that S is compact. Thisallows us to extract convergent subsequences. The most crucialstep (24) states that at an arbitrary limit point of { Z t } ∞ t =1 ,denoted as ˚ Z = ( ˚ W , ˚ H ) , ˚ H serves as a minimizer of G ( ·| ˚ Z ) over R K × N + . By symmetry, ˚ W also serves as a minimizerof G ( ·| ˚ Z ) over R F × K + . In the single-block case, this ideais fairly intuitive. However, to prove (24) in the double-block case, we consider two subsequences { Z t ji } ∞ i =1 and { Z t ji − } ∞ i =1 . In each sequence, only W or H is updated. Thenwe show these two sequences converge to the same limit point.This implies the Gauss-Seidel minimization procedure [45,Section 7.3] in the double-block case is essentially the sameas the minimization in the single-block case. The claim thenfollows. V. C ONCLUSION AND F UTURE W ORK
In this work, we present a unified MU algorithm for(weighted) h -divergences with ℓ and Tikhonov regularizationand analyze its convergence (to stationary points).In the future, we plan to investigate the further properties ofthe MU algorithm. Specifically, we would like to understandwhether it is able to converge to second-order stationarypoints [46]. This question is motivated by some recent workswhich have shown that for low-rank matrix factorizationproblems [47], under mild conditions, all the second-orderstationary points are local minima. For these problems, thelocal minima have been shown to possess strong theoreticalproperties [48]. Therefore, investigation into convergence tothe second-order stationary points of the MU algorithm ismeaningful. Acknowledgements : The authors would like to thank Prof.Zhirong Yang for many useful comments on the manuscript.A
PPENDIX AP ROOF OF P ROPOSITION G ( ·|· ) is a surrogate function of ( W , H ) ℓ ( W , H ) for W by decomposing G ( ·|· ) intotwo functions e G ( ·|· ) and G ( ·|· ) , where e G ( W | e Z ) , F X i =1 K X k =1 (cid:20) s + ik e w ik h (cid:18) w ik e w ik , ϑ (cid:19) − s − ik e w ik h (cid:18) w ik e w ik , ϑ (cid:19)(cid:21) , (28) G ( W | e Z ) , F X i =1 K X k =1 ( λ e w ik + 2 λ e w ik ) h (cid:18) w ik e w ik , ϑ (cid:19) . (29)Define constants { ζ ′ p } Pp =1 such that ζ ′ p , if ζ p ∈ (0 , and ζ ′ p , ζ p otherwise, for any p ∈ [ P ] . Accordingly, define ζ ′ min , min { ζ ′ p } Pp =1 and ζ ′ max , max { ζ ′ p } Pp =1 . (30)When ϑ = ζ ′ min and ϑ = ζ ′ max , Yang and Oja [6] showedthat e G ( ·|· ) is a surrogate function of ( W , H ) D ( V k WH ) for W . By Lemmas T-1 and T-2(b), e G ( ·|· ) is a surrogatefunction for any ϑ ≥ ζ ′ max . Define a new function G ′ ( W | e Z ) , G ( W | e Z )+ X i =1 λ i φ i ( f W )+ X j =1 e λ j φ j ( e H ) , (31)so that W G ′ ( W | e Z ) and W G ′ ( W | e Z ) differs by aconstant that is independent of W . By Lemma T-2(a), to show G ( ·|· ) is is a first-order surrogate function of ( W , H ) ℓ ( W , H ) for W , it suffices to show G ′ ( ·|· ) is a surrogatefunction of ( W , H ) P i =1 λ i φ i ( W ) + P j =1 e λ j φ j ( H ) for W , with ϑ ≥ ζ ′ max . First, note that λ k W k , + λ k W k = F X i =1 K X k =1 λ e w ik (cid:20) w ik e w ik − (cid:21) + 2 λ e w ik " (cid:18) w ik e w ik (cid:19) − + λ k f W k , + λ k f W k (32) ≤ F X i =1 K X k =1 ( λ e w ik + 2 λ e w ik ) h (cid:18) w ik e w ik , max { ζ ′ max , , λ ) } (cid:19) + λ k f W k , + λ k f W k , (33)where sgn( λ ) equals if λ = 0 and equals if λ > ,and in (33) we use the monotonicity of h ( σ, · ) for any σ > in Lemma T-1. Since h (1 , t ) = 0 for any t > , by choosing ϑ = max { ζ ′ max , , λ ) } , we see that G ′ ( ·|· ) satisfiesproperties (P1) and (P2) . In addition, by (41) in Lemma T-1, G ′ ( ·|· ) obviously satisfies (P3) .To prove G ( ·|· ) is a first-order surrogate function of ( W , H ) ℓ ( W , H ) for W , we show it also satisfiesproperties (P4) and (P5) . First, for any i ∈ [ F ] and k ∈ [ K ] , h ∇ W G ( W | e Z ) i ik = ( s + ik + λ + 2 λ e w ik ) (cid:18) w ik e w ik (cid:19) ϑ − − s − ik (cid:18) w ik e w ik (cid:19) ϑ − . Therefore, ∇ W G ( W | e Z ) (cid:12)(cid:12) W = f W = ∇ W ℓ ( W , e H ) (cid:12)(cid:12) W = f W .This shows (P4) . Next, since ϑ ≤ ≤ ϑ , ϑ − ϑ > and S + , S − ∈ R F × K + , for any i ∈ [ F ] , k ∈ [ K ] and W ∈ R F × K ++ , ∂ ∂w ik G ( W | e Z ) = (1 − ϑ ) s − ik (cid:18) w ik e w ik (cid:19) ϑ − + (cid:18) s + ik + λ e w ik + 2 λ (cid:19) ( ϑ − (cid:18) w ik e w ik (cid:19) ϑ − > . This implies W G ( W | e Z ) is strictly convex on R F × K ++ .Since W G ( W | e Z ) is continuous and convex on R F × K + ,it is strictly convex on R F × K + . This proves (P5) .A PPENDIX BF IRST - ORDER S URROGATE F UNCTIONS AND M ULTIPLICATIVE U PDATES FOR THE α -D IVERGENCES ( α = 0) Definition 5 ( α -divergences) . Given any matrix V ∈ R F × N ++ ,the α -divergences D alp α ( V k· ) : R F × N ++ → R , is defined as D alp α ( V k b V ) , P Fi =1 P Nj =1 d alp α ( v ij k b v ij ) , where for v, b v > , d alp α ( v k b v ) , ( b v [( v/ b v ) α − − α ( v − b v )) α ( α − , α ∈ R \ { , } v log( v/ b v ) − v + b v, α = 1 b v log( b v/v ) − b v + v, α = 0 . To construct the first-order surrogate function G alp ,α ( ·|· ) of ( W , H ) ℓ ( W , H ) in (6) for W , when D ( V k· ) belongs tothe family of α -divergences ( α = 0 ), first recall the definitionsof S + and S − in Proposition 1. From the definition of the α -divergences (in Definition 5), given any f W ∈ R F × K + and e H ∈ R K × N + , and for any ( i, k ) ∈ [ F ] × [ K ] , we have s + ik = 1 α N X j =1 e h kj , and s − ik = 1 α N X j =1 q αij e h kj , α = 0 , (34)where for any ( i, j ) ∈ [ F ] × [ N ] , q ij , v ij / ( f W e H ) ij . Inaddition, from Definition 5, we can also observe the values of ζ and ζ (see Definition 1), hence deduce from (30) that ( ζ ′ max , ζ ′ min ) = (cid:26) (1 , − α ) , α > − α, , α < . (35)By choosing ϑ = ζ ′ min and ϑ = max { ζ ′ max , , λ ) } as in Appendix A, we can obtain G alp ,α ( ·|· ) per (10) inProposition 1. Additionally, the multiplicative update in (11)in the case of α -divergences becomes w ik := e w ik P Nj =1 q αij e h kj P Nj =1 e h kj + 2 αλ e w ik + αλ ! /φ ( α ) , α = 0 , where φ ( α ) = α + sgn( λ ) , α > , α ∈ ( − , − α, α < − . (36)A PPENDIX CF IRST - ORDER S URROGATE F UNCTIONS AND M ULTIPLICATIVE U PDATES FOR THE D UAL
KLD
IVERGENCE
In [12], a surrogate function of ( W , H ) D ( V k WH ) for W , when D ( V k· ) is the dual KL divergence, is given by e G alp , ( W | e Z ) , X ik w ik log (cid:18) w ik e w ik (cid:19) X j h kj − w ik X j (log q ij + 1) h kj . (37)However, this surrogate function cannot be expressed as (28)with an appropriate choice of parameters, hence it cannot bedirectly used to derive the first-order surrogate function for (6).Therefore consider a majorant for e G alp , ( W | e Z ) as follows e G alp , ′ ( W | e Z ) , η X ik ( w ik / e w ik ) η e w ik X j q − ηij h kj − (1 + η )( w ik / e w ik ) e w ik X j h kj , (38) where η is any positive real number. By Lemma T-2(b), e G alp , ( ·k· ) is also a surrogate function of ( W , H ) D ( V k WH ) for W . By comparing (28) to (38), we have ϑ = 1 , ϑ ≥ η,s + ik = 1 η X j q − ηij h kj , s − ik = 1 + ηη X j h kj . Note that in this case S + and S − may not representthe sums of positive and unsigned negative terms in ∇ W D ( V k W e H ) (cid:12)(cid:12) W = f W . However, as long as they lie in R F × K + , Propositions 1 and 2 still hold. Therefore, by choosing ϑ = max { η, λ ) } , the first-order surrogate functionfor the dual KL-divergence can be constructed per Propo-sition 1. Additionally, per Proposition 2, the multiplicativeupdate is w ik , w ik (1 + η ) P j h kj P j q − ηij h kj + 2 ηλ e w ik + ηλ ! /ψ ( η ) , (39)where ψ ( η ) , max { η, λ ) − } . (40)A PPENDIX DT ECHNICAL L EMMAS
Among the following technical lemmas, Lemma T-1 isadapted from [49, Lemma 1], whereas Lemma T-2 can besimply proved by definition.
Lemma T-1 (Regularity of h in (2)) . For any t ∈ R , denotethe natural domain of h ( · , t ) as Ξ t . Then h ( · , t ) is continuouslydifferentiable on int (Ξ t ) , i.e., the interior of Ξ t . In particular, ∂∂σ h ( σ, t ) = σ t − , ∀ t ∈ R , ∀ σ ∈ int (Ξ t ) . (41) In addition, h ( · , t ) is either convex or concave on int (Ξ t ) ,for any t ∈ R . Finally, for every ν > , the function h ( ν, · ) isnondecreasing on R . Lemma T-2 (Calculus of Surrogate Functions) . Let f : X → R be given as in Definition 2. Let i be any index in [ n ] .(a) If f = f + e f and F i ( ·|· ) and e F i ( ·|· ) are surrogate functionsof ( x , . . . , x n ) f ( x , . . . , x n ) and ( x , . . . , x n ) f ( x , . . . , x n ) for the x i respectively, then F i ( ·|· ) , F i ( ·|· ) + e F i ( ·|· ) is a surrogate function of ( x , . . . , x n ) f ( x , . . . , x n ) for x i .(b) If F i ( ·|· ) is a surrogate function of ( x , . . . , x n ) f ( x , . . . , x n ) for x i , and there exists e F i : X i × X → R that satisfies (P3) and e F i ( e x i | e x ) = F i ( e x i | e x ) , ∀ e x ∈ X , (42) e F i ( x i | e x ) ≥ F i ( x i | e x ) , ∀ ( x i , e x ) ∈ X i × X , (43) then e F i ( ·|· ) is a surrogate function of f for x i . R EFERENCES[1] R. Zhao and V. Y. F. Tan, “A unified convergence analysis of themultiplicative update algorithm for nonnegative matrix factorization,”in
Proc. ICASSP , New Orleans, LA, USA, Mar. 2017. [2] D. D. Lee and H. S. Seung, “Learning the parts of objects by nonnegativematrix factorization,”
Nature , vol. 401, pp. 788–791, October 1999.[3] ——, “Algorithms for non-negative matrix factorization,” in
Proc. NIPS ,Denver, USA, Dec. 2000, pp. 556–562.[4] I. S. Dhillon and S. Sra, “Generalized nonnegative matrix approxima-tions with Bregman divergences,” in
Proc. NIPS , Dec. 2006.[5] C. F´evotte and J. Idier, “Algorithms for nonnegative matrix factorizationwith the beta-divergence,”
Neural Comput. , vol. 23, no. 9, pp. 2421–2456, 2011.[6] Z. Yang and E. Oja, “Unified development of multiplicative algorithmsfor linear and quadratic nonnegative matrix factorization,”
IEEE Trans.Neural Netw. , vol. 22, no. 12, pp. 1878–1891, Dec. 2011.[7] J. Kim and H. Park, “Toward faster nonnegative matrix factorization: Anew algorithm and comparisons,” in
Proc. ICDM , Pisa, Italy, Dec. 2008,pp. 353–362.[8] C.-J. Lin, “Projected gradient methods for nonnegative matrix factoriza-tion,”
Neural Comput. , vol. 19, no. 10, pp. 2756–2779, Oct. 2007.[9] Y. Xu, W. Yin, Z. Wen, and Y. Zhang, “An alternating direction algorithmfor matrix completion with nonnegative factors,”
Frontiers Math. China ,vol. 7, pp. 365–384, 2012.[10] D. Sun and C. Fevotte, “Alternating direction method of multipliersfor non-negative matrix factorization with the beta-divergence,” in
Proc.ICASSP , Florence, Italy, May 2014, pp. 6201–6205.[11] A. Cichocki, H.-K. Lee, Y.-D. Kim, and S. Choi, “Nonnegative matrixfactorization with alpha-divergence,”
Pattern Recognit. Lett. , vol. 29,no. 9, pp. 1433–1440, 2008.[12] A. Cichocki and S. Amari, “Families of Alpha-Beta-and Gamma-divergences: Flexible and robust measures of similarities,”
Entropy ,vol. 12, no. 6, pp. 1532–1568, 2010.[13] C. F´evotte, N. Bertin, and J.-L. Durrieu, “Nonnegative matrix factor-ization with the Itakura-Saito divergence. With application to musicanalysis,”
Neural Comput. , vol. 21, no. 3, pp. 793–830, Mar. 2009.[14] S. A. Vavasis, “On the complexity of nonnegative matrix factorization,”
SIAM J. Optim. , vol. 20, no. 3, pp. 1364–1377, 2009.[15] D. Donoho and V. Stodden, “When does non-negative matrix factoriza-tion give correct decomposition into parts?” in
Proc. NIPS , Vancouver,Canada, Dec. 2004, pp. 1141–1148.[16] V. Bittorf, B. Recht, C. R´e, and J. A. Tropp, “Factoring nonnegativematrices with linear programs,” in
Proc. NIPS , 2012, pp. 1214–1222.[17] S. Arora, R. Ge, R. Kannan, and A. Moitra, “Computing a nonnegativematrix factorization – provably,” in
Proc. STOC , New York, New York,USA, May 2012, pp. 145–162.[18] N. Gillis and R. Luce, “Robust near-separable nonnegative matrixfactorization using linear optimization,”
J. Mach. Learn. Res. , vol. 15,no. 1, pp. 1249–1280, 2014.[19] K. Huang, N. D. Sidiropoulos, and A. Swami, “Non-negative matrixfactorization revisited: Uniqueness and algorithm for symmetric decom-position,”
IEEE Trans. Signal Process. , vol. 62, no. 1, pp. 211–224, Jan.2014.[20] C.-J. Lin, “On the convergence of multiplicative update algorithms fornonnegative matrix factorization,”
IEEE Trans. Neural Netw. , vol. 18,no. 6, pp. 1589–1596, 2007.[21] N. Gillis and F. Glineur, “Nonnegative factorization and the maximumedge biclique problem,” arXiv:0810.4225, 2008.[22] N. Takahashi and R. Hibi, “Global convergence of modified multiplica-tive updates for nonnegative matrix factorization,”
Comput. Optim. Appl. ,vol. 57, no. 2, pp. 417–440, 2014.[23] N. Takahashi, J. Katayama, and J. Takeuchi, “A generalized sufficientcondition for global convergence of modified multiplicative updates fornmf,” in
Int. Symp. Nonlinear Theory Appl. , Luzern, Switzerland, 2014,pp. 44–47.[24] H. Kim and H. Park, “Sparse non-negative matrix factorizations viaalternating non-negativity-constrained least squares for microarray dataanalysis,”
Bioinform. , pp. 1495–1502, 2007.[25] D. Hajinezhad, T. H. Chang, X. Wang, Q. Shi, and M. Hong, “Non-negative matrix factorization using admm: Algorithm and convergenceanalysis,” in
Proc. ICASSP , Shanghai, China, Mar. 2016, pp. 4742–4746.[26] C. F´evotte and A. T. Cemgil, “Nonnegative matrix factorizations as prob-abilistic inference in composite models,” in
Proc. EUSIPCO , Glasgow,UK, Aug. 2009, pp. 1913–1917.[27] P. Paatero and U. Tapper, “Positive matrix factorization: A non-negativefactor model with optimal utilization of error estimates of data values,”
Environmetrics , vol. 5, no. 2, pp. 111–126, 1994.[28] R. Badeau, N. Bertin, and E. Vincent, “Stability analysis of mul-tiplicative update algorithms and application to nonnegative matrixfactorization,”
IEEE Trans. Neural Netw. , vol. 21, no. 12, pp. 1869–1881, Dec. 2010. [29] A. Cichocki, S. Cruces, and S.-i. Amari, “Generalized alpha-beta diver-gences and their application to robust nonnegative matrix factorization,”
Entropy , vol. 13, no. 1, pp. 134–170, 2011.[30] P. O. Hoyer, “Non-negative sparse coding,” in
Proc. NNSP , Valais,Switzerland, Sep. 2002, pp. 557–565.[31] D. Cai, X. He, J. Han, and T. S. Huang, “Graph regularized nonnegativematrix factorization for data representation,”
IEEE Trans. Pattern Anal.Mach. Intell. , vol. 33, no. 8, pp. 1548–1560, 2011.[32] L. Taslaman and B. Nilsson, “A framework for regularized non-negativematrix factorization, with application to the analysis of gene expressiondata,”
PloS ONE , vol. 7, no. 11, pp. 1–7, 2012.[33] A. Mirzal, “Nonparametric tikhonov regularized nmf and its applicationin cancer clustering,”
IEEE/ACM Trans. Comput. Biol. Bioinform. ,vol. 11, no. 6, pp. 1208–1217, 2014.[34] M. Razaviyayn, M. Hong, and Z.-Q. Luo, “A unified convergenceanalysis of block successive minimization methods for nonsmoothoptimization,”
SIAM J. Optim. , vol. 23, no. 2, 2013.[35] M. Hong, M. Razaviyayn, Z. Q. Luo, and J. S. Pang, “A unifiedalgorithmic framework for block-structured optimization involving bigdata: With applications in machine learning and signal processing,”
IEEESignal Process. Mag. , vol. 33, no. 1, pp. 57–77, Jan. 2016.[36] D. Guillamet, M. Bressan, and J. Vitria, “A weighted non-negative matrixfactorization for local representations,” in
Proc. CVPR , vol. 1, 2001, pp.942–947.[37] D. Guillamet, J. Vitri`a, and B. Schiele, “Introducing a weighted non-negative matrix factorization for image classification,”
Pattern Recognit.Lett. , vol. 24, no. 14, pp. 2447–2454, Oct. 2003.[38] N.-D. Ho, “Nonnegative matrix factorization: Algorithms and applica-tions.”[39] Q. Gu, J. Zhou, and C. H. Q. Ding, “Collaborative filtering: Weightednonnegative matrix factorization incorporating user and item graphs.” in
Proc. SDM , 2010, pp. 199–210.[40] X. Zheng, S. Zhu, J. Gao, and H. Mamitsuka, “Instance-wise weightednonnegative matrix factorization for aggregating partitions with locallyreliable clusters,” in
Proc. IJCAI , Buenos Aires, Argentina, 2015, pp.4091–4097.[41] H. Zou and T. Hastie, “Regularization and variable selection via theelastic net,”
J. Roy. Statist. Soc. Ser. B , vol. 67, no. 2, pp. 301–320,2005.[42] V. P. Pauca, J. Piper, and R. J. Plemmons, “Nonnegative matrix factor-ization for spectral data analysis,”
Linear Algebra Appl. , vol. 416, no. 1,pp. 29 – 47, 2006.[43] J. Mairal, “Incremental majorization-minimization optimization withapplication to large-scale machine learning,”
SIAM J. Optim. , vol. 25,no. 2, pp. 829–855, 2015.[44] D. P. Bertsekas,
Nonlinear Programming . Athena Scitific, 1999.[45] R. L. Burden and J. D. Faires,
Numerical Analysis . Brooks/ColePublishing Company, 2016.[46] Y. Nesterov and B. Polyak, “Cubic regularization of newton method andits global performance,”
Mathematical Programming , vol. 108, no. 1, pp.177–205, 2006.[47] R. Sun and Z.-Q. Luo, “Guaranteed matrix completion via non-convexfactorization,”
IEEE Trans. Inf. Theory , vol. 62, no. 11, Nov. 2016.[48] R. Ge, J. D. Lee, and T. Ma, “Matrix completion has no spurious localminimum,” in