[PDF] Improved SVD-based Initialization for Nonnegative Matrix Factorization using Low-Rank Correction

Abstract

Due to the iterative nature of most nonnegative matrix factorization (\textsc{NMF}) algorithms, initialization is a key aspect as it significantly influences both the convergence and the final solution obtained. Many initialization schemes have been proposed for NMF, among which one of the most popular class of methods are based on the singular value decomposition (SVD). However, these SVD-based initializations do not satisfy a rather natural condition, namely that the error should decrease as the rank of factorization increases. In this paper, we propose a novel SVD-based \textsc{NMF} initialization to specifically address this shortcoming by taking into account the SVD factors that were discarded to obtain a nonnegative initialization. This method, referred to as nonnegative SVD with low-rank correction (NNSVD-LRC), allows us to significantly reduce the initial error at a negligible additional computational cost using the low-rank structure of the discarded SVD factors. NNSVD-LRC has two other advantages compared to previous SVD-based initializations: (1) it provably generates sparse initial factors, and (2) it is faster as it only requires to compute a truncated SVD of rank ⌈r/2+1⌉ where r is the factorization rank of the sought NMF decomposition (as opposed to a rank- r truncated SVD for other methods). We show on several standard dense and sparse data sets that our new method competes favorably with state-of-the-art SVD-based initializations for NMF.

Full PDF

IImproved SVD-based Initialization for Nonnegative MatrixFactorization using Low-Rank Correction

Atif Muhammad Syed , Sameer Qazi , Nicolas Gillis ∗ Graduate School of Science and Engineering,PAF-Karachi Institute of Economics and TechnologyKarachi, Pakistan Department of Mathematics and Operational ResearchFacult´e polytechnique, Universit´e de MonsRue de Houdain 9, 7000 Mons, Belgium

Abstract

Due to the iterative nature of most nonnegative matrix factorization (

NMF ) algorithms, ini-tialization is a key aspect as it signiﬁcantly inﬂuences both the convergence and the ﬁnal solutionobtained. Many initialization schemes have been proposed for NMF, among which one of the mostpopular class of methods are based on the singular value decomposition (SVD). However, theseSVD-based initializations do not satisfy a rather natural condition, namely that the error shoulddecrease as the rank of factorization increases. In this paper, we propose a novel SVD-based

NMF initialization to speciﬁcally address this shortcoming by taking into account the SVD factors thatwere discarded to obtain a nonnegative initialization. This method, referred to as nonnegativeSVD with low-rank correction (NNSVD-LRC), allows us to signiﬁcantly reduce the initial errorat a negligible additional computational cost using the low-rank structure of the discarded SVDfactors. NNSVD-LRC has two other advantages compared to previous SVD-based initializations:(1) it provably generates sparse initial factors, and (2) it is faster as it only requires to compute atruncated SVD of rank (cid:100) r/ (cid:101) where r is the factorization rank of the sought NMF decomposition(as opposed to a rank- r truncated SVD for other methods). We show on several standard denseand sparse data sets that our new method competes favorably with state-of-the-art SVD-basedinitializations for NMF. Keywords: nonnegative matrix factorization, initialization, singular value decomposition.

Nonnegative matrix factorization (NMF) is the problem of approximating a input nonnegative matrix X as the product of two nonnegative matrices: Given X ∈ R m × n ≥ and an integer r , ﬁnd W ∈ R m × r ≥ and H ∈ R r × n ≥ such that X ≈ W H . NMF allows to reconstruct data using a purely additive model:each column of X is a nonnegative linear combination of the columns of W . For this reason, it iswidely employed in research ﬁelds like image processing and computer vision [8, 20], data mining anddocument clustering [6], hyperspectral image analysis [18, 24], signal processing [31] and computationalbiology [19]; see also [5, 9] and the references therein. ∗ Corresponding author. Email: [email protected]. a r X i v : . [ c s . NA ] J u l o measure the quality of the NMF approximation, a distance metric should be chosen. In thispaper, we focus on the most widely used one, namely the Frobenius norm, leading to the followingoptimization problem min W ∈ R m × r ,H ∈ R r × n (cid:107) X − W H (cid:107) F such that W ≥ H ≥ , (1)where (cid:107) M (cid:107) F = (cid:113)(cid:80) i,j M i,j is Frobenius norm of a matrix M . Most algorithms tackling (1) use stan-dard non-linear optimization schemes such as block coordinate descent methods hence initializationof the factors ( W, H ) is crucial in practice as it will inﬂuence(i) the number of iterations needed for an algorithm to converge (in fact, if the initial point is closerto a local minimum, it will require less iterations to converge to it), and(ii) the ﬁnal solution to which the algorithm will converge.Many approaches have been proposed for

NMF initialization, for example based on k -means andspherical k -means [29], on fuzzy c -means [22], on nature inspired heuristic algorithms [13], on Lanczosbidiagonalization [28], on subtractive clustering [4], and on the successive projection algorithm [23],to name a few; see also [15].In this paper, we focus on SVD-based initializations for NMF. Two of the most widely used methodsare NNDSVD [2] and

SVD-NMF [21] which are described in the next section. These methods suﬀerfrom the fact that the approximation error || X − W H || F of the initial factors ( W, H ) increases asthe rank increases which is not a desirable property for NMF initializations. Our key contributionis to provide a new SVD-based initialization that does not suﬀer from this shortcoming while (i) itgenerates sparse factors which not only provide storage eﬃciency [10] but also provide better part-based representations [4, 7] and resilience to noise [30, 26], and (ii) it only requires a truncated SVDof rank (cid:100) r + 1 (cid:101) , as opposed to a truncated SVD of rank r for the other SVD-based initializations. Outline of the paper

This paper is organized as follows. Section 2 will discuss our proposedsolution in details, highlighting the diﬀerences with existing SVD-based initializations. In Section 3,we evaluate our proposed solution against other SVD-based initializations on dense and sparse datasets. Section 4 concludes the paper.

The truncated SVD is a low-rank matrix approximation technique that approximates a given matrix X ∈ R m × n as a sum of r rank-one terms made of singular triplets, where 1 ≤ r ≤ rank( X ). Eachsingular triplet ( u i , v i , σ i ) (1 ≤ i ≤ r ) consists of two column vectors u i and v i which are the left andthe right singular vectors, respectively, associated with the i th singular value (which we assume aresorted in nonincreasing order). We have X ≈ X r = r (cid:88) i =1 σ i u i v Ti = U r Σ r V Tr , (2)where ( . ) T is the transpose of given matrix or vector, X r is the rank- r approximation of X , thecolumns of U r ∈ R m × r (resp. of V r ∈ R n × r ) are the left (resp. right) singular vectors, and Σ r ∈ R r × r

2s the diagonal matrix containing the singular values on its diagonal. According to Eckhart-Youngtheorem, X r provides an optimal rank- r approximation of X with respect to the Frobenius and spectralnorms [12]. To simplify our later derivations, we transform the three factors of the SVD representationinto two factors, like in NMF, by multiplying U r and V Tr by the square root of Σ r to obtain Y r and Z r : X ≈ X r = r (cid:88) i =1 y i z i = Y r Z r , (3)where Y r = U r Σ / r , Z r = Σ / r V Tr , y i = √ σ i u i and z i = √ σ i v Ti for 1 ≤ i ≤ r . Matrices Y r and Z r cannot be used directly for NMF initialization since Y r and Z r usually contain negative elements(roughly half of them, except for the ﬁrst factor, by the Perron-Frobenius theorem [1]).Given a vector x , let us denote x ( ≥ = max(0 , x ) its nonnegative part and x ( ≤ = max(0 , − x ) itsnonpositive part so that x = x ( ≥ − x ( ≤ . Using this notation, (3) can be rewritten as: X ≈ X r = r (cid:88) i =1 y i z i = r (cid:88) i =1 (cid:16) y ( ≥ i z ( ≥ i + y ( ≤ i z ( ≤ i (cid:17) − r (cid:88) i =1 (cid:16) y ( ≥ i z ( ≤ i + y ( ≤ i z ( ≥ i (cid:17) . (4)To obtain a feasible initialization for NMF, we have to deal with the second summand which leads tonegative elements in the decomposition. Currently, there are mostly two approaches used in practicefor this purpose.The ﬁrst approach discards the second summand and selects r product terms from the ﬁrst sum-mand on the basis of some criterion. In particular, the most widely used method, namely non-negative double SVD ( NNDSVD ) [2], selects r terms as follows: for each i , it selects y ( ≥ i z ( ≥ i if || y ( ≥ i z ( ≥ i || F > || y ( ≤ i z ( ≤ i || F , otherwise it selects y ( ≤ i z ( ≤ i . This is equivalent to projecting Y r and Z r onto the nonnegative orthant but taking advantage of the sign ambiguity of the SVD [3]. Thesecond approach takes the absolute value of the second term, which is equivalent to using W = | Y r | and H = | Z r | as an initialization for NMF [21]. This method is referred to as SVD-NMF.Let us denote X ≥ r the solution obtained by one of the two approaches mentioned above. In bothcases, we will have X ≥ r +1 ≥ X ≥ r for all r ≥ , since each rank-one factor selected from the SVD is nonnegative. Hence, for r suﬃciently large, theerror || X − X ≥ r || F will increase as r increases since the negative terms are not taken into account;see Figure 1 for examples on real data sets. Like the unconstrained rank- r approximation X r of X , itwould make sense that the approximation quality of X ≥ r increases as r increases. Another drawbackof these approaches is that they either throw away half of the rank-one factors of the ﬁrst summandand all of the rank-one factors in the second summand (as in NNDSVD) or sum them together so thatthe sign information is lost (as in SVD-NMF): a lot of information is wasted.In order to avoid these two important drawbacks, we propose a new method where(i) We keep all the terms from the ﬁrst summand in (4). Hence, we will only need a truncatedSVD of rank (cid:100) r + 1 (cid:101) . In fact, assuming the matrices XX T and X T X are irreducible (whichis the case for all the matrices we have tested in practice), the ﬁrst rank-one factor y z of theSVD is positive, by the Perron-Frobenius theorem [1]. This implies that y ( ≥ i z ( ≥ i (cid:54) = 0 and y ( ≤ i z ( ≤ i (cid:54) = 0 for all i ≥ A symmetric matrix is irreducible if and only if its associated graph is connected. y Ti y = z i z T = 0 for all i ≥

2, which implies that y i and z i contain at least one positive andone negative entry.(ii) Although we also discard the second summand as in NNDSVD, we will use this information toimprove the terms in the ﬁrst summand. This can be done computationally very eﬃciently usingthe low-rank structure of the second summand; see the details below.Our initialization is described in Algorithm 1. It works as follows: Let p = (cid:100) r/ (cid:101) . Then,1. Compute the rank- p truncated SVD of X , with X p = (cid:80) pi =1 y i z i ; see (3).2. The ﬁrst rank-one factor of the SVD is used to initialize W (: ,

1) and H (1 , :), that is, W (: ,

1) = | y | and H (1 , :) = | z | . Note that the absolute value is used because the SVD has a sign ambiguity (hence could generate y and z with negative entries). In any case, | y || z | is an optimal rank-one approximation since X is nonnegative [1].3. The other r − (cid:100) r/ (cid:101) factors of the truncated SVD asfollows: W (: , i ) = y ( ≥ i , W (: , i + 1) = y ( ≤ i , H ( i, :) = z ( ≥ i and H ( i + 1 , :) = z ( ≤ i , where i = 2 , , . . . , in order to obtain a nonnegative NMF initialization ( W, H ) with r factors.Note that, by this construction, the average sparsity of these factors is at least 50%. (In practice,SVD factors usually do not contain zero entries hence average sparsity is exactly 50%, ignoringthe ﬁrst rank-one factor.)4. In order to improve the current solution ( W, H ) built using the ﬁrst p singular triplets, wepropose to update them using the low-rank approximation X p by performing a few iteration ofan NMF algorithm on the problemmin W ≥ ,H ≥ || X p − W H || F , where X p = Y p Z p . The reason for this choice is that, for most NMF algorithms, performing such iterations issigniﬁcantly cheaper than performing a standard NMF iteration on the input matrix X . In fact,the most expensive steps of most NMF algorithms is to compute XH T , W T X , HH T and W T W which relates to computing the gradient of the objective function; see, e.g., [11]. When X = X p has a low-rank representation X p = Y p Z p , the cost of one NMF iteration reduces from O ( mnr )operations to O (( m + n ) r ) operations. In this paper, we use the state-of-the-art NMF algorithmreferred to as accelerated hierarchical alternating least squares (A-HALS) [11] to perform thisstep. A proper implementation requires O (( m + n ) r ) operations per iteration instead of O ( mnr )if we would apply A-HALS on the input matrix X , as explained above. We run A-HALS as longas the relative error decreases the initial error by a proportion of δ . We used δ = 5% which leadsin all tested cases to less than 10 iterations, which are negligible compared to computing thetruncated SVD that requires Ω( pmn ) operations, and to the subsequent NMF iterations, thatrequire O ( mnr ) operations.The idea of using a low-rank approximation of X to speep up NMF computations was proposedin [33], but not in combination with A-HALS nor as an initialization procedure.4or these reasons, we will refer to our method as nonnegative SVD with low-rank correction( NNSVD-LRC ) as it consist of (i) a selection of nonnegative factors from the

SVD followed by(ii) NMF iterations that uses the low-rank approximation X p of X , for a negligible additional compu-tational cost of O (( m + n ) r ) operations. Algorithm 1

Nonnegative Singular Value Decomposition with Low-Rank Correction (

NNSVD-LRC ) Input: An m -by- n nonnegative matrix X and a positive integer r . Output:

Nonnegative factors W ∈ R m × r and H ∈ R r × n such that X ≈ W H p = (cid:100) r/ (cid:101) ; [ U, Σ , V ] = truncated-SVD( X , p ); Y p = U Σ / ; Z p = Σ / V T ; % Populating W and H using Y p and Z p W (: ,

1) = | Y p (: , | ; H (1 , :) = | Z p (1 , :) | ; i = 2; j = 2; while i ≤ r do if i is even then W (: , i ) = max( Y p (: , j ) , H ( i, :) = max( Z p ( j, :) , else j = j + 1; W (: , i ) = max( − Y p (: , j ) , H ( i, :) = max( − Z p ( j, :) , end if i = i + 1; end while e = || X p − W H || F ; k = 0; % Improve W and H by applying A-HALS on the low-rank matrix X p = Y p Z p while k = 0 or e k − e k − ≥ δe do Perform one iteration of A-HALS on X p = Y p Z p starting from ( W, H ) to obtain an improvedsolution (

W, H ). e k +1 = || X p − W H || F ; k = k + 1; end whileRemark 1 (Computation of the error) . In Algorithm 1, the error || X p − W H || F has to be computed:this can be done in O (( m + n ) r ) operations observing that || X p − W H || F = (cid:104) X p , X p (cid:105) − (cid:104) X p , W H (cid:105) + (cid:104) W H, W H (cid:105) = (cid:104) Y p Z p , Y p Z p (cid:105) − (cid:104) Y p Z p , W H (cid:105) + (cid:104) W T W, HH T (cid:105) = (cid:104) Y Tp Y p , Z p Z Tp (cid:105) − (cid:104) ( W T Y p ) Z p , H (cid:105) + (cid:104) W T W, HH T (cid:105) , where (cid:104) A, B (cid:105) = (cid:80) i,j A i,j B i,j is the inner product associated with the Frobenius norm. In this section, we compare NNSVD-LRC with NNDSVD and SVD-NMF. All tests are preformedusing All tests are preformed using Matlab R2017b (Student License) on a laptop Intel CORE i5-2540M CPU @2.60GHz 4GB RAM. The code is available from https://sites.google.com/site/ icolasgillis/code . Due to the space limit, we restrict ourselves to three dense and three sparsewidely used data sets; see Tables 1 and 2. We also restrict ourselves to using the multiplicativeupdate algorithm, one of the most widely used one. (On the Matlab code provided online, we provideexperiments for two other data sets, namely the CBCL facial images, and the classic document dataset, in combination with A-HALS.) Table 1: Biometric data setsData set Image size ( h × w ) m = h × w n AT&T Faces a [21] 112 ×

92 10304 400IITD Iris b [14] 240 ×

320 76800 200TD Fingerprints c [25] 750 ×

800 600000 100 a b c http://ivg.au.tsinghua.edu.cn/dataset/TDFD.php Table 2: Document data sets from [32]Dataset Name m n

Sports 1091723 99.14 8580 14870Reviews 758635 98.99 4069 18483Hitech 331373 98.57 2301 10080Throughout this section, we will use the following two quantities:1. the relative error which measures the quality of an NMF solution:relative error(

W, H ) = (cid:107) X − W H (cid:107) F (cid:107) X (cid:107) F ,

2. the sparsity which measures the proportion of zero entries in a matrix:sparsity( W ) = W W .

Initial error

Figure 1 displays the relative errors in percent for diﬀerent values of r for each dataset. This illustrates the fact that the error of NNDSVD and

SVD-NMF increases as r increases (assoon as r is suﬃciently large); see the discussion in Section 2. In contrast, the error of NNSVD-LRC decreases as r increases. Note that the relative error of SVD-NMF grows much faster than

NNDSVD .One may argue that the above comparison is not totally fair as

SVD-NMF and

NNDSVD didnot update the factors W and H as opposed to NNSVD-LRC . Therefore, Figure 1 also displays therelative error of these initializations after the matrix H is updated with the solution of the nonnegative6igure 1: Relative error of the SVD-based NMF initializations for diﬀerent values of the rank r .least squares (NNLS) problem min H ≥ || X − W H || F for W ﬁxed. This allows to compare the qualityof the basis matrix generated by the diﬀerent initializations. We observe that NNSVD-LRC stilloutperforms

SVD-NMF and

NNDSVD after this update. Table 3 displays the relative error inpercent of the three SVD-based initializations for diﬀerent values of the factorization rank r , after theNNLS update, and also after one iteration of the HALS algorithm. Although the error of SVD-NMF and

NNDSVD decreases signiﬁcantly compared to the initial error (cf. Figure 1), it is still muchhigher than

NNSVD-LRC . 7able 3: Comparison of the relative error (in percent) of the SVD-based NMF initializations whenthey are aided by one iteration of

HALS and the

NNLS update of H . The lowest error in highlightedin bold. AT&T IITD TDF r =60 r =80 r =100 r =30 r =40 r =50 r =15 r =20 r =25NNDSVD+HALS 22.10 21.71 21.35 27.69 27.28 26.94 35.03 34.58 34.22NNDSVD+NNLS 25.55 25.49 25.46 30.98 30.89 30.82 35.99 35.73 35.47SVD-NMF+HALS 22.14 21.37 20.76 28.61 27.85 27.28 36.47 36.17 35.81SVD-NMF+NNLS 27.80 27.77 27.76 33.24 33.23 33.22 38.17 38.16 38.16NNSVD-LRC Sports Reviews Hitech r =15 r =20 r =25 r =15 r =20 r =25 r =15 r =20 r =25NNDSVD+HALS 85.69 84.65 83.90 84.19 83.35 82.78 89.93 89.09 88.29NNDSVD+NNLS 87.46 86.74 86.29 86.41 85.82 85.52 91.46 90.87 90.29SVD-NMF+HALS 87.04 86.12 85.46 84.83 84.10 83.64 90.72 90.02 89.45SVD-NMF+NNLS 90.52 90.16 89.99 88.63 88.21 88.05 93.48 93.29 93.07NNSVD-LRC For the sparsity of the initializations,

SVD-NMF generates dense initial factors, withsparsity 0% in all cases (because SVD generated dense factors and SVD-NMF take their absolutevalues as initial estimates for W and H ). NNDSVD generates factors with average sparsity 49%, withthe sparsity of every initialization (

W, H ) being between 45% and 53% for all data sets.

NNSVD-LRC generates factors with average sparsity 45% (resp. 58%), with the sparsity of every initialization (

W, H )being between 23% (resp. 51%) and 59% (resp. 66%) for dense (resp. sparse) data sets. This conﬁrmsour discussion in Section 2 where the initialization provided by

NNSVD-LRC has average sparsityaround 50%, similarly as

NNDSVD . (Note that this is not exactly 50% because of the low-rankcorrection step performed by

NNSVD-LRC .) Computational time

Table 4 reports the computational time for the diﬀerent initializations on thediﬀerent data sets, averaged over 100 runs. As expected,

NNDSVD and

SVD-NMF have roughly thesame computational cost, the main cost being the computation of the rank- r truncated SVD, while NNSVD-LRC is faster as the main computational cost is the computation of the rank- p truncatedSVD, with p = (cid:100) r/ (cid:101) , with an additional cost of running A-HALS on the rank- p approximation of X . Convergence of NMF algorithms

We now compare the three NMF initializations used in com-bination with one of the most widely used NMF algorithm, namely, the multiplicative updates(MU) [16, 17]. Table 5 displays the relative error in percent after 1, 10 and 100 iterations of MU.We observe the following: • NNDSVD and

SVD-NMF with 1 or 10 iterations of MU are not enough to get back at

NNSVD-LRC , except for the Hitech data set where

SVD-NMF achieves a slightly lower error (0.05% for r = 15 and 0.04% for r = 25). This is explained by the fact that the inital error of NNSVD-LRC is much lower, as shown in Figure 1 and Table 3.8able 4: CPU time (in s.) taken by diﬀerent

NMF initializations for the diﬀerent data sets. Boldindicates the algorithm that took less CPU time.AT&T IITD TDF r =60 r =80 r =100 r =30 r =40 r =50 r =15 r =20 r =25NNDSVD 4.10 5.68 7.72 28.59 65.97 84.89 18.65 14.91 15.66SVD-NMF 4.09 5.70 7.72 28.46 65.53 84.48 18.54 14.83 NNSVD-LRC r =15 r =20 r =25 r =15 r =20 r =25 r =15 r =20 r =25NNDSVD 4.30 4.63 5.96 3.13 3.39 4.40 1.89 2.17 2.83SVD-NMF 4.31 4.62 5.95 3.10 3.39 4.40 1.89 2.17 2.82NNSVD-LRC Table 5: Relative error in percent of MU after 1, 10 and 100 iterations when seeded by diﬀerentSVD-based NMF initializations on the dense and sparse data sets. The lowest error is highlighted inbold. AT&T IITD TDF r =60 r =80 r =100 r =30 r =40 r =50 r =15 r =20 r =25NNDSVD 1 24.58 24.51 24.47 29.73 29.56 29.44 35.88 35.55 35.26SVD-NMF 30.03 30.02 30.02 34.40 34.41 34.40 38.97 39.06 39.12NNSVD-LRC NNDSVD 10 21.71 21.52 21.40 26.99 26.68 26.43 34.43 33.81 33.27SVD-NMF 27.18 27.15 27.14 31.66 31.60 31.55 37.11 36.96 36.85NNSVD-LRC

NNDSVD 100 17.83 17.09 16.52 24.40 23.69 23.13 33.37 32.42 31.55SVD-NMF 17.06 16.40 15.92

NNSVD-LRC r r =15 r =20 r =25 r =15 r =20 r =30 r =15 r =20 r =25NNDSVD 1 87.22 86.53 86.07 85.57 84.91 84.58 91.13 90.56 89.99SVD-NMF 90.90 90.56 90.46 88.89 88.54 88.38 93.60 93.48 93.31NNSVD-LRC NNDSVD 10 84.17 82.70 81.40 82.78 81.66 81.01 88.48 87.34 86.24SVD-NMF 84.02 82.68 81.34 83.01 81.90 81.03

NNSVD-LRC

NNSVD-LRC 83.00 • After 100 iterations of the MU,

NNDSVD and

SVD-NMF sometimes are able to get back at

NNSVD-LRC : there is no clear winner (although on these 6 data sets,

NNDSVD seems toperform worse). The MU have converged (close) to diﬀerent stationary points and there is no9uarantee in general that

NNSVD-LRC will lead to better local solutions.In summary,

NNSVD-LRC is able to obtain a better (and sparse) initial solution faster than

NNSDVD and

SVD-NMF . It should therefore always be preferred if one wants to quickly obtain agood solution. However, due to the complexity of NMF [27], if one wants to obtain a possibly bettersolution, it is recommended to use multiple initializations and keep the best solution obtained; see,e.g., [5] for a discussion.

In this paper, we presented a novel

SVD -based

NMF initialization. Our motivation was to address theshortcomings of previously proposed

SVD -based

NMF initializations. Our newly proposed method,referred to as nonnegative singular value decomposition with low-rank correction (NNSVD-LRC), hasthe following advantages1. the initial error decreases as the factorization r increases,2. the average sparsity of the initial factors ( W, H ) is close to 50%,3. it is computationally cheaper as it only requires the computation of a truncated SVD of rank p = (cid:100) r/ (cid:101) , instead of r , and4. it takes advantage of the discarded factors using highly eﬃcient NMF iterations based on thelow-rank approximation computed by the SVD.In summary, NNSVD-LRC provides better initial NMF factors (both in terms of error and sparsity)at a lower computational cost. This was conﬁrmed on both dense and sparse real data sets. Thisallows NMF algorithms to converge faster to a stationary point, although there is no guarantee thatthis stationary point will have lower error than other initializations, as NMF is a diﬃcult non-convexoptimization problem [27]. Acknowledgement

The ﬁnancial support of HEC Pakistan is highly acknowledged for granting PhD scholarship to theﬁrst author. NG acknowledges the support of the European Research Council (ERC starting grant n o References [1] Berman, A., Plemmons, R.J.: Nonnegative matrices in the mathematical sciences, vol. 9. Siam (1994)[2] Boutsidis, C., Gallopoulos, E.: SVD based initialization: A head start for nonnegative matrix fac-torization. Pattern Recognition (4), 1350–1362 (2008). DOI 10.1016/j.patcog.2007.09.010. URL http://linkinghub.elsevier.com/retrieve/pii/S0031320307004359 [3] Bro, R., Acar, E., Kolda, T.G.: Resolving the sign ambiguity in the singular value decomposition. Journalof Chemometrics (2), 135–140 (2008)[4] Casalino, G., Del Buono, N., Mencar, C.: Subtractive clustering for seeding non-negative matrix fac-torizations. Information Sciences , 369–387 (2014). DOI 10.1016/j.ins.2013.05.038. URL http://linkinghub.elsevier.com/retrieve/pii/S0020025513004349

5] Cichocki, A., Zdunek, R., Phan, A.H., Amari, S.i.: Nonnegative matrix and tensor factorizations: applica-tions to exploratory multi-way data analysis and blind source separation. John Wiley & Sons (2009)[6] Du, R., Drake, B., Park, H.: Hybrid clustering based on content and connection structure using jointnonnegative matrix factorization. Journal of Global Optimization (2017). DOI 10.1007/s10898-017-0578-x.URL http://link.springer.com/10.1007/s10898-017-0578-x [7] Elad, M., Figueiredo, M.A.T., Yi Ma: On the Role of Sparse and Redundant Representations in ImageProcessing. Proceedings of the IEEE (6), 972–982 (2010). DOI 10.1109/JPROC.2009.2037655. URL http://ieeexplore.ieee.org/document/5420029/ [8] Ensari, T.: Character Recognition Analysis with Nonnegative Matrix Factorization. International Journalof Computers , 219–222 (2016)[9] Gillis, N.: The why and how of nonnegative matrix factorization. Regularization, Optimization, Kernels,and Support Vector Machines (257) (2014)[10] Gillis, N., Glineur, F.: Using underapproximations for sparse nonnegative matrix factorization. Pat-tern Recognition (4), 1676–1687 (2010). DOI 10.1016/j.patcog.2009.11.013. URL http://linkinghub.elsevier.com/retrieve/pii/S0031320309004324 [11] Gillis, N., Glineur, F.: Accelerated Multiplicative Updates and Hierarchical ALS Algorithms for Nonneg-ative Matrix Factorization. Neural Computation (4), 1085–1105 (2012). DOI 10.1162/NECO a 00256.URL [12] Golub, G.H., Van Loan, C.F.: Matrix computations, fourth edition edn. Johns Hopkins studies in themathematical sciences. The Johns Hopkins University Press, Baltimore (2013). OCLC: ocn824733531[13] Janecek, A., Tan, Y.: Using Population Based Algorithms for Initializing Nonnegative Matrix Factorization.In: Y. Tan, Y. Shi, Y. Chai, G. Wang (eds.) Advances in Swarm Intelligence, vol. 6729, pp. 307–316.Springer Berlin Heidelberg, Berlin, Heidelberg (2011). DOI 10.1007/978-3-642-21524-7 37. URL http://link.springer.com/10.1007/978-3-642-21524-7_37 [14] Kumar, A., Passi, A.: Comparison and combination of iris matchers for reliable personal authentica-tion. Pattern Recognition (3), 1016–1026 (2010). DOI 10.1016/j.patcog.2009.08.016. URL http://linkinghub.elsevier.com/retrieve/pii/S0031320309003343 [15] Langville, A.N., Meyer, C.D., Albright, R., Cox, J., Duling, D.: Initializations for the nonnegative ma-trix factorization. In: Proceedings of the twelfth ACM SIGKDD international conference on knowledgediscovery and data mining, pp. 23–26. Citeseer (2006)[16] Lee, D.D., Seung, H.S.: Learning the parts of objects by non-negative matrix factorization. Nature ,788 (1999). URL http://dx.doi.org/10.1038/44565 [17] Lee, D.D., Seung, H.S.: Algorithms for Non-negative Matrix Factorization. In: T.K. Leen, T.G. Dietterich,V. Tresp (eds.) Advances in Neural Information Processing Systems 13, pp. 556–562. MIT Press (2001).URL http://papers.nips.cc/paper/1861-algorithms-for-non-negative-matrix-factorization.pdf [18] Luce, R., Hildebrandt, P., Kuhlmann, U., Liesen, J.: Using Separable Nonnegative Matrix Factor-ization Techniques for the Analysis of Time-Resolved Raman Spectra. Applied Spectroscopy (9),1464–1475 (2016). DOI 10.1177/0003702816662600. URL http://journals.sagepub.com/doi/10.1177/0003702816662600 [19] Maruyama, R., Maeda, K., Moroda, H., Kato, I., Inoue, M., Miyakawa, H., Aonishi, T.: Detecting cells us-ing non-negative matrix factorization on calcium imaging data. Neural Networks , 11–19 (2014). DOI 10.1016/j.neunet.2014.03.007. URL http://linkinghub.elsevier.com/retrieve/pii/S0893608014000707 [20] Prajapati, S.J., Jadhav, K.R.: Brain tumor detection by various image segmentation techniques withintroduction to non negative matrix factorization. Brain (3), 600–3 (2015)

21] Qiao, H.: New SVD based initialization strategy for non-negative matrix factorization. Pattern RecognitionLetters , 71–77 (2015). DOI 10.1016/j.patrec.2015.05.019. URL http://linkinghub.elsevier.com/retrieve/pii/S0167865515001762 [22] Rezaei, M., Boostani, R., Rezaei, M.: An Eﬃcient Initialization Method for Nonnegative Matrix Fac-torization. Journal of Applied Sciences (2), 354–359 (2011). DOI 10.3923/jas.2011.354.359. URL [23] Sauwen, N., Acou, M., Bharath, H.N., Sima, D.M., Veraart, J., Maes, F., Himmelreich, U., Achten,E., Van Huﬀel, S.: The successive projection algorithm as an initialization method for brain tumorsegmentation using non-negative matrix factorization. PLOS ONE (8), e0180,268 (2017). DOI10.1371/journal.pone.0180268. URL http://dx.plos.org/10.1371/journal.pone.0180268 [24] Shiga, M., Tatsumi, K., Muto, S., Tsuda, K., Yamamoto, Y., Mori, T., Tanji, T.: Sparse modeling ofEELS and EDX spectral imaging data by nonnegative matrix factorization. Ultramicroscopy , 43–59(2016). DOI 10.1016/j.ultramic.2016.08.006. URL http://linkinghub.elsevier.com/retrieve/pii/S0304399116301267 [25] Si, X., Feng, J., Zhou, J., Luo, Y.: Detection and Rectiﬁcation of Distorted Fingerprints. IEEE Transactionson Pattern Analysis and Machine Intelligence (3), 555–568 (2015). DOI 10.1109/TPAMI.2014.2345403.URL http://ieeexplore.ieee.org/document/7029762/ [26] Sun, F., Xu, M., Hu, X., Jiang, X.: Graph regularized and sparse nonnegative matrix factorization withhard constraints for data representation. Neurocomputing , 233–244 (2016). DOI 10.1016/j.neucom.2015.01.103. URL http://linkinghub.elsevier.com/retrieve/pii/S092523121501276X [27] Vavasis, S.A.: On the Complexity of Nonnegative Matrix Factorization. SIAM Journal on Optimiza-tion (3), 1364–1377 (2010). DOI 10.1137/070709967. URL http://epubs.siam.org/doi/10.1137/070709967 [28] Wang, X., Xie, X., Lu, L.: An Eﬀective Initialization for Orthogonal Nonnegative Matrix Factorization.Journal of Computational Mathematics (1), 34–46 (2012). DOI 10.4208/jcm.1110-m11si10. URL [29] Wild, S., Curry, J., Dougherty, A.: Improving non-negative matrix factorizations through structured ini-tialization. Pattern recognition (11), 2217–2232 (2004)[30] Ye, M., Qian, Y., Zhou, J.: Multitask Sparse Nonnegative Matrix Factorization for Joint Spectral–SpatialHyperspectral Imagery Denoising. IEEE Transactions on Geoscience and Remote Sensing (5), 2621–2639(2015). DOI 10.1109/TGRS.2014.2363101. URL http://ieeexplore.ieee.org/document/6939673/ [31] Yoshii, K., Itoyama, K., Goto, M.: Student’s T nonnegative matrix factorization and positive semideﬁnitetensor factorization for single-channel audio source separation. pp. 51–55. IEEE (2016). DOI 10.1109/ICASSP.2016.7471635. URL http://ieeexplore.ieee.org/document/7471635/ [32] Zhong, S., Ghosh, J.: Generative model-based document clustering: a comparative study. Knowledge andInformation Systems , 374–384 (2005)[33] Zhou, G., Cichocki, A., Xie, S.: Fast Nonnegative Matrix/Tensor Factorization Based on Low-Rank Ap-proximation. IEEE Transactions on Signal Processing (6), 2928–2940 (2012). DOI 10.1109/TSP.2012.2190410. URL http://ieeexplore.ieee.org/document/6166354/http://ieeexplore.ieee.org/document/6166354/