[PDF] Exterior Point Method for Completely Positive Factorization

Abstract

Completely positive factorization (CPF) is a critical task with applications in many fields. This paper proposes a novel method for the CPF. Based on the idea of exterior point iteration, an optimization model is given, which aims to orthogonally transform a symmetric lower rank factor to be nonnegative. The optimization problem can be solved via a modified nonlinear conjugate gradient method iteratively. The iteration points locate on the exterior of the orthonormal manifold and the closed set whose transformed matrices are nonnegative before convergence generally. Convergence analysis is given for the local or global optimum of the objective function, together with the iteration algorithm. Some potential issues that may affect the CPF are explored numerically. The exterior point method performs much better than other algorithms, not only in the efficiency of computational cost or accuracy, but also in the ability to address the CPF in some hard cases.

Full PDF

aa r X i v : . [ m a t h . O C ] F e b Noname manuscript No. (will be inserted by the editor)

Exterior Point Method for Completely Positive Factorization

Zhenyue Zhang · Bingjie Li

Received: date / Accepted: date

Abstract

Completely positive factorization (CPF) is a critical task with applicationsin many ﬁelds. This paper proposes a novel method for the CPF. Based on the idea ofexterior point iteration, an optimization model is given, which aims to orthogonallytransform a symmetric lower rank factor to be nonnegative. The optimization prob-lem can be solved via a modiﬁed nonlinear conjugate gradient method iteratively.The iteration points locate on the exterior of the orthonormal manifold and the closedset whose transformed matrices are nonnegative before convergence generally. Con-vergence analysis is given for the local or global optimum of the objective function,together with the iteration algorithm. Some potential issues that may affect the CPFare explored numerically. The exterior point method performs much better than otheralgorithms, not only in the efﬁciency of computational cost or accuracy, but also inthe ability to address the CPF in some hard cases.

Keywords completely positive factorization · completely positive rank · nonlinearconjugate gradient method · nonnegative factorization of matrices Mathematics Subject Classiﬁcation (2020) · · Completely positive factorization (CPF) of a nonnegative, symmetric, and pos-itive semideﬁnite matrix A looks for a nonnegative factorization A = BB T with anonnegative factor B . If such a CPF exists, the matrix A is said to be completelypositive. More preciously, a strict CPF of A means the factorization A = BB T with a Zhenyue ZhangSchool of Mathematics Science, Zhejiang University, Yuquan Campus, Hangzhou 310038, ChinaResearch Center for Advanced Artiﬁcial Intelligence Theory, Zhejiang Lab, Hangzhou 311121, ChinaE-mail: [email protected] LiSchool of Mathematics Science, Zhejiang University, Yuquan Campus, Hangzhou 310038, ChinaE-mail: [email protected] Zhenyue Zhang, Bingjie Li nonnegative factor B having the smallest column number r cp since such a factoriza-tion may be not unique if it exists. The smallest column number r cp is also called asthe completely positive rank, or cp-rank for short, of A .The CPF problem appears in many ﬁelds such as balanced incomplete block de-signing [8], energy demand model designing [23], probability estimation of DNAsequences [26]. In matrix games, completely positive matrices are used to check ifthe Nash equilibrium can be found by algorithm Lemke that solves a linear com-plementary problem [32]. In the recent decade, the CPF plays an important rolein combinatorial and nonconvex quadratic optimization. For example, some binaryquadratic problem or graph optimization can be relaxed as a continuous optimiza-tion problem over a set of completely positive matrices [2,3,13], and the iterationof completely positive matrices could be implemented in a cone where the updatedcompletely positive matrices are given in the form of convex combination of rank-onenonnegative matrices [5], or via optimizing the nonnegative factors of the CPF [9].In the latter case, an initial CPF should be given for running the factor-optimization.In data science, CPF or completely positive approximation can also be used to reﬁnenoisy graphs for data clustering [27,29,39] or to estimate the probabilities of extremeevents happen in nature or ﬁnancial markets [12]. Hence, the CPF is interesting inboth of theory analysis and applications.However, the CPF problem is NP-hard in the following issues [6]: detectingwhether a nonnegative, symmetric, and positive semideﬁnite matrix is completelypositive [16], determining the cp-rank of a completely positive matrix, ﬁnding a non-negative factor B when the cp-rank r cp ( A ) of A is known [21], or solving the prob-lem min B ≥ k A − BB T k F for completely positive approximation, where B has a givennumber of columns [37]. Only for a narrow part of completely positive matrices withspecial structures, the CPF can be obtained directly or within polynomial time. Forexample, diagonal dominant nonnegative symmetric matrices are completely posi-tive [25]. A bipartite graph, i.e. , A = (cid:20) D C T C D (cid:21) with positive diagonal D and D andnonnegative C , is completely positive and has an explicit CPF [7]. For an acyclic orcircular graph, if it is positive semideﬁnite, one can check whether it is completelypositive or not within polynomial time. Furthermore, if it is completely positive, onecan also get its CPF within polynomial time [15]. It was shown in [28] that if a com-pletely positive matrix of rank r has a principal diagonal matrix of order r , then itscp-rank equals to r and its CPF can be obtained explicitly.There are some efforts on algorithms for this problem in the literature. MoodyChu proposed a direct approach in [18] via successive rank-one reduction, based onthe Wedderburn’s formula to reduce the rank [38]: ˆ A = A − σ − uu T with u = Ax and σ = x T Ax . Here x should yield a nonnegative u and a positive σ , and meanwhile, ˆ A is also nonnegative. A greedy approach was given in [18] to determine such an x via maximizing the smallest entries of ˆ A . Kuang et al. considered the natural modelmin W ≥ k A − WW T k F and solved it by projected Newton method in [27]. This prob-lem was relaxed to the penalty form min W , H ≥ k A − W H T k F + λ k W − H k F in [29]so that it can be solved iteratively via alternatively optimizing the factors W and H , The procedure terminates if such an x cannot be obtained.xterior Point Method for Completely Positive Factorization 3 each is a nonnegative least square problem without a closed form solution. Writing k A − WW T k F = k A j − w · j w T · j k for each j , where w · j is the j -th column of W and A j = A − ∑ i = j w · i w T · i , In [37], Vandaele et al. considered the successive column-updating: min w · j ≥ k A j − w · j w T · j k F , sweeping all the columns of W repeatedly untilconvergence.Different from the direct optimization on nonnegative factors, [24] and [21] con-sidered two approaches that orthogonally transform a full-rank factor W of the sym-metric factorization A = WW T to be nonnegative as B = W Q directly or indirectly.In [24], the explicit error k B − WQ k F is minimized, subjected to nonnegative B andorthogonal Q . This problem is solved by alternatively optimizing nonnegative B andorthogonal Q , and the iteration converges to a point satisfying its ﬁrst order conditionof local optimum. The optimization considered in [21] is implemented in the originalspace as min k Q − P k F subjected to orthogonal Q and restricted matrix P with non-negative image W P . This problem is also solved by alternative iteration on Q and P .However, it is hard to solve the subproblem on P because of the implicit restriction on P . Local convergence of these alternative projection algorithms was given in [21].However, it is not clear if the local convergence could be linear. In [14], Drusvyatskiyproved that for an alternative projection onto two closed sets X and Y , the linearconvergence occurring nearby an intersected point z of X and Y if X intersects Y at z transversally. That is, at z , the normal cone of X and the negative normal cone of Y are intersected at the origin only. For the alternative projection considered in [21],this condition at an orthogonal Q with nonnegative B = W Q is equivalent to that forany nonnegative matrix Y ∈ R n × r , S = B T Y is not symmetric or its trace is not zero if S =

0. See Proposition 4 given in Appendix A. This condition is not satisﬁed in somecases. Below is an instance: the completely positive matrix A has a factor W , andthere is an orthogonal Q with B = W Q ≥ Y ≥ B T Y is symmetricand nonzero. However, it has a zero trace. A =   , W =  − − − − − − −  , Q =  − −  , Y =   . These algorithms mentioned above may suffer from the nonnegative or orthogonalrestrictions. Because of the restrictions, an inner iteration is required to solve theinvolved subproblems if there are no closed form solutions, or adopt an estimatedsolution instead, but it may delay the convergence.In this paper, we consider a novel approach for the CPF without restrictions. Dif-ferent from the alternative projection mentioned above, our idea is exterior point it-eration that pursues an orthogonal Q on the exterior of the set of orthogonal matricesand the set of P ’s whose images W P are nonnegative. It may be expected that suchan exterior point iteration is more efﬁcient since no restrictions should be obeyed.We will give an optimization model for implementing the exterior point iteration. We will use w i for the i -th row of W . An approximate rule of P was suggested in [21] to simplify the updating, but WP may be no longernonnegative. Zhenyue Zhang, Bingjie Li Decreasing the objective function at a point X implicitly propels X moving towardthe set of orthogonal matrices, and meanwhile, its image W X is nonnegative even-tually. We will show some properties about the ﬁrst order condition and the secondorder condition for the optimal solutions. We will also characterize the global opti-mum that can be used to check whether the iterative point will go to a global optimalone or not, when we solve this optimization problem iteratively. Based on this prop-erty, a restart strategy can be also used in order to avoid local optimum as much aspossible. The classical nonlinear conjugate gradient (NCG) method requires a twicecontinuously differentiable function [1,35] or differentiable function with Lipschitzcontinuous gradient [35] to guarantee the convergence. We modify the NCG so thatthe condition of the objective function can be weakened to be continuously differ-entiable as that in our model. The modiﬁed NCG performs very efﬁciently in ourexperiments.It is not clear what is the dominant issue that determines the difﬁculty of CPF.In this paper, we will numerically explore the three possible issues: the cp-rank of A , the sparsity of the nonnegative factor B in a CPF of A , or the approximately cp-rank deﬁciency of A . An interesting observation is that there is a special rank-sparsityboundary nearby which the CPF is much harder than others. We believe that thesephenomena offer some important insight into the CPF problem.The exterior point method performs much better than other algorithms given inthe literature, not only on the suitableness of completely positive matrices wheneverwhose cp-rank is equal to or larger than the rank, but also on the computational efﬁ-ciency on the accuracy of the factorization or the computational cost for completelypositive matrices in large scales. We will report the results of the numerical experi-ments to show the advantages and comparisons with other algorithms.The remaining part of this paper is organized as follows. In Section 2, we show themotivation of the exterior point method and the optimization model of the CPF. Thenthe ﬁrst order and second order conditions are discussed. We also discussed someinteresting properties of the global optimum. The NCG method and its modiﬁcationsare discussed in Section 3 for solving the problem. In Section 4, we report someinteresting phenomena of the CPF via a lot of numerical experiments by the modiﬁedNCG, which partially explores possible issues resulting in difﬁcult CPF. We alsoshow some improvements for difﬁcult CPF by weak CPF or restart techniques inSection 5. Comparisons with other algorithms are given in Section 6 that further showthe efﬁciency of our exterior point method on synthetic completely positive matricesin small or larger scales and some difﬁcult examples reported in the literature. Finally,some remarks are given in the conclusion section. Let A be a completely positive matrix of order n , and let r = r ( A ) and r cp = r cp ( A ) be the matrix rank and cp-rank of A , respectively. It is easy to estimate the matrixrank but hard for the cp-rank unless r = , Theoretically, the cp-rank could be In this special case, r cp = r .xterior Point Method for Completely Positive Factorization 5 equal to or much larger than the matrix rank or the matrix order. There are two kindsof estimations on the cp-rank. One was an upper bound of r cp in term of r given in[4]: r cp ≤ ( r + r − ) for r ≥

2. The estimation is not tight when n is relativelylarge. For instance, the upper bound cannot be touched if A is of full rank with n > r cp ≤ ( n + n − ) if n ≥

6, not depending on the rank of A .A lower-bound given in [11] as that r cp ≥ (cid:0) n + n − − ( √ n − ) n (cid:1) for n ≥ r cp = r . It makes sense since if wearbitrarily choose r nonnegative n -dimensional vectors with r ≤ n , these vectors arealmost always linearly independent, i.e. , the matrix B of these vectors is of full rank,and hence A = BB T has rank r almost always.In this section, we do not assume that r cp = r . Furthermore, we do not ask thecolumn number of B equal to r cp in the CPF that we are going to determined. Similarwith that in [24] and [21], we look for an orthonormal transformation B = W Q from afull-rank factor W of a symmetric factorization A = WW T . That is, W is of r columns, B has r + columns with a given integer r + ≥ r cp , and Q is a row-orthonormal matrixof order r × r + . Such an orthonormal matrix exists according to the following lemma.A similar result was given in [24], here we give a much simpler proof. Lemma 1

Let A = WW T be a full column-rank and symmetric factorization of A,and r + ≥ r cp ( A ) . Then A = BB T with a nonnegative B of r + columns, if and only ifthere is a row-orthonormal Q of order r × r + such that B = W Q.Proof

The sufﬁciency is obvious since QQ T = I . For the necessity, the equality BB T = WW T implies that B and W have the same rank and span ( W ) = span ( B ) .Therefore, B = W Q for a matrix Q ∈ R r × r + . Substituting B = W Q into BB T = WW T ,we get that QQ T = I since W is of full column rank. The proof is then completed. (cid:3) { Q k } and { P k } in the constrained domains Q = { Q ∈ R r × r + : QQ T = I r } , P = { P ∈ R r × r + : W P ≥ } , respectively, to pursue an intersect point of the feasible domains, via alternative pro-jection. Different from the greedy alternative projection, we consider a new strategyof exterior point iteration for solving the problem of CPF. The idea is that we lookfor an iterative sequence { X k } that alive on the exterior of the two subdomains. Each X k is not orthogonal and W X k may be not nonnegative. Meanwhile, we hope that thesequence { X k } can get close to both of Q and P more and more, and eventually, thesequence { X k } can converge to an intersect point Q of Q and P to get a nonnegativefactor B = W Q of A .To this end, let us consider the distances of a given matrix X to Q and P , respec-tively, d ( X , Q ) = min Q ∈ Q k X − Q k F , d ( X , P ) = min P ∈ P k X − P k F . Zhenyue Zhang, Bingjie Li

The ideal model for the CPF ismin X ∈ R r × r + (cid:8) d ( X , Q ) + d ( X , P ) (cid:9) . (1)Clearly, d ( X , Q ) has a simple representation d ( X , Q ) = ∑ k ( − σ k ( X )) , where { σ k ( X ) } are the singular values of X . Unfortunately, there is not a closed form to rep-resent the distance d ( X , P ) because of the implicit restriction in P . However, d ( X , P ) can be represented by k W X − W P k F since k W X − W P k F ≤ k W k k X − P k F . It has alower bounded as d ( X , P ) ≥ min P ∈ P k W X − W P k F k W k ≥ min B ∈ R n × r ++ k B − WX k F k W k = k ( W X ) − k F k W k , where ( W X ) − = min (cid:8) W X , (cid:9) is the negative part of W X . Therefore, the ideal objec-tive function can be equivalently transformed tomin X ∈ R r × r + n ∑ k (cid:0) − σ k ( X ) (cid:1) + k ( W X ) − k F k W k o . (2)However, the above model should be slightly modiﬁed in the view of numericalcomputation since it costs much to evaluate all the singular values. This disadvantagecan be addressed by taking into account the equality k XX T − I k F = ∑ i ( − σ i ( X )) .Since the global optimal solution is achieved at an X with σ i =

1, when σ i ( X ) ≈ i , we see that k XX T − I k F = ∑ i ( − σ i ( X )) ( + σ i ( X )) ≈ ∑ i ( − σ i ( X )) . Moreimportantly, it can be evaluated economically. Notice that k XX T − I k F is a quarticfunction of X , while k ( W X ) − k F is quadratic. We slight modify the coefﬁcients of thetwo functions in (2) when the ﬁrst one is replaced by k XX T − I k F ,min X ∈ R r × r + n k XX T − I k F + λ k ( W X ) − k F o . (3)where λ = / k W k . One may slightly change λ if necessary. Numerically, it maybe more robust if we rescale the rows of W to have unit norm, i.e. , replace W by itsnormalized ˜ W = diag ( k w k − , · · · , k w n k − ) W in (3), where { w i } are the rows of W . Since √ n = k ˜ W k F ≤ √ r k ˜ W k , 1 / k ˜ W k ≤ rn .Hence, we can simply set λ ≈ rn in this case. In our experiments, we always use thisparameter set for λ and it works very well. We can set B = ( W X ) + , the nonnegativepart of W X , as the required nonnegative factor.There are some advantages to the problem (3) and its objective function f ( X ) = k XX T − I k F + λ k ( W X ) − k F . At ﬁrst, X ∈ Q ∩ P if and only if f ( X ) = i.e. , it is an optimal solution of (3). Second,because of the square form of the negative entries in the second term, f is derivative.Third, if at a point X , W X does not have zero entries, f ( X ) is a polynomial functionnear X . This property implies a similar behavior as a polynomial function that ben-eﬁts the analysis of convergence or global optimality. In the next subsection, we willfurther discuss the local or global optimality. xterior Point Method for Completely Positive Factorization 7 f is derivable, it is not difﬁcult to characterize itsﬁrst order condition for its local optimum. One can verify that ∂∂ x i j (cid:13)(cid:13) ( W X ) − (cid:13)(cid:13) F = ∂∂ x i j n ∑ s = (cid:16) r ∑ t = w st x t j (cid:17) − = n ∑ s = w si (cid:16) r ∑ t = w st x t j (cid:17) − . The gradient of k ( W X ) − k F can be represented as ∇ k ( W X ) − k F = W T ( W X ) − . Wealso have that ∇ k XX T − I k F = ( XX T − I ) X . Hence, ∇ f ( X ) = ( XX T − I ) X + λ W T ( W X ) − . (4)The ﬁrst order condition for a local minimum of f ( X ) follows immediately. Lemma 2

The ﬁrst order condition of a (local) minimum of f is ( XX T − I ) X + λ W T ( W X ) − = . (5)Some interesting properties follows for any stationary point of f , i.e. , a matrix X satisfying (5). The ﬁrst one is that the term λ k ( W X ) − k F and f can be represented interms of the singular values { σ i ( X ) } of X . Proposition 1

Let X be a stationary point of f and σ i = σ i ( X ) for simplicity. Then λ k ( W X ) − k F = ∑ i σ i ( − σ i ) , f ( X ) = ∑ i ( − σ i ) . (6) Proof

Since trace ( X T ( I − XX T ) X ) = trace ( X T X ( I − X T X )) = ∑ i σ i ( − σ i ) , λ k ( W X ) − k F = trace ( λ ( W X ) T ( W X ) − ) = trace ( X T ( I − XX T ) X ) = ∑ σ i ( − σ i ) . This is the ﬁrst inequality in (6). The second one follows it with the deﬁnition of f and k XX T − I k F = ∑ i ( − σ i ) and ( − σ i ) + σ i ( − σ i ) = − σ i . (cid:3) Obviously, all σ i ( X ) = X is globally optimal. One can also expect σ i ( X ) ≈ X is just a locally optimal or stationary point since λ k ( W X ) − k F > ∑ σ i > σ i ( σ i − ) = ∑ σ i < σ i ( − σ i ) − λ k ( W X ) − k F . Practically, one can conclude form (5) that some σ i ( X ) = W X has partial non-negative columns because of the symmetry of λ ( W X ) T ( W X ) − = X T ( I − XX T ) X . If ( W X ) + is the nonnegative part of W X , the symmetry gives ( W X ) T + ( W X ) − = ( W X ) T − ( W X ) + . (7) Proposition 2

For a stationary point X of f , if W X has a nonnegative submatrixW X and the remaining W X does not, then there are at least r ( X ) singular values σ i ( X ) = , and ( W X ) T ( W X ) − = , ( W X ) T ( W X ) ≥ . Zhenyue Zhang, Bingjie Li

Proof

By (5), ( XX T − I r ) X =

0. Hence, XX T has at least r = r ( X ) eigenvaluesequal to one. It follows that X has at least r singular values equal to one. By (7), ( W X ) T + ( W X ) − = ( W X ) T − ( W X ) + = . It follows that ( W X ) T ( W X ) = ( W X ) T + ( W X ) = ( W X ) T + ( W X ) + ≥ (cid:3) The orthogonality between

W X and ( W X ) − implies that the rows of W X cor-responding to the nonzero rows of W X must be nonnegative. Hence, if W X has apositive column,

W X itself must be nonnegative. If

W X exists, W X must have zerorows, and we can partition W X as W X = (cid:20) W X W X W X (cid:21) within permutation, whereboth W X and W X are nonnegative and all the rows of W X are nonzero.A second order condition can guarantee a stationary point to be local optimal. Forour problem, it is not difﬁcult to give the second order condition. Theorem 1

Assume that X is a stationary point of f , and t j is the indicator of nega-tive components of the column x j of X. If the symmetric matrix S = ( S i j ) , partitionedwith blocks S i j = (cid:26) k x i k I + x i x Ti + W T diag ( t i ) W + ( XX T − I ) , i = j ; ( x Ti x j ) I + x j x Ti , i = j , (8) is positive deﬁnite, or positive semideﬁnite with a null space spanned by vectorslinked by all columns of ∆ satisfying ∆ X T + X ∆ T = and the nonzero eigenvaluesof S are not smaller than k X k , then X is a local minimizer of f . The proof is given in Appendix B. Because f is not convex, local optimum hap-pens when we solve the problem (3) iteratively. Hence, it is important to checkwhether an iterative sequence goes to a global optimum or not. To address this issue,we will further discuss sufﬁcient conditions for the global optimum of f , focusing ondescending sequences { f ( X k ) } that may be generated via some iteration algorithms,as the modiﬁed NCG that we will adopt to solve the problem (3).2.3 Global optimumWe ﬁst give some equivalent conditions for the global optimum of f , based onProposition 1 and Proposition 2. Proposition 3

The following statements are equivalent for a stationary point X of f .(a) X ∈ P and X is of full row rank;(b) X ∈ Q , i.e., XX T = I r ;(c) X ∈ Q ∩ P , i.e., X is a global minimizer of f .Proof The proof is simple. In fact, if X ∈ P , i.e. , X = X in Proposition 2, then all thesingular values of X are equal to one since r ( X ) = r , which implies that XX T = I r , i.e. , X ∈ Q . Conversely, if X ∈ Q , we have ( W X ) − = i.e. , X ∈ P , from Proposition1 directly. (cid:3) xterior Point Method for Completely Positive Factorization 9 The full-rank condition on a stationary point X is satisﬁed generally. Practically,if f ( X ) < / X must be full rank by (6). If an algorithm generates the iterativesequence { X k } with descending { f ( X k ) } , starting with an row-orthonormal matrix, itis highly possible to have f ( X ) < / X of { X k } . Hence, X ∈ P is equivalent to X ∈ Q for an accumulative point X of the sequence generally.Proposition 3 also show that any stationary point X of f always locates the ex-terior of both Q and P except it is a solution of (3). In Section 3, we will give aniterative algorithm that yields a sequence of iterative points on the exterior of Q ∪ P unless it converges globally. This is why we call it as an exterior point method, quitedifferent from the alternative methods discussed in the previous subsection, whereiterative points alternatively drop into one of the subdomains.Let us consider those stationary points of f on the exterior of Q ∪ P . Each X ofthem has an indicator matrix sgn ( | ( W X ) − | ) . However, the number of different ones { T ℓ } is limit. Let { X ( ℓ ) } be the stationary points with sgn ( | ( W X ( ℓ ) ) − | ) = T ℓ , and let f ℓ ( X ) = k XX T − I k F + λ k ( W X ) ⊙ T ℓ k F . Each f ℓ has a ﬁnite number of different critical values since it is a polynomial [36]. Asshown in the proof of Theorem 1 given in Appendix B, each critical value of f is alsoa critical value of one of f ℓ ’s. Hence, f also has a ﬁnite number of critical values. Let f min be the smallest one of these nonzero critical values of f . The following theoremshows that a descending sequence { f ( X k ) } yields the optimum of f if there is an f ( X k ) < f min . Theorem 2

Let { f ( X k ) } be a descending sequence and lim k ∇ f ( X k ) = . If there isan f ( X k ) < f min , then lim k f ( X k ) = .Proof Obviously, { f ( X k ) } converges and { X k } is bounded. Let X ∞ be any accumu-lative point of { X k } and subsequence X k j → X ∞ . Then ∇ f ( X ∞ ) = lim ∇ f ( X k j ) = f ( X ∞ ) is a critical value of f . Since f ( X ∞ ) < f ( X k ) < f min , we conclude that f ( X ∞ ) = (cid:3) Theorem 2 shows that any accumulative point of { X k } is a global minimizer of f if there is an f ( X k ) < f min . It is theoretically meaningful, but the condition is notdetectable since f min is unknown. The following theorem gives sufﬁcient conditionsfor the global optimum. Theorem 3

Let { f ( X k ) } be a descending sequence such that lim k ∇ f ( X k ) = . If lim k k X k k ≤ , f ( X k ) < α for a constant α ∈ ( , ) , and for sufﬁciently large k, λ k ( W X k ) − k F ≤ α k X k X Tk − I k F − k X k X Tk − I k F , (9) then any accumulative point X ∞ of { X k } is a globally optimal solution. Different stationary points may share a common indicator matrix.0 Zhenyue Zhang, Bingjie Li

Proof

Let X ∞ be any accumulative point of { X k } and X k j → X ∞ as in the proof ofTheorem 2. The condition lim k k X k k ≤ σ i ≤ k X ∞ k ≤ { σ i } of X ∞ . Below we show that all σ i = i.e. , f ( X ∞ ) = σ i <

1. We have that 0 < f ( X ∞ ) < α . We willgive a contradiction, based on the observation that there is a constant β > f ( X ∞ ) ≤ α ( β + ) ( β + ) , λ k ( W X ∞ ) − k F ≤ β k X ∞ X T ∞ − I k F . (10)To show (10), let a = k X ∞ X T ∞ − I k F and b = λ k ( W X ∞ ) − k F . Both a and b arepositive by Proposition 3 since f ( X ∞ ) >

0. It is easy to verify that the ﬁrst one in (10)is equivalent to β ≤ c − c , where c = p − f ( X ∞ ) / α ∈ ( , ) . The second one isequivalent to ba ≤ β . Thus, the existence of a positive β is equivalent to ba ≤ c − c , orequivalently, ba + b ≤ c . Recalling that the condition (9) gives a + b ≤ α √ a . Hence,1 − c = f ( X ∞ ) α = a + b α ≤ a ( a + b )( a + b ) = − b ( a + b ) . That is, the inequality ba + b ≤ c holds. Therefore, there is a positive β satisfying (10).By Proposition 1 with X = X ∞ , the inequalities in (10) become ∑ σ i < ( − σ i ) ≤ α ( β + )( β + ) , ∑ σ i < σ i ( − σ i ) ≤ β ∑ σ i < ( − σ i ) . (11)However, the left inequality above implies that 1 − σ i < β + ( β + ) since α < i.e. , σ i ( β + ) > β for each σ i <

1. Thus, σ i > β ( − σ i ) and σ i ( − σ i ) > β ( − σ i ) ,which yields a contradiction to the right inequality in (11), completing the proof. (cid:3) We will show that the modiﬁed NCG given in the next section can yield a se-quence { X k } that guarantees the descent of { f ( X k ) } and lim k ∇ f ( X k ) =

0. Becauseof the descent, it is easy to have f ( X k ) < /

4. The inequality (9) is always satisﬁedexcept a few of early X k ’s. Generally, k X k k > k >

1, and k X k k → ρ ≥ { X k } converges locally or globally. The difference is that ρ > ρ = f and the norm of its derivative ∇ f nearby a stationary point of f . This relation willbe used in our algorithm to check whether an iterative sequence tends to the globaloptimum or not. Theorem 4

If X is a stationary point of f , then for sufﬁciently small ∆ = ˜ X − X,f ( ˜ X ) = f ( X ) + h ∇ f ( ˜ X ) , ∆ i − h ∆∆ T X , ∆ i − k ∆∆ T k F . (12) Furthermore, if X is globally optimal, then f ( ˜ X ) < k ∇ f ( ˜ X ) k F k ∆ k F . xterior Point Method for Completely Positive Factorization 11 Proof

Consider an arbitrary ˜ X in the open neighborhood of X , N ( X ) = (cid:8) ˜ X : ( W ˜ X ) i j ( W X ) i j > ( W X ) i j = (cid:9) . We will show that k ˜ X ˜ X T − I k F = (cid:13)(cid:13) XX T − I (cid:13)(cid:13) F + h ( ˜ X ˜ X T − I ) ˜ X , ∆ i − h λ ( W X ) − , W ∆ i− h X , ∆∆ T ∆ i − k ∆∆ T k F ; (13) k ( W ˜ X ) − k F = k ( W X ) − k F + h ( W ˜ X ) − , W ∆ i + h ( W X ) − , W ∆ i . (14)Substituting these equalities into f ( ˜ X ) = k ˜ X ˜ X T − I k F + λ k ( W ˜ X ) − k F , and combin-ing h ∇ f ( ˜ X ) , ∆ i = h ( ˜ X ˜ X T − I ) ˜ X , ∆ i + h λ ( W ˜ X ) − , W ∆ i , we can get (12) immediately.We prove (13) below, based on ˜ X ˜ X T − I = XX T − I + X ∆ T + ∆ X T + ∆∆ T . Onone hand, from the equality, we have k ˜ X ˜ X T − I k F = (cid:13)(cid:13) XX T − I (cid:13)(cid:13) F + (cid:13)(cid:13) X ∆ T + ∆ X T + ∆∆ T (cid:13)(cid:13) F + (cid:10) XX T − I , X ∆ T + ∆ X T + ∆∆ T (cid:11) . On the other hand, we rewrite 2 h ( ˜ X ˜ X T − I ) ˜ X , ∆ i = h ˜ X ˜ X T − I , ˜ X ∆ T + ∆ ˜ X T i as2 (cid:10) ( ˜ X ˜ X T − I ) ˜ X , ∆ (cid:11) = (cid:10) XX T − I + X ∆ T + ∆ X T + ∆∆ T , X ∆ T + ∆ X T + ∆∆ T i = (cid:10) XX T − I , X ∆ T + ∆ X T + ∆∆ T i + (cid:13)(cid:13) X ∆ T + ∆ X T + ∆∆ T (cid:13)(cid:13) F + h XX T − I + X ∆ T + ∆ X T + ∆∆ T , ∆∆ T i = (cid:10) XX T − I , X ∆ T + ∆ X T + ∆∆ T i + (cid:13)(cid:13) X ∆ T + ∆ X T + ∆∆ T (cid:13)(cid:13) F − h XX T − I − ∆∆ T , ∆ X T i + k ∆∆ T k F . Since ( XX T − I ) X = − λ W T ( W X ) − by ∇ f ( X ) =

0, we also have that h XX T − I , ∆ X T i = −h λ ( W X ) − , W ∆ i . Hence, combining these equalities, we get (13).To show (14), let T − and T be the indicator matrices of the negative or zeroentries of W X , respectively. Since ˜ X ∈ N ( X ) , ( W ˜ X ) − = ( W ˜ X ) ⊙ ( T − + T ) = ( W X + W ∆ ) ⊙ ( T − + T ) = ( W X ) − + H ∆ where H ∆ = W ∆ ⊙ ( T − + T ) . Combining it with h ( W X ) − , H ∆ i = h ( W X ) − , W ∆ i , weget k ( W ˜ X ) − k F = k ( W X ) − k F + h ( W X ) − , W ∆ i + k H ∆ k F , h ( W ˜ X ) − , W ∆ i = h ( W X ) − , W ∆ i + h H ∆ , W ∆ i = h ( W X ) − , W ∆ i + k H ∆ k F . Hence (14) is also true, completing the proof of (12).For a globally optimal X , f ( X ) = T − =

0. We estimate f ( ˜ X ) in the twosubsets according to the sign of function g ( ∆ ) = h ∆ T X , ∆ T ∆ i + k ∆∆ T k F , N + = { ˜ X ∈ N ( X ) : g ( ˜ X − X ) ≥ } ; N − = { ˜ X ∈ N ( X ) : g ( ˜ X − X ) < } . For ˜ X ∈ N + , we have f ( ˜ X ) ≤ h ∇ f ( ˜ X ) , ∆ i ≤ k ∇ f ( ˜ X ) k F k ∆ k F directly.Consider ℓ ( ∆ ) = X ∆ T X + ∆ X T X + λ W T H ∆ for ˜ X = X + ∆ ∈ N − , and rewrite ∇ f ( ˜ X ) = ℓ ( ∆ ) + ( X ∆ T + ∆ X T ) ∆ + ∆∆ T ( X + ∆ ) We should have ℓ ( ∆ ) = X = X + ∆ ∈ N − different from X . Otherwise, ℓ ( ∆ ) = ∆ such that ˜ X = X + ∆ ∈ N − . It follows that0 = h ℓ ( ∆ ) , ∆ i = h X ∆ T X + ∆ X T X + λ W T H ∆ , ∆ i = h ∆ T X , X T ∆ i + k ∆ X T k F + λ k H ∆ k F . Since λ W T H ∆ = − ( X ∆ T + ∆ X T ) X by ℓ ( ∆ ) = k λ W T H ∆ k F = k X ∆ T + ∆ X T k F = k ∆ X T k F + h ∆ T X , X T ∆ i = − λ k H ∆ k F , which implies that ∆ T X + X T ∆ =

0. Thus, by g ( ∆ ) <

0, we get a contradiction0 ≤ k ∆∆ T k F < − h ∆ T X , ∆ T ∆ i = −h ∆ T X + X T ∆ , ∆ T ∆ i = . Therefore, ℓ ( ∆ ) is non-singular. Since it is piece-wisely linear, there is a positiveconstant c such that k ℓ ( ∆ ) k ≥ c k ∆ k F for ˜ X ∈ N − . Thus, for k ∆ k F ≤ min { , c / } , k ℓ ( ∆ ) k ≥ k ∆ k F , and by deﬁnition, k ∇ f ( ˜ X ) k F ≥ c k ∆ k F − k ∆ k F ≥ k ∆ k F . Hence, f ( ˜ X ) ≤ ( k ∇ f ( ˜ X ) k F + k ∆ k F ) k ∆ k F ≤ k ∇ f ( ˜ X ) k F k ∆ k F . The theorem is then proven. (cid:3)

Concluding from Theorem 4, we get thatlim ˜ X → X k ∇ f ( ˜ X ) k F f ( ˜ X ) = (cid:26) , if X is not globally optimal; + ∞ , if X is globally optimal . (15)In the next section, we will consider a nonlinear conjugated gradient method forsolving the minimization problem (3). The sufﬁcient conditions mentioned abovewill be taken into account as much as possible. The NCG method [19] is commonly used for solving nonlinear smooth optimiza-tion problems. Generally, its convergence requires the objective function to be twicecontinuously differentiable [1,35], or be differentiable and its gradient is Lipschitzcontinuous [35] if the conjugated gradient direction is suitably updated. Unfortu-nately, in our case, the function f is continuously differentiable only.In this section we brieﬂy describe the NCG and conditions of its convergence atﬁrst. To guarantee the convergence when the NCG is applied to the exterior pointmodel (3), we propose two modiﬁcations for the NCG. One is a new approach forupdating the conjugate gradient in the NCG, and the other one is a simpler rule forsetting an inexact line search for updating the iteration point. These modiﬁcations cannot only guarantee the convergence for continuously differentiable functions withoutother conditions, but also improve the efﬁciency of NCG. Hence, the modiﬁed NCGworks on our exterior point problem (3). xterior Point Method for Completely Positive Factorization 13 φ ( x ) via the two classical steps, starting at d = − g , with g = ∇φ ( x ) on aninitial point x : Modify the current point x k to x k + = x k + α k d k (16)along the direction d k with a suitable step length α k . Then, update the conjugatedirection d k to d k + = − g k + + β k d k (17)with the gradient g k + = ∇φ ( x k + ) at the updated point x k + and a suitable value β k .The weak convergence lim g k = x k to the minimizerof φ depends on the smoothness of φ and the choices of { α k } and { β k } . The ideal α k is the minimizer of φ ( x k + α d k ) with respect to α , which is called as exact linesearch. Inexact search is commonly suggested but α k should satisfy the weak Wolfeconditions [33] φ ( x k + α d k ) ≤ φ ( x k ) + ρα h g k , d k i (18) h ∇φ ( x k + α d k ) , d k i ≥ σ h g k , d k i (19)with two positive parameters ρ < σ <

1, or the strong Wolfe conditions (18) and |h ∇φ ( x k + α d k ) , d k i| ≤ − σ h g k , d k i . (20)Generally, the inexact line search α k can be obtained via bisection [31] or interpola-tion [35], or combination of the two approaches [20]. About the step choice for β k ,there are three commonly used approaches in the literature [19,34,40]: β FR k = k g k + k k g k k , (21) β PRP k = h g k + , g k + − g k ik g k k , (22) β MPRP k = β PRP k − min (cid:8) β PRP k , ν k g k + − g k k k g k k h g k + , d k i (cid:9) , ν > . (23)Sun and Yuan have shown in [35] that the strong convergence of the NCG withthe exact line search and { β PRP k } is true if φ is twice continuously differentiable anduniformly convex, and the level set { x : φ ( x ) ≤ φ ( x ) } is bounded. However, thestrong convergence is not guaranteed if the exact search is relaxed to the inexact one,even if it satisﬁes the strong Wolfe conditions. The weak convergence is guaranteedfor the NCG with { β FR k } and inexact { α k } satisfying the strong Wolfe conditionswith 0 < ρ < σ < , under the same conditions on φ without uniformly convexity, orthe gradient of φ is Lipschitz continuous, a stricter condition than the continuouslydifferentiable φ . For the modiﬁed PRP, MPRP, if the weak Wolfe conditions are sat-isﬁed by { α k } and { α k } has a positive lower bounded, the weak convergence can be slightly improved to g k → φ as for FR [40]. It wasreported in the literature that the PRP is more efﬁcient than FR in applications al-though its stronger conditions may not be satisﬁed. MPRP performs better than PRPgenerally if the parameter ν is suitably set.3.2 Modiﬁcations for the NCG Modiﬁed step β k . Practically, the MPRP adopts a restart strategy: reset β k = i.e. , d k + = − g k + , when h g k + , g k + − g k − ν k g k + − g k k k g k k d k i ≤

0. It guarantees h d k + , g k + i ≤ − ( − ν ) k g k + k , a sufﬁcient condition for the descent of φ ( x k + ) by (18). However, this descent con-dition is not sufﬁcient for the convergence of { g k } . An additional condition about thepositive lower-bound of { α k } is required [40].To guarantee the same convergence as MPRP for continuously differentiable φ ( x ) without the lower-bound condition on { α k } , we further modify MPRP as that β k = min n h g k + , g k + − g k − ν k g k + − g k k k g k k d k i + k g k k , κ k g k + k k d k k o . (24)where ν > as in (23) and κ >

0. The modiﬁcation can guarantee a stronger sufﬁcientdescent condition h d k + , g k + i ≤ − µ k d k + k k g k + k . We will show it in the nextsubsection. The stronger sufﬁcient descent condition with µ ∈ ( , ) was given in[35] for the convergence g k → φ is uniformly continuous in the level set { x : φ ( x ) ≤ φ ( x ) } . We willfurther show that the uniformly continuous condition is also not necessary for theconvergence. Simple approach for inexact line search.

For continuously differentiable φ ( x ) ,the interpolation method does not guarantee the capture of required weak line search α k since it asks for a thrice times continuously differentiable and unimodal φ ( x ) [35]. One can get α k by the combination method [20] that is more efﬁcient than thebisection approach [31]. In [20], the bisection is combined with the interpolation in abit complicated way for interval shrinking. Here we give a simpler and more efﬁcientapproach for determining α k satisfying the weak Wolfe conditions.Theoretically, at a current point x = x k with the conjugate direction d = d k , therequired inexact line search α = α k satisfying the weak Wolfe conditions (18-19) canbe chosen as α ∗ = sup (cid:8) ˆ α : the Wolfe condition (18) holds over ( , ˆ α ) (cid:9) . (25) We always assume that both g k and d k are always nonzero. NCG terminates if g k = d k = It exists, is positive, and satisﬁes (18-19). To verify this claim, let’s consider thefunction g ( α ) = φ ( x ) + ρα h ∇φ ( x ) , d i − φ ( x + α d ) . Clearly, (18) is equivalent to g ( α ) ≥

0, and meanwhile, (19) holds if g ′ ( α ) ≤

0. By thedeﬁnition and the continuousness of φ , (18) is true for 0 < α ≤ α ∗ . The supremum in(25) implies that g ( α ∗ ) = g ′ ( α ∗ ) ≤

0. Hence, (19) is also satisﬁed for α = α ∗ .Practically, there is a relative large sub-interval of ( , α ∗ ] in which both (18) and (19)are true. For instance, if ˆ α ∈ ( , α ∗ ] is the largest point such that g ( α ) is a localmaximum, then g ′ ( α ) ≤ [ ˆ α , α ∗ ] . Therefore, (18-19) hold for α ∈ [ ˆ α , α ∗ ] .An ideal choice of α is the minimizer α min of φ ( x + α d ) over ( , α ∗ ] since it de-creases φ as small as possible, while both (18) and (19) are still satisﬁed. In thissubsection, we give a simple rule for pursuing α min via a quadratic interpolationto φ ( x + α d ) , assuming φ is continuously differentiable. It generates a nested andshrunk interval sequence containing the required α . The pursuing terminates as soonas a point satisfying (18-19) is found.Initially, we set α ′ = α ′′ > α ′′ will be given later. Startingwith [ α ′ , α ′′ ] , we generate a sequence of intervals [ α ′ , α ′′ ] iteratively such that each α ′ ℓ satisﬁes (18) but α ′′ ℓ does not, and meanwhile, α ′ ℓ doesn’t satisfy (19). That is, for x ′ ℓ = x + α ′ ℓ d and x ′′ ℓ = x + α ′′ ℓ d φ ( x ′ ℓ ) ≤ φ ( x ) + ρα ′ ℓ h g , d i , φ ( x ′′ ℓ ) > φ ( x ) + ρα ′′ ℓ h g , d i , h ∇φ ( x ′ ℓ ) , d i < σ h g , d i , (26)where g = ∇φ ( x ) . The third inequality above implies that h ∇φ ( x ′ ℓ ) , d i <

0. Further-more, by the ﬁrst two inequalities in (26), we have that φ ( x ′′ ℓ ) > φ ( x ′ ℓ ) + ρ ( α ′′ ℓ − α ′ ℓ ) h g , d i > φ ( x ′ ℓ ) + ρσ ( α ′′ ℓ − α ′ ℓ ) h ∇φ ( x ′ ℓ ) , d i . (27)In the current interval, we consider a quadratic function q ( α ) with interpolation con-ditions q ( α ′ ℓ ) = φ ( x ′ ℓ ) , q ′ ( α ′ ℓ ) = h ∇φ ( x ′ ℓ ) , d i , q ( α ′′ ℓ ) = φ ( x ′′ ℓ ) , It can be represented as q ( α ) = φ ( x ′ ℓ ) + ( α − α ′ ℓ ) h ∇φ ( x ′ ℓ ) , d i + (cid:0) φ ( x ′′ ℓ ) − φ ( x ′ ℓ ) − ( α ′′ ℓ − α ′ ℓ ) h ∇φ ( x ′ ℓ ) , d i (cid:1) ( α − α ′ ℓ ) ( α ′′ ℓ − α ′ ℓ ) with the minimizer c ℓ = argmin α q ( α ) given by c ℓ = α ′ ℓ + α ′′ ℓ − α ′ ℓ − ( α ′′ ℓ − α ′ ℓ ) h ∇φ ( x ′ ℓ ) , d i φ ( x ′′ ℓ ) − φ ( x ′ ℓ ) − ( α ′′ ℓ − α ′ ℓ ) h ∇φ ( x ′ ℓ ) , d i > α ′ ℓ . (28)By the Mean-Value Theorem for derivatives and the second inequality in (27),0 < ( − M ℓ ) − ≤ (cid:16) − h ∇φ ( ¯ x ℓ ) , d ih ∇φ ( x ′ ℓ ) , D k i (cid:17) − = − ( α ′′ ℓ − α ′ ℓ ) h ∇φ ( x ′ ℓ ) , D k i φ ( x ′′ ℓ ) − φ ( x ′ ℓ ) − ( α ′′ ℓ − α ′ ℓ ) h ∇φ ( x ′ ℓ ) , D k i < σσ − ρ , (29) where ¯ x ℓ = x + ¯ α ℓ d with ¯ α ℓ ∈ [ α ′ ℓ , α ′′ ℓ ] and M ℓ = min α ∈ [ α ′ ℓ , α ′′ ℓ ] h ∇φ ( x + α d ) , d ih ∇φ ( x ′ ℓ ) , d i ≤ h ∇φ ( ¯ x ℓ ) , d ih ∇φ ( x ′ ℓ ) , d i < ρσ . Hence, if 0 < ρ < σ , we have that α ′ ℓ < α ′ ℓ + ( − M ℓ ) ( α ′′ ℓ − α ′ ℓ ) < c ℓ < α ′ ℓ + σ ( σ − ρ ) ( α ′′ ℓ − α ′ ℓ ) < α ′′ ℓ . (30)We may shrink [ α ′ ℓ , α ′′ ℓ ] to [ c ℓ , α ′′ ℓ ] or [ α ′ ℓ , c ℓ ] , if α = c ℓ satisﬁes (18) or does not.However, if (18) is satisﬁed, the interval length is α ′′ ℓ − c ℓ ≤ − M ℓ − M ℓ ( α ′′ ℓ − α ′ ℓ ) . When M ℓ < | M ℓ | is large, − M ℓ − M ℓ ≈

1. The interval shrinking is inefﬁcient in this case.To avoid this phenomenon, we slightly modify c ℓ as that with η = σ ( σ − ρ ) ˜ c ℓ = max (cid:8) c ℓ , ηα ′ ℓ + ( − η ) α ′′ ℓ (cid:9) ∈ ( α ′ ℓ , α ′′ ℓ ) . (31)Since ˜ c ℓ ≥ ηα ′ ℓ + ( − η ) α ′′ ℓ and c ℓ < α ′ ℓ + η ( α ′′ ℓ − α ′ ℓ ) by (30), we get α ′′ ℓ − ˜ c ℓ ≤ η ( α ′′ ℓ − α ′ ℓ ) , ˜ c ℓ − α ′ ℓ ≤ max (cid:8) η , − η (cid:9) ( α ′′ ℓ − α ′ ℓ ) = η ( α ′′ ℓ − α ′ ℓ ) . The last equality holds since η > /

2. Hence, if the Wolfe conditions (18-19) aresatisﬁed for α = ˜ c ℓ , we get the required α k = ˜ c ℓ . Otherwise, shrink [ α ′ ℓ , α ′′ ℓ ] as [ α ′ ℓ + , α ′′ ℓ + ] = (cid:26) [ α ′ ℓ , ˜ c ℓ ] , if (18) does not hold for α = ˜ c ℓ ; [ ˜ c ℓ , α ′′ ℓ ] , otherwise . (32)The interval length is signiﬁcantly decreased as 0 < α ′′ ℓ + − α ′ ℓ + ≤ η ( α ′′ ℓ − α ′ ℓ ) , where η < ρ < σ . Hence, α ′′ ℓ − α ′ ℓ → ℓ → ∞ . Lemma 3 If φ is lower bounded and continuously differentiable, an α = ˜ c ℓ ∗ satisfy-ing (18-19) can be obtained within a ﬁnite iterations of (32) if < ρ < σ < .Proof If (18-19) do not hold for all ˜ c ℓ , the updating rule (32) yields a sequence ofnested intervals { [ α ′ ℓ , α ′′ ℓ ] } . Since 0 < ρ < σ <

1, the intervals tend to a singlepoint α ∗ and both { x ′ ℓ } and { x ′′ ℓ } tend to x ∗ = x + α ∗ d . Hence, by (27) and the Taylorextension of φ ( x + α d ) at α = α ∗ , we get h ∇φ ( x ∗ ) , d i = lim ℓ → ∞ φ ( x ′′ ℓ ) − φ ( x ′ ℓ ) α ′′ ℓ − α ′ ℓ ≥ ρ h ∇φ ( x ) , d i > σ h ∇φ ( x ) , d i (33)since h ∇φ ( x ) , d i < ρ < σ . However, by (26), h ∇φ ( x ∗ ) , d i ≤ σ h ∇φ ( x ) , d i , acontradiction with (33). (cid:3) A good choice of α ′′ helps to pursue the minimizer α min . Motivated by the aboveanalysis on the estimation of the shrinking rate η ℓ , we suggest the experiential setting α ′′ = min (cid:8) α = p η : (18) is not satisﬁed for α = p η with integer p ≥ (cid:9) . (34)Starting with the initial setting, the interval updating (32) converges quickly. For in-stance, we set ρ = . τ = .

4, the interval iteration terminates within one or twoiterations generally in our experiments. Algorithm 1 gives the details of the procedurefor determining an inexact line search α k , given x k , φ k , g k , the conjugate direction d k . xterior Point Method for Completely Positive Factorization 17 Algorithm 1

An inexact line search satisfying the weak Wolfe conditions

Require: point x , φ = φ ( x ) , g = ∇φ ( x ) , direction d , and parameters σ , ρ , ε Ensure: α satisfying (18-19) within accuracy ε , x : = x + α d , φ ( x ) , and g = ∇φ ( x ) . Set α ′ = x ′ = x , φ ′ = φ , g ′ = g , ν = ρ h g , d i . Find the smallest integer p ≥ α = η p , and set α ′′ = η p . Repeat the following iteration until α ′′ − α ′ < ε . Compute c as (28), ˜ c as (31), and ˜ φ = φ ( ˜ x ) at ˜ x = x + ˜ cd . If ˜ φ > φ + ˜ c ν , update ( α ′′ , φ ( x ′′ )) by ( ˜ c , φ ( ˜ x )) and go to Step 3. Compute ˜ g = ∇φ ( ˜ x ) . If h ˜ g , d i ≥ σ h g , d i , set x = ˜ x , φ = ˜ φ , g = ˜ g , and terminate. Otherwise, update α ′ , φ ′ , g ′ by ˜ c , ˜ φ , ˜ g , respectively. End iteration3.3 Convergence of the modiﬁed NCGWe have two results for the convergence.

Lemma 4

Let β k be deﬁned by (24) and µ = ν − ν ( + κ ) . Then h d k + , g k + i ≤ − µ k d k + k k g k + k . (35) Proof

We rewrite β k = ρ k ˜ β k and d k + = ρ k ˜ d k + + ( ρ k − ) g k + , where ρ k ∈ [ , ] , ˜ β k is the step in (23), and ˜ d k + = − g k + + ˜ β k d k . Since h ˜ d k + , g k + i ≤ ( ν − ) k g k + k , h d k + , g k + i = ρ k h ˜ d k + , g k + i + ( ρ k − ) k g k + k ≤ (cid:0) ρ k ( ν − ) + ( ρ k − ) (cid:1) k g k + k ≤ − ν ν k g k + k . By β k ≤ κ k g k + k k d k k , we also have that k d k + k ≤ k g k + k + β k k d k k ≤ ( + κ ) k g k + k .The inequality (35) follows since k g k + k ≥ + κ k d k + k k g k + k and − ν ν < (cid:3) Combining (18), the inequality (35) guarantees the descent of { φ ( x k ) } . The NCGwith the inexact line search discussed in the previous subsection and the modiﬁed step β k given in (24) is convergent if φ is continuously differentiable and lower bounded.The convergence analysis is slightly different from that for the PRP step in [35]. Theorem 5

Starting with an arbitrary x , the NCG with inexact line search { α k } sat-isfying the weak Wolfe condition (18-19) and steps { β k } deﬁned in (24) is convergentin the sense that { φ ( x k ) } is monotone decreasing and converges and that ∇φ ( x k ) → .Proof We assume g k = ∇φ ( x k ) = k without loss of generalities, and let s k = α k d k . By Lemma 4, h g k , s k i ≤ − µ k g k kk s k k ≤

0. The Wolfe condition (18) gives φ ( x k + ) − φ ( x k ) ≤ ρ h g k , s k i ≤ − ρµ k g k kk s k k ≤ . Algorithm 2

Modiﬁed nonlinear conjugated gradient (MNCG) method

Require: initial point x , parameters ε , ε , ρ , σ , ν , κ , λ , and k NCGmax . Ensure: an approximate solution x ∗ of min x φ ( x ) with the given accuracy Compute φ = φ ( x ) , g = ∇φ ( x ) , and set d = − g . For k = , · · · , k NCGmax , Save φ old = φ , and update ( x , φ , g ) by Algorithm 1 with searching direct d . If k g k F < ε and φ old − φ < ε , then set x ∗ = x and terminate the iteration. Otherwise, compute β by (24) and update d : = − g + β d . End forIt means that { φ ( x k ) } is monotone decreasing. Hence, it is convergent since φ itselfis lower bounded, which also implies that k g k kk s k k → k g k k →

0. Otherwise, there is a subsequence {k g k i k } witha positive lower bound. Correspondently, k g k i kk s k i k → k s k i k →

0. Bythe Taylor extensions φ ( x k i + ) = φ ( x k i ) + h g k i , s k i i + o (cid:0) k s k i k (cid:1) , φ ( x k i ) = φ ( x k i + ) − h g k i + , s k i i + o (cid:0) k s k i k (cid:1) , and the Wolfe condition (19) that gives h g k i + , s k i i ≥ σ h g k i , s k i i , we have that o ( k s k i k ) = h g k i + , s k i i − h g k i , s k i i ≥ ( σ − ) h g k i , s k i i . Hence, 1 − σ = o ( k s ki k ) −h g ki , s ki i ≤ o ( k s ki k ) µ k g ki kk s ki k → {k g k i k} has a positive lower bound,which implies σ ≤

1, a contradiction with σ < (cid:3) Algorithm 2 gives the details of NCG for minimizing a nonlinear function φ ( x ) .We will use it to solve the exterior point model (3). The algorithm performs verywell in our tests. For instance, applying on a symmetric factorization A = WW T of acompletely positive matrix of order 20000 with cp-rank 10, the algorithm can get aCPF with accuracy 10 − within 150 iterations and 3 seconds, starting at the identitymatrix of order r . As a comparison, using the same initial point, the alternative pro-jection method given in [24] gives an approximate CPF in the accuracy 10 − , whichcosts more than 450000 iterations and more than 1500 seconds.3.4 PostprocessingGenerally, a solution X of (3) solve by the modiﬁed NCG is not exactly row-orthonormal since the algorithm terminates within a limit accuracy. We can get anapproximate CPF with a nonnegative factor ˜ B = ( W X ) + truncated from W X .To improve the accuracy of the approximate CPF, we suggest postprocessing onthe solution X . That is, ﬁnd a row-orthonormal matrix Q ∈ R r × r + nearest to X at ﬁrst,and then truncate W Q to be a nonnegative B + = ( W Q ) + . This Q can be a solutionto the Procrustes problem min QQ T = I k X − Q k F . That is, Q = UV T when we have the xterior Point Method for Completely Positive Factorization 19 singular value decomposition X = U Σ V T of X , where U is an orthogonal matrix oforder r and V ∈ R r × r + is column-orthonormal. The following estimation gives insightinto the improvement.Let N = ( W X ) − for simplicity, then ( W X ) + = W X − N , and k A − ( WX ) + ( W X ) T + k F = k A − ( WX )( W X ) T + ( W X ) N T + N ( W X ) T − NN T k F ≤ k A − ( WX )( W X ) T k F + ( k W X k + k N k ∞ ) k N k F , where k N k ∞ is the largest absolute entry of N . When X = Q , it is simpliﬁed as k A − ( WQ ) + ( W Q ) T + k F ≤ (cid:0) p k A k + k N k ∞ (cid:1) k N k F . This postprocessing may slightly increase the negative component k N k F , but it van-ishes the term k A − ( W X )( W X ) T k F , and yields a signiﬁcant decreasing of the ap-proximate error eventually. In our experiments, we always adopt the postprocessingand take the orthogonal projection Q of a solution X as an eventual output. The CPF was thought to be NP-hard in [21] without proofs, even if the columnnumber of a nonnegative factor is relaxed to be larger than the cp-rank. That is, oneis not able to get an algorithm to compute such a CPF for all completely positivematrices within polynomial time of the matrix order. However, it may be possible toget a good factorization with high accuracy for some completely positive matriceswithin acceptable time. It is tricky that we know less about what kind of completelypositive matrices whose CPF is easy or hard to obtain.In this section, we will explore some potential issues that may inﬂuence the CPFnumerically, implemented by our exterior point method using the modiﬁed NCG thatis given in the previous section. We focus on the three issues on the truly existednonnegative factor B of a completely positive matrix A : the distribution of its columnnorms, its sparsity, and its approximately rank deﬁciency. It is not clear whether aﬁxed A has multiple CPFs whose nonnegative factors B have quite different propertieson these three issues. However, we do not ﬁnd evident differences in our experimentswhen A has a spare nonnegative factor B .Four kinds of distributions of the column norms of B are considered: constant,linear, convex, or concave. In each set of those B ’s with the same kind of the col-umn distribution, we also consider the inﬂuences of the column number (the cp-rankof A ), sparsity, and approximate rank deﬁciency of B to the CPF. Synthetic com-pletely positive matrices are randomly constructed with these properties. Because ofthe construction, we always have that r cp ( A ) = r ( A ) for these matrices. Hence, we set r + = r ( A ) . For simplicity, we also ﬁx the order of these synthetic matrices as n = We say A = BB T is a weak CPF later if the column number of the nonnegative B is larger than thecp-rank r cp of A , distinguishing it from the strict CPF whose factor has r cp columns. It is more likely for A with a dense nonnegative factor to have multiple CPFs.0 Zhenyue Zhang, Bingjie Li r b min − − − − − − − − − − − − − − − − constant 100 45 98 100linear 90 99 99 99 98 100 100 100 100 100 100 100 100 100 100 100convex 87 97 97 98 99 100 100 100 100 100 100 100 100 100 100 100concave 92 100 100 100 97 99 98 99 99 100 100 100 100 100 100 100 Table 1

Percentage of CPF by EPM achieving accuracy ε = − , dense factors, and n = A few completely positive matrices with cp-rank larger than rank reported in the lit-erature and the synthetic completely positive matrices in a larger scale ( n = W from the symmet-ric factorization A = WW T to ˜ W = D − W with D = diag ( k w k , · · · , k w n k ) beforeits CPF. The postprocessing discussed in the last section is also adopted. That is, weuse the orthogonal projection Q as the output and set ˜ B = ( W Q ) + as an approximatenonnegative factor of A . We measure the factorization accuracy by the relative errorError ( ˜ A ) = k A − ˜ B ˜ B T k F k A k F . (36)4.1 Column distributionCompletely positive matrices in the form A = BB T can be easily constructed byrandomly choosing a nonnegative factor B with a given number of columns. Gener-ally, the cp-rank of such a matrix A is also equal to its rank. We consider four sets of A ’s with the different distributions of column norm sequence { b i } of B : One is thatwith constant b i = i , and the others have the same form as b i = − ( − b min ) t i − t t r − t ∈ [ b min , ] , i = , · · · , r , (37)where r is the number of columns, t i = i d and d = − − , 1, and 2, respectively.The different values of d determine the different sharp of { b i } : convex ( d = − − ),linear ( d = d = b min determines how small someof columns of B can be in these three types.The four types of B ’s are constructed as follows. We ﬁrst choose ˆ B = ( ˆ b i j ) or order n × r with entries uniformly distributed in the interval ( , ) , and then normalize eachcolumn of ˆ B to B = ( b i j ) that has a given column norm sequence { b i } . That is, theentries are b i j = ˆ b i j (cid:0) b j / q ∑ k ˆ b k j (cid:1) .Table 1 lists the percentages of tested matrices whose CPF can be successfullyobtained by our exterior point method (EPM) with the accuracy Error ( ˜ A ) < ε = − among 1000 repeats in each type of column norm distribution with ﬁxed b min and r . For constant { b i } , the percentages are that among 4000 tests.xterior Point Method for Completely Positive Factorization 21 Type constant linear, b min = . s r Table 2

Percentage of CPF achieving accuracy ε = − depending on s = s ( B ) and r = r cp ( A ) ( n = The parameters are also ﬁxed for all the tests as λ = . , ρ = . , σ = . , ν = , κ = , ε = − , ε = − . (38)In this experiment, starting at a randomly chosen orthogonal matrix, our algorithmworks very well for almost all the cases, except the constant case with r =

10. Thereare only about 45% matrices of rank r =

10 whose factorization errors can achieve theaccuracy 10 − and about 53% matrices have factorization errors larger than 10 − .Similar phenomenon also occurs when the matrix size n is larger. The CPF is not dif-ﬁcult for matrices with a dense factor B whose column norms are distributed linearly,convexly, or concavely, except in a spacial case shown in the following phenomenon.Phenomenon A. For a dense and full rank B ≥ with approximately equal columnnorms, if it has a special column number depending on its row number, the CPF ofA = BB T is relatively difﬁcult than others. It is a bit puzzling. A similar phenomenon also occurs when B is sparse. We willshow it in the next subsection.4.2 SparsityBesides the distribution of the column norms, the sparsity of factor B is anotherissue that may affect the CPF. To show this phenomenon, we modify the construc-tion in the previous subsection by vanishing partial smallest entries of ˆ B with a givensparsity in percentage, without modiﬁcations on the normalization for having a spe-cial column distribution. Generally, the sparsity strategy does not change the rank andcp-rank if the sparsity is not large. However it affects the difﬁculty of CPF. If the sparsity is large, some rows of ˆ B may be zero. These rows should be deleted and the size n ofthe resulting A is slightly reduced.2 Zhenyue Zhang, Bingjie Li Type convex concave b min − − − − s r Table 3

Dependence of successful CPF ( ε = − ) on approximate cp-rank deﬁciency Table 2 lists the percentages of tested matrices whose CPF errors are smaller than10 − among 100 repeats with the constant or linear distribution of the column norms(the results for concave or convex distribution can be referred in Table 3). It showsthat the hardness of CPF does not monotonically depend on the rank/cp-rank or thesparsity of a nonnegative factor B . Very interestingly, the hardness is tightly relatedto a special rank-sparsity boundary.Phenomenon B. For each column norm distribution, there is a special rank-sparsityboundary such that the closer to this boundary the rank-sparsity pair of B is, the moredifﬁcult the CPF of A = BB T is. b min means an approximately deﬁcient cp-rank of the com-pletely positive matrix A tested in the previous subsections. This is because we canrewrite A = ∑ ri d i u i u Ti with nonnegative unit vectors { u i } and d r = b . For instance,if b min = − , then d r = − . Clearly, deleting the last column of B just slightlymodiﬁes A to be a completely positive matrix ˆ A = ∑ r − i d i u i u Ti that is very close to A with k A − ˆ A k F = d r , but ˆ A has a smaller cp-rank. The approximate deﬁciency doesnot affect the CPF very much if the nonnegative factor B is dense, as shown in Ta-ble 1. However, as the sparsity is increased, the inﬂuence of approximate cp-rankdeﬁciency to the CPF is more and more evident.Table 3 illustrates the phenomenon. We test two values of parameter b min as 10 − and 10 − in the construction of completely positive matrices with concave distribu-tion or convex distribution of column norms. As we decrease b min from 10 − to 10 − ,the percentage of successful CPF is decreased evidently when the sparsity of B is not xterior Point Method for Completely Positive Factorization 23 -16 -12 -8 -4 f a c t o r i z a t i on e rr o r constant, dense, r = 10, s = 0% r + =10r + =11r + =12r + =13 -16 -12 -8 -4 f a c t o r i z a t i on e rr o r constant, sparse, r = 12, s = 10% r + =12r + =16r + =20r + =24 -16 -12 -8 -4 f a c t o r i z a t i on e rr o r convex, r = 12, s = 7% r + =12r + =14r + =16r + =18 -16 -12 -8 -4 f a c t o r i z a t i on e rr o r concave, r = 12, s = 7% r + =12r + =14r + =16r + =18 Fig. 1

Weak CPF errors of 1000 tests on each of four matrices without approximate cp-rank deﬁciency:constant-type dense factor ( r = r =

12) in the types ’constant’ ( s ( B ) = b min = − , s ( B ) = ignorable, especially nearby the rank-sparsity boundary mentioned in PhenomenonB. We summarize it asPhenomenon C. Approximate cp-rank deﬁciency signiﬁcantly aggravates the CPFnearby the special rank-sparsity boundary in Phenomenon B.

One explanation is that there are many rows of the sparse nonnegative factor aliveat the boundary of the nonnegative cone R r + + and it is hard to accurately pursue theserows. In this section, we consider two strategies to increase the possibility of getting aCPF with acceptable accuracy. One is the relaxation of cp-rank, in which the columnnumber of factor B is relaxed from the exact cp-rank r cp to a larger integer r + . Ouralgorithm works in this case without any modiﬁcations, just starting at an initial ma-trix, for example, an arbitrarily chosen row-orthonormal matrix, of order r × r + . Thesecond one is a fast restart strategy. We will give a new stopping criterion based onTheorem 4 to reduce the computational cost when the restart strategy is adopted.5.1 Weak CPFBy weak CPF, we mean a symmetric factorization A = ˜ B ˜ B T with a nonnegativefactor ˜ B with a relaxed column number r + larger than the cp-rank of A , distinguishingit with the strict CPF with strict r cp columns in its nonnegative factor. Relative to thestrict factorization, the weak problem is a bit easier if A is not approximately cp-rankdeﬁcient. A rough but reliable explanation is that increasing the column number canenlarge the solution set, and most rows of a solution ˜ B are far from the boundary ofthe nonnegative cone in a higher dimensional space. In the relatively dense case, the percentage of successful factorization is slightly increased as b min becomes smaller. This phenomenon will be explained in the next section.4 Zhenyue Zhang, Bingjie Li -16 -15 -14 f a c t o r i z a t i on e rr o r convex, dense, r = 12, s = 1% r + =12r + =13r + =14r + =15 -14 -12 -10 -8 -6 -4 -2 f a c t o r i z a t i on e rr o r convex, sparse, r = 12, s = 10% r + =12r + =13r + =14r + =15 -10 -9 -8 -7 -6 f a c t o r i z a t i on e rr o r convex, sparse, r = 12, s = 10% r - =11r =12 Fig. 2

Weak CPF errors of 100 tests on two matrices with approximate deﬁcient cp-rank ( b min = − )with dense B (left) or sparse B (middle). The left one illustrates the improvement of approximate CPF ona matrix with sparse B like that shown in the middle panel The relaxation strategy can signiﬁcantly decrease the hardness of strict CPF onmatrices without cp-rank deﬁciency, whatever B is dense or not. The left two panelsof Figure 1 illustrate the phenomenon on four matrices whose strict CPF are difﬁcultas shown in Tables 1 and 2. One of the matrices has a dense factor ( r =

10) withconstant-type column norms, and the other three matrices have sparse factors ( r = s ( B ) = , , r × r + . In the dense case, due to the increasedcolumn number r + , the improvement is signiﬁcant in the sense that it is more likely toget a weak CPF within a higher accuracy. In the sparse case, the algorithm cannot geta strict CPF with accuracy smaller than 10 − in these repeated tests. However, if weincrease the column number from r cp =

12 to r + =

18 (or r + =

24 for the ’constant’-type), the factorization accuracy of weak CPF can be increased to 10 − on about 900tests among the 1000 tests.However, the improvement of weak CPF is limited when the matrix has an ap-proximately deﬁcient cp-rank. As the strict CPF, the weak CPF also works well if thefactor B is dense. We illustrate the performance in the left panel of Figure 2. The im-provement exists but is very slight. It partially explains the slightly higher percentagelisted in the ﬁrst row of Table 3, corresponding to smaller b min . However, the relax-ation strategy may lose its efﬁciency when A is approximately cp-rank deﬁcient and B is sparse, as illustrated in the middle panel of Figure 2.To partially address the difﬁculty when A is approximately cp-rank deﬁcient andhas a sparse factor B , we may use the CPF of its lower-rank modiﬁcation ˜ A = ˜ W ˜ W T as an approximate CPF of the original A if ˜ A is also nonnegative, where ˜ W is the W whose small columns are deleted. That is, apply the algorithm on ˜ W with a columnnumber r − smaller than r cp . The right panel of Figure 2 shows the improvement whenthe strategy of column shrinking is adopted. There are about more than 40% tests, inwhich the factorization error can be reduced. However, because of the perturbationexisted in the input ˜ W , the factorization error to the original A cannot be smaller than b in scale theoretically. xterior Point Method for Completely Positive Factorization 25 Algorithm 3

Restart exterior point method (REPM) for the CPF

Require: W , r cp , ρ , σ , ν , κ , λ , ε f , ε df , k NCGmax , and k REPMtotal . Ensure:

Row-orthonormal matrix Q of order r × r cp . Arbitrarily choose a row-orthonormal X ∈ R r × r cp , and set k = Starting with X , run Algorithm 2 within at most k NCGmax iterations until (39) or(40) is satisﬁed, and count the iteration number k NCG in this NCG procedure. Update k : = k + k NCG . If k ≥ k REPMtotal , terminate the restart procedure. If (39) is satisﬁed or k NCGmax is achieved, reset X = − X , and go to Step 2. If (40) holds, continue the iteration of Algorithm 2 until f ( X k ) − f ( X k − ) < ε df .5.2 Restart EPMBasically, convergence behaviour of the exterior point iteration may depend onthe initial point. There are two extreme phenomena that may occur in the iteration: – The CPF, whatever it is strict or weak, is easy for some matrices – the iterationalgorithm converges quickly, starting at almost any orthogonal matrix. – The CPF is difﬁcult for some matrices – starting at almost all points, the itera-tion algorithm always drops to a local minimum quickly or converges to a globaloptimal solution very slowly.In other cases, the convergence may depend on the initial point signiﬁcantly.One may repeatedly test variant initial points to increase the probability of suc-cessful CPF. Let p be the probability of an algorithm getting CPF with a given accu-racy for ﬁxed A , starting at an arbitrarily chosen point. If we take an additional test assoon as the ﬁrst one fails, the probability is increased to p + ( − p ) p = − ( − p ) .Generally, the probability is 1 − ( − p ) k within at most k times of tests. However,it may also cost k times of the original computational cost statistically. If it happens,the cost may be unacceptably expensive if k is large.Fortunately, Theorem 4 implies the possibility of quickly checking whether theiteration can converge to a global optimizer or not. Clearly, quick checking in timewhether the iteration is more likely to converge locally can avoid the unnecessarycomputational cost, and combining with a restart strategy, it can make the algorithmmore competitive for solving the CPF problem. In this subsection, we give a practicalapproach for restarting the algorithm EPM.Theorem 4 or Eq. (15) shows two extremely different behaviours of the ratiofunction k ∇ f ( ˜ X ) k F f ( ˜ X ) when ˜ X gets close to a stationary point of f . Motivated by thisobservation, we adopt the two criteria for charging whether the global convergencecan be expectant: – Not global optimum. If the current iterative point X satisﬁes k ∇ f ( X ) k F < ε , f ( X ) > k ∇ f ( X ) k F , (39) ε s

1% 4% 7% 10% 13% 16% 19% 1% 4% 7% 10% 13% 16% 19% N o r e s t a r t constant linear1e-12 47 21 3 0 0 3 4 97 83 7 26 42 46 461e-13 47 21 3 0 0 3 4 95 82 7 26 42 46 461e-14 47 19 3 0 0 3 4 95 80 7 2 5 10 14 R e s t a r t N o r e s t a r t convex concave1e-12 99 88 13 11 44 56 73 97 71 12 9 35 43 501e-13 99 87 12 11 44 56 73 97 69 10 9 35 43 501e-14 99 83 11 5 33 48 58 96 68 10 1 1 8 2 R e s t a r t Table 4

Percentage of successful CP factorization with ε , using the restart strategy, r = b min = . X is at least a stationary point approximately because of the ﬁrst inequality, andmeanwhile, the second condition means that X is far from a global minimizer. – Global optimum. We obtain a global optimal solution X in a high accuracy if k ∇ f ( X ) k F < ε , f ( X ) < ε f (40)with ε f ≪ ε ≪ ε . For example, ε f = − , ε = − , and ε = − , as did inour experiments.Therefore, if the condition (39) is satisﬁed, we think that the iteration is muchlikely to drop into a local minimum. In this case, we have to restart the algorithm at anew point. To avoid turning back to the same local minimum, the new starting pointshould be far from the current one. A simple approach is to set the opposite one − X of X as a new initial point. If (40) is satisﬁed, the iterative point is approaching at aglobal minimizer. It is suggested to continue the iteration until convergence in a highaccuracy or the restriction on total iteration number k total is achieved. Details of thisapproach are given in Algorithm 3.To show the efﬁciency of REPM, we test matrices of order n =

200 with rank (cp-rank) r =

12, constructed in the four types ( b min = .

1) as before. The sparsity s ( B ) varies from 1% to 19%. As shown in Tables 2 and 3, the CPF is difﬁcult for most ofthe constructed matrices. 50 matrices are constructed for each type and each sparsity,and totally 1400 matrices are tested. For each matrix, we run 10 times of AlgorithmREPM starting at a randomly chosen orthogonal matrix. The parameters are set as in(38) for ρ , σ , ν , κ , λ , and ε = − , ε = − , ε f = − , ε df = − , (41) k NCGmax = , k REPMtotal = . (42)Table 4 show the percentages of CPF achieving at a given accuracy ε obtained byAlgorithm REPM among 500 tests for each sparse setting and each type of columnnorm distribution. The restart strategy can signiﬁcantly increase the possibility ofobtaining a CPF with a high accuracy. For example, the original algorithm almost xterior Point Method for Completely Positive Factorization 27 Type s ( B )

1% 4% 7% 10% 13% 16% 19%constant

Table 5

REPM ( r = b min = . always fails to get a CPF in high accuracy for constant type matrices with sparsitiesvarying from 7% to 19%. The factorization via the restart algorithm is very successfulin most of the tests.It should be pointed out that the number of restarts or the computation time de-pends on the tested matrix. Roughly speaking, the larger of the restart number is, theharder of its CPF is. However, a large number of restarts does not mean an expen-sive computational cost since a quick restart may occur. For example, on a constanttype matrix with s = s = r =

12, the computational cost isrelative expensive when s ( B ) is 7% or 10%, compared with other cases. In this section, we compare our exterior point method with four state-of-art al-gorithms: two alternative projection methods given in [24] (marked as AP-H) and[21] (marked as AP-G), the coordinate descending method given in [37] (marked asCD), and the alternative nonnegative least squared method given in [29] (marked asANLS). The algorithms and their parameter setting are brieﬂy described below.The AP-H solves min B ≥ , QQ T = I k B − W Q k F via optimizing k B k − − W Q k F toget Q k and optimizing k B − W Q k k F to update B k = ( W Q k ) + alternatively. It wassuggested in [24] to terminate the iteration when |h B k − , B k − − W Q k i| < ε with agiven ε . This criterion may miss its global convergence. Since f APH k = k B k − W Q k k F It was reported in [37] that the Newton method [27] cannot beat the coordinate descending method.We omit the comparison with the Newton method.8 Zhenyue Zhang, Bingjie Li -25 -20 -15 -10 -5 APH0 1000 2000 3000 4000 5000Iterations10 -3.5 -3 -2.5 -2 -1.5 -1 -0.5 APG

Fig. 3

Iterative behaviours of Algorithms AP-G and AP-H is monotone decreasing, we terminate the iteration if f APH k − − f APH k f APH k < ε APH r and (cid:26) f APH k > √ ε APH , for local optimum, or f APH k < ε APH , for global optimum . (43)The left one can avoid meaningless iterations that do not provide acceptable decreas-ing on its objective function, while the right one can distinguish whether only a localminimum is achieved. The modiﬁcation can increase the efﬁciency and is helpful forrestarting this algorithm as did in REPM.The AP-G aims to minimize min WP ≥ , QQ T = I k Q − P k F via alternative projection[21]: Set P k + = Q k − W + ( W Q k ) − , an approximate solution to min WP ≥ k Q k − P k F given Q k , and Q k + = arg min QQ T = I k Q − P k + k F given P k + . The algorithm termi-nates if | min ( W Q k ) i j | ≤ ε with given ε , for example, ε = − . We observe that thegap k Q k − P k k F matches the factorization error k A − ( W Q k ) + ( W Q k ) T + k F better than | min ( W Q k ) i j | and that {k Q k − P k k F } is not monotone decreasing, A more efﬁcienttermination criterion is that ε k = min i ≤ k k Q i − P i k F < ε APG (44)with a given ε APG . Let i = i k ≤ k is the smallest index such that k Q i − P i k F = ε k , i.e. , ε i k = ε k . We terminate the iteration and take Q i k as the output if (44) is satisﬁed. If ε i k is not changed within k APG iterations, we also terminate the algorithm to avoidunnecessary iterations.Figure 3 illustrates the necessity of these modiﬁcations for AP-H and AP-G. Inour experiments, we set ε APH r = − , ε APH = − , ε APG = − , and k APG = k A − BB T k F via column-by-column optimization, to-gether with component-wise optimization for column updating. ANLS minimizes k A − B L B TR k F + α k B L − B R k F with B L ≥ B R ≥ xterior Point Method for Completely Positive Factorization 29 -16 -14 -12 -10 -8 -6 -4 -2 C P F e rr o r REPMRAP-HRAP-GCDANLS 0 50 100 150 200 250 300 350Example ID10 -2 -1.5 -1 -0.5 C o m pu t a t i on t i m e s ( s ) REPMRAP-HRAP-GCDANLS

Fig. 4

Sorted CPF errors (left) and computational time (right) of the ﬁve algorithms on the 360 matrices scaled random matrix B R . It is more efﬁcient than CD. However, both the two algo-rithms converge slowly in our experiments. Because of the weak point, the restartstrategy is no longer suitable for CD or ANLS.6.1 Synthetic matrices with equal rank and cp-rankThe tested matrices are randomly constructed with ﬁxed n = r =

12, and b min = .

1, as did in the last section. For each type of the four column norm distribu-tions, we choose 9 values for s ( B ) , varying from 1% to 25%. 10 completely positivematrices are constructed for each type and each sparsity, and totally we have 360 test-ing matrices in this experiment. The CPF is not easy on some of these matrices, asshown in Tables 2-4.The left panel of Figure 4 plots the CPF errors of REPM, RAP-H, RAP-G, CD,and ANLS (without the restart strategy) on the 360 matrices. Each matrix is testedwith random starting points. The REPM gives good results with relative CPF errorsless than 10 − on about 97.8% matrices. The percentage is decreased to 56.9% or31.7% for RAP-H or RAP-G, respectively. Meanwhile, the computational cost ofREPM is much less than that of RAP-H and RAP-G. The percentages for CD andANLS are very small. See the right panel of Figure 4 for the comparisons on thecomputational time of these algorithms.The efﬁciency of an algorithm may depend on data sets. To make the comparisonas fair as possible, for each of REPM, RAP-H, and RAP-G, we choose two sets oftesting matrices that are especially suitable or unsuitable for the selected algorithm,in the sense that CPF errors are smaller than 10 − (success set) or larger than 10 − (fail set), respectively. Then, we compare the efﬁciency of other algorithms on ma-trices in the two special sets. In the success sets of RAP-H or RAP-G, REPM isalso successful. Meanwhile, in the success set of REPM, RAP-H and RAP-G fail onabout 39.1% and 76.0% matrices, respectively. Conversely, in the sets of RAP-H and As in REPM, we randomly construct an orthogonal matrix as the starting point for RAP-H and RAP-Gsince it performs better than the initial setting provided by the authors.0 Zhenyue Zhang, Bingjie Li -15 -10 -5 REPM is succetive -15 -10 -5 RAP-H is succetiveREPMRAP-HRAP-G -15 -10 -5 RAP-G is succetive -4 -3 -2 -1 REPM is fail -15 -10 -5 RAP-H is fail -15 -10 -5 RAP-G is fail

Fig. 5

CPF errors of the algorithms in six subsets of those tests corresponding to the CPF errors smallerthan 10 − (success set) or larger then 10 − (fail set) of REPM, RAP-H, and RAP-G, respectively RAP-G fail, REPM has success rates 64.4% and 67.4% yet, respectively. The suc-cess rates for REPM can be increased to 97.0% and 98.4% if the accuracy is slightlydecreased to 10 − . Figure 5 shows the distributions of relative CPF errors for thesethree algorithms in each of the 6 sets.As shown in Figure 4, CD fails to give a CPF with an acceptable accuracy –the CPF errors are always larger than 10 − . The factorization error cannot be de-creased when we restart CD. ANLS is slightly beneﬁted from the restart strategybut the running time is signiﬁcantly increased. For example, if we restart ANLS atmost 10 times, each is restricted to run at most 20 seconds, the percentage of CPFwith Error ( ˜ A ) < − is increased from 13.3% to 17.5% slightly, while the averagecomputational time is unacceptably increased from 17.6 seconds to 171 seconds.6.2 Large-scale completely positive matricesIn this comparison, we show how the efﬁciency of REPM, RAP-H, and RAP-Gon matrices in large scale. We randomly construct 80 matrices in the scale n = r =

20 with 4 different sparsity values s ( B ) =

0, 10%, 20%, and 30% with eachof the four distributions of column norms of B as we did before.The left panel of Figure 6 plots the CPF errors of REPM, RAP-H and RAP-G. Weterminate these algorithms when the limit on computational time t max =

100 secondsis touched. All the CPF errors of REPM are smaller than 10 − on the 80 matrices.For RAP-H, there are only 37.5% of CPF errors are smaller than 10 − . RAP-Gperforms poorly – its CPF errors are larger than 10 − on all the matrices. Meanwhile,the computational cost of REPM is much less than that of RAP-H and RAP-G. In theright of Figure 6, we also compare the computational time in seconds for these threealgorithms on the 80 matrices. REPM is much faster than RAP-H and RAP-G. xterior Point Method for Completely Positive Factorization 31 Example ID -16 -14 -12 -10 -8 -6 -4 -2 C P F e rr o r REPMRAP-HRAP-G

Example ID C o m pu t a t i on t i m e s ( s ) REPMRAP-HRAP-G

Fig. 6

CPF errors (left) and computation time (right) of REPM, RAP-H, and RAP-G on 80 matrices A =   , A = W DW T , A = (cid:20) I k k J k k J k I k (cid:21) , A =  G H HH G HH H G  , where both I k and J k are of order k , I k is identity, and J k has all entries equal to 1, W =  a √ b − b √ a − b −√ a a −√ b  , G = 

91 0 0 00 42 0 00 0 42 00 0 0 42  , H = 

19 24 24 2424 6 6 624 6 6 624 6 6 6  , a = √ − , b = √ + , and D = diag ( , √ , + √ ) , The pairs ( r ( A ) , r cp ( A )) of thesematrices are ( , ) , ( , ) , ( k − , k ) , and ( , ) , respectively. A is a small exam-ple whose cp-rank can achieve the upper bound in the estimation r cp ≤ r ( r + ) − A is an example whose cp-rank can be signiﬁcantly largerthan the matrix order [22] since r cp = n . It is diagonally dominant, and hence, hasan explicit CPF by the factorization A = ∑ ≤ i < j ≤ n a i j ( e i + e j )( e i + e j ) T + diag ( c , ..., c n ) , (45)given [25] for any symmetric matrix A = ( a i j ) , where c i = a ii − ∑ j = i a i j and e i isthe i -th column of the identity matrix of order n . Hence, if A is nonnegative anddiagonally dominant, i.e. , all c i ≥

0, it must be completely positive since A = BB T with a nonnegative matrix B of at most r + nonzero columns, where r + is the numberof nonzero entries of { a i j : i < j } and { c i } . For A , r + = k = r cp . A is an example Error( ˜ A ) Algorithm A A A ( k = ) T(s) < − REPM 0.002 128 0.004 283 0.104 6121RAP-H 0.091 16543 0.012 1493 1.510 89956RAP-G 0.07 11226 0.119 16007 2.512 140245 > − CD 1.913 500001 6.077 1000001 20.000 353369ANLS 9.629 100000 9.878 100000 24.582 121204

Table 6

Average values of the computational costs for the ﬁve algorithms on the special matrices

Example ID -16 -14 -12 -10 -8 -6 -4 -2 C P F e rr o r A Example ID -16 -14 -12 -10 -8 -6 -4 -2 C P F e rr o r A Example ID -16 -14 -12 -10 -8 -6 -4 -2 C P F e rr o r A (k = 5) EPMAP-HAP-G Fig. 7

Sorted CPF errors (left) of the three algorithms without the restart strategy on A , A and A ( k = given in [11] to show that there is a completely positive matrix whose cp-rank largerthan n without an CPF. In Appendix C, we give an explicit form of its CPF. Ourexterior point method can give a strict CPF for each of these matrices almost exactly.The REPM, RAP-H, and RAP-G perform very well on A , A , and A with small k . Each of the three algorithms can quickly obtain a good CPF with an error smallerthan 10 − . However, both the CD and ANLS fail to yield an acceptable CPF in theaccuracy 10 − . We restrict the running time at most 20 seconds for CD and ANLS. Because each iteration of the CD is faster than that of ANLS, we also use an addi-tional limit n iter for the iteration number in CD and ANLS: n CDiter = for A and n CDiter = for the other three matrices, but n ANLSiter = . As mentioned before, therestart strategy is not suitable for CD since it seldom terminates before the two re-strictions are not touched. In this experiment, we also restart the ANLS when the totaliteration number is less than n ANLSiter . In Table 6, we list the average values and total iteration numbers, among 100repeats on each matrix. The REPM converges much faster than RAP-H and RAPG.Besides the advantage point on computational time, the REPM is also more robust tothe initial point. We also test the EPM, AP-H, and AP-G without the restart strategyon A , A and A with 100 tests for each algorithm. In these tests, EPM has a highersuccess rate, see Figure 7. We use a randomly chosen row-orthonormal matrix as a restart point in REPM, rather than − X , and ε = − , keeping others unchanged. The REPM, RAP-H, and RAP-G do not touch the restriction on time. The the total iteration number of restart ANLS could be slightly larger than n ANLSiter but less 2 n ANLSiter .xterior Point Method for Completely Positive Factorization 33

Error( ˜ A ) Matrix REPM RAP-H RAP-GRate Time < − A , 6 100% 0.4s 17764 100% 3.9s 157958 100% 7.9 36718 k = A > − A , 6 0% – – 0% – – 0% – – k = A

0% – – 93% 40.1s 1740837 100% 40.1s 1590793

Table 7

Performance of REPM, RAP-H, and RAP-G on the matrix A with k = , ,

10 and A The cp-rank of A is much larger than its rank or its matrix order if k is slightlylarge, and so is it for A . The CPF is difﬁcult in this case. We use a relative weakcriterion for RAP-H and RAP-G: the algorithms are terminated when both the totalrunning time t max seconds and total iteration number k total are touched, but a strictcriterion for the REPM: it is terminated when the half of the time t max is touched.We use different values of t max and k total for A and A as follows: For A with k = ,

8, and 10, t max = , k total = × , × , and 8 × ,respectively, and for A , t max =

40 seconds and k total = × .Table 7 shows the performance of REPM, RAP-H, and RAP-G on A and A onthe average value of computational time and total iteration number among 100 timesfor each matrix. These three algorithms perform very well on A k with small k =

6. As k slightly increases to 8 or 10, the percentage of successful CPF of RAP-H or RAP-Gdecreases and the computational time increases quickly. RAP-H is a bit better thanRAP-G on A , but its success rate is only 7%. The REPM performs much better thanRAP-H and RAP-G. On A with k = A , it always gives a CPF with errorsmaller than 10 − . On the difﬁcult A with k =

10, it can also provide 47% CPFwithin this accuracy.

In this paper, we tried the idea of exterior point method for addressing the CPFproblem numerically. The proposed optimization model and its iterative solver viaa modiﬁed NCG can implement the exterior point method. In the numerical exper-iments reported in this paper, the exterior point method performs much better thanthe algorithms in the literature. We discussed some potential issues that may affectthe CPF via a lot of numerical experiments by our algorithm. Some phenomena areinteresting and might be helpful for further analysis on this topic. However, we justtouched a small angle of the ice mountain. For instance, The special rank-sparsityboundary mentioned in Phenomenon B may determine how hard the CPF is, but wehave no idea to verify its existence or characterize such a boundary theoretically. Be-sides this, the approximate cp-rank deﬁciency may also result in a difﬁcult CPF. Forour algorithm or its restart version on difﬁcult CPFs, it deserves to explore an efﬁcientinitial setting. The weak CPF problem is relatively easier than the (strict) CPF. It may also be an interesting topic to transform a weak CPF to a strict CPF efﬁciently. It isworth further working on these topics.

Acknowledgements

The work was supported in part by NSFC project 11971430 and Major ScientiﬁcResearch Project of Zhejiang Lab (No. 2019KB0AB01).

A Transversal intersection of Q and P Proposition 4

The Stiefel manifold Q r = { Q ∈ R r × r + : QQ T = I r } transversally intersects the subman-ifold P r = { P ∈ R r × r + : WP ≥ } at Q ∈ Q r ∩ P r if and only if for any nonnegative matrix Y ∈ R n × r ,W T YQ T is not symmetric or its trace is not zero when it is not zero.Proof It is known that the normal spaces of Q r at Q is N Q r ( Q ) = (cid:8) SQ : S ∈ R r × r , S = S T (cid:9) . If we alsohave Q ∈ P r , the normal cone of P r at Q is deﬁned as N P r ( Q ) = (cid:8) N : h N , P − Q i ≤ , P ∈ P r (cid:9) . Since W is of full column rank, we can rewrite any N ∈ R r × r + as N = − W T Y with a Y ∈ R n × r + . The restrictionfor N P r ( Q ) becomes h Y , WP − WQ i ≥ P ∈ P . Since Q ∈ P , choosing P = Q and P = − Q in therestriction, it is equivalent to h Y , WQ i = h Y , WP i ≥ P ∈ P . Furthermore, we can restrict Y to benonnegative, which makes the inequality h Y , WP i ≥ P ∈ P , and hence, we canrepresent N P r ( Q ) = (cid:8) − W T Y : Y ∈ R m × r + , Y ≥ , h Y , WQ i = (cid:9) . To show the existence for a ﬁxed W T Y , we assume W T Y = W T ˜ Y for any ˜ Y ≥

0. Hence, W T Y and theclosed cone { W T ˜ Y : ˜ Y ≥ } are separated. That is, there is an H ∈ R r × r + such that h Y , WH i < h ˜ Y , W H i for all ˜ Y ≥

0. Let ˜ Y t ≥ ( ˜ Y t ) ij = t > ( i , j ) . We get that as t → ( WH ) ij = t h ˜ Y t , W H i > t h Y , WH i →

0, which implies ( WH ) ij ≥

0. That is, H ∈ P . A contradictionfollows immediately as that 0 ≤ h Y , WH i < h ˜ Y , W H i = Y = Q r intersects P r transversally at Q ∈ Q r ∩ P r , that is by deﬁnition, N Q r ( Q ) and − N P r ( Q ) areintersected at the origin only, is equivalent to that if SQ = W T Y for a symmetric S ∈ R r × r and a nonnegative Y ∈ R n × r + satisfying h WQ , Y i =

0, we must have S = W T YQ T =

0. It is also equivalent to that for any Y ≥ r × r + , there is not a symmetric S equal to W T YQ T , i.e. , W T YQ T is not symmetric, or Y / ∈ − N P r ( Q ) , i.e. , the trace of W T YQ T is not zero when it is not zero. (cid:3) B Proof of Theorem 1

Proof

For a stationary point X of f , let T and T − be the indicator matrices of the zero entries and negativeentries of WX , respectively. Consider a sufﬁciently small neighborhood N ( X ) of X , in which ( W ˜ X ) ij < ( WX ) ij <

0, and ( W ˜ X ) ij > ( WX ) ij >

0. For ˜ X = X + ∆ ∈ N ( X ) , since ( WX ) ⊙ T =

0, we get that ( W ˜ X ) − = ( W ˜ X ) ⊙ T − + ( W ∆ ) − ⊙ T , and k ( W ˜ X ) − k F = k ( W ˜ X ) ⊙ T − k F + k ( W ∆ ) − ⊙ T k F . Let f − ( ˜ X ) = k ˜ X ˜ X T − I k F + λ k ( W ˜ X ) ⊙ T − k F . It gives ∇ f − ( ˜ X ) = ( ˜ X ˜ X T − I ) ˜ X + λ W T (cid:0) ( W ˜ X ) ⊙ T − (cid:1) . Hence, f ( ˜ X ) = f − ( ˜ X ) + k ( W ∆ ) − ⊙ T k F , ∇ f ( ˜ X ) = ∇ f − ( ˜ X ) + λ W T (cid:0) ( W ∆ ) − ⊙ T (cid:1) . (46)Obviously, f − ( X ) = f ( X ) and ∇ f − ( X ) = ∇ f ( X ) =

0. It is easy to verify that f − ( ˜ X ) − f − ( X ) = λ k ( W ∆ ) ⊙ T − k F + (cid:10) XX T − I , ∆∆ T (cid:11) + k ∆ X T + X ∆ T + ∆∆ T k F = h A ( ∆ ) , ∆ i + (cid:10) ∆ X T + X ∆ T , ∆∆ T (cid:11) + k ∆∆ T k F , Y is not unique in the representation W T Y .xterior Point Method for Completely Positive Factorization 35where h A ( ∆ ) , ∆ i is a quadratic form with the linear mapping A : R r × r + → R r × r + , A ( ∆ ) = ( XX T − I ) ∆ + ∆ X T X + X ∆ T X + λ W T (cid:0) ( W ∆ ) ⊙ T − (cid:1) . It is not difﬁcult to rewrite h A ( ∆ ) , ∆ i as a quadratic form δ T S δ with a symmetric matrix S and the thevector representation δ = v ( ∆ ) of the matrix ∆ . If S is positive deﬁnite, the quadratic form must be alsopositive deﬁnite, and hence, X is a local minimizer of f − . If S is positive semidiﬁnite, let { δ ( k ) = v ( ∆ ( k ) ) } be the rr + unit eigenvectors of S corresponding to eigenvalues { λ k } in ascending order and the ﬁrst m ones are zeros. By the assumption, ∆ ( k ) X T + X ( ∆ ( k ) ) T = k ≤ m . Then, for ˜ X = X + ∆ with a nonzero ∆ = ∑ k α k ∆ ( k ) , f − ( ˜ X ) − f − ( X ) = h A ( ∆ ) , ∆ i + (cid:10) ∆ X T + X ∆ T , ∆∆ T (cid:11) + k ∆∆ T k F = ∑ k > m n α k λ k + α k (cid:10) ∆ ( k ) X T , ∆∆ T (cid:11)o + k ∆∆ T k F . Since (cid:12)(cid:12) ∑ k > m α k (cid:10) ∆ ( k ) X T , ∆∆ T (cid:11)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:10) ∑ k > m α k ∆ ( k ) , ∆∆ T X (cid:11)(cid:12)(cid:12) ≤ k ∑ k > m α k ∆ ( k ) k F k ∆∆ T k F k X k and by theorthogonality of { δ ( k ) } , k ∑ k > m α k ∆ ( k ) k F = q ∑ k > m α k , we get (cid:12)(cid:12) ∑ k > m α k (cid:10) ∆ ( k ) X T , ∆∆ T (cid:11)(cid:12)(cid:12) ≤ r ∑ k > m α k k X k k ∆∆ T k F ≤ ∑ k > m α k k X k + k ∆∆ T k F . Therefore, f − ( ˜ X ) − f − ( X ) ≥ ∑ k > m α k ( λ k − k X k ) ≥