Exterior Point Method for Completely Positive Factorization
aa r X i v : . [ m a t h . O C ] F e b Noname manuscript No. (will be inserted by the editor)
Exterior Point Method for Completely Positive Factorization
Zhenyue Zhang · Bingjie Li
Received: date / Accepted: date
Abstract
Completely positive factorization (CPF) is a critical task with applicationsin many fields. This paper proposes a novel method for the CPF. Based on the idea ofexterior point iteration, an optimization model is given, which aims to orthogonallytransform a symmetric lower rank factor to be nonnegative. The optimization prob-lem can be solved via a modified nonlinear conjugate gradient method iteratively.The iteration points locate on the exterior of the orthonormal manifold and the closedset whose transformed matrices are nonnegative before convergence generally. Con-vergence analysis is given for the local or global optimum of the objective function,together with the iteration algorithm. Some potential issues that may affect the CPFare explored numerically. The exterior point method performs much better than otheralgorithms, not only in the efficiency of computational cost or accuracy, but also inthe ability to address the CPF in some hard cases.
Keywords completely positive factorization · completely positive rank · nonlinearconjugate gradient method · nonnegative factorization of matrices Mathematics Subject Classification (2020) · · Completely positive factorization (CPF) of a nonnegative, symmetric, and pos-itive semidefinite matrix A looks for a nonnegative factorization A = BB T with anonnegative factor B . If such a CPF exists, the matrix A is said to be completelypositive. More preciously, a strict CPF of A means the factorization A = BB T with a Zhenyue ZhangSchool of Mathematics Science, Zhejiang University, Yuquan Campus, Hangzhou 310038, ChinaResearch Center for Advanced Artificial Intelligence Theory, Zhejiang Lab, Hangzhou 311121, ChinaE-mail: [email protected] LiSchool of Mathematics Science, Zhejiang University, Yuquan Campus, Hangzhou 310038, ChinaE-mail: [email protected] Zhenyue Zhang, Bingjie Li nonnegative factor B having the smallest column number r cp since such a factoriza-tion may be not unique if it exists. The smallest column number r cp is also called asthe completely positive rank, or cp-rank for short, of A .The CPF problem appears in many fields such as balanced incomplete block de-signing [8], energy demand model designing [23], probability estimation of DNAsequences [26]. In matrix games, completely positive matrices are used to check ifthe Nash equilibrium can be found by algorithm Lemke that solves a linear com-plementary problem [32]. In the recent decade, the CPF plays an important rolein combinatorial and nonconvex quadratic optimization. For example, some binaryquadratic problem or graph optimization can be relaxed as a continuous optimiza-tion problem over a set of completely positive matrices [2,3,13], and the iterationof completely positive matrices could be implemented in a cone where the updatedcompletely positive matrices are given in the form of convex combination of rank-onenonnegative matrices [5], or via optimizing the nonnegative factors of the CPF [9].In the latter case, an initial CPF should be given for running the factor-optimization.In data science, CPF or completely positive approximation can also be used to refinenoisy graphs for data clustering [27,29,39] or to estimate the probabilities of extremeevents happen in nature or financial markets [12]. Hence, the CPF is interesting inboth of theory analysis and applications.However, the CPF problem is NP-hard in the following issues [6]: detectingwhether a nonnegative, symmetric, and positive semidefinite matrix is completelypositive [16], determining the cp-rank of a completely positive matrix, finding a non-negative factor B when the cp-rank r cp ( A ) of A is known [21], or solving the prob-lem min B ≥ k A − BB T k F for completely positive approximation, where B has a givennumber of columns [37]. Only for a narrow part of completely positive matrices withspecial structures, the CPF can be obtained directly or within polynomial time. Forexample, diagonal dominant nonnegative symmetric matrices are completely posi-tive [25]. A bipartite graph, i.e. , A = (cid:20) D C T C D (cid:21) with positive diagonal D and D andnonnegative C , is completely positive and has an explicit CPF [7]. For an acyclic orcircular graph, if it is positive semidefinite, one can check whether it is completelypositive or not within polynomial time. Furthermore, if it is completely positive, onecan also get its CPF within polynomial time [15]. It was shown in [28] that if a com-pletely positive matrix of rank r has a principal diagonal matrix of order r , then itscp-rank equals to r and its CPF can be obtained explicitly.There are some efforts on algorithms for this problem in the literature. MoodyChu proposed a direct approach in [18] via successive rank-one reduction, based onthe Wedderburn’s formula to reduce the rank [38]: ˆ A = A − σ − uu T with u = Ax and σ = x T Ax . Here x should yield a nonnegative u and a positive σ , and meanwhile, ˆ A is also nonnegative. A greedy approach was given in [18] to determine such an x via maximizing the smallest entries of ˆ A . Kuang et al. considered the natural modelmin W ≥ k A − WW T k F and solved it by projected Newton method in [27]. This prob-lem was relaxed to the penalty form min W , H ≥ k A − W H T k F + λ k W − H k F in [29]so that it can be solved iteratively via alternatively optimizing the factors W and H , The procedure terminates if such an x cannot be obtained.xterior Point Method for Completely Positive Factorization 3 each is a nonnegative least square problem without a closed form solution. Writing k A − WW T k F = k A j − w · j w T · j k for each j , where w · j is the j -th column of W and A j = A − ∑ i = j w · i w T · i , In [37], Vandaele et al. considered the successive column-updating: min w · j ≥ k A j − w · j w T · j k F , sweeping all the columns of W repeatedly untilconvergence.Different from the direct optimization on nonnegative factors, [24] and [21] con-sidered two approaches that orthogonally transform a full-rank factor W of the sym-metric factorization A = WW T to be nonnegative as B = W Q directly or indirectly.In [24], the explicit error k B − WQ k F is minimized, subjected to nonnegative B andorthogonal Q . This problem is solved by alternatively optimizing nonnegative B andorthogonal Q , and the iteration converges to a point satisfying its first order conditionof local optimum. The optimization considered in [21] is implemented in the originalspace as min k Q − P k F subjected to orthogonal Q and restricted matrix P with non-negative image W P . This problem is also solved by alternative iteration on Q and P .However, it is hard to solve the subproblem on P because of the implicit restriction on P . Local convergence of these alternative projection algorithms was given in [21].However, it is not clear if the local convergence could be linear. In [14], Drusvyatskiyproved that for an alternative projection onto two closed sets X and Y , the linearconvergence occurring nearby an intersected point z of X and Y if X intersects Y at z transversally. That is, at z , the normal cone of X and the negative normal cone of Y are intersected at the origin only. For the alternative projection considered in [21],this condition at an orthogonal Q with nonnegative B = W Q is equivalent to that forany nonnegative matrix Y ∈ R n × r , S = B T Y is not symmetric or its trace is not zero if S =
0. See Proposition 4 given in Appendix A. This condition is not satisfied in somecases. Below is an instance: the completely positive matrix A has a factor W , andthere is an orthogonal Q with B = W Q ≥ Y ≥ B T Y is symmetricand nonzero. However, it has a zero trace. A = , W = − − − − − − − , Q = − − , Y = . These algorithms mentioned above may suffer from the nonnegative or orthogonalrestrictions. Because of the restrictions, an inner iteration is required to solve theinvolved subproblems if there are no closed form solutions, or adopt an estimatedsolution instead, but it may delay the convergence.In this paper, we consider a novel approach for the CPF without restrictions. Dif-ferent from the alternative projection mentioned above, our idea is exterior point it-eration that pursues an orthogonal Q on the exterior of the set of orthogonal matricesand the set of P ’s whose images W P are nonnegative. It may be expected that suchan exterior point iteration is more efficient since no restrictions should be obeyed.We will give an optimization model for implementing the exterior point iteration. We will use w i for the i -th row of W . An approximate rule of P was suggested in [21] to simplify the updating, but WP may be no longernonnegative. Zhenyue Zhang, Bingjie Li Decreasing the objective function at a point X implicitly propels X moving towardthe set of orthogonal matrices, and meanwhile, its image W X is nonnegative even-tually. We will show some properties about the first order condition and the secondorder condition for the optimal solutions. We will also characterize the global opti-mum that can be used to check whether the iterative point will go to a global optimalone or not, when we solve this optimization problem iteratively. Based on this prop-erty, a restart strategy can be also used in order to avoid local optimum as much aspossible. The classical nonlinear conjugate gradient (NCG) method requires a twicecontinuously differentiable function [1,35] or differentiable function with Lipschitzcontinuous gradient [35] to guarantee the convergence. We modify the NCG so thatthe condition of the objective function can be weakened to be continuously differ-entiable as that in our model. The modified NCG performs very efficiently in ourexperiments.It is not clear what is the dominant issue that determines the difficulty of CPF.In this paper, we will numerically explore the three possible issues: the cp-rank of A , the sparsity of the nonnegative factor B in a CPF of A , or the approximately cp-rank deficiency of A . An interesting observation is that there is a special rank-sparsityboundary nearby which the CPF is much harder than others. We believe that thesephenomena offer some important insight into the CPF problem.The exterior point method performs much better than other algorithms given inthe literature, not only on the suitableness of completely positive matrices wheneverwhose cp-rank is equal to or larger than the rank, but also on the computational effi-ciency on the accuracy of the factorization or the computational cost for completelypositive matrices in large scales. We will report the results of the numerical experi-ments to show the advantages and comparisons with other algorithms.The remaining part of this paper is organized as follows. In Section 2, we show themotivation of the exterior point method and the optimization model of the CPF. Thenthe first order and second order conditions are discussed. We also discussed someinteresting properties of the global optimum. The NCG method and its modificationsare discussed in Section 3 for solving the problem. In Section 4, we report someinteresting phenomena of the CPF via a lot of numerical experiments by the modifiedNCG, which partially explores possible issues resulting in difficult CPF. We alsoshow some improvements for difficult CPF by weak CPF or restart techniques inSection 5. Comparisons with other algorithms are given in Section 6 that further showthe efficiency of our exterior point method on synthetic completely positive matricesin small or larger scales and some difficult examples reported in the literature. Finally,some remarks are given in the conclusion section. Let A be a completely positive matrix of order n , and let r = r ( A ) and r cp = r cp ( A ) be the matrix rank and cp-rank of A , respectively. It is easy to estimate the matrixrank but hard for the cp-rank unless r = , Theoretically, the cp-rank could be In this special case, r cp = r .xterior Point Method for Completely Positive Factorization 5 equal to or much larger than the matrix rank or the matrix order. There are two kindsof estimations on the cp-rank. One was an upper bound of r cp in term of r given in[4]: r cp ≤ ( r + r − ) for r ≥
2. The estimation is not tight when n is relativelylarge. For instance, the upper bound cannot be touched if A is of full rank with n > r cp ≤ ( n + n − ) if n ≥
6, not depending on the rank of A .A lower-bound given in [11] as that r cp ≥ (cid:0) n + n − − ( √ n − ) n (cid:1) for n ≥ r cp = r . It makes sense since if wearbitrarily choose r nonnegative n -dimensional vectors with r ≤ n , these vectors arealmost always linearly independent, i.e. , the matrix B of these vectors is of full rank,and hence A = BB T has rank r almost always.In this section, we do not assume that r cp = r . Furthermore, we do not ask thecolumn number of B equal to r cp in the CPF that we are going to determined. Similarwith that in [24] and [21], we look for an orthonormal transformation B = W Q from afull-rank factor W of a symmetric factorization A = WW T . That is, W is of r columns, B has r + columns with a given integer r + ≥ r cp , and Q is a row-orthonormal matrixof order r × r + . Such an orthonormal matrix exists according to the following lemma.A similar result was given in [24], here we give a much simpler proof. Lemma 1
Let A = WW T be a full column-rank and symmetric factorization of A,and r + ≥ r cp ( A ) . Then A = BB T with a nonnegative B of r + columns, if and only ifthere is a row-orthonormal Q of order r × r + such that B = W Q.Proof
The sufficiency is obvious since QQ T = I . For the necessity, the equality BB T = WW T implies that B and W have the same rank and span ( W ) = span ( B ) .Therefore, B = W Q for a matrix Q ∈ R r × r + . Substituting B = W Q into BB T = WW T ,we get that QQ T = I since W is of full column rank. The proof is then completed. (cid:3) { Q k } and { P k } in the constrained domains Q = { Q ∈ R r × r + : QQ T = I r } , P = { P ∈ R r × r + : W P ≥ } , respectively, to pursue an intersect point of the feasible domains, via alternative pro-jection. Different from the greedy alternative projection, we consider a new strategyof exterior point iteration for solving the problem of CPF. The idea is that we lookfor an iterative sequence { X k } that alive on the exterior of the two subdomains. Each X k is not orthogonal and W X k may be not nonnegative. Meanwhile, we hope that thesequence { X k } can get close to both of Q and P more and more, and eventually, thesequence { X k } can converge to an intersect point Q of Q and P to get a nonnegativefactor B = W Q of A .To this end, let us consider the distances of a given matrix X to Q and P , respec-tively, d ( X , Q ) = min Q ∈ Q k X − Q k F , d ( X , P ) = min P ∈ P k X − P k F . Zhenyue Zhang, Bingjie Li
The ideal model for the CPF ismin X ∈ R r × r + (cid:8) d ( X , Q ) + d ( X , P ) (cid:9) . (1)Clearly, d ( X , Q ) has a simple representation d ( X , Q ) = ∑ k ( − σ k ( X )) , where { σ k ( X ) } are the singular values of X . Unfortunately, there is not a closed form to rep-resent the distance d ( X , P ) because of the implicit restriction in P . However, d ( X , P ) can be represented by k W X − W P k F since k W X − W P k F ≤ k W k k X − P k F . It has alower bounded as d ( X , P ) ≥ min P ∈ P k W X − W P k F k W k ≥ min B ∈ R n × r ++ k B − WX k F k W k = k ( W X ) − k F k W k , where ( W X ) − = min (cid:8) W X , (cid:9) is the negative part of W X . Therefore, the ideal objec-tive function can be equivalently transformed tomin X ∈ R r × r + n ∑ k (cid:0) − σ k ( X ) (cid:1) + k ( W X ) − k F k W k o . (2)However, the above model should be slightly modified in the view of numericalcomputation since it costs much to evaluate all the singular values. This disadvantagecan be addressed by taking into account the equality k XX T − I k F = ∑ i ( − σ i ( X )) .Since the global optimal solution is achieved at an X with σ i =
1, when σ i ( X ) ≈ i , we see that k XX T − I k F = ∑ i ( − σ i ( X )) ( + σ i ( X )) ≈ ∑ i ( − σ i ( X )) . Moreimportantly, it can be evaluated economically. Notice that k XX T − I k F is a quarticfunction of X , while k ( W X ) − k F is quadratic. We slight modify the coefficients of thetwo functions in (2) when the first one is replaced by k XX T − I k F ,min X ∈ R r × r + n k XX T − I k F + λ k ( W X ) − k F o . (3)where λ = / k W k . One may slightly change λ if necessary. Numerically, it maybe more robust if we rescale the rows of W to have unit norm, i.e. , replace W by itsnormalized ˜ W = diag ( k w k − , · · · , k w n k − ) W in (3), where { w i } are the rows of W . Since √ n = k ˜ W k F ≤ √ r k ˜ W k , 1 / k ˜ W k ≤ rn .Hence, we can simply set λ ≈ rn in this case. In our experiments, we always use thisparameter set for λ and it works very well. We can set B = ( W X ) + , the nonnegativepart of W X , as the required nonnegative factor.There are some advantages to the problem (3) and its objective function f ( X ) = k XX T − I k F + λ k ( W X ) − k F . At first, X ∈ Q ∩ P if and only if f ( X ) = i.e. , it is an optimal solution of (3). Second,because of the square form of the negative entries in the second term, f is derivative.Third, if at a point X , W X does not have zero entries, f ( X ) is a polynomial functionnear X . This property implies a similar behavior as a polynomial function that ben-efits the analysis of convergence or global optimality. In the next subsection, we willfurther discuss the local or global optimality. xterior Point Method for Completely Positive Factorization 7 f is derivable, it is not difficult to characterize itsfirst order condition for its local optimum. One can verify that ∂∂ x i j (cid:13)(cid:13) ( W X ) − (cid:13)(cid:13) F = ∂∂ x i j n ∑ s = (cid:16) r ∑ t = w st x t j (cid:17) − = n ∑ s = w si (cid:16) r ∑ t = w st x t j (cid:17) − . The gradient of k ( W X ) − k F can be represented as ∇ k ( W X ) − k F = W T ( W X ) − . Wealso have that ∇ k XX T − I k F = ( XX T − I ) X . Hence, ∇ f ( X ) = ( XX T − I ) X + λ W T ( W X ) − . (4)The first order condition for a local minimum of f ( X ) follows immediately. Lemma 2
The first order condition of a (local) minimum of f is ( XX T − I ) X + λ W T ( W X ) − = . (5)Some interesting properties follows for any stationary point of f , i.e. , a matrix X satisfying (5). The first one is that the term λ k ( W X ) − k F and f can be represented interms of the singular values { σ i ( X ) } of X . Proposition 1
Let X be a stationary point of f and σ i = σ i ( X ) for simplicity. Then λ k ( W X ) − k F = ∑ i σ i ( − σ i ) , f ( X ) = ∑ i ( − σ i ) . (6) Proof
Since trace ( X T ( I − XX T ) X ) = trace ( X T X ( I − X T X )) = ∑ i σ i ( − σ i ) , λ k ( W X ) − k F = trace ( λ ( W X ) T ( W X ) − ) = trace ( X T ( I − XX T ) X ) = ∑ σ i ( − σ i ) . This is the first inequality in (6). The second one follows it with the definition of f and k XX T − I k F = ∑ i ( − σ i ) and ( − σ i ) + σ i ( − σ i ) = − σ i . (cid:3) Obviously, all σ i ( X ) = X is globally optimal. One can also expect σ i ( X ) ≈ X is just a locally optimal or stationary point since λ k ( W X ) − k F > ∑ σ i > σ i ( σ i − ) = ∑ σ i < σ i ( − σ i ) − λ k ( W X ) − k F . Practically, one can conclude form (5) that some σ i ( X ) = W X has partial non-negative columns because of the symmetry of λ ( W X ) T ( W X ) − = X T ( I − XX T ) X . If ( W X ) + is the nonnegative part of W X , the symmetry gives ( W X ) T + ( W X ) − = ( W X ) T − ( W X ) + . (7) Proposition 2
For a stationary point X of f , if W X has a nonnegative submatrixW X and the remaining W X does not, then there are at least r ( X ) singular values σ i ( X ) = , and ( W X ) T ( W X ) − = , ( W X ) T ( W X ) ≥ . Zhenyue Zhang, Bingjie Li
Proof
By (5), ( XX T − I r ) X =
0. Hence, XX T has at least r = r ( X ) eigenvaluesequal to one. It follows that X has at least r singular values equal to one. By (7), ( W X ) T + ( W X ) − = ( W X ) T − ( W X ) + = . It follows that ( W X ) T ( W X ) = ( W X ) T + ( W X ) = ( W X ) T + ( W X ) + ≥ (cid:3) The orthogonality between
W X and ( W X ) − implies that the rows of W X cor-responding to the nonzero rows of W X must be nonnegative. Hence, if W X has apositive column,
W X itself must be nonnegative. If
W X exists, W X must have zerorows, and we can partition W X as W X = (cid:20) W X W X W X (cid:21) within permutation, whereboth W X and W X are nonnegative and all the rows of W X are nonzero.A second order condition can guarantee a stationary point to be local optimal. Forour problem, it is not difficult to give the second order condition. Theorem 1
Assume that X is a stationary point of f , and t j is the indicator of nega-tive components of the column x j of X. If the symmetric matrix S = ( S i j ) , partitionedwith blocks S i j = (cid:26) k x i k I + x i x Ti + W T diag ( t i ) W + ( XX T − I ) , i = j ; ( x Ti x j ) I + x j x Ti , i = j , (8) is positive definite, or positive semidefinite with a null space spanned by vectorslinked by all columns of ∆ satisfying ∆ X T + X ∆ T = and the nonzero eigenvaluesof S are not smaller than k X k , then X is a local minimizer of f . The proof is given in Appendix B. Because f is not convex, local optimum hap-pens when we solve the problem (3) iteratively. Hence, it is important to checkwhether an iterative sequence goes to a global optimum or not. To address this issue,we will further discuss sufficient conditions for the global optimum of f , focusing ondescending sequences { f ( X k ) } that may be generated via some iteration algorithms,as the modified NCG that we will adopt to solve the problem (3).2.3 Global optimumWe fist give some equivalent conditions for the global optimum of f , based onProposition 1 and Proposition 2. Proposition 3
The following statements are equivalent for a stationary point X of f .(a) X ∈ P and X is of full row rank;(b) X ∈ Q , i.e., XX T = I r ;(c) X ∈ Q ∩ P , i.e., X is a global minimizer of f .Proof The proof is simple. In fact, if X ∈ P , i.e. , X = X in Proposition 2, then all thesingular values of X are equal to one since r ( X ) = r , which implies that XX T = I r , i.e. , X ∈ Q . Conversely, if X ∈ Q , we have ( W X ) − = i.e. , X ∈ P , from Proposition1 directly. (cid:3) xterior Point Method for Completely Positive Factorization 9 The full-rank condition on a stationary point X is satisfied generally. Practically,if f ( X ) < / X must be full rank by (6). If an algorithm generates the iterativesequence { X k } with descending { f ( X k ) } , starting with an row-orthonormal matrix, itis highly possible to have f ( X ) < / X of { X k } . Hence, X ∈ P is equivalent to X ∈ Q for an accumulative point X of the sequence generally.Proposition 3 also show that any stationary point X of f always locates the ex-terior of both Q and P except it is a solution of (3). In Section 3, we will give aniterative algorithm that yields a sequence of iterative points on the exterior of Q ∪ P unless it converges globally. This is why we call it as an exterior point method, quitedifferent from the alternative methods discussed in the previous subsection, whereiterative points alternatively drop into one of the subdomains.Let us consider those stationary points of f on the exterior of Q ∪ P . Each X ofthem has an indicator matrix sgn ( | ( W X ) − | ) . However, the number of different ones { T ℓ } is limit. Let { X ( ℓ ) } be the stationary points with sgn ( | ( W X ( ℓ ) ) − | ) = T ℓ , and let f ℓ ( X ) = k XX T − I k F + λ k ( W X ) ⊙ T ℓ k F . Each f ℓ has a finite number of different critical values since it is a polynomial [36]. Asshown in the proof of Theorem 1 given in Appendix B, each critical value of f is alsoa critical value of one of f ℓ ’s. Hence, f also has a finite number of critical values. Let f min be the smallest one of these nonzero critical values of f . The following theoremshows that a descending sequence { f ( X k ) } yields the optimum of f if there is an f ( X k ) < f min . Theorem 2
Let { f ( X k ) } be a descending sequence and lim k ∇ f ( X k ) = . If there isan f ( X k ) < f min , then lim k f ( X k ) = .Proof Obviously, { f ( X k ) } converges and { X k } is bounded. Let X ∞ be any accumu-lative point of { X k } and subsequence X k j → X ∞ . Then ∇ f ( X ∞ ) = lim ∇ f ( X k j ) = f ( X ∞ ) is a critical value of f . Since f ( X ∞ ) < f ( X k ) < f min , we conclude that f ( X ∞ ) = (cid:3) Theorem 2 shows that any accumulative point of { X k } is a global minimizer of f if there is an f ( X k ) < f min . It is theoretically meaningful, but the condition is notdetectable since f min is unknown. The following theorem gives sufficient conditionsfor the global optimum. Theorem 3
Let { f ( X k ) } be a descending sequence such that lim k ∇ f ( X k ) = . If lim k k X k k ≤ , f ( X k ) < α for a constant α ∈ ( , ) , and for sufficiently large k, λ k ( W X k ) − k F ≤ α k X k X Tk − I k F − k X k X Tk − I k F , (9) then any accumulative point X ∞ of { X k } is a globally optimal solution. Different stationary points may share a common indicator matrix.0 Zhenyue Zhang, Bingjie Li
Proof
Let X ∞ be any accumulative point of { X k } and X k j → X ∞ as in the proof ofTheorem 2. The condition lim k k X k k ≤ σ i ≤ k X ∞ k ≤ { σ i } of X ∞ . Below we show that all σ i = i.e. , f ( X ∞ ) = σ i <
1. We have that 0 < f ( X ∞ ) < α . We willgive a contradiction, based on the observation that there is a constant β > f ( X ∞ ) ≤ α ( β + ) ( β + ) , λ k ( W X ∞ ) − k F ≤ β k X ∞ X T ∞ − I k F . (10)To show (10), let a = k X ∞ X T ∞ − I k F and b = λ k ( W X ∞ ) − k F . Both a and b arepositive by Proposition 3 since f ( X ∞ ) >
0. It is easy to verify that the first one in (10)is equivalent to β ≤ c − c , where c = p − f ( X ∞ ) / α ∈ ( , ) . The second one isequivalent to ba ≤ β . Thus, the existence of a positive β is equivalent to ba ≤ c − c , orequivalently, ba + b ≤ c . Recalling that the condition (9) gives a + b ≤ α √ a . Hence,1 − c = f ( X ∞ ) α = a + b α ≤ a ( a + b )( a + b ) = − b ( a + b ) . That is, the inequality ba + b ≤ c holds. Therefore, there is a positive β satisfying (10).By Proposition 1 with X = X ∞ , the inequalities in (10) become ∑ σ i < ( − σ i ) ≤ α ( β + )( β + ) , ∑ σ i < σ i ( − σ i ) ≤ β ∑ σ i < ( − σ i ) . (11)However, the left inequality above implies that 1 − σ i < β + ( β + ) since α < i.e. , σ i ( β + ) > β for each σ i <
1. Thus, σ i > β ( − σ i ) and σ i ( − σ i ) > β ( − σ i ) ,which yields a contradiction to the right inequality in (11), completing the proof. (cid:3) We will show that the modified NCG given in the next section can yield a se-quence { X k } that guarantees the descent of { f ( X k ) } and lim k ∇ f ( X k ) =
0. Becauseof the descent, it is easy to have f ( X k ) < /
4. The inequality (9) is always satisfiedexcept a few of early X k ’s. Generally, k X k k > k >
1, and k X k k → ρ ≥ { X k } converges locally or globally. The difference is that ρ > ρ = f and the norm of its derivative ∇ f nearby a stationary point of f . This relation willbe used in our algorithm to check whether an iterative sequence tends to the globaloptimum or not. Theorem 4
If X is a stationary point of f , then for sufficiently small ∆ = ˜ X − X,f ( ˜ X ) = f ( X ) + h ∇ f ( ˜ X ) , ∆ i − h ∆∆ T X , ∆ i − k ∆∆ T k F . (12) Furthermore, if X is globally optimal, then f ( ˜ X ) < k ∇ f ( ˜ X ) k F k ∆ k F . xterior Point Method for Completely Positive Factorization 11 Proof
Consider an arbitrary ˜ X in the open neighborhood of X , N ( X ) = (cid:8) ˜ X : ( W ˜ X ) i j ( W X ) i j > ( W X ) i j = (cid:9) . We will show that k ˜ X ˜ X T − I k F = (cid:13)(cid:13) XX T − I (cid:13)(cid:13) F + h ( ˜ X ˜ X T − I ) ˜ X , ∆ i − h λ ( W X ) − , W ∆ i− h X , ∆∆ T ∆ i − k ∆∆ T k F ; (13) k ( W ˜ X ) − k F = k ( W X ) − k F + h ( W ˜ X ) − , W ∆ i + h ( W X ) − , W ∆ i . (14)Substituting these equalities into f ( ˜ X ) = k ˜ X ˜ X T − I k F + λ k ( W ˜ X ) − k F , and combin-ing h ∇ f ( ˜ X ) , ∆ i = h ( ˜ X ˜ X T − I ) ˜ X , ∆ i + h λ ( W ˜ X ) − , W ∆ i , we can get (12) immediately.We prove (13) below, based on ˜ X ˜ X T − I = XX T − I + X ∆ T + ∆ X T + ∆∆ T . Onone hand, from the equality, we have k ˜ X ˜ X T − I k F = (cid:13)(cid:13) XX T − I (cid:13)(cid:13) F + (cid:13)(cid:13) X ∆ T + ∆ X T + ∆∆ T (cid:13)(cid:13) F + (cid:10) XX T − I , X ∆ T + ∆ X T + ∆∆ T (cid:11) . On the other hand, we rewrite 2 h ( ˜ X ˜ X T − I ) ˜ X , ∆ i = h ˜ X ˜ X T − I , ˜ X ∆ T + ∆ ˜ X T i as2 (cid:10) ( ˜ X ˜ X T − I ) ˜ X , ∆ (cid:11) = (cid:10) XX T − I + X ∆ T + ∆ X T + ∆∆ T , X ∆ T + ∆ X T + ∆∆ T i = (cid:10) XX T − I , X ∆ T + ∆ X T + ∆∆ T i + (cid:13)(cid:13) X ∆ T + ∆ X T + ∆∆ T (cid:13)(cid:13) F + h XX T − I + X ∆ T + ∆ X T + ∆∆ T , ∆∆ T i = (cid:10) XX T − I , X ∆ T + ∆ X T + ∆∆ T i + (cid:13)(cid:13) X ∆ T + ∆ X T + ∆∆ T (cid:13)(cid:13) F − h XX T − I − ∆∆ T , ∆ X T i + k ∆∆ T k F . Since ( XX T − I ) X = − λ W T ( W X ) − by ∇ f ( X ) =
0, we also have that h XX T − I , ∆ X T i = −h λ ( W X ) − , W ∆ i . Hence, combining these equalities, we get (13).To show (14), let T − and T be the indicator matrices of the negative or zeroentries of W X , respectively. Since ˜ X ∈ N ( X ) , ( W ˜ X ) − = ( W ˜ X ) ⊙ ( T − + T ) = ( W X + W ∆ ) ⊙ ( T − + T ) = ( W X ) − + H ∆ where H ∆ = W ∆ ⊙ ( T − + T ) . Combining it with h ( W X ) − , H ∆ i = h ( W X ) − , W ∆ i , weget k ( W ˜ X ) − k F = k ( W X ) − k F + h ( W X ) − , W ∆ i + k H ∆ k F , h ( W ˜ X ) − , W ∆ i = h ( W X ) − , W ∆ i + h H ∆ , W ∆ i = h ( W X ) − , W ∆ i + k H ∆ k F . Hence (14) is also true, completing the proof of (12).For a globally optimal X , f ( X ) = T − =
0. We estimate f ( ˜ X ) in the twosubsets according to the sign of function g ( ∆ ) = h ∆ T X , ∆ T ∆ i + k ∆∆ T k F , N + = { ˜ X ∈ N ( X ) : g ( ˜ X − X ) ≥ } ; N − = { ˜ X ∈ N ( X ) : g ( ˜ X − X ) < } . For ˜ X ∈ N + , we have f ( ˜ X ) ≤ h ∇ f ( ˜ X ) , ∆ i ≤ k ∇ f ( ˜ X ) k F k ∆ k F directly.Consider ℓ ( ∆ ) = X ∆ T X + ∆ X T X + λ W T H ∆ for ˜ X = X + ∆ ∈ N − , and rewrite ∇ f ( ˜ X ) = ℓ ( ∆ ) + ( X ∆ T + ∆ X T ) ∆ + ∆∆ T ( X + ∆ ) We should have ℓ ( ∆ ) = X = X + ∆ ∈ N − different from X . Otherwise, ℓ ( ∆ ) = ∆ such that ˜ X = X + ∆ ∈ N − . It follows that0 = h ℓ ( ∆ ) , ∆ i = h X ∆ T X + ∆ X T X + λ W T H ∆ , ∆ i = h ∆ T X , X T ∆ i + k ∆ X T k F + λ k H ∆ k F . Since λ W T H ∆ = − ( X ∆ T + ∆ X T ) X by ℓ ( ∆ ) = k λ W T H ∆ k F = k X ∆ T + ∆ X T k F = k ∆ X T k F + h ∆ T X , X T ∆ i = − λ k H ∆ k F , which implies that ∆ T X + X T ∆ =
0. Thus, by g ( ∆ ) <
0, we get a contradiction0 ≤ k ∆∆ T k F < − h ∆ T X , ∆ T ∆ i = −h ∆ T X + X T ∆ , ∆ T ∆ i = . Therefore, ℓ ( ∆ ) is non-singular. Since it is piece-wisely linear, there is a positiveconstant c such that k ℓ ( ∆ ) k ≥ c k ∆ k F for ˜ X ∈ N − . Thus, for k ∆ k F ≤ min { , c / } , k ℓ ( ∆ ) k ≥ k ∆ k F , and by definition, k ∇ f ( ˜ X ) k F ≥ c k ∆ k F − k ∆ k F ≥ k ∆ k F . Hence, f ( ˜ X ) ≤ ( k ∇ f ( ˜ X ) k F + k ∆ k F ) k ∆ k F ≤ k ∇ f ( ˜ X ) k F k ∆ k F . The theorem is then proven. (cid:3)
Concluding from Theorem 4, we get thatlim ˜ X → X k ∇ f ( ˜ X ) k F f ( ˜ X ) = (cid:26) , if X is not globally optimal; + ∞ , if X is globally optimal . (15)In the next section, we will consider a nonlinear conjugated gradient method forsolving the minimization problem (3). The sufficient conditions mentioned abovewill be taken into account as much as possible. The NCG method [19] is commonly used for solving nonlinear smooth optimiza-tion problems. Generally, its convergence requires the objective function to be twicecontinuously differentiable [1,35], or be differentiable and its gradient is Lipschitzcontinuous [35] if the conjugated gradient direction is suitably updated. Unfortu-nately, in our case, the function f is continuously differentiable only.In this section we briefly describe the NCG and conditions of its convergence atfirst. To guarantee the convergence when the NCG is applied to the exterior pointmodel (3), we propose two modifications for the NCG. One is a new approach forupdating the conjugate gradient in the NCG, and the other one is a simpler rule forsetting an inexact line search for updating the iteration point. These modifications cannot only guarantee the convergence for continuously differentiable functions withoutother conditions, but also improve the efficiency of NCG. Hence, the modified NCGworks on our exterior point problem (3). xterior Point Method for Completely Positive Factorization 13 φ ( x ) via the two classical steps, starting at d = − g , with g = ∇φ ( x ) on aninitial point x : Modify the current point x k to x k + = x k + α k d k (16)along the direction d k with a suitable step length α k . Then, update the conjugatedirection d k to d k + = − g k + + β k d k (17)with the gradient g k + = ∇φ ( x k + ) at the updated point x k + and a suitable value β k .The weak convergence lim g k = x k to the minimizerof φ depends on the smoothness of φ and the choices of { α k } and { β k } . The ideal α k is the minimizer of φ ( x k + α d k ) with respect to α , which is called as exact linesearch. Inexact search is commonly suggested but α k should satisfy the weak Wolfeconditions [33] φ ( x k + α d k ) ≤ φ ( x k ) + ρα h g k , d k i (18) h ∇φ ( x k + α d k ) , d k i ≥ σ h g k , d k i (19)with two positive parameters ρ < σ <
1, or the strong Wolfe conditions (18) and |h ∇φ ( x k + α d k ) , d k i| ≤ − σ h g k , d k i . (20)Generally, the inexact line search α k can be obtained via bisection [31] or interpola-tion [35], or combination of the two approaches [20]. About the step choice for β k ,there are three commonly used approaches in the literature [19,34,40]: β FR k = k g k + k k g k k , (21) β PRP k = h g k + , g k + − g k ik g k k , (22) β MPRP k = β PRP k − min (cid:8) β PRP k , ν k g k + − g k k k g k k h g k + , d k i (cid:9) , ν > . (23)Sun and Yuan have shown in [35] that the strong convergence of the NCG withthe exact line search and { β PRP k } is true if φ is twice continuously differentiable anduniformly convex, and the level set { x : φ ( x ) ≤ φ ( x ) } is bounded. However, thestrong convergence is not guaranteed if the exact search is relaxed to the inexact one,even if it satisfies the strong Wolfe conditions. The weak convergence is guaranteedfor the NCG with { β FR k } and inexact { α k } satisfying the strong Wolfe conditionswith 0 < ρ < σ < , under the same conditions on φ without uniformly convexity, orthe gradient of φ is Lipschitz continuous, a stricter condition than the continuouslydifferentiable φ . For the modified PRP, MPRP, if the weak Wolfe conditions are sat-isfied by { α k } and { α k } has a positive lower bounded, the weak convergence can be slightly improved to g k → φ as for FR [40]. It wasreported in the literature that the PRP is more efficient than FR in applications al-though its stronger conditions may not be satisfied. MPRP performs better than PRPgenerally if the parameter ν is suitably set.3.2 Modifications for the NCG Modified step β k . Practically, the MPRP adopts a restart strategy: reset β k = i.e. , d k + = − g k + , when h g k + , g k + − g k − ν k g k + − g k k k g k k d k i ≤
0. It guarantees h d k + , g k + i ≤ − ( − ν ) k g k + k , a sufficient condition for the descent of φ ( x k + ) by (18). However, this descent con-dition is not sufficient for the convergence of { g k } . An additional condition about thepositive lower-bound of { α k } is required [40].To guarantee the same convergence as MPRP for continuously differentiable φ ( x ) without the lower-bound condition on { α k } , we further modify MPRP as that β k = min n h g k + , g k + − g k − ν k g k + − g k k k g k k d k i + k g k k , κ k g k + k k d k k o . (24)where ν > as in (23) and κ >
0. The modification can guarantee a stronger sufficientdescent condition h d k + , g k + i ≤ − µ k d k + k k g k + k . We will show it in the nextsubsection. The stronger sufficient descent condition with µ ∈ ( , ) was given in[35] for the convergence g k → φ is uniformly continuous in the level set { x : φ ( x ) ≤ φ ( x ) } . We willfurther show that the uniformly continuous condition is also not necessary for theconvergence. Simple approach for inexact line search.
For continuously differentiable φ ( x ) ,the interpolation method does not guarantee the capture of required weak line search α k since it asks for a thrice times continuously differentiable and unimodal φ ( x ) [35]. One can get α k by the combination method [20] that is more efficient than thebisection approach [31]. In [20], the bisection is combined with the interpolation in abit complicated way for interval shrinking. Here we give a simpler and more efficientapproach for determining α k satisfying the weak Wolfe conditions.Theoretically, at a current point x = x k with the conjugate direction d = d k , therequired inexact line search α = α k satisfying the weak Wolfe conditions (18-19) canbe chosen as α ∗ = sup (cid:8) ˆ α : the Wolfe condition (18) holds over ( , ˆ α ) (cid:9) . (25) We always assume that both g k and d k are always nonzero. NCG terminates if g k = d k = It exists, is positive, and satisfies (18-19). To verify this claim, let’s consider thefunction g ( α ) = φ ( x ) + ρα h ∇φ ( x ) , d i − φ ( x + α d ) . Clearly, (18) is equivalent to g ( α ) ≥
0, and meanwhile, (19) holds if g ′ ( α ) ≤
0. By thedefinition and the continuousness of φ , (18) is true for 0 < α ≤ α ∗ . The supremum in(25) implies that g ( α ∗ ) = g ′ ( α ∗ ) ≤
0. Hence, (19) is also satisfied for α = α ∗ .Practically, there is a relative large sub-interval of ( , α ∗ ] in which both (18) and (19)are true. For instance, if ˆ α ∈ ( , α ∗ ] is the largest point such that g ( α ) is a localmaximum, then g ′ ( α ) ≤ [ ˆ α , α ∗ ] . Therefore, (18-19) hold for α ∈ [ ˆ α , α ∗ ] .An ideal choice of α is the minimizer α min of φ ( x + α d ) over ( , α ∗ ] since it de-creases φ as small as possible, while both (18) and (19) are still satisfied. In thissubsection, we give a simple rule for pursuing α min via a quadratic interpolationto φ ( x + α d ) , assuming φ is continuously differentiable. It generates a nested andshrunk interval sequence containing the required α . The pursuing terminates as soonas a point satisfying (18-19) is found.Initially, we set α ′ = α ′′ > α ′′ will be given later. Startingwith [ α ′ , α ′′ ] , we generate a sequence of intervals [ α ′ , α ′′ ] iteratively such that each α ′ ℓ satisfies (18) but α ′′ ℓ does not, and meanwhile, α ′ ℓ doesn’t satisfy (19). That is, for x ′ ℓ = x + α ′ ℓ d and x ′′ ℓ = x + α ′′ ℓ d φ ( x ′ ℓ ) ≤ φ ( x ) + ρα ′ ℓ h g , d i , φ ( x ′′ ℓ ) > φ ( x ) + ρα ′′ ℓ h g , d i , h ∇φ ( x ′ ℓ ) , d i < σ h g , d i , (26)where g = ∇φ ( x ) . The third inequality above implies that h ∇φ ( x ′ ℓ ) , d i <
0. Further-more, by the first two inequalities in (26), we have that φ ( x ′′ ℓ ) > φ ( x ′ ℓ ) + ρ ( α ′′ ℓ − α ′ ℓ ) h g , d i > φ ( x ′ ℓ ) + ρσ ( α ′′ ℓ − α ′ ℓ ) h ∇φ ( x ′ ℓ ) , d i . (27)In the current interval, we consider a quadratic function q ( α ) with interpolation con-ditions q ( α ′ ℓ ) = φ ( x ′ ℓ ) , q ′ ( α ′ ℓ ) = h ∇φ ( x ′ ℓ ) , d i , q ( α ′′ ℓ ) = φ ( x ′′ ℓ ) , It can be represented as q ( α ) = φ ( x ′ ℓ ) + ( α − α ′ ℓ ) h ∇φ ( x ′ ℓ ) , d i + (cid:0) φ ( x ′′ ℓ ) − φ ( x ′ ℓ ) − ( α ′′ ℓ − α ′ ℓ ) h ∇φ ( x ′ ℓ ) , d i (cid:1) ( α − α ′ ℓ ) ( α ′′ ℓ − α ′ ℓ ) with the minimizer c ℓ = argmin α q ( α ) given by c ℓ = α ′ ℓ + α ′′ ℓ − α ′ ℓ − ( α ′′ ℓ − α ′ ℓ ) h ∇φ ( x ′ ℓ ) , d i φ ( x ′′ ℓ ) − φ ( x ′ ℓ ) − ( α ′′ ℓ − α ′ ℓ ) h ∇φ ( x ′ ℓ ) , d i > α ′ ℓ . (28)By the Mean-Value Theorem for derivatives and the second inequality in (27),0 < ( − M ℓ ) − ≤ (cid:16) − h ∇φ ( ¯ x ℓ ) , d ih ∇φ ( x ′ ℓ ) , D k i (cid:17) − = − ( α ′′ ℓ − α ′ ℓ ) h ∇φ ( x ′ ℓ ) , D k i φ ( x ′′ ℓ ) − φ ( x ′ ℓ ) − ( α ′′ ℓ − α ′ ℓ ) h ∇φ ( x ′ ℓ ) , D k i < σσ − ρ , (29) where ¯ x ℓ = x + ¯ α ℓ d with ¯ α ℓ ∈ [ α ′ ℓ , α ′′ ℓ ] and M ℓ = min α ∈ [ α ′ ℓ , α ′′ ℓ ] h ∇φ ( x + α d ) , d ih ∇φ ( x ′ ℓ ) , d i ≤ h ∇φ ( ¯ x ℓ ) , d ih ∇φ ( x ′ ℓ ) , d i < ρσ . Hence, if 0 < ρ < σ , we have that α ′ ℓ < α ′ ℓ + ( − M ℓ ) ( α ′′ ℓ − α ′ ℓ ) < c ℓ < α ′ ℓ + σ ( σ − ρ ) ( α ′′ ℓ − α ′ ℓ ) < α ′′ ℓ . (30)We may shrink [ α ′ ℓ , α ′′ ℓ ] to [ c ℓ , α ′′ ℓ ] or [ α ′ ℓ , c ℓ ] , if α = c ℓ satisfies (18) or does not.However, if (18) is satisfied, the interval length is α ′′ ℓ − c ℓ ≤ − M ℓ − M ℓ ( α ′′ ℓ − α ′ ℓ ) . When M ℓ < | M ℓ | is large, − M ℓ − M ℓ ≈
1. The interval shrinking is inefficient in this case.To avoid this phenomenon, we slightly modify c ℓ as that with η = σ ( σ − ρ ) ˜ c ℓ = max (cid:8) c ℓ , ηα ′ ℓ + ( − η ) α ′′ ℓ (cid:9) ∈ ( α ′ ℓ , α ′′ ℓ ) . (31)Since ˜ c ℓ ≥ ηα ′ ℓ + ( − η ) α ′′ ℓ and c ℓ < α ′ ℓ + η ( α ′′ ℓ − α ′ ℓ ) by (30), we get α ′′ ℓ − ˜ c ℓ ≤ η ( α ′′ ℓ − α ′ ℓ ) , ˜ c ℓ − α ′ ℓ ≤ max (cid:8) η , − η (cid:9) ( α ′′ ℓ − α ′ ℓ ) = η ( α ′′ ℓ − α ′ ℓ ) . The last equality holds since η > /
2. Hence, if the Wolfe conditions (18-19) aresatisfied for α = ˜ c ℓ , we get the required α k = ˜ c ℓ . Otherwise, shrink [ α ′ ℓ , α ′′ ℓ ] as [ α ′ ℓ + , α ′′ ℓ + ] = (cid:26) [ α ′ ℓ , ˜ c ℓ ] , if (18) does not hold for α = ˜ c ℓ ; [ ˜ c ℓ , α ′′ ℓ ] , otherwise . (32)The interval length is significantly decreased as 0 < α ′′ ℓ + − α ′ ℓ + ≤ η ( α ′′ ℓ − α ′ ℓ ) , where η < ρ < σ . Hence, α ′′ ℓ − α ′ ℓ → ℓ → ∞ . Lemma 3 If φ is lower bounded and continuously differentiable, an α = ˜ c ℓ ∗ satisfy-ing (18-19) can be obtained within a finite iterations of (32) if < ρ < σ < .Proof If (18-19) do not hold for all ˜ c ℓ , the updating rule (32) yields a sequence ofnested intervals { [ α ′ ℓ , α ′′ ℓ ] } . Since 0 < ρ < σ <
1, the intervals tend to a singlepoint α ∗ and both { x ′ ℓ } and { x ′′ ℓ } tend to x ∗ = x + α ∗ d . Hence, by (27) and the Taylorextension of φ ( x + α d ) at α = α ∗ , we get h ∇φ ( x ∗ ) , d i = lim ℓ → ∞ φ ( x ′′ ℓ ) − φ ( x ′ ℓ ) α ′′ ℓ − α ′ ℓ ≥ ρ h ∇φ ( x ) , d i > σ h ∇φ ( x ) , d i (33)since h ∇φ ( x ) , d i < ρ < σ . However, by (26), h ∇φ ( x ∗ ) , d i ≤ σ h ∇φ ( x ) , d i , acontradiction with (33). (cid:3) A good choice of α ′′ helps to pursue the minimizer α min . Motivated by the aboveanalysis on the estimation of the shrinking rate η ℓ , we suggest the experiential setting α ′′ = min (cid:8) α = p η : (18) is not satisfied for α = p η with integer p ≥ (cid:9) . (34)Starting with the initial setting, the interval updating (32) converges quickly. For in-stance, we set ρ = . τ = .
4, the interval iteration terminates within one or twoiterations generally in our experiments. Algorithm 1 gives the details of the procedurefor determining an inexact line search α k , given x k , φ k , g k , the conjugate direction d k . xterior Point Method for Completely Positive Factorization 17 Algorithm 1
An inexact line search satisfying the weak Wolfe conditions
Require: point x , φ = φ ( x ) , g = ∇φ ( x ) , direction d , and parameters σ , ρ , ε Ensure: α satisfying (18-19) within accuracy ε , x : = x + α d , φ ( x ) , and g = ∇φ ( x ) . Set α ′ = x ′ = x , φ ′ = φ , g ′ = g , ν = ρ h g , d i . Find the smallest integer p ≥ α = η p , and set α ′′ = η p . Repeat the following iteration until α ′′ − α ′ < ε . Compute c as (28), ˜ c as (31), and ˜ φ = φ ( ˜ x ) at ˜ x = x + ˜ cd . If ˜ φ > φ + ˜ c ν , update ( α ′′ , φ ( x ′′ )) by ( ˜ c , φ ( ˜ x )) and go to Step 3. Compute ˜ g = ∇φ ( ˜ x ) . If h ˜ g , d i ≥ σ h g , d i , set x = ˜ x , φ = ˜ φ , g = ˜ g , and terminate. Otherwise, update α ′ , φ ′ , g ′ by ˜ c , ˜ φ , ˜ g , respectively. End iteration3.3 Convergence of the modified NCGWe have two results for the convergence.
Lemma 4
Let β k be defined by (24) and µ = ν − ν ( + κ ) . Then h d k + , g k + i ≤ − µ k d k + k k g k + k . (35) Proof
We rewrite β k = ρ k ˜ β k and d k + = ρ k ˜ d k + + ( ρ k − ) g k + , where ρ k ∈ [ , ] , ˜ β k is the step in (23), and ˜ d k + = − g k + + ˜ β k d k . Since h ˜ d k + , g k + i ≤ ( ν − ) k g k + k , h d k + , g k + i = ρ k h ˜ d k + , g k + i + ( ρ k − ) k g k + k ≤ (cid:0) ρ k ( ν − ) + ( ρ k − ) (cid:1) k g k + k ≤ − ν ν k g k + k . By β k ≤ κ k g k + k k d k k , we also have that k d k + k ≤ k g k + k + β k k d k k ≤ ( + κ ) k g k + k .The inequality (35) follows since k g k + k ≥ + κ k d k + k k g k + k and − ν ν < (cid:3) Combining (18), the inequality (35) guarantees the descent of { φ ( x k ) } . The NCGwith the inexact line search discussed in the previous subsection and the modified step β k given in (24) is convergent if φ is continuously differentiable and lower bounded.The convergence analysis is slightly different from that for the PRP step in [35]. Theorem 5
Starting with an arbitrary x , the NCG with inexact line search { α k } sat-isfying the weak Wolfe condition (18-19) and steps { β k } defined in (24) is convergentin the sense that { φ ( x k ) } is monotone decreasing and converges and that ∇φ ( x k ) → .Proof We assume g k = ∇φ ( x k ) = k without loss of generalities, and let s k = α k d k . By Lemma 4, h g k , s k i ≤ − µ k g k kk s k k ≤
0. The Wolfe condition (18) gives φ ( x k + ) − φ ( x k ) ≤ ρ h g k , s k i ≤ − ρµ k g k kk s k k ≤ . Algorithm 2
Modified nonlinear conjugated gradient (MNCG) method
Require: initial point x , parameters ε , ε , ρ , σ , ν , κ , λ , and k NCGmax . Ensure: an approximate solution x ∗ of min x φ ( x ) with the given accuracy Compute φ = φ ( x ) , g = ∇φ ( x ) , and set d = − g . For k = , · · · , k NCGmax , Save φ old = φ , and update ( x , φ , g ) by Algorithm 1 with searching direct d . If k g k F < ε and φ old − φ < ε , then set x ∗ = x and terminate the iteration. Otherwise, compute β by (24) and update d : = − g + β d . End forIt means that { φ ( x k ) } is monotone decreasing. Hence, it is convergent since φ itselfis lower bounded, which also implies that k g k kk s k k → k g k k →
0. Otherwise, there is a subsequence {k g k i k } witha positive lower bound. Correspondently, k g k i kk s k i k → k s k i k →
0. Bythe Taylor extensions φ ( x k i + ) = φ ( x k i ) + h g k i , s k i i + o (cid:0) k s k i k (cid:1) , φ ( x k i ) = φ ( x k i + ) − h g k i + , s k i i + o (cid:0) k s k i k (cid:1) , and the Wolfe condition (19) that gives h g k i + , s k i i ≥ σ h g k i , s k i i , we have that o ( k s k i k ) = h g k i + , s k i i − h g k i , s k i i ≥ ( σ − ) h g k i , s k i i . Hence, 1 − σ = o ( k s ki k ) −h g ki , s ki i ≤ o ( k s ki k ) µ k g ki kk s ki k → {k g k i k} has a positive lower bound,which implies σ ≤
1, a contradiction with σ < (cid:3) Algorithm 2 gives the details of NCG for minimizing a nonlinear function φ ( x ) .We will use it to solve the exterior point model (3). The algorithm performs verywell in our tests. For instance, applying on a symmetric factorization A = WW T of acompletely positive matrix of order 20000 with cp-rank 10, the algorithm can get aCPF with accuracy 10 − within 150 iterations and 3 seconds, starting at the identitymatrix of order r . As a comparison, using the same initial point, the alternative pro-jection method given in [24] gives an approximate CPF in the accuracy 10 − , whichcosts more than 450000 iterations and more than 1500 seconds.3.4 PostprocessingGenerally, a solution X of (3) solve by the modified NCG is not exactly row-orthonormal since the algorithm terminates within a limit accuracy. We can get anapproximate CPF with a nonnegative factor ˜ B = ( W X ) + truncated from W X .To improve the accuracy of the approximate CPF, we suggest postprocessing onthe solution X . That is, find a row-orthonormal matrix Q ∈ R r × r + nearest to X at first,and then truncate W Q to be a nonnegative B + = ( W Q ) + . This Q can be a solutionto the Procrustes problem min QQ T = I k X − Q k F . That is, Q = UV T when we have the xterior Point Method for Completely Positive Factorization 19 singular value decomposition X = U Σ V T of X , where U is an orthogonal matrix oforder r and V ∈ R r × r + is column-orthonormal. The following estimation gives insightinto the improvement.Let N = ( W X ) − for simplicity, then ( W X ) + = W X − N , and k A − ( WX ) + ( W X ) T + k F = k A − ( WX )( W X ) T + ( W X ) N T + N ( W X ) T − NN T k F ≤ k A − ( WX )( W X ) T k F + ( k W X k + k N k ∞ ) k N k F , where k N k ∞ is the largest absolute entry of N . When X = Q , it is simplified as k A − ( WQ ) + ( W Q ) T + k F ≤ (cid:0) p k A k + k N k ∞ (cid:1) k N k F . This postprocessing may slightly increase the negative component k N k F , but it van-ishes the term k A − ( W X )( W X ) T k F , and yields a significant decreasing of the ap-proximate error eventually. In our experiments, we always adopt the postprocessingand take the orthogonal projection Q of a solution X as an eventual output. The CPF was thought to be NP-hard in [21] without proofs, even if the columnnumber of a nonnegative factor is relaxed to be larger than the cp-rank. That is, oneis not able to get an algorithm to compute such a CPF for all completely positivematrices within polynomial time of the matrix order. However, it may be possible toget a good factorization with high accuracy for some completely positive matriceswithin acceptable time. It is tricky that we know less about what kind of completelypositive matrices whose CPF is easy or hard to obtain.In this section, we will explore some potential issues that may influence the CPFnumerically, implemented by our exterior point method using the modified NCG thatis given in the previous section. We focus on the three issues on the truly existednonnegative factor B of a completely positive matrix A : the distribution of its columnnorms, its sparsity, and its approximately rank deficiency. It is not clear whether afixed A has multiple CPFs whose nonnegative factors B have quite different propertieson these three issues. However, we do not find evident differences in our experimentswhen A has a spare nonnegative factor B .Four kinds of distributions of the column norms of B are considered: constant,linear, convex, or concave. In each set of those B ’s with the same kind of the col-umn distribution, we also consider the influences of the column number (the cp-rankof A ), sparsity, and approximate rank deficiency of B to the CPF. Synthetic com-pletely positive matrices are randomly constructed with these properties. Because ofthe construction, we always have that r cp ( A ) = r ( A ) for these matrices. Hence, we set r + = r ( A ) . For simplicity, we also fix the order of these synthetic matrices as n = We say A = BB T is a weak CPF later if the column number of the nonnegative B is larger than thecp-rank r cp of A , distinguishing it from the strict CPF whose factor has r cp columns. It is more likely for A with a dense nonnegative factor to have multiple CPFs.0 Zhenyue Zhang, Bingjie Li r b min − − − − − − − − − − − − − − − − constant 100 45 98 100linear 90 99 99 99 98 100 100 100 100 100 100 100 100 100 100 100convex 87 97 97 98 99 100 100 100 100 100 100 100 100 100 100 100concave 92 100 100 100 97 99 98 99 99 100 100 100 100 100 100 100 Table 1
Percentage of CPF by EPM achieving accuracy ε = − , dense factors, and n = A few completely positive matrices with cp-rank larger than rank reported in the lit-erature and the synthetic completely positive matrices in a larger scale ( n = W from the symmet-ric factorization A = WW T to ˜ W = D − W with D = diag ( k w k , · · · , k w n k ) beforeits CPF. The postprocessing discussed in the last section is also adopted. That is, weuse the orthogonal projection Q as the output and set ˜ B = ( W Q ) + as an approximatenonnegative factor of A . We measure the factorization accuracy by the relative errorError ( ˜ A ) = k A − ˜ B ˜ B T k F k A k F . (36)4.1 Column distributionCompletely positive matrices in the form A = BB T can be easily constructed byrandomly choosing a nonnegative factor B with a given number of columns. Gener-ally, the cp-rank of such a matrix A is also equal to its rank. We consider four sets of A ’s with the different distributions of column norm sequence { b i } of B : One is thatwith constant b i = i , and the others have the same form as b i = − ( − b min ) t i − t t r − t ∈ [ b min , ] , i = , · · · , r , (37)where r is the number of columns, t i = i d and d = − − , 1, and 2, respectively.The different values of d determine the different sharp of { b i } : convex ( d = − − ),linear ( d = d = b min determines how small someof columns of B can be in these three types.The four types of B ’s are constructed as follows. We first choose ˆ B = ( ˆ b i j ) or order n × r with entries uniformly distributed in the interval ( , ) , and then normalize eachcolumn of ˆ B to B = ( b i j ) that has a given column norm sequence { b i } . That is, theentries are b i j = ˆ b i j (cid:0) b j / q ∑ k ˆ b k j (cid:1) .Table 1 lists the percentages of tested matrices whose CPF can be successfullyobtained by our exterior point method (EPM) with the accuracy Error ( ˜ A ) < ε = − among 1000 repeats in each type of column norm distribution with fixed b min and r . For constant { b i } , the percentages are that among 4000 tests.xterior Point Method for Completely Positive Factorization 21 Type constant linear, b min = . s r Table 2
Percentage of CPF achieving accuracy ε = − depending on s = s ( B ) and r = r cp ( A ) ( n = The parameters are also fixed for all the tests as λ = . , ρ = . , σ = . , ν = , κ = , ε = − , ε = − . (38)In this experiment, starting at a randomly chosen orthogonal matrix, our algorithmworks very well for almost all the cases, except the constant case with r =
10. Thereare only about 45% matrices of rank r =
10 whose factorization errors can achieve theaccuracy 10 − and about 53% matrices have factorization errors larger than 10 − .Similar phenomenon also occurs when the matrix size n is larger. The CPF is not dif-ficult for matrices with a dense factor B whose column norms are distributed linearly,convexly, or concavely, except in a spacial case shown in the following phenomenon.Phenomenon A. For a dense and full rank B ≥ with approximately equal columnnorms, if it has a special column number depending on its row number, the CPF ofA = BB T is relatively difficult than others. It is a bit puzzling. A similar phenomenon also occurs when B is sparse. We willshow it in the next subsection.4.2 SparsityBesides the distribution of the column norms, the sparsity of factor B is anotherissue that may affect the CPF. To show this phenomenon, we modify the construc-tion in the previous subsection by vanishing partial smallest entries of ˆ B with a givensparsity in percentage, without modifications on the normalization for having a spe-cial column distribution. Generally, the sparsity strategy does not change the rank andcp-rank if the sparsity is not large. However it affects the difficulty of CPF. If the sparsity is large, some rows of ˆ B may be zero. These rows should be deleted and the size n ofthe resulting A is slightly reduced.2 Zhenyue Zhang, Bingjie Li Type convex concave b min − − − − s r Table 3
Dependence of successful CPF ( ε = − ) on approximate cp-rank deficiency Table 2 lists the percentages of tested matrices whose CPF errors are smaller than10 − among 100 repeats with the constant or linear distribution of the column norms(the results for concave or convex distribution can be referred in Table 3). It showsthat the hardness of CPF does not monotonically depend on the rank/cp-rank or thesparsity of a nonnegative factor B . Very interestingly, the hardness is tightly relatedto a special rank-sparsity boundary.Phenomenon B. For each column norm distribution, there is a special rank-sparsityboundary such that the closer to this boundary the rank-sparsity pair of B is, the moredifficult the CPF of A = BB T is. b min means an approximately deficient cp-rank of the com-pletely positive matrix A tested in the previous subsections. This is because we canrewrite A = ∑ ri d i u i u Ti with nonnegative unit vectors { u i } and d r = b . For instance,if b min = − , then d r = − . Clearly, deleting the last column of B just slightlymodifies A to be a completely positive matrix ˆ A = ∑ r − i d i u i u Ti that is very close to A with k A − ˆ A k F = d r , but ˆ A has a smaller cp-rank. The approximate deficiency doesnot affect the CPF very much if the nonnegative factor B is dense, as shown in Ta-ble 1. However, as the sparsity is increased, the influence of approximate cp-rankdeficiency to the CPF is more and more evident.Table 3 illustrates the phenomenon. We test two values of parameter b min as 10 − and 10 − in the construction of completely positive matrices with concave distribu-tion or convex distribution of column norms. As we decrease b min from 10 − to 10 − ,the percentage of successful CPF is decreased evidently when the sparsity of B is not xterior Point Method for Completely Positive Factorization 23 -16 -12 -8 -4 f a c t o r i z a t i on e rr o r constant, dense, r = 10, s = 0% r + =10r + =11r + =12r + =13 -16 -12 -8 -4 f a c t o r i z a t i on e rr o r constant, sparse, r = 12, s = 10% r + =12r + =16r + =20r + =24 -16 -12 -8 -4 f a c t o r i z a t i on e rr o r convex, r = 12, s = 7% r + =12r + =14r + =16r + =18 -16 -12 -8 -4 f a c t o r i z a t i on e rr o r concave, r = 12, s = 7% r + =12r + =14r + =16r + =18 Fig. 1
Weak CPF errors of 1000 tests on each of four matrices without approximate cp-rank deficiency:constant-type dense factor ( r = r =
12) in the types ’constant’ ( s ( B ) = b min = − , s ( B ) = ignorable, especially nearby the rank-sparsity boundary mentioned in PhenomenonB. We summarize it asPhenomenon C. Approximate cp-rank deficiency significantly aggravates the CPFnearby the special rank-sparsity boundary in Phenomenon B.
One explanation is that there are many rows of the sparse nonnegative factor aliveat the boundary of the nonnegative cone R r + + and it is hard to accurately pursue theserows. In this section, we consider two strategies to increase the possibility of getting aCPF with acceptable accuracy. One is the relaxation of cp-rank, in which the columnnumber of factor B is relaxed from the exact cp-rank r cp to a larger integer r + . Ouralgorithm works in this case without any modifications, just starting at an initial ma-trix, for example, an arbitrarily chosen row-orthonormal matrix, of order r × r + . Thesecond one is a fast restart strategy. We will give a new stopping criterion based onTheorem 4 to reduce the computational cost when the restart strategy is adopted.5.1 Weak CPFBy weak CPF, we mean a symmetric factorization A = ˜ B ˜ B T with a nonnegativefactor ˜ B with a relaxed column number r + larger than the cp-rank of A , distinguishingit with the strict CPF with strict r cp columns in its nonnegative factor. Relative to thestrict factorization, the weak problem is a bit easier if A is not approximately cp-rankdeficient. A rough but reliable explanation is that increasing the column number canenlarge the solution set, and most rows of a solution ˜ B are far from the boundary ofthe nonnegative cone in a higher dimensional space. In the relatively dense case, the percentage of successful factorization is slightly increased as b min becomes smaller. This phenomenon will be explained in the next section.4 Zhenyue Zhang, Bingjie Li -16 -15 -14 f a c t o r i z a t i on e rr o r convex, dense, r = 12, s = 1% r + =12r + =13r + =14r + =15 -14 -12 -10 -8 -6 -4 -2 f a c t o r i z a t i on e rr o r convex, sparse, r = 12, s = 10% r + =12r + =13r + =14r + =15 -10 -9 -8 -7 -6 f a c t o r i z a t i on e rr o r convex, sparse, r = 12, s = 10% r - =11r =12 Fig. 2
Weak CPF errors of 100 tests on two matrices with approximate deficient cp-rank ( b min = − )with dense B (left) or sparse B (middle). The left one illustrates the improvement of approximate CPF ona matrix with sparse B like that shown in the middle panel The relaxation strategy can significantly decrease the hardness of strict CPF onmatrices without cp-rank deficiency, whatever B is dense or not. The left two panelsof Figure 1 illustrate the phenomenon on four matrices whose strict CPF are difficultas shown in Tables 1 and 2. One of the matrices has a dense factor ( r =
10) withconstant-type column norms, and the other three matrices have sparse factors ( r = s ( B ) = , , r × r + . In the dense case, due to the increasedcolumn number r + , the improvement is significant in the sense that it is more likely toget a weak CPF within a higher accuracy. In the sparse case, the algorithm cannot geta strict CPF with accuracy smaller than 10 − in these repeated tests. However, if weincrease the column number from r cp =
12 to r + =
18 (or r + =
24 for the ’constant’-type), the factorization accuracy of weak CPF can be increased to 10 − on about 900tests among the 1000 tests.However, the improvement of weak CPF is limited when the matrix has an ap-proximately deficient cp-rank. As the strict CPF, the weak CPF also works well if thefactor B is dense. We illustrate the performance in the left panel of Figure 2. The im-provement exists but is very slight. It partially explains the slightly higher percentagelisted in the first row of Table 3, corresponding to smaller b min . However, the relax-ation strategy may lose its efficiency when A is approximately cp-rank deficient and B is sparse, as illustrated in the middle panel of Figure 2.To partially address the difficulty when A is approximately cp-rank deficient andhas a sparse factor B , we may use the CPF of its lower-rank modification ˜ A = ˜ W ˜ W T as an approximate CPF of the original A if ˜ A is also nonnegative, where ˜ W is the W whose small columns are deleted. That is, apply the algorithm on ˜ W with a columnnumber r − smaller than r cp . The right panel of Figure 2 shows the improvement whenthe strategy of column shrinking is adopted. There are about more than 40% tests, inwhich the factorization error can be reduced. However, because of the perturbationexisted in the input ˜ W , the factorization error to the original A cannot be smaller than b in scale theoretically. xterior Point Method for Completely Positive Factorization 25 Algorithm 3
Restart exterior point method (REPM) for the CPF
Require: W , r cp , ρ , σ , ν , κ , λ , ε f , ε df , k NCGmax , and k REPMtotal . Ensure:
Row-orthonormal matrix Q of order r × r cp . Arbitrarily choose a row-orthonormal X ∈ R r × r cp , and set k = Starting with X , run Algorithm 2 within at most k NCGmax iterations until (39) or(40) is satisfied, and count the iteration number k NCG in this NCG procedure. Update k : = k + k NCG . If k ≥ k REPMtotal , terminate the restart procedure. If (39) is satisfied or k NCGmax is achieved, reset X = − X , and go to Step 2. If (40) holds, continue the iteration of Algorithm 2 until f ( X k ) − f ( X k − ) < ε df .5.2 Restart EPMBasically, convergence behaviour of the exterior point iteration may depend onthe initial point. There are two extreme phenomena that may occur in the iteration: – The CPF, whatever it is strict or weak, is easy for some matrices – the iterationalgorithm converges quickly, starting at almost any orthogonal matrix. – The CPF is difficult for some matrices – starting at almost all points, the itera-tion algorithm always drops to a local minimum quickly or converges to a globaloptimal solution very slowly.In other cases, the convergence may depend on the initial point significantly.One may repeatedly test variant initial points to increase the probability of suc-cessful CPF. Let p be the probability of an algorithm getting CPF with a given accu-racy for fixed A , starting at an arbitrarily chosen point. If we take an additional test assoon as the first one fails, the probability is increased to p + ( − p ) p = − ( − p ) .Generally, the probability is 1 − ( − p ) k within at most k times of tests. However,it may also cost k times of the original computational cost statistically. If it happens,the cost may be unacceptably expensive if k is large.Fortunately, Theorem 4 implies the possibility of quickly checking whether theiteration can converge to a global optimizer or not. Clearly, quick checking in timewhether the iteration is more likely to converge locally can avoid the unnecessarycomputational cost, and combining with a restart strategy, it can make the algorithmmore competitive for solving the CPF problem. In this subsection, we give a practicalapproach for restarting the algorithm EPM.Theorem 4 or Eq. (15) shows two extremely different behaviours of the ratiofunction k ∇ f ( ˜ X ) k F f ( ˜ X ) when ˜ X gets close to a stationary point of f . Motivated by thisobservation, we adopt the two criteria for charging whether the global convergencecan be expectant: – Not global optimum. If the current iterative point X satisfies k ∇ f ( X ) k F < ε , f ( X ) > k ∇ f ( X ) k F , (39) ε s
1% 4% 7% 10% 13% 16% 19% 1% 4% 7% 10% 13% 16% 19% N o r e s t a r t constant linear1e-12 47 21 3 0 0 3 4 97 83 7 26 42 46 461e-13 47 21 3 0 0 3 4 95 82 7 26 42 46 461e-14 47 19 3 0 0 3 4 95 80 7 2 5 10 14 R e s t a r t N o r e s t a r t convex concave1e-12 99 88 13 11 44 56 73 97 71 12 9 35 43 501e-13 99 87 12 11 44 56 73 97 69 10 9 35 43 501e-14 99 83 11 5 33 48 58 96 68 10 1 1 8 2 R e s t a r t Table 4
Percentage of successful CP factorization with ε , using the restart strategy, r = b min = . X is at least a stationary point approximately because of the first inequality, andmeanwhile, the second condition means that X is far from a global minimizer. – Global optimum. We obtain a global optimal solution X in a high accuracy if k ∇ f ( X ) k F < ε , f ( X ) < ε f (40)with ε f ≪ ε ≪ ε . For example, ε f = − , ε = − , and ε = − , as did inour experiments.Therefore, if the condition (39) is satisfied, we think that the iteration is muchlikely to drop into a local minimum. In this case, we have to restart the algorithm at anew point. To avoid turning back to the same local minimum, the new starting pointshould be far from the current one. A simple approach is to set the opposite one − X of X as a new initial point. If (40) is satisfied, the iterative point is approaching at aglobal minimizer. It is suggested to continue the iteration until convergence in a highaccuracy or the restriction on total iteration number k total is achieved. Details of thisapproach are given in Algorithm 3.To show the efficiency of REPM, we test matrices of order n =
200 with rank (cp-rank) r =
12, constructed in the four types ( b min = .
1) as before. The sparsity s ( B ) varies from 1% to 19%. As shown in Tables 2 and 3, the CPF is difficult for most ofthe constructed matrices. 50 matrices are constructed for each type and each sparsity,and totally 1400 matrices are tested. For each matrix, we run 10 times of AlgorithmREPM starting at a randomly chosen orthogonal matrix. The parameters are set as in(38) for ρ , σ , ν , κ , λ , and ε = − , ε = − , ε f = − , ε df = − , (41) k NCGmax = , k REPMtotal = . (42)Table 4 show the percentages of CPF achieving at a given accuracy ε obtained byAlgorithm REPM among 500 tests for each sparse setting and each type of columnnorm distribution. The restart strategy can significantly increase the possibility ofobtaining a CPF with a high accuracy. For example, the original algorithm almost xterior Point Method for Completely Positive Factorization 27 Type s ( B )
1% 4% 7% 10% 13% 16% 19%constant
Table 5
REPM ( r = b min = . always fails to get a CPF in high accuracy for constant type matrices with sparsitiesvarying from 7% to 19%. The factorization via the restart algorithm is very successfulin most of the tests.It should be pointed out that the number of restarts or the computation time de-pends on the tested matrix. Roughly speaking, the larger of the restart number is, theharder of its CPF is. However, a large number of restarts does not mean an expen-sive computational cost since a quick restart may occur. For example, on a constanttype matrix with s = s = r =
12, the computational cost isrelative expensive when s ( B ) is 7% or 10%, compared with other cases. In this section, we compare our exterior point method with four state-of-art al-gorithms: two alternative projection methods given in [24] (marked as AP-H) and[21] (marked as AP-G), the coordinate descending method given in [37] (marked asCD), and the alternative nonnegative least squared method given in [29] (marked asANLS). The algorithms and their parameter setting are briefly described below.The AP-H solves min B ≥ , QQ T = I k B − W Q k F via optimizing k B k − − W Q k F toget Q k and optimizing k B − W Q k k F to update B k = ( W Q k ) + alternatively. It wassuggested in [24] to terminate the iteration when |h B k − , B k − − W Q k i| < ε with agiven ε . This criterion may miss its global convergence. Since f APH k = k B k − W Q k k F It was reported in [37] that the Newton method [27] cannot beat the coordinate descending method.We omit the comparison with the Newton method.8 Zhenyue Zhang, Bingjie Li -25 -20 -15 -10 -5 APH0 1000 2000 3000 4000 5000Iterations10 -3.5 -3 -2.5 -2 -1.5 -1 -0.5 APG
Fig. 3
Iterative behaviours of Algorithms AP-G and AP-H is monotone decreasing, we terminate the iteration if f APH k − − f APH k f APH k < ε APH r and (cid:26) f APH k > √ ε APH , for local optimum, or f APH k < ε APH , for global optimum . (43)The left one can avoid meaningless iterations that do not provide acceptable decreas-ing on its objective function, while the right one can distinguish whether only a localminimum is achieved. The modification can increase the efficiency and is helpful forrestarting this algorithm as did in REPM.The AP-G aims to minimize min WP ≥ , QQ T = I k Q − P k F via alternative projection[21]: Set P k + = Q k − W + ( W Q k ) − , an approximate solution to min WP ≥ k Q k − P k F given Q k , and Q k + = arg min QQ T = I k Q − P k + k F given P k + . The algorithm termi-nates if | min ( W Q k ) i j | ≤ ε with given ε , for example, ε = − . We observe that thegap k Q k − P k k F matches the factorization error k A − ( W Q k ) + ( W Q k ) T + k F better than | min ( W Q k ) i j | and that {k Q k − P k k F } is not monotone decreasing, A more efficienttermination criterion is that ε k = min i ≤ k k Q i − P i k F < ε APG (44)with a given ε APG . Let i = i k ≤ k is the smallest index such that k Q i − P i k F = ε k , i.e. , ε i k = ε k . We terminate the iteration and take Q i k as the output if (44) is satisfied. If ε i k is not changed within k APG iterations, we also terminate the algorithm to avoidunnecessary iterations.Figure 3 illustrates the necessity of these modifications for AP-H and AP-G. Inour experiments, we set ε APH r = − , ε APH = − , ε APG = − , and k APG = k A − BB T k F via column-by-column optimization, to-gether with component-wise optimization for column updating. ANLS minimizes k A − B L B TR k F + α k B L − B R k F with B L ≥ B R ≥ xterior Point Method for Completely Positive Factorization 29 -16 -14 -12 -10 -8 -6 -4 -2 C P F e rr o r REPMRAP-HRAP-GCDANLS 0 50 100 150 200 250 300 350Example ID10 -2 -1.5 -1 -0.5 C o m pu t a t i on t i m e s ( s ) REPMRAP-HRAP-GCDANLS
Fig. 4
Sorted CPF errors (left) and computational time (right) of the five algorithms on the 360 matrices scaled random matrix B R . It is more efficient than CD. However, both the two algo-rithms converge slowly in our experiments. Because of the weak point, the restartstrategy is no longer suitable for CD or ANLS.6.1 Synthetic matrices with equal rank and cp-rankThe tested matrices are randomly constructed with fixed n = r =
12, and b min = .
1, as did in the last section. For each type of the four column norm distribu-tions, we choose 9 values for s ( B ) , varying from 1% to 25%. 10 completely positivematrices are constructed for each type and each sparsity, and totally we have 360 test-ing matrices in this experiment. The CPF is not easy on some of these matrices, asshown in Tables 2-4.The left panel of Figure 4 plots the CPF errors of REPM, RAP-H, RAP-G, CD,and ANLS (without the restart strategy) on the 360 matrices. Each matrix is testedwith random starting points. The REPM gives good results with relative CPF errorsless than 10 − on about 97.8% matrices. The percentage is decreased to 56.9% or31.7% for RAP-H or RAP-G, respectively. Meanwhile, the computational cost ofREPM is much less than that of RAP-H and RAP-G. The percentages for CD andANLS are very small. See the right panel of Figure 4 for the comparisons on thecomputational time of these algorithms.The efficiency of an algorithm may depend on data sets. To make the comparisonas fair as possible, for each of REPM, RAP-H, and RAP-G, we choose two sets oftesting matrices that are especially suitable or unsuitable for the selected algorithm,in the sense that CPF errors are smaller than 10 − (success set) or larger than 10 − (fail set), respectively. Then, we compare the efficiency of other algorithms on ma-trices in the two special sets. In the success sets of RAP-H or RAP-G, REPM isalso successful. Meanwhile, in the success set of REPM, RAP-H and RAP-G fail onabout 39.1% and 76.0% matrices, respectively. Conversely, in the sets of RAP-H and As in REPM, we randomly construct an orthogonal matrix as the starting point for RAP-H and RAP-Gsince it performs better than the initial setting provided by the authors.0 Zhenyue Zhang, Bingjie Li -15 -10 -5 REPM is succetive -15 -10 -5 RAP-H is succetiveREPMRAP-HRAP-G -15 -10 -5 RAP-G is succetive -4 -3 -2 -1 REPM is fail -15 -10 -5 RAP-H is fail -15 -10 -5 RAP-G is fail
Fig. 5
CPF errors of the algorithms in six subsets of those tests corresponding to the CPF errors smallerthan 10 − (success set) or larger then 10 − (fail set) of REPM, RAP-H, and RAP-G, respectively RAP-G fail, REPM has success rates 64.4% and 67.4% yet, respectively. The suc-cess rates for REPM can be increased to 97.0% and 98.4% if the accuracy is slightlydecreased to 10 − . Figure 5 shows the distributions of relative CPF errors for thesethree algorithms in each of the 6 sets.As shown in Figure 4, CD fails to give a CPF with an acceptable accuracy –the CPF errors are always larger than 10 − . The factorization error cannot be de-creased when we restart CD. ANLS is slightly benefited from the restart strategybut the running time is significantly increased. For example, if we restart ANLS atmost 10 times, each is restricted to run at most 20 seconds, the percentage of CPFwith Error ( ˜ A ) < − is increased from 13.3% to 17.5% slightly, while the averagecomputational time is unacceptably increased from 17.6 seconds to 171 seconds.6.2 Large-scale completely positive matricesIn this comparison, we show how the efficiency of REPM, RAP-H, and RAP-Gon matrices in large scale. We randomly construct 80 matrices in the scale n = r =
20 with 4 different sparsity values s ( B ) =
0, 10%, 20%, and 30% with eachof the four distributions of column norms of B as we did before.The left panel of Figure 6 plots the CPF errors of REPM, RAP-H and RAP-G. Weterminate these algorithms when the limit on computational time t max =
100 secondsis touched. All the CPF errors of REPM are smaller than 10 − on the 80 matrices.For RAP-H, there are only 37.5% of CPF errors are smaller than 10 − . RAP-Gperforms poorly – its CPF errors are larger than 10 − on all the matrices. Meanwhile,the computational cost of REPM is much less than that of RAP-H and RAP-G. In theright of Figure 6, we also compare the computational time in seconds for these threealgorithms on the 80 matrices. REPM is much faster than RAP-H and RAP-G. xterior Point Method for Completely Positive Factorization 31 Example ID -16 -14 -12 -10 -8 -6 -4 -2 C P F e rr o r REPMRAP-HRAP-G
Example ID C o m pu t a t i on t i m e s ( s ) REPMRAP-HRAP-G
Fig. 6
CPF errors (left) and computation time (right) of REPM, RAP-H, and RAP-G on 80 matrices A = , A = W DW T , A = (cid:20) I k k J k k J k I k (cid:21) , A = G H HH G HH H G , where both I k and J k are of order k , I k is identity, and J k has all entries equal to 1, W = a √ b − b √ a − b −√ a a −√ b , G =
91 0 0 00 42 0 00 0 42 00 0 0 42 , H =
19 24 24 2424 6 6 624 6 6 624 6 6 6 , a = √ − , b = √ + , and D = diag ( , √ , + √ ) , The pairs ( r ( A ) , r cp ( A )) of thesematrices are ( , ) , ( , ) , ( k − , k ) , and ( , ) , respectively. A is a small exam-ple whose cp-rank can achieve the upper bound in the estimation r cp ≤ r ( r + ) − A is an example whose cp-rank can be significantly largerthan the matrix order [22] since r cp = n . It is diagonally dominant, and hence, hasan explicit CPF by the factorization A = ∑ ≤ i < j ≤ n a i j ( e i + e j )( e i + e j ) T + diag ( c , ..., c n ) , (45)given [25] for any symmetric matrix A = ( a i j ) , where c i = a ii − ∑ j = i a i j and e i isthe i -th column of the identity matrix of order n . Hence, if A is nonnegative anddiagonally dominant, i.e. , all c i ≥
0, it must be completely positive since A = BB T with a nonnegative matrix B of at most r + nonzero columns, where r + is the numberof nonzero entries of { a i j : i < j } and { c i } . For A , r + = k = r cp . A is an example Error( ˜ A ) Algorithm A A A ( k = ) T(s) < − REPM 0.002 128 0.004 283 0.104 6121RAP-H 0.091 16543 0.012 1493 1.510 89956RAP-G 0.07 11226 0.119 16007 2.512 140245 > − CD 1.913 500001 6.077 1000001 20.000 353369ANLS 9.629 100000 9.878 100000 24.582 121204
Table 6
Average values of the computational costs for the five algorithms on the special matrices
Example ID -16 -14 -12 -10 -8 -6 -4 -2 C P F e rr o r A Example ID -16 -14 -12 -10 -8 -6 -4 -2 C P F e rr o r A Example ID -16 -14 -12 -10 -8 -6 -4 -2 C P F e rr o r A (k = 5) EPMAP-HAP-G Fig. 7
Sorted CPF errors (left) of the three algorithms without the restart strategy on A , A and A ( k = given in [11] to show that there is a completely positive matrix whose cp-rank largerthan n without an CPF. In Appendix C, we give an explicit form of its CPF. Ourexterior point method can give a strict CPF for each of these matrices almost exactly.The REPM, RAP-H, and RAP-G perform very well on A , A , and A with small k . Each of the three algorithms can quickly obtain a good CPF with an error smallerthan 10 − . However, both the CD and ANLS fail to yield an acceptable CPF in theaccuracy 10 − . We restrict the running time at most 20 seconds for CD and ANLS. Because each iteration of the CD is faster than that of ANLS, we also use an addi-tional limit n iter for the iteration number in CD and ANLS: n CDiter = for A and n CDiter = for the other three matrices, but n ANLSiter = . As mentioned before, therestart strategy is not suitable for CD since it seldom terminates before the two re-strictions are not touched. In this experiment, we also restart the ANLS when the totaliteration number is less than n ANLSiter . In Table 6, we list the average values and total iteration numbers, among 100repeats on each matrix. The REPM converges much faster than RAP-H and RAPG.Besides the advantage point on computational time, the REPM is also more robust tothe initial point. We also test the EPM, AP-H, and AP-G without the restart strategyon A , A and A with 100 tests for each algorithm. In these tests, EPM has a highersuccess rate, see Figure 7. We use a randomly chosen row-orthonormal matrix as a restart point in REPM, rather than − X , and ε = − , keeping others unchanged. The REPM, RAP-H, and RAP-G do not touch the restriction on time. The the total iteration number of restart ANLS could be slightly larger than n ANLSiter but less 2 n ANLSiter .xterior Point Method for Completely Positive Factorization 33
Error( ˜ A ) Matrix REPM RAP-H RAP-GRate Time < − A , 6 100% 0.4s 17764 100% 3.9s 157958 100% 7.9 36718 k = A > − A , 6 0% – – 0% – – 0% – – k = A
0% – – 93% 40.1s 1740837 100% 40.1s 1590793
Table 7
Performance of REPM, RAP-H, and RAP-G on the matrix A with k = , ,
10 and A The cp-rank of A is much larger than its rank or its matrix order if k is slightlylarge, and so is it for A . The CPF is difficult in this case. We use a relative weakcriterion for RAP-H and RAP-G: the algorithms are terminated when both the totalrunning time t max seconds and total iteration number k total are touched, but a strictcriterion for the REPM: it is terminated when the half of the time t max is touched.We use different values of t max and k total for A and A as follows: For A with k = ,
8, and 10, t max = , k total = × , × , and 8 × ,respectively, and for A , t max =
40 seconds and k total = × .Table 7 shows the performance of REPM, RAP-H, and RAP-G on A and A onthe average value of computational time and total iteration number among 100 timesfor each matrix. These three algorithms perform very well on A k with small k =
6. As k slightly increases to 8 or 10, the percentage of successful CPF of RAP-H or RAP-Gdecreases and the computational time increases quickly. RAP-H is a bit better thanRAP-G on A , but its success rate is only 7%. The REPM performs much better thanRAP-H and RAP-G. On A with k = A , it always gives a CPF with errorsmaller than 10 − . On the difficult A with k =
10, it can also provide 47% CPFwithin this accuracy.
In this paper, we tried the idea of exterior point method for addressing the CPFproblem numerically. The proposed optimization model and its iterative solver viaa modified NCG can implement the exterior point method. In the numerical exper-iments reported in this paper, the exterior point method performs much better thanthe algorithms in the literature. We discussed some potential issues that may affectthe CPF via a lot of numerical experiments by our algorithm. Some phenomena areinteresting and might be helpful for further analysis on this topic. However, we justtouched a small angle of the ice mountain. For instance, The special rank-sparsityboundary mentioned in Phenomenon B may determine how hard the CPF is, but wehave no idea to verify its existence or characterize such a boundary theoretically. Be-sides this, the approximate cp-rank deficiency may also result in a difficult CPF. Forour algorithm or its restart version on difficult CPFs, it deserves to explore an efficientinitial setting. The weak CPF problem is relatively easier than the (strict) CPF. It may also be an interesting topic to transform a weak CPF to a strict CPF efficiently. It isworth further working on these topics.
Acknowledgements
The work was supported in part by NSFC project 11971430 and Major ScientificResearch Project of Zhejiang Lab (No. 2019KB0AB01).
A Transversal intersection of Q and P Proposition 4
The Stiefel manifold Q r = { Q ∈ R r × r + : QQ T = I r } transversally intersects the subman-ifold P r = { P ∈ R r × r + : WP ≥ } at Q ∈ Q r ∩ P r if and only if for any nonnegative matrix Y ∈ R n × r ,W T YQ T is not symmetric or its trace is not zero when it is not zero.Proof It is known that the normal spaces of Q r at Q is N Q r ( Q ) = (cid:8) SQ : S ∈ R r × r , S = S T (cid:9) . If we alsohave Q ∈ P r , the normal cone of P r at Q is defined as N P r ( Q ) = (cid:8) N : h N , P − Q i ≤ , P ∈ P r (cid:9) . Since W is of full column rank, we can rewrite any N ∈ R r × r + as N = − W T Y with a Y ∈ R n × r + . The restrictionfor N P r ( Q ) becomes h Y , WP − WQ i ≥ P ∈ P . Since Q ∈ P , choosing P = Q and P = − Q in therestriction, it is equivalent to h Y , WQ i = h Y , WP i ≥ P ∈ P . Furthermore, we can restrict Y to benonnegative, which makes the inequality h Y , WP i ≥ P ∈ P , and hence, we canrepresent N P r ( Q ) = (cid:8) − W T Y : Y ∈ R m × r + , Y ≥ , h Y , WQ i = (cid:9) . To show the existence for a fixed W T Y , we assume W T Y = W T ˜ Y for any ˜ Y ≥
0. Hence, W T Y and theclosed cone { W T ˜ Y : ˜ Y ≥ } are separated. That is, there is an H ∈ R r × r + such that h Y , WH i < h ˜ Y , W H i for all ˜ Y ≥
0. Let ˜ Y t ≥ ( ˜ Y t ) ij = t > ( i , j ) . We get that as t → ( WH ) ij = t h ˜ Y t , W H i > t h Y , WH i →
0, which implies ( WH ) ij ≥
0. That is, H ∈ P . A contradictionfollows immediately as that 0 ≤ h Y , WH i < h ˜ Y , W H i = Y = Q r intersects P r transversally at Q ∈ Q r ∩ P r , that is by definition, N Q r ( Q ) and − N P r ( Q ) areintersected at the origin only, is equivalent to that if SQ = W T Y for a symmetric S ∈ R r × r and a nonnegative Y ∈ R n × r + satisfying h WQ , Y i =
0, we must have S = W T YQ T =
0. It is also equivalent to that for any Y ≥ r × r + , there is not a symmetric S equal to W T YQ T , i.e. , W T YQ T is not symmetric, or Y / ∈ − N P r ( Q ) , i.e. , the trace of W T YQ T is not zero when it is not zero. (cid:3) B Proof of Theorem 1
Proof
For a stationary point X of f , let T and T − be the indicator matrices of the zero entries and negativeentries of WX , respectively. Consider a sufficiently small neighborhood N ( X ) of X , in which ( W ˜ X ) ij < ( WX ) ij <
0, and ( W ˜ X ) ij > ( WX ) ij >
0. For ˜ X = X + ∆ ∈ N ( X ) , since ( WX ) ⊙ T =
0, we get that ( W ˜ X ) − = ( W ˜ X ) ⊙ T − + ( W ∆ ) − ⊙ T , and k ( W ˜ X ) − k F = k ( W ˜ X ) ⊙ T − k F + k ( W ∆ ) − ⊙ T k F . Let f − ( ˜ X ) = k ˜ X ˜ X T − I k F + λ k ( W ˜ X ) ⊙ T − k F . It gives ∇ f − ( ˜ X ) = ( ˜ X ˜ X T − I ) ˜ X + λ W T (cid:0) ( W ˜ X ) ⊙ T − (cid:1) . Hence, f ( ˜ X ) = f − ( ˜ X ) + k ( W ∆ ) − ⊙ T k F , ∇ f ( ˜ X ) = ∇ f − ( ˜ X ) + λ W T (cid:0) ( W ∆ ) − ⊙ T (cid:1) . (46)Obviously, f − ( X ) = f ( X ) and ∇ f − ( X ) = ∇ f ( X ) =
0. It is easy to verify that f − ( ˜ X ) − f − ( X ) = λ k ( W ∆ ) ⊙ T − k F + (cid:10) XX T − I , ∆∆ T (cid:11) + k ∆ X T + X ∆ T + ∆∆ T k F = h A ( ∆ ) , ∆ i + (cid:10) ∆ X T + X ∆ T , ∆∆ T (cid:11) + k ∆∆ T k F , Y is not unique in the representation W T Y .xterior Point Method for Completely Positive Factorization 35where h A ( ∆ ) , ∆ i is a quadratic form with the linear mapping A : R r × r + → R r × r + , A ( ∆ ) = ( XX T − I ) ∆ + ∆ X T X + X ∆ T X + λ W T (cid:0) ( W ∆ ) ⊙ T − (cid:1) . It is not difficult to rewrite h A ( ∆ ) , ∆ i as a quadratic form δ T S δ with a symmetric matrix S and the thevector representation δ = v ( ∆ ) of the matrix ∆ . If S is positive definite, the quadratic form must be alsopositive definite, and hence, X is a local minimizer of f − . If S is positive semidifinite, let { δ ( k ) = v ( ∆ ( k ) ) } be the rr + unit eigenvectors of S corresponding to eigenvalues { λ k } in ascending order and the first m ones are zeros. By the assumption, ∆ ( k ) X T + X ( ∆ ( k ) ) T = k ≤ m . Then, for ˜ X = X + ∆ with a nonzero ∆ = ∑ k α k ∆ ( k ) , f − ( ˜ X ) − f − ( X ) = h A ( ∆ ) , ∆ i + (cid:10) ∆ X T + X ∆ T , ∆∆ T (cid:11) + k ∆∆ T k F = ∑ k > m n α k λ k + α k (cid:10) ∆ ( k ) X T , ∆∆ T (cid:11)o + k ∆∆ T k F . Since (cid:12)(cid:12) ∑ k > m α k (cid:10) ∆ ( k ) X T , ∆∆ T (cid:11)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:10) ∑ k > m α k ∆ ( k ) , ∆∆ T X (cid:11)(cid:12)(cid:12) ≤ k ∑ k > m α k ∆ ( k ) k F k ∆∆ T k F k X k and by theorthogonality of { δ ( k ) } , k ∑ k > m α k ∆ ( k ) k F = q ∑ k > m α k , we get (cid:12)(cid:12) ∑ k > m α k (cid:10) ∆ ( k ) X T , ∆∆ T (cid:11)(cid:12)(cid:12) ≤ r ∑ k > m α k k X k k ∆∆ T k F ≤ ∑ k > m α k k X k + k ∆∆ T k F . Therefore, f − ( ˜ X ) − f − ( X ) ≥ ∑ k > m α k ( λ k − k X k ) ≥