Deterministic guarantees for Burer-Monteiro factorizations of smooth semidefinite programs
aa r X i v : . [ m a t h . O C ] M a y Deterministic guarantees for Burer–Monteirofactorizations of smooth semidefinite programs
NICOLAS BOUMAL
Mathematics Department and Program in Applied and Computational Mathematics,Princeton University
VLADISLAV VORONINSKI
Helm.ai
ANDAFONSO S. BANDEIRA
Department of Mathematics and Center for Data Science,Courant Institute of Mathematical Sciences, New York University
Abstract
We consider semidefinite programs (SDPs) with equality constraints. The vari-able to be optimized is a positive semidefinite matrix X of size n . Following theBurer–Monteiro approach, we optimize a factor Y of size n × p instead, such that X = YY ⊤ . This ensures positive semidefiniteness at no cost and can reduce thedimension of the problem if p is small, but results in a non-convex optimiza-tion problem with a quadratic cost function and quadratic equality constraints in Y . In this paper, we show that if the set of constraints on Y regularly defines asmooth manifold, then, despite non-convexity, first- and second-order necessaryoptimality conditions are also sufficient, provided p is large enough. For smallervalues of p , we show a similar result holds for almost all (linear) cost functions.Under those conditions, a global optimum Y maps to a global optimum X = YY ⊤ of the SDP. We deduce old and new consequences for SDP relaxations of thegeneralized eigenvector problem, the trust-region subproblem and quadratic op-timization over several spheres, as well as for the Max-Cut and Orthogonal-CutSDPs which are common relaxations in stochastic block modeling and synchro-nization of rotations. https://onlinelibrary.wiley.com/doi/abs/10.1002/cpa.21830 We consider semidefinite programs (SDPs) of the form f ⋆ = min X ∈ S n × n h C , X i subject to A ( X ) = b , X (cid:23) , (SDP)where S n × n is the set of real symmetric matrices of size n , C ∈ S n × n is the costmatrix, h C , X i = Tr ( C ⊤ X ) , A : S n × n → R m is a linear operator capturing m equalityconstraints with right-hand side b ∈ R m , and the variable X is symmetric, positive BOUMAL, VORONINSKI, BANDEIRA semidefinite. Let A , . . . , A m ∈ S n × n be the constraint matrices such that A ( X ) i = h A i , X i , and let C = (cid:8) X ∈ S n × n : A ( X ) = b and X (cid:23) (cid:9) (1.1)be the search space of (SDP), assumed non empty.Interior point methods solve (SDP) in polynomial time [23]. In practice how-ever, for n beyond a few thousands, such algorithms run out of memory (and time),prompting research for alternative solvers. Crucially, if C is compact, then (SDP)admits a global optimum of rank at most r , where r ( r + ) ≤ m [24, 7]—we reviewthis fact in Section 2.2. Thus, if one restricts C to matrices of rank at most p with p ( p + ) ≥ m , the optimal value remains unchanged. This restriction is easilyenforced by factorizing X = YY ⊤ where Y has size n × p , yielding a quadraticallyconstrained quadratic program:min Y ∈ R n × p h CY , Y i subject to A ( YY ⊤ ) = b . (P)In general, (P) is non-convex because its search space M p = n Y ∈ R n × p : A ( YY ⊤ ) = b o (1.2)is non-convex. (When p is clear from context or unimportant, we just write M .)Non-convexity makes it a priori unclear how to solve (P). Still, the benefits arethat M requires no conic constraint and can be lower dimensional than C . This hasmotivated Burer and Monteiro [12, 13] to try to solve (P) using local optimizationmethods, with surprisingly good results. They developed theory in support of thisobservation (details below). About their results, Burer and Monteiro write:“ How large must we take p so that the local minima of (P) areguaranteed to map to global minima of (SDP) ? Our theorem as-serts that we need only p ( p + ) > m (with the important caveat thatpositive-dimensional faces of (SDP) which are ‘flat’ with respectto the objective function can harbor non-global local minima). ”— End of Section 3 in [13], mutatis mutandis.The caveat—the existence or non-existence of non-global local optima, or theirpotentially adverse effect for local optimization algorithms—was not further dis-cussed. How mild this caveat really is (as stated) is hard to gauge, considering C can have a continuum of faces. Contributions
In this paper, we identify settings where the non-convexity of (P) is benign, inthe sense that second-order necessary optimality conditions are sufficient for globaloptimality—an unusual property for a non-convex problem. This paper extends a The condition on p and m is slightly, but inconsequentially, different in [13].UARANTEES FOR BURER–MONTEIRO FACTORIZATIONS OF SMOOTH SDPS 3 previous conference paper by the same authors [11]. Our core assumption is asfollows. Assumption . For a given p such that M (1.2) is non-empty, constraints on (SDP)defined by A , . . . , A m ∈ S n × n and b ∈ R m satisfy at least one of the following:a. { A Y , . . . , A m Y } are linearly independent in R n × p for all Y ∈ M ; orb. { A Y , . . . , A m Y } span a subspace of constant dimension in R n × p for all Y in an open neighborhood of M in R n × p .In either case, let m ′ denote the dimension of the space spanned by { A Y , . . . , A m Y } .(By assumption, m ′ is independent of the choice of Y ∈ M .)Under Assumption 1.1, M is a smooth manifold, which is why we say suchan (SDP) is smooth . Furthermore, if the assumption holds for several values of p ,then m ′ is the same for all. Formal statements follow; proofs are in Appendix A. Proposition 1.2.
Under Assumption 1.1, M is an embedded submanifold of R n × p of dimension np − m ′ . Proposition 1.3.
If Assumption 1.1 holds for some p, it holds for all p ′ ≤ p suchthat M p ′ is non-empty. Furthermore, if Assumption 1.1a holds for p = n, then itholds for all p ′ such that M p ′ is non-empty. In both cases, m ′ is independent of p. Examples of SDPs satisfying Assumption 1.1 are detailed in Section 5 (they allsatisfy Assumption 1.1a for p = n ). The assumption itself is further discussed inSection 6. Our first main result is as follows, where rank A can be replaced by m if preferred. Optimality conditions are derived in Section 2. Theorem 1.4.
Let p be such that p ( p + ) > rank A and such that Assumption 1.1holds. For almost any cost matrix C ∈ S n × n , if Y ∈ M satisfies first- and second-order necessary optimality conditions for (P) , then Y is globally optimal and X = YY ⊤ is globally optimal for (SDP) . The proof combines two intermediate results (Proposition 3.1 and Lemma 3.3below):(1) If Y is column-rank deficient and satisfies first- and second-order necessaryoptimality conditions for (P), then it is globally optimal and X = YY ⊤ isoptimal for (SDP); and(2) If p ( p + ) > rank A , then, for almost all C , every Y which satisfies first-order necessary optimality conditions is column-rank deficient.The first step is a variant of well-known results [12, 13, 17]. The second step isnew and crucial, as it allows to formally exclude the existence of spurious localoptima, thus resolving the caveat raised by Burer and Monteiro generically in C .Theorem 1.4 is a statement about the optimization problem itself, not aboutspecific algorithms. If C is compact, then so is M and known algorithms for BOUMAL, VORONINSKI, BANDEIRA optimization on manifolds converge to second-order critical points , regardless ofinitialization [10]. Thus, provided p is large enough, for almost any cost matrix C ,such algorithms generate sequences which converge to global optima of (P). Eachiteration requires a polynomial number of arithmetic operations.In practice, the algorithm is stopped after a finite number of iterations, at whichpoint one can only guarantee approximate satisfaction of first- and second-ordernecessary optimality conditions. Ideally, this should lead to a statement of approx-imate optimality. We are only able to make that statement for large values of p .We state this result informally here, and give a precise statement in Corollary 4.5below. Theorem 1.5 (Informal) . Assume C is compact and Assumption 1.1 holds for p = n + . Then, for any cost matrix C ∈ S n × n , if Y ∈ M n + approximately satisfies first-and second-order necessary optimality conditions for (P) , then it is approximatelyglobally optimal and X = YY ⊤ is approximately globally optimal for (SDP) , interms of attained cost value. Theorem 1.4 does not exclude the possibility that a zero-measure subset of costmatrices C may pose difficulties. Theorem 1.5 does apply for all cost matrices, butrequires a large value of p . A complementary result in this paper, which comes witha more geometric proof, constitutes a refinement of the caveat raised by Burer andMonteiro [13] in the excerpt quoted above. It states that a suboptimal second-ordercritical point Y must map to a face F YY ⊤ of the convex search space C whosedimension is large (rather than just positive) when p itself is large. The facialstructure of C is discussed in Section 2.2. The following is a consequence ofCorollary 2.9 and Theorem 3.4 below. Theorem 1.6.
Let Assumption 1.1 hold for some p. Let Y ∈ M be a second-order critical point of (P) . If rank ( Y ) < p, or if rank ( Y ) = p and dim F YY ⊤ < p ( p + ) − m ′ + p, then Y is globally optimal for (P) and X = YY ⊤ is globally optimalfor (SDP) . Combining this theorem with bounds on the dimension of faces of C allowsus to conclude the optimality of second-order critical points for all cost matrices C , with bounds on p that are smaller than n . Implications of these theorems forexamples of SDPs are treated in Section 5, including the trust-region subproblem,Max-Cut and Orthogonal-Cut. Notation S n × n is the set of real, symmetric matrices of size n . A symmetric matrix X ispositive semidefinite ( X (cid:23)
0) if and only if u ⊤ X u ≥ u ∈ R n . For matrices A , B , the standard Euclidean inner product is h A , B i = Tr ( A ⊤ B ) . The associated Points which satisfy first- and second-order necessary optimality conditions. Compactness of C ensures a minimum is attained in (P), hence also that second-order critical points exist.UARANTEES FOR BURER–MONTEIRO FACTORIZATIONS OF SMOOTH SDPS 5 (Frobenius) norm is k A k = p h A , A i . Id is the identity operator and I n is the identitymatrix of size n . The variable m ′ ≤ m is defined in Assumption 1.1. The adjoint of A is A ∗ , such that A ∗ ( ν ) = ν A + · · · + ν m A m . We first discuss the smooth geometry of (P) and the convex geometry of (SDP),as well as optimality conditions for both. (P)
Endow R n × p with the classical Euclidean metric h U , U i = Tr ( U ⊤ U ) , corre-sponding to the Frobenius norm: k U k = h U , U i . As stated in Proposition 1.2,under Assumption 1.1 for a given p , the search space M of (P) defined in (1.2)is a submanifold of R n × p of dimension dim M = np − m ′ . Furthermore, the tan-gent space to M at Y is a subspace of R n × p obtained by linearizing the equalityconstraints. Lemma 2.1.
Under Assumption 1.1, the tangent space at Y to M , T Y M , obeys T Y M = n ˙ Y ∈ R n × p : A ( ˙ YY ⊤ + Y ˙ Y ⊤ ) = o = (cid:8) ˙ Y ∈ R n × p : h A i Y , ˙ Y i = for i = , . . . , m (cid:9) . (2.1) Proof.
By definition, ˙ Y ∈ R n × p is a tangent vector to M at Y if and only if thereexists a curve γ : R → M such that γ ( ) = Y and ˙ γ ( ) = ˙ Y , where ˙ γ is the deriv-ative of γ . Then, A ( γ ( t ) γ ( t ) ⊤ ) = b for all t . Differentiating on both sides yields A ( ˙ γ ( t ) γ ( t ) ⊤ + γ ( t ) ˙ γ ( t ) ⊤ ) =
0. Evaluating at t = Y M is included inthe subspace (2.1). To conclude, use the fact that both subspaces have the samedimension under Assumption 1.1, by Proposition 1.2. (cid:3) Each tangent space is equipped with a restriction of the metric h· , ·i , thus making M a Riemannian submanifold of R n × p . From (2.1), it is clear that the A i Y spanthe normal space at Y : N Y M = span { A Y , . . . , A m Y } . (2.2)An important tool is the orthogonal projector Proj Y : R n × p → T Y M :Proj Y Z = argmin ˙ Y ∈ T Y M k ˙ Y − Z k . (2.3)We have the following lemma to characterize it. Lemma 2.2.
Under Assumption 1.1, the orthogonal projector is given by:
Proj Y Z = Z − A ∗ (cid:16) G † A ( ZY ⊤ ) (cid:17) Y , where A ∗ : R m → S n × n is the adjoint of A , G = G ( Y ) is a Gram matrix definedby G i j = (cid:10) A i Y , A j Y (cid:11) , and G † denotes the Moore–Penrose pseudo-inverse of G. BOUMAL, VORONINSKI, BANDEIRA
Furthermore, if Y Z ( Y ) is differentiable in an open neighborhood of M in R n × p ,then Y Proj Y Z ( Y ) is differentiable at all Y in M .Proof. Orthogonal projection is along the normal space, so that Proj Y Z ∈ T Y M and Z − Proj Y Z ∈ N Y M (2.2). From the latter we infer there exists µ ∈ R m suchthat Z − Proj Y Z = m ∑ i = µ i A i Y = A ∗ ( µ ) Y , since the adjoint of A is A ∗ ( µ ) = µ A + · · · + µ m A m by definition. Multiply onthe right by Y ⊤ and apply A to obtain A ( ZY ⊤ ) = A ( A ∗ ( µ ) YY ⊤ ) , where we used A ( Proj Y ( Z ) Y ⊤ ) = Y ( Z ) ∈ T Y M . The right-hand sideexpands into A ( A ∗ ( µ ) YY ⊤ ) i = * A i , m ∑ j = µ j A j YY ⊤ + = m ∑ j = (cid:10) A i Y , A j Y (cid:11) µ j = ( G µ ) i . Thus, any µ satisfying G µ = A ( ZY ⊤ ) will do. Without loss of generality, we pickthe smallest norm solution: µ = G † A ( ZY ⊤ ) . The function Y G † is continuousand differentiable at Y ∈ M provided G has constant rank in an open neighborhoodof Y in R n × p [16, Thm. 4.3], which is the case under Assumption 1.1. (cid:3) Problem (P) minimizes g ( Y ) = h CY , Y i (2.4)over M , where g is defined over R n × p . Its classical (Euclidean) gradient at Y is ∇ g ( Y ) = CY . The Riemannian gradient of g at Y , grad g ( Y ) , is defined as theunique tangent vector at Y such that, for all tangent ˙ Y , h grad g ( Y ) , ˙ Y i = h ∇ g ( Y ) , ˙ Y i . This is given by the projection of the classical gradient onto the tangent space [3,eq. (3.37)]:grad g ( Y ) = Proj Y ( ∇ g ( Y )) = Y ( CY ) = (cid:16) C − A ∗ (cid:16) G † A ( CYY ⊤ ) (cid:17)(cid:17) Y . This motivates the definition of S as follows, with G i j = (cid:10) A i Y , A j Y (cid:11) : S = S ( Y ) = S ( YY ⊤ ) = C − A ∗ ( µ ) , with µ = G † A ( CYY ⊤ ) . (2.5)This is indeed well defined since G i j is a function of YY ⊤ . We get a convenientformula for the gradient: grad g ( Y ) = SY . (2.6)In the sequel, S will play a major role.Turning toward second-order derivatives, the Riemannian Hessian of g at Y is a symmetric operator on the tangent space at Y obtained as the projection ofthe derivative of the Riemannian gradient vector field [3, eq. (5.15)]. The latter UARANTEES FOR BURER–MONTEIRO FACTORIZATIONS OF SMOOTH SDPS 7 is indeed differentiable owing to Lemma 2.2. With D denoting classical Fr´echetdifferentiation, writing S = S ( Y ) and ˙ S = D ( Y S ( Y ))( Y )[ ˙ Y ] ,Hess g ( Y )[ ˙ Y ] = Proj Y (cid:0) Dgrad g ( Y )[ ˙ Y ] (cid:1) = Y (cid:0) ˙ SY + S ˙ Y (cid:1) = Y (cid:0) S ˙ Y (cid:1) . (2.7)The projection of ˙ SY vanishes because ˙ S = A ∗ ( ν ) for some ν ∈ R m so that ˙ SY = ∑ mi = ν i A i Y is in the normal space at Y (2.2).These differentials are relevant for their role in necessary optimality conditionsof (P). Definition 2.3. Y ∈ M is a (first-order) critical point for (P) if12 grad g ( Y ) = SY = , (2.8)where S is a function of Y (2.5). If furthermore Hess g ( Y ) (cid:23)
0, that is (using thefact that Proj Y is self-adjoint), ∀ ˙ Y ∈ T Y M , h ˙ Y , Hess g ( Y )[ ˙ Y ] i = h ˙ Y , S ˙ Y i ≥ , (2.9)then Y is a second-order critical point for (P). Proposition 2.4.
Under Assumption 1.1, all local (and global) minima of (P) aresecond-order critical points.Proof.
These are standard necessary optimality conditions on manifolds, see [31,Rem. 4.2 and Cor. 4.2]. (cid:3)
Thus, the central role of S in necessary optimality conditions for the non-convexproblem is clear. Its role for the convex problem is elucidated next. (SDP) The search space of (SDP) is the convex set C defined in (1.1), assumed non-empty. Geometry-wise, we are primarily interested in the facial structure of C [27, § Definition 2.5. A face of C is a convex subset F of C such that every (closed)line segment in C with a relative interior point in F has both endpoints in F . Theempty set and C itself are faces of C .For example, the non-empty faces of a cube are its vertices, edges, facets and thecube itself. By [27, Thm. 18.2], the collection of relative interiors of the non-emptyfaces forms a partition of C (the relative interior of a singleton is the singleton).That is, each X ∈ C is in the relative interior of exactly one face of C , called F X .The dimension of a face is the dimension of the lowest dimensional affine subspacewhich contains that face. Of particular interest are the zero-dimensional faces of C (singletons). Definition 2.6. X ∈ C is an extreme point of C if dim F X = BOUMAL, VORONINSKI, BANDEIRA
In other words, X is extreme if it does not lie on an open line segment includedin C . If C is compact, it is the convex hull of its extreme points [27, Cor. 18.5.1].Of importance to us, if C is compact, (SDP) always attains its minimum at one ofits extreme points since the linear cost function of (SDP) is (a fortiori) concave [27,Cor. 32.3.2]. The faces of C can be described explicitly as follows. The proof is inAppendix B. Proposition 2.7.
Let X ∈ C have rank p and let F X be its associated face (that is,X is in the relative interior of F X .) Then, with Y ∈ M p such that X = YY ⊤ , F X = n X ′ = Y ( I p + A ) Y ⊤ : A ∈ ker L X and I p + A (cid:23) o , (2.10) where L X : S p × p → R m is defined by: L X ( A ) = A ( YAY ⊤ ) = (cid:0)(cid:10) Y ⊤ A Y , A (cid:11) , . . . , (cid:10) Y ⊤ A m Y , A (cid:11)(cid:1) ⊤ . (2.11)Thus, the dimension of F X is the dimension of the kernel of L X . Since thedimension of S p × p is p ( p + ) and rank ( L X ) ≤ m ′ , the rank-nullity theorem gives alower bound: dim F X = p ( p + ) − rank L X ≥ p ( p + ) − m ′ . (2.12)For extreme points, dim F X =
0; then, p ( p + ) = rank L X ≤ m ′ . Solving for p (therank of X ) shows extreme points have small rank, namely,dim F X = = ⇒ rank ( X ) ≤ p ∗ , √ m ′ + − . (2.13)Since (SDP) attains its minimum at an extreme point for compact C , we recoverthe known fact that one of the optima has rank at most p ∗ . This approach to provingthat statement is well known [24, Thm. 2.1].Optimality conditions for (SDP) are easily stated once S (2.5) is introduced—itacts as a dual certificate, known in closed form owing to the underlying smoothgeometry of M . We need a first general fact about SDPs (Assumption 1.1 is notrequired.) Proposition 2.8.
Let X ∈ C and let S = C − A ∗ ( ν ) for some ν ∈ R m (as is thecase in (2.5) for example). If S (cid:23) and h S , X i = , then X is optimal for (SDP) .Proof. First, use S (cid:23)
0: for any X ′ ∈ C , since X ′ (cid:23) A ( X ) = A ( X ′ ) ,0 ≤ h S , X ′ i = h C , X ′ i − h A ∗ ( ν ) , X ′ i = h C , X ′ i − h ν , A ( X ) i . Concentrating on the last term, use h S , X i = h ν , A ( X ) i = h A ∗ ( ν ) , X i = h C , X i − h S , X i = h C , X i . Hence, h C , X i ≤ h C , X ′ i , which shows X is optimal. (cid:3) Since (SDP) is a relaxation of (P), this leads to a corollary of prime importance.
UARANTEES FOR BURER–MONTEIRO FACTORIZATIONS OF SMOOTH SDPS 9
Corollary 2.9.
Let Assumption 1.1 hold for some p. If Y is a critical point for (P) as defined by (2.8) and S (2.5) is positive semidefinite, then X = YY ⊤ is globallyoptimal for (SDP) and Y is globally optimal for (P) .Proof. Since Y is a critical point, SY =
0; thus, h S , X i = (cid:3) A converse of Proposition 2.8 holds under additional conditions which are sat-isfied by all examples in Section 5. Thus, for those cases, for a critical point Y , YY ⊤ is optimal if and only if S is positive semidefinite. We state it here for completeness(this result is not needed in the sequel.) Proposition 2.10.
Let X ∈ C be a global optimum of (SDP) and assume strongduality holds. Let Assumption 1.1a hold with p = rank ( X ) . Then, S (cid:23) and h S , X i = , where S = S ( X ) is as in (2.5) .Proof. Consider the dual of (SDP):max ν ∈ R m h b , ν i subject to C − A ∗ ( ν ) (cid:23) . (DSDP)Since we assume strong duality and X is optimal, there exists ν optimal for the dualsuch that h C , X i = h b , ν i . Using h b , ν i = h A ( X ) , ν i = h X , A ∗ ( ν ) i , this implies0 = h C , X i − h b , ν i = h C − A ∗ ( ν ) , X i . Since both C − A ∗ ( ν ) and X are positive semidefinite, ( C − A ∗ ( ν )) X =
0. As aresult, by definition of µ and G (2.5), µ = G † A ( CX ) = G † A ( A ∗ ( ν ) X ) = G † G ν = ν , where we used G † = G − under Assumption 1.1a and ( G ν ) i = ∑ j G i j ν j = ∑ j (cid:10) A i , A j X (cid:11) ν j = h A i , A ∗ ( ν ) X i = A ( A ∗ ( ν ) X ) i . Thus, S = C − A ∗ ( µ ) = C − A ∗ ( ν ) has the desired properties. This concludes theproof, and shows uniqueness of the dual certificate. (cid:3) We aim to show that second-order critical points of (P) are global optima, pro-vided p is sufficiently large. To this end, we first recall a known result about rank-deficient second-order critical points. Proposition 3.1.
Let Assumption 1.1 hold for some p and let Y ∈ M be a second-order critical point for (P) . If rank ( Y ) < p, then S ( Y ) (cid:23) so that Y is globallyoptimal for (P) and so is X = YY ⊤ for (SDP) . Optimality of rank deficient local optima is shown (under different assumptions) in [13, 17],with the proof in [17] actually only requiring second-order criticality.0 BOUMAL, VORONINSKI, BANDEIRA
Proof.
The proof parallels the one in [17]. By Corollary 2.9, it is sufficient to showthat S = S ( Y ) (2.5) is positive semidefinite. Since rank ( Y ) < p , there exists z ∈ R p such that z = Y z =
0. Furthermore, for all x ∈ R n , the matrix ˙ Y = xz ⊤ is suchthat Y ˙ Y ⊤ =
0. In particular, ˙ Y is a tangent vector at Y (2.1). Since Y is second-ordercritical, inequality (2.9) holds, and here simplifies to:0 ≤ (cid:10) ˙ Y , S ˙ Y (cid:11) = (cid:10) xz ⊤ , Sxz ⊤ (cid:11) = k z k · x ⊤ Sx . This holds for all x ∈ R n . Thus, S is positive semidefinite. (cid:3) Corollary 3.2.
Let Assumption 1.1 hold for some p ≥ n. Then, any second-ordercritical point Y ∈ M of (P) is globally optimal, and X = YY ⊤ is globally optimalfor (SDP) .Proof. For p > n (with p = n + M are necessarily column-rank deficient, so that the corollary follows from Proposi-tion 3.1. For p = n , if Y is rank deficient, use the same proposition. Otherwise, Y is invertible and SY = S =
0, which is a fortiori positive semidefi-nite. By (2.5), this only happens if C = A ∗ ( µ ) for some µ , in which case the costfunction h C , X i = h A ∗ ( µ ) , X i = h µ , b i is constant over C . (cid:3) In this paper, we aim to secure optimality of second-order critical points for p less than n . As indicated by Proposition 3.1, the sole concern in that respectis the possible existence of full-rank second-order critical points. We first give aresult which excludes the existence of full-rank first-order critical points (thus, afortiori of second-order critical points) for almost all cost matrices C , provided p is sufficiently large. The argument is by dimensionality counting. Lemma 3.3.
Let p be such that p ( p + ) > rank A and such that Assumption 1.1holds. Then, for almost all C, all critical points of (P) are column-rank deficient.Proof. Let Y ∈ M be a critical point for (P). By the definition of S ( Y ) = C − A ∗ ( µ ( Y )) (2.5) and the first-order condition S ( Y ) Y = Y ≤ null ( C − A ∗ ( µ ( Y ))) ≤ max ν ∈ R m null ( C − A ∗ ( ν )) , (3.1)where null denotes the nullity (dimension of the kernel). This first step in the proofis inspired by [30, Thm. 3]. If the right-hand side evaluates to ℓ , then there exists ν and M = C − A ∗ ( ν ) such that null ( M ) = ℓ . Writing C = M + A ∗ ( ν ) , we find that C ∈ N ℓ + im ( A ∗ ) , (3.2)where N ℓ denotes the set of symmetric matrices of size n with nullity ℓ and the + is a set-sum. The set N ℓ has dimensiondim N ℓ = n ( n + ) − ℓ ( ℓ + ) . (3.3) UARANTEES FOR BURER–MONTEIRO FACTORIZATIONS OF SMOOTH SDPS 11
Assume the right-hand side of (3.1) evaluates to p or more. Then, a fortiori, C ∈ [ ℓ = p ,..., n N ℓ + im ( A ∗ ) . (3.4)The set on the right-hand side contains all “bad” matrices C , that is, those forwhich (3.1) offers no information about the rank of Y . The dimension of that setis bounded as follows, using the fact that the dimension of a finite union is at mostthe maximal dimension, and the dimension of a finite sum of sets is at most thesum of the set dimensions:dim [ ℓ = p ,..., n N ℓ + im ( A ∗ ) ! ≤ dim ( N p + im ( A ∗ )) ≤ n ( n + ) − p ( p + ) + rank A . Since C ∈ S n × n lives in a space of dimension n ( n + ) , almost no C verifies (3.4) if n ( n + ) − p ( p + ) + rank A < n ( n + ) . Hence, if p ( p + ) > rank A , for almost all C , critical points have rank ( Y ) < p . (cid:3) Theorem 1.4 follows as an easy corollary of Proposition 3.1 and Lemma 3.3.In order to make a statement valid for all C , we further explore the implicationsof second-order criticality on the definiteness of S . For large p (though still smallerthan n ), we expect full-rank second-order critical points should indeed be optimal.The intuition is as follows. If Y ∈ M is a second-order critical point of rank p , then,by (2.8), SY = S has a kernel of dimension at least p . Furthermore,by (2.9), S has “positive curvature” along directions in T Y M , whose dimensiongrows with p . Overall, the larger p , the more conditions force S to have nonnegativeeigenvalues. The main concern is to avoid double counting, as the two conditionsare redundant along certain directions: this is where the facial structure of C comesinto play.The following theorem refines this intuition. We use ⊗ for Kronecker productsand vec to vectorize a matrix by stacking its columns on top of each other, so thatvec ( AX B ) = ( B ⊤ ⊗ A ) vec ( X ) . A real number a is rounded down as ⌊ a ⌋ . Theorem 3.4.
Let p be such that Assumption 1.1 holds. Let Y ∈ M be a second-order critical point for (P) . The matrix X = YY ⊤ belongs to the relative interior ofthe face F X (2.10) . If rank ( Y ) = p, then S = S ( X ) (2.5) has at most (cid:22) dim F X − ∆ p (cid:23) (3.5) negative eigenvalues, where ∆ = p ( p + ) − m ′ . (3.6) In particular, if dim F X < ∆ + p, then S is positive semidefinite and both X and Yare globally optimal.Proof. Consider the subspace vec ( T Y M ) of vectorized tangent vectors at Y : it hasdimension k , dim M . Pick U ∈ R np × k with columns forming an orthonormalbasis for that subspace: U ⊤ U = I k . Then, U ⊤ ( I p ⊗ S ) U has the same spectrum as Hess g ( Y ) . Indeed, for all ˙ Y ∈ T Y M there exists x ∈ R k such that vec ( ˙ Y ) = U x ,and, by (2.9),12 h ˙ Y , Hess g ( Y )[ ˙ Y ] i = h ˙ Y , S ˙ Y i = h U x , ( I p ⊗ S ) U x i = h x , U ⊤ ( I p ⊗ S ) U x i . In particular, U ⊤ ( I p ⊗ S ) U is positive semidefinite since Y is second-order critical.Let V ∈ R np × p , V ⊤ V = I p , have columns forming an orthonormal basis ofthe space spanned by the vectors vec ( Y R ) for R ∈ R p × p : such V exists becauserank ( Y ) = p . Indeed, vec ( Y R ) = ( I p ⊗ Y ) vec ( R ) and I p ⊗ Y ∈ R np × p then has fullrank p . Since Y is a critical point, SY = ( I p ⊗ S ) V = k ′ denote the dimension of the space spanned by the columns of both U and V , and let W ∈ R np × k ′ , W ⊤ W = I k ′ , be an orthonormal basis for this space. Itfollows that M = W ⊤ ( I p ⊗ S ) W is positive semidefinite. Indeed, for any z , thereexist x , y such that W z = U x + V y . Hence, z ⊤ Mz = x ⊤ U ⊤ ( I p ⊗ S ) U x ≥ λ ≤ · · · ≤ λ n − denote the eigenvalues of S , and let ˜ λ ≤ · · · ≤ ˜ λ np − denotethe eigenvalues of I p ⊗ S . The latter are simply the eigenvalues of S repeated p times, thus: ˜ λ i = λ ⌊ i / p ⌋ . Let µ ≤ · · · ≤ µ k ′ − denote the eigenvalues of M . TheCauchy interlacing theorem states that, for all i ,˜ λ i ≤ µ i ≤ ˜ λ i + np − k ′ . (3.7)In particular, since M (cid:23)
0, we have 0 ≤ µ ≤ λ ⌊ ( np − k ′ ) / p ⌋ . It remains to determine k ′ . From Proposition 1.2, recall that k = dim M = np − m ′ . We now investigatehow many new dimensions V adds to U . All matrices R ∈ R p × p admit a uniquedecomposition as R = R skew + R ker L + R ( ker L ) ⊥ , where R skew is skew-symmetric, R ker L is in the kernel of L X (2.11) and R ( ker L ) ⊥ is in the orthogonal complement of the latter in S p × p . Recalling the definition oftangent vectors (2.1), it is clear that ˙ Y = Y R skew is tangent. Similarly, ˙ Y = Y R ker L is tangent because of the definition of L X (2.11). Thus, vectorized versions ofthese are already in the span of U . On the other hand, by definition, Y R ( ker L ) ⊥ isnot tangent at Y (if it is nonzero). This raises k ′ (the rank of W ) by dim ( ker L X ) ⊥ = p ( p + ) − dim ker L X . Since dim ker L X = dim F X , we have: k ′ = np − m ′ + p ( p + ) − dim F X = np + ∆ − dim F X . (3.8) UARANTEES FOR BURER–MONTEIRO FACTORIZATIONS OF SMOOTH SDPS 13
Thus, np − k ′ = dim F X − ∆ . Combine with λ ⌊ ( np − k ′ ) / p ⌋ ≥ (cid:3) Theorem 1.6 follows easily from Corollary 2.9 and Theorem 3.4.
Remark . What does it take for a second-order critical point Y ∈ M to be sub-optimal? For local optima, the quote from Burer and Monteiro [13, §
3] in the in-troduction readily states that Y must have rank p , and the face F X (with X = YY ⊤ )must be positive dimensional and such that the cost function h C , X i is constant over F X . Here, under Assumption 1.1 for p , Theorem 3.4 states that if Y is second-order critical and is suboptimal, then F X must have dimension ∆ + p or higher.Since (2.12) suggests generic faces at rank p have dimension ∆ , this further showsthats suboptimal second-order critical points, if they exist, can only occur if thecost function is constant over a high-dimensional face of C .To use Theorem 3.4 in a particular application, one needs to obtain upperbounds on the dimensions of faces of C . We follow this path for a number ofexamples in Section 5. Under Assumption 1.1, problem (P) is an example of smooth optimization overa smooth manifold. This suggests using
Riemannian optimization to solve it [3],as already proposed by Journ´ee et al. [17] in a similar context. Importantly, knownalgorithms—in particular, the
Riemannian trust-region method (RTR)—convergeto second-order critical points regardless of initialization [2]. We state here a recentcomputational result to that effect [10].
Proposition 4.1.
Under Assumption 1.1, if C is compact, RTR initialized with anyY ∈ M produces in O ( / ε g ε H + / ε H ) iterations a point Y ∈ M such thatg ( Y ) ≤ g ( Y ) , k grad g ( Y ) k ≤ ε g , and Hess g ( Y ) (cid:23) − ε H Id , where g (2.4) is the cost function of (P) .Proof. Apply the main results of [10] using the fact that g has locally Lipschitzcontinuous gradient and Hessian in R n × p and M is a compact submanifold of R n × p . (cid:3) Importantly, only a finite number of iterations of any algorithm can be run inpractice, so that only approximate second-order critical points can be computed.Thus, it is of interest to establish whether approximate second-order critical pointsare also approximately optimal. As a first step, we give a soft version of Corol-lary 2.9. We remark that the condition I n ∈ im A ∗ is satisfied in all examples ofSection 5. Lemma 4.2.
Let Assumption 1.1 hold for some p and assume C (1.1) is compact.For any Y on the manifold M , if k grad g ( Y ) k ≤ ε g and S ( Y ) (cid:23) − ε H I n , then theoptimality gap at Y with respect to (SDP) is bounded as ≤ ( g ( Y ) − f ⋆ ) ≤ ε H R + ε g √ R , (4.1) where f ⋆ is the optimal value of (SDP) and R = max X ∈ C Tr ( X ) < ∞ measures thesize of C .If I n ∈ im ( A ∗ ) , the right-hand side of (4.1) can be replaced by ε H R. This holdsin particular if all X ∈ C have same trace and C has a relative interior point(Slater condition).Proof. By assumption on S ( Y ) = C − A ∗ ( µ ( Y )) (2.5) with µ ( Y ) = G † A ( CYY ⊤ ) , ∀ X ′ ∈ C , − ε H ( X ′ ) ≤ h S ( Y ) , X ′ i = h C , X ′ i − h A ∗ ( µ ( Y )) , X ′ i = h C , X ′ i − h µ ( Y ) , b i . This holds in particular for X ′ optimal for (SDP). Thus, we may set h C , X ′ i = f ⋆ ;and certainly, Tr ( X ′ ) ≤ R . Furthermore, h µ ( Y ) , b i = h µ ( Y ) , A ( YY ⊤ ) i = h C − S ( Y ) , YY ⊤ i = g ( Y ) − h S ( Y ) Y , Y i . Combining the displayed equations and using grad g ( Y ) = S ( Y ) Y (2.8), we find0 ≤ ( g ( Y ) − f ⋆ ) ≤ ε H R + h grad g ( Y ) , Y i . (4.2)In general, we do not assume I n ∈ im ( A ∗ ) and we get the result by Cauchy–Schwarz on (4.2) and k Y k = p Tr ( YY ⊤ ) ≤ √ R :0 ≤ ( g ( Y ) − f ⋆ ) ≤ ε H R + ε g √ R . But if I n ∈ im ( A ∗ ) , then we show that Y is a normal vector at Y , so that it isorthogonal to grad g ( Y ) . Formally: there exists ν ∈ R m such that I n = A ∗ ( ν ) , and h grad g ( Y ) , Y i = h grad g ( Y ) Y ⊤ , I n i = h A ( grad g ( Y ) Y ⊤ ) , ν i = , since grad g ( Y ) ∈ T Y M (2.1). This indeed allows us to simplify (4.2).To conclude, we show that if C has a relative interior point X ′ (that is, A ( X ′ ) = b and X ′ ≻
0) and if Tr ( X ) is constant for X in C , then I n ∈ im ( A ∗ ) . Indeed, S n × n = im ( A ∗ ) ⊕ ker A , so there exist ν ∈ R m and M ∈ ker A such that I n = A ∗ ( ν ) + M .Thus, for all X in C ,0 = Tr ( X − X ′ ) = (cid:10) A ∗ ( ν ) + M , X − X ′ (cid:11) = (cid:10) M , X − X ′ (cid:11) . This implies M is orthogonal to all X − X ′ . These span ker A since X ′ is interior.Indeed, for any H ∈ ker A , since X ′ ≻
0, there exists t > X , X ′ + tH (cid:23) A ( X ) = b , so that X ∈ C . Hence, M ∈ ker A is orthogonal to ker A .Consequently, M = I n = A ∗ ( ν ) . (cid:3) UARANTEES FOR BURER–MONTEIRO FACTORIZATIONS OF SMOOTH SDPS 15
The lemma above involves a condition on the spectrum of S . Next, we showthis condition is satisfied under an assumption on the spectrum of Hess g and rankdeficiency. Lemma 4.3.
Let Assumption 1.1 hold for some p. If Y ∈ M is column-rank defi-cient and Hess g ( Y ) (cid:23) − ε H Id , then S ( Y ) (cid:23) − ε H I n .Proof. By assumption, there exists z ∈ R p , k z k = Y z =
0. Thus, forany x ∈ R n , we can form ˙ Y = xz ⊤ : it is a tangent vector since Y ˙ Y ⊤ = k ˙ Y k = k x k . Then, condition (2.9) combined with the assumption on Hess g ( Y ) tells us − ε H k x k ≤ h ˙ Y , Hess g ( Y )[ ˙ Y ] i = h ˙ Y , S ˙ Y i = h xz ⊤ zx ⊤ , S i = x ⊤ Sx . This holds for all x ∈ R n , hence S (cid:23) − ε H I n as required. (cid:3) We now combine the two previous lemmas to form a soft optimality statement.
Theorem 4.4.
Assume C is compact and let R < ∞ be the maximal trace of any Xfeasible for (SDP) . For some p, let Assumption 1.1 hold for both p and p + . Forany Y ∈ M p , form ˜ Y = [ Y | n × ] in M p + . The optimality gap at Y is bounded as ≤ ( g ( Y ) − f ⋆ ) ≤ √ R k grad g ( Y ) k − R λ min ( Hess g ( ˜ Y )) . (4.3) If all X ∈ C have the same trace R and there exists a positive definite feasible X ,then the bound ≤ ( g ( Y ) − f ⋆ ) ≤ − R λ min ( Hess g ( ˜ Y )) (4.4) holds. If p > n, the bounds hold with ˜ Y = Y (and Assumption 1.1 only needs tohold for p.)Proof.
Since ˜ Y ˜ Y ⊤ = YY ⊤ , S ( ˜ Y ) = S ( Y ) ; in particular, we have g ( ˜ Y ) = g ( Y ) and k grad g ( ˜ Y ) k = k grad g ( Y ) k . Since ˜ Y has deficient column rank, apply Lemmas 4.2and 4.3. For p > n , there is no need to form ˜ Y as Y itself necessarily has deficientcolumn rank. (cid:3) This works well with Proposition 4.1. Indeed, equation (4.3) also implies thefollowing: λ min ( Hess g ( ˜ Y )) ≤ − ( g ( Y ) − f ⋆ ) − √ R k grad g ( Y ) k R . That is, an approximate critical point Y in M p which is far from optimal (for (SDP))maps to a comfortably-escapable approximate saddle point ˜ Y in M p + . This canbe helpful for the development of optimization algorithms.For p = n +
1, the bound in Theorem 4.4 can be controlled a priori: approximatesecond-order critical points are approximately optimal, for any C . With p = n +
1, problem (P) is no longer lower dimensional than (SDP), but retains the advan-tage of not involving a positive semidefiniteness constraint.6 BOUMAL, VORONINSKI, BANDEIRA
Corollary 4.5.
Assume C is compact. Let Assumption 1.1 hold for p = n + .If Y ∈ M n + satisfies both k grad g ( Y ) k ≤ ε g and Hess g ( Y ) (cid:23) − ε H Id , then Y isapproximately optimal in the sense that (with R = max X ∈ C Tr ( X ) ): ≤ ( g ( Y ) − f ⋆ ) ≤ ε g √ R + ε H R . Under the same condition as in Theorem 4.4, the bound holds with right-hand side ε H R instead.
Theorem 1.5 is an informal statement of this corollary.
In all applications below, Assumption 1.1a holds for all p such that the searchspace is non-empty. For each one, we deduce the consequences of Theorems 1.4and 1.6. For the latter, the key part is to investigate the facial structure of theSDP. As everywhere else in the paper, k x k denotes the 2-norm of vector x and k X k denotes the Frobenius norm of matrix X . The generalized symmetric eigenvalue problem admits a well-known extremalformulation: min x ∈ R n x ⊤ Cx subject to x ⊤ Bx = , (EIG)where C , B are symmetric of size n ≥
2. The usual relaxation by lifting introduces X = xx ⊤ and discards the constraint rank ( X ) = X ∈ S n × n h C , X i subject to h B , X i = , X (cid:23) . (EIG-SDP)Let C denote the search space of (EIG-SDP). It is non-empty and compact if andonly if B ≻
0, which we now assume. A direct application of (2.13) guaranteesall extreme points of C have rank 1, so that it always admits a solution of rank1: the SDP relaxation is always tight, which is well known. Under our assump-tion, B admits a Cholesky factorization as B = R ⊤ R with R ∈ R n × n invertible. Thecorresponding Burer–Monteiro formulation at rank p reads:min Y ∈ R n × p h CY , Y i subject to k RY k = . (EIG-BM)Let M denote its search space. Assumption 1.1a holds for any p ≥ m ′ = Y ∈ M , { BY } spans a subspace of dimension 1, since BY = R ⊤ RY , RY = R ⊤ is invertible. Thus, Theorem 1.4 readily states that for p ≥
2, foralmost all C , all second-order critical points of (EIG-BM) are optimal.We can do better. The facial structure of C is easily described. Recalling (2.12),for all X = YY ⊤ ∈ C we have dim F X = p ( p + ) −
1, since Y ⊤ BY =
0. Hence, byTheorem 1.6, for any value of p ≥
1, all second-order critical points of (EIG-BM)
UARANTEES FOR BURER–MONTEIRO FACTORIZATIONS OF SMOOTH SDPS 17 are optimal (for any C ). In particular, for p = Corollary 5.1.
All second-order critical points of (EIG) are optimal.
This is a well-known fact, though usually proven by direct inspection of neces-sary optimality conditions.
The trust-region subproblem consists in minimizing a quadratic on a sphere,with n ≥
2: min x ∈ R n x ⊤ Ax + b ⊤ x + c subject to k x k = . (TRS)It is not difficult to produce ( A , b , c ) such that (TRS) admits suboptimal second-order critical points. The usual lifting here introduces X = (cid:18) x (cid:19) (cid:0) x ⊤ (cid:1) = (cid:18) xx ⊤ xx ⊤ (cid:19) , and C = (cid:18) A bb ⊤ c (cid:19) . The quadratic cost and constraint are linear in X , yielding this SDP relaxation:min X ∈ S n × n h C , X i subject to Tr ( X n , n ) = , X n + , n + = , X (cid:23) . (TRS-SDP)Let C denote the search space of (TRS-SDP). It is non-empty and compact. Heretoo, a direct application of (2.13) guarantees the SDP relaxation is always tight(it always admits a solution of rank 1), which is a well-known fact related to theS-lemma [25]. The Burer–Monteiro relaxation at rank p reads:min Y ∈ R n × p , y ∈ R p h CY , Y i subject to k Y k = , k y k = , with Y = (cid:18) Y y ⊤ (cid:19) . (TRS-BM)Let M denote its search space. After verifying Assumption 1.1 holds (see below),application of Theorem 1.4 guarantees that for p ≥ almost all ( A , b , c ) ,second-order critical points of (TRS-BM) are optimal. We can further strengthenthis result by looking at the faces of C , as we do now. Lemma 5.2.
Assumption 1.1a holds for any p ≥ with m ′ = . Furthermore, forX ∈ C of rank p, dim F X = ( if p = , p ( p + ) − if p ≥ . Proof.
The constraints of (SDP) are defined by A = (cid:18) I n n × × n (cid:19) , b = , A = (cid:18) n × n n × × n (cid:19) , b = . For Y ∈ M , we have A Y = (cid:18) Y × p (cid:19) , A Y = (cid:18) n × p y ⊤ (cid:19) . These are nonzero and always linearly independent, so that dim span { A Y , A Y } = Y ∈ M , which confirms Assumption 1.1a holds with m ′ = C is simple as well. Let X ∈ C have rank p and consider Y ∈ M such that X = YY ⊤ . To use (2.12), note that: Y ⊤ A Y = Y ⊤ Y , Y ⊤ A Y = y y ⊤ . These are nonzero. For p =
1, they are scalars: they span a subspace of dimension1. Then, dim F X = − =
0. For p >
1, we argue they are linearly independent.Indeed, if they are not, there exists α = Y ⊤ Y = α · y y ⊤ . If so, Y must have rank 1 with row space spanned by y , so that Y = zy ⊤ for some z ∈ R n ,and k z k =
1. As a result, Y itself has rank 1, which is a contradiction. Thus,dim F X = p ( p + ) −
2, as announced. (cid:3)
Combining the latter with Theorem 1.6 yields the following new result, whichholds for all ( A , b , c ) . Notice that for p =
1, the theorem correctly allows second-order critical points to be suboptimal in general.
Corollary 5.3.
For p ≥ , all second-order critical points of (TRS-BM) are glob-ally optimal. A second-order critical point Y of (TRS-BM) with p = Y has rank 1, it is straightforward to extract a solution of (TRS) from it. If Y hasrank 2, it maps to a face of dimension 1. The endpoints of that face have rank 1and are also optimal. The following lemma shows these can be computed easilyfrom Y by solving two scalar equations. Lemma 5.4.
Let Y ∈ M be a second-order critical point of (TRS-BM) with p = ,and let z ∈ R satisfy k Y z k = and y ⊤ z = . Then, Y z is a global optimumof (TRS) .Proof. If rank ( Y ) =
1, then Y = xy T for some x ∈ R n , and k Y k = , k y k = k x k =
1. Solutions to y T z = z = y + u , where y T u =
0. Forany such z , Y z = x , which is indeed optimal for (TRS) since Y is globally optimalfor (TRS-BM) and x attains the same cost for the restricted problem (TRS).Now assume rank ( Y ) =
2. By (2.10), the one-dimensional face F YY ⊤ containsall matrices of the form Y ( I − M ) Y ⊤ such that I − M (cid:23) (cid:10) I − M , Y ⊤ Y (cid:11) = (cid:10) I − M , y y ⊤ (cid:11) =
0. This face has two extreme points of rank 1, for which I − M is a positive semidefinite matrix of rank 1, so that I − M = zz ⊤ for some z ∈ R .Given that Y is feasible, the conditions on z are k Y z k = y ⊤ z = ±
1. These This can happen, notably if ( A , b , c ) forms a so-called hard case TRS (details omitted.) Thisobservation shows that it is indeed necessary to exclude some non-trivial matrices C in Lemma 3.3.UARANTEES FOR BURER–MONTEIRO FACTORIZATIONS OF SMOOTH SDPS 19 equations define an ellipse in R and two parallel lines, totaling four intersections ± z , ± z ′ which can be computed explicitly. Fixing y ⊤ z = + (cid:3) The trust-region subproblem generalizes to optimization of a quadratic functionover k spheres, possibly in different dimensions n , . . . , n k ≥ x i ∈ R ni , i = ... k x ⊤ Cx subject to k x k = · · · = k x k k = , (Spheres) with x ⊤ = (cid:0) x ⊤ · · · x ⊤ k (cid:1) . The variable x is in R n + , with n = n + · · · + n k . Since the last entry of x is 1, thisindeed covers all possible quadratic functions of x , . . . , x k . The SDP relaxation bylifting reads: min X ∈ R ( n + ) × ( n + ) h C , X i subject to Tr ( X ) = · · · = Tr ( X kk ) = , X n + , n + = , X (cid:23) , (Spheres-SDP)where X i j denotes the block of size n i × n j of matrix X , in the obvious way. ThisSDP has a non-empty compact search space and k + p ∗ = √ k + − . TheBurer–Monteiro relaxation at rank p reads:min Y ∈ R ( n + ) × p h CY , Y i subject to k Y k = · · · = k Y k k = , k y k = , (Spheres-BM)with Y ⊤ = (cid:0) Y ⊤ · · · Y ⊤ k y (cid:1) , where Y i ∈ R n i × p and y ∈ R p . It is easily checked that Assumption 1.1a holds forall p ≥
1. Thus, Theorem 1.4 gives this result:
Corollary 5.5.
For p > √ k + − and for almost all C, all second-order criticalpoints of (Spheres-BM) are optimal and map to optima of (Spheres-SDP) . To apply Theorem 1.6, we first investigate the facial structure of the SDP.
Lemma 5.6.
Let Y be feasible for (Spheres-BM) and have full rank p. The dimen-sion of the face of the search space of (Spheres-SDP) at YY ⊤ obeys: dim F YY ⊤ ≤ p ( p + ) − if p ≥ , and dim F YY ⊤ = if p = .Proof. Following (2.12),dim F YY ⊤ = p ( p + ) − dim span (cid:16) Y ⊤ Y , . . . , Y ⊤ k Y k , yy ⊤ (cid:17) . Since Y is feasible, each defining element of the span is nonzero, so that the dimen-sion is at least 1. If p =
1, these elements are scalars: they span R . Now consider p ≥ Y ⊤ i Y i = α i · yy ⊤ for somenonzero α i . If so, Y i has rank 1 and there exists z i ∈ R n i such that Y i = z i y ⊤ . In turn,this implies Y has rank 1, which is a contradiction. Thus, the span has dimensionat least two. (cid:3) Corollary 5.7.
For p ≥ max ( , k ) , all second-order critical points of (Spheres-BM) are optimal and map to optima of (Spheres-SDP) (for any C). For k =
1, this recovers the main result about the trust-region subproblem. Ifthe cost function in (Spheres) is a homogeneous quadratic, then it can be written asmin x i ∈ R ni , i = ... k x ⊤ Cx subject to k x k = · · · = k x k k = , (SpheresH) with x ⊤ = (cid:0) x ⊤ · · · x ⊤ k (cid:1) . The corresponding relaxation and Burer–Monteiro formulations read:min X ∈ R n × n h C , X i subject to Tr ( X ) = · · · = Tr ( X kk ) = , X (cid:23) , (SpheresH-SDP)and: min Y ∈ R n × p h CY , Y i subject to k Y k = · · · = k Y k k = , (SpheresH-BM) with Y ⊤ = (cid:0) Y ⊤ · · · Y ⊤ k (cid:1) . Assumption 1.1a holds for all p ≥ m ′ = k . A similar analysis of the facialstructure yields the following corollary of Theorem 1.6. Corollary 5.8.
For almost all C, provided p > √ k + − , all second-order criticalpoints of (SpheresH-BM) are optimal and map to optima of (SpheresH-SDP) . Ifp ≥ k, the result holds for all C. For k =
1, this recovers the results of (EIG) with B = I n . Let n = qd for some integers q , d . Consider the semidefinite programmin X ∈ S n × n h C , X i subject to sbd ( X ) = I n , X (cid:23) , (OrthoCut)where sbd : S n × n → S n × n preserves the diagonal blocks of size d × d and zeros outall other blocks. Specifically, with X i j denoting the ( i , j ) th block of size d × d inmatrix X , sbd ( X ) i j = ( X ii if i = j , d × d otherwise. UARANTEES FOR BURER–MONTEIRO FACTORIZATIONS OF SMOOTH SDPS 21
For example, with d =
1, the constraint sbd ( X ) = I n is equivalent to diag ( X ) = and this SDP is the Max-Cut SDP [15]. For general d , diagonal blocks of X of size d × d are constrained to be identity matrices: this SDP is known as Orthogonal-Cut [6, 9]. Among other uses, it appears as a relaxation of synchronization on Z = {± } [5, 21, 1] and synchronization of rotations [28, 14], with applicationsin stochastic block modeling (community detection) and SLAM (simultaneous lo-calization and mapping for robotics).The Stiefel manifold St ( p , d ) is the set of matrices of size p × d with orthonor-mal columns. The Burer–Monteiro formulation of (OrthoCut) is an optimizationproblem over q copies of St ( p , d ) :min Y ,..., Y q ∈ R p × d h CY , Y i subject to Y ⊤ k Y k = I d ∀ k , Y ⊤ = (cid:2) Y · · · Y q (cid:3) . (OrthoCut-BM)For d =
1, this problem captures one side of the Grothendieck inequality [18,eq. (1.1)]. Assumption 1.1a holds for all p ≥ d with m ′ = q d ( d + ) (which is thenumber of constraints). Theorem 1.4 applies as follows. Corollary 5.9.
If p > √ + n ( d + ) − , for almost all C, any second-order criticalpoint Y of (OrthoCut-BM) is a global optimum, and X = YY ⊤ is globally optimalfor (OrthoCut) . In order to apply Theorem 1.6, we must investigate the facial structure of C = { X ∈ S n × n : sbd ( X ) = I n , X (cid:23) } . The following result generalizes a result in [19, Thm. 3.1(i)] to d ≥ Theorem 5.10.
If X ∈ C has rank p, then the face F X (2.10) has dimensionbounded as: p ( p + ) − n d + ≤ dim F X ≤ p ( p + ) − p d + . (5.1) If p is an integer multiple of d, the upper bound is attained for some X .
The proof is in Appendix C. Combining this with Theorem 1.6 yields the fol-lowing result.
Corollary 5.11.
If p > d + d + n, any second-order critical point Y for (OrthoCut-BM) is globally optimal, and X = YY ⊤ is globally optimal for (OrthoCut) . In particular,for Max-Cut SDP (d = ), the requirement is p > n .Proof. If Y is rank deficient, use Proposition 3.1. Otherwise, since rank ( X ) = p ,Theorem 5.10 gives dim F X ≤ p ( p + ) − p d + and Theorem 1.6 gives optimality ifdim F X < p ( p + ) − n d + + p . This is the case provided ( n − p )( d + ) < p , that is, if p > d + d + n . (cid:3) We now discuss the assumptions that appear in the main theorems.The starting point of this investigation is the hope to solve (SDP) by solving (P)instead. For smooth, non-convex optimization problems, even verifying local op-timality is usually hard [22]. Thus, we wish to restrict our attention to efficientlycomputable points, such as points which satisfy first- and second-order Karush–Kuhn–Tucker (KKT) conditions for (P)—see [12, § § Y necessarily satisfies KKT conditions if constraint qualifi-cations (CQs) hold at Y [29]. The standard CQs for equality constrained programsare Robinson’s conditions or metric regularity (they are here equivalent). Theyread as follows:CQs hold at Y ∈ M if A Y , . . . , A m Y are linearly independent in R n × p . (CQ)Considering all cost matrices C , global optima could, a priori, be anywhere in M .Thus, we require CQs to hold at all Y in M rather than only at the (unknown)global optima. This leads to Assumption 1.1a. Adding redundant constraints (forexample, duplicating h A , X i = b ) would break the CQs, but does not change theoptimization problem. This is allowed by Assumption 1.1b.In general, (SDP) may not have an optimal solution. One convenient way toguarantee that it does is to require C to be compact, which is why this assumptionappears in Theorem 1.5 to bound optimality gaps for approximate second-ordercritical points. When C is compact, one furthermore gets the guarantee that atleast one of the global optima is an extreme point of C , which leads to the guaran-tee that at least one of the global optima has rank p bounded as p ( p + ) ≤ m ′ (2.13).The other way around, it is possible to pick the cost matrix C such that the uniquesolution to (SDP) is an extreme point of maximal rank, which can be as large asallowed by (2.13). This justifies why, in Theorem 1.4, the bound on p is essen-tially optimal. The compactness assumption could conceivably be relaxed, pro-vided candidate global optima remain bounded. This could plausibly come aboutby restricting attention to positive definite cost matrices C .One restriction in particular in Theorem 1.4 merits further investigation: theexclusion of a zero-measure set of cost matrices (“bad C ”). From the trust-regionsubproblem example in Section 5.2, we know that it is necessary (in general) toallow the exclusion of a zero-measure set of cost matrices in Lemma 3.3. Yet,in that same example, the excluded cost matrices do not give rise to suboptimalsecond-order critical points (as we proved through a different argument involvingTheorem 1.6.) Thus, it remains unclear whether or not a zero-measure set of costmatrices must be excluded in Theorem 1.4. Resolving this question is key to gaindeeper understanding of the relationship between (SDP) and (P). UARANTEES FOR BURER–MONTEIRO FACTORIZATIONS OF SMOOTH SDPS 23
Finally, we connect the notion of smooth SDP used in this paper to the morestandard notion of non-degeneracy in SDPs as defined in [4, Def. 5]. Informally:for linearly independent A i , non-degeneracy at all points is equivalent to smooth-ness. The proof is in Appendix D. Definition 6.1. X is primal non-degenerate for (SDP) if it is feasible and T X + ker A = S n × n , where T X is the tangent space at X to the manifold of symmetricmatrices of rank r embedded in S n × n , where r = rank ( X ) . Proposition 6.2.
Let A , . . . , A m defining A be linearly independent. Then, As-sumption 1.1a holds for all p such that M p is non-empty if and only if all X ∈ C are primal non-degenerate. We have shown how, under Assumption 1.1 and extra conditions (on p , com-pactness, and the cost matrix), the Burer–Monteiro factorization approach to solv-ing (SDP) is “safe”, despite non-convexity. For future research, it is of interest todetermine if the proposed assumptions can be relaxed. Furthermore, it is impor-tant for practical purposes to determine whether approximate second-order criticalpoints are approximately optimal for values of p well below n (an example of thisfor a specific context is given in [5]). One possible way forward is a smoothedanalysis of the type developed recently in [8, 26], though these early works leaveplenty of room for improvement. Appendix A: Consequences and properties of Assumption 1.1
Proof of Proposition 1.2.
The set M is defined as the zero level set of Φ : R n × p → R m where Φ ( Y ) = A ( YY ⊤ ) − b . The differential of Φ at Y , D Φ ( Y ) , has rank equalto the dimension of the space spanned by { A Y , . . . , A m Y } . Under Assumption 1.1a,D Φ ( Y ) has full rank m on M and the result follows from [20, Corollary 5.14].Under Assumption 1.1b, D Φ ( Y ) has constant rank m ′ in a neighborhood of M andthe result follows from [20, Theorem 5.12]. (cid:3) Proof of Proposition 1.3.
First, let Assumption 1.1a hold for some p , and consider p ′ < p such that M p ′ is non-empty. For any Y ′ ∈ M p ′ , form Y = (cid:2) Y ′ | n × ( p − p ′ ) (cid:3) ∈ R n × p . Clearly, Y is in M p , so that m = dim span { A Y , . . . , A m Y } = dim span { A Y ′ , . . . , A m Y ′ } , as desired. For p = n , we now consider the case p ′ > n . Let Y ′ ∈ M p ′ and considerits full SVD, Y ′ = U Σ V ⊤ , with Σ ∈ R n × p ′ . Then, Y ′ V is in M p ′ as well. Since thelast p ′ − n columns of Σ are zero, we have Y ′ V = U Σ = (cid:2) Y | n × ( p ′ − n ) (cid:3) with Y ∈ M n . Thus, as desired,dim span { A Y ′ , . . . , A m Y ′ } = dim span { A Y ′ V , . . . , A m Y ′ V } = dim span { A Y , . . . , A m Y } = m . Second, let Assumption 1.1b hold for some p , and consider p ′ < p such that M p ′ is non-empty. For any Y ′ ∈ M p ′ , form Y = (cid:2) Y ′ | n × ( p − p ′ ) (cid:3) ∈ M p . By assump-tion, there exists an open ball B Y in R n × p of radius ε = ε ( Y ) > Y suchthat dim span { A ˜ Y , . . . , A m ˜ Y } = m ′ for all ˜ Y ∈ B Y . Let B Y ′ be the open ball in R n × p ′ of radius ε ( Y ) and center Y ′ . Forany ˜ Y ′ ∈ B Y ′ , form ˜ Y = (cid:2) ˜ Y ′ | n × ( p − p ′ ) (cid:3) . Since k ˜ Y − Y k = k ˜ Y ′ − Y ′ k ≤ ε , we have˜ Y ∈ B Y , so that m ′ = dim span { A ˜ Y , . . . , A m ˜ Y } = dim span { A ˜ Y ′ , . . . , A m ˜ Y ′ } . Thus, Assumption 1.1b holds with the open neighborhood of M p ′ consisting of theunion of all balls B Y ′ for Y ′ ∈ M p ′ as described above. (cid:3) Appendix B: The facial structure of C Proof of Proposition 2.7.
The construction follows [24] and applies for any linearequality constraints. We first show that if X ′ is of the form in (2.10), then it mustbe in F X . This is clear if X ′ = X . Otherwise, pick t > I p − tA (cid:23) X ′ and X ′′ = Y ( I p − tA ) Y ⊤ define a closed line segment in C whose relativeinterior contains X . By Definition 2.5, this implies X ′ (and X ′′ ) are in F X .The other way around, we now show that any point in F X must be of the formof X ′ in (2.10). Let W ∈ S n × n be such that X ′ = X + W . Since X is in the relativeinterior of F X which is convex, there exists t > X − tW ∈ F X . Let Y ⊥ ∈ R n × ( n − p ) be such that M = (cid:2) Y Y ⊥ (cid:3) is invertible. We can express X = YY ⊤ and W as X = M (cid:20) I p
00 0 (cid:21) M ⊤ and W = M (cid:20) A BB ⊤ C (cid:21) M ⊤ . Then, explicitly, these two matrices must belong to C : X + W = M (cid:20) I p + A BB ⊤ C (cid:21) M ⊤ , and X − tW = M (cid:20) I p − tA − tB − tB ⊤ − tC (cid:21) M ⊤ . In particular, they must both be positive semidefinite, which implies C (cid:23) − tC (cid:23)
0, so that C =
0. By Schur’s complement, it follows that B =
0. Thus, W = YAY ⊤ for some A ∈ S p × p such that I p + A (cid:23)
0. Furthermore, A ( X ′ ) = A ( X + W ) = b , so that A ( W ) =
0. The latter is equivalent to L X ( A ) = (cid:3) UARANTEES FOR BURER–MONTEIRO FACTORIZATIONS OF SMOOTH SDPS 25
Appendix C: Faces of the Ortho-Cut SDP
Proof of Theorem 5.10.
Consider the definition of L X (2.11) and inequality (2.12):the latter covers the lower bound and shows we need rank L X ≥ p ( d + ) / L X ( A ) = p ( d + ) / A ∈ S p × p .Let Y ∈ M p be such that X = YY ⊤ , and let y , . . . , y n ∈ R p denote the rows of Y , transposed. Greedily select p linearly independent rows of Y , in order, such thatrow i is picked iff it is linearly independent from rows y to y i − . This is alwayspossible since Y has rank p . Write t = { t < · · · < t p } to denote the indices ofselected rows. Write s k = { (( k − ) d + ) , . . . , kd } to denote the indices of rows inslice Y ⊤ k , and let c k = s k ∩ t be the indices of selected rows in that slice.We make use of the following fact [19, Lem. 2.1]: for x , . . . , x p ∈ R p linearlyindependent, the p ( p + ) / x i x ⊤ j + x j x ⊤ i form a basis of S p × p .Defining E i j = y i y ⊤ j + y j y ⊤ i = E ji , this means E = { E t ℓ , t ℓ ′ : ℓ, ℓ ′ = . . . p } forms abasis of S p × p ( E is a set, so that E i j and E ji contribute only one element). Similarly,since each slice Y ⊤ k has orthonormal rows, matrices in { E i j : i , j ∈ s k } are linearlyindependent.The constraint L X ( A ) = (cid:10) A , E i j (cid:11) = k and for each i , j ∈ s k .To establish the theorem, we need to extract a subset T of at least p ( d + ) / qd ( d + ) / T = { E i j : k ∈ { , . . . , q } and i ∈ c k ⊆ s k , j ∈ s k } . (C.1)That is, for each slice k , T includes all constraints of that slice which involve atleast one of the selected rows. For each slice k , there are | c k | d − | c k | ( | c k |− ) suchconstraints—note the correction for double-counting the E i j ’s where both i and j are in c k . Thus, using | c | + · · · + | c q | = p , the cardinality of T is: | T | = q ∑ k = (cid:20) | c k | d − | c k | ( | c k | − ) (cid:21) = p ( d + / ) − q ∑ k = | c k | . (C.2)We first show matrices in T are linearly independent. Then, we show | T | is largeenough.Consider one E i j ∈ T : i , j ∈ s k for some k and i = t ℓ for some ℓ (otherwise, per-mute i and j ). By construction of t , we can expand y j in terms of the rows selectedin slices 1 to k , i.e., y j = ∑ ℓ k ℓ ′ = α j ,ℓ ′ y t ℓ ′ , where ℓ k = | c | + · · · + | c k | . As a result, E i j expands in the basis E as follows: E i j = ∑ ℓ k ℓ ′ = α j ,ℓ ′ E t ℓ , t ℓ ′ . As noted before, E i j ’s in T contributed by a same slice k are linearly independent. Furthermore, they ex-pand in only a subset of the basis E , namely, E ( k ) = { E t ℓ , t ℓ ′ : ℓ k − < ℓ ≤ ℓ k , ℓ ′ ≤ ℓ k } : t ℓ is a selected row of slice k and t ℓ ′ is a selected row of some slice between 1 and k . For k = k ′ , E ( k ) and E ( k ′ ) are disjoint; in fact, they form a partition of E . Hence,elements of T are linearly independent. It remains to lower bound (C.2). To this end, use | c k | ≤ d and | c | + · · · + | c q | = p to get: q ∑ k = | c k | ≤ max x ∈ R q : k x k ∞ ≤ d , k x k = p k x k = j pd k d + (cid:16) p − j pd k d (cid:17) ≤ pd . Indeed, the maximum in x is attained by making as many of the entries of x aslarge as possible, that is, by setting ⌊ p / d ⌋ entries to d and setting one other en-try to p − ⌊ p / d ⌋ d if the latter is nonzero. This many entries are available since p ≤ qd = n . That this is optimal can be verified using KKT conditions. In combi-nation with (C.2), this confirms at least p ( d + / ) − pd / = p ( d + ) / A , thus upper bounding dim F X .To conclude, we argue that the proposed upper bound is essentially tight. In-deed, build Y ∈ M p by repeating q times the d first rows of I p , then by replacingits p first rows with I p (to ensure Y has full rank). If p / d is an integer, then ex-actly the p / d first slices each contribute d ( d + ) / F YY ⊤ = p ( p + ) / − p ( d + ) / (cid:3) Appendix D: Equivalence of global non-degeneracy and smoothness
Proof of Proposition 6.2.
By Proposition 1.3, it is sufficient to consider the case p = n . Consider X ∈ C of rank r and a diagonalization X = QDQ ⊤ , where D = diag ( λ , . . . , λ r , , . . . , ) and Q = (cid:2) Q Q (cid:3) is orthogonal of size n with Q ∈ R n × r . By [4, Thm. 6], since A , . . . , A m are linearly independent, X is primal non-degenerate if and only if the matrices B k = (cid:20) Q ⊤ A k Q Q ⊤ A k Q Q ⊤ A k Q (cid:21) , k = . . . , m are linearly independent. The B k are linearly dependent if and only if there exist α , . . . , α m not all zero such that α B + · · · + α m B m =
0. Considering the first r columns of the B k , the latter holds if and only if ∑ k α k Q ⊤ A k Q =
0, which holds ifand only if ∑ k α k A k Q =
0. For any Y ∈ R n × p such that X = YY ⊤ , since span ( Y ) = span ( Q ) , we have ∑ k α k A k Q = ∑ k α k A k Y =
0. This shows the B k are linearly dependent if and only if the A k Y are linearly dependent. Thus, X is primal non-degenerate if and only if { A Y , . . . , A m Y } are linearly independent.Overall, primal non-degeneracy holds at all X ∈ C if and only if Assumption 1.1aholds. (cid:3) Acknowledgment.
NB was partially supported by NSF grant DMS-1719558. Part of this work wasdone while NB was with the D.I. at Ecole normale sup´erieure de Paris and INRIA’sSIERRA team. ASB was partially supported by NSF grants DMS-1712730 andDMS-1719545. Part of this work was done while ASB was with the MathematicsDepartment at MIT and partially supported by NSF grant DMS-1317308
UARANTEES FOR BURER–MONTEIRO FACTORIZATIONS OF SMOOTH SDPS 27
Bibliography [1] Abb´e, E.; Bandeira, A.; Hall, G. Exact recovery in the stochastic block model.
InformationTheory, IEEE Transactions on (2016), no. 1, 471–487.[2] Absil, P.-A.; Baker, C. G.; Gallivan, K. A. Trust-region methods on Riemannian manifolds. Foundations of Computational Mathematics (2007), no. 3, 303–330.[3] Absil, P.-A.; Mahony, R.; Sepulchre, R. Optimization Algorithms on Matrix Manifolds , Prince-ton University Press, Princeton, NJ, 2008.[4] Alizadeh, F.; Haeberly, J.-P.; Overton, M. Complementarity and nondegeneracy in semidefiniteprogramming.
Mathematical Programming (1997), no. 1, 111–128.[5] Bandeira, A.; Boumal, N.; Voroninski, V.: On the low-rank approach for semidefinite programsarising in synchronization and community detection, in Proceedings of The 29th Conference onLearning Theory, COLT 2016, New York, NY, June 23–26 , 2016 .[6] Bandeira, A.; Kennedy, C.; Singer, A. Approximating the little Grothendieck problem over theorthogonal and unitary groups.
Mathematical Programming (2016), 1–43.[7] Barvinok, A. Problems of distance geometry and convex properties of quadratic maps.
Discrete& Computational Geometry (1995), no. 1, 189–202.[8] Bhojanapalli, S.; Boumal, N.; Jain, P.; Netrapalli, P.: Smoothed analysis for low-rank so-lutions to semidefinite programs in quadratic penalty form, in Proceedings of the 31stConference On Learning Theory , Proceedings of Machine Learning Research , vol. 75,edited by S. Bubeck; V. Perchet; P. Rigollet, PMLR, 2018 pp. 3243–3270. Available at: http://proceedings.mlr.press/v75/bhojanapalli18a.html [9] Boumal, N. A Riemannian low-rank method for optimization over semidefinite matrices withblock-diagonal constraints. arXiv preprint arXiv:1506.00575 (2015).[10] Boumal, N.; Absil, P.-A.; Cartis, C. Global rates of convergence for nonconvex optimization onmanifolds.
IMA Journal of Numerical Analysis (2018).[11] Boumal, N.; Voroninski, V.; Bandeira, A. The non-convex Burer–Monteiro approach workson smooth semidefinite programs. in
Advances in Neural Information Processing Systems 29 ,edited by D. D. Lee; M. Sugiyama; U. V. Luxburg; I. Guyon; R. Garnett, pp. 2757–2765, CurranAssociates, Inc., 2016.[12] Burer, S.; Monteiro, R. A nonlinear programming algorithm for solving semidefinite programsvia low-rank factorization.
Mathematical Programming (2003), no. 2, 329–357.[13] Burer, S.; Monteiro, R. Local minima and convergence in low-rank semidefinite programming. Mathematical Programming (2005), no. 3, 427–444.[14] Eriksson, A.; Olsson, C.; Kahl, F.; Chin, T.-J.: Rotation averaging and strong duality, in
Pro-ceedings of the IEEE Conference on Computer Vision and Pattern Recognition , 2018 pp. 127–135.[15] Goemans, M.; Williamson, D. Improved approximation algorithms for maximum cut and sat-isfiability problems using semidefinite programming.
Journal of the ACM (JACM) (1995),no. 6, 1115–1145.[16] Golub, G. H.; Pereyra, V. The differentiation of pseudo-inverses and nonlinear least squaresproblems whose variables separate. SIAM Journal on Numerical Analysis (1973), no. 2,413–432.[17] Journ´ee, M.; Bach, F.; Absil, P.-A.; Sepulchre, R. Low-rank optimization on the cone of positivesemidefinite matrices. SIAM Journal on Optimization (2010), no. 5, 2327–2351.[18] Khot, S.; Naor, A. Grothendieck-type inequalities in combinatorial optimization. Communica-tions on Pure and Applied Mathematics (2012), no. 7, 992–1035.[19] Laurent, M.; Poljak, S. On the facial structure of the set of correlation matrices. SIAM Journalon Matrix Analysis and Applications (1996), no. 3, 530–547.8 BOUMAL, VORONINSKI, BANDEIRA[20] Lee, J. Introduction to Smooth Manifolds , Graduate Texts in Mathematics , vol. 218, Springer-Verlag New York, 2012, 2nd ed.[21] Mei, S.; Misiakiewicz, T.; Montanari, A.; Oliveira, R.: Solving SDPs for synchronizationand MaxCut problems via the Grothendieck inequality, in
Proceedings of the 2017 Confer-ence on Learning Theory , Proceedings of Machine Learning Research , vol. 65, edited byS. Kale; O. Shamir, PMLR, Amsterdam, Netherlands, 2017 pp. 1476–1515. Available at: http://proceedings.mlr.press/v65/mei17a.html [22] Murty, K.; Kabadi, S. Some NP-complete problems in quadratic and nonlinear programming.
Mathematical Programming (1987), no. 2, 117–129.[23] Nesterov, Y.; Nemirovskii, A. Interior-point polynomial algorithms in convex programming ,SIAM, 1994.[24] Pataki, G. On the rank of extreme matrices in semidefinite programs and the multiplicity ofoptimal eigenvalues.
Mathematics of operations research (1998), no. 2, 339–358.[25] Polik, I.; Terlaky, T. A survey of the S-lemma. SIAM Review (2007), no. 3, 371–418.[26] Pumir, T.; Jelassi, S.; Boumal, N. Smoothed analysis of the low-rank approach for smoothsemidefinite programs. in Advances in Neural Information Processing Systems 31 , edited byS. Bengio; H. Wallach; H. Larochelle; K. Grauman; N. Cesa-Bianchi; R. Garnett, pp. 2283–2292, Curran Associates, Inc., 2018.[27] Rockafellar, R.
Convex analysis , Princeton University Press, Princeton, NJ, 1970.[28] Rosen, D. M.; Carlone, L.; Bandeira, A. S.; Leonard, J. J. A certifiably correct algorithm forsynchronization over the special euclidean group. arXiv preprint arXiv:1611.00128 (2016).[29] Ruszczy´nski, A.
Nonlinear optimization , Princeton University Press, Princeton, NJ, 2006.[30] Wen, Z.; Yin, W. A feasible method for optimization with orthogonality constraints.
Mathemat-ical Programming (2013), no. 1–2, 397–434.[31] Yang, W.; Zhang, L.-H.; Song, R. Optimality conditions for the nonlinear programming prob-lems on Riemannian manifolds.
Pacific Journal of Optimization (2014), no. 2, 415–434.(2014), no. 2, 415–434.