[PDF] The Mixing method: low-rank coordinate descent for semidefinite programming with diagonal constraints

Abstract

In this paper, we propose a low-rank coordinate descent approach to structured semidefinite programming with diagonal constraints. The approach, which we call the Mixing method, is extremely simple to implement, has no free parameters, and typically attains an order of magnitude or better improvement in optimization performance over the current state of the art. We show that the method is strictly decreasing, converges to a critical point, and further that for sufficient rank all non-optimal critical points are unstable. Moreover, we prove that with a step size, the Mixing method converges to the global optimum of the semidefinite program almost surely in a locally linear rate under random initialization. This is the first low-rank semidefinite programming method that has been shown to achieve a global optimum on the spherical manifold without assumption. We apply our algorithm to two related domains: solving the maximum cut semidefinite relaxation, and solving a maximum satisfiability relaxation (we also briefly consider additional applications such as learning word embeddings). In all settings, we demonstrate substantial improvement over the existing state of the art along various dimensions, and in total, this work expands the scope and scale of problems that can be solved using semidefinite programming methods.

Full PDF

TThe Mixing method

The Mixing method: low-rank coordinate descent forsemideﬁnite programming with diagonal constraints

Po-Wei Wang [email protected]

Machine Learning DepartmentCarnegie Mellon UniversityPittsburgh, PA 15213

Wei-Cheng Chang [email protected]

Language Technologies InstituteCarnegie Mellon UniversityPittsburgh, PA 15213

J. Zico Kolter [email protected]

Computer Science DepartmentCarnegie Mellon UniversityPittsburgh, PA 15213, andBosch Center for Artiﬁcial IntelligencePittsburgh, PA 15222

Editor:

Abstract

In this paper, we propose a low-rank coordinate descent approach to structured semideﬁniteprogramming with diagonal constraints. The approach, which we call the Mixing method,is extremely simple to implement, has no free parameters, and typically attains an orderof magnitude or better improvement in optimization performance over the current stateof the art. We show that the method is strictly decreasing, converges to a critical point,and further that for suﬃcient rank all non-optimal critical points are unstable. Moreover,we prove that with a step size, the Mixing method converges to the global optimum of thesemideﬁnite program almost surely in a locally linear rate under random initialization. Thisis the ﬁrst low-rank semideﬁnite programming method that has been shown to achieve aglobal optimum on the spherical manifold without assumption. We apply our algorithmto two related domains: solving the maximum cut semideﬁnite relaxation, and solving amaximum satisﬁability relaxation (we also brieﬂy consider additional applications such aslearning word embeddings). In all settings, we demonstrate substantial improvement overthe existing state of the art along various dimensions, and in total, this work expands thescope and scale of problems that can be solved using semideﬁnite programming methods.

Keywords: semideﬁnite program, non-convex optimization, spherical manifold, conver-gence to global optimum, random initialization1 a r X i v : . [ m a t h . O C ] J u l ang, Chang, and Kolter

1. Introduction

This paper considers the solution of large-scale, structured semideﬁnite programming problems(SDPs). A generic SDP can be written as the optimization problem minimize X (cid:23) (cid:104) C, X (cid:105) , subject to (cid:104) A i , X (cid:105) = b i , i = 1 . . . p (1)where X ∈ S n is the optimization variable (a symmetric n × n matrix), and A i ∈ R n × n , b i ∈ R , i = 1 , . . . , p are problem data. Semideﬁnite programs can encode a huge range ofpractical problems, including relaxations of many combinatorial optimization tasks (Boydand Vandenberghe, 2004), approximate probabilistic inference (Jordan and Wainwright, 2004),metric learning (Yang, 2006), matrix completion (Candes and Recht, 2012), and many others.Unfortunately, generic semideﬁnite programs involve optimizing a matrix-valued variable X ,where the number of variables grows quadratically so that it quickly becomes unaﬀordablefor solvers employing exact methods such as primal-dual interior point algorithms.Fortunately, a property of these problems, which has been recognized for some time now,is that the solution to such problems is often low-rank ; speciﬁcally, the problem alwaysadmits an optimal solution with at most rank (cid:100)√ p (cid:101) (Barvinok, 1995; Pataki, 1998), andmany SDPs are set up to have even lower rank solutions in practice. This has motivated thedevelopment of non-convex low-rank solvers for these systems: that is, we can attempt tosolve the equivalent (but now non-convex) formulation of the problem minimize V ∈ R k × n (cid:104) C, V T V (cid:105) , subject to (cid:104) A i , V T V (cid:105) = b i , i = 1 . . . p with the optimization variable V ∈ R k × n . Here we are explicitly representing X by thematrix V of rank k (typically with k (cid:28) n ), X = V T V . Note that because we are representing X in this way, we no longer need to explicitly enforce semideﬁniteness, as it is implied bythe change of variables. In a long series of work dating back several years, it has been shownthat, somewhat surprisingly, this change to a non-convex problem does not cause as manydiﬃculties as might be thought: in practice, local solutions to the problem tend to recoverthe optimal solution (Burer and Monteiro, 2003); assuming suﬃcient rank k , all second orderlocal optima of the problem are also global optima (Boumal et al., 2016); and it even holdsthat approximated local optima also have good approximation properties (Mei et al., 2017)for convex relaxations of some combinatorial problems. Despite these advances, solvingfor V remains a practical challenge. Traditional methods for handling non-linear equalityconstraints, such as augmented Lagrangian methods and Riemannian manifold methods,suﬀer from slow convergence, diﬃculties in selecting step size, or other deﬁciencies.In this paper, we present a low-rank coordinate descent approach to solving SDPs thathave the additional speciﬁc structure of constraints (only) on the diagonal entries of thematrix (we consider the case of unit diagonal constraints, though we can easily extend toarbitrary positive numbers) minimize X (cid:23) (cid:104) C, X (cid:105) , subject to X ii = 1 , i = 1 . . . n. (2)This is clearly a very special case of the full semideﬁnite program, but it also capturessome fundamental problems, such as the famous semideﬁnite relaxation of the maximum he Mixing method cut (MAXCUT) combinatorial optimization problem; indeed, the MAXCUT relaxation willbe one of the primary applications in this paper. In this setting, if we consider the abovelow-rank form of the problem, we show that we can derive the coordinate descent updates ina very simple closed form, resulting in an algorithm several times faster than the existingstate of the art. We call our approach the Mixing method, since the updates have a naturalinterpretation in terms of giving each v i as a mixture of the remaining v j terms. We will alsoshow, however, that the method can be applied to other problems as well, such as a (novel, tothe best of our knowledge) relaxation of the MAXSAT problem, as well as an unconstrainedquadratic SDP similar to the GloVe word embedding algorithm (Pennington et al., 2014).On the theoretical side, we prove several strong properties about the Mixing method.Most notably, despite the fact that the method is solving a non-convex formulation of theoriginal MAXCUT SDP, it nonetheless will converge to the true global optimum of the originalSDP problem, provided the rank is suﬃcient, k > √ n (which is of course much smallerthan optimizing over the entire n variables of the original SDP). We prove this by ﬁrstshowing that the method is strictly decreasing, and always converges to a ﬁrst-order criticalpoint. We then show that all non-optimal critical points (that is, all points which are criticalpoint in the non-convex formulation in V , but not optimal in X in the original SDP), are unstable , i.e., they are saddle points in V and will be unstable under updates of the Mixingmethod; this means that, in practice, the Mixing method is extremely unlikely to convergeto any non-optimal solution. However, to formally prove that the method will converge tothe global optimum, we consider a slightly modiﬁed “step size" version of the algorithm, forwhich we can prove formally that the method indeed converges to the global optimum in allcases (although in practice such a step size is not needed). Finally, for both the traditionaland step size versions of the algorithm, we show that the Mixing method attains locally linear convergence around the global optimum. The primary tools we use for these proofs requireanalyzing the spectrum of the Jacobian of the Mixing method at non-optimal critical points,showing that it is always unstable around those points; we combine these with a geometricanalysis of critical points due to Boumal et al. (2016) and a convergence proof for coordinategradient descent due to Lee et al. (2017) to reach our main result (both results require slightspecialization for our case, so are included for completeness, but the overall thrust of thosesupporting points are due to these past papers). Contributions of the current work

In summary, the main contributions of this paperare: 1) We propose a low-rank coordinate descent method, the Mixing method, for thediagonally constrained SDP problem, which is extremely fast and simple to implement. 2)We prove that despite its non-convex formulation, the method is guaranteed to converge toglobal optimum of the original SDP with local linear convergence, provided that we use asmall rank k > √ n . 3) We evaluate the proposed method on the MAXCUT SDP relaxation,showing that it is 10-100x times faster than the other state-of-the-art solvers and scales tomillions of variables. 4) We extend the MAX-2SAT relaxation of Goemans and Williamson(1995) to general MAXSAT problems, showing that the proposed formulation can be solvedby the Mixing method in linear time to the number of literals. Further, experiments showthat the proposed method has much better approximation ratios than other approximationalgorithms, and is even comparable to the best partial MAXSAT solvers in some instances. ang, Chang, and Kolter

2. Background and related work

Low-rank methods for semideﬁnite programming

Given an SDP problem with p constraints, it was proven by Barvinok (1995); Pataki (1998) that, if the problem is solvable,it admits solutions of rank k = (cid:100)√ p (cid:101) . That is, we have solutions satisfying X = V T V ,such that V ∈ R k × n . Thus, if we can solve the problem in the space of V , we can ignorethe semideﬁnite constraint and also have many fewer variables. The idea of using thislow-rank structure during optimization was ﬁrst proposed by Burer and Monteiro (2003)in their solver SDPLR, in which they solve the low-rank problem with L-BFGS on theextended Lagrangian problem. Since then, many low-rank optimization algorithms have beendeveloped. One of the most notable is the Riemannian trust region method introduced byAbsil et al. (2009). They considered the Riemannian manifold of low-rank structures, andextended the non-linear conjugate gradient method to work on the manifold. Later, Boumaland Absil (2015) improved the method by including preconditioned CG; these methods areimplemented in the popular Manopt package (Boumal et al., 2014).Somewhat surprisingly, all the above low-rank methods are observed to converge to aglobal optimum in practice (Burer and Monteiro, 2003; Absil et al., 2009; Boumal et al.,2014). However, to the best of the authors’ knowledge, there is not yet a general proof onconvergence to the globally optimal solution without strong assumptions. Absil et al. (2009,Theorem 7.4.2) proved that their method converges to the critical point under a suﬃcientdecrease condition, and super-linear convergence near an isolated local minimizer when theRiemannian Hessian is positive deﬁnite. Nonetheless, the results do not apply to the linearobjective in our problem. Boumal et al. (2016, Theorem 2) proved that, for suﬃciently large k , all second-order optimal solutions are globally optimal for almost all cost matrices C .However, this alone does not imply f converges to f ∗ . Sato and Iwai (2015) proved theconvergence to a critical point without assumptions for a modiﬁed Riemannian conjugategradient method, and Boumal et al. (2018, Theorem 12 and Proposition 19) proved that theRiemannian trust region method converges to a solution with Hessian larger than − γI in O (1 /γ ) and provide bounds on f − f ∗ when k > n . Recently, Bhojanapalli et al. (2018)proved the connection between γ , the norm of gradients, and f µ − f ∗ µ for the unconstrainedpenalty form f µ , but this does not state the relationship between f − f ∗ after projecting theunconstrained solution back to the feasible space.Other existing results on global convergence to the optimal on low-rank problems do notapply to our problem setting. For example, Bhojanapalli et al. (2016) proved that under acertain spectral initialization, the gradient descent method converges to the global optimafor unconstrained low-rank SDP. Park et al. (2016) further proved that the method works fornorm-constrained low-rank SDP problems when the feasible space of V is convex. Lee et al.(2016, 2017); O’Neill and Wright (2017) proved that, under random initialization, ﬁrst-ordermethods converge to local minimizers for unconstrained problems. However, O’Neill andWright (2017) only concerns the unconstrained optimization in Euclidean space, and resultsin Lee et al. (2016, 2017) do not work in the spherical manifold due to the singularity inthe Jacobian (even with step size). Because of these issues, certifying global convergencefor low-rank SDP methods typically requires a two-stage algorithm, where one iterativelysolves the low-rank problem via some “inner” optimization, then checks for a certiﬁcate of(global) optimality via the dual objective of the original SDP and inﬂates the rank if the he Mixing method solution is not optimal (i.e., the “outer” iteration) (Burer and Monteiro, 2003). Even here, aglobally optimal solution is not theoretically guaranteed to be achieved with reduced rank inall cases, but at least can be veriﬁed after the fact. Computing this dual solution, however,typically requires solving an eigenvalue problem of the size of the original X matrix, so isoften avoided entirely in practice in favor of just solving one low-rank problem, and this isthe setting we consider here. Approximation algorithms for MAXCUT and MAXSAT

Semideﬁnite program-ming has many applications in approximation algorithms for NP-complete problems. Inparticular, Goemans and Williamson (1995) proposed a classical SDP relaxation on theMAXCUT and MAX-2SAT problems, which has a 0.878 approximation guarantee. Exper-iments on MAX-2SAT (Gomes et al., 2006) show that the SDP upper bound and lowerbound are much tighter than the classical linear programming relaxation for MAXSAT(Goemans and Williamson, 1994). While traditionally SDPs are more expensive to solvethan linear programming, we will show here that with our approach, SDP relaxations canachieve substantially better results than linear programming in less time.

3. The Mixing method

As mentioned above, the goal of the Mixing method is to solve the semideﬁnite program(2) with a unit diagonal constraint. As discussed, we can replace the X (cid:23) constraintwith X = V T V for some V ∈ R k × n ; when we do this, the constraint X ii = 1 translatesto the constraint (cid:107) v i (cid:107) = 1 , where v i is the i th column of V . This leads to the equivalent(non-convex) problem on the spherical manifold minimize V ∈ R k × n (cid:104) C, V (cid:62) V (cid:105) subject to (cid:107) v i (cid:107) = 1 , i = 1 . . . n. (3)Although the problem is non-convex, it is known (Barvinok, 1995; Pataki, 1998) that when k > √ n , the optimal solution for V ∈ R k × n can recover the optimal solution for X .We consider solving the problem (3) via a coordinate descent method. The resultingalgorithm is extremely simple to implement but, as we will show, it performs substantiallybetter than existing approaches for the semideﬁnite problems of interest. Speciﬁcally, theobjective terms that depend on v i are given by v Ti ( (cid:80) nj =1 c ij v j ) . However, because (cid:107) v i (cid:107) = 1 wecan assume that c ii = 0 without aﬀecting the solution of the optimization problem. Thus, theproblem is equivalent to simply minimizing the inner product v Ti g i (where g i = (cid:80) nj =1 c ij v j ),subject to the constraint that (cid:107) v i (cid:107) = 1 ; this problem has a closed form solution, simply givenby v i = − g i / (cid:107) g i (cid:107) . Put in terms of the original v j variable, this is simply the update v i := normalize  − n (cid:88) j =1 c ij v j  . This way, we can initialize v i on the unit sphere and perform cyclic updates over all the i = 1 . . . n in closed-form. We call this the Mixing method, because for each v i our updatemixes and normalizes the remaining vectors v j according to weight c ij . In the case of sparse C (which is common for any large data problem), the time complexity for updating all variablesonce is O ( k · m ) , where k is the rank of V and m is the number of nonzeros in C . This is ang, Chang, and Kolter Algorithm 1:

The Mixing method for MAXCUT problem Initialize v i randomly on a unit sphere; while not yet converged do for i = 1 , . . . , n do v i := normalize ( − (cid:80) nj =1 c ij v j ) ; end end signiﬁcantly cheaper than the interior point method, which typically admits complexitiescubic in n . However, the details for eﬃcient computation diﬀer depending on the precisenature of the SDP, so we will describe these in more detail in the subsequent applicationsections. A complete description of the generic algorithm is shown in Algorithm 1. This section presents the theoretical analysis of the Mixing methods, which constitutes fourmain properties: • We prove that the Mixing method is strictly decreasing in objective value and alwaysconverges to a ﬁrst-order critical point over iterations. • Further, we show that for a rank k > √ n , all non-optimal critical points V ∈ R k × n are unstable for the Mixing method. That is, when V T V is non-optimal for the convexproblem (2), the critical point V will sit on a saddle of the non-convex problem (3)and the Mixing method tends to diverge from the point locally . • To rigorously prove the global convergence , we show that a variant of the Mixing methodwith a proper step size converges to a global optimum almost surely under randominitialization for almost every cost matrix C , without any assumptions . • Moreover, we prove that both Mixing methods (with or without step size) convergelinearly to the global optimum when the solution is close enough, regardless of the rank or the existence of nearby non-optimal critical points.In total, our method represents the ﬁrst low-rank semideﬁnite programming method whichwill provably converge to a global optimum under constraints, and further in a locally linearrate.

As mentioned above, our only assumption (satisﬁable by picking a proper step size) issimply that the v i updates will not lead to a v with zero norm before normalization. Assumption 3.1

Assume that for all i = 1 . . . n , (cid:107) (cid:80) nj =1 c ij v j (cid:107) do not degenerate in theprocedure. That is, all norms are always greater than or equal to a constant δ > . Convergence to a critical point.

Our ﬁrst property shows that the Mixing methodstrictly decreases, and always converges to a ﬁrst-order critical point for any k . This is arelatively weak statement, but useful in the context of the further proofs.

1. The tightness of the √ n rank (actually, rank satisfying k ( k + 1) / ≥ n ) is proved in Barvinok (2001). he Mixing method Theorem 3.2

Under Assumption 3.1, the Mixing method on the SDP problem (2) is strictlydecreasing and always converges to a ﬁrst-order critical point.

The proof is in Appendix A, which mainly involves setting up the Mixing iteration in amatrix form, and showing that the diﬀerence in objective value between two iterations isgiven by a particular positive term based upon this form.

Instability of non-optimal critical points.

Next we prove the main result of ourapproach, that not only is the function decreasing, but that every non-optimal critical pointis unstable; that is, although the problem (3) is non-convex, the algorithm tends to diverge(locally) from any solution that is not globally optimal. Further, the local divergence fromnon-optimal critical points and the global convergence to a critical points hint that theMixing method can only converges to a global optimal solution.

Theorem 3.3

Pick k > √ n . Under Assumption 3.1, for almost all C , all non-optimalﬁrst-order critical points are unstable ﬁxed points for the Mixing method. The full proof is provided in Section 3.2. The main idea is to show that the maximumeigenvalue of the dynamics Jacobian, evaluated at a critical point but when V is not optimal,is guaranteed to have a spectral radius (magnitude of the largest eigenvalue) greater thanone. We do this by showing that the Jacobian of the Mixing method has the same structureas the Jacobian of a standard Gauss-Seidel update plus an additional projection matrix.By a property from Boumal et al. (2016), plus an analysis of the eigenvector of Kroneckerproducts, we can then guarantee that the eigenvalues of the Mixing method Jacobian containsthose of the Gauss-Seidel update. We then use an argument based upon Lee et al. (2017) toshow that the Gauss-Seidel Jacobian is similarly unstable around non-optimal critical points,proving our main theorem. Globally optimal convergence of Mixing method with a step size.

Though theabove two theorems in practice ensure convergence of the Mixing method to a global optimum,because the method makes discrete updates, there is the theoretical possibility that themethod will “jump" directly to a non-optimal critical point (yet again, this has never beenobserved in practice). For completeness, however, in the next theorem we highlight the factthat a version of the Mixing method that is modiﬁed with a step size will always converge toa global optimum.

Theorem 3.4

Consider the Mixing method with a step size θ > . That is, v i := normalize  v i − θ n (cid:88) j =1 c ij v j  , for i = 1 , . . . , n. Take k > √ n and θ ∈ (0 , i (cid:107) c i (cid:107) ) , where (cid:107) · (cid:107) denotes the -norm. Then for almost every C , the method converges to a global optimum almost surely under random initialization.

2. Any distribution will suﬃce if it maps zero volume in the spherical manifold to zero probability. Forexample, both spherical Gaussian distribution and normalized uniform distribution work. ang, Chang, and Kolter The full proof is provided in Appendix C. The main diﬀerence in the proof from the stepsize free version is that, with a step size, we can prove the diﬀeomorphism of the Mixingmethod and thus are able to use the stable manifold theorem. Because the Jacobian of any feasible method on the spherical manifold is singular , we need to construct the inversefunction explicitly and show the smoothness of such function. We can then use an analysis ofthe Gauss-Seidel method with a step size to show our result. To the best of our knowledge,this represents the ﬁrst globally optimal convergence result for the low-rank method appliedto (constrained) semideﬁnite programming, without additional assumptions such as theCauchy decrease (Boumal et al., 2018, Assumption A3), or solving a sequence of “bounded”log-barrier subproblems exactly (Burer and Monteiro, 2005, Theorem 5.3). Locally linear convergence.

Finally, as a last point of analysis, we show that theconvergence of the Mixing methods exhibits linear convergence to the global optimumwhenever the solution is close enough, regardless of the rank and the existence of nearbynon-optimal critical points, for both versions with or without the step size. This also echoespractical experience, where the Mixing method does exhibit this rate of convergence.

Theorem 3.5

The Mixing methods converge linearly to the global optimum when close enoughto the solution, with step size (no assumption) or without step size (under Assumption 3.1).

The full proof is provided in Appendix D. We prove it by exploiting the Lipschitzsmoothness (Lipschitz continuous gradient) of the Mixing mappings. The main diﬃcultyhere is that the corresponding linear system in the Gauss-Seidel method, S ∗ ∈ R n × n , issemideﬁnite so that the corresponding Jacobian J GS contains eigenvectors of magnitude inthe null space of S ∗ . We overcome the diﬃculty by proving the result directly in the functionvalue space like Wang and Lin (2014) so that the eigenvectors in null ( S ∗ ) can be ignored.This is the ﬁrst local linear convergence on the spherical manifold without assumption. Remark 3.6

Assume there are m nonzeros in C ∈ R n × n . With Theorem 3.4 and 3.5, foralmost every C , the Mixing method with a step size admits an asymptotic complexity of O ( m √ n log (cid:15) ) to reach the global optimality gap of f − f ∗ ≤ (cid:15) almost surely under randominitialization. This concludes the theoretical analysis of the Mixing method. Now we will prove one of our main result: the instability of non-optimal critical points.

Before starting the proofs, we discuss our notations and reformulate the Mixing methods.

Notations.

We use upper-case letters for matrix, and lower-case letters for vector and scalar.For a matrix V ∈ R k × n , v i ∈ R k refers to the i -th column of V , vec( V ) and vect( V ) ∈ R nk denote the vector stacking columns and rows of V , respectively. For a vector y ∈ R n , D y = diag( y ) ∈ R n × n denotes the matrix with y on the diagonal, y max , y min , y min -nz ∈ R the

3. Note that on the spherical manifold, the Jacobian of any feasible method is singular because the Jacobianof v i / (cid:107) v i (cid:107) is singular. Thus, the proof in Lee et al. (2017, Section 5.5) does not carry over to the case ofthe Mixing method, even with a step size. This past proof required a non-singular Jacobian, and thusdiﬀerent techniques are required in our setting. he Mixing method maximum, the minimum, and the minimum nonzero element of y , respectively. The symbol ⊗ denotes the Kronecker product, † the Moore-Penrose inverse, σ ( · ) the vector of eigenvalues, ρ ( · ) the spectral radius, n the -vector of length n , I n the identity matrix of size n , tr( · ) thetrace, (cid:104) A, B (cid:105) = tr( A T B ) the dot product, and (cid:107) V (cid:107) = (cid:112) tr( V V T ) the generalized L2 norm.Indices i, j are reserved for matrix element, and index r for iterations. The Mixing methods.

We denote C ∈ R n × n the cost matrix of the problem minimize V ∈ R k × n f ( V ) ≡ (cid:104) C, V T V (cid:105) subject to (cid:107) v i (cid:107) = 1 , i = 1 . . . n, and w.l.o.g. assume that c ii = 0 in all proofs. Matrices V and ˆ V refer to the current and thenext iterate, and V ∗ the global optimum attaining an optimal value f ∗ in the semideﬁniteprogram (2). Let g i = (cid:88) ji c ij v j (4)and matrix L be the strict lower triangular part of C . With these notations, the mapping ofthe Mixing method M : R k × n → R k × n and its variant M θ with step size θ can be written as M ( V ) T = − ( L + D y ) − L T V T , where y i = (cid:107) g i (cid:107) , i = 1 . . . n, and (5) M θ ( V ) T = ( θL + D y ) − ( I − θL ) T V T , where y i = (cid:107) v i − θg i (cid:107) , i = 1 . . . n. (6)Note that both M and M θ are valid functions of V because y can be calculated from V bythe original algorithmic deﬁnitions in Section 3. This formulation is similar to the classicalanalysis of the Gauss-Seidel method for linear equation in Golub and Van Loan (2012), wherethe diﬀerence is that y here is not a constant to V and thus the evolution is not linear. Proof of technical lemmas.

We start by analyzing the Jacobian of the Mixing method.

Lemma 3.7

Using the notation in (5) , the Jacobian of the Mixing method is J ( V ) = − ( L + D y ) − ⊗ I k P L T ⊗ I k , in which P is the rejection matrix of V . That is, P = diag( P , . . . , P n ) ∈ R nk × nk , where P i = I k − ˆ v i ˆ v Ti ∈ R k × k . Proof

Denote V and ˆ V the current and the next iterate. Taking total derivatives on theupdate of the Mixing method (7), we have y i d ˆ v i = − P i dg i = − P i ( (cid:88) ji c ij dv j ) , i = 1 . . . n.

4. By Pataki (1998), the optimality in (2) is always attainable by V ∈ R k × n when k > √ n .5. The reason to reformulate here is to avoid the “overwrite” of variables in the algorithmic deﬁnition.Moving the inverse term to the left-hand side, the reader can recover the original sequential algorithm. ang, Chang, and Kolter Moving d ˆ v j to the left-hand side. By the implicit function theorem, we have the Jacobian J ( V ) = −  y I k . . . c P y I k . . . . . . . . . . . . c n P n c n P n . . . y n I k  −  c P . . . c n P . . . . . . . . . c ( n − n P n − . . .  . The implicit function theorem holds here because the triangular matrix is always inversible.Rewrite J ( V ) with Kronecker product, we have equivalently J ( V ) = − ( P L ⊗ I k + D y ⊗ I k ) − P L T ⊗ I k . Further, we can push P into the inverse so that ( P L ⊗ I k + D y ⊗ I k ) − P = ( P L ⊗ I k + P D y ⊗ I k ) † = (( L + D y ) ⊗ I k ) − P. Thus, J ( V ) can be reformulated as J ( V ) = − ( L + D y ) − ⊗ I k P L T ⊗ I k . Note that V = ˆ V on critical points, which is also a ﬁxed point of the Mixing method. Nowwe demonstrate how to analyze the Jacobian. Remember the notation vect( Z ) = vec( Z T ) .This way, we have the following convenient property by the Kronecker product. Lemma 3.8

For matrices

A, B, Q, R , we have A ⊗ B vect( QR T ) = vect(( AQ )( BR ) T ) . Proof A ⊗ B vect( QR T ) = A ⊗ B vec( RQ T ) = vec( BRQ T A T ) = vect(( AQ )( BR ) T ) . By the above property, part of the spectrum of the Jacobian can be analyzed.

Lemma 3.9 (Overlapping Eigenvalues)

Assume V ∈ R k × n has rank ( V ) < k . Let P = diag( P , . . . , P n ) , where P i = I k − v i v Ti . Then for any

A, B ∈ R n × n , any eigenvalue of AB is also an eigenvalue of J = A ⊗ I k P B ⊗ I k . Proof

Because rank ( V ) < k , by linear dependency there is a nonzero r ∈ R k such that r T v i = 0 for i = 1 . . . n = ⇒ P i r = r for i = 1 . . . n. Let q ∈ C k be an eigenvector of AB with eigenvalue λ ∈ C . Let Z = qr T . With Lemma 3.8, J vect( Z ) = A ⊗ I k P vect(( Bq ) r T ) = A ⊗ I k (cid:0) P i r ( Bq ) Ti (cid:1) i =1 ...n = A ⊗ I k vect(( Bq ) r T ) = vect(( ABq ) r T )= λ vect( qr T ) . Thus, every eigenvalue λ of AB is also an eigenvalue of J .By the above lemma, the spectral radius of J = − ( L + D y ) − ⊗ I k P L T ⊗ I k is lower boundedby J GS = − ( L + D y ) − L T , which can be again lower bounded as follows. he Mixing method Lemma 3.10

For a positive vector y ∈ R n , consider a matrix under the notation in (5) J GS = − ( L + D y ) − L T . Let S = C + D y . When S (cid:54)(cid:23) , the spectral radius ρ ( J GS ) > . Proof

The proof is more technical and is given in Appendix B.Further, the assumption in Lemma 3.9 is fulﬁlled by the following property of critical points.

Lemma 3.11 (Boumal et al., 2016, Lemma 9) Let k ( k +1)2 > n . Then, for almost all C ∈ R n × n , all ﬁrst-order critical points V ∈ R k × n have rank smaller than k . Proof

The proof is listed in Appendix E for completeness.Next, we characterize the optimality of V by proving all non-optimal V admits an S (cid:54)(cid:23) . Lemma 3.12

For a critical solution V , denote S = C + diag( y ) , where y i = (cid:107) V c i (cid:107) , ∀ i .Then S (cid:23) ⇐⇒ V is optimal.Further, if V is optimal, all y i are positive except when c i = 0 . Proof

Consider the dual problem of the SDP problem (2), maximize y ∈ R n − Tn y, subject to C + diag( y ) (cid:23) . If S = C + diag( y ) (cid:23) , variable y becomes a feasible solution of the above dual problem.Further, since V is a critical solution, we have V S = 0 = ⇒ V T V S = 0 = ⇒ tr( V T V C ) = − tr( V T V diag( y )) = − Tn y, which means that V T V and y are primal and dual optimal solutions that close the dualitygap. Thus, for critical V , S (cid:23) implies optimality, and non-optimality implies S (cid:54)(cid:23) .On the other direction, when the solution V is optimal, there will be a correspondingdual optimal solution y satisfying V T V ( C + diag( y )) = 0 = ⇒ v Ti V ( C + diag( y )) = 0 , ∀ i = ⇒ y i = (cid:107) V c i (cid:107) , ∀ i. And S = C + diag( y ) (cid:23) follows from the dual feasibility. By the characterization of SPSDmatrix, all submatrix of S (cid:23) are SPSD. Thus, y i ≥ . If equality y i = 0 holds, by the samereason all × submatrix (cid:18) c ij c ij y jj (cid:19) (cid:23) , ∀ j . This means c i = 0 .

6. If S (cid:23) , we can prove that the spectral radius ρ ( J GS ) ≤ , in which all eigenvectors with magnitude reside in the null of S , see Wang and Lin (2014, Corollary 3.4). However, the result is not used here.7. Let y ∗ i = (cid:107) V ∗ c i (cid:107) and S ∗ = C + diag( y ∗ ) (cid:23) . An immediate consequence of the lemma is that, for anyfeasible U , f ( U ) − f ∗ = tr( UCU T ) + 1 Tn y ∗ = tr( US ∗ U T ) . Further, suppose U is also an optimum, then f ( U ) − f ∗ = tr( US ∗ U T ) = 0 ⇐⇒ US ∗ = 0 ⇐⇒ (cid:107) Uc i (cid:107) = y ∗ i = (cid:107) V ∗ c i (cid:107) , ∀ i. That is, y ∗ is unique. ang, Chang, and Kolter Proof of Theorem 3.3

Proof

Consider the non-optimal critical point V . If (cid:107) V c i (cid:107) degenerates for any i , byAssumption 3.1, the Mixing method would not converge to the solution. Thus, we only needto discuss the critical point with y i = (cid:107) V c i (cid:107) ≥ δ > , i = 1 . . . n .We ﬁrst derive the Jacobian J of the Mixing method in Lemma 3.7, which gives J = − ( L + D y ) − ⊗ I k P L T ⊗ I k , where P = diag( P , . . . , P n ) and P i = I − v i v Ti because ˆ v i = v i on the critical point (also aﬁxed point). In Lemma 3.9, we prove that when rank ( V ) < k , the eigenvalues of J containthe eigenvalues of J GS = − ( L + D y ) − L T . The assumption in Lemma 3.9 is fulﬁlled by Lemma 3.11, which guarantees that for almostevery C , all the ﬁrst-order critical point must have rank ( V ) < k . Further, Lemma 3.10 and3.12 show that J GS happens to be the Jacobian of the Gauss-Seidel method on a linearsystem, which has a spectral radius ρ ( J GS ) > on the non-optimal ﬁrst-order critical point V .Thus, Lemma 3.9 implies ρ ( J ) ≥ ρ ( J GS ) > , which means that all non-optimal ﬁrst-ordercritical points are unstable for the Mixing method.

4. Applications

The SDP MAXCUT relaxation is indeed the motivating example of the Mixing method, sowe consider it ﬁrst. In this section, we demonstrate how to apply our method to this problem,which originated from Goemans and Williamson (1995).

Problem description.

The maximum cut problem is an NP-hard binary optimizationproblem, which seeks a partition over a set of vertices i = 1 . . . n , so that the sum of edgeweights c ij across the partition is maximized. If we denote the two partitions as ± , we canformulate the assignment v i of vertex i as the following binary optimization problem maximize v i ∈{± } , ∀ i (cid:88) ij c ij (cid:18) − v i v j (cid:19) . Goemans and Williamson (1995) proposed that we can approximate the above solution by“lifting” the assignment v i from {± } to a unit sphere in R k for suﬃciently large k as maximize (cid:107) v i (cid:107) =1 , ∀ i (cid:88) ij c ij (cid:18) − v Ti v j (cid:19) . To recover the binary assignment, we can do a randomized rounding by picking a randomvector r ∈ R k on the unit sphere, and letting the binary assignment of vertex i be sign ( r T v i ) .Their analysis shows that the approximation ratio for the NP-hard problem is . , whichmeans that the expected objective from the randomized rounding scheme is at least . times the optimal binary objective. he Mixing method Algorithm Design.

Because the problem can be solved by the unit diagonal SDP (2), wecan apply the Mixing method directly, as presented in Algorithm 1. Further, for a sparseadjacency matrix C , the coeﬃcient (cid:80) j c ij v j can be constructed in time proportional to thenonzeros in column i of C . Thus, the time complexity of running a round of updates for all v i is O ( k · edges ) , in which k is at most √ n . Using similar ideas as in the previous section, Goemans and Williamson (1995) proposedthat we can use SDP to approximate the maximum 2-satisﬁability problem. In this section,we propose a formulation that generalizes this idea to the general maximum satisﬁabilityproblem, and apply the Mixing method to this problem. The proposed relaxation here isnovel, to the best of our knowledge, and (as we will show) achieves substantially betterapproximation results than existing relaxations.

Problem description.

The MAXSAT problem is an extension of the well-known sat-isﬁability problem, where the goal is to ﬁnd an assignment that maximizes the number ofsatisﬁed clauses. Let v i ∈ {± } be a binary variable and s ij ∈ {− , , } be the sign ofvariable i in clause j . The goal of MAXSAT can then be written as the optimization problem maximize v ∈{− , } n m (cid:88) j =1 n (cid:95) i =1 { s ij v i > } . Note that most clauses will contain relatively few variables, so the s j vectors will be sparse.To avoid the need for an additional bias term, we introduce an auxiliary “truth” variable v , and deﬁne z j = (cid:80) ni =1 s ij v i − (cid:80) ni =0 s ij v i = V s j . Then the MAXSAT problem can beapproximated as maximize v ∈{− , } n m (cid:88) j =1 − (cid:107) V s j (cid:107) − ( | s j | − | s j | . Although we will not derive it formally, the reader can verify that for any conﬁguration v ∈ {− , } n , this term represents an upper bound on the exact MAXSAT solution. Similarto the MAXCUT SDP, we can relax the v i s to be vectors in R k with (cid:107) v i (cid:107) = 1 . This leads tothe full MAXSAT semideﬁnite programming relaxation minimize X (cid:23) (cid:104) C, X (cid:105) , subject to C = m (cid:88) j =1 w j s j s Tj , X ii = 1 , i = 0 . . . n, where w j = 1 / (4 | s j | ) . Algorithm Design.

Because the C matrix here is not sparse ( s j s Tj has | s j | non-sparseentries), we need a slightly more involved approach than for MAXCUT, but the algorithmis still extremely simple. Speciﬁcally, we maintain z j = V s j for all clauses j . Because ineach subproblem only one variable v i is changed, z j can be maintained in O ( k · m i ) time,where m i denotes the number of clauses that contain variable i . In total, the iteration timecomplexity is O ( k · m ) , where m is the number of literals in the problem. Also, because

8. Actually, the formula matches the approximation of Goemans and Williamson (1995) for MAX-2SAT. ang, Chang, and Kolter Algorithm 2:

The Mixing method for MAXSAT problem Initialize all v i randomly on a unit sphere, i = 1 . . . n .; Let z j = (cid:80) ni =0 s ij v i for j = 1 , . . . , m ; while not yet converged do for i = 1 , . . . , n do foreach s ij (cid:54) = 0 do z j := z j − s ij v i ; v i := normalize (cid:16) − (cid:80) mj =1 s ij | s j | z j (cid:17) ; foreach s ij (cid:54) = 0 do z j := z j + s ij v i ; end end applying arbitrary rotations R ∈ R k × k to V does not change the objective value of ourproblem, we can avoid updating v . Algorithm 2 shows the complete algorithm. To recoverthe binary assignment, we apply the following classic rounding scheme: sample a randomvector r from a unit sphere, then assign binary variable i as true if sign ( r T v i ) = sign ( r T v ) and false otherwise.

5. Experimental results

Running time comparison for MAXCUT

Figure 1 shows the results of running theMixing method on several instances of benchmark MAXCUT problems. These range in sizefrom approximately 1000 nodes and 20000 edges to approximately 2 million nodes and 3million edges. For this application, we are largely concerned with evaluating the runtimeof our Mixing method versus other approaches for solving the same semideﬁnite program.Speciﬁcally, we compare to DSDP (Benson and Ye, 2005), a mature interior point solver;SDPLR (Burer and Monteiro, 2003), one of the ﬁrst approaches to exploit low-rank structures;Pure-RBR (Wen et al., 2009, 2012), a coordinate descent method in the X space, whichoutputs the best rank-1 update at each step; and Manopt (Boumal et al., 2014), a recenttoolkit for optimization on Riemannian manifolds, with specialized solvers dedicated to theMAXCUT problem. To be speciﬁc, we use DSDP 5.8, SDPLR 1.03-beta, and Manopt 3.0with their default parameters. For Manopt, we compare to a subroutine "elliptopefactory”,specially designed for diagonal-constrained SDP. For Pure-RBR, we implement the specializedalgorithm (Wen et al., 2009, Algorithm 2) for MAXCUT SDP with a sparse graph in C++,which only requires a single pass of the sparse matrix per iteration. We omit the log barrierand initialize the RBR with full-rank X . All experiments are run on an Intel Xeon E5-2670machine with 256 GB memory, and all solvers are run in the single-core mode to ensurefair comparisons. As the results show, in all cases the Mixing method is substantially fasterthan other approaches: for reaching modest accuracy (deﬁned as − times the diﬀerencebetween the initial and optimal value), we are typically 10-100x faster than all competingapproaches; only the Manopt algorithm ever surpasses our approach, and this happens only

9. We didn’t compare to commercial software like MOSEK, an interior-point solver like DSDP, because it isnot open-source and Boumal (2015) already showed that SDPLR is faster than MOSEK on diagonallyconstrained problems. he Mixing method (n=800 edges=19176) g3.rud (n=2000 edges=19990) g27.rud (n=2000 edges=11778) g35.rud (n=3000 edges=6000) g48.rud (n=5242 edges=14496)ca-GrQc (n=12008 edges=118521)ca-HepPh (n=36692 edges=183831)email-Enron (n=1971281 edges=2766607)roadNet.CA MIXING MANOPT SDPLR DSDP RBR l o g f un c t i o n d i ff e r e n c e log running time (s) Figure 1: Objective value diﬀerence versus training time for the MAXCUT problems (log-log plot, lower is better). The horizontal lines mark the default stopping precision of theMixing method, which is − times the starting relative objective of the Mixing method.Experiments show that our method (the blue line) is 10-100x faster than other solvers onour default stopping precision. Note that sometimes curves for SDPLR, DSDP, and RBRare not plotted because they either crashed or did not output any solution after an hour.once both methods have achieved very high accuracy. Crucially, on the largest problems, weremain about 10x (or more) faster than Manopt over the entire run, which allows the Mixingmethod to scale to substantially larger problems. Eﬀectiveness of the Mixing method on approximating MAXSAT problems.

Un-like the previous experiment (where the focus was solely on optimization performance),in this section we highlight the fact that with the Mixing method we are able to obtainMAXSAT results with a high approximation ratio on challenging domains (as the problemsare similar, relative optimization performance is similar to that of the MAXCUT evaluations).Speciﬁcally, we evaluate examples from the 2016 MAXSAT competition (Argelich et al.,2016) and compare our result to the best heuristic complete and partial solvers. Note thatthe complete solver produces a veriﬁed result, while the partial solver outputs a non-veriﬁedsolution. Out of 525 problems solved in the complete track (every problem solved exactly bysome solvers within 30 minutes during the competition), our method achieves an averageapproximation ratio of 0.978, and usually ﬁnds such solutions within seconds or less. Further,in some instances we obtain perfect solution faster than the best partial solvers. Figure 2shows the progress of the approximate quality versus the running time. Beside the bestheuristic solvers in MAXSAT 2016, we also show the approximation ratio over time for thewell-known linear programming approximation (Goemans and Williamson, 1994) (solvedvia the Gurobi solver). Note that each point in the blue and green curves denote the ang, Chang, and Kolter (n=120 m=1200 nnz=2400)s2v120c1200-1.cnf (n=120 m=1800 nnz=3600)s2v120c1800-1.cnf (n=220 m=1400 nnz=2800)s2v220c1400-1.cnf (n=110 m=900 nnz=2700)s3v110c900-10.cnf (n=110 m=900 nnz=2700)s3v110c900-1.cnf (n=100 m=900 nnz=3600)HG-4SAT-V100-C900-20.cnf (n=100 m=900 nnz=3600)HG-4SAT-V100-C900-23.cnf (n=140 m=1258 nnz=2516)maxcut-140-630-0.8-50.cnf mixing LP partial complete a pp r o x i m a t e d r a t i o log running time (s) Figure 2: Approximated ratio versus (log) running time for the MAXSAT problems (higheris better). The horizontal line marks the perfect approximation ratio ( . ), and the curvesmark the approximation ratio of diﬀerent approximation algorithms over time. Experimentsindicate that our proposed formulation/method (blue curves) achieves better approximationratios in less time compared to LP. Further, it is sometimes faster than the best partialsolvers (purple vertical lines) and complete solvers (black vertical lines) in the MAXSAT2016 competition.approximation ratio of the output solution at the time, and the starting points of the curvesdenote the time that the solver output the ﬁrst solution. In all cases the Mixing methodgives better and faster solutions than the LP approximations.

6. Conclusion

In this paper we have presented the Mixing method, a low-rank coordinate descent approachfor solving diagonally constrained semideﬁnite programming problems. The algorithm isextremely simple to implement, and involves no free parameters such as learning rates. Intheoretical aspects, we have proved that the method converges to a ﬁrst-order critical pointand all non-optimal critical points are unstable under suﬃcient rank. With a proper stepsize, the method converges to the global optimum almost surely under random initialization.This is the ﬁrst convergence result to the global optimum on the spherical manifold withoutassumption. Further, we have shown that the proposed methods admit local linear convergencein the neighborhood of the optimum regardless of the rank. In experiments, we havedemonstrated the method on three diﬀerent application domains: the MAXCUT SDP, aMAXSAT relaxation, and a word embedding problem (in the appendix). In all cases weshow positive results, demonstrating that the method performs much faster than existingapproaches from an optimization standpoint (for MAXCUT and word embeddings), and that he Mixing method the resulting solutions have high quality from an application perspective (for MAXSAT).In total, the method substantially raises the bar as to what applications can be feasiblyaddressed using semideﬁnite programming, and also advances the state of the art in structuredlow-rank optimization. Acknowledgement

We thank Gary L. Miller for his helpful discussion on the geometry of the unit sphere, andSimon S. Du for pointing out many recent references.

References

P-A Absil, Robert Mahony, and Rodolphe Sepulchre.

Optimization algorithms on matrixmanifolds . Princeton University Press, 2009.Josep Argelich, Chu Min Li, Felip Manya, and Jordi Planes. Max-sat-2016 eleventh max-satevaluation. http://maxsat.ia.udl.cat/ , 2016.Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma, and Andrej Risteski. Rand-walk: Alatent variable model approach to word embeddings. arXiv preprint arXiv:1502.03520 ,2015.Alexander Barvinok. A remark on the rank of positive semideﬁnite matrices subject to aﬃneconstraints.

Discrete & Computational Geometry , 25(1):23–31, 2001.Alexander I. Barvinok. Problems of distance geometry and convex properties of quadraticmaps.

Discrete & Computational Geometry , 13(2):189–202, 1995.Steven J. Benson and Yinyu Ye. DSDP5: Software for semideﬁnite programming. TechnicalReport ANL/MCS-P1289-0905, Mathematics and Computer Science Division, ArgonneNational Laboratory, Argonne, IL, September 2005. URL . Submitted to ACM Transactions on Mathematical Software.Srinadh Bhojanapalli, Anastasios Kyrillidis, and Sujay Sanghavi. Dropping convexity forfaster semi-deﬁnite optimization. In

Conference on Learning Theory , pages 530–582, 2016.Srinadh Bhojanapalli, Nicolas Boumal, Prateek Jain, and Praneeth Netrapalli. Smoothedanalysis for low-rank solutions to semideﬁnite programs in quadratic penalty form. arXivpreprint arXiv:1803.00186 , 2018.Nicolas Boumal. A riemannian low-rank method for optimization over semideﬁnite matriceswith block-diagonal constraints. arXiv preprint arXiv:1506.00575 , 2015.Nicolas Boumal and P-A Absil. Low-rank matrix completion via preconditioned optimizationon the grassmann manifold.

Linear Algebra and its Applications , 475:200–239, 2015.Nicolas Boumal, Bamdev Mishra, Pierre-Antoine Absil, Rodolphe Sepulchre, et al. Manopt,a matlab toolbox for optimization on manifolds.

Journal of Machine Learning Research ,15(1):1455–1459, 2014. ang, Chang, and Kolter Nicolas Boumal, Vlad Voroninski, and Afonso Bandeira. The non-convex burer-monteiroapproach works on smooth semideﬁnite programs. In

Advances in Neural InformationProcessing Systems , pages 2757–2765, 2016.Nicolas Boumal, P-A Absil, and Coralia Cartis. Global rates of convergence for nonconvexoptimization on manifolds.

IMA Journal of Numerical Analysis , page drx080, 2018. doi:10.1093/imanum/drx080. URL http://dx.doi.org/10.1093/imanum/drx080 .Stephen Boyd and Lieven Vandenberghe.

Convex optimization . Cambridge university press,2004.Samuel Burer and Renato DC Monteiro. A nonlinear programming algorithm for solvingsemideﬁnite programs via low-rank factorization.

Mathematical Programming , 95(2):329–357, 2003.Samuel Burer and Renato DC Monteiro. Local minima and convergence in low-rank semidef-inite programming.

Mathematical Programming , 103(3):427–444, 2005.Emmanuel Candes and Benjamin Recht. Exact matrix completion via convex optimization.

Communications of the ACM , 55(6):111–119, 2012.Michel X Goemans and David P Williamson. New 3/4-approximation algorithms for themaximum satisﬁability problem.

SIAM Journal on Discrete Mathematics , 7(4):656–666,1994.Michel X Goemans and David P Williamson. Improved approximation algorithms formaximum cut and satisﬁability problems using semideﬁnite programming.

Journal of theACM (JACM) , 42(6):1115–1145, 1995.Gene H Golub and Charles F Van Loan.

Matrix computations , volume 3. JHU Press, 2012.Carla P Gomes, Willem-Jan van Hoeve, and Lucian Leahu. The power of semideﬁnite pro-gramming relaxations for max-sat. In

International Conference on Integration of ArtiﬁcialIntelligence (AI) and Operations Research (OR) Techniques in Constraint Programming ,pages 104–118. Springer, 2006.Michael I Jordan and Martin J Wainwright. Semideﬁnite relaxations for approximate inferenceon graphs with cycles. In

Advances in Neural Information Processing Systems , pages369–376, 2004.Jason D Lee, Max Simchowitz, Michael I Jordan, and Benjamin Recht. Gradient descentonly converges to minimizers. In

Conference on Learning Theory , pages 1246–1257, 2016.Jason D Lee, Ioannis Panageas, Georgios Piliouras, Max Simchowitz, Michael I Jordan, andBenjamin Recht. First-order methods almost always avoid saddle points. arXiv preprintarXiv:1710.07406 , 2017.Song Mei, Theodor Misiakiewicz, Andrea Montanari, and Roberto I Oliveira. Solving sdpsfor synchronization and maxcut problems via the grothendieck inequality. arXiv preprintarXiv:1703.08729 , 2017. he Mixing method Tomas Mikolov, Kai Chen, Greg Corrado, and Jeﬀrey Dean. Eﬃcient estimation of wordrepresentations in vector space. arXiv preprint arXiv:1301.3781 , 2013a.Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeﬀ Dean. Distributedrepresentations of words and phrases and their compositionality. In

Advances in neuralinformation processing systems , pages 3111–3119, 2013b.Michael O’Neill and Stephen J Wright. Behavior of accelerated gradient methods near criticalpoints of nonconvex problems. arXiv preprint arXiv:1706.07993 , 2017.Ioannis Panageas and Georgios Piliouras. Gradient descent only converges to minimizers:Non-isolated critical points and invariant regions. arXiv preprint arXiv:1605.00405 , 2016.Dohyung Park, Anastasios Kyrillidis, Srinadh Bhojanapalli, Constantine Caramanis, andSujay Sanghavi. Provable burer-monteiro factorization for a class of norm-constrainedmatrix problems. arXiv preprint arXiv:1606.01316 , 2016.Gábor Pataki. On the rank of extreme matrices in semideﬁnite programs and the multiplicityof optimal eigenvalues.

Mathematics of operations research , 23(2):339–358, 1998.Jeﬀrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors forword representation. In

EMNLP , volume 14, pages 1532–1543, 2014.Hiroyuki Sato and Toshihiro Iwai. A new, globally convergent riemannian conjugate gradientmethod.

Optimization , 64(4):1011–1031, 2015.Michael Shub.

Global stability of dynamical systems . Springer Science & Business Media,2013.Po-Wei Wang and Chih-Jen Lin. Iteration complexity of feasible descent methods for convexoptimization.

Journal of Machine Learning Research , 15(1):1523–1548, 2014.Zaiwen Wen, Donald Goldfarb, Shiqian Ma, and Katya Scheinberg. Row by row methods forsemideﬁnite programming.

Industrial Engineering , pages 1–21, 2009.Zaiwen Wen, Donald Goldfarb, and Katya Scheinberg. Block coordinate descent methodsfor semideﬁnite programming. In

Handbook on Semideﬁnite, Conic and PolynomialOptimization , pages 533–564. Springer, 2012.Liu Yang. Distance metric learning: A comprehensive survey. 2006. ang, Chang, and Kolter Appendix A. Proof of Theorem 3.2: Convergence to a critical point

Lemma A.1

Let ˆ V = M ( V ) for the Mixing method M : R k × n → R k × n deﬁned in (4) and (5) . Then f ( V ) − f ( ˆ V ) = n (cid:88) i =1 y i (cid:107) v i − ˆ v i (cid:107) . Proof

Recall the objective function f w.r.t. variable v i while ﬁxing all other variables v j is (cid:104) C, V T V (cid:105) = (cid:88) i (cid:88) j c ij v Ti v j = 2 v Ti ( (cid:88) j c ij v j ) + constant . Note that the (cid:80) j c ij v j term is independent of v i because c ii = 0 . Now consider the inner cycliciteration of the Mixing method updating v i to ˆ v i . Because only those v j with j < i are updatedto ˆ v j , the objective value before updating v i to ˆ v i equals v Ti ( (cid:80) ji c ij v j ) = 2 g Ti v i plus constants, Thus, the updates of the Mixing method can be written as ˆ v i = − g i /y i , where y i = (cid:107) g i (cid:107) and g i = (cid:88) ji c ij v j , i = 1 . . . n. (7)and the objective diﬀerence after updating v i to ˆ v i is g Ti ( v i − ˆ v i ) = − (cid:107) g i (cid:107) ˆ v Ti ( v i − ˆ v i ) = 2 y i (1 − v Ti ˆ v i ) = y i (cid:107) v i − ˆ v i (cid:107) . The result follows from summing above equation over i = 1 . . . n . Proof of Theorem 3.2Proof

By Assumption 3.1 y does not degenerate over iterations, so Lemma A.1 implies thatthe Mixing method is strictly decreasing and admits a unique limit point ¯ V = lim r →∞ V r forthe sequence { V r } generated by the algorithm. Denote ¯ y the corresponding y for ¯ V . Then ¯ V being a ﬁxed point of (5) implies ¯ V ( C + diag(¯ y )) = 0 , which means the projected gradient of (cid:104) C, V T V (cid:105) at ¯ V is zero. Thus, together with thefeasibility of V r , we prove that the Mixing method converges to a critical point. Appendix B. Proof of Lemma 3.10: Divergence of Gauss-Seidel methods

Proof

Because the dynamics of the Gauss-Seidel method (GS) on the system min x ∈ R n f ( x ) , where f ( x ) ≡ x T Sx, has the same Jacobian as J GS , proving ρ ( J GS ) > is equivalent to proving the “lineardivergence” of the Gauss-Seidel method, which cyclically optimizes each coordinate of x ∈ R n .Further, since S (cid:54)(cid:23) , there is an eigenvector q ∈ R n of S such that q T Sq < . he Mixing method Consider the sequence { x r } r =0 , ,... generated by the GS. That is, x r = ( J GS ) r x , ∀ r > ,where ( J GS ) r is J GS to the r -th power. Let the initial solution of the system be x = q sothat f ( x ) < . Because the Gauss-Seidel method is greedy in every coordinate updates,it is monotonically deceasing in the function value. Thus, there are only two cases for thesequence of function values: 1) the function value converges below zero; 2) the function valuegoes to negative inﬁnity.Denote z ri the x r before updating the i -th coordinate and let z r = x r and z rn +1 = x r +1 .This way, only the i -th coordinate between z ri and z ri +1 is changed and the inner cyclicupdates can be ﬂattened as x r = z r → z r → . . . → z rn → z rn +1 = x r +1 . (8)

1) When the function value converges.

The monotonic decreasing property of GSimplies that the function diﬀerence converges to zero. By the same analysis in Lemma A.1, we have f ( x r ) − f ( x r +1 ) = n (cid:88) i =1 y i (cid:107) z ri − z ri +1 (cid:107) . Thus, the ﬂattened sequence { z ri } convergences, which implies { x r } also converges. Let ¯ x bethe limit of the sequence { x r } . Being a limit of the GS sequence means that ¯ x is a ﬁxed pointof GS, which implies S ¯ x = 0 and f ( ¯ X ) = ¯ x T S ¯ x = 0 . This contradicts with the monotonicdecreasing property of GS and the fact that f ( x ) < .

2) When the function value x Tr Sx r goes to negative inﬁnity. Because the spectrumof S is bounded, we know that (cid:107) x r (cid:107) also goes to inﬁnity. For simplicity, we focus on the r -thiterate and write z ri as z i . From the GS, z i,i is updated to z i +1 ,i = − y i (cid:80) j c ij z j,i , and we have f ( z i ) − f ( z i +1 ) = y i (cid:107) z i − z i +1 (cid:107) = y i ( z i,i + 1 y i ( (cid:88) j c ij z j,i )) = | e Ti Sz i | /y i , (9)where e i is the i -th coordinate vector and the ﬁrst equality is from f ( x ) = x T Sx . Then wehave the following claim from Lee et al. (2017, cliam 1). Claim B.1

Assume x r be in the range of S . There exists an index j such that y j | e Tj Sz j | ≥ ω (cid:107) z j (cid:107) for some global constant ω > that only depends on S and n . The full proof of the claim is listed after this lemma for completeness. To fulﬁll the assumption,we can decompose x r = x r + x ⊥ r , where x r is in the range of S and x ⊥ r is in the null of S .Consider the ﬂattened inner cyclic update z i like (8) but starting from x r such that x r = z → z → . . . → z n → z n +1 . Because J GS map the null space of S to itself, f ( x r ) − f ( x r +1 ) = f ( x r ) − f ( J GS ( x r + x ⊥ r )) = f ( z ) − f ( z n +1 + x ⊥ r ) = f ( z ) − f ( z n +1 ) .

10. We can obtain the result by ﬁxing y in Lemma A.1 to be a constant and let V ∈ R × n . The result canalso be obtained by examining the coordinate updates of GS, which is already known in the literature.11. Note that only x r is decomposed to x r in range ( S ) and x ⊥ r in null ( S ) . Symbols z i are the GS iteratesgenerated from x r and might not be in the range of S .12. Consider p such that Sp = 0 . Then ( J GS ) p = − ( L + y ) − L T p = − ( L + y ) − Sp + p = p . ang, Chang, and Kolter Further, because GS is coordinate-wise monotonic decreasing and the function decrease of acoordinate update is smaller than the whole cyclic update, by above equality and (9) we have f ( x r ) − f ( x r +1 ) = f ( z ) − f ( z n +1 ) ≥ ( e Tj Sz j ) y j ≥ y j ω (cid:107) z j (cid:107) . The last inequality is from Claim B.1. Thus, f ( x r +1 ) ≤ f ( x r ) − y j ω (cid:107) z j (cid:107) (10)Further, because (cid:107) z j (cid:107) ≥ | z Tj Sz j | /ρ ( S ) and z Tj Sz j = f ( z j ) ≤ f ( x r ) = f ( x r ) ≤ f ( x ) < , f ( x r ) − y j ω (cid:107) z j (cid:107) ≤ f ( x r ) + y j ω ρ ( S ) z Tj Sz j ≤ (1 + y min ω ρ ( S ) ) f ( x r ) . (11)Combining (10) and (11), we obtain the exponential divergence to negative inﬁnity f ( x r +1 ) ≤ (1 + y min ω ρ ( S ) ) f ( x r ) ≤ (1 + y min ω ρ ( S ) ) r +1 f ( x ) , ∀ r ≥ . (12)The last inequality is from applying the ﬁrst inequality recursively. Because S (cid:23) σ min ( S ) I n and σ min ( S ) < , f ( x r ) = (( J GS ) r x ) T S (( J GS ) r x ) ≥ σ min ( S ) (cid:107) ( J GS ) r x (cid:107) ≥ σ min ( S ) (cid:107) ( J GS ) r (cid:107) (cid:107) x (cid:107) . (13)Combining (12) and (13), we have σ min ( S ) (cid:107) ( J GS ) r (cid:107) (cid:107) x (cid:107) ≤ (cid:18) y min ω ρ ( S ) (cid:19) r f ( x ) , ∀ r > . Applying Gelfand’s theorem for spectral radius, we conclude that ρ ( J GS ) = lim r →∞ (cid:107) ( J GS ) r (cid:107) /r ≥ (cid:115) y min ω ρ ( S ) , which means that the spectral radius ρ ( J GS ) is strictly larger than . Proof of the Claim B.1 in the above lemma.

Note that the following proof is essentiallythe same with Lee et al. (2017, cliam 1), where their α is our y i and their y t is our x r . Theonly diﬀerence here is that we prove the result for the exact Gauss-Seidel method, and theyprove the result for the coordinate gradient descent method with a step size. The proof islisted here for completeness. Proof

We will prove by contradiction. Assume that y j | e Tj Sz j | < ω (cid:107) z j (cid:107) for all j = 1 . . . n for certain ω. (14)Now we show the following result by induction, that for j = 2 . . . n + 1 , (cid:107) x r − z j (cid:107) < j − ω (cid:107) x r (cid:107) . (15) he Mixing method Remember from (9) we have y j (cid:107) z j − z j +1 (cid:107) = | e Tj Sz j | /y j . (16)For j = 2 , we have the induction basis for (15) from the above equality and (14), that (cid:107) x r − z (cid:107) = (cid:107) z − z (cid:107) = 1 y | e T Sz | < ω (cid:107) z (cid:107) = ω (cid:107) x r (cid:107) < ω (cid:107) x r (cid:107) , and accordingly (cid:107) z (cid:107) ≤ (cid:107) z − z (cid:107) + (cid:107) z (cid:107) < (1 + 2 ω ) (cid:107) x r (cid:107) . Now we do the induction. Supposethe hypothesis (15) holds for a j . This implies (cid:107) z j (cid:107) ≤ (cid:107) z j − x r (cid:107) + (cid:107) x r (cid:107) < (1 + 2( j − ω ) (cid:107) x r (cid:107) . (17)Then at j + 1 , (cid:107) x r − z j +1 (cid:107) ≤ (cid:107) x r − z j (cid:107) + (cid:107) z j − z j +1 (cid:107) by the triangular inequality < j − ω (cid:107) x r (cid:107) + 1 y j | e Tj Sz j | by hypothesis (15) at j and (16) < j − ω (cid:107) x r (cid:107) + ω (cid:107) z j (cid:107) by assumption (14) < j − ω (cid:107) x r (cid:107) + ω (1 + 2( j − ω ) (cid:107) x r (cid:107) by (17) ≤ jω (cid:107) x r (cid:107) , where the last inequality holds from picking ω ∈ (0 , n ) so that ω (1 + 2( j − ω − < .Thus, the induction on (15) holds. With the result (15), for j = 2 . . . n we have y max | e Tj Sx r | ≤ y j | e Tj Sx r | by y max ≥ y j ≤ y j ( | e Tj Sz j | + | e Tj S ( x r − z j ) | ) by the triangular inequality < ω (cid:107) z j (cid:107) + 1 y j (cid:107) Se j (cid:107)(cid:107) x r − z j (cid:107) by (14) and Cauchy inequality < ω (1 + 2( j − ω ) (cid:107) x r (cid:107) + 1 y j j − ω (cid:107) Se j (cid:107)(cid:107) x r (cid:107) by (17) and (15) ≤ ω (1 + 2 nω + 2 n y min ρ ( S )) (cid:107) x r (cid:107) , (18)where the last inequality is from (cid:107) Se j (cid:107) ≤ (cid:107) S (cid:107)(cid:107) e j (cid:107) ≤ ρ ( S ) , y min ≤ y j , and j ≤ n + 1 . Notethat the result of (18) for j = 1 also holds because (14) and x r = z . Summing the square of(18) over j = 1 . . . n and put it in a square root, we have √ nω (1 + 2 nω + 2 n ρ ( S ) y min ) (cid:107) x r (cid:107) > y max (cid:107) Sx r (cid:107) ≥ κ min -nz ( S ) (cid:107) x r (cid:107) , where κ min -nz ( S ) = (cid:112) σ min -nz ( S T S ) > is the minimum nonzero singular value of S andthe last inequality holds because κ min -nz ( S ) (cid:107) x r (cid:107) ≤ (cid:107) Sx r (cid:107) from x r ∈ range ( S ) . Can-celling (cid:107) x r (cid:107) from both sides of the above inequality, the left-hand side goes to when ω → but the right-hand side stays constant. Thus, picking small enough ω such that κ min -nz ( S ) ≥ y max √ nω (1 + 2 nω + 2 n ρ ( S ) y min ) leads to a contradiction. So the claim holds.

13. Note that the choice of ω only depends on n and S . ang, Chang, and Kolter Appendix C. Proof of Theorem 3.4: Global convergence with a step size

Lemma C.1

The Mixing method M θ with a step size θ ∈ (0 , i (cid:107) c i (cid:107) ) never degenerates.That is, there is a constant δ ∈ (0 , such that (cid:107) θV c i (cid:107) ≤ − δ < and (cid:107) v i − θV c i (cid:107) ≥ δ > . Proof

Taking a constant θ ∈ (0 , i (cid:107) c i (cid:107) ) is equivalent to taking θ = − δ max i (cid:107) c i (cid:107) for aconstant δ ∈ (0 , . From the triangular inequality, (cid:107) V c i (cid:107) = (cid:107) (cid:88) j c ij v j (cid:107) ≤ (cid:88) j | c ij |(cid:107) v j (cid:107) = (cid:107) c i (cid:107) . So we have (cid:107) θV c i (cid:107) ≤ − δ < . The second result follows from (cid:107) v i − θV c i (cid:107) ≥ − (cid:107) θV c i (cid:107) . Lemma C.2

The Mixing method M θ with a step size θ ∈ (0 , i (cid:107) c i (cid:107) ) is a diﬀeomorphism. Proof

Note that M θ can be decomposed to M θ ( V ) = Φ n (Φ n − ( . . . Φ ( V ))) , where the column update Φ i ( V ) : R k × n → R k × n is deﬁned as (Φ i ( V )) s =1 ...n = (cid:40) v i − θV c i (cid:107) v i − θV c i (cid:107) if s = iv s otherwise . Thus, if we can prove that Φ i ( V ) is a diﬀeomorphism for i = 1 . . . n , then M θ is a diﬀeo-morphism because compositions of diﬀeomorphisms are still diﬀeomorphism. Speciﬁcally,because all variables v j except for v i are given and stay the same, proving diﬀeomorphism of Φ i ( V ) is equivalent to proving the diﬀeomorphism of the projective mapping φ : R n → R n φ ( v ) = v − g (cid:107) v − g (cid:107) , where g = θV c i is known. Let φ ( v ) = z . We claim the inverse function φ − ( z ) is φ − ( z ) = αz + g, where α = − z T g + (cid:113) ( z T g ) + 1 − (cid:107) g (cid:107) . The square root is valid because of Lemma C.1. We prove the claim by validation. First, φ − ( φ ( v )) = φ − (cid:18) v − g (cid:107) v − g (cid:107) (cid:19) . By using the property that (cid:107) v (cid:107) = 1 , the α for the above function is α = − ( v − g ) T g (cid:107) v − g (cid:107) + (cid:115)(cid:18) ( v − g ) T g (cid:107) v − g (cid:107) (cid:19) + 1 − (cid:107) g (cid:107) = 1 (cid:107) v − g (cid:107) (cid:18) − ( v − g ) T g + (cid:113) (( v − g ) T g ) + (cid:107) v − g (cid:107) (1 − (cid:107) g (cid:107) ) (cid:19) = 1 (cid:107) v − g (cid:107) (cid:0) − v T g + (cid:107) g (cid:107) + 1 − v T g (cid:1) = (cid:107) v − g (cid:107) . he Mixing method Thus, φ − ( φ ( v )) = (cid:107) v − g (cid:107) v − g (cid:107) v − g (cid:107) + g = v is indeed the inverse function. The diﬀeomorphismfollows from the smoothness of φ ( v ) and φ − ( z ) when (cid:107) v − g (cid:107) ≥ δ > by Lemma C.1. Lemma C.3

For the Mixing method M θ with step size θ ∈ (0 , i (cid:107) c i (cid:107) ) , let ˆ V = M θ ( V ) .Following the notation in (4) and (6) , we have f ( V ) − f ( ˆ V ) = (cid:88) i y i θ (cid:107) v i − ˆ v i (cid:107) , where y i = (cid:107) v i − θg i (cid:107) and g i = (cid:80) ji c ij v j . Proof

Consider the inner iteration of the Mixing method with a step size. With the sameanalysis to Lemma A.1, the function value before updating the variable v i is g Ti v i , and thefunction diﬀerence after updating v i to ˆ v i = ( v i − θg i ) /y i is g Ti ( v i − ˆ v i ) = 2( g i + v i − θg i θ ) T ( v i − ˆ v i ) − v i − θg i θ ) T ( v i − ˆ v i )= 2 1 θ v Ti ( v i − ˆ v i ) − y i θ ˆ v Ti ( v i − ˆ v i )= 1 + y i θ − v Ti ˆ v i ) = 1 + y i θ (cid:107) v i − ˆ v i (cid:107) . Thus, the result holds from summing the above equation over i = 1 . . . n . Proof of Theorem 3.4Proof

Similar to Lemma 3.7, the Jacobian of the Mixing method M θ with step size θ is J ( V ) = ( D y + θL ) − ⊗ I k P ( I n − θL T ) ⊗ I k , where P = diag( P , . . . , P n ) and P i = I − ˆ v i ˆ v Ti . By Lemma 3.9, the spectral radius ρ ( J ) at a critical point is lower bounded by J CGD =( D y + θL ) − ( I n − θL T ) , which equals the Jacobian of coordinate gradient descent (CGD) ona linear system S = C + diag( (cid:107) V c i (cid:107) ) . Because CGD admits a Jacobian with ρ ( J CGD ) > when S (cid:54)(cid:23) (Lee et al., 2017, Proposition 5), Lemma 3.12 implies all non-optimal criticalpoints are unstable ﬁxed points for M θ . Further, since M θ is a diﬀeomorphism by Lemma C.2,we can apply the center-stable manifold theorem (Shub, 2013, Theorem III.5) to the mapping M θ . To be speciﬁc, because the non-optimal critical points are unstable ﬁxed points and M θ is a diﬀeomorphism, Lee et al. (2017, Theorem 2) implies that the Mixing method M θ escapes all non-optimal critical points almost surely under random initialization. Further,because M θ is strictly decreasing (Lemma C.3) and the objective value is lower bounded, M θ converges to a ﬁrst-order critical points (with the same analysis to Theorem 3.2). In conclu-sion, the almost surely divergence from the non-optimal critical points and the convergenceto a critical point imply that M θ almost surely converges to a global optimum.

14. Note that Lee et al. (2017, Lemma 1) use only the property of diﬀeomorphism, so their assumption onthe non-singular Jacobian is not necessary. Actually, non-singular Jacobian is a suﬃcient condition forthe existence of one-to-one mapping but not the necessary condition.15. Note that the critical points here is non-isolated because they are invariant to rotations in R k . While Leeet al. (2017, Theorem 2) suﬃces for our result, interested reader can also refer to Panageas and Piliouras(2016) on how they use the Lindelöf lemma to solve the non-isolation issue. ang, Chang, and Kolter Appendix D. Proof of Theorem 3.5: Local Linear convergence

Lemma D.1

Composition of Lipschitz continuous functions is Lipschitz continuous. Thatis, if mapping F is λ F -Lipschitz continuous and G is λ G -Lipschitz continuous, then G ( F ( · )) is λ G λ F -Lipschitz continuous. Proof

Mapping F being λ F -Lipschitz continuous means for variables X, Y in its domain, (cid:107) F ( X ) − F ( Y ) (cid:107) ≤ λ F (cid:107) X − Y (cid:107) . Consequently, for the λ G -Lipschitz continuous mapping G (cid:107) G ( F ( X )) − G ( F ( Y )) (cid:107) ≤ λ G (cid:107) F ( X ) − F ( Y ) (cid:107) ≤ λ G λ F (cid:107) X − Y (cid:107) , which establishes the proof. Lemma D.2

Under Assumption 3.1, the Mixing method M : R k × n → R k × n and its nor-malizer y : R k × n → R n are Lipschitz continuous to any optimum V ∗ . Further, there areconstants β, γ > such that, (cid:107) V − M ( V ) (cid:107) ≥ ( β − γ (cid:107) y ( V ) − y ( V ∗ ) (cid:107) )( f ( V ) − f ∗ ) . (19) Proof

By Lemma D.1, proving the Lipschitz continuity of the mapping y ( V ) and thecorresponding M ( V ) is equivalent to proving the Lipschitz continuity of ψ i ( V ) = (cid:107) V c i (cid:107) and φ i ( V ) = − ψ i ( V ) V c i . First, from the deﬁnition of ψ i and the Cauchy inequality, there is | ψ i ( V ) − ψ i ( V ∗ ) | ≤ (cid:107) V c i − V ∗ c i (cid:107) ≤ (cid:107) c i (cid:107)(cid:107) V − V ∗ (cid:107) . Further, from φ i ( V ∗ ) = v ∗ i , the Cauchy inequality, and the above inequality, we have (cid:107) φ i ( V ) − φ i ( V ∗ ) (cid:107) = (cid:13)(cid:13)(cid:13)(cid:13) − ψ i ( V ) V c i − − ψ i ( V ) V ∗ c i − (cid:18) ψ i ( V ) ψ i ( V ) − ψ i ( V ∗ ) ψ i ( V ) (cid:19) v ∗ i (cid:13)(cid:13)(cid:13)(cid:13) ≤ ψ i ( V ) (cid:0) (cid:107) c i (cid:107)(cid:107) V − V ∗ (cid:107) + (cid:107) ψ i ( V ) − ψ i ( V ∗ ) (cid:107)(cid:107) v ∗ i (cid:107) (cid:1) ≤ (cid:107) c i (cid:107) δ (cid:107) V − V ∗ (cid:107) , where the last inequality is from Assumption 3.1 that ψ i ( V ) ≥ δ > . Thus, the Lipschitzcontinuity holds from applying Lemma D.1 recursively. Now we prove result (19). Let S ∗ = C + D y ∗ and S = C + D y , where y ∗ = y ( V ∗ ) and y = y ( V ) . Under the notation in (5), V − M ( V ) = V ( L T + D y )( L T + D y ) − + V L ( L T + D y ) − = V S ( L T + D y ) − .

16. The coeﬃcient y deﬁned in (5) is also a function of V .17. Because y and M can be composed by ψ i and φ i like Lemma C.2. he Mixing method For simplicity, let R := ( L T + D y ) − . Expand the square using S = S ∗ + D y − D y ∗ and dropthe last squared term after the expansion, we have (cid:107) V − M ( V ) (cid:107) ≥ (cid:107) V S ∗ R (cid:107) + 2 tr (cid:0) V T V S ∗ RR T ( D y − D y ∗ ) (cid:1) ≥ σ ( R ) σ min -nz ( S ∗ ) tr( V T V S ∗ ) − (cid:107) y − y ∗ (cid:107) σ ( R ) tr( V T V S ∗ ) . The last inequality is from (cid:107) y − y ∗ (cid:107) I n + ( D y − D y ∗ ) (cid:23) and Lemma 3.12 ( S ∗ (cid:23) ). Note thatthe term tr( V T V S ∗ ) = tr( V T V C ) − ( − T y ∗ ) = f ( V ) − f ∗ by the analysis in Lemma 3.12.With σ min ( R ) = y max and σ max ( R ) = y min , the result (19) follows from (cid:107) V − M ( V ) (cid:107) ≥ (cid:18) σ min -nz ( S ∗ ) y − (cid:107) y − y ∗ (cid:107) y (cid:19) ( f ( V ) − f ∗ ) . Lemma D.3

The Mixing method M θ : R k × n → R k × n with step size θ and its normalizer y : R k × n → R n are Lipschitz continuous to any optimum V ∗ . Further, there are constants β, γ > such that (cid:107) V − M θ ( V ) (cid:107) ≥ ( β − γ (cid:107) y ( V ) − y ( V ∗ ) (cid:107) )( f ( V ) − f ∗ ) . (20) Proof

Similar to the proof of Lemma D.2, proving the Lipschitz continuity of y ( V ) and M θ ( V ) is to prove the Lipschitz continuity of the mappings ψ i ( V ) = (cid:107) v i − θV c i (cid:107) and φ i ( V ) = 1 ψ i ( V ) ( v i − θV c i ) . First, the Lipschitz continuity of ψ i ( V ) follows by applying the Cauchy inequality | ψ i ( V ) − ψ i ( V ∗ ) | ≤ (cid:107) v i − θV c i − ( v ∗ i − θV ∗ c i ) (cid:107) ≤ (1 + θ (cid:107) c i (cid:107) ) (cid:107) V − V ∗ (cid:107) . Combining the above inequality and φ i ( V ∗ ) = v ∗ i , we have the Lipschitz continuity (cid:107) φ i ( V ) − φ i ( V ∗ ) (cid:107) = (cid:13)(cid:13)(cid:13)(cid:13) ψ i ( V ) ( v i − θV c i ) − ψ i ( V ) ( v ∗ i − θV ∗ c i ) − (cid:18) ψ i ( V ) ψ i ( V ) − ψ i ( V ∗ ) ψ i ( V ) (cid:19) v ∗ i (cid:13)(cid:13)(cid:13)(cid:13) ≤ ψ i ( V ) (cid:0) (1 + θ (cid:107) c i (cid:107) ) (cid:107) V − V ∗ (cid:107) + (cid:107) ψ i ( V ) − ψ i ( V ∗ ) (cid:107) (cid:1) ≤ θ (cid:107) c i (cid:107) ) δ (cid:107) V − V ∗ (cid:107) . The last inequality is from Lemma C.1. Next, we prove result (20). Under notations in (6), V − M θ ( V ) = V − V ( I n − θL )( θL T + D y ) − = V ( D y − I n + θC )( θL T + D y ) − . (21)Let y ∗ = y ( V ∗ ) and y = y ( V ) . We have y ∗ i = 1 + θ (cid:107) V ∗ c i (cid:107) by M θ ( V ∗ ) = V ∗ , thus S ∗ = C + θ ( D y ∗ − I ) (cid:23) by Lemma 3.12. Further, D y − I n + θC = θS ∗ + D y − D y ∗ . (22)Expand (cid:107) V − M θ (cid:107) with equality (21) and (22) and follow the same analysis in Lemma D.2,then result (20) holds with (cid:107) V − M θ ( V ) (cid:107) ≥ θ (cid:18) θσ min -nz ( S ∗ ) y − (cid:107) y − y ∗ (cid:107) y (cid:19) ( f ( V ) − f ∗ ) . ang, Chang, and Kolter Lemma D.4

For both Mixing methods M (with Assumption 3.1) and M θ (no assumption),there is a constant τ > such that, for their normalizers y ∗ = y ( V ∗ ) and y = y ( V ) , (cid:107) y − y ∗ (cid:107) ≤ τ ( f ( V ) − f ∗ ) . Proof

First we prove the result for the Mixing method M . Denote Z i the (ﬂattened) innercyclic iterate before updating v i to ˆ v i so that only vector v i is changed between Z i and Z i +1 .That is, V = Z → Z → . . . → Z n → Z n +1 = M ( V ) . Let S ∗ = C + diag( (cid:107) V ∗ c i (cid:107) ) for an optimum solution V ∗ . Observe that ( y i − y ∗ i )ˆ v i = − Z i c i − y ∗ i ˆ v i = − Z i s ∗ i + y ∗ i ( v i − ˆ v i ) . The terms Z i s ∗ i and v i − ˆ v i can be bounded by Lemma 3.12 and A.1 (with Assumption 3.1): f ( V ) − f ∗ ≥ f ( Z i ) − f ∗ = tr( Z Ti Z i S ∗ ) ≥ ρ max ( S ∗ ) (cid:107) Z i S ∗ (cid:107) ≥ ρ max ( S ∗ ) (cid:107) Z i s ∗ i (cid:107) and f ( V ) − f ∗ ≥ f ( Z i ) − f ( Z i +1 ) ≥ δ (cid:107) v i − ˆ v i (cid:107) . Thus, by ( y i − y ∗ i ) ≤ (cid:107) Z i s i (cid:107) + 2( y ∗ i ) (cid:107) v i − ˆ v i (cid:107) and the above two inequalities, ( y i − y ∗ i ) is bounded by f ( V ) − f ∗ times a constant. Summing over i = 1 . . . n we have the result fora constant τ > . The result for the Mixing method M θ with step size θ ∈ (0 , i (cid:107) c i (cid:107) ) follows from the same analysis and Lemma C.3 (no assumption). Proof of Theorem 3.5Proof

By Lemma D.2, there are constants β, γ > for the Mixing method M such that (cid:107) V − M ( V ) (cid:107) ≥ ( β − γ (cid:107) y − y ∗ (cid:107) )( f ( V ) − f ∗ ) . By Lemma D.4, we can pick a neighborhood close enough to the optimum where f ( V ) − f ∗ is suﬃciently small so that β − γ (cid:107) y − y ∗ (cid:107) ≥ κ > for a constant κ > . Consequently, (cid:107) V − M ( V ) (cid:107) ≥ κ ( f ( V ) − f ∗ ) . Further, the above inequality holds for all subsequent iterates, because f ( V ) − f ∗ is strictlydecreasing, thus still suﬃciently small so that β − γ (cid:107) y − y ∗ (cid:107) ≥ κ > holds. Togetherwith Lemma A.1 (under Assumption 3.1), there is another constant δ > such that f ( V ) − f ( M ( V )) ≥ δ (cid:107) V − M ( V ) (cid:107) , and thus f ( V ) − f ( M ( V )) ≥ δκ ( f ( V ) − f ∗ ) = ⇒ (1 − δκ )( f ( V ) − f ∗ ) ≥ f ( M ( V )) − f ∗ . That is, the Mixing method M converges R-linearly to the optimal objective value in theneighborhood of the optimum. The same result follows for the Mixing method M θ with stepsize θ ∈ (0 , i (cid:107) c i (cid:107) ) from Lemma C.3 (no assumption) and Lemma D.3.Further, because f ( V ) − f ∗ = tr( V S ∗ V T ) and S ∗ = C +diag( (cid:107) V ∗ c i (cid:107) ) (cid:23) by Lemma 3.12,above R-linear convergence in objective value implies the Q-linear convergence of rows of V to the null of S ∗ . Together with the Mixing methods being feasible methods, we obtain theQ-linear convergence of V to the global optimum in the neighborhood of the optimum. he Mixing method Appendix E. Proof of Lemma 3.11: Rank Deﬁciency in Critical Points

The proof is a specialized version of (Boumal et al., 2016, Lemma 9) for the MAXCUT SDP,in which their Y is our V and their µ ( V ) is our y . We list it here for completeness. Proof

Let V be a ﬁrst-order critical point of problem (2), which means that there is acorresponding y i = (cid:107) V c i (cid:107) , i = 1 . . . n such that V S = 0 , where S ≡ C + D y . This implies rank ( V ) ≤ null ( C + D y ) ≤ max ν null ( C + D ν ) . Note that the right-hand side is independent of V , so we can use it to bound the rank ofall critical V . Let ν be a solution of the right-hand side, M ≡ C + D ν , and null ( M ) ≡ (cid:96) .Writing C = M − D ν , we have C ∈ N (cid:96) + im ( D ) , in which the + denotes the set-sum, im ( D ) denotes the image of all diagonal matrices ofsize n , and N (cid:96) denotes the set of symmetric matrices of size n with nullity (cid:96) . Because of thesymmetricity of N (cid:96) , dim ( N (cid:96) ) = n ( n + 1)2 − (cid:96) ( (cid:96) + 1)2 . Further, because rank ( V ) ≤ k , we can assume that (cid:96) ≥ k . Union all possible (cid:96) , C ∈ (cid:91) (cid:96) = k...n N (cid:96) + im ( D ) . Note that the right-hand side is now independent of C . Because the dimension of a ﬁniteunion is at most the maximal dimension, and the dimension of a ﬁnite set sum is at mostthe sum of set dimensions,dim (cid:32) (cid:91) (cid:96) ∈ k...n N (cid:96) + im ( D ) (cid:33) ≤ dim ( N k + im ( D )) ≤ n ( n + 1)2 − k ( k + 1)2 + rank ( D ) . (23)We know that rank ( D ) = n because the space of diagonal matrix has n free dimensions.Because the symmetric matrix C lives in the space n ( n +1)2 , almost no C satisﬁes the right-handside of (23) if we take large enough k so that n ( n + 1)2 − k ( k + 1)2 + n < n ( n + 1)2 . Thus, almost no C has critical point of rank k if k ( k +1)2 > n , which means for almost all C ,the critical point has at most rank k − . ang, Chang, and Kolter Algorithm 3:

The Mixing method for Word embedding problem Initialize v i randomly on a unit sphere; Initialize b i := 0 for each i = 1 , . . . , m ; while not yet converged do for i = 1 , . . . , n do Solve Hd + g = 0 approximately by conjugate gradient method using Hd := (cid:80) j w ij ( v Tj d ) v j and g := (cid:80) j w ij e ij v j , where e ij = v Ti v j + b i + b j − log( c ij ) ; Let v i := v i + d ; Update bias term b i := b i − ( (cid:80) j w ij e j ) / ( (cid:80) j w ij ) ; end end Appendix F. Application: Word embedding problem

The word embedding is a feature learning technique to embed the meanings of wordsas low-dimensional vectors. For example, the popular Word2vec model (Mikolov et al.,2013b,a) successfully embeds similarities and analogies between words using a shallow neuralnetwork. Another popular model, GloVe (Pennington et al., 2014), uses a factorization-basedformulation and achieves better accuracy in analogies tasks compared to Word2vec. Thetheoretical justiﬁcations of these two models are discussed in RANDWALK (Arora et al.,2015). Here, we show that our coordinate descent approach can also be applied to learningword embeddings with the GloVe objective function, highlighting the fact that the Mixingmethods can be applied to problems typically considered in the domain of solely non-convexoptimization.

Problem description.

Let C be the co-occurrence matrix such that c ij is the number oftimes word i and j co-occur within the same window in the corpus. We consider solving aslightly modiﬁed version of the GloVe objective function (Pennington et al., 2014) min V ∈ R k × n (cid:88) i (cid:54) = j w ij (cid:16) v Ti v j + b i + b j − log c ij (cid:17) , where n is size of the vocabulary, k is the number of latent factors and w ij = min { C / w,w (cid:48) , } is a tuning factor to suppress high-frequency words. The only diﬀerence with GloVe is thatwe do not include the self-loop, i.e., i = j terms, in the formulation. Application of Mixing method.

Focusing on the subproblem involving only variable v i and take a step d ∈ R k , we can see that the subproblem min d f ( v i + d ) becomes: minimize d ∈ R k (cid:88) j w ij (cid:16) ( v i + d ) T ( v j ) + b i + b j − log c ij (cid:17) he Mixing method r e l a t i v e f un c t i o n v a l u e d i ff e r e n c e mixingsgd (a) wiki8 (n= , nnz= , k= ) r e l a t i v e f un c t i o n v a l u e d i ff e r e n c e mixingsgd (b) enwiki (n= , nnz= , k= ) Figure 3: ( f − f min ) v.s. training time for the word embedding problem, where f min isminimum objective value we have. The experiments show that the Mixing method can begeneralized to nonlinear objective function and is faster than SGD.Deﬁne e ij = v Ti v j + b i + b j − log c ij . Then the above subproblem can be reformulated as minimize d ∈ R k d T (cid:0) H (cid:122) (cid:125)(cid:124) (cid:123)(cid:88) j w ij v j v Tj (cid:1) d + (cid:0) g (cid:122) (cid:125)(cid:124) (cid:123)(cid:88) j w ij e ij v j (cid:1) T d, which is an unconstrained quadratic program solvable in O ( n ) time. In practice, we applythe conjugate gradient method with a stopping condition on (cid:107)∇ d f ( v i + d ) (cid:107) to obtain goodenough approximations and cyclically update all the v i . Totally, updating all the variablesonce takes O ( k · m · ( )) time, where m is the number of nonzeros in C andtypically ( ) < in our settings. Algorithm 3 contains a complete description. Results.

Figure 3 shows the result of comparing the proposed Mixing method with thestochastic gradient method, which is the default solver for GloVe. We consider the wiki8 andthe enwiki corpus, which are widely-used benchmarks for word embeddings. The corpus ispre-processed following Pennington et al. (2014) (removing non-textual elements, sentencesplitting, and tokenization), and words that appeared fewer than times in the corpusare ignored. Figure 3 shows the results of the Mixing method versus the SGD on the twocorpora. For both datasets, the Mixing method converges substantially faster, achieving alower function value than the SGD method.times in the corpusare ignored. Figure 3 shows the results of the Mixing method versus the SGD on the twocorpora. For both datasets, the Mixing method converges substantially faster, achieving alower function value than the SGD method.

Related Researches

Restart of accelerated first order methods with linear convergence for non-strongly convex optimization

by Teodoro Alamo

Using Inverse Optimization to Learn Cost Functions in Generalized Nash Games

by Stephanie Allen

Statistical learning of non-linear stochastic differential equations from non-stationary time-series using variational clustering

by Vyacheslav Boyko

A dependence of the cost of fast controls for the heat equation on the support of initial datum

by Hoai-Minh Nguyen

Decomposition Methods for Global Solutions of Mixed-Integer Linear Programs

by Kaizhao Sun

A Data-Driven Convex Programming Approach to Worst-Case Robust Tracking Controller Design

by Liang Xu

Conditions for Exact Convex Relaxation and No Spurious Local Optima

by Fengyu Zhou

Two-sided Singular Control of an Inventory with Unknown Demand Trend

by Salvatore Federico

Hospital management in the COVID-19 emergency: Abelian Sandpile paradigm and beyond

by Roberta Martucci

Nonconvex integro-differential sweeping process with applications

by Abderrahim Bouach

Revisiting the Role of Euler Numerical Integration on Acceleration and Stability in Convex Optimization

by Peiyuan Zhang

Optimizing dividends and capital injections limited by bankruptcy, and practical approximations for the Cramér-Lundberg process

by Florin Avram

Space mapping-based optimization with the macroscopic limit of interacting particle systems

by Jennifer Wei?en

Comparing Inverse Optimization and Machine Learning Methods for Imputing a Convex Objective Function

by Elaheh H.Iraj

Characterization of Minimum Time-Fuel Optimal Control for LTI Systems

by Rajasree Sarkar

An Optimal Algorithm for Strongly Convex Minimization under Affine Constraints

by Adil Salim

A Multiscale Analysis of Multi-Agent Coverage Control Algorithms

by Vishaal Krishnan

The Projected Polar Proximal Point Algorithm Converges Globally

by Scott B. Lindstrom

Sharp and fast bounds for the Celis-Dennis-Tapia problem

by Luca Consolini

Actor-Critic Method for High Dimensional Static Hamilton--Jacobi--Bellman Partial Differential Equations based on Neural Networks

by Mo Zhou

Beyond Persistent Excitation: Online Experiment Design for Data-Driven Modeling and Control

by Henk J. van Waarde

Fenchel Duality and a Separation Theorem on Hadamard Manifolds

by Ronny Bergmann

Direct-Search for a Class of Stochastic Min-Max Problems

by Sotiris Anagnostidis

A Zeroth-Order Block Coordinate Descent Algorithm for Huge-Scale Black-Box Optimization

by HanQin Cai

The quadratic minimum spanning tree problem: lower bounds via extended formulations

by Renata Sotirov

«
1

2

3

4

»

Submitted on 1 Jun 2017 (v1), last revised 4 Jul 2018 (this version, v3) Updated

arXiv.org Original Source

NASA ADS

Google Scholar

Semantic Scholar