The Mixing method: low-rank coordinate descent for semidefinite programming with diagonal constraints
TThe Mixing method
The Mixing method: low-rank coordinate descent forsemidefinite programming with diagonal constraints
Po-Wei Wang [email protected]
Machine Learning DepartmentCarnegie Mellon UniversityPittsburgh, PA 15213
Wei-Cheng Chang [email protected]
Language Technologies InstituteCarnegie Mellon UniversityPittsburgh, PA 15213
J. Zico Kolter [email protected]
Computer Science DepartmentCarnegie Mellon UniversityPittsburgh, PA 15213, andBosch Center for Artificial IntelligencePittsburgh, PA 15222
Editor:
Abstract
In this paper, we propose a low-rank coordinate descent approach to structured semidefiniteprogramming with diagonal constraints. The approach, which we call the Mixing method,is extremely simple to implement, has no free parameters, and typically attains an orderof magnitude or better improvement in optimization performance over the current stateof the art. We show that the method is strictly decreasing, converges to a critical point,and further that for sufficient rank all non-optimal critical points are unstable. Moreover,we prove that with a step size, the Mixing method converges to the global optimum of thesemidefinite program almost surely in a locally linear rate under random initialization. Thisis the first low-rank semidefinite programming method that has been shown to achieve aglobal optimum on the spherical manifold without assumption. We apply our algorithmto two related domains: solving the maximum cut semidefinite relaxation, and solving amaximum satisfiability relaxation (we also briefly consider additional applications such aslearning word embeddings). In all settings, we demonstrate substantial improvement overthe existing state of the art along various dimensions, and in total, this work expands thescope and scale of problems that can be solved using semidefinite programming methods.
Keywords: semidefinite program, non-convex optimization, spherical manifold, conver-gence to global optimum, random initialization1 a r X i v : . [ m a t h . O C ] J u l ang, Chang, and Kolter
1. Introduction
This paper considers the solution of large-scale, structured semidefinite programming problems(SDPs). A generic SDP can be written as the optimization problem minimize X (cid:23) (cid:104) C, X (cid:105) , subject to (cid:104) A i , X (cid:105) = b i , i = 1 . . . p (1)where X ∈ S n is the optimization variable (a symmetric n × n matrix), and A i ∈ R n × n , b i ∈ R , i = 1 , . . . , p are problem data. Semidefinite programs can encode a huge range ofpractical problems, including relaxations of many combinatorial optimization tasks (Boydand Vandenberghe, 2004), approximate probabilistic inference (Jordan and Wainwright, 2004),metric learning (Yang, 2006), matrix completion (Candes and Recht, 2012), and many others.Unfortunately, generic semidefinite programs involve optimizing a matrix-valued variable X ,where the number of variables grows quadratically so that it quickly becomes unaffordablefor solvers employing exact methods such as primal-dual interior point algorithms.Fortunately, a property of these problems, which has been recognized for some time now,is that the solution to such problems is often low-rank ; specifically, the problem alwaysadmits an optimal solution with at most rank (cid:100)√ p (cid:101) (Barvinok, 1995; Pataki, 1998), andmany SDPs are set up to have even lower rank solutions in practice. This has motivated thedevelopment of non-convex low-rank solvers for these systems: that is, we can attempt tosolve the equivalent (but now non-convex) formulation of the problem minimize V ∈ R k × n (cid:104) C, V T V (cid:105) , subject to (cid:104) A i , V T V (cid:105) = b i , i = 1 . . . p with the optimization variable V ∈ R k × n . Here we are explicitly representing X by thematrix V of rank k (typically with k (cid:28) n ), X = V T V . Note that because we are representing X in this way, we no longer need to explicitly enforce semidefiniteness, as it is implied bythe change of variables. In a long series of work dating back several years, it has been shownthat, somewhat surprisingly, this change to a non-convex problem does not cause as manydifficulties as might be thought: in practice, local solutions to the problem tend to recoverthe optimal solution (Burer and Monteiro, 2003); assuming sufficient rank k , all second orderlocal optima of the problem are also global optima (Boumal et al., 2016); and it even holdsthat approximated local optima also have good approximation properties (Mei et al., 2017)for convex relaxations of some combinatorial problems. Despite these advances, solvingfor V remains a practical challenge. Traditional methods for handling non-linear equalityconstraints, such as augmented Lagrangian methods and Riemannian manifold methods,suffer from slow convergence, difficulties in selecting step size, or other deficiencies.In this paper, we present a low-rank coordinate descent approach to solving SDPs thathave the additional specific structure of constraints (only) on the diagonal entries of thematrix (we consider the case of unit diagonal constraints, though we can easily extend toarbitrary positive numbers) minimize X (cid:23) (cid:104) C, X (cid:105) , subject to X ii = 1 , i = 1 . . . n. (2)This is clearly a very special case of the full semidefinite program, but it also capturessome fundamental problems, such as the famous semidefinite relaxation of the maximum he Mixing method cut (MAXCUT) combinatorial optimization problem; indeed, the MAXCUT relaxation willbe one of the primary applications in this paper. In this setting, if we consider the abovelow-rank form of the problem, we show that we can derive the coordinate descent updates ina very simple closed form, resulting in an algorithm several times faster than the existingstate of the art. We call our approach the Mixing method, since the updates have a naturalinterpretation in terms of giving each v i as a mixture of the remaining v j terms. We will alsoshow, however, that the method can be applied to other problems as well, such as a (novel, tothe best of our knowledge) relaxation of the MAXSAT problem, as well as an unconstrainedquadratic SDP similar to the GloVe word embedding algorithm (Pennington et al., 2014).On the theoretical side, we prove several strong properties about the Mixing method.Most notably, despite the fact that the method is solving a non-convex formulation of theoriginal MAXCUT SDP, it nonetheless will converge to the true global optimum of the originalSDP problem, provided the rank is sufficient, k > √ n (which is of course much smallerthan optimizing over the entire n variables of the original SDP). We prove this by firstshowing that the method is strictly decreasing, and always converges to a first-order criticalpoint. We then show that all non-optimal critical points (that is, all points which are criticalpoint in the non-convex formulation in V , but not optimal in X in the original SDP), are unstable , i.e., they are saddle points in V and will be unstable under updates of the Mixingmethod; this means that, in practice, the Mixing method is extremely unlikely to convergeto any non-optimal solution. However, to formally prove that the method will converge tothe global optimum, we consider a slightly modified “step size" version of the algorithm, forwhich we can prove formally that the method indeed converges to the global optimum in allcases (although in practice such a step size is not needed). Finally, for both the traditionaland step size versions of the algorithm, we show that the Mixing method attains locally linear convergence around the global optimum. The primary tools we use for these proofs requireanalyzing the spectrum of the Jacobian of the Mixing method at non-optimal critical points,showing that it is always unstable around those points; we combine these with a geometricanalysis of critical points due to Boumal et al. (2016) and a convergence proof for coordinategradient descent due to Lee et al. (2017) to reach our main result (both results require slightspecialization for our case, so are included for completeness, but the overall thrust of thosesupporting points are due to these past papers). Contributions of the current work
In summary, the main contributions of this paperare: 1) We propose a low-rank coordinate descent method, the Mixing method, for thediagonally constrained SDP problem, which is extremely fast and simple to implement. 2)We prove that despite its non-convex formulation, the method is guaranteed to converge toglobal optimum of the original SDP with local linear convergence, provided that we use asmall rank k > √ n . 3) We evaluate the proposed method on the MAXCUT SDP relaxation,showing that it is 10-100x times faster than the other state-of-the-art solvers and scales tomillions of variables. 4) We extend the MAX-2SAT relaxation of Goemans and Williamson(1995) to general MAXSAT problems, showing that the proposed formulation can be solvedby the Mixing method in linear time to the number of literals. Further, experiments showthat the proposed method has much better approximation ratios than other approximationalgorithms, and is even comparable to the best partial MAXSAT solvers in some instances. ang, Chang, and Kolter
2. Background and related work
Low-rank methods for semidefinite programming
Given an SDP problem with p constraints, it was proven by Barvinok (1995); Pataki (1998) that, if the problem is solvable,it admits solutions of rank k = (cid:100)√ p (cid:101) . That is, we have solutions satisfying X = V T V ,such that V ∈ R k × n . Thus, if we can solve the problem in the space of V , we can ignorethe semidefinite constraint and also have many fewer variables. The idea of using thislow-rank structure during optimization was first proposed by Burer and Monteiro (2003)in their solver SDPLR, in which they solve the low-rank problem with L-BFGS on theextended Lagrangian problem. Since then, many low-rank optimization algorithms have beendeveloped. One of the most notable is the Riemannian trust region method introduced byAbsil et al. (2009). They considered the Riemannian manifold of low-rank structures, andextended the non-linear conjugate gradient method to work on the manifold. Later, Boumaland Absil (2015) improved the method by including preconditioned CG; these methods areimplemented in the popular Manopt package (Boumal et al., 2014).Somewhat surprisingly, all the above low-rank methods are observed to converge to aglobal optimum in practice (Burer and Monteiro, 2003; Absil et al., 2009; Boumal et al.,2014). However, to the best of the authors’ knowledge, there is not yet a general proof onconvergence to the globally optimal solution without strong assumptions. Absil et al. (2009,Theorem 7.4.2) proved that their method converges to the critical point under a sufficientdecrease condition, and super-linear convergence near an isolated local minimizer when theRiemannian Hessian is positive definite. Nonetheless, the results do not apply to the linearobjective in our problem. Boumal et al. (2016, Theorem 2) proved that, for sufficiently large k , all second-order optimal solutions are globally optimal for almost all cost matrices C .However, this alone does not imply f converges to f ∗ . Sato and Iwai (2015) proved theconvergence to a critical point without assumptions for a modified Riemannian conjugategradient method, and Boumal et al. (2018, Theorem 12 and Proposition 19) proved that theRiemannian trust region method converges to a solution with Hessian larger than − γI in O (1 /γ ) and provide bounds on f − f ∗ when k > n . Recently, Bhojanapalli et al. (2018)proved the connection between γ , the norm of gradients, and f µ − f ∗ µ for the unconstrainedpenalty form f µ , but this does not state the relationship between f − f ∗ after projecting theunconstrained solution back to the feasible space.Other existing results on global convergence to the optimal on low-rank problems do notapply to our problem setting. For example, Bhojanapalli et al. (2016) proved that under acertain spectral initialization, the gradient descent method converges to the global optimafor unconstrained low-rank SDP. Park et al. (2016) further proved that the method works fornorm-constrained low-rank SDP problems when the feasible space of V is convex. Lee et al.(2016, 2017); O’Neill and Wright (2017) proved that, under random initialization, first-ordermethods converge to local minimizers for unconstrained problems. However, O’Neill andWright (2017) only concerns the unconstrained optimization in Euclidean space, and resultsin Lee et al. (2016, 2017) do not work in the spherical manifold due to the singularity inthe Jacobian (even with step size). Because of these issues, certifying global convergencefor low-rank SDP methods typically requires a two-stage algorithm, where one iterativelysolves the low-rank problem via some “inner” optimization, then checks for a certificate of(global) optimality via the dual objective of the original SDP and inflates the rank if the he Mixing method solution is not optimal (i.e., the “outer” iteration) (Burer and Monteiro, 2003). Even here, aglobally optimal solution is not theoretically guaranteed to be achieved with reduced rank inall cases, but at least can be verified after the fact. Computing this dual solution, however,typically requires solving an eigenvalue problem of the size of the original X matrix, so isoften avoided entirely in practice in favor of just solving one low-rank problem, and this isthe setting we consider here. Approximation algorithms for MAXCUT and MAXSAT
Semidefinite program-ming has many applications in approximation algorithms for NP-complete problems. Inparticular, Goemans and Williamson (1995) proposed a classical SDP relaxation on theMAXCUT and MAX-2SAT problems, which has a 0.878 approximation guarantee. Exper-iments on MAX-2SAT (Gomes et al., 2006) show that the SDP upper bound and lowerbound are much tighter than the classical linear programming relaxation for MAXSAT(Goemans and Williamson, 1994). While traditionally SDPs are more expensive to solvethan linear programming, we will show here that with our approach, SDP relaxations canachieve substantially better results than linear programming in less time.
3. The Mixing method
As mentioned above, the goal of the Mixing method is to solve the semidefinite program(2) with a unit diagonal constraint. As discussed, we can replace the X (cid:23) constraintwith X = V T V for some V ∈ R k × n ; when we do this, the constraint X ii = 1 translatesto the constraint (cid:107) v i (cid:107) = 1 , where v i is the i th column of V . This leads to the equivalent(non-convex) problem on the spherical manifold minimize V ∈ R k × n (cid:104) C, V (cid:62) V (cid:105) subject to (cid:107) v i (cid:107) = 1 , i = 1 . . . n. (3)Although the problem is non-convex, it is known (Barvinok, 1995; Pataki, 1998) that when k > √ n , the optimal solution for V ∈ R k × n can recover the optimal solution for X .We consider solving the problem (3) via a coordinate descent method. The resultingalgorithm is extremely simple to implement but, as we will show, it performs substantiallybetter than existing approaches for the semidefinite problems of interest. Specifically, theobjective terms that depend on v i are given by v Ti ( (cid:80) nj =1 c ij v j ) . However, because (cid:107) v i (cid:107) = 1 wecan assume that c ii = 0 without affecting the solution of the optimization problem. Thus, theproblem is equivalent to simply minimizing the inner product v Ti g i (where g i = (cid:80) nj =1 c ij v j ),subject to the constraint that (cid:107) v i (cid:107) = 1 ; this problem has a closed form solution, simply givenby v i = − g i / (cid:107) g i (cid:107) . Put in terms of the original v j variable, this is simply the update v i := normalize − n (cid:88) j =1 c ij v j . This way, we can initialize v i on the unit sphere and perform cyclic updates over all the i = 1 . . . n in closed-form. We call this the Mixing method, because for each v i our updatemixes and normalizes the remaining vectors v j according to weight c ij . In the case of sparse C (which is common for any large data problem), the time complexity for updating all variablesonce is O ( k · m ) , where k is the rank of V and m is the number of nonzeros in C . This is ang, Chang, and Kolter Algorithm 1:
The Mixing method for MAXCUT problem Initialize v i randomly on a unit sphere; while not yet converged do for i = 1 , . . . , n do v i := normalize ( − (cid:80) nj =1 c ij v j ) ; end end significantly cheaper than the interior point method, which typically admits complexitiescubic in n . However, the details for efficient computation differ depending on the precisenature of the SDP, so we will describe these in more detail in the subsequent applicationsections. A complete description of the generic algorithm is shown in Algorithm 1. This section presents the theoretical analysis of the Mixing methods, which constitutes fourmain properties: • We prove that the Mixing method is strictly decreasing in objective value and alwaysconverges to a first-order critical point over iterations. • Further, we show that for a rank k > √ n , all non-optimal critical points V ∈ R k × n are unstable for the Mixing method. That is, when V T V is non-optimal for the convexproblem (2), the critical point V will sit on a saddle of the non-convex problem (3)and the Mixing method tends to diverge from the point locally . • To rigorously prove the global convergence , we show that a variant of the Mixing methodwith a proper step size converges to a global optimum almost surely under randominitialization for almost every cost matrix C , without any assumptions . • Moreover, we prove that both Mixing methods (with or without step size) convergelinearly to the global optimum when the solution is close enough, regardless of the rank or the existence of nearby non-optimal critical points.In total, our method represents the first low-rank semidefinite programming method whichwill provably converge to a global optimum under constraints, and further in a locally linearrate.
As mentioned above, our only assumption (satisfiable by picking a proper step size) issimply that the v i updates will not lead to a v with zero norm before normalization. Assumption 3.1
Assume that for all i = 1 . . . n , (cid:107) (cid:80) nj =1 c ij v j (cid:107) do not degenerate in theprocedure. That is, all norms are always greater than or equal to a constant δ > . Convergence to a critical point.
Our first property shows that the Mixing methodstrictly decreases, and always converges to a first-order critical point for any k . This is arelatively weak statement, but useful in the context of the further proofs.
1. The tightness of the √ n rank (actually, rank satisfying k ( k + 1) / ≥ n ) is proved in Barvinok (2001). he Mixing method Theorem 3.2
Under Assumption 3.1, the Mixing method on the SDP problem (2) is strictlydecreasing and always converges to a first-order critical point.
The proof is in Appendix A, which mainly involves setting up the Mixing iteration in amatrix form, and showing that the difference in objective value between two iterations isgiven by a particular positive term based upon this form.
Instability of non-optimal critical points.
Next we prove the main result of ourapproach, that not only is the function decreasing, but that every non-optimal critical pointis unstable; that is, although the problem (3) is non-convex, the algorithm tends to diverge(locally) from any solution that is not globally optimal. Further, the local divergence fromnon-optimal critical points and the global convergence to a critical points hint that theMixing method can only converges to a global optimal solution.
Theorem 3.3
Pick k > √ n . Under Assumption 3.1, for almost all C , all non-optimalfirst-order critical points are unstable fixed points for the Mixing method. The full proof is provided in Section 3.2. The main idea is to show that the maximumeigenvalue of the dynamics Jacobian, evaluated at a critical point but when V is not optimal,is guaranteed to have a spectral radius (magnitude of the largest eigenvalue) greater thanone. We do this by showing that the Jacobian of the Mixing method has the same structureas the Jacobian of a standard Gauss-Seidel update plus an additional projection matrix.By a property from Boumal et al. (2016), plus an analysis of the eigenvector of Kroneckerproducts, we can then guarantee that the eigenvalues of the Mixing method Jacobian containsthose of the Gauss-Seidel update. We then use an argument based upon Lee et al. (2017) toshow that the Gauss-Seidel Jacobian is similarly unstable around non-optimal critical points,proving our main theorem. Globally optimal convergence of Mixing method with a step size.
Though theabove two theorems in practice ensure convergence of the Mixing method to a global optimum,because the method makes discrete updates, there is the theoretical possibility that themethod will “jump" directly to a non-optimal critical point (yet again, this has never beenobserved in practice). For completeness, however, in the next theorem we highlight the factthat a version of the Mixing method that is modified with a step size will always converge toa global optimum.
Theorem 3.4
Consider the Mixing method with a step size θ > . That is, v i := normalize v i − θ n (cid:88) j =1 c ij v j , for i = 1 , . . . , n. Take k > √ n and θ ∈ (0 , i (cid:107) c i (cid:107) ) , where (cid:107) · (cid:107) denotes the -norm. Then for almost every C , the method converges to a global optimum almost surely under random initialization.
2. Any distribution will suffice if it maps zero volume in the spherical manifold to zero probability. Forexample, both spherical Gaussian distribution and normalized uniform distribution work. ang, Chang, and Kolter The full proof is provided in Appendix C. The main difference in the proof from the stepsize free version is that, with a step size, we can prove the diffeomorphism of the Mixingmethod and thus are able to use the stable manifold theorem. Because the Jacobian of any feasible method on the spherical manifold is singular , we need to construct the inversefunction explicitly and show the smoothness of such function. We can then use an analysis ofthe Gauss-Seidel method with a step size to show our result. To the best of our knowledge,this represents the first globally optimal convergence result for the low-rank method appliedto (constrained) semidefinite programming, without additional assumptions such as theCauchy decrease (Boumal et al., 2018, Assumption A3), or solving a sequence of “bounded”log-barrier subproblems exactly (Burer and Monteiro, 2005, Theorem 5.3). Locally linear convergence.
Finally, as a last point of analysis, we show that theconvergence of the Mixing methods exhibits linear convergence to the global optimumwhenever the solution is close enough, regardless of the rank and the existence of nearbynon-optimal critical points, for both versions with or without the step size. This also echoespractical experience, where the Mixing method does exhibit this rate of convergence.
Theorem 3.5
The Mixing methods converge linearly to the global optimum when close enoughto the solution, with step size (no assumption) or without step size (under Assumption 3.1).
The full proof is provided in Appendix D. We prove it by exploiting the Lipschitzsmoothness (Lipschitz continuous gradient) of the Mixing mappings. The main difficultyhere is that the corresponding linear system in the Gauss-Seidel method, S ∗ ∈ R n × n , issemidefinite so that the corresponding Jacobian J GS contains eigenvectors of magnitude inthe null space of S ∗ . We overcome the difficulty by proving the result directly in the functionvalue space like Wang and Lin (2014) so that the eigenvectors in null ( S ∗ ) can be ignored.This is the first local linear convergence on the spherical manifold without assumption. Remark 3.6
Assume there are m nonzeros in C ∈ R n × n . With Theorem 3.4 and 3.5, foralmost every C , the Mixing method with a step size admits an asymptotic complexity of O ( m √ n log (cid:15) ) to reach the global optimality gap of f − f ∗ ≤ (cid:15) almost surely under randominitialization. This concludes the theoretical analysis of the Mixing method. Now we will prove one of our main result: the instability of non-optimal critical points.
Before starting the proofs, we discuss our notations and reformulate the Mixing methods.
Notations.
We use upper-case letters for matrix, and lower-case letters for vector and scalar.For a matrix V ∈ R k × n , v i ∈ R k refers to the i -th column of V , vec( V ) and vect( V ) ∈ R nk denote the vector stacking columns and rows of V , respectively. For a vector y ∈ R n , D y = diag( y ) ∈ R n × n denotes the matrix with y on the diagonal, y max , y min , y min -nz ∈ R the
3. Note that on the spherical manifold, the Jacobian of any feasible method is singular because the Jacobianof v i / (cid:107) v i (cid:107) is singular. Thus, the proof in Lee et al. (2017, Section 5.5) does not carry over to the case ofthe Mixing method, even with a step size. This past proof required a non-singular Jacobian, and thusdifferent techniques are required in our setting. he Mixing method maximum, the minimum, and the minimum nonzero element of y , respectively. The symbol ⊗ denotes the Kronecker product, † the Moore-Penrose inverse, σ ( · ) the vector of eigenvalues, ρ ( · ) the spectral radius, n the -vector of length n , I n the identity matrix of size n , tr( · ) thetrace, (cid:104) A, B (cid:105) = tr( A T B ) the dot product, and (cid:107) V (cid:107) = (cid:112) tr( V V T ) the generalized L2 norm.Indices i, j are reserved for matrix element, and index r for iterations. The Mixing methods.
We denote C ∈ R n × n the cost matrix of the problem minimize V ∈ R k × n f ( V ) ≡ (cid:104) C, V T V (cid:105) subject to (cid:107) v i (cid:107) = 1 , i = 1 . . . n, and w.l.o.g. assume that c ii = 0 in all proofs. Matrices V and ˆ V refer to the current and thenext iterate, and V ∗ the global optimum attaining an optimal value f ∗ in the semidefiniteprogram (2). Let g i = (cid:88) ji c ij v j (4)and matrix L be the strict lower triangular part of C . With these notations, the mapping ofthe Mixing method M : R k × n → R k × n and its variant M θ with step size θ can be written as M ( V ) T = − ( L + D y ) − L T V T , where y i = (cid:107) g i (cid:107) , i = 1 . . . n, and (5) M θ ( V ) T = ( θL + D y ) − ( I − θL ) T V T , where y i = (cid:107) v i − θg i (cid:107) , i = 1 . . . n. (6)Note that both M and M θ are valid functions of V because y can be calculated from V bythe original algorithmic definitions in Section 3. This formulation is similar to the classicalanalysis of the Gauss-Seidel method for linear equation in Golub and Van Loan (2012), wherethe difference is that y here is not a constant to V and thus the evolution is not linear. Proof of technical lemmas.
We start by analyzing the Jacobian of the Mixing method.
Lemma 3.7
Using the notation in (5) , the Jacobian of the Mixing method is J ( V ) = − ( L + D y ) − ⊗ I k P L T ⊗ I k , in which P is the rejection matrix of V . That is, P = diag( P , . . . , P n ) ∈ R nk × nk , where P i = I k − ˆ v i ˆ v Ti ∈ R k × k . Proof
Denote V and ˆ V the current and the next iterate. Taking total derivatives on theupdate of the Mixing method (7), we have y i d ˆ v i = − P i dg i = − P i ( (cid:88) ji c ij dv j ) , i = 1 . . . n.
4. By Pataki (1998), the optimality in (2) is always attainable by V ∈ R k × n when k > √ n .5. The reason to reformulate here is to avoid the “overwrite” of variables in the algorithmic definition.Moving the inverse term to the left-hand side, the reader can recover the original sequential algorithm. ang, Chang, and Kolter Moving d ˆ v j to the left-hand side. By the implicit function theorem, we have the Jacobian J ( V ) = − y I k . . . c P y I k . . . . . . . . . . . . c n P n c n P n . . . y n I k − c P . . . c n P . . . . . . . . . c ( n − n P n − . . . . The implicit function theorem holds here because the triangular matrix is always inversible.Rewrite J ( V ) with Kronecker product, we have equivalently J ( V ) = − ( P L ⊗ I k + D y ⊗ I k ) − P L T ⊗ I k . Further, we can push P into the inverse so that ( P L ⊗ I k + D y ⊗ I k ) − P = ( P L ⊗ I k + P D y ⊗ I k ) † = (( L + D y ) ⊗ I k ) − P. Thus, J ( V ) can be reformulated as J ( V ) = − ( L + D y ) − ⊗ I k P L T ⊗ I k . Note that V = ˆ V on critical points, which is also a fixed point of the Mixing method. Nowwe demonstrate how to analyze the Jacobian. Remember the notation vect( Z ) = vec( Z T ) .This way, we have the following convenient property by the Kronecker product. Lemma 3.8
For matrices
A, B, Q, R , we have A ⊗ B vect( QR T ) = vect(( AQ )( BR ) T ) . Proof A ⊗ B vect( QR T ) = A ⊗ B vec( RQ T ) = vec( BRQ T A T ) = vect(( AQ )( BR ) T ) . By the above property, part of the spectrum of the Jacobian can be analyzed.
Lemma 3.9 (Overlapping Eigenvalues)
Assume V ∈ R k × n has rank ( V ) < k . Let P = diag( P , . . . , P n ) , where P i = I k − v i v Ti . Then for any
A, B ∈ R n × n , any eigenvalue of AB is also an eigenvalue of J = A ⊗ I k P B ⊗ I k . Proof
Because rank ( V ) < k , by linear dependency there is a nonzero r ∈ R k such that r T v i = 0 for i = 1 . . . n = ⇒ P i r = r for i = 1 . . . n. Let q ∈ C k be an eigenvector of AB with eigenvalue λ ∈ C . Let Z = qr T . With Lemma 3.8, J vect( Z ) = A ⊗ I k P vect(( Bq ) r T ) = A ⊗ I k (cid:0) P i r ( Bq ) Ti (cid:1) i =1 ...n = A ⊗ I k vect(( Bq ) r T ) = vect(( ABq ) r T )= λ vect( qr T ) . Thus, every eigenvalue λ of AB is also an eigenvalue of J .By the above lemma, the spectral radius of J = − ( L + D y ) − ⊗ I k P L T ⊗ I k is lower boundedby J GS = − ( L + D y ) − L T , which can be again lower bounded as follows. he Mixing method Lemma 3.10
For a positive vector y ∈ R n , consider a matrix under the notation in (5) J GS = − ( L + D y ) − L T . Let S = C + D y . When S (cid:54)(cid:23) , the spectral radius ρ ( J GS ) > . Proof
The proof is more technical and is given in Appendix B.Further, the assumption in Lemma 3.9 is fulfilled by the following property of critical points.
Lemma 3.11 (Boumal et al., 2016, Lemma 9) Let k ( k +1)2 > n . Then, for almost all C ∈ R n × n , all first-order critical points V ∈ R k × n have rank smaller than k . Proof
The proof is listed in Appendix E for completeness.Next, we characterize the optimality of V by proving all non-optimal V admits an S (cid:54)(cid:23) . Lemma 3.12
For a critical solution V , denote S = C + diag( y ) , where y i = (cid:107) V c i (cid:107) , ∀ i .Then S (cid:23) ⇐⇒ V is optimal.Further, if V is optimal, all y i are positive except when c i = 0 . Proof
Consider the dual problem of the SDP problem (2), maximize y ∈ R n − Tn y, subject to C + diag( y ) (cid:23) . If S = C + diag( y ) (cid:23) , variable y becomes a feasible solution of the above dual problem.Further, since V is a critical solution, we have V S = 0 = ⇒ V T V S = 0 = ⇒ tr( V T V C ) = − tr( V T V diag( y )) = − Tn y, which means that V T V and y are primal and dual optimal solutions that close the dualitygap. Thus, for critical V , S (cid:23) implies optimality, and non-optimality implies S (cid:54)(cid:23) .On the other direction, when the solution V is optimal, there will be a correspondingdual optimal solution y satisfying V T V ( C + diag( y )) = 0 = ⇒ v Ti V ( C + diag( y )) = 0 , ∀ i = ⇒ y i = (cid:107) V c i (cid:107) , ∀ i. And S = C + diag( y ) (cid:23) follows from the dual feasibility. By the characterization of SPSDmatrix, all submatrix of S (cid:23) are SPSD. Thus, y i ≥ . If equality y i = 0 holds, by the samereason all × submatrix (cid:18) c ij c ij y jj (cid:19) (cid:23) , ∀ j . This means c i = 0 .
6. If S (cid:23) , we can prove that the spectral radius ρ ( J GS ) ≤ , in which all eigenvectors with magnitude reside in the null of S , see Wang and Lin (2014, Corollary 3.4). However, the result is not used here.7. Let y ∗ i = (cid:107) V ∗ c i (cid:107) and S ∗ = C + diag( y ∗ ) (cid:23) . An immediate consequence of the lemma is that, for anyfeasible U , f ( U ) − f ∗ = tr( UCU T ) + 1 Tn y ∗ = tr( US ∗ U T ) . Further, suppose U is also an optimum, then f ( U ) − f ∗ = tr( US ∗ U T ) = 0 ⇐⇒ US ∗ = 0 ⇐⇒ (cid:107) Uc i (cid:107) = y ∗ i = (cid:107) V ∗ c i (cid:107) , ∀ i. That is, y ∗ is unique. ang, Chang, and Kolter Proof of Theorem 3.3
Proof
Consider the non-optimal critical point V . If (cid:107) V c i (cid:107) degenerates for any i , byAssumption 3.1, the Mixing method would not converge to the solution. Thus, we only needto discuss the critical point with y i = (cid:107) V c i (cid:107) ≥ δ > , i = 1 . . . n .We first derive the Jacobian J of the Mixing method in Lemma 3.7, which gives J = − ( L + D y ) − ⊗ I k P L T ⊗ I k , where P = diag( P , . . . , P n ) and P i = I − v i v Ti because ˆ v i = v i on the critical point (also afixed point). In Lemma 3.9, we prove that when rank ( V ) < k , the eigenvalues of J containthe eigenvalues of J GS = − ( L + D y ) − L T . The assumption in Lemma 3.9 is fulfilled by Lemma 3.11, which guarantees that for almostevery C , all the first-order critical point must have rank ( V ) < k . Further, Lemma 3.10 and3.12 show that J GS happens to be the Jacobian of the Gauss-Seidel method on a linearsystem, which has a spectral radius ρ ( J GS ) > on the non-optimal first-order critical point V .Thus, Lemma 3.9 implies ρ ( J ) ≥ ρ ( J GS ) > , which means that all non-optimal first-ordercritical points are unstable for the Mixing method.
4. Applications
The SDP MAXCUT relaxation is indeed the motivating example of the Mixing method, sowe consider it first. In this section, we demonstrate how to apply our method to this problem,which originated from Goemans and Williamson (1995).
Problem description.
The maximum cut problem is an NP-hard binary optimizationproblem, which seeks a partition over a set of vertices i = 1 . . . n , so that the sum of edgeweights c ij across the partition is maximized. If we denote the two partitions as ± , we canformulate the assignment v i of vertex i as the following binary optimization problem maximize v i ∈{± } , ∀ i (cid:88) ij c ij (cid:18) − v i v j (cid:19) . Goemans and Williamson (1995) proposed that we can approximate the above solution by“lifting” the assignment v i from {± } to a unit sphere in R k for sufficiently large k as maximize (cid:107) v i (cid:107) =1 , ∀ i (cid:88) ij c ij (cid:18) − v Ti v j (cid:19) . To recover the binary assignment, we can do a randomized rounding by picking a randomvector r ∈ R k on the unit sphere, and letting the binary assignment of vertex i be sign ( r T v i ) .Their analysis shows that the approximation ratio for the NP-hard problem is . , whichmeans that the expected objective from the randomized rounding scheme is at least . times the optimal binary objective. he Mixing method Algorithm Design.
Because the problem can be solved by the unit diagonal SDP (2), wecan apply the Mixing method directly, as presented in Algorithm 1. Further, for a sparseadjacency matrix C , the coefficient (cid:80) j c ij v j can be constructed in time proportional to thenonzeros in column i of C . Thus, the time complexity of running a round of updates for all v i is O ( k · edges ) , in which k is at most √ n . Using similar ideas as in the previous section, Goemans and Williamson (1995) proposedthat we can use SDP to approximate the maximum 2-satisfiability problem. In this section,we propose a formulation that generalizes this idea to the general maximum satisfiabilityproblem, and apply the Mixing method to this problem. The proposed relaxation here isnovel, to the best of our knowledge, and (as we will show) achieves substantially betterapproximation results than existing relaxations.
Problem description.
The MAXSAT problem is an extension of the well-known sat-isfiability problem, where the goal is to find an assignment that maximizes the number ofsatisfied clauses. Let v i ∈ {± } be a binary variable and s ij ∈ {− , , } be the sign ofvariable i in clause j . The goal of MAXSAT can then be written as the optimization problem maximize v ∈{− , } n m (cid:88) j =1 n (cid:95) i =1 { s ij v i > } . Note that most clauses will contain relatively few variables, so the s j vectors will be sparse.To avoid the need for an additional bias term, we introduce an auxiliary “truth” variable v , and define z j = (cid:80) ni =1 s ij v i − (cid:80) ni =0 s ij v i = V s j . Then the MAXSAT problem can beapproximated as maximize v ∈{− , } n m (cid:88) j =1 − (cid:107) V s j (cid:107) − ( | s j | − | s j | . Although we will not derive it formally, the reader can verify that for any configuration v ∈ {− , } n , this term represents an upper bound on the exact MAXSAT solution. Similarto the MAXCUT SDP, we can relax the v i s to be vectors in R k with (cid:107) v i (cid:107) = 1 . This leads tothe full MAXSAT semidefinite programming relaxation minimize X (cid:23) (cid:104) C, X (cid:105) , subject to C = m (cid:88) j =1 w j s j s Tj , X ii = 1 , i = 0 . . . n, where w j = 1 / (4 | s j | ) . Algorithm Design.
Because the C matrix here is not sparse ( s j s Tj has | s j | non-sparseentries), we need a slightly more involved approach than for MAXCUT, but the algorithmis still extremely simple. Specifically, we maintain z j = V s j for all clauses j . Because ineach subproblem only one variable v i is changed, z j can be maintained in O ( k · m i ) time,where m i denotes the number of clauses that contain variable i . In total, the iteration timecomplexity is O ( k · m ) , where m is the number of literals in the problem. Also, because
8. Actually, the formula matches the approximation of Goemans and Williamson (1995) for MAX-2SAT. ang, Chang, and Kolter Algorithm 2:
The Mixing method for MAXSAT problem Initialize all v i randomly on a unit sphere, i = 1 . . . n .; Let z j = (cid:80) ni =0 s ij v i for j = 1 , . . . , m ; while not yet converged do for i = 1 , . . . , n do foreach s ij (cid:54) = 0 do z j := z j − s ij v i ; v i := normalize (cid:16) − (cid:80) mj =1 s ij | s j | z j (cid:17) ; foreach s ij (cid:54) = 0 do z j := z j + s ij v i ; end end applying arbitrary rotations R ∈ R k × k to V does not change the objective value of ourproblem, we can avoid updating v . Algorithm 2 shows the complete algorithm. To recoverthe binary assignment, we apply the following classic rounding scheme: sample a randomvector r from a unit sphere, then assign binary variable i as true if sign ( r T v i ) = sign ( r T v ) and false otherwise.
5. Experimental results
Running time comparison for MAXCUT
Figure 1 shows the results of running theMixing method on several instances of benchmark MAXCUT problems. These range in sizefrom approximately 1000 nodes and 20000 edges to approximately 2 million nodes and 3million edges. For this application, we are largely concerned with evaluating the runtimeof our Mixing method versus other approaches for solving the same semidefinite program.Specifically, we compare to DSDP (Benson and Ye, 2005), a mature interior point solver;SDPLR (Burer and Monteiro, 2003), one of the first approaches to exploit low-rank structures;Pure-RBR (Wen et al., 2009, 2012), a coordinate descent method in the X space, whichoutputs the best rank-1 update at each step; and Manopt (Boumal et al., 2014), a recenttoolkit for optimization on Riemannian manifolds, with specialized solvers dedicated to theMAXCUT problem. To be specific, we use DSDP 5.8, SDPLR 1.03-beta, and Manopt 3.0with their default parameters. For Manopt, we compare to a subroutine "elliptopefactory”,specially designed for diagonal-constrained SDP. For Pure-RBR, we implement the specializedalgorithm (Wen et al., 2009, Algorithm 2) for MAXCUT SDP with a sparse graph in C++,which only requires a single pass of the sparse matrix per iteration. We omit the log barrierand initialize the RBR with full-rank X . All experiments are run on an Intel Xeon E5-2670machine with 256 GB memory, and all solvers are run in the single-core mode to ensurefair comparisons. As the results show, in all cases the Mixing method is substantially fasterthan other approaches: for reaching modest accuracy (defined as − times the differencebetween the initial and optimal value), we are typically 10-100x faster than all competingapproaches; only the Manopt algorithm ever surpasses our approach, and this happens only
9. We didn’t compare to commercial software like MOSEK, an interior-point solver like DSDP, because it isnot open-source and Boumal (2015) already showed that SDPLR is faster than MOSEK on diagonallyconstrained problems. he Mixing method (n=800 edges=19176) g3.rud (n=2000 edges=19990) g27.rud (n=2000 edges=11778) g35.rud (n=3000 edges=6000) g48.rud (n=5242 edges=14496)ca-GrQc (n=12008 edges=118521)ca-HepPh (n=36692 edges=183831)email-Enron (n=1971281 edges=2766607)roadNet.CA MIXING MANOPT SDPLR DSDP RBR l o g f un c t i o n d i ff e r e n c e log running time (s) Figure 1: Objective value difference versus training time for the MAXCUT problems (log-log plot, lower is better). The horizontal lines mark the default stopping precision of theMixing method, which is − times the starting relative objective of the Mixing method.Experiments show that our method (the blue line) is 10-100x faster than other solvers onour default stopping precision. Note that sometimes curves for SDPLR, DSDP, and RBRare not plotted because they either crashed or did not output any solution after an hour.once both methods have achieved very high accuracy. Crucially, on the largest problems, weremain about 10x (or more) faster than Manopt over the entire run, which allows the Mixingmethod to scale to substantially larger problems. Effectiveness of the Mixing method on approximating MAXSAT problems.
Un-like the previous experiment (where the focus was solely on optimization performance),in this section we highlight the fact that with the Mixing method we are able to obtainMAXSAT results with a high approximation ratio on challenging domains (as the problemsare similar, relative optimization performance is similar to that of the MAXCUT evaluations).Specifically, we evaluate examples from the 2016 MAXSAT competition (Argelich et al.,2016) and compare our result to the best heuristic complete and partial solvers. Note thatthe complete solver produces a verified result, while the partial solver outputs a non-verifiedsolution. Out of 525 problems solved in the complete track (every problem solved exactly bysome solvers within 30 minutes during the competition), our method achieves an averageapproximation ratio of 0.978, and usually finds such solutions within seconds or less. Further,in some instances we obtain perfect solution faster than the best partial solvers. Figure 2shows the progress of the approximate quality versus the running time. Beside the bestheuristic solvers in MAXSAT 2016, we also show the approximation ratio over time for thewell-known linear programming approximation (Goemans and Williamson, 1994) (solvedvia the Gurobi solver). Note that each point in the blue and green curves denote the ang, Chang, and Kolter (n=120 m=1200 nnz=2400)s2v120c1200-1.cnf (n=120 m=1800 nnz=3600)s2v120c1800-1.cnf (n=220 m=1400 nnz=2800)s2v220c1400-1.cnf (n=110 m=900 nnz=2700)s3v110c900-10.cnf (n=110 m=900 nnz=2700)s3v110c900-1.cnf (n=100 m=900 nnz=3600)HG-4SAT-V100-C900-20.cnf (n=100 m=900 nnz=3600)HG-4SAT-V100-C900-23.cnf (n=140 m=1258 nnz=2516)maxcut-140-630-0.8-50.cnf mixing LP partial complete a pp r o x i m a t e d r a t i o log running time (s) Figure 2: Approximated ratio versus (log) running time for the MAXSAT problems (higheris better). The horizontal line marks the perfect approximation ratio ( . ), and the curvesmark the approximation ratio of different approximation algorithms over time. Experimentsindicate that our proposed formulation/method (blue curves) achieves better approximationratios in less time compared to LP. Further, it is sometimes faster than the best partialsolvers (purple vertical lines) and complete solvers (black vertical lines) in the MAXSAT2016 competition.approximation ratio of the output solution at the time, and the starting points of the curvesdenote the time that the solver output the first solution. In all cases the Mixing methodgives better and faster solutions than the LP approximations.
6. Conclusion
In this paper we have presented the Mixing method, a low-rank coordinate descent approachfor solving diagonally constrained semidefinite programming problems. The algorithm isextremely simple to implement, and involves no free parameters such as learning rates. Intheoretical aspects, we have proved that the method converges to a first-order critical pointand all non-optimal critical points are unstable under sufficient rank. With a proper stepsize, the method converges to the global optimum almost surely under random initialization.This is the first convergence result to the global optimum on the spherical manifold withoutassumption. Further, we have shown that the proposed methods admit local linear convergencein the neighborhood of the optimum regardless of the rank. In experiments, we havedemonstrated the method on three different application domains: the MAXCUT SDP, aMAXSAT relaxation, and a word embedding problem (in the appendix). In all cases weshow positive results, demonstrating that the method performs much faster than existingapproaches from an optimization standpoint (for MAXCUT and word embeddings), and that he Mixing method the resulting solutions have high quality from an application perspective (for MAXSAT).In total, the method substantially raises the bar as to what applications can be feasiblyaddressed using semidefinite programming, and also advances the state of the art in structuredlow-rank optimization. Acknowledgement
We thank Gary L. Miller for his helpful discussion on the geometry of the unit sphere, andSimon S. Du for pointing out many recent references.
References
P-A Absil, Robert Mahony, and Rodolphe Sepulchre.
Optimization algorithms on matrixmanifolds . Princeton University Press, 2009.Josep Argelich, Chu Min Li, Felip Manya, and Jordi Planes. Max-sat-2016 eleventh max-satevaluation. http://maxsat.ia.udl.cat/ , 2016.Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma, and Andrej Risteski. Rand-walk: Alatent variable model approach to word embeddings. arXiv preprint arXiv:1502.03520 ,2015.Alexander Barvinok. A remark on the rank of positive semidefinite matrices subject to affineconstraints.
Discrete & Computational Geometry , 25(1):23–31, 2001.Alexander I. Barvinok. Problems of distance geometry and convex properties of quadraticmaps.
Discrete & Computational Geometry , 13(2):189–202, 1995.Steven J. Benson and Yinyu Ye. DSDP5: Software for semidefinite programming. TechnicalReport ANL/MCS-P1289-0905, Mathematics and Computer Science Division, ArgonneNational Laboratory, Argonne, IL, September 2005. URL . Submitted to ACM Transactions on Mathematical Software.Srinadh Bhojanapalli, Anastasios Kyrillidis, and Sujay Sanghavi. Dropping convexity forfaster semi-definite optimization. In
Conference on Learning Theory , pages 530–582, 2016.Srinadh Bhojanapalli, Nicolas Boumal, Prateek Jain, and Praneeth Netrapalli. Smoothedanalysis for low-rank solutions to semidefinite programs in quadratic penalty form. arXivpreprint arXiv:1803.00186 , 2018.Nicolas Boumal. A riemannian low-rank method for optimization over semidefinite matriceswith block-diagonal constraints. arXiv preprint arXiv:1506.00575 , 2015.Nicolas Boumal and P-A Absil. Low-rank matrix completion via preconditioned optimizationon the grassmann manifold.
Linear Algebra and its Applications , 475:200–239, 2015.Nicolas Boumal, Bamdev Mishra, Pierre-Antoine Absil, Rodolphe Sepulchre, et al. Manopt,a matlab toolbox for optimization on manifolds.
Journal of Machine Learning Research ,15(1):1455–1459, 2014. ang, Chang, and Kolter Nicolas Boumal, Vlad Voroninski, and Afonso Bandeira. The non-convex burer-monteiroapproach works on smooth semidefinite programs. In
Advances in Neural InformationProcessing Systems , pages 2757–2765, 2016.Nicolas Boumal, P-A Absil, and Coralia Cartis. Global rates of convergence for nonconvexoptimization on manifolds.
IMA Journal of Numerical Analysis , page drx080, 2018. doi:10.1093/imanum/drx080. URL http://dx.doi.org/10.1093/imanum/drx080 .Stephen Boyd and Lieven Vandenberghe.
Convex optimization . Cambridge university press,2004.Samuel Burer and Renato DC Monteiro. A nonlinear programming algorithm for solvingsemidefinite programs via low-rank factorization.
Mathematical Programming , 95(2):329–357, 2003.Samuel Burer and Renato DC Monteiro. Local minima and convergence in low-rank semidef-inite programming.
Mathematical Programming , 103(3):427–444, 2005.Emmanuel Candes and Benjamin Recht. Exact matrix completion via convex optimization.
Communications of the ACM , 55(6):111–119, 2012.Michel X Goemans and David P Williamson. New 3/4-approximation algorithms for themaximum satisfiability problem.
SIAM Journal on Discrete Mathematics , 7(4):656–666,1994.Michel X Goemans and David P Williamson. Improved approximation algorithms formaximum cut and satisfiability problems using semidefinite programming.
Journal of theACM (JACM) , 42(6):1115–1145, 1995.Gene H Golub and Charles F Van Loan.
Matrix computations , volume 3. JHU Press, 2012.Carla P Gomes, Willem-Jan van Hoeve, and Lucian Leahu. The power of semidefinite pro-gramming relaxations for max-sat. In
International Conference on Integration of ArtificialIntelligence (AI) and Operations Research (OR) Techniques in Constraint Programming ,pages 104–118. Springer, 2006.Michael I Jordan and Martin J Wainwright. Semidefinite relaxations for approximate inferenceon graphs with cycles. In
Advances in Neural Information Processing Systems , pages369–376, 2004.Jason D Lee, Max Simchowitz, Michael I Jordan, and Benjamin Recht. Gradient descentonly converges to minimizers. In
Conference on Learning Theory , pages 1246–1257, 2016.Jason D Lee, Ioannis Panageas, Georgios Piliouras, Max Simchowitz, Michael I Jordan, andBenjamin Recht. First-order methods almost always avoid saddle points. arXiv preprintarXiv:1710.07406 , 2017.Song Mei, Theodor Misiakiewicz, Andrea Montanari, and Roberto I Oliveira. Solving sdpsfor synchronization and maxcut problems via the grothendieck inequality. arXiv preprintarXiv:1703.08729 , 2017. he Mixing method Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of wordrepresentations in vector space. arXiv preprint arXiv:1301.3781 , 2013a.Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributedrepresentations of words and phrases and their compositionality. In
Advances in neuralinformation processing systems , pages 3111–3119, 2013b.Michael O’Neill and Stephen J Wright. Behavior of accelerated gradient methods near criticalpoints of nonconvex problems. arXiv preprint arXiv:1706.07993 , 2017.Ioannis Panageas and Georgios Piliouras. Gradient descent only converges to minimizers:Non-isolated critical points and invariant regions. arXiv preprint arXiv:1605.00405 , 2016.Dohyung Park, Anastasios Kyrillidis, Srinadh Bhojanapalli, Constantine Caramanis, andSujay Sanghavi. Provable burer-monteiro factorization for a class of norm-constrainedmatrix problems. arXiv preprint arXiv:1606.01316 , 2016.Gábor Pataki. On the rank of extreme matrices in semidefinite programs and the multiplicityof optimal eigenvalues.
Mathematics of operations research , 23(2):339–358, 1998.Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors forword representation. In
EMNLP , volume 14, pages 1532–1543, 2014.Hiroyuki Sato and Toshihiro Iwai. A new, globally convergent riemannian conjugate gradientmethod.
Optimization , 64(4):1011–1031, 2015.Michael Shub.
Global stability of dynamical systems . Springer Science & Business Media,2013.Po-Wei Wang and Chih-Jen Lin. Iteration complexity of feasible descent methods for convexoptimization.
Journal of Machine Learning Research , 15(1):1523–1548, 2014.Zaiwen Wen, Donald Goldfarb, Shiqian Ma, and Katya Scheinberg. Row by row methods forsemidefinite programming.
Industrial Engineering , pages 1–21, 2009.Zaiwen Wen, Donald Goldfarb, and Katya Scheinberg. Block coordinate descent methodsfor semidefinite programming. In
Handbook on Semidefinite, Conic and PolynomialOptimization , pages 533–564. Springer, 2012.Liu Yang. Distance metric learning: A comprehensive survey. 2006. ang, Chang, and Kolter Appendix A. Proof of Theorem 3.2: Convergence to a critical point
Lemma A.1
Let ˆ V = M ( V ) for the Mixing method M : R k × n → R k × n defined in (4) and (5) . Then f ( V ) − f ( ˆ V ) = n (cid:88) i =1 y i (cid:107) v i − ˆ v i (cid:107) . Proof
Recall the objective function f w.r.t. variable v i while fixing all other variables v j is (cid:104) C, V T V (cid:105) = (cid:88) i (cid:88) j c ij v Ti v j = 2 v Ti ( (cid:88) j c ij v j ) + constant . Note that the (cid:80) j c ij v j term is independent of v i because c ii = 0 . Now consider the inner cycliciteration of the Mixing method updating v i to ˆ v i . Because only those v j with j < i are updatedto ˆ v j , the objective value before updating v i to ˆ v i equals v Ti ( (cid:80) ji c ij v j ) = 2 g Ti v i plus constants, Thus, the updates of the Mixing method can be written as ˆ v i = − g i /y i , where y i = (cid:107) g i (cid:107) and g i = (cid:88) ji c ij v j , i = 1 . . . n. (7)and the objective difference after updating v i to ˆ v i is g Ti ( v i − ˆ v i ) = − (cid:107) g i (cid:107) ˆ v Ti ( v i − ˆ v i ) = 2 y i (1 − v Ti ˆ v i ) = y i (cid:107) v i − ˆ v i (cid:107) . The result follows from summing above equation over i = 1 . . . n . Proof of Theorem 3.2Proof
By Assumption 3.1 y does not degenerate over iterations, so Lemma A.1 implies thatthe Mixing method is strictly decreasing and admits a unique limit point ¯ V = lim r →∞ V r forthe sequence { V r } generated by the algorithm. Denote ¯ y the corresponding y for ¯ V . Then ¯ V being a fixed point of (5) implies ¯ V ( C + diag(¯ y )) = 0 , which means the projected gradient of (cid:104) C, V T V (cid:105) at ¯ V is zero. Thus, together with thefeasibility of V r , we prove that the Mixing method converges to a critical point. Appendix B. Proof of Lemma 3.10: Divergence of Gauss-Seidel methods
Proof
Because the dynamics of the Gauss-Seidel method (GS) on the system min x ∈ R n f ( x ) , where f ( x ) ≡ x T Sx, has the same Jacobian as J GS , proving ρ ( J GS ) > is equivalent to proving the “lineardivergence” of the Gauss-Seidel method, which cyclically optimizes each coordinate of x ∈ R n .Further, since S (cid:54)(cid:23) , there is an eigenvector q ∈ R n of S such that q T Sq < . he Mixing method Consider the sequence { x r } r =0 , ,... generated by the GS. That is, x r = ( J GS ) r x , ∀ r > ,where ( J GS ) r is J GS to the r -th power. Let the initial solution of the system be x = q sothat f ( x ) < . Because the Gauss-Seidel method is greedy in every coordinate updates,it is monotonically deceasing in the function value. Thus, there are only two cases for thesequence of function values: 1) the function value converges below zero; 2) the function valuegoes to negative infinity.Denote z ri the x r before updating the i -th coordinate and let z r = x r and z rn +1 = x r +1 .This way, only the i -th coordinate between z ri and z ri +1 is changed and the inner cyclicupdates can be flattened as x r = z r → z r → . . . → z rn → z rn +1 = x r +1 . (8)
1) When the function value converges.
The monotonic decreasing property of GSimplies that the function difference converges to zero. By the same analysis in Lemma A.1, we have f ( x r ) − f ( x r +1 ) = n (cid:88) i =1 y i (cid:107) z ri − z ri +1 (cid:107) . Thus, the flattened sequence { z ri } convergences, which implies { x r } also converges. Let ¯ x bethe limit of the sequence { x r } . Being a limit of the GS sequence means that ¯ x is a fixed pointof GS, which implies S ¯ x = 0 and f ( ¯ X ) = ¯ x T S ¯ x = 0 . This contradicts with the monotonicdecreasing property of GS and the fact that f ( x ) < .
2) When the function value x Tr Sx r goes to negative infinity. Because the spectrumof S is bounded, we know that (cid:107) x r (cid:107) also goes to infinity. For simplicity, we focus on the r -thiterate and write z ri as z i . From the GS, z i,i is updated to z i +1 ,i = − y i (cid:80) j c ij z j,i , and we have f ( z i ) − f ( z i +1 ) = y i (cid:107) z i − z i +1 (cid:107) = y i ( z i,i + 1 y i ( (cid:88) j c ij z j,i )) = | e Ti Sz i | /y i , (9)where e i is the i -th coordinate vector and the first equality is from f ( x ) = x T Sx . Then wehave the following claim from Lee et al. (2017, cliam 1). Claim B.1
Assume x r be in the range of S . There exists an index j such that y j | e Tj Sz j | ≥ ω (cid:107) z j (cid:107) for some global constant ω > that only depends on S and n . The full proof of the claim is listed after this lemma for completeness. To fulfill the assumption,we can decompose x r = x r + x ⊥ r , where x r is in the range of S and x ⊥ r is in the null of S .Consider the flattened inner cyclic update z i like (8) but starting from x r such that x r = z → z → . . . → z n → z n +1 . Because J GS map the null space of S to itself, f ( x r ) − f ( x r +1 ) = f ( x r ) − f ( J GS ( x r + x ⊥ r )) = f ( z ) − f ( z n +1 + x ⊥ r ) = f ( z ) − f ( z n +1 ) .
10. We can obtain the result by fixing y in Lemma A.1 to be a constant and let V ∈ R × n . The result canalso be obtained by examining the coordinate updates of GS, which is already known in the literature.11. Note that only x r is decomposed to x r in range ( S ) and x ⊥ r in null ( S ) . Symbols z i are the GS iteratesgenerated from x r and might not be in the range of S .12. Consider p such that Sp = 0 . Then ( J GS ) p = − ( L + y ) − L T p = − ( L + y ) − Sp + p = p . ang, Chang, and Kolter Further, because GS is coordinate-wise monotonic decreasing and the function decrease of acoordinate update is smaller than the whole cyclic update, by above equality and (9) we have f ( x r ) − f ( x r +1 ) = f ( z ) − f ( z n +1 ) ≥ ( e Tj Sz j ) y j ≥ y j ω (cid:107) z j (cid:107) . The last inequality is from Claim B.1. Thus, f ( x r +1 ) ≤ f ( x r ) − y j ω (cid:107) z j (cid:107) (10)Further, because (cid:107) z j (cid:107) ≥ | z Tj Sz j | /ρ ( S ) and z Tj Sz j = f ( z j ) ≤ f ( x r ) = f ( x r ) ≤ f ( x ) < , f ( x r ) − y j ω (cid:107) z j (cid:107) ≤ f ( x r ) + y j ω ρ ( S ) z Tj Sz j ≤ (1 + y min ω ρ ( S ) ) f ( x r ) . (11)Combining (10) and (11), we obtain the exponential divergence to negative infinity f ( x r +1 ) ≤ (1 + y min ω ρ ( S ) ) f ( x r ) ≤ (1 + y min ω ρ ( S ) ) r +1 f ( x ) , ∀ r ≥ . (12)The last inequality is from applying the first inequality recursively. Because S (cid:23) σ min ( S ) I n and σ min ( S ) < , f ( x r ) = (( J GS ) r x ) T S (( J GS ) r x ) ≥ σ min ( S ) (cid:107) ( J GS ) r x (cid:107) ≥ σ min ( S ) (cid:107) ( J GS ) r (cid:107) (cid:107) x (cid:107) . (13)Combining (12) and (13), we have σ min ( S ) (cid:107) ( J GS ) r (cid:107) (cid:107) x (cid:107) ≤ (cid:18) y min ω ρ ( S ) (cid:19) r f ( x ) , ∀ r > . Applying Gelfand’s theorem for spectral radius, we conclude that ρ ( J GS ) = lim r →∞ (cid:107) ( J GS ) r (cid:107) /r ≥ (cid:115) y min ω ρ ( S ) , which means that the spectral radius ρ ( J GS ) is strictly larger than . Proof of the Claim B.1 in the above lemma.
Note that the following proof is essentiallythe same with Lee et al. (2017, cliam 1), where their α is our y i and their y t is our x r . Theonly difference here is that we prove the result for the exact Gauss-Seidel method, and theyprove the result for the coordinate gradient descent method with a step size. The proof islisted here for completeness. Proof
We will prove by contradiction. Assume that y j | e Tj Sz j | < ω (cid:107) z j (cid:107) for all j = 1 . . . n for certain ω. (14)Now we show the following result by induction, that for j = 2 . . . n + 1 , (cid:107) x r − z j (cid:107) < j − ω (cid:107) x r (cid:107) . (15) he Mixing method Remember from (9) we have y j (cid:107) z j − z j +1 (cid:107) = | e Tj Sz j | /y j . (16)For j = 2 , we have the induction basis for (15) from the above equality and (14), that (cid:107) x r − z (cid:107) = (cid:107) z − z (cid:107) = 1 y | e T Sz | < ω (cid:107) z (cid:107) = ω (cid:107) x r (cid:107) < ω (cid:107) x r (cid:107) , and accordingly (cid:107) z (cid:107) ≤ (cid:107) z − z (cid:107) + (cid:107) z (cid:107) < (1 + 2 ω ) (cid:107) x r (cid:107) . Now we do the induction. Supposethe hypothesis (15) holds for a j . This implies (cid:107) z j (cid:107) ≤ (cid:107) z j − x r (cid:107) + (cid:107) x r (cid:107) < (1 + 2( j − ω ) (cid:107) x r (cid:107) . (17)Then at j + 1 , (cid:107) x r − z j +1 (cid:107) ≤ (cid:107) x r − z j (cid:107) + (cid:107) z j − z j +1 (cid:107) by the triangular inequality < j − ω (cid:107) x r (cid:107) + 1 y j | e Tj Sz j | by hypothesis (15) at j and (16) < j − ω (cid:107) x r (cid:107) + ω (cid:107) z j (cid:107) by assumption (14) < j − ω (cid:107) x r (cid:107) + ω (1 + 2( j − ω ) (cid:107) x r (cid:107) by (17) ≤ jω (cid:107) x r (cid:107) , where the last inequality holds from picking ω ∈ (0 , n ) so that ω (1 + 2( j − ω − < .Thus, the induction on (15) holds. With the result (15), for j = 2 . . . n we have y max | e Tj Sx r | ≤ y j | e Tj Sx r | by y max ≥ y j ≤ y j ( | e Tj Sz j | + | e Tj S ( x r − z j ) | ) by the triangular inequality < ω (cid:107) z j (cid:107) + 1 y j (cid:107) Se j (cid:107)(cid:107) x r − z j (cid:107) by (14) and Cauchy inequality < ω (1 + 2( j − ω ) (cid:107) x r (cid:107) + 1 y j j − ω (cid:107) Se j (cid:107)(cid:107) x r (cid:107) by (17) and (15) ≤ ω (1 + 2 nω + 2 n y min ρ ( S )) (cid:107) x r (cid:107) , (18)where the last inequality is from (cid:107) Se j (cid:107) ≤ (cid:107) S (cid:107)(cid:107) e j (cid:107) ≤ ρ ( S ) , y min ≤ y j , and j ≤ n + 1 . Notethat the result of (18) for j = 1 also holds because (14) and x r = z . Summing the square of(18) over j = 1 . . . n and put it in a square root, we have √ nω (1 + 2 nω + 2 n ρ ( S ) y min ) (cid:107) x r (cid:107) > y max (cid:107) Sx r (cid:107) ≥ κ min -nz ( S ) (cid:107) x r (cid:107) , where κ min -nz ( S ) = (cid:112) σ min -nz ( S T S ) > is the minimum nonzero singular value of S andthe last inequality holds because κ min -nz ( S ) (cid:107) x r (cid:107) ≤ (cid:107) Sx r (cid:107) from x r ∈ range ( S ) . Can-celling (cid:107) x r (cid:107) from both sides of the above inequality, the left-hand side goes to when ω → but the right-hand side stays constant. Thus, picking small enough ω such that κ min -nz ( S ) ≥ y max √ nω (1 + 2 nω + 2 n ρ ( S ) y min ) leads to a contradiction. So the claim holds.
13. Note that the choice of ω only depends on n and S . ang, Chang, and Kolter Appendix C. Proof of Theorem 3.4: Global convergence with a step size
Lemma C.1
The Mixing method M θ with a step size θ ∈ (0 , i (cid:107) c i (cid:107) ) never degenerates.That is, there is a constant δ ∈ (0 , such that (cid:107) θV c i (cid:107) ≤ − δ < and (cid:107) v i − θV c i (cid:107) ≥ δ > . Proof
Taking a constant θ ∈ (0 , i (cid:107) c i (cid:107) ) is equivalent to taking θ = − δ max i (cid:107) c i (cid:107) for aconstant δ ∈ (0 , . From the triangular inequality, (cid:107) V c i (cid:107) = (cid:107) (cid:88) j c ij v j (cid:107) ≤ (cid:88) j | c ij |(cid:107) v j (cid:107) = (cid:107) c i (cid:107) . So we have (cid:107) θV c i (cid:107) ≤ − δ < . The second result follows from (cid:107) v i − θV c i (cid:107) ≥ − (cid:107) θV c i (cid:107) . Lemma C.2
The Mixing method M θ with a step size θ ∈ (0 , i (cid:107) c i (cid:107) ) is a diffeomorphism. Proof
Note that M θ can be decomposed to M θ ( V ) = Φ n (Φ n − ( . . . Φ ( V ))) , where the column update Φ i ( V ) : R k × n → R k × n is defined as (Φ i ( V )) s =1 ...n = (cid:40) v i − θV c i (cid:107) v i − θV c i (cid:107) if s = iv s otherwise . Thus, if we can prove that Φ i ( V ) is a diffeomorphism for i = 1 . . . n , then M θ is a diffeo-morphism because compositions of diffeomorphisms are still diffeomorphism. Specifically,because all variables v j except for v i are given and stay the same, proving diffeomorphism of Φ i ( V ) is equivalent to proving the diffeomorphism of the projective mapping φ : R n → R n φ ( v ) = v − g (cid:107) v − g (cid:107) , where g = θV c i is known. Let φ ( v ) = z . We claim the inverse function φ − ( z ) is φ − ( z ) = αz + g, where α = − z T g + (cid:113) ( z T g ) + 1 − (cid:107) g (cid:107) . The square root is valid because of Lemma C.1. We prove the claim by validation. First, φ − ( φ ( v )) = φ − (cid:18) v − g (cid:107) v − g (cid:107) (cid:19) . By using the property that (cid:107) v (cid:107) = 1 , the α for the above function is α = − ( v − g ) T g (cid:107) v − g (cid:107) + (cid:115)(cid:18) ( v − g ) T g (cid:107) v − g (cid:107) (cid:19) + 1 − (cid:107) g (cid:107) = 1 (cid:107) v − g (cid:107) (cid:18) − ( v − g ) T g + (cid:113) (( v − g ) T g ) + (cid:107) v − g (cid:107) (1 − (cid:107) g (cid:107) ) (cid:19) = 1 (cid:107) v − g (cid:107) (cid:0) − v T g + (cid:107) g (cid:107) + 1 − v T g (cid:1) = (cid:107) v − g (cid:107) . he Mixing method Thus, φ − ( φ ( v )) = (cid:107) v − g (cid:107) v − g (cid:107) v − g (cid:107) + g = v is indeed the inverse function. The diffeomorphismfollows from the smoothness of φ ( v ) and φ − ( z ) when (cid:107) v − g (cid:107) ≥ δ > by Lemma C.1. Lemma C.3
For the Mixing method M θ with step size θ ∈ (0 , i (cid:107) c i (cid:107) ) , let ˆ V = M θ ( V ) .Following the notation in (4) and (6) , we have f ( V ) − f ( ˆ V ) = (cid:88) i y i θ (cid:107) v i − ˆ v i (cid:107) , where y i = (cid:107) v i − θg i (cid:107) and g i = (cid:80) ji c ij v j . Proof
Consider the inner iteration of the Mixing method with a step size. With the sameanalysis to Lemma A.1, the function value before updating the variable v i is g Ti v i , and thefunction difference after updating v i to ˆ v i = ( v i − θg i ) /y i is g Ti ( v i − ˆ v i ) = 2( g i + v i − θg i θ ) T ( v i − ˆ v i ) − v i − θg i θ ) T ( v i − ˆ v i )= 2 1 θ v Ti ( v i − ˆ v i ) − y i θ ˆ v Ti ( v i − ˆ v i )= 1 + y i θ − v Ti ˆ v i ) = 1 + y i θ (cid:107) v i − ˆ v i (cid:107) . Thus, the result holds from summing the above equation over i = 1 . . . n . Proof of Theorem 3.4Proof
Similar to Lemma 3.7, the Jacobian of the Mixing method M θ with step size θ is J ( V ) = ( D y + θL ) − ⊗ I k P ( I n − θL T ) ⊗ I k , where P = diag( P , . . . , P n ) and P i = I − ˆ v i ˆ v Ti . By Lemma 3.9, the spectral radius ρ ( J ) at a critical point is lower bounded by J CGD =( D y + θL ) − ( I n − θL T ) , which equals the Jacobian of coordinate gradient descent (CGD) ona linear system S = C + diag( (cid:107) V c i (cid:107) ) . Because CGD admits a Jacobian with ρ ( J CGD ) > when S (cid:54)(cid:23) (Lee et al., 2017, Proposition 5), Lemma 3.12 implies all non-optimal criticalpoints are unstable fixed points for M θ . Further, since M θ is a diffeomorphism by Lemma C.2,we can apply the center-stable manifold theorem (Shub, 2013, Theorem III.5) to the mapping M θ . To be specific, because the non-optimal critical points are unstable fixed points and M θ is a diffeomorphism, Lee et al. (2017, Theorem 2) implies that the Mixing method M θ escapes all non-optimal critical points almost surely under random initialization. Further,because M θ is strictly decreasing (Lemma C.3) and the objective value is lower bounded, M θ converges to a first-order critical points (with the same analysis to Theorem 3.2). In conclu-sion, the almost surely divergence from the non-optimal critical points and the convergenceto a critical point imply that M θ almost surely converges to a global optimum.
14. Note that Lee et al. (2017, Lemma 1) use only the property of diffeomorphism, so their assumption onthe non-singular Jacobian is not necessary. Actually, non-singular Jacobian is a sufficient condition forthe existence of one-to-one mapping but not the necessary condition.15. Note that the critical points here is non-isolated because they are invariant to rotations in R k . While Leeet al. (2017, Theorem 2) suffices for our result, interested reader can also refer to Panageas and Piliouras(2016) on how they use the Lindelöf lemma to solve the non-isolation issue. ang, Chang, and Kolter Appendix D. Proof of Theorem 3.5: Local Linear convergence
Lemma D.1
Composition of Lipschitz continuous functions is Lipschitz continuous. Thatis, if mapping F is λ F -Lipschitz continuous and G is λ G -Lipschitz continuous, then G ( F ( · )) is λ G λ F -Lipschitz continuous. Proof
Mapping F being λ F -Lipschitz continuous means for variables X, Y in its domain, (cid:107) F ( X ) − F ( Y ) (cid:107) ≤ λ F (cid:107) X − Y (cid:107) . Consequently, for the λ G -Lipschitz continuous mapping G (cid:107) G ( F ( X )) − G ( F ( Y )) (cid:107) ≤ λ G (cid:107) F ( X ) − F ( Y ) (cid:107) ≤ λ G λ F (cid:107) X − Y (cid:107) , which establishes the proof. Lemma D.2
Under Assumption 3.1, the Mixing method M : R k × n → R k × n and its nor-malizer y : R k × n → R n are Lipschitz continuous to any optimum V ∗ . Further, there areconstants β, γ > such that, (cid:107) V − M ( V ) (cid:107) ≥ ( β − γ (cid:107) y ( V ) − y ( V ∗ ) (cid:107) )( f ( V ) − f ∗ ) . (19) Proof
By Lemma D.1, proving the Lipschitz continuity of the mapping y ( V ) and thecorresponding M ( V ) is equivalent to proving the Lipschitz continuity of ψ i ( V ) = (cid:107) V c i (cid:107) and φ i ( V ) = − ψ i ( V ) V c i . First, from the definition of ψ i and the Cauchy inequality, there is | ψ i ( V ) − ψ i ( V ∗ ) | ≤ (cid:107) V c i − V ∗ c i (cid:107) ≤ (cid:107) c i (cid:107)(cid:107) V − V ∗ (cid:107) . Further, from φ i ( V ∗ ) = v ∗ i , the Cauchy inequality, and the above inequality, we have (cid:107) φ i ( V ) − φ i ( V ∗ ) (cid:107) = (cid:13)(cid:13)(cid:13)(cid:13) − ψ i ( V ) V c i − − ψ i ( V ) V ∗ c i − (cid:18) ψ i ( V ) ψ i ( V ) − ψ i ( V ∗ ) ψ i ( V ) (cid:19) v ∗ i (cid:13)(cid:13)(cid:13)(cid:13) ≤ ψ i ( V ) (cid:0) (cid:107) c i (cid:107)(cid:107) V − V ∗ (cid:107) + (cid:107) ψ i ( V ) − ψ i ( V ∗ ) (cid:107)(cid:107) v ∗ i (cid:107) (cid:1) ≤ (cid:107) c i (cid:107) δ (cid:107) V − V ∗ (cid:107) , where the last inequality is from Assumption 3.1 that ψ i ( V ) ≥ δ > . Thus, the Lipschitzcontinuity holds from applying Lemma D.1 recursively. Now we prove result (19). Let S ∗ = C + D y ∗ and S = C + D y , where y ∗ = y ( V ∗ ) and y = y ( V ) . Under the notation in (5), V − M ( V ) = V ( L T + D y )( L T + D y ) − + V L ( L T + D y ) − = V S ( L T + D y ) − .
16. The coefficient y defined in (5) is also a function of V .17. Because y and M can be composed by ψ i and φ i like Lemma C.2. he Mixing method For simplicity, let R := ( L T + D y ) − . Expand the square using S = S ∗ + D y − D y ∗ and dropthe last squared term after the expansion, we have (cid:107) V − M ( V ) (cid:107) ≥ (cid:107) V S ∗ R (cid:107) + 2 tr (cid:0) V T V S ∗ RR T ( D y − D y ∗ ) (cid:1) ≥ σ ( R ) σ min -nz ( S ∗ ) tr( V T V S ∗ ) − (cid:107) y − y ∗ (cid:107) σ ( R ) tr( V T V S ∗ ) . The last inequality is from (cid:107) y − y ∗ (cid:107) I n + ( D y − D y ∗ ) (cid:23) and Lemma 3.12 ( S ∗ (cid:23) ). Note thatthe term tr( V T V S ∗ ) = tr( V T V C ) − ( − T y ∗ ) = f ( V ) − f ∗ by the analysis in Lemma 3.12.With σ min ( R ) = y max and σ max ( R ) = y min , the result (19) follows from (cid:107) V − M ( V ) (cid:107) ≥ (cid:18) σ min -nz ( S ∗ ) y − (cid:107) y − y ∗ (cid:107) y (cid:19) ( f ( V ) − f ∗ ) . Lemma D.3
The Mixing method M θ : R k × n → R k × n with step size θ and its normalizer y : R k × n → R n are Lipschitz continuous to any optimum V ∗ . Further, there are constants β, γ > such that (cid:107) V − M θ ( V ) (cid:107) ≥ ( β − γ (cid:107) y ( V ) − y ( V ∗ ) (cid:107) )( f ( V ) − f ∗ ) . (20) Proof
Similar to the proof of Lemma D.2, proving the Lipschitz continuity of y ( V ) and M θ ( V ) is to prove the Lipschitz continuity of the mappings ψ i ( V ) = (cid:107) v i − θV c i (cid:107) and φ i ( V ) = 1 ψ i ( V ) ( v i − θV c i ) . First, the Lipschitz continuity of ψ i ( V ) follows by applying the Cauchy inequality | ψ i ( V ) − ψ i ( V ∗ ) | ≤ (cid:107) v i − θV c i − ( v ∗ i − θV ∗ c i ) (cid:107) ≤ (1 + θ (cid:107) c i (cid:107) ) (cid:107) V − V ∗ (cid:107) . Combining the above inequality and φ i ( V ∗ ) = v ∗ i , we have the Lipschitz continuity (cid:107) φ i ( V ) − φ i ( V ∗ ) (cid:107) = (cid:13)(cid:13)(cid:13)(cid:13) ψ i ( V ) ( v i − θV c i ) − ψ i ( V ) ( v ∗ i − θV ∗ c i ) − (cid:18) ψ i ( V ) ψ i ( V ) − ψ i ( V ∗ ) ψ i ( V ) (cid:19) v ∗ i (cid:13)(cid:13)(cid:13)(cid:13) ≤ ψ i ( V ) (cid:0) (1 + θ (cid:107) c i (cid:107) ) (cid:107) V − V ∗ (cid:107) + (cid:107) ψ i ( V ) − ψ i ( V ∗ ) (cid:107) (cid:1) ≤ θ (cid:107) c i (cid:107) ) δ (cid:107) V − V ∗ (cid:107) . The last inequality is from Lemma C.1. Next, we prove result (20). Under notations in (6), V − M θ ( V ) = V − V ( I n − θL )( θL T + D y ) − = V ( D y − I n + θC )( θL T + D y ) − . (21)Let y ∗ = y ( V ∗ ) and y = y ( V ) . We have y ∗ i = 1 + θ (cid:107) V ∗ c i (cid:107) by M θ ( V ∗ ) = V ∗ , thus S ∗ = C + θ ( D y ∗ − I ) (cid:23) by Lemma 3.12. Further, D y − I n + θC = θS ∗ + D y − D y ∗ . (22)Expand (cid:107) V − M θ (cid:107) with equality (21) and (22) and follow the same analysis in Lemma D.2,then result (20) holds with (cid:107) V − M θ ( V ) (cid:107) ≥ θ (cid:18) θσ min -nz ( S ∗ ) y − (cid:107) y − y ∗ (cid:107) y (cid:19) ( f ( V ) − f ∗ ) . ang, Chang, and Kolter Lemma D.4
For both Mixing methods M (with Assumption 3.1) and M θ (no assumption),there is a constant τ > such that, for their normalizers y ∗ = y ( V ∗ ) and y = y ( V ) , (cid:107) y − y ∗ (cid:107) ≤ τ ( f ( V ) − f ∗ ) . Proof
First we prove the result for the Mixing method M . Denote Z i the (flattened) innercyclic iterate before updating v i to ˆ v i so that only vector v i is changed between Z i and Z i +1 .That is, V = Z → Z → . . . → Z n → Z n +1 = M ( V ) . Let S ∗ = C + diag( (cid:107) V ∗ c i (cid:107) ) for an optimum solution V ∗ . Observe that ( y i − y ∗ i )ˆ v i = − Z i c i − y ∗ i ˆ v i = − Z i s ∗ i + y ∗ i ( v i − ˆ v i ) . The terms Z i s ∗ i and v i − ˆ v i can be bounded by Lemma 3.12 and A.1 (with Assumption 3.1): f ( V ) − f ∗ ≥ f ( Z i ) − f ∗ = tr( Z Ti Z i S ∗ ) ≥ ρ max ( S ∗ ) (cid:107) Z i S ∗ (cid:107) ≥ ρ max ( S ∗ ) (cid:107) Z i s ∗ i (cid:107) and f ( V ) − f ∗ ≥ f ( Z i ) − f ( Z i +1 ) ≥ δ (cid:107) v i − ˆ v i (cid:107) . Thus, by ( y i − y ∗ i ) ≤ (cid:107) Z i s i (cid:107) + 2( y ∗ i ) (cid:107) v i − ˆ v i (cid:107) and the above two inequalities, ( y i − y ∗ i ) is bounded by f ( V ) − f ∗ times a constant. Summing over i = 1 . . . n we have the result fora constant τ > . The result for the Mixing method M θ with step size θ ∈ (0 , i (cid:107) c i (cid:107) ) follows from the same analysis and Lemma C.3 (no assumption). Proof of Theorem 3.5Proof
By Lemma D.2, there are constants β, γ > for the Mixing method M such that (cid:107) V − M ( V ) (cid:107) ≥ ( β − γ (cid:107) y − y ∗ (cid:107) )( f ( V ) − f ∗ ) . By Lemma D.4, we can pick a neighborhood close enough to the optimum where f ( V ) − f ∗ is sufficiently small so that β − γ (cid:107) y − y ∗ (cid:107) ≥ κ > for a constant κ > . Consequently, (cid:107) V − M ( V ) (cid:107) ≥ κ ( f ( V ) − f ∗ ) . Further, the above inequality holds for all subsequent iterates, because f ( V ) − f ∗ is strictlydecreasing, thus still sufficiently small so that β − γ (cid:107) y − y ∗ (cid:107) ≥ κ > holds. Togetherwith Lemma A.1 (under Assumption 3.1), there is another constant δ > such that f ( V ) − f ( M ( V )) ≥ δ (cid:107) V − M ( V ) (cid:107) , and thus f ( V ) − f ( M ( V )) ≥ δκ ( f ( V ) − f ∗ ) = ⇒ (1 − δκ )( f ( V ) − f ∗ ) ≥ f ( M ( V )) − f ∗ . That is, the Mixing method M converges R-linearly to the optimal objective value in theneighborhood of the optimum. The same result follows for the Mixing method M θ with stepsize θ ∈ (0 , i (cid:107) c i (cid:107) ) from Lemma C.3 (no assumption) and Lemma D.3.Further, because f ( V ) − f ∗ = tr( V S ∗ V T ) and S ∗ = C +diag( (cid:107) V ∗ c i (cid:107) ) (cid:23) by Lemma 3.12,above R-linear convergence in objective value implies the Q-linear convergence of rows of V to the null of S ∗ . Together with the Mixing methods being feasible methods, we obtain theQ-linear convergence of V to the global optimum in the neighborhood of the optimum. he Mixing method Appendix E. Proof of Lemma 3.11: Rank Deficiency in Critical Points
The proof is a specialized version of (Boumal et al., 2016, Lemma 9) for the MAXCUT SDP,in which their Y is our V and their µ ( V ) is our y . We list it here for completeness. Proof
Let V be a first-order critical point of problem (2), which means that there is acorresponding y i = (cid:107) V c i (cid:107) , i = 1 . . . n such that V S = 0 , where S ≡ C + D y . This implies rank ( V ) ≤ null ( C + D y ) ≤ max ν null ( C + D ν ) . Note that the right-hand side is independent of V , so we can use it to bound the rank ofall critical V . Let ν be a solution of the right-hand side, M ≡ C + D ν , and null ( M ) ≡ (cid:96) .Writing C = M − D ν , we have C ∈ N (cid:96) + im ( D ) , in which the + denotes the set-sum, im ( D ) denotes the image of all diagonal matrices ofsize n , and N (cid:96) denotes the set of symmetric matrices of size n with nullity (cid:96) . Because of thesymmetricity of N (cid:96) , dim ( N (cid:96) ) = n ( n + 1)2 − (cid:96) ( (cid:96) + 1)2 . Further, because rank ( V ) ≤ k , we can assume that (cid:96) ≥ k . Union all possible (cid:96) , C ∈ (cid:91) (cid:96) = k...n N (cid:96) + im ( D ) . Note that the right-hand side is now independent of C . Because the dimension of a finiteunion is at most the maximal dimension, and the dimension of a finite set sum is at mostthe sum of set dimensions,dim (cid:32) (cid:91) (cid:96) ∈ k...n N (cid:96) + im ( D ) (cid:33) ≤ dim ( N k + im ( D )) ≤ n ( n + 1)2 − k ( k + 1)2 + rank ( D ) . (23)We know that rank ( D ) = n because the space of diagonal matrix has n free dimensions.Because the symmetric matrix C lives in the space n ( n +1)2 , almost no C satisfies the right-handside of (23) if we take large enough k so that n ( n + 1)2 − k ( k + 1)2 + n < n ( n + 1)2 . Thus, almost no C has critical point of rank k if k ( k +1)2 > n , which means for almost all C ,the critical point has at most rank k − . ang, Chang, and Kolter Algorithm 3:
The Mixing method for Word embedding problem Initialize v i randomly on a unit sphere; Initialize b i := 0 for each i = 1 , . . . , m ; while not yet converged do for i = 1 , . . . , n do Solve Hd + g = 0 approximately by conjugate gradient method using Hd := (cid:80) j w ij ( v Tj d ) v j and g := (cid:80) j w ij e ij v j , where e ij = v Ti v j + b i + b j − log( c ij ) ; Let v i := v i + d ; Update bias term b i := b i − ( (cid:80) j w ij e j ) / ( (cid:80) j w ij ) ; end end Appendix F. Application: Word embedding problem
The word embedding is a feature learning technique to embed the meanings of wordsas low-dimensional vectors. For example, the popular Word2vec model (Mikolov et al.,2013b,a) successfully embeds similarities and analogies between words using a shallow neuralnetwork. Another popular model, GloVe (Pennington et al., 2014), uses a factorization-basedformulation and achieves better accuracy in analogies tasks compared to Word2vec. Thetheoretical justifications of these two models are discussed in RANDWALK (Arora et al.,2015). Here, we show that our coordinate descent approach can also be applied to learningword embeddings with the GloVe objective function, highlighting the fact that the Mixingmethods can be applied to problems typically considered in the domain of solely non-convexoptimization.
Problem description.
Let C be the co-occurrence matrix such that c ij is the number oftimes word i and j co-occur within the same window in the corpus. We consider solving aslightly modified version of the GloVe objective function (Pennington et al., 2014) min V ∈ R k × n (cid:88) i (cid:54) = j w ij (cid:16) v Ti v j + b i + b j − log c ij (cid:17) , where n is size of the vocabulary, k is the number of latent factors and w ij = min { C / w,w (cid:48) , } is a tuning factor to suppress high-frequency words. The only difference with GloVe is thatwe do not include the self-loop, i.e., i = j terms, in the formulation. Application of Mixing method.
Focusing on the subproblem involving only variable v i and take a step d ∈ R k , we can see that the subproblem min d f ( v i + d ) becomes: minimize d ∈ R k (cid:88) j w ij (cid:16) ( v i + d ) T ( v j ) + b i + b j − log c ij (cid:17) he Mixing method r e l a t i v e f un c t i o n v a l u e d i ff e r e n c e mixingsgd (a) wiki8 (n= , nnz= , k= ) r e l a t i v e f un c t i o n v a l u e d i ff e r e n c e mixingsgd (b) enwiki (n= , nnz= , k= ) Figure 3: ( f − f min ) v.s. training time for the word embedding problem, where f min isminimum objective value we have. The experiments show that the Mixing method can begeneralized to nonlinear objective function and is faster than SGD.Define e ij = v Ti v j + b i + b j − log c ij . Then the above subproblem can be reformulated as minimize d ∈ R k d T (cid:0) H (cid:122) (cid:125)(cid:124) (cid:123)(cid:88) j w ij v j v Tj (cid:1) d + (cid:0) g (cid:122) (cid:125)(cid:124) (cid:123)(cid:88) j w ij e ij v j (cid:1) T d, which is an unconstrained quadratic program solvable in O ( n ) time. In practice, we applythe conjugate gradient method with a stopping condition on (cid:107)∇ d f ( v i + d ) (cid:107) to obtain goodenough approximations and cyclically update all the v i . Totally, updating all the variablesonce takes O ( k · m · ( )) time, where m is the number of nonzeros in C andtypically ( ) < in our settings. Algorithm 3 contains a complete description. Results.
Figure 3 shows the result of comparing the proposed Mixing method with thestochastic gradient method, which is the default solver for GloVe. We consider the wiki8 andthe enwiki corpus, which are widely-used benchmarks for word embeddings. The corpus ispre-processed following Pennington et al. (2014) (removing non-textual elements, sentencesplitting, and tokenization), and words that appeared fewer than times in the corpusare ignored. Figure 3 shows the results of the Mixing method versus the SGD on the twocorpora. For both datasets, the Mixing method converges substantially faster, achieving alower function value than the SGD method.times in the corpusare ignored. Figure 3 shows the results of the Mixing method versus the SGD on the twocorpora. For both datasets, the Mixing method converges substantially faster, achieving alower function value than the SGD method.