[PDF] A Faster Interior Point Method for Semidefinite Programming

Abstract

Semidefinite programs (SDPs) are a fundamental class of optimization problems with important recent applications in approximation algorithms, quantum complexity, robust learning, algorithmic rounding, and adversarial deep learning. This paper presents a faster interior point method to solve generic SDPs with variable size n×n and m constraints in time O ˜ ( n − − √ (m n 2 + m ω + n ω )log(1/ϵ)), where ω is the exponent of matrix multiplication and ϵ is the relative accuracy. In the predominant case of m≥n , our runtime outperforms that of the previous fastest SDP solver, which is based on the cutting plane method of Jiang, Lee, Song, and Wong [JLSW20]. Our algorithm's runtime can be naturally interpreted as follows: O ˜ ( n − − √ log(1/ϵ)) is the number of iterations needed for our interior point method, m n 2 is the input size, and m ω + n ω is the time to invert the Hessian and slack matrix in each iteration. These constitute natural barriers to further improving the runtime of interior point methods for solving generic SDPs.

Full PDF

aa r X i v : . [ c s . D S ] S e p A Faster Interior Point Method for Semideﬁnite Programming

Haotian Jiang ∗ Tarun Kathuria † Yin Tat Lee ‡ Swati Padmanabhan § Zhao Song ¶ Abstract

Semideﬁnite programs (SDPs) are a fundamental class of optimization problems with im-portant recent applications in approximation algorithms, quantum complexity, robust learning,algorithmic rounding, and adversarial deep learning. This paper presents a faster interior pointmethod to solve generic SDPs with variable size n × n and m constraints in time e O ( √ n ( mn + m ω + n ω ) log(1 /ǫ )) , where ω is the exponent of matrix multiplication and ǫ is the relative accuracy. In the predom-inant case of m ≥ n , our runtime outperforms that of the previous fastest SDP solver, which isbased on the cutting plane method [JLSW20].Our algorithm’s runtime can be naturally interpreted as follows: e O ( √ n log(1 /ǫ )) is the num-ber of iterations needed for our interior point method, mn is the input size, and m ω + n ω is thetime to invert the Hessian and slack matrix in each iteration. These constitute natural barriersto further improving the runtime of interior point methods for solving generic SDPs. ∗ [email protected] . University of Washington. † [email protected] . University of California, Berkeley. ‡ [email protected] . University of Washington. § [email protected] . University of Washington ¶ [email protected] . Princeton University and Institute for Advanced Study. ontents T mat ( n, mn, n ) and T mat ( m, n , m ) . . . . . . . . . . . . . . 143.4 Speciﬁc upper bound on T mat ( m, n , m ) . . . . . . . . . . . . . . . . . . . . . . . . . 15 A.1 Exponent of matrix multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36A.2 Matrix multiplication tensor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36A.3 Implication of matrix multiplication technique . . . . . . . . . . . . . . . . . . . . . . 37A.4 General bound on T mat ( n, mn, n ) and T mat ( m, n , m ) . . . . . . . . . . . . . . . . . . 391 Introduction

Semideﬁnite programs (SDPs) constitute a class of convex optimization problems that optimize alinear objective over the intersection of the cone of positive semideﬁnite matrices with an aﬃne space.SDPs generalize linear programs and have a plethora of applications in operations research, controltheory, and theoretical computer science [VB96]. Applications in theoretical computer science in-clude improved approximation algorithms for fundamental problems (e.g., Max-Cut [GW95], color-ing 3-colorable graphs [KMS94], and sparsest cut [ARV09]), quantum complexity theory [JJUW11],robust learning and estimation [CG18, CDG19, CDGW19], and algorithmic discrepancy and round-ing [BDG16, BG17, Ban19]. We formally deﬁne SDPs with variable size n × n and m constraints: Deﬁnition 1.1 (Semideﬁnite programming) . Given symmetric matrices C, A , · · · , A m ∈ R n × n and b i ∈ R for all i ∈ [ m ] , the goal is to solve the convex optimization problem max h C, X i subject to X (cid:23) , h A i , X i = b i ∀ i ∈ [ m ] (1) where h A, B i := P i,j A i,j B i,j is the trace product. Cutting plane and interior point methods

Two prominent methods for solving SDPs, withruntimes depending logarithmically on the accuracy parameter ǫ , are the cutting plane method andthe interior point method .The cutting plane method maintains a convex set containing the optimal solution. In eachiteration, the algorithm queries a separation oracle, which returns a hyperplane that divides theconvex set into two subsets. The convex set is then updated to contain the subset with the optimalsolution. This process is repeated until the volume of the maintained set becomes small enough anda near-optimal solution can be found. Since Khachiyan proved [Kha80] that the ellipsoid methodsolves linear programs in polynomial time, cutting plane methods have played a crucial role in bothdiscrete and continuous optimization [GLS81, GV02].In contrast, interior point methods add a barrier function to the objective and, by adjusting theweight of this barrier function, solve a diﬀerent optimization problem in each iteration. The solutionsto these successive problems form a well-deﬁned central path . Since Karmarkar proved [Kar84] thatinterior point methods can solve linear programs in polynomial time, these methods have becomean active research area. Their number of iterations is usually the square root of the number ofdimensions, as opposed to the linear dependence on dimensions in cutting plane methods.Since cutting plane methods use less structural information than interior point methods, theyare slower at solving almost all problems where interior point methods are known to apply. However,SDPs remain one of the most fundamental optimization problems where the state of the art is, infact, the opposite: the current fastest cutting plane methods of [LSW15, JLSW20] solve a generalSDP in time m ( mn + m + n ω ) , while the fastest SDP solvers based on interior point methods in thework of [NN92] and [Ans00] achieve runtimes of √ n ( m n + mn ω + m ω ) and ( mn ) / ( m n + m n ω ) ,respectively, which are slower in the most common regime of m ∈ [ n, n ] (see Table 1.2). Thisapparent paradox raises the following natural question: How fast can SDPs be solved using interior point methods? We can assume that

C, A , · · · , A m are symmetric, since given any M ∈ { C, A , · · · , A m } , we have P i,j M ij X ij = P i,j M ij X ji = P i,j ( M ⊤ ) ij X ij , and therefore we can replace M with ( M + M ⊤ ) / . [JLSW20] improves upon the runtime of [LSW15] in terms of the dependence on log( n/ǫ ) , while the polynomialfactors are the same in both runtimes. .1 Our results We present a faster interior point method for solving SDPs. Our main result is the following theorem,the formal version of which is given in Theorem 4.1.

Theorem 1.2 (Main result, informal) . There is an interior point method that solves a general SDPwith variable size n × n and m constraints in time O ∗ ( √ n ( mn + m ω + n ω )) . Our runtime can be roughly interpreted as follows:• √ n is the iteration complexity of the interior point method with the log barrier function.• mn is the input size.• m ω is the cost of inverting the Hessian of the log barrier.• n ω is the cost of inverting the slack matrix.Thus, the terms in the runtime of our algorithm arise as a natural barrier to further speeding upSDP solvers. See Section 1.2.2, 1.2.3, and 1.2.4 for more detail.Table 1.1 compares our result with previous SDP solvers. The ﬁrst takeaway of this table andTheorem 1.2 is that our interior point method always runs faster than that in [NN92] and is fasterthan that in [NN94] and [Ans00] when m ≥ n / . A second consequence is that whenever m ≥ √ n ,our interior point method is faster than the current fastest cutting plane method [LSW15, JLSW20].We note that n ≤ m ≤ n is satisﬁed in most SDP applications known to us, such as classicalcombinatorial optimization problems over graphs, experiment design problems in statistics andmachine learning, and sum-of-squares problems. An explicit comparison to previous algorithms inthe cases of m = n and m = n is shown in Table 1.2. Year References Method m mn + m + n ω m mn + m . + n ω m mn + m ω + n ω √ n m n + mn ω + m ω ( mn ) / m n + m n ω m mn + m ω + n ω m mn + m + n ω m mn + m + n ω √ n mn + m ω + n ω Table 1.1: Summary of key SDP algorithms. CPM stands for cutting plane method, and IPM, interior point method. n is the size of the variable matrix, and m ≤ n is the number of constraints. Runtimes hide n o (1) , m o (1) and poly log(1 /ǫ ) factors, where ǫ is the accuracy parameter. [Ans00] simpliﬁes the proofs in [NN94, Section 5.5]. Nei-ther [Ans00] nor [NN94] explicitly analyzed their runtimes, and their runtimes shown here are our best estimates. Even in the more general case where the SDP might not be dense, where nnz( A ) is the inputsize (i.e., the total number of non-zeroes in all matrices A i for i ∈ [ m ] and C ), our interior pointmethod runs faster than the current fastest cutting plane methods[LSW15, JLSW20], which run intime O ∗ ( m (nnz( A ) + m + n ω )) . We use O ∗ to hide n o (1) and log O (1) ( n/ǫ ) factors and e O to hide log O (1) ( n/ǫ ) factors, where ǫ is the accuracyparameter. ear References Method Runtime m = n m = n n n n . n n n . n . n . n . n . n n . n n n n n . n . Table 1.2: Total runtimes for the algorithms in Table 1.1 for SDPs when m = n and m = n , where n is the size ofmatrices, and m is the number of constraints. The runtimes shown in the table hide n o (1) , m o (1) and poly log(1 /ǫ ) factors, where ǫ is the accuracy parameter and assume ω to equal its currently best known upper bound of . . Theorem 1.3 (Comparison with Cutting Plane Method) . When m ≥ n , there is an interior pointmethod that solves an SDP with n × n matrices, m constraints, and nnz( A ) input size, faster thanthe current best cutting plane method [LSW15, JLSW20], over all regimes of nnz( A ) . By removing redundant constraints, we can, without loss of generality, assume m ≤ n in the primalformulation of the SDP (1). Thereafter, instead of solving the primal SDP, which has variable size n × n , we solve its dual formulation, which has dimension m ≤ n : min b ⊤ y subject to S = m X i =1 y i A i − C, and S (cid:23) . (2)Interior point methods solve (2) by minimizing the penalized objective function: min y ∈ R m f η ( y ) , where f η ( y ) := η · b ⊤ y + φ ( y ) , (3)where η > is a parameter and φ : R m → R is a barrier function that approaches inﬁnity as y approaches the boundary of the feasible set { y ∈ R m : P mi =1 y i A i (cid:23) C } . These methods ﬁrst obtainan approximate minimizer of f η for some small η > , which they then use as an initial point tominimize f (1+ c ) η , for some constant c > , via the Newton method. This process repeats until theparameter η in (3) becomes suﬃciently large, at which point the minimizer of f η is provably close tothe optimal solution of (2). The iterates y generated by this method follow a central path . Diﬀerentchoices of the barrier function φ lead to diﬀerent run times in solving (3), as we next describe. The log barrier

Nesterov and Nemirovski [NN92] use the log barrier function, φ ( y ) = g ( y ) := − log det m X i =1 y i A i − C ! , (4)in (3) and, in O ( √ n log( n/ǫ )) iterations, obtain a feasible dual solution y that satisﬁes b ⊤ y ≤ b ⊤ y ∗ + ǫ , where y ∗ ∈ R m is the optimal solution for (2). Within each iteration, the costliest step4s to compute the inverse of the Hessian of the log barrier function for the Newton step. For each ( j, k ) ∈ [ m ] × [ m ] , the ( j, k ) -th entry of H is given by H j,k = tr[ S − A j S − A k ] . (5)The analysis of [NN92] ﬁrst computes S − / A j S − / for all j ∈ [ m ] , which takes time O ∗ ( mn ω ) ,and then calculates the m trace products tr[ S − A j S − A k ] for all ( j, k ) ∈ [ m ] × [ m ] , each ofwhich takes O ( n ) time. Inverting the Hessian costs O ∗ ( m ω ) , which results in a total runtime of O ∗ ( √ n ( m n + mn ω + m ω )) . The volumetric barrier

Vaidya [Vai89a] introduced the volumetric barrier for a polyhedral set { x ∈ R n : Ax ≥ c } , where A ∈ R m × n and c ∈ R m . Nesterov and Nemirovski [NN94] studied thefollowing extension of the volumetric barrier to the convex subset { y ∈ R m : P mi =1 y i A i (cid:23) C } of thepolyhedral cone: V ( y ) = 12 log det( ∇ g ( y )) , where g ( y ) is the log barrier function deﬁned in (4). They proved that choosing φ ( y ) = √ nV ( y ) in (3) makes the interior point method converge in e O ( √ mn / ) iterations, which is smaller than the e O ( √ n ) iteration complexity of [NN92] when m ≤ √ n . They also studied the combined volumetric-logarithmic barrier V ρ ( y ) = V ( y ) + ρ · g ( y ) and showed that taking φ ( y ) = p n/m · V ρ ( y ) for ρ = ( m − / ( n − yields an iteration complexityof e O (( mn ) / ) . when m ≤ n , this iteration complexity is lower than e O ( √ n ) of [NN92]. We referreaders to the much simpler proofs in [Ans00] for these results.However, the volumetric barrier (and thus the combined volumetric-logarithmic barrier) leads tocomplicated expressions for the gradient and Hessian that make each iteration costly. For instance,the Hessian of the volumetric barrier is ∇ V ( y ) = 2 Q ( y ) + R ( y ) − T ( y ) , where Q ( y ) , R ( y ) , and T ( y ) are m × m matrices such that for each ( j, k ) ∈ [ m ] × [ m ] , Q ( y ) j,k = tr h A H − A ⊤ (cid:0)(cid:0) S − A j S − A k S − (cid:1) b ⊗ S − (cid:1)i ,R ( y ) j,k = tr h A H − A ⊤ (cid:0)(cid:0) S − A j S − (cid:1) b ⊗ (cid:0) S − A k S − (cid:1)(cid:1)i , (6) T ( y ) j,k = tr h A H − A ⊤ (cid:0)(cid:0) S − A j S − (cid:1) b ⊗ S − (cid:1) A H − A ⊤ (cid:0)(cid:0) S − A k S − (cid:1) b ⊗ S − (cid:1)i . Here,

A ∈ R n × m is the n × m matrix whose i th column is obtained by ﬂattening A i into a vectorof length n , and b ⊗ is the symmetric Kronecker product A b ⊗ B := 12 ( A ⊗ B + B ⊗ A ) , where ⊗ is the Kronecker product (see Section 2.1for formal deﬁnition). Due to the complicatedformulas in (6), eﬃcient computation of Newton step in each iteration of the interior point methodis diﬃcult; in fact, each iteration runs slower than the Nesterov-Nemirovski interior point methodby a factor of m . Since most applications of SDPs known to us have the number of constraints m be at least linear in n , the total runtime of interior point methods based on the volumetric barrierand the combined volumetric-logarithmic barrier is inevitably slow.5 .2.2 Our techniques Given the ineﬃciency of implementing the volumetric and volumetric-logarithmic barriers discussedabove, this paper uses the log barrier in (4). We now describe some of our key techniques thatimprove the runtime of the Nesterov-Nemirovski interior point method [NN92].

Hessian computation using fast rectangular matrix multiplication

As noted in Sec-tion 1.2.1, the runtime bottleneck in [NN92] is computing the inverse of the Hessian of the logbarrier function, where the Hessian is described in (5). In [NN92], each of these m entries iscomputed separately, resulting in a runtime of O ( m n ) per iteration.Instead contrast, we show below how to group these computations using rectangular matrixmultiplication. The expression from (5) can be re-written as H j,k = tr[ S − / A j S − / · S − / A k S − / ] . (7)We ﬁrst compute the key quantity S − / A j S − / ∈ R n × n for all j ∈ [ m ] by stacking all matrices A j ∈ R n × n into a tall matrix of size mn × n , and then compute the product of S − / ∈ R n × n with thistall matrix. This matrix product can be computed in time T mat ( n, mn, n ) using fast rectangularmatrix multiplication. We then ﬂatten each S − / A j S − / into a row vector of length n and stackall m vectors to form a matrix B of size m × n , i.e., the j -th row of B is B j = vec ( S − / A j S − / ) .It follows that the Hessian can be computed as H = BB ⊤ , (8)which takes time T mat ( m, n , m ) by applying fast rectangular matrix multiplication. By leveragingrecent developments in this area [GU18], this approach already improves upon the runtime in [NN92].Thus far, we have reduced the per iteration cost of O ∗ ( m n + mn ω ) for Hessian computationdown to T mat ( n, mn, n ) + T mat ( m, n , m ) . Low rank update on the slack matrix

The fast rectangular matrix multiplication approachnoted above, however, is still not very eﬃcient, because the Hessian must be computed from scratchin each iteration of the interior point method. If there are T iterations in total, it then takes time T · ( T mat ( n, mn, n ) + T mat ( m, n , m )) . To further improve the runtime, we need to eﬃciently update the Hessian for the current iterationfrom the Hessian computed in the previous one. Generally, this is not possible, as the slack matrix S ∈ R n × n in (7) might change arbitrarily in the Nesterov-Nemirovski interior point method.To overcome this problem, we propose a new interior point method that maintains an approxi-mate slack matrix e S ∈ R n × n , which is a spectral approximation of the true slack matrix S ∈ R n × n such that e S admits a low-rank update in each iteration. Where needed, we will now use the subscript t to denote a matrix in the t -th iteration. Our algorithm updates only the directions in which e S t deviates too much from S t +1 ; the changes to S t for the remaining directions are not propagated in e S t . This process of selective update ensures a low-rank change in e S t even when S t suﬀers from a See Section 3 for the deﬁnition. t ∈ [ T ] , we deﬁne the diﬀerence matrix Z t = S − / t e S t S − / t − I ∈ R n × n , which intuitively captures how far the approximate slack matrix e S t is from the true slack matrix S t .We maintain the invariant k Z t k op ≤ c for some suﬃciently small constant c > . In the ( t + 1) -thiteration when S t gets updated to S t +1 , our construction of e S t +1 involves a novel approach of zeroingout some of the largest eigenvalues of | Z t | to bound the rank of the update on the approximate slackmatrix.We prove that with this approach, the updates on e S ∈ R n × n over all T = e O ( √ n ) iterationssatisfy the following rank inequality (see Theorem 6.1for the formal statement). Theorem 1.4 (Rank inequality, informal version) . Let e S , e S , · · · , e S T ∈ R n × n denote the sequenceof approximate slack matrices generated in our interior point method. For each t ∈ [ T − , denoteby r t = rank( e S t +1 − e S t ) the rank of the update on e S t . Then, the sequence r , r , · · · , r T satisﬁes T X t =1 √ r t = e O ( T ) . The key component to proving Theorem 1.4 is the potential function

Φ : R n × n → R ≥ Φ( Z ) := n X ℓ =1 | λ ( Z ) | [ ℓ ] √ ℓ , where | λ ( Z ) | [ ℓ ] is the ℓ -th in the list of eigenvalues of Z ∈ R n × n sorted in decreasing order of theirabsolute values. We show an upper bound on the increase in this potential when S is updated, alower bound on its decrease when e S is updated, and combine the two with non-negativity of thepotential to obtain Theorem 1.4.Speciﬁcally, ﬁrst we prove that whenever S is updated in an iteration, the potential function increases by at most e O (1) (see Lemma 6.2). The proof of this statement crucially uses the structuralproperty of interior point method that slack matrices in consecutive steps are suﬃciently close toeach other. Formally, for any iteration t ∈ [ T ] , we show in Theorem 5.1 that the consecutive slackmatrices S t and S t +1 satisfy k S − / t S t +1 S − / t − I k F = O (1) (9)and combine this bound with the Hoﬀman-Wielandt theorem [HJ12], which relates the ℓ distancebetween the spectrum of two matrices with the Frobenius norm of their diﬀerence (see Fact 2.2).Next, when e S gets updated, we prove that our method of zeroing out the r t largest eigenvalues of | Z t | , thereby incurring a rank- r t update to e S t , results in a potential decrease of at least e O ( √ r t ) (seeLemma 6.3). Maintaining rectangular matrix multiplication for Hessian computation.

Given the low-rank update on e S described above, we show how to eﬃciently update the approximate Hessian e H ,deﬁned as e H j,k = tr[ e S − A j e S − A k ] (10)7or each entry ( j, k ) ∈ [ m ] × [ m ] . The approximate slack matrix e S being a spectral approximation ofthe true slack matrix S implies that the approximate Hessian e H is also a spectral approximation ofthe true Hessian H (see Lemma 5.3). This approximate Hessian therefore suﬃces for our algorithmto approximately follow the central path.To eﬃciently update the approximate Hessian e H in (10), we notice that a rank- r update on e S implies a rank- r update on e S − via the Woodbury matrix identity (see Fact 2.4). The change in e S − can be expressed as ∆( e S − ) = V + V ⊤ + − V − V ⊤− , (11)where V + , V − ∈ R n × r . Plugging (11) into (10), we can express ∆ e H j,k as the sum of multiple terms,among the costliest of which are those of the form tr[ e S − A j V V ⊤ A k ] , where V ∈ R n × r is either V + or V − . We compute tr[ e S − A j V V ⊤ A k ] for all ( j, k ) ∈ [ m ] × [ m ] in time T mat ( r, n, mn ) by ﬁrstcomputing V ⊤ A k for all k ∈ [ m ] by horizontally concatenating all A k ’s into a wide matrix of size n × mn . We then compute the product of e S − / with A j V for all j ∈ [ m ] , which can be done in time T mat ( n, n, mr ) , which equals T mat ( n, mr, n ) (see Lemma 3.3). Finally, by ﬂattening each e S − / A j V into a vector of length nr and stacking all these vectors to form a matrix e B ∈ R m × nr with j -th row e B j = vec ( e S − / A j V ) , the task of computing tr[ e S − A j V V ⊤ A k ] for all ( j, k ) ∈ [ m ] × [ m ] reduces to computing e B e B ⊤ , whichcosts T mat ( m, nr, m ) .In this way, we reduce the runtime of T · ( T mat ( n, mn, n ) + T mat ( m, n , m )) for computing theHessian using fast rectangular matrix multiplication down to T X t =1 ( T mat ( r t , n, mn ) + T mat ( n, mr t , n ) + T mat ( m, nr t , m )) , (12)where r t is the rank of the update on e S t . Applying Theorem 1.4 with several properties of fastrectangular matrix multiplication that we prove in Section 3 , we upper bound the runtime in (12)by O ∗ ( √ n ( mn + m ω + n ω )) , which implies Theorem 1.2. In Section 1.2.3 and 1.2.4, we discuss bottlenecks to further improvingour runtime. In most cases, the costliest term in our runtime is the per iteration cost of mn , which correspondsto reading the entire input in each iteration. Our subsequent discussions therefore focus on thesteps in our algorithm that require at least mn time per iteration. Slack matrix computation.

When y is updated in each iteration of our interior point method,we need to compute the true slack matrix S as S = X i ∈ [ m ] y i A i − C. Computing S is needed to update the approximate slack matrix e S so that e S remains a spectralapproximation to S . As S might suﬀer from full-rank changes, it naturally requires mn time tocompute in each iteration. This is the ﬁrst appearance of the mn cost per iteration.8 radient computation. Recall from (3) that our interior point method follows the central pathdeﬁned via the penalized objective function min y ∈ R m f η ( y ) where f η ( y ) := ηb ⊤ y + φ ( y ) , for a parameter η > and φ ( y ) = − log det S . In each iteration, to perform the Newton step, thegradient of the penalized objective is computed as g η ( y ) j = η · b j − tr[ S − A j ] (13)for each coordinate j ∈ [ m ] . Even if we are given S − , it still requires mn time to compute (13)for all j ∈ [ m ] . This is the second appearance of the per iteration cost of mn . Approximate Hessian computation.

Recall from Section 1.2.2 that updating the approximateslack matrix S by rank r means the time needed to update the approximate Hessian is dominatedby computing the term ∆ j,k = tr[ e S − / A j V · V ⊤ A k e S − / ] , where V ∈ R n × r is a tall, skinny matrix that comes from the spectral decomposition of ∆ e S − .Computing ∆ j,k for all ( j, k ) ∈ [ m ] × [ m ] requires reading at least A j for all j ∈ [ m ] , which takestime mn . This is the third bottleneck that leads to the mn term in the cost per iteration. The preceeding discussion of bottlenecks suggests that reading the entire input in each iteration,which takes mn time per iteration, stands as a natural barrier to further improving the runtime ofSDP solvers based on interior point methods.In the context of linear programming (LP), several recent results [CLS19, BLSS20] yield fasterinterior point methods that bypass reading the entire input in every iteration. Two techniques crucialto these results are: (1) showing that the Hessian (projection matrix) admits low-rank updates, and(2) speeding computation of the Hessian via sampling.We now describe these techniques in the context of SDP and argue that they are unlikely toimprove our runtime. Showing that the Hessian admits low-rank updates.

We saw in Section 1.2.2 that construct-ing an approximate slack matrix e S that admits low-rank updates in each iterations leveraged thefact that the true slack matrix S changes “slowly” throughout our interior point method as describedin (9). One natural question that follows is whether a similar upper bound can be obtained for theHessian. If such a result could be proved, then one could maintain an approximate Hessian thatadmitted low-rank updates, which would speed up the approximate Hessian computation. Indeed,in the context of LP, such a bound for the Hessian can be proved (e.g., [BLSS20, Lemma 47]).Unfortunately, it is impossible to prove such a statement for the Hessian in the context of SDP.To show this, it is convenient to express the Hessian using the Kronecker product (Section 2.1)as H = A ⊤ · ( S − ⊗ S − ) · A , where A ∈ R n × m is the n × m matrix whose i th column is obtained by ﬂattening A i into a vector oflength n . By proper scaling, we can assume without loss of generality that the current slack matrix9s S = I , and the slack matrix in the next iteration is S new = I + ∆ S , which satisﬁes k ∆ S k F = c for some tiny constant c > . Consider the simple example where A = I (we are assuming herethat m = n so that A is a square matrix), which implies that the change in the Hessian can beapproximately computed as (cid:13)(cid:13)(cid:13) H − / ∆ HH − / (cid:13)(cid:13)(cid:13) F ≈ tr h (( I − ∆ S ) ⊗ ( I − ∆ S ) − I ⊗ I ) i ≈ tr (cid:2) ( I ⊗ ∆ S + ∆ S ⊗ I ) (cid:3) ≥ · tr[ I ] · tr (cid:2) (∆ S ) (cid:3) = 2 n k ∆ S k F ≫ . This large change indicates that we are unlikely to obtain an approximation to the Hessian thatadmits low-rank updates, which is a key diﬀerence between LP and SDP.

Sampling for faster Hessian computation.

Recall from (8) that the Hessian can be computedas H = B · B ⊤ , where the j th row of B ∈ R m × n is B j = vec ( S − / A j S − / ) for all j ∈ [ m ] . We might attemptto approximately compute H faster by sampling a subset of columns of B indexed by L ⊆ [ n ] and compute the product for only the sampled columns. This could reduce the dimension of thematrix multiplication and speed up the Hessian computation. Indeed, sampling techniques havebeen successfully used to obtain faster LP solvers [CLS19, BLSS20].For SDP, however, sampling is unlikely to speed up the Hessian computation. In general, wemust sample at least m columns (i.e. | L | ≥ m ) of B to spectrally approximate H or the computedmatrix will not be full rank. However, this requires computing the entries of S − / A j S − / thatcorrespond to L ⊆ [ n ] for all j ∈ [ m ] , which requires reading all A j ’s and thus still takes O ( mn ) time. Linear Programming.

Linear Programming is a class of fundamental problems in convex opti-mization. There is a long list of work focused on fast algorithms for linear programming [Dan47,Kha80, Kar84, Vai87, Vai89b, LS14, LS15, Sid15, Lee16, CLS19, Bra20, BLSS20].

Cutting Plane Method.

Cutting plane method is a class of optimization methods that itera-tively reﬁne a convex set that contains the optimal solution by querying a separation oracle. Sinceits introduction in the 1950s, there has been a long line of work on obtaining fast cutting planemethods [Sho77, YN76, Kha80, KTE88, NN89, Vai89a, AV95, BV02, LSW15, JLSW20].

First-Order SDP Algorithms.

As the focus of this paper, cutting plane methods and interiorpoint methods solve SDPs in time that depends logarithmically on /ǫ , where ǫ is the accuracyparameter. A third class of algorithms, the ﬁrst-order methods , solve SDPs at runtimes that depend polynomially on /ǫ . While having worse dependence on /ǫ compared to IPM and CPM, theseﬁrst-order algorithms usually have better dependence on the dimension. There is a long list of workon ﬁrst-order methods for general SDP or special classes of SDP (e.g. Max-Cut SDP [AK07, GH16,AZL17, CDST19, LP20, YTF + + Preliminaries

For any integer d , we use [ d ] to denote the set { , , · · · , d } . We use S n × n to denote the set ofsymmetric n × n matrices, S n × n ≥ for the set of n × n positive semideﬁnite matrices, and S n × n> forthe set of n × n positive deﬁnite matrices. For two matrices A, B ∈ S n × n , the notation A (cid:22) B means that B − A ∈ S n × n ≥ . When clear from the context, we use to denote the all-zeroes matrix(e.g. A (cid:23) ). For a vector v ∈ R n , we use diag( v ) to denote the diagonal n × n matrix with diag( v ) i,i = v i . For A, B ∈ S n × n , we deﬁne the inner product to be the trace product of A and B ,deﬁned as h A, B i := tr[ A ⊤ B ] = P i,j ∈ [ n ] A i,j B i,j . For two matrices A ∈ R m × n and B ∈ R k × l , the Kronecker product of A and B , denoted as A ⊗ B , is deﬁned as the mk × nl block matrix whose ( i, j ) block is A i,j B , for all ( i, j ) ∈ [ m ] × [ n ] .Throughout this paper, unless otherwise speciﬁed, m denotes the number of constraints for theprimal SDP (1), and the variable matrix X is of size n × n . The number of non-zero entries in allthe A i and C of (1) is denoted by nnz( A ) . Linear algebra.

Some matrix norms we frequently use in this paper are the Frobenius and op-erator norms, deﬁned as follows. The Frobenius norm of a matrix A ∈ R n × n is deﬁned to be k A k F := p tr[ A ⊤ A ] . The operator (or spectral) norm k A k op of A ∈ R n × n is deﬁned to be thelargest singular value of A . In the case of symmetric matrices (which is what we encounter in thispaper), this can be shown to equal the largest absolute eigenvalue of the matrix. A property of tracewe frequently use is the following: given matrices A ∈ R m × n , A ∈ R n × n , . . . , A k ∈ R n k − × n k , thetrace of their product is invariant under cyclic permutation tr[ A A . . . A k ] = tr[ A A . . . A k A ] = · · · = tr[ A k A . . . A k − A k − ] . A matrix A ∈ R n × n is called normal if A commutes with its transpose,i.e. AA ⊤ = A ⊤ A . We note that all symmetric n × n matrices are normal. Two matrices A, B ∈ R n × n are said to be similar if there exists a nonsingular matrix S ∈ R n × n such that A = S − BS . Inparticular, if matrices A and B are similar, then they have the same set of eigenvalues. We use thefollowing simple fact involving Loewner ordering: given two invertible matrices A and B satisfying α B (cid:22) A (cid:22) αB for some α > , we have α B − (cid:22) A − (cid:22) αB − . We further need the following facts. Fact 2.1 (Generalized Lieb-Thirring Inequality [Eld13, ALO16, JLL + . Given a symmetric ma-trix B , a positive semi-deﬁnite matrix A and α ∈ [0 , , we have tr[ A α BA − α B ] ≤ tr[ AB ] . Fact 2.2 (Hoﬀman-Wielandt Theorem, [HW53, HJ12]) . Let

A, E ∈ R n × n such that A and A + E are both normal matrices. Let λ , λ , . . . , λ n be the eigenvalues of A , and let b λ , b λ , . . . , b λ n be theeigenvalues of A + E in any order. There is a permutation σ of the integers , . . . , n such that P i ∈ [ n ] | b λ σ ( i ) − λ i | ≤ k E k F = tr[ E ∗ E ] . Fact 2.3 (Corollary of the Hoﬀman-Wielandt Theorem, [HJ12]) . Let

A, E ∈ R n × n such that A isHermitian and A + E is normal. Let λ , . . . , λ n be the eigenvalues of A arranged in increasing order λ ≤ . . . ≤ λ n . Let b λ , . . . , b λ n be the eigenvalues of A + E , ordered so that Re( b λ ) ≤ . . . ≤ Re( b λ n ) .Then, P i ∈ [ n ] | b λ i − λ i | ≤ k E k F . Fact 2.4 (Woodbury matrix identity, [Woo49, Woo50]) . Given matrices A ∈ R n × n , U ∈ R n × k , C ∈ R k × k , and V ∈ R k × n , such that A , C , and A + U CV are invertible, we have ( A + U CV ) − = A − − A − U ( C − + V A − U ) − V A − . Matrix Multiplication

The main goal of this section is to derive upper bounds on the time to perform the following tworectangular matrix multiplication tasks (Lemma 3.9, 3.10, and 3.11):• Multiplying a matrix of dimensions m × n with one of dimensions n × m ,• Multiplying a matrix of dimensions n × mn with one of dimensions mn × n .Besides being crucial to the runtime analysis of our interior point method in Section 7, these results(as well as several intermediate results) might be of independent interest. We need the following deﬁnitions to describe the cost of certain fundamental matrix operations weuse.

Deﬁnition 3.1.

Deﬁne T mat ( n, r, m ) to be the number of operations needed to compute the productof matrices of dimensions n × r and r × m . Deﬁnition 3.2.

We deﬁne the function ω ( k ) to be the minimum value such that T mat ( n, n k , n ) = n ω ( k )+ o (1) . We overload notation and use ω to denote the cost of multiplying two n × n matrices.Thus, we have ω (1) = ω . The following is a basic property of T mat that we frequently use. Lemma 3.3 ([BCS97, Blä13]) . For any three positive integers n, m, r , we have T mat ( n, r, m ) = O ( T mat ( n, m, r )) = O ( T mat ( m, n, r )) . We refer to Table 3 in [GU18] for the latest upper bounds on ω ( k ) for diﬀerent values of k . Inparticular, we need the following upper bounds in our paper. Lemma 3.4 ([GU18]) . We have:• ω = ω (1) ≤ . ,• ω (1 . ≤ . ,• ω (1 . ≤ . ,• ω (2) ≤ . . In this section, we derive some technical results on T mat and ω that we extensively use for ourruntime analysis. Some of these results can be derived using tensors, and we demonstrate this inAppendix A. We hope that the use of tensors can yield better runtimes for this problem in future. Lemma 3.5 (Sub-linearity) . For any p ≥ q ≥ , we have ω ( p ) ≤ p − q + ω ( q ) . Proof.

We assume that n p and n q are integers for notational simplicity. Consider multiplying an n × n p matrix with an n p × n matrix. One can cut the n × n p matrix into n p − q rectangular blocksof size n × n q and the n p × n matrix into n p − q rectangular blocks of size n q × n , and compute themultiplication of the corresponding blocks. This approach takes time n p − q + ω ( q )+ o (1) , from whichthe desired inequality immediately follows. 12ey to our analysis is the following lemma, which establishes the convexity of ω ( k ) . Lemma 3.6 (Convexity) . The fast rectangular matrix multiplication time exponent ω ( k ) as deﬁnedin Deﬁnition 3.2 is convex in k .Proof. Let k = α · p + (1 − α ) · q for α ∈ (0 , . For notational simplicity, we assume that n p , n q and n k are all integers. Consider a rectangular matrix of dimensions n × n k . Since αp ≤ k , we cantile this rectangular matrix with matrices of dimensions n α × n αp . Then, the product of this tiledmatrix with another similarly tiled matrix of dimensions n k × n can be obtained by viewing it asa multiplication of a matrix of dimensions n/n α × n k /n αp with one of dimensions n k /n αp × n /α ,where each “element” of these two matrices is itself a matrix of dimensions n α × n αp . With thisrecursion in tow, we obtain the following upper bound. T mat ( n, n k , n ) ≤T mat ( n α , n αp , n α ) · T mat ( n/n α , n k /n αp , n/n α )= T mat ( n α , n αp , n α ) · T mat ( n (1 − α ) , n (1 − α ) q , n (1 − α ) ) ≤ n α · ω ( p )+ o (1) · n (1 − α ) · ω ( q )+ o (1) . The ﬁnal step above follows from denoting m = n α and observing that multiplying matrices ofdimensions n α × n α · p costs, by Deﬁnition 3.2, m ω ( p )+ o (1) , which is exactly n α ( ω ( p )+ o (1)) . ApplyingDeﬁnition 3.2 and comparing exponents, this implies that ω ( k ) ≤ α · ω ( p ) + (1 − α ) · ω ( q ) , which proves the convexity of the function ω ( k ) . Claim 3.7. ω (1 . ≤ . .Proof. We can upper bound ω (1 . in the following sense ω (1 . ω (0 . · . − . · . ≤ . · ω (1 .

5) + (1 − . · ω (1 . ≤ . · . − . · . ≤ . , where the ﬁrst step follows from convexity of ω (Lemma 3.6), the third step follows from ω (1 . ≤ . and ω (1 . ≤ . (Lemma 3.4). Lemma 3.8.

Let T mat be deﬁned as in Deﬁnition 3.1. Then for any positive integers h , ℓ , and k ,we have T mat ( h, ℓk, h ) ≤ O ( T mat ( hk, ℓ, hk )) . Proof.

Given any matrices

A, B ⊤ ∈ R h,ℓk , by Deﬁnition 3.1, the cost of computing the matrixproduct AB is T mat ( h, ℓk, h ) . We now show how to compute this product in time O ( T mat ( hk, ℓ, hk )) .We cut A and B ⊤ into k sub-matrices each of size h × ℓ , i.e. A = ( A , · · · , A k ) and B ⊤ =( B ⊤ , · · · , B ⊤ k ) , where each A i , B ⊤ i ∈ R h × ℓ for all i ∈ [ k ] . By performing matrix multiplicationblockwise, we can write AB = k X i =1 A i B i . k matrices A , · · · , A k vertically to form a matrix A ′ ∈ R hk,ℓ . Similarly, westack the k matrices B , · · · , B k horizontally to form a matrix B ′ = ( B , · · · , B k ) ∈ R ℓ,hk . ByDeﬁnition 3.1, we can compute A ′ B ′ ∈ R hk,hk in time T mat ( hk, ℓ, hk ) . To complete the proof, wenote that we can derive AB from A ′ B ′ as follows: for each j ∈ [ k ] , the j th diagonal block of A ′ B ′ of size h × h is exactly A j B j , and summing up the k diagonal h × h blocks of A ′ B ′ gives AB . T mat ( n, mn, n ) and T mat ( m, n , m ) Lemma 3.9.

Let T mat be deﬁned as in Deﬁnition 3.1.If m ≥ n , then we have T mat ( n, mn, n ) ≤ O ( T mat ( m, n , m )) . If m ≤ n , then we have T mat ( m, n , m ) ≤ O ( T mat ( n, mn, n )) . Proof.

We only prove the case of m ≥ n , as the other case where m < n is similar. This is animmediate consequence of Lemma 3.8 by taking h = n , ℓ = n , and k = ⌊ m/n ⌋ , where k is apositive integer because m ≥ n .In the next lemma, we derive upper bounds on the term T mat ( m, n , m ) when m ≥ n and T mat ( n, mn, n ) when m < n , which is crucial to our runtime analysis. Lemma 3.10.

Let T mat be deﬁned as in Deﬁnition 3.1 and ω be deﬁned as in Deﬁnition 3.2.Property I. We have T mat ( n, mn, n ) ≤ O ( mn ω + o (1) ) . Property II. We have T mat ( m, n , m ) ≤ O (cid:0) √ n (cid:0) mn + m ω (cid:1)(cid:1) . Proof.

Property I.

Recall from Deﬁnition 3.1 that T mat ( n, mn, n ) is the cost of multiplying a matrix of size n × mn with one of size mn × n . We can cut each of the matrices into m sub-matrices of size n × n each.The product in question then can be obtained by multiplying these sub-matrices. Since there are m of them, and each product of an n × n submatrix with another n × n submatrix costs, by deﬁnition, n ω + o (1) , we get T mat ( n, mn, n ) ≤ O ( mn ω + o (1) ) , as claimed. Property II.

Let m = n a , where a ∈ (0 , ∞ ) . By deﬁnition, T mat ( m, n , m ) is the cost of multiplying a matrixof size m × n with one of size n × m . Expressing n as m /a then gives, by Deﬁnition 3.2, that T mat ( m, n , m ) = m ω (2 /a )+ o (1) = n a · ω (2 /a )+ o (1) Property II is then an immediate consequence of the following inequality, which we prove next: ω (2 /a ) < max(1 + 2 . /a, ω (1) + 0 . /a ) ∀ a ∈ (0 , ∞ ) . (14)Deﬁne b = 2 /a ∈ (0 , ∞ ) . Then the desired inequality in (14) can be expressed in terms of b as ω ( b ) < max(1 + 5 b/ , ω (1) + b/ ∀ b ∈ (0 , ∞ ) . (15)14otice that the RHS of (15) is a maximum of two linear functions of b and these intersect at b ∗ = ω (1) − . By the convexity of ω ( · ) as proved in Lemma 3.6, it suﬃces to verify (15) at theendpoints b → , b → ∞ and b = b ∗ . In the case where b = δ for any δ < , (15) follows immediatelyfrom the observation that ω ( δ ) < ω (1) . We next argue about the case b → ∞ . By Lemma 3.4 wehave ω (2) ≤ . . Using Lemma 3.5, we have ω ( b ) ≤ b − ω (2) . Combining these two factsimplies that for any b > , we have ω ( b ) ≤ b − ω (2) ≤ b/ , which again satisﬁes (15). The ﬁnal case is b = b ∗ = ω (1) − , for which (15) is equivalent to ω ( ω (1) − < ω (1) / − / . (16)By Lemma 3.4, we have that ω (1) − ∈ [0 , . . Then to prove (16), it is suﬃcient to showthat ω ( t + 1) < t/ / ∀ t ∈ [0 , . . (17)By the convexity of ω ( · ) as proved in Lemma 3.6, the upper bound of ω (2) ≤ . in Lemma3.4, and recalling that ω (1) = t + 2 for t ∈ [0 , . , we have for k ∈ [1 , , ω ( k ) ≤ ω (1) + ( k − · (3 . − ( t + 2)) = t + 2 + ( k − · (1 . − t ) . In particular, using this inequality for k = t + 1 , we have ω ( t + 1) − t/ − / ≤ ( t + 2) + t · (1 . − t ) − t/ − / − t + 1 . t − / , which is negative on the entire interval [0 , . . This establishes (17) and ﬁnishes the proof. T mat ( m, n , m ) Lemma 3.11.

For any two positive integers n and m , we have T mat ( m, n , m ) = o (cid:0) m + mn . (cid:1) . Proof.

Let m = n a where a ∈ (0 , ∞ ) . Recall that T mat ( m, n , m ) = m ω (2 /a )+ o (1) = n aω (2 /a )+ o (1) .We consider the following two cases according to the range of a . Case 1: a ∈ [1 . , ∞ ) . In this case, we have ω (2 /a ) ≤ ω (2 / . ≤ ω (1 . < ,where the last inequality follows from Claim 3.7. This implies that T mat ( m, n , m ) = o ( n a ) = o ( m ) . (18) Case 2: a ∈ (0 , . . In this case, we have /a ∈ [1 . , ∞ ) . Consider the linear function y ( t ) = 1 + 2 . · t . (19)By Claim 3.7, we have ω (1 . < . ≤ y (1 . . (20)15y Lemma 3.4, we have ω (2) < .

37 = y (2) . (21)An application of Lemma 3.5 then gives, for any t ≥ , the inequality ω ( t ) ≤ t − ω (2) < t − y (2) ≤ y ( t ) , (22)where the last inequality is by deﬁnition of y ( t ) from (19). Therefore, combining the convexity of ω ( · ) , as proved in Lemma 3.6, with (20), (21), and (22), we conclude that for any t ∈ [1 . , ∞ ) ,the function ω is bounded from above by the aﬃne function y , expressed as follows. ω ( t ) < y ( t ) = 1 + 2 . · t . This implies that T mat ( m, n , m ) = n a · ω (2 /a )+ o (1) = o ( n a +2 . ) = o ( mn . ) . (23)Combining the results from (18) and (23) ﬁnishes the proof of the lemma. In this section, we give the formal statement of our main result.

Theorem 4.1 (Main result, formal) . Consider a semideﬁnite program with variable size n × n and m constraints (assume there are no redundant constraints): max h C, X i subject to X (cid:23) , h A i , X i = b i for all i ∈ [ m ] . (24) Assume that any feasible solution X ∈ S n × n ≥ satisﬁes k X k op ≤ R . Then for any error parameter < δ ≤ . , there is an interior point method that outputs in time O ∗ ( √ n ( mn + m ω + n ω ) log( n/δ )) a positive semideﬁnite matrix X ∈ R n × n ≥ such that h C, X i ≥ h

C, X ∗ i − δ · k C k op · R and X i ∈ [ m ] (cid:12)(cid:12)(cid:12) h A i , b X i − b i (cid:12)(cid:12)(cid:12) ≤ nδ · ( R X i ∈ [ m ] k A i k + k b k ) , where ω is the exponent of matrix multiplication, X ∗ is any optimal solution to the semideﬁniteprogram in (24) , and k A i k is the Schatten -norm of matrix A i . The proof of Theorem 4.1 is given in the subsequent sections.

Our main result of this section is the following.

Theorem 5.1 (Approximate central path) . Consider a semideﬁnite program as in Deﬁnition 1.1with no redundant constraints. Assume that any feasible solution X ∈ S n × n ≥ satisﬁes k X k op ≤ R .Then for any error parameter < δ ≤ . and Newton step size ǫ N satisfying √ δ < ǫ N ≤ . , lgorithm 1 outputs, in T = ǫ N √ n log( n/δ ) iterations, a positive semideﬁnite matrix X ∈ R n × n ≥ that satisﬁes h C, X i ≥ h

C, X ∗ i − δ · k C k op · R and X i ∈ [ m ] (cid:12)(cid:12)(cid:12) h A i , b X i − b i (cid:12)(cid:12)(cid:12) ≤ nδ · ( R X i ∈ [ m ] k A i k + k b k ) , (25) where X ∗ is any optimal solution to the semideﬁnite program in Deﬁnition 1.1, and k A i k is theSchatten -norm of matrix A i . Further, in each iteration of Algorithm 1, the following invariantholds for α H = 1 . : k S − / S new S − / − I k F ≤ α H · ǫ N . (26) Proof.

At the start of Algorithm 1, Lemma 9.1 is called to modify the semideﬁnite program toobtain an initial dual solution y for the modiﬁed SDP that is close to the dual central path at η = 1 / ( n + 2) . This ensures that the invariant g η ( y ) ⊤ H ( y ) − g η ( y ) ≤ ǫ N holds at the start of thealgorithm. Therefore, by Lemma 5.4 and Lemma 5.5, this invariant continues to hold throughout therun of the algorithm. Therefore, after T = ǫ N √ n log (cid:0) nδ (cid:1) iterations, the step size η in Algorithm 1grows to η = (1 + ǫ N √ n ) T / ( n + 2) ≥ n/δ . It then follows from Lemma 5.6 that b ⊤ y ≤ b ⊤ y ∗ + nη · (1 + 2 ǫ N ) ≤ b ⊤ y ∗ + δ . Thus when the algorithm stops, the dual solution y has duality gap at most δ for the modiﬁed SDP.Lemma 9.1 then shows how to obtain an approximate solution to the original SDP that satisﬁes theguarantees in (25).To prove (26), deﬁne ∆ S = S new − S ∈ R n × n and δ y = y new − y ∈ R m . For each i ∈ [ n ] , we use δ y,i to denote the i -th coordinate of vector δ y . We rewrite k S − / S new S − / − I k F as k S − / S new S − / − I k F = tr h ( S − / (∆ S ) S − / ) i = tr  S − m X i =1 δ y,i A i ! S −  m X j =1 δ y,j A j  = m X i,j =1 δ y,i δ y,j tr[ S − A i S − A j ]= ( δ y ) ⊤ H ( y ) δ y = g η ( y ) ⊤ e H ( y ) − H ( y ) e H ( y ) − g η ( y ) , (27)where we used the fact that ∆ S = P mi =1 ( δ y ) i A i . It then follows from Lemma 5.4 and the invariant g η ( y ) ⊤ H ( y ) − g η ( y ) ≤ ǫ N that g η ( y ) ⊤ e H ( y ) − H ( y ) e H ( y ) − g η ( y ) ≤ α H · ǫ N , (28)where α H = 1 . . Combining Equation (27) with Inequality (28) completes the proof of thetheorem. 17 able 5.1: Summary of parameters in approxiate central path. Notation Choice Appearance Meaning α H α − H · H (cid:22) e H (cid:22) α H · Hǫ N ( g ⊤ η H − g η ) / ǫ S (1 − ǫ S ) · S (cid:22) e S (cid:22) (1 + ǫ S ) · S Algorithm 1 procedure Main ( n, m, δ, ǫ N , C, A, b ) ⊲ C ∈ S n × n , { A i } mi =1 ∈ S n × n , vector b ∈ R m , errorparameter < δ < . , Newton step size parameter < ǫ N < . Modify the SDP and obtain an initial dual solution y according to Lemma 9.1 η ← / ( n + 2) T ← ǫ N √ n log (cid:0) nδ (cid:1) e S ← S ← P i ∈ [ m ] y i A i − C . for iter = 1 → T do η new ← η (cid:16) ǫ N √ n (cid:17) for j = 1 , · · · , m do ⊲ Gradient computation g η new ( y ) j ← η new · b j − tr[ S − · A j ] end for for j = 1 , · · · , m do ⊲ Hessian computation for k = 1 , · · · , m do e H j,k ( y ) ← tr[ e S − · A j · e S − · A k ] end for end for δ y ← − e H ( y ) − g η new ( y ) ⊲ Update on y y new ← y + δ y ⊲ Approximate Newton step S new ← P i ∈ [ m ] ( y new ) i A i − C e S new ← ApproxSlackUpdate ( S new , e S ) ⊲ Approximate slack computation y ← y new , S ← S new , e S ← e S new ⊲ Update variables end for

Return an approximate solution to the original SDP according to Lemma 9.1 end procedure

Lemma 5.2.

Given positive deﬁnite matrices S new , e S ∈ S n × n> and any parameter < ǫ S < . ,there is an algorithm (procedure ApproxSlackUpdate in Algorithm 2) that takes O ( n ω + o (1) ) timeto output a positive deﬁnite matrix e S new ∈ S n × n> such that k S − / e S new S − / − I k op ≤ ǫ S . (29) Proof.

The runtime of O ( n ω + o (1) ) is by the spectral decomposition Z = U · Λ · U ⊤ , the costliest stepin the algorithm. To prove (29), we notice that λ new are the eigenvalues of S − / e S new S − / − I andby the algorithm description (lines 6 - 13), the upper bound ( λ new ) i ≤ ǫ S holds for each i ∈ [ n ] .18 lgorithm 2 Approximate Slack Update procedure ApproxSlackUpdate ( S new , e S ) ⊲ S new , e S ∈ S n × n ≥ are positive deﬁnite matrices ǫ S ← . ⊲ Spectral approximation constant Z mid ← S − / · e S · S − / − I Compute spectral decomposition Z mid = U · Λ · U ⊤ ⊲ Λ = diag( λ , · · · , λ n ) are the eigenvalues of Z mid , and U ∈ R n × n is orthogonal Let π : [ n ] → [ n ] be a sorting permutation such that | λ π ( i ) | ≥ | λ π ( i +1) | if | λ π (1) | ≤ ǫ S then e S new ← e S else r ← while | λ π (2 r ) | > ǫ S or | λ π (2 r ) | > (1 − / log n ) | λ π ( r ) | do r ← r + 1 end while ( λ new ) π ( i ) ← ( if i = 1 , , · · · , r ; λ π ( i ) otherwise. e S new ← e S + S / · U · diag( λ new − λ ) · U ⊤ · S / end if return e S new end procedure Lemma 5.3.

Given symmetric matrices A , · · · , A m ∈ S n × n , and positive deﬁnite matrices e S, S ∈ S n × n> , deﬁne matrices e H ∈ R m × m and H ∈ R m × m as e H j,k = tr[ e S − A j e S − A k ] and H j,k = tr[ S − A j S − A k ] . Then both e H and H are positive semideﬁnite. For any accuracy parameter α S ≥ , if α − S · S (cid:22) e S (cid:22) α S · S, then we have that α − S · H (cid:22) e H (cid:22) α S · H. Proof.

For any vector v ∈ R n , we deﬁne A ( v ) = P mi =1 v i A i . We can rewrite v ⊤ Hv as follows. v ⊤ Hv = m X i =1 m X j =1 v i v j H i,j = m X i =1 m X j =1 v i v j tr[ S − A i S − A j ] = tr[ S − / A ( v ) S − A ( v ) S − / ] . (30)Similarly, we have v ⊤ e Hv = tr[ e S − / A ( v ) e S − A ( v ) e S − / ] . (31)As the RHS of (30) and (31) are non-negative, both e H and H are positive semideﬁnite. Since e S (cid:22) α S · S , we have S − (cid:22) α S · e S − (see Section 2.2), which gives the following inequalities tr[ S − / A ( v ) S − A ( v ) S − / ] ≤ α S · tr[ S − / A ( v ) e S − A ( v ) S − / ] ≤ α S · tr[ e S − / A ( v ) e S − A ( v ) e S − / ] , (32)19here the ﬁrst inequality follows from viewing tr[ S − / A ( v ) S − A ( v ) S − / ] as P ni =1 u ⊤ i S − u i for u i = A ( v ) S − / e i and the second inequality follows similarly, after using the cyclic permutationproperty of trace. Similarly, using α − S · S (cid:22) e S , we have tr[ S − / A ( v ) S − A ( v ) S − / ] ≥ α − S · tr[ e S − / A ( v ) e S − A ( v ) e S − / ] . (33)Combining (32) and (33) with (30) and (31) along with the fact that v can be any arbitrary n -dimensional vector ﬁnishes the proof of the lemma. Lemma 5.4.

In each iteration of Algorithm 1, for α H = 1 . , the approximate Hessian e H ( y ) satisﬁes that α − H H ( y ) (cid:22) e H ( y ) (cid:22) α H · H ( y ) . Proof.

By Lemma 5.2, given as input two positive deﬁnite matrices S new and e S , Algorithm 2 outputsa matrix e S new such that k S − / e S new S − / − I k op ≤ ǫ S , where ǫ S = 0 . as in Algorithm 2. By deﬁnition of operator norm, this implies that in each iterationof Algorithm 1, we have, for α S = 1 . , α − S · S (cid:22) e S (cid:22) α S · S. The statement of this lemma then follows from Lemma 5.3.

The following lemma is standard in the theory of interior point methods (e.g. see [Ren01]).

Lemma 5.5 (Invariance of Newton step [Ren01]) . Given any parameters ≤ α H ≤ . and < ǫ N ≤ / , suppose that g η ( y ) ⊤ H ( y ) − g η ( y ) ≤ ǫ N holds for some feasible dual solution y ∈ R m and parameter η > , and positive deﬁnite matrix e H ∈ S n × n> satisﬁes α − H H ( y ) (cid:22) e H (cid:22) α H H ( y ) Then η new = η (1 + ǫ N √ n ) and y new = y − e H − g η new ( y ) satisfy g η new ( y new ) ⊤ H ( y new ) − g η new ( y new ) ≤ ǫ N . The following lemma is also standard in interior point method.

Lemma 5.6 (Approximate optimality [Ren01]) . Suppose < ǫ N ≤ / , dual feasible solution y ∈ R m , and parameter η ≥ satisfy the following bound on Newton step size: g η ( y ) ⊤ H ( y ) − g η ( y ) ≤ ǫ N . Let y ∗ be an optimal solution to the dual formulation (2) . Then we have b ⊤ y ≤ b ⊤ y ∗ + nη · (1 + 2 ǫ N ) . Low-rank Update

Crucial to being able to eﬃciently approximate the Hessian in each iteration is the condition thatthe rank of the update be not too large. We formalize this idea in the following theorem, essentialto the runtime analysis in Section 7.

Theorem 6.1 (Rank inequality) . Let r = n and r i be the rank of the update to the approximateslack matrix e S when calling Algorithm 2 in iteration i of Algorithm 1. Then, over T iterations ofAlgorithm 1, the ranks r i satisfy the inequality T X i =0 √ r i ≤ O ( T log . n ) . The rest of this section is devoted to proving Theorem 6.1. To this end, we deﬁne the “error”matrix Z ∈ R n × n as follows Z = S − / e SS − / − I (34)and the potential function Φ : R n × n → R Φ( Z ) = n X i =1 | λ ( Z ) | [ i ] √ i , (35)where | λ ( Z ) | [ i ] denotes the i ’th entry in the list of absolute eigenvalues of Z sorted in descend-ing order. The following lemma bounds, from above, the change in the potential described byEquation (35), when S is updated to S new . Lemma 6.2 (Potential change when S changes) . Suppose matrices S , S new and e S satisfy theinequalities k S − / S new S − / − I k F ≤ . and k S − / e SS − / − I k op ≤ . . (36) Deﬁne matrices Z = S − / e SS − / − I and Z mid = ( S new ) − / e S ( S new ) − / − I . Then we have Φ( Z mid ) − Φ( Z ) ≤ p log n. Proof.

Our goal is to prove n X i =1 ( λ ( Z ) [ i ] − λ ( Z mid ) [ i ] ) ≤ − . (37)We ﬁrst show that the lemma statement is implied by (37). We rearrange the order of the eigenvaluesof Z mid and Z so that λ ( Z mid ) i and λ ( Z ) i are the i th largest eigenvalues of Z mid and Z , respectively.For each i ∈ [ n ] , denote ∆ i = λ ( Z mid ) i − λ ( Z ) i . Then (37) is equivalent to k ∆ k ≤ − . Let τ be thedescending order of the magnitudes of eigenvalues of Z mid , i.e. | λ ( Z mid ) τ (1) | ≥ · · · ≥ | λ ( Z mid ) τ ( n ) | .The potential change Φ( Z mid ) − Φ( Z ) can be upper bounded as Φ( Z mid ) = n X i =1 √ i | λ ( Z mid ) τ ( i ) |≤ n X i =1 (cid:18) √ i | λ ( Z ) τ ( i ) | + 1 √ i | ∆ τ ( i ) | (cid:19) ≤ Φ( Z ) + n X i =1 i ! / n X i =1 | ∆ i | ! / ≤ Φ( Z ) + p log n, X i √ i | λ ( Z ) τ ( i ) | ≤ X i √ i | λ ( Z ) | [ i ] and Cauchy-Schwarz inequality. This proves the lemma.The remaining part of this proof is therefore devoted to proving (37). Deﬁne W = S − / S / .Then, we can express Z mid in terms of Z and W in the following way. Z mid = ( S new ) − / e S ( S new ) − / − I = ( S new ) − / S / S − / e SS − / S / ( S new ) − / − I = W ZW ⊤ + W W ⊤ − I. (38)Let λ ( M ) [ i ] denote the i ’th (ordered) eigenvalue of a matrix M . We then have n X i =1 ( λ ( Z mid ) [ i ] − λ ( W ZW ⊤ ) [ i ] ) ≤ k Z mid − W ZW ⊤ k F = k W ⊤ W − I k F , (39)where the ﬁrst inequality is by Fact 2.3 (which is applicable here because Z mid and W ZW ⊤ areboth normal matrices) and the second step is by (38). Denote the eigenvalues of S − / S new S − / by { ν i } ni =1 . Then the ﬁrst assumption in (36) implies that P i ∈ [ n ] ( ν i − ≤ × − . It followsthat k W ⊤ W − I k F = k S / S − S / − I k F = X i ∈ [ n ] (1 /ν i − ≤ × − , (40)where the last inequality is because the ﬁrst assumption from (36) implies ν i ≥ . for all i ∈ [ n ] .Plugging (40) into the right hand side of (39), we have n X i =1 ( λ ( Z mid ) [ i ] − λ ( W ZW ⊤ ) [ i ] ) ≤ × − . (41)Let W = U Σ V ⊤ be the singular value decomposition of W , with U and V being n × n unitarymatrices. Because of the invariance of the Frobenius norm under unitary transformation, (40) isthen equivalent to k Σ − I k F = n X i =1 ( σ i − ≤ × − . (42)Since U and V are unitary, the matrix W ZW ⊤ = U Σ V ⊤ ZV Σ U ⊤ is similar to Σ V ⊤ ZV Σ , and thematrix Z ′ = V ⊤ ZV is similar to Z . Therefore, n X i =1 ( λ ( W ZW ⊤ ) [ i ] − λ ( Z ) [ i ] ) = n X i =1 ( λ (Σ Z ′ Σ) [ i ] − λ ( Z ′ ) [ i ] ) ≤ k Σ Z ′ Σ − Z ′ k F , (43)22here the last inequality is by Fact 2.3. We rewrite the Frobenius norm as k Σ Z ′ Σ − Z ′ k F = k (Σ − I ) Z ′ (Σ − I ) + (Σ − I ) Z ′ + Z ′ (Σ − I ) k F ≤ k (Σ − I ) Z ′ (Σ − I ) k F + 2 k (Σ − I ) Z ′ k F . (44)The ﬁrst term can be bounded as: k (Σ − I ) Z ′ (Σ − I ) k F = tr[(Σ − I ) Z ′ (Σ − I ) Z ′ (Σ − I )] ≤ tr[(Σ − I ) · ( Z ′ ) ] ≤ . · tr[(Σ − I ) ]= n X i =1 ( σ i − ≤ × − , (45)The ﬁrst inequality above uses Fact 2.1, the second used the observation that k Z ′ k op = k Z k op ≤ . ,and the last inequality follows from (42) and the fact that P ni =1 ( σ i − ≤ P ni =1 ( σ i − . Similarly,we can bound the second term as k (Σ − I ) Z ′ k F = tr[(Σ − I )( Z ′ ) (Σ − I )] ≤ tr[(Σ − I ) ( Z ′ ) ] ≤ . · tr[(Σ − I ) ] ≤ − . (46)It follows from (43), (44) and (46) that n X i =1 ( λ ( W ZW ⊤ ) [ i ] − λ ( Z ) [ i ] ) ≤ − . (47)Combining (41) and (47), we get that P ni =1 ( λ ( Z ) [ i ] − λ ( Z mid ) [ i ] ) ≤ − which establishes (37).This completes the proof of the lemma. Lemma 6.3 (Potential change when e S changes) . Given positive deﬁnite matrices S new , e S ∈ S n> ,let e S new and r be generated during the run of Algorithm 2 when the inputs are S new and e S . Deﬁnethe matrices Z mid = ( S new ) − / e S ( S new ) − / − I and Z new = ( S new ) − / e S new ( S new ) − / − I . Thenwe have Φ( Z mid ) − Φ( Z new ) ≥ − log n √ r. Proof.

The setup of the lemma considers the eigenvalues of Z when e S changes. For the sake ofnotational convenience, we deﬁne y = | λ ( Z mid ) | , the vector of absolute values of eigenvalues of Z mid = S − / e SS − / − I . Recall from Table 5.1 that ǫ S = 0 . . We consider two cases below. Case 1.

There does not exist an i ≤ n/ that satisﬁes the two conditions y [2 i ] < ǫ S and y [2 i ] < (1 − /

10 log n ) y [ i ] . In this case, we have r = n/ . We consider two sub-cases.• Case (a). For all i ∈ [ n ] , we have y [ i ] ≥ ǫ S . In this case, we change all n coordinates of y , and the change in each coordinate contributes to a potential decrease of at least ǫ S / √ n .Therefore, we have Φ( Z mid ) − Φ( Z new ) ≥ ǫ S √ n ≥ − log n √ r .23 Case (b). There exists a minimum index i ≤ n/ such that y [2 j ] < ǫ S holds for all j in the range i ≤ j ≤ n/ . In this case, for all j in the above range, we have that y [2 j ] ≥ (1 − /

10 log n ) y [ j ] .In particular, picking j = i, i, · · · gives y [ n ] ≥ y [ i ] · (1 − / (10 log n )) ⌈ log n ⌉ ≥ ǫ S / . Recalling that our notation y [ i ] denotes the i ’th absolute eigenvalue in decreasing order, weuse the above inequality and repeat the argument from the previous sub-case to conclude that Φ( Z mid ) − Φ( Z new ) ≥ ǫ S / · √ n ≥ − log n · √ r . Case 2.

There exists an index i for which both the conditions y [2 i ] < ǫ S and y [2 i ] < (1 − /

10 log n ) y [ i ] are satisﬁed. By deﬁnition, r ≤ n/ is the smallest such index. Consider the index j such that forall j ′ < j , we have y [ j ′ ] ≥ ǫ S and for all j ′ ≥ j , we have y [ j ] < ǫ S . By the same argument as in Case1(b), we can prove y [ r ] ≥ ǫ S / . Moreover, y [2 r ] < (1 − /

10 log n ) y [ r ] by deﬁnition of r . Denote by y new the vector of magnitudes of the eigenvalues of Z new . Since y new[ i ] is set to for each i ∈ [2 r ] , wehave y new[ i ] = y [ i +2 r ] ≤ y [ i ] . Further, y [2 r ] < (1 − /

10 log n ) y [ r ] implies that for each i ∈ [ r ] , we have y [ i ] − y new[ i ] ≥

110 log n · y [ r ] ≥ − ǫ S log n = 10 − log n , where ǫ S = 0 . by Table 5.1. Therefore, we can bound, from below, the decrease in potentialfunction as Φ( Z mid ) − Φ( Z new ) ≥ r X i =1 y [ i ] − y new[ i ] √ i ≥ − log n √ r. This ﬁnishes the proof of the lemma.

Proof of Theorem 6.1.

Recall the deﬁnition of the potential function in (35) for an error matrix Z ∈ S n × n : Φ( Z ) = n X i =1 | λ ( Z ) | [ i ] √ i . Let S ( i ) and e S ( i ) be the true and approximate slack matrices in the i th iteration of Algorithm 1. De-ﬁne Z ( i ) = ( S ( i ) ) − / e S ( i ) ( S ( i ) ) − / − I and Z ( i )mid = ( S ( i +1) ) − / e S ( i ) ( S ( i +1) ) − / − I . By Lemma 6.2,we have that Φ( Z ( i )mid ) − Φ( Z ( i ) ) ≤ p log n. From Lemma 6.3, we have the following potential decrease: Φ( Z ( i )mid ) − Φ( Z ( i +1) ) ≥ − log n √ r i . These together imply that Φ( Z ( i +1) ) − Φ( Z ( i ) ) ≤ p log n − − log n √ r i . (48)We note that Φ( Z (0) ) = 0 as we initialized e S = S in the beginning of the algorithm, and that thepotential function Φ( Z ) is always non-negative. The theorem then follows by summing up (48) overall T iterations. 24 Runtime Analysis

Our main result of this section is the following bound on the runtime of Algorithm 1.

Theorem 7.1 (Runtime bound) . The total runtime of Algorithm 1 for solving an SDP with vari-able size n × n and m constraints is at most O ∗ (cid:0) √ n (cid:0) mn + max( m, n ) ω (cid:1)(cid:1) , where ω is the matrixmultiplication exponent as deﬁned in Deﬁnition 3.2. To prove Theorem 7.1, we ﬁrst upper bound the runtime in terms of fast rectangular matrixmultiplication times. The iteration complexity of Algorithm 1 is T = e O ( √ n ) . Lemma 7.2 (Total cost) . The total runtime of Algorithm 1 over T iterations is upper bounded as T Total ≤ O ∗ min (cid:0) n · nnz( A ) , mn . (cid:1) + √ n max( m, n ) ω + T X i =0 ( T mat ( n, mr i , n ) + T mat ( m, nr i , m )) ! , (49) where nnz( A ) is the total number of non-zero entries in all the constraint matrices, r i , as deﬁnedin Theorem 6.1, is the rank of the update to the approximation slack matrix e S in iteration i , and ω and T mat are deﬁned in Deﬁnitions 3.2 and 3.1, respectively. Remark 7.3.

A more careful analysis can improve the ﬁrst term in the RHS of (49) to √ n · nnz( A ) − γ · ( mn ) γ for γ = − ω (1)) . For the purpose of this paper, however, we will only need thesimpler bound given in Lemma 7.2.Proof. The total runtime of Algorithm 1 consists of two parts:•

Part 1.

The time to compute the approximate Hessian e H ( y ) (which we abbreviate as e H ) inLine 11 - 15.• Part 2.

The total cost of operations other than computing the approximate Hessian.

Part 1.

We analyze the cost of computing the approximate Hessian e H . Part 1a. Initialization.

We start with computing e H in the ﬁrst iteration of the algorithm. Each entry of e H involves thecomputation e H j,k = tr h ( e S − / A j e S − / )( e S − / A k e S − / ) i . It ﬁrst costs O ∗ ( n ω ) to invert e S . Then the cost of computing the key module of the approximateHessian, e S − / A j e S − / for all j ∈ [ m ] , is obtained by stacking the matrices A j together: T e S − / A j e S − / for all j ∈ [ m ] ≤ O ( T mat ( n, mn, n )) . (50)Vectorizing the matrices e S − / A j e S − / into row vectors of length n , for each j ∈ [ m ] , and stackingthese rows vertically to form a matrix B of dimensions m × n , one observes that e H = BB ⊤ . Wetherefore have, T computing e H from B ≤ O ( T mat ( m, n , m )) . (51)25ombining (50), (51), and the initial cost of inverting e S gives the following cost for computing e H for the ﬁrst iteration: T part 1a ≤ O ∗ ( T mat ( m, n , m ) + T mat ( n, mn, n ) + n ω ) . (52) Part 1b. Accumulating low-rank changes over all the iterations

Once the approximate Hessian in the ﬁrst iteration has been computed, every next iteration hasthe approximate Hessian computed using a rank r i update to the approximate slack matrix e S (seeLine 15 of Algorithm 2). If the update from e S to e S new has rank r i , Fact 2.4 implies that we cancompute, in time O ( n ω + o (1) ) , the n × r i matrices V + and V − satisfying e S − = e S − + V + V ⊤ + − V − V ⊤− .The cost of updating e H is then dominated by the computation of tr[ e S − / A j V V ⊤ A k e S − / ] , where V ∈ R n × r i is either V + or V − . We note that T A j V for all j ∈ [ m ] ≤ O ∗ (cid:16) min (cid:16) r i · nnz( A ) , mn r ω − o (1) i (cid:17)(cid:17) , (53)where nnz( A ) is the total number of non-zero entries in all the constraint matrices, and the secondterm in the minimum is obtained by stacking the matrices A j together and splitting it and V intomatrices of dimensions r i × r i . Further, pre-multiplying e S − / with A j V for all j ∈ [ m ] essentiallyinvolves computing the matrix product of an n × n matrix and an n × mr i matrix, which, byDeﬁnition 3.1, costs T mat ( n, mr i , n ) . This, together with (53), gives T e S − / A j V for all j ∈ [ m ] ≤ O ∗ (cid:16) T mat ( n, mr i , n ) + min (cid:16) r i · nnz( A ) , mn r ω − o (1) i (cid:17)(cid:17) . (54)The ﬁnal step is to vectorize all the matrices e S − / A j V , for each j ∈ [ m ] , and stack these verticallyto get an m × nr i matrix B , which gives the update to Hessian to be computed as BB ⊤ . Thiscosts, by deﬁnition, T mat ( m, nr i , m ) . Combining this with (54) gives the following run time for oneupdate to the approximate Hessian: T rank r i Hessian update ≤ O ∗ (cid:0) T mat ( n, mr i , n ) + min (cid:0) r i · nnz( A ) , mn r ω − i (cid:1) + T mat ( m, nr i , m ) + n ω (cid:1) . (55)Using this bound over all T = e O ( √ n ) iterations, and applying P Ti =0 √ r i ≤ e O ( √ n ) from Theo-rem 6.1, gives T part 1b ≤ O ∗ min( n · nnz( A ) , mn . ) + √ n · n ω + T X i =1 ( T mat ( n, mr i , n ) + T mat ( m, nr i , m )) ! . (56) Combining Part 1a and 1b.

Combining (52) and (56), we have T part 1 ≤ T part 1a + T part 1b ≤ O ∗ min( n · nnz( A ) , mn . ) + √ n · n ω + T X i =0 ( T mat ( n, mr i , n ) + T mat ( m, nr i , m )) ! , (57)where we incorporated the bound from (52) into the i = 0 case. Part 2.

Observe that there are four operations performed in Algorithm 1 other than computing e H :26 Part 2a. computing the gradient g η ( y ) • Part 2b. inverting the approximate Hessian e H • Part 2c. updating the dual variables y new and S ( y new ) • Part 2d. computing the new approximate slack matrix e S ( y new ) Part 2a.

The i ’th coordinate of the gradient is expressed as g η ( y ) i = ηb i − tr[ S − A i ] . The costper iteration of computing this quantity equals O (nnz( A ) + n ω + o (1) ) , where the second term comesfrom inverting the matrix S . Part 2b.

The cost of inverting the approximate Hessian e H is O ( m ω + o (1) ) per iteration. Part 2c.

The cost of updating the dual variable y new = y − e H − g η new ( y ) , given e H − and g η new ( y ) ,is O ( m ) per iteration. The cost of computing the new slack matrix S new = P i ∈ [ m ] ( y new ) i A i − C is O (nnz( A )) per iteration. Part 2d.

The per iteration cost of updating the approximate slack matrix e S new is O ( n ω + o (1) ) by Lemma 5.2. Combining Part 2a, 2b, 2c and 2d.

The total cost of operations other than computing the Hessian over the T = e O ( √ n ) iterationsis therefore bounded by T part 2 ≤ T part 2a + T part 2b + T part 2c + T part 2d ≤ O ∗ ( √ n (nnz( A ) + max( m, n ) ω )) . (58) Combining Part 1 and Part 2.

Combining (57) and (58) and using r = n ﬁnishes the proof of the lemma. T total ≤ T part 1 + T part 2 ≤ O ∗ min (cid:0) n · nnz( A ) , mn . (cid:1) + √ n max( m, n ) ω + T X i =0 ( T mat ( n, mr i , n ) + T mat ( m, nr i , m )) ! . Lemma 7.4.

Let T mat be as deﬁned in Deﬁnition 3.1. Let T = e O ( √ n ) and { r , · · · , r T } be asequence that satisﬁes T X i =1 √ r i ≤ O ( T log . n ) Property I. We have T X i =1 T mat ( m, nr i , m ) ≤ O ∗ ( √ n max( m ω , n ω ) + T mat ( m, n , m )) , Property II. We have T X i =1 T mat ( n, mr i , n ) ≤ O ∗ ( √ n max( m ω , n ω ) + T mat ( n, mn, n )) . roof. We give only the proof of Property I, as the proof of Property II is similar. Let m = n a . Foreach i ∈ [ T ] , let r i = n b i , where b i ∈ [0 , . Then T mat ( m, nr i , m ) = T mat ( n a , n b i , n a ) = n aω ((1+ b i ) /a )+ o (1) . (59)For each number k ∈ { , , · · · , log n } , deﬁne the set of iterations I k = { i ∈ [ T ] : 2 k ≤ r i ≤ k +1 } . Then our assumption on the sequence { r , · · · , r T } can be expressed as P log nk =0 | I k | · k/ ≤ O ( T log . n ) . This implies that for each k { , , · · · , log n } , we have | I k | ≤ O ( T log . n/ k/ ) . Next,taking the summation of Eq. (59) over all i ∈ [ T ] , we have T X i =1 T mat ( m, nr i , m ) = T X i =1 n a · ω ((1+ b i ) /a ) = log n X k =0 X i ∈ I k n a · ω ((1+ b i ) /a ) ≤ O (log n ) · max k max i ∈ I k T log . n k/ · n a · ω ((1+ b i ) /a ) ≤ e O (1) · max k max k ≤ n bi ≤ k +1 √ n k/ · n a · ω ((1+ b i ) /a ) ≤ e O (1) · max b i ∈ [0 , n / − b i / a · ω ((1+ b i ) /a ) , where the fourth step follows from T = e O ( √ n ) . To bound the exponent on n above, we deﬁne thefunction g , g ( b i ) = 1 / − b i / a · ω ((1 + b i ) /a ) . (60)This function is convex in b i due to the convexity of the function ω (Lemma 3.6). Therefore, overthe interval b i ∈ [0 , , the maximum of g is attained at one of the end points. We simply evaluatethis function at the end points. Case 1.

Consider the case b i = 0 . In this case, we have g (0) = 1 / aω (1 /a ) . We consider thefollowing two subcases. Case 1a. If a ≥ , then we have g (0) = 1 / a · ω (1 /a ) ≤ / aω (1) = 1 / aω Case 1b. If a ∈ (0 , , then we deﬁne k = 1 /a > . It follows from Lemma 3.5 and ω > , that g (0) = 1 / a · ω (1 /a ) = 1 / ω ( k ) /k ≤ / k − ω ) /k ≤ / ω. Combining both Case 1a and Case 1b, we have that n g (0) ≤ max( n / aω , n / ω ) ≤ √ n · max( m ω , n ω ) . Case 2

Consider the other case of b i = 1 . In this case, g (1) = 1 / − / aω (2 /a ) = aω (2 /a ) .We now ﬁnish the proof by combining Case 1 and Case 2 as follows. max b i ∈ [0 , n / − b i + a · ω ((1+ b i ) /a ) ≤ √ n max( m ω , n ω ) + n a · ω (2 /a ) . roof of Theorem 7.1. In light of Lemma 7.4, the upper bound on runtime given in Lemma 7.2 canbe written as T Total ≤ O ∗ (cid:0) min (cid:8) n · nnz( A ) , mn . (cid:9) + √ n max( m, n ) ω + T mat ( n, mn, n ) + T mat ( m, n , m ) (cid:1) . (61)Combining this with 3.10, we have the following upper bound on the total runtime of Algo-rithm 1: T Total ≤ O ∗ (cid:0) min (cid:8) n · nnz( A ) , mn . (cid:9) + √ n max( m, n ) ω + √ n (cid:0) mn + m ω (cid:1)(cid:1) ≤ O ∗ (cid:0) √ n (cid:0) mn + max( m, n ) ω (cid:1)(cid:1) . This ﬁnishes the proof of the theorem.

In this section, we prove Theorem 1.3, restated below.

Theorem 1.3 (Comparison with Cutting Plane Method) . When m ≥ n , there is an interior pointmethod that solves an SDP with n × n matrices, m constraints, and nnz( A ) input size, faster thanthe current best cutting plane method [LSW15, JLSW20], over all regimes of nnz( A ) . Remark 8.1.

In the dense case with nnz( A ) = Θ( mn ) , Algorithm 1 is faster than the cuttingplane method whenever m ≥ √ n .Proof of Theorem 1.3. Recall that the current best runtime of the cutting plane method for solvingan SDP (1) is T CP = O ∗ ( m · nnz( A ) + mn . + m ) [LSW15, JLSW20], where . is thecurrent best upper bound on the exponent of matrix multiplication ω . By Lemma 7.2 and 7.4, wehave the following upper bound on the total runtime of Algorithm 1: T Total ≤ O ∗ (cid:0) min (cid:8) n · nnz( A ) , mn . (cid:9) + √ n max( m, n ) ω + T mat ( n, mn, n ) + T mat ( m, n , m ) (cid:1) Since m ≥ n by assumption, Lemma 3.9 and 3.9 further simplify the runtime to T Total ≤ O ∗ (cid:0) min (cid:8) n · nnz( A ) , mn . (cid:9) + √ nm ω + T mat ( m, n , m ) (cid:1) (62)Note that min (cid:8) n · nnz( A ) , mn . (cid:9) ≤ m · nnz( A ) ≤ O ( T CP ) and that √ nm ω = o ( m ) ≤ o ( T CP ) since m ≥ n . Furthermore, Lemma 3.11 states that T mat ( m, n , m ) = o ( m + mn . ) ≤ o ( T CP ) . Sinceeach term on the RHS of (62) is upper bounded by T CP , we make the stated conclusion. Lemma 9.1 (Initialization) . Consider a semideﬁnite program as in Deﬁnition 1.1 of dimension n × n with m constraints, and assume that it has the following properties.1. Bounded diameter: for any X (cid:23) with h A i , X i = b i for all i ∈ [ m ] , we have k X k op ≤ R .2. Lipschitz objective: k C k op ≤ L . or any < δ ≤ , the following modiﬁed semideﬁnite program max X (cid:23) h C, X i s . t . h A i , X i = b i , ∀ i ∈ [ m + 1] , where A i =  A i n n ⊤ n ⊤ n b i R − tr[ A i ]  , ∀ i ∈ [ m ] ,A m +1 =  I n n n ⊤ n ⊤ n  , b = (cid:20) R bn + 1 (cid:21) , C =  C · δL n n ⊤ n ⊤ n −  , satisﬁes the following statements.1. The following are feasible primal and dual solutions: X = I n +2 , y = (cid:20) m (cid:21) , S =  I n − C · δL n ⊤ n ⊤ n  .

2. For any feasible primal and dual solutions ( X, y, S ) with duality gap at most δ , the matrix b X = R · X [ n ] × [ n ] , where X [ n ] × [ n ] is the top-left n × n block submatrix of X , is an approximatesolution to the original semideﬁnite program in the following sense: h C, b X i ≥ h C, X ∗ i − LR · δ, b X (cid:23) , X i ∈ [ m ] (cid:12)(cid:12)(cid:12) h A i , b X i − b i (cid:12)(cid:12)(cid:12) ≤ nδ · ( R X i ∈ [ m ] k A i k + k b k ) , where X ∗ is any optimal solution to the original SDP and k A k denotes the Schatten -normof a matrix A .Proof. For the ﬁrst result, straightforward calculations show that h A i , X i = b i for all i ∈ [ m + 1] ,and that P i ∈ [ m +1] y i A i − S = C . Now we prove the second result. Denote OPT and

OPT theoptimal values of the original and modiﬁed SDP respectively. Our ﬁrst goal is to establish a lowerbound for

OPT in terms of

OPT . For any optimal solution X ∈ S n × n of the original SDP, considerthe following matrix X ∈ R ( n +2) × ( n +2) X =  R X n n ⊤ n n + 1 − R tr[ X ] 00 ⊤ n  . Notice that X is a feasible primal solution to the modiﬁed SDP, and that OPT ≥ h

C, X i = δLR · h C, X i = δLR · OPT , X is an optimal solution to the original SDP.Given a feasible primal solution X ∈ R ( n +2) × ( n +2) of the modiﬁed SDP with duality gap δ , wecould assume X =  X [ n ] × [ n ] n n ⊤ n τ ⊤ n θ  without loss of generality, where τ, θ ≥ . This is because ifthe entries of X other than the diagonal and the top-left n × n block are not , then we could zerothese entries out and the matrix remains feasible and positive semideﬁnite. We thus immediatelyhave b X (cid:23) . Notice that δL · h C, X [ n ] × [ n ] i − θ = h C, X i ≥

OPT − δ ≥ δLR · OPT − δ . (63)Therefore, we can lower bound the objective value for X [ n ] × [ n ] in the original SDP as h C, b X i = R · h C, X [ n ] × [ n ] i ≥ OPT − LR · δ, where the last inequality follows from (63). By matrix Hölder inequality, we have δL · h C, X [ n ] × [ n ] i ≤ δL · k C k op · tr (cid:2) X [ n ] × [ n ] (cid:3) ≤ δL · k C k op · h A m +1 , X i≤ ( n + 1) δ, where in the last step follows from k C k op ≤ L and b m +1 = n + 1 . We can thus upper bound θ as θ ≤ δL · h C, X [ n ] × [ n ] i + δ − δLR · OPT ≤ (2 n + 1) δ + δ ≤ nδ, (64)where the ﬁrst step follows from (63), the second step follows from OPT ≥ − k C k op ·k X ∗ k ≥ − nLR where k·k is the Schatten -norm, and the last step follows from δ ≤ ≤ n . Notice that by thefeasiblity of X for the modiﬁed SDP, we have h A i , X [ n ] × [ n ] i + ( 1 R · b i − tr[ A i ]) θ = 1 R · b i . This implies that (cid:12)(cid:12)(cid:12) h A i , b X i − b i (cid:12)(cid:12)(cid:12) = | ( b i − R · tr[ A i ]) θ | ≤ nδ · ( R k A i k + | b i | ) , where the ﬁnal step follows from the upper bound of θ in (64). Summing the above inequality upover all i ∈ [ m ] ﬁnishes the proof of the lemma. Acknowledgment

We thank Aaron Sidford for many helpful discussions and Deeksha Adil, Sally Dong, Sandy Kaplan,and Kevin Tian for useful feedback on the writing. We gratefully acknowledge funding from CCF-1749609, CCF-1740551, DMS-1839116, Microsoft Research Faculty Fellowship, and Sloan ResearchFellowship. Zhao Song is partially supported by Ma Huateng Foundation, Schmidt Foundation,Simons Foundation, NSF, DARPA/SRC, Google and Amazon.31 eferences [AK07] Sanjeev Arora and Satyen Kale. A combinatorial, primal-dual approach to semideﬁniteprograms. In

Proceedings of the 39th Annual ACM Symposium on Theory of Computing(STOC) , 2007.[ALO16] Zeyuan Allen Zhu, Yin Tat Lee, and Lorenzo Orecchia. Using optimization to obtain awidth-independent, parallel, simpler, and faster positive SDP solver. In

Proceedings ofthe Twenty-Seventh Annual ACM-SIAM Symposium on Discrete Algorithms(SODA) ,2016.[Ans00] Kurt M Anstreicher. The volumetric barrier for semideﬁnite programming.

Mathematicsof Operations Research , 2000.[ARV09] Sanjeev Arora, Satish Rao, and Umesh Vazirani. Expander ﬂows, geometric embeddingsand graph partitioning.

Journal of the ACM (JACM) , 2009.[AV95] David S Atkinson and Pravin M Vaidya. A cutting plane algorithm for convex pro-gramming that uses analytic centers.

Mathematical Programming , 69(1-3):1–43, 1995.[AZL17] Zeyuan Allen-Zhu and Yuanzhi Li. Follow the compressed leader: faster online learningof eigenvectors and faster mmwu. In

Proceedings of the 34th International Conferenceon Machine Learning-Volume 70 , 2017.[Ban19] Nikhil Bansal. On a generalization of iterated and randomized rounding. In

Proceedingsof the 51st Annual ACM SIGACT Symposium on Theory of Computing (STOC) , 2019.[BCS97] Peter Bürgisser, Michael Clausen, and Mohammad A Shokrollahi.

Algebraic complexitytheory , volume 315. Springer Science & Business Media, 1997.[BDG16] Nikhil Bansal, Daniel Dadush, and Shashwat Garg. An algorithm for komlós conjecturematching banaszczyk. In , 2016.[BG17] Nikhil Bansal and Shashwat Garg. Algorithmic discrepancy beyond partial coloring.In

Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing(STOC) , 2017.[Blä13] Markus Bläser. Fast matrix multiplication.

Theory of Computing , pages 1–60, 2013.[BLSS20] Jan van den Brand, Yin Tat Lee, Aaron Sidford, and Zhao Song. Solving tall denselinear programs in nearly linear time. In , 2020.[Bra20] Jan van den Brand. A deterministic linear program solver in current matrix multipli-cation time. In

ACM-SIAM Symposium on Discrete Algorithms (SODA) , 2020.[BV02] Dimitris Bertsimas and Santosh Vempala. Solving convex programs by random walks.In

Proceedings of the thiry-fourth annual ACM symposium on Theory of computing(STOC) , pages 109–115. ACM, 2002.[CDG19] Yu Cheng, Ilias Diakonikolas, and Rong Ge. High-dimensional robust mean estimationin nearly-linear time. In

Proceedings of the Thirtieth Annual ACM-SIAM Symposiumon Discrete Algorithms (SODA) . SIAM, 2019.32CDGW19] Yu Cheng, Ilias Diakonikolas, Rong Ge, and David Woodruﬀ. Faster algorithms for high-dimensional robust covariance estimation. In

Conference on Learning Theory (COLT) ,2019.[CDST19] Yair Carmon, John C. Duchi, Aaron Sidford, and Kevin Tian. A rank-1 sketch formatrix multiplicative weights. In

Conference on Learning Theory, COLT 2019, 25-28June 2019, Phoenix, AZ, USA , pages 589–623, 2019.[CG18] Yu Cheng and Rong Ge. Non-convex matrix completion against a semi-random adver-sary. In

Conference On Learning Theory (COLT) , 2018.[CLS19] Michael B Cohen, Yin Tat Lee, and Zhao Song. Solving linear programs in the currentmatrix multiplication time. In

Proceedings of the 51st Annual ACM Symposium onTheory of Computing (STOC) , 2019.[Dan47] George B Dantzig. Maximization of a linear function of variables subject to linearinequalities.

Activity analysis of production and allocation , 13:339–347, 1947.[Eld13] Ronen Eldan. Thin shell implies spectral gap up to polylog via a stochastic localizationscheme.

Geometric and Functional Analysis , 2013.[GH16] Dan Garber and Elad Hazan. Sublinear time algorithms for approximate semideﬁniteprogramming.

Mathematical Programming , 158(1-2):329–361, 2016.[GLS81] Martin Grötschel, László Lovász, and Alexander Schrijver. The ellipsoid method andits consequences in combinatorial optimization.

Combinatorica , 1981.[GU18] François Le Gall and Florent Urrutia. Improved rectangular matrix multiplication usingpowers of the coppersmith-winograd tensor. In

Proceedings of the Twenty-Ninth AnnualACM-SIAM Symposium on Discrete Algorithms , SODA ’18, 2018.[GV02] Jean-Louis Goﬃn and Jean-Philippe Vial. Convex nondiﬀerentiable optimization: Asurvey focused on the analytic center cutting plane method.

Optimization methods andsoftware , 2002.[GW95] Michel X Goemans and David P Williamson. Improved approximation algorithms formaximum cut and satisﬁability problems using semideﬁnite programming.

Journal ofthe ACM (JACM) , 1995.[HJ12] Roger A. Horn and Charles R. Johnson.

Matrix Analysis . Cambridge University Press,New York, NY, USA, 2nd edition, 2012.[HW53] A. J. Hoﬀman and H. W. Wielandt. The variation of the spectrum of a normal matrix.

Duke Math. J. , 20(1):37–39, 03 1953.[JJUW11] Rahul Jain, Zhengfeng Ji, Sarvagya Upadhyay, and John Watrous. QIP = PSPACE.

Journal of the ACM (JACM) , 2011.[JLL +

20] Arun Jambulapati, Yin Tat Lee, Jerry Li, Swati Padmanabhan, and Kevin Tian. Posi-tive semideﬁnite programming: mixed, parallel, and width-independent. In

Proccedingsof the 52nd Annual ACM SIGACT Symposium on Theory of Computing, STOC 2020,Chicago, IL, USA, June 22-26, 2020 . ACM, 2020.33JLSW20] Haotian Jiang, Yin Tat Lee, Zhao Song, and Sam Chiu-wai Wong. An improved cuttingplane method for convex optimization, convex-concave games and its applications. In

STOC , 2020.[JY11] Rahul Jain and Penghui Yao. A parallel approximation algorithm for positive semideﬁ-nite programming. In

Proceedings of the 2011 IEEE 52nd Annual Symposium on Foun-dations of Computer Science (FOCS) , 2011.[Kar84] Narendra Karmarkar. A new polynomial-time algorithm for linear programming. In

Proceedings of the sixteenth annual ACM symposium on Theory of computing (STOC) ,1984.[Kha80] Leonid G Khachiyan. Polynomial algorithms in linear programming.

USSR Computa-tional Mathematics and Mathematical Physics , 20(1):53–72, 1980.[KM03] Kartik Krishnan and John E Mitchell. Properties of a cutting plane method for semidef-inite programming. submitted for publication , 2003.[KMS94] David Karger, Rajeev Motwani, and Madhu Sudan. Approximate graph coloring bysemideﬁnite programming. In

Proceedings 35th Annual Symposium on Foundations ofComputer Science (FOCS) . IEEE, 1994.[KTE88] Leonid G Khachiyan, Sergei Pavlovich Tarasov, and I. I. Erlikh. The method of inscribedellipsoids. In

Soviet Math. Dokl , volume 37, pages 226–230, 1988.[Lee16] Yin Tat Lee.

Faster algorithms for convex and combinatorial optimization . PhD thesis,Massachusetts Institute of Technology, 2016.[LP20] Yin Tat Lee and Swati Padmanabhan. An $\widetilde\mathcalo(m/\varepsilonˆ3.5)$-cost algorithm for semideﬁnite programs with diagonal constraints. In Jacob D. Aber-nethy and Shivani Agarwal, editors,

Conference on Learning Theory, COLT 2020, 9-12July 2020, Virtual Event [Graz, Austria] , Proceedings of Machine Learning Research.PMLR, 2020.[LS14] Yin Tat Lee and Aaron Sidford. Path ﬁnding methods for linear programming: Solvinglinear programs in O ( √ rank ) iterations and faster algorithms for maximum ﬂow. In , 2014.[LS15] Yin Tat Lee and Aaron Sidford. Eﬃcient inverse maintenance and faster algorithms forlinear programming. In , 2015.[LSW15] Yin Tat Lee, Aaron Sidford, and Sam Chiu-wai Wong. A faster cutting plane methodand its implications for combinatorial and convex optimization. In , 2015.[NN89] Yurii Nesterov and Arkadi Nemirovski. Self-concordant functions and polynomial timemethods in convex programming. preprint, central economic & mathematical institute,ussr acad. Sci. Moscow, USSR , 1989.[NN92] Yurii Nesterov and Arkadi Nemirovski. Conic formulation of a convex programmingproblem and duality.

Optimization Methods and Software , 1(2):95–115, 1992.34NN94] Yurii Nesterov and Arkadi Nemirovski.

Interior-point polynomial algorithms in convexprogramming , volume 13. Siam, 1994.[PT12] Richard Peng and Kanat Tangwongsan. Faster and simpler width-independent parallelalgorithms for positive semideﬁnite programming. In

Proceedings of the twenty-fourthannual ACM symposium on Parallelism in algorithms and architectures (SPAA) , pages101–108, 2012.[Ren01] James Renegar.

A Mathematical View of Interior-point Methods in Convex Optimiza-tion . Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2001.[Sho77] Naum Z Shor. Cut-oﬀ method with space extension in convex programming problems.

Cybernetics and systems analysis , 13(1):94–96, 1977.[Sid15] Aaron Daniel Sidford.

Iterative methods, combinatorial optimization, and linear pro-gramming beyond the universal barrier . PhD thesis, Massachusetts Institute of Tech-nology, 2015.[Str91] Volker Strassen. Degeneration and complexity of bilinear maps: some asymptotic spec-tra.

J. reine angew. Math , 413:127–180, 1991.[Vai87] Pravin M Vaidya. An algorithm for linear programming which requires o ((( m + n ) n +( m + n ) . n ) l ) arithmetic operations. In , 1987.[Vai89a] Pravin M Vaidya. A new algorithm for minimizing convex functions over convex sets.In , pages338–343, 1989.[Vai89b] Pravin M Vaidya. Speeding-up linear programming using fast matrix multiplication. In , pages 332–337.IEEE, 1989.[VB96] Lieven Vandenberghe and Stephen P. Boyd. Semideﬁnite programming. SIAM Review ,1996.[Woo49] Max A Woodbury. The stability of out-input matrices.

Chicago, IL , 9, 1949.[Woo50] Max A Woodbury. Inverting modiﬁed matrices. 1950.[YN76] David B Yudin and Arkadi S Nemirovski. Evaluation of the information complexity ofmathematical programming problems.

Ekonomika i Matematicheskie Metody , 12:128–142, 1976.[YTF +

19] Alp Yurtsever, Joel A. Tropp, Olivier Fercoq, Madeleine Udell, and Volkan Cevher.Scalable semideﬁnite programming, 2019.35

Matrix Multiplication: A Tensor Approach

The main goal of this section is to rederive, using tensors, some of the technical results fromSection 3. In particular, we use tensors to derive upper bounds on the time to perform the followingtwo rectangular matrix multiplication tasks (Lemma A.12 and A.13):• Multiplying a matrix of dimensions m × n with one of dimensions n × m ,• Multiplying a matrix of dimensions n × mn with one of dimensions mn × n .Our hope is that these techniques will eventually be useful in further improving the results of thispaper. A.1 Exponent of matrix multiplication

We recall two deﬁnitions to describe the cost of certain fundamental matrix operations, along withtheir properties.

Deﬁnition A.1.

Deﬁne T mat ( n, r, m ) to be the number of operations needed to compute the productof matrices of dimensions n × r and r × m . Deﬁnition A.2.

We deﬁne the function ω ( k ) to be the minimum value such that T mat ( n, n k , n ) = n ω ( k )+ o (1) . We overload notation and use ω to denote the exponent of matrix multiplication (inother words, the cost of multiplying two n × n matrices is n ω ), and let α denote the dual exponentof matrix multiplication. Thus, we have ω (1) = ω and ω ( α ) = 2 . Lemma A.3 ([GU18]) . We have :• ω = ω (1) ≤ . ,• ω (1 . ≤ . ,• ω (1 . ≤ . ,• ω (2) ≤ . . Lemma A.4 ([BCS97, Blä13]) . For any three positive integers n, m, r , we have T mat ( n, r, m ) = O ( T mat ( n, m, r )) = O ( T mat ( m, n, r )) . A.2 Matrix multiplication tensor

The rank of a tensor T , denoted as R ( T ) , is the minimum number of simple tensors that sum upto T . For any two tensors S = ( S i,j,k ) i,j,k and T = ( T a,b,c ) a,b,c , we write S ≤ T if there exist threematrices A, B and C (of appropriate sizes) such that S i,j,k = P a,b,c A i,a B j,b C k,c T a,b,c for all i, j, k .For any i, j, k , denote e i,j,k the tensor with in the ( i, j, k ) -th entry, and elsewhere. Deﬁnition A.5 (Matrix-multiplication tensor) . For any three positive integers a, b, c , we deﬁne h a, b, c i := X i ∈ [ a ] X j ∈ [ b ] X k ∈ [ c ] e i ( b − j,j ( c − k,k ( a − i to be the matrix-multiplication tensor corresponding to multiplying a matrix of size a × b with oneof size b × c . n i and m i where i = 1 , , , we have h n , n , n i ⊗ h m , m , m i = h n m , n m , n m i . Let h n i = P i ∈ [ n ] e i,i,i be the identity tensor. For any three tensors S, T and T , if T ≤ T , thenwe have S ⊗ T ≤ S ⊗ T . Lemma A.6 (Monotonicity of tensor rank, [Str91]) . Tensor rank is monotone under the relation ≤ , i.e. if T ≤ T , then we have R ( T ) ≤ R ( T ) . Lemma A.7 (Sub-multiplicity of tensor rank, [Str91]) . For any tensors T and T , we have R ( T ⊗ T ) ≤ R ( T ) · R ( T ) . Lemma A.8.

The tensor rank of a matrix multiplication tensor is equal to the cost of multiplyingthe two correponding sized matrices up to some constant factor, i.e., R ( h a, b, c i ) = Θ( T mat ( a, b, c )) . A.3 Implication of matrix multiplication technique

Lemma A.9 (Sub-linearity) . For any p ≥ q ≥ , we have ω ( p ) ≤ p − q + ω ( q ) . Proof.

We have h n, n p , n i = h n, n q , n i ⊗ h , n p − q , i . Applying tensor rank on both sides R ( h n, n p , n i ) = R ( h n, n q , n i ⊗ h , n p − q , i ) ≤ R ( h n, n q , n i ) · R ( h , n p − q , i ) , where the last line follows from Lemma A.7. Applying Lemma A.8, we have T mat ( n, n p , n ) ≤ O (1) · T mat ( n, n q , n ) · n p − q Using the deﬁnition of ω ( p ) , we have n ω ( p )+ o (1) ≤ O (1) · n ω ( q )+ o (1) · n p − q . Comparing the exponent on both sides completes the proof.The next lemma establishes the convexity of ω ( k ) as a function of k . Lemma A.10 (Convexity of ω ( k ) ) . The fast rectangular matrix multiplication time exponent ω ( k ) as deﬁned in Deﬁnition A.2 is convex in k . roof. Let k = α · p + (1 − α ) · q for α ∈ (0 , . We have h n, n k , n i = h n α , n α · p , n α i ⊗ h n − α , n (1 − α ) p , n − α i . Applying the tensor rank on both sides, R ( h n, n k , n i ) = R ( h n α , n α · p , n α i ⊗ h n − α , n (1 − α ) p , n − α i ) ≤ R ( h n α , n α · p , n α i ) · R ( h n − α , n (1 − α ) p , n − α i ) , where the last line follows from Lemma A.7. By Lemma A.8, we have T mat ( n, n k , n ) ≤ O (1) · T mat ( n α , n αp , n α ) · T mat ( n − α , n (1 − α ) p , n − α ) By deﬁnition of ω ( · ) , we have n ω ( k )+ o (1) ≤ O (1) · n α · ω ( p ) · n (1 − α ) ω (1 − p ) . By comparing the exponent, we know that ω ( k ) ≤ α · ω ( p ) + (1 − α ) · ω (1 − p ) . Lemma A.11.

Let T mat be deﬁned as in Deﬁnition A.1. Then for any positive integers a, b, c and k , we have T mat ( a, bk, c ) ≤ O ( T mat ( ak, b, ck )) . Proof.

Notice that h , k, i ≤ h k, , k i . Therefore, we have h a, bk, c i = h a, b, c i ⊗ h , k, i≤ h a, b, c i ⊗ h k, , k i = h ak, b, ck i . It then follows from Lemma A.6 that R ( h a, bk, c i ) ≤ R ( h ak, b, ck i ) . Finally, using Lemma A.8 gives T mat ( a, bk, c ) ≤ O ( T mat ( ak, b, ck )) . Thus we complete the proof. 38 .4 General bound on T mat ( n, mn, n ) and T mat ( m, n , m ) Lemma A.12.

Let T mat be deﬁned as in Deﬁnition A.1.If m ≥ n , then we have T mat ( n, mn, n ) ≤ O ( T mat ( m, n , m )) . If m ≤ n , then we have T mat ( m, n , m ) ≤ O ( T mat ( n, mn, n )) . Proof.

We only prove the case of m ≥ n , as the other case where m < n is similar. This is animmediate consequence of Lemma A.11 by taking a = c = n , b = n , and k = ⌊ m/n ⌋ , where k is apositive integer because m ≥ n .In the next lemma, we derive upper bounds on the term T mat ( m, n , m ) when m ≥ n and T mat ( n, mn, n ) when m < n , which is crucial to our runtime analysis. Lemma A.13.

Let T mat be deﬁned as in Deﬁnition A.1 and ω be deﬁned as in Deﬁnition A.2.Property I. We have T mat ( n, mn, n ) ≤ O ( mn ω + o (1) ) . Property II. We have T mat ( m, n , m ) ≤ O (cid:0) √ n (cid:0) mn + m ω (cid:1)(cid:1) . Proof.

Property I.

Since h n, mn, n i = h n, n, n i ⊗ h , m, i . Applying the tensor rank on both sides, we have R ( h n, mn, n i ) = R ( h n, n, n i ⊗ h , m, i ) ≤ R ( h n, n, n i ) · R ( h , m, i ) Thus, we complete the proof.

Property II.

Let m = n a , where a ∈ (0 , ∞ ) . We have h m, n , m i = h n a , ( n a ) /a , n a i It implies that T mat ( m, n , m ) = n a · ω (2 /a )+ o (1) The Property II is then an immediate consequence of the following inequality, which we provenext: ω (2 /a ) < max(1 + 2 . /a, ω (1) + 0 . /a ) ∀ a ∈ (0 , ∞ ) . b = 2 /a ∈ (0 , ∞ ) . Then the above desired inequality can be expressed in terms of b as ω ( b ) < max(1 + 5 b/ , ω (1) + b/ ∀ b ∈ (0 , ∞ ) . (65)Notice that the RHS of (15) is a maximum of two linear functions of b and these intersect at b ∗ = ω (1) − . By the convexity of ω ( · ) as proved in Lemma A.10, it suﬃces to verify (15) at theendpoints b → , b → ∞ and b = b ∗ . In the case where b = δ for any δ < , (15) follows immediatelyfrom the observation that ω ( δ ) < ω (1) . For the case b → ∞ , by Lemma A.3 we have ω (2) ≤ . .It then follows from Lemma A.9 that for any b > , we have ω ( b ) ≤ b − ω (2) ≤ b/ . The ﬁnal case is where b = b ∗ = ω (1) − , for which (15) is equivalent to ω ( ω (1) − < ω (1) / − / . (66)By Lemma A.3, we have that ω (1) − ∈ [0 , . . Then to prove (66), it is suﬃcient to showthat ω ( t + 1) < t/ / ∀ t ∈ [0 , . . (67)By the convexity of ω ( · ) as proved in Lemma A.10 and the upper bound of ω (2) ≤ . inLemma A.3, we have for k ∈ [1 , , ω ( k ) ≤ ω (1) + ( k − · (3 . − ( t + 2)) = t + 2 + ( k − · (1 . − t ) . In particular, using this inequality for k = t + 1 , we have ω ( t + 1) − t/ − / ≤ ( t + 2) + t · (1 . − t ) − t/ − / − t + 1 . t − / , which is negative on the entire interval [0 , .372927]