[PDF] Low-Rank Semidefinite Programs via Bilinear Factorization

Abstract

Many machine learning problems can be reduced to learning a low-rank positive semidefinite matrix (denoted as Z ), which encounters semidefinite program (SDP). Existing SDP solvers are often expensive for large-scale learning. To avoid directly solving SDP, some works convert SDP into a nonconvex program by factorizing Z \textit{quadraticly} as X X ⊤ . However, this would bring higher-order nonlinearity, resulting in scarcity of structure in subsequent optimization. In this paper, we propose a novel surrogate for SDP learning, in which the structure of subproblem is exploited. More specifically, we surrogate unconstrained SDP by a biconvex problem, through factorizing Z \textit{bilinearly} as X Y ⊤ and using a Courant penalty to penalize the difference of X and Y , in which the resultant subproblems in terms of X and Y are convex respectively. Furthermore, we provide a theoretical bound for the associated penalty parameter under the assumption that the subobjective function of X or Y is L -Lipschitz-smooth and σ -strongly convex, such that the proposed surrogate will solve the original SDP when the penalty parameter is larger than this bound (that is γ> 1 4 (L−σ) ). Experiments on two SDP-related applications demonstrate that the proposed algorithm is as accurate as the state-of-the-art, but is faster on large-scale learning.

Full PDF

BBilinear Factorization For Low-Rank SDP Learning

En-Liang Hu Department of Mathematics, Yunnan Normal University { [email protected] } . Abstract

Many machine learning problems can be reduced to learn-ing a low-rank positive semideﬁnite matrix (denoted as Z ),which encounters semideﬁnite program (SDP). Existing SDPsolvers are often expensive for large-scale learning. To avoiddirectly solving SDP, some works convert SDP into a non-convex program by factorizing Z quadraticly as XX (cid:62) . How-ever, this would bring higher-order nonlinearity, resulting inscarcity of structure in subsequent optimization. In this pa-per, we propose a novel surrogate for SDP-related learning, inwhich the structure of subproblem is exploited. More specif-ically, we surrogate unconstrained SDP by a biconvex prob-lem, through factorizing Z bilinearly as XY (cid:62) and using aCourant penalty to penalize the difference of X and Y , inwhich the resultant subproblems are convex. Furthermore, weprovide a theoretical bound for the associated penalty param-eter under the assumption that the subobjective function is L -Lipschitz-smooth and σ − strongly convex, such that the pro-posed surrogate will solve the original SDP when the penaltyparameter is larger than this bound (that is γ > ( L − σ ) ).Experiments on two SDP-related machine learning applica-tions demonstrate that the proposed algorithm is as accurateas the state-of-the-art, but is faster on large-scale learning. Introduction

In this paper, we consider the minimization of a smoothconvex function over symmetric positive semideﬁnite (PSD)matrices, that is min Z ∈ S n + ˜ f ( Z ) , (1)where ˜ f ( · ) is differentiable convex, and S n + = { Z | Z = Z (cid:62) , Z (cid:23) , Z ∈ R n × n } is the cone of symmetric PSD ma-trices, in which (cid:23) denotes the L¨owner partial order. Manymachine learning problems can be reduced as (1). Prominentexamples include adaptive clustering (Royer 2017), sparsePCA (D’aspremont et al. et al. et al. et al. et al. et al. etal. et al. et al. et al. Z quadraticly as XX (cid:62) , and surrogate (1) by a nonconvex program as min X ∈ R n × r ˜ f ( XX (cid:62) ) , (2)where X is a real matrix and r (cid:28) n if it is low-rank. Com-pared with (1), problem (2) does not have PSD constraint,but the objective becomes a nonconvex function.To solve problem (2), we refer to the existing results in(Burer and Monteiro 2003; Journe´e et al. ¯ X of problem (2) provides a stationarypoint ¯ Z = ¯ X ¯ X T of problem (1) if ¯ X is rank deﬁcient.Furthermore, the stationary point ¯ Z is thus optimal solu-tion of problem (1).2. In the case r = n , any local minimizer ¯ X of problem (2)provides the stationary point ¯ Z = ¯ X ¯ X T of problem (1), ¯ Z is thus optimal solution of (1).The above results show that a local minimizer ¯ X of (2) willsolve (1) if we set r > rank ( ¯ X ) (i.e., ¯ X is rank deﬁcient,refer to (Li and Tang 2017) for more details). The recentwork (Ge et al. Z quadraticly as XX (cid:62) as in(2), in this paper we propose a novel surrogate that factor-izing Z bilinearly as XY (cid:62) , and penalizing the differencebetween X and Y . Speciﬁcally, let Z = XY (cid:62) , we con-vert SDP (1) into a biconvex problem with the addition of aCourant penalty (Nocedal and Wright 2006), in which theresultant subproblems are convex w.r.t. X and Y respec-tively. We will show that the proposed surrogate contributesto faster computation to optimize (1) especially when ˜ f ( · ) isa quadratic function.Empirical results on two applications demonstrate that ourproposal can bring computational gain, which is substantialwithout loss of accuracy. a r X i v : . [ c s . L G ] M a r elated Work The interior point method (Nesterov and Nemirovski 1994)can produce highly accurate solutions for SDP problem.However, it is not scalable because at least O ( n ) time and O ( n ) space are needed in each iteration. Moreover, the in-terior point method is difﬁcult to utilize additional infor-mation about the problem structure, such as that the targetmatrix is low-rank. To exploit low-rank structure, a popu-lar approach for solving SDP is the Frank-Wolfe (FW) al-gorithm (Jaggi 2013), which is based on sparse greedy ap-proximation. For smooth problems, the complexity of FWis O ( N/(cid:15) . ) arithmetic operations (Jaggi 2013), that is: O (1 /(cid:15) ) iterations are demanded to converge to an (cid:15) -accuratesolution, and in each iteration, O ( N log( n ) / √ (cid:15) ) arithmeticoperations are needed, where N is the number of nonzeroentries in gradient. When the gradient of objective is dense,FW can still be expensive.Alternatively, as Z in (1) is symmetric and positivesemideﬁnite, it can be factorized quadraticly as XX (cid:62) . Thus,problem (1) can then be rewritten as a nonconvex pro-gram as (2) (Burer and Monteiro 2003). It is known thata local minimizer ¯ X of (2) corresponds to a minimizerof (1) if ¯ X is rank-deﬁcient (Burer and Monteiro 2003;Journe´e et al. et al. Notation

The transpose of vector or matrix is denoted by the su-perscript T . The identity matrix is denoted by I with ap-propriate size. For a matrix A = [ a ij ] , tr ( A ) is its trace, (cid:107) A (cid:107) F = ( (cid:80) ij a ij ) / is its Frobenius norm, (cid:107)·(cid:107) is itsspectral norm (i.e., maximal singular value), and vec( A ) unrolls A into a column vector. For two matrices A, B , (cid:104) A, B (cid:105) = tr ( A (cid:62) B ) , and A ⊗ B is their Kronecker product.For a function g ( x ) , ∇ x g ( · ) is the derivative w.r.t. variable x ,and g is L -smooth and σ -strongly convex if for any x , x ,there exists a Lipschitz constant ≤ σ ≤ L < ∞ such that σ (cid:107) x − x (cid:107) ≤ (cid:107)∇ x g ( x ) − ∇ x g ( x ) (cid:107) ≤ L (cid:107) x − x (cid:107) .If G ( u, v ) is a function w.r.t. both left-variable u andright-varible v , then G ( ·(cid:99)· ) is a function of only left-variablewith a (or any) constant for the right-variable, while G ( ·(cid:98)· ) isa function of only right-variable with a (or any) constant forthe left-variable. Henceforth, if both ¯ u and ¯ v are constants,we have G (¯ u, ¯ v ) = G (¯ u (cid:99) ¯ v ) = G (¯ u (cid:98) ¯ v ) , and G (¯ u (cid:99) ¯ v ) = G (¯ v (cid:99) ¯ u ) iff G (¯ u, ¯ v ) = G (¯ v, ¯ u ) . Naturally, ∇ G ( ·(cid:99)· ) and ∇ G ( ·(cid:98)· ) are the partial derivatives of left-variable functionand right-variable function respectively. Problem Formulation

Motivation

For current large-scale optimization in machine learn-ing, the ﬁrst-order optimization methods, such as (accel-erated/stochastic) gradient descent, are popular because oftheir computational simplicity. In the class of these meth-ods, the most expensive operation usually lies in repeatedlysearching a stepsize for objective descent at each iteration,which involves computation of the objective many times. Ifa good stepsize is possible in an analytical and simple form,the computation complexity would be decreased greatly.However, a simple stepsize is impossible in (2) even if ˜ f is a simple quadratic function. This is because that when ˜ f in (1) is quadratic w.r.t. Z , the objective in (2) rises into aquartic function w.r.t. X , such that the stepsize search in (2)involves some complex computations including construct-ing the coefﬁcients of a quartic polynomial, solving a cubicequation and comparing the resulted different solutions (Bu-rer and Choi 2006). Again, this needs compute the objectivemany times.The above difﬁculties motivate us to propose biconvexformulation (3) instead of (2). In each subproblem of (3),the stepsize search is usually easier than in (2). For example,when ˜ f is a quadratic function, the subobjective F ( X (cid:99) Y ; γ ) or F ( X (cid:98) Y ; γ ) in (3) is still quadratic w.r.t. X or Y respec-tively. Hence, the optimal stepsize to descend objective w.r.t X or Y is analytical, simple and unique for (3) ((18) is anexample). Problem Formulation

Instead of factorizing Z quadraticly as XX (cid:62) , we factorize Z bilinearly as XY (cid:62) , and penalize the difference between X and Y , which can be formulated as min X,Y ∈ R n × r F ( X, Y ; γ ) ≡ f ( X, Y ) + γ (cid:107) X − Y (cid:107) F , (3)where γ > is a penalty parameter and f ( X, Y ) is thesymmetrization of ˜ f ( XY (cid:62) ) in terms of X and Y , namely f ( X, Y ) = f ( Y, X ) is satisﬁed. For example, f can be de-ﬁned as f ( X, Y ) =  ˜ f ( XY (cid:62) ) , if ˜ f ( XY (cid:62) ) = ˜ f ( Y X (cid:62) )˜ f ( XY (cid:62) ) + ˜ f ( Y X (cid:62) )2 , otherwise (4)and we will see (in proof of Theorem 1) that the symme-try of f is key to bound γ . Note that the penalty term in(3) is the classic quadratic (or Courant) penalty for the con-straint X = Y . In general, problem (3) approaches prob-lem (2) only when γ goes to inﬁnity. However, we willshow (in Theorem 1) that γ can be bounded if f ( ·(cid:99)· ) and f ( ·(cid:98)· ) are Lipschitz-smooth, such that when γ is larger thanthat bound, (3) is equivalent to (2) in that a stationary point ( ¯ X, ¯ Y ) of (3) provides ¯ X as a stationary point of (2). Hence,any local minimizer of (3) will produce optimal solution of(1) if r is set as large as enough. Moreover, it will be shownthat the objective in (3) is biconvex, and thus we can choosea convex optimization algorithm to decrease the objectiveunction F w.r.t. X and Y alternately. Consequently, thisopens a door to surrogate an unconstrained SDP by bicon-vex optimization. Contributions

Our main contributions can be summarized as follows:1. We propose a biconvex penalty formulation (3) to surro-gate unconstrained SDP. This opens a door to solve manySDP-based learning problems using biconvex optimiza-tion.2. We propose a theoretical bound of the penalty parameter γ . This bridges the stationary point of (3) to the stationarypoint of (2) when γ is set as larger than this bound.3. We show that problem (3) is evidently easier to be solvedthan (2) if ˜ f ( · ) is especially a quadratic function. This isvalidated on two SDP-related applications in the experi-mental section. Proposed Algorithm

Biconvex Penalty

We ﬁrst give the deﬁnition of biconvexity as follows

Deﬁnition 1. g ( · , · ) is biconvex if for any x i , y i , and any ≤ α i , β i ≤ ( i = 1 , ) such that α + α = β + β = 1 ,we have g ( α x + α x , β y + β y ) ≤ α β g ( x , y ) + α β g ( x , y ) + α β g ( x , y ) + α β g ( x , y ) . Namely, g ( ·(cid:99) y ) and g ( x (cid:98)· ) are convex respectively with ﬁxed x and y . Deﬁnition 2. g ( · ) is ˆ L -smooth and ˆ σ -strongly convex if forany x , x , we have ˆ σ (cid:107) x − x (cid:107) ≤ (cid:107)∇ g ( x ) − ∇ g ( x ) (cid:107) ≤ ˆ L (cid:107) x − x (cid:107) with ˆ L ≥ ˆ σ ≥ . The following Proposition shows that f in (4) and the ob-jective in (3) are biconvex if ˜ f in (1) is convex. Proposition 1. If ˜ f in (1) is convex w.r.t. Z , then f in (4) and F in (3) are biconvex w.r.t. X and Y . We assume that f in (3) is coercive and satisﬁes the fol-lowing assumption. Assumption 1. ˜ f (and thus f ) is coercive, namely satisfy-ing lim (cid:107) Z (cid:107)→ + ∞ ˜ f ( Z ) (cid:107) Z (cid:107) → + ∞ . Hence, the sublevel set L c = { ( X, Y ) | f ( X, Y ) ≤ c } is bounded. Assumption 2. f ( ·(cid:99)· ) is L (cid:121) -smooth and σ (cid:121) -strongly con-vex, and f ( ·(cid:98)· ) is L (cid:120) -smooth and σ (cid:120) -strongly convex. Hence,both f ( ·(cid:99)· ) and f ( ·(cid:98)· ) are L -smooth and σ -strongly convexwith L = max { L (cid:121) , L (cid:120) } , σ = min { σ (cid:121) , σ (cid:120) } . The Theoretical Bound

In the following Theorem, we provide a theoretical bound ofpenalty parameter γ to connect (3) to (2), and thus to (1). Theorem 1. If ( ¯ X, ¯ Y ) is a stationary point of (3) , then ¯ X =¯ Y when γ > ( L − σ ) . Corollary 1.

When the condition of Theorem 1 is satisﬁed,we have: (i) ¯ X is a stationary point of (2) if ( ¯ X, ¯ Y ) is astationary point of (3) ; (ii) ¯ X is a local minimizer of (2) if ( ¯ X, ¯ Y ) is a local minimizer of (3) . Remark 1.

Following Corollary 1, a local minimizer ( ¯ X, ¯ Y ) of (3) provides ¯ X as a local minimizer of (2) if γ > ( L − σ ) . Furthermore, the produced ¯ Z (= ¯ X ¯ X (cid:62) ) will solve (1) if the condition of rank deﬁciency (i.e., r > rank ( ¯ X ) ) issatisﬁed (Li and Tang 2017). Our Algorithm

Now we focus on solving (3) under the assumption that thepenalty parameter γ satisﬁes Theorem 1. Depending on theconvexity of subproblems, we can solve (3) in alternate con-vex search. Speciﬁcally, we can solve the convex subprob-lem w.r.t. X with ﬁxed Y and the convex subproblem w.r.t. Y with ﬁxed X alternately, that is X k = arg min X ∈ R n × r F ( X (cid:99) Y k − ; γ ); (5) Y k = arg min Y ∈ R n × r F ( X k (cid:98) Y ; γ ) . (6)In fact, it makes sense to solve (5) and (6) only inexactly.An example is multiconvex optimization (Xu and Yin 2013),in which an alternating proximal gradient descent was pro-posed to solve multiconvex objective.We ﬁrst give a brief introduction to accelerated proximalgradient (APG) descent method, and then fuse it into our al-gorithm. A well-known APG method is FISTA (Beck andTeboulle 2009), which was proposed to solve convex com-posite objective as like as min x (cid:96) ( x ) + h ( x ) , where (cid:96) ( x ) is differentiable loss and h ( x ) is nondifferen-tiable regularizer. When h ( x ) = 0 (i.e., ’proximal’ is ab-sent, as used in our case later), FISTA is degenerated intocomputing the sequence { x s } via the iteration  t s = 12 (cid:18) (cid:113) s s − (cid:19) , ω s = t s − − t s , ˆ x s − = x s − + ω s ( x s − − x s − ) ,x s = ˆ x s − − τ s ∇ x (cid:96) (ˆ x s − ) . (7)where τ s is stepsize for descent.We can apply (7) (as a degenerated FISTA) to (5) and (6).Since F ( X (cid:99) Y k − ; γ ) is convex and differentiable, regardingit as (cid:96) ( · ) we can inexactly solve (5) by using (7) as the pro-cess of inner iterations. Analogously, we can inexactly solve(6) by using (7) as inner procedure. The smaller the num-ber of inner iterations, the faster the interaction between theupdates of X k and Y k each other. Specially, if we set asthe number of inner iterations to inexactly solve (5) and (6)respectively, optimization in them can be realized as Algo-rithm 1 (except steps , and ), that is our alternating ac-celerated gradient descent (AAGD) algorithm, where steps and inexactly solve (5) and steps and inexactly solve(6).From another perspective, Algorithm 1 (except steps and and ) can be regarded as a degenerated realization(with a slight difference in rectiﬁed ω k ) of alternating prox-imal gradient method in multiconvex optimization (Xu andYin 2013). Hence, the convergence of Algorithm 1 is solidfollowing (Xu and Yin 2013). lgorithm 1 Alternating Accelerated Gradient Descent(AAGD). Initialization: X , Y , X − = X , Y − = Y , t = 1 . for k = 1 , . . . , do t k = (cid:16) (cid:113) t k − (cid:17) , ω k = t k − − t k , estimate the penalty parameter γ k ; ˆ X k − = X k − + ω k ( X k − − X k − ) ; X k = ˆ X k − − τ Xk ∇ F ( ˆ X k − (cid:99) Y k − ; γ k ) ; estimate the penalty parameter γ k ; ˆ Y k − = Y k − + ω k ( Y k − − Y k − ) ; Y k = ˆ Y k − − τ Yk ∇ F ( X k (cid:98) ˆ Y k − ; γ k ) ; if a stopping criterion is satisﬁed then return ¯ Z = Y k Y (cid:62) k (or X k X (cid:62) k ). end if end for Applications

NPKL: nonparametric kernel learning

Given n patterns, let M be the must-link set containing pairsthat should belong to the same class, and C be the cannot-link set containing pairs that should not belong to the sameclass. We use one of formulations in (Zhuang et al. min Z ∈ S n + ˜ f (1) ( Z ) = 12 (cid:88) { j,i }∈T ( Z ji − T ji ) + λ tr ( Z ∆) (8)where Z = [ Z ji ] ∈ R n × n is the target kernel matrix, ∆ isthe graph Laplacian matrix of the data, T = M ∪ C ∪ { j = i } , T = [ T ji ] with T ji = 1 if { j, i } ∈ M or j = i , and 0if { j, i } ∈ C , and λ is a tradeoff parameter. The ﬁrst termof objective in (8) measures the difference between Z ji and T ji , and the second term tr ( Z ∆) encourages smoothness onthe data manifold by aligning Z with ∆ . Biconvex reformulation

Let f (1) ( X, Y ) = ˜ f (1) ( XY (cid:62) ) ,we have f (1) ( X, Y ) = 12 (cid:88) { j,i }∈T [( tr ( X (cid:62) S ji Y )) − T ji tr ( X (cid:62) S ji Y )]+ λ tr ( X (cid:62) ∆ Y ) + C = vec( X ) (cid:62) ˆ A ( Y )vec( X ) − vec( X ) (cid:62) ˆ b ( Y ) + C (9)where ˆ A ( Y ) = ( Y (cid:62) ⊗ I ) P P (cid:62) ( Y (cid:62) ⊗ I ) (cid:62) , ˆ b ( Y ) = ( Y (cid:62) ⊗ I ) q,P = (cid:88) { j,i }∈T vec( S ji )vec( S ji ) (cid:62) ,q = (cid:88) { j,i }∈T T ji vec( S ji ) − λ vec(∆) ,S ji = I (: , j )( I ( i, :)) (cid:62) , and I is denoted as the identity matrix with appropriate size, I (: , i ) is the i -th column of I , C is a constant.Clearly, f (1) is symmetric in terms of X and Y with T = T (cid:62) and ∆ = ∆ (cid:62) . Plugging f = f (1) into (3), we have F ( X, Y ; γ ) = f (1) ( X, Y ) + γ (cid:107) X − Y (cid:107) F = 12 vec( X ) (cid:62) A ( Y, γ )vec( X ) − vec( X ) (cid:62) b ( Y, γ ) − γ Y ) (cid:62) vec( Y ) + C where A ( Y, γ ) = γI + ˆ A ( Y ) , b ( Y, γ ) = ( Y (cid:62) ⊗ I )( q − γ vec( I )) . So, the subproblem about X (corresponding to (5)) is min X ∈ R n × r F ( X, Y k − ; γ k ) ⇔ min X ∈ R n × r vec( X ) (cid:62) A Yk vec( X ) − vec( X ) (cid:62) b Yk (10)where A Yk = A ( Y k − , γ γ ) , b Yk = b ( Y k − , γ γ ) .Since F is symmetric in terms of X and Y , similar to (10)the subproblem about Y (corresponding to (6)) is min Y ∈ R n × r F ( X k (cid:98) Y ; γ k ) ⇔ min Y ∈ R n × r vec( Y ) (cid:62) A Xk vec( Y ) − vec( Y ) (cid:62) b Xk (11)where A Xk = A ( X k , γ k ) , b Xk = b ( X k , γ k ) . Estimating the bound of γ From (9), the Hessian of f (1) ( X, Y ) in terms of variable X is H ( Y ) = ˆ A ( Y ) = ( Y (cid:62) ⊗ I ) P P (cid:62) ( Y (cid:62) ⊗ I ) (cid:62) . So L (cid:121) (Lipschitz smooth constant) and σ (cid:121) (strongly con-vex constant) (in Assumption 2) can be respectively esti-mated as the maximal singular value (i.e., spectral norm)and the minimal singular value of H ( Y ) , that is L (cid:121) =˙ σ max ( H ( Y )) , σ (cid:121) = ˙ σ min ( H ( Y )) and L (cid:121) = (cid:107) H ( Y ) (cid:107) ≤ (cid:13)(cid:13) P P (cid:62) (cid:13)(cid:13) (cid:13)(cid:13) Y Y (cid:62) (cid:13)(cid:13) = (cid:107) P (cid:107) (cid:107) Y (cid:107) (12)where (cid:107)·(cid:107) is the spectral norm, and the inequality uses (cid:107) AB (cid:107) ≤ (cid:107) A (cid:107) (cid:107) B (cid:107) , and the equality uses (cid:13)(cid:13) AA (cid:62) (cid:13)(cid:13) = (cid:107) A (cid:107) . In practice, L (cid:121) can be estimated dynamically, suchthat at k -th iteration, L (cid:121) and σ (cid:121) can be approximated as L (cid:121) ≈ L (cid:121) k − (cid:44) (cid:107) P (cid:107) (cid:107) Y k − (cid:107) , (13) σ (cid:121) = ˙ σ min ( H ( Y k − )) . (14)From the symmetry of f (1) , similar to the above, σ (cid:120) and L (cid:120) can be approximated as L (cid:120) ≈ L (cid:120) k (cid:44) (cid:107) P (cid:107) (cid:107) X k (cid:107) , (15) σ (cid:120) = ˙ σ min ( H ( X k − )) . (16)By algorithm 1, the objective c k = F ( X k , Y k ; γ ) is descentas iterations, and Assumption 1 means that the sublevel set c k = { ( X, Y ) | f ( X, Y ) ≤ c k } is nonexpensive as itera-tions. Thus, L (cid:121) k − in (13) and L (cid:120) k in (15) are bounded. Mean-while, we can dynamically set the penalty parameter by The-orem 1 as γ k >

14 ( L (cid:121) k − − σ (cid:121) ) , and γ k >

14 ( L (cid:120) k − σ (cid:120) ) . (17)Additionally, instead of (13) and (15), we can obtain Lips-chitz smooth constants L (cid:121) k − and L (cid:120) k by line search or back-tracking. Computing the optimal stepsizes

The objective func-tions of (10) and (11) are quadratic, so the optimal stepsizeto descend subobjective w.r.t X or Y is analytical, simpleand unique, which are provided by the following Proposi-tion. Proposition 2.

For subproblems (10) and (11) , the optimalstepsizes in Algorithm 1 have closed forms as τ Xk = (cid:13)(cid:13) d Xk (cid:13)(cid:13) F d Xk (cid:62) A Yk d Xk , τ Yk = (cid:13)(cid:13) d Yk (cid:13)(cid:13) F d Yk (cid:62) A Xk d Yk (18) where d Xk = vec( ∇ F ( ˆ X k − (cid:99) Y k − ; γ k )) and d Yk =vec( ∇ F ( X k (cid:98) ˆ Y k − ; γ k )) . CMVU:colored maximum variance unfolding

The colored maximum variance unfolding (CMVU) is a col-ored variants of maximum variance unfolding (Weinberger et al. et al. min Z ∈ S n + ˜ f (2) ( Z ) = 12 (cid:88) { i,j }∈N ( Z ii + Z jj − Z ij − d ij ) − λ tr ( HZHL ) . (19)where d ij denotes the square Euclidean distance between the i -th and j -th objects, N denotes the set of neighbor pairs,whose distances are to be preserved in the unfolding space, L is a kernel matrix of the labels, H = [ H ij ] = [ δ ij − n ] centers the data and the labels in the unfolding space, and λ controls the tradeoff between dependence maximization anddistance preservation. Let E ij = I (: , i ) − I (: , j ) , (19) canbe reformulated as min Z ∈ S + n ˜ f (2) ( Z ) = 12 (cid:88) { i,j }∈N ( E (cid:62) ij ZE ij − d ij ) − λ tr ( ZHLH ) . Let f (2) ( X, Y ) = ˜ f (2) ( XY (cid:62) ) , it is easy to know that f (2) satisﬁes the symmetry. Let S ji = E ij E (cid:62) ij and viewing d ij as T ji in (8), the subproblems of (19) w.r.t. X and Y can bederived and they are similar to (10) and (11) respectively. Complexity Analysis

The iterations in Algorithms 1 are inexpensive. It mainlycosts O ( nr ) operations for estimating penalty parameter in(13) and (15) since (cid:107) P (cid:107) is a constant needed to be com-puted only once, and the spectral norm (cid:107) Y k − (cid:107) or (cid:107) X k (cid:107) costs O ( nr ) . As in Proposition 2, it is still O ( nr ) opera-tions for calculating the optimal stepsizes. The space com-plexity is scalable because only O ( nr ) space is needed in-deed. Experiments

In this part, we perform experiments on two SDP-related ap-plications in the previous section: NPKL, CMVU and SD.The proposed algorithm,

AAGD in Algorithm 1, will becompared with the following state-of-the-art baselines insolving SDP in (1).1. FW : accelerated Frank-Wolf algorithm (Laue 2012) tosolve (1);2. AGD : transform (1) to (2), and solve (2) using the acceler-ated gradient descent (Ghadimi and Lan 2016; Li and Lin2015). The stepsize search is followed as in (Burer andChoi 2006).3.

L-BFGS : transform (1) to (2), and solve (2) using limitedmemory Broyden-Fletcher-Goldfarb-Shanno method (assimilar as done in (Burer and Monteiro 2003)).4.

SQLP : semideﬁnite-quadratic-linear program (Wu et al.

SQLP would be faster than a general pur-pose SDP-solver to solve a quadratic SDP problem like(8) or (19).All these algorithms are implemented in Matlab. Thestopping condition of FW is reached when the relativechange in objective value is smaller than − , and then thestopping conditions of the others are reached when their ob-jective values are not larger than the ﬁnal objective valueof FW or the number of iterations exceeds . All ex-periments are run on a PC with a 3.07GHz CPU and 24GBRAM.The r (rank of the solution) is automatically chosen by theFW algorithm, and is then used by all the other algorithmsexcept SQLP . How to estimate the rank r in (2) and (3) isbeyond the scope of this paper. Interested readers can refer to(Journe´e et al. Results on NPKL

Experiments are performed on the adult data sets (including sets: a1a ∼ a9a , see Table 1) which have been commonlyused for benchmarking NPKL algorithms.As in (Zhuang et al. k -means clustering with the numberof clusters k equals to the number of classes. We set λ = 10 and repeat times clustering on each data set with randomstart point ( X , Y ) and random pair constraints {M , C} ,and then the average results ( ± standard deviations) are re-ported.As in (Rand 1971), clustering accuracy is measured by theRand index a + b . n ( n − , where a is the number of pattern pairsbelonging to the same class and are placed in the same clus-ter by k -means, b is the number of pattern pairs belonging todifferent classes and are placed in different clusters, and thedenominator . n ( n − is the total number of all differentpairs. The higher the Rand index, the better the accuracy.Results on the clustering accuracy and running CPU timeare shown in Tables 2 and 3, respectively. As can be seen, Downloaded from . able 1: Adult data sets a1a ∼ a9a used in the nonparametric kernel learning experiment. Each data sample has 123 features. a1a a2a a3a a4a a5a a6a a7a a8a a9a number of samples 1,605 2,265 3,185 4,781 6,414 11,220 16,100 22,696 32,561number of must-link pairs 1204 2,269 2,389 3,586 4,811 8,415 12,075 17,022 24,421number of cannot-link pairs 1204 2,269 2,389 3,586 4,811 8,415 12,075 17,022 24,421Table 2: Rand indices (%) on the adult data sets a1a ∼ a9a in Table 1. The best and comparable results (according to the pairwiset-test with 95% conﬁdence) are highlighted. FW AGD AAGD L-BFGS SQLPa1a ± ± ± ± ± a2a ± ± ± ± ± a3a ± ± ± ± ± a4a ± ± ± ± ± a5a ± ± ± ± a6a ± ± ± ± – a7a ± ± ± a8a ± ± ± – – a9a ± ± ± – –Table 3: CPU time (in sec) on the adult data sets a1a ∼ a9a in Table 1. The best and comparable results (according to the pairwiset-test with 95% conﬁdence) are highlighted. FW AGD AAGD L-BFGS SQLPa1a ± ± ± ± ± a2a ± ± ± ± ± a3a ± ± ± ± ± a4a ± ± ± ± ± a5a ± ± ± ± a6a ± ± ± ± a7a ± ± ± – – a8a ± ± ± – – a9a ± ± ± – –all algorithms obtain almost the same accuracy, and AAGD is the fastest. Figure 1 shows convergence of the objective onthe smallest a1a data set and the largest a9a data set. It canbe observed that, to arrive the ﬁnal objective-value (a brokenmauve line),

AAGD converges signiﬁcantly much faster thanthe others. The optimal stepsizes exist as simple and closedforms (i.e., as (18)) for

AAGD , while this is unavailable for

AGD , which is one of reasons that

AAGD is faster than

AGD .Figure 2 shows the progress of (cid:107) X k − Y k (cid:107) F with itera-tions. It converges to zero clearly, implying that the penaltyterm will vanish after convergence. Results on CMVU

Two benchmark data sets , USPS Digits and

Newsgroups 20 are used in our experiments, and their information can bereferred to (Song et al. et al. N by considering the nearest neighborsof each point. The tradeoff parameter λ is set to as a de-fault. The problem dimensionality n is straightly set as thenumber of all data points. Downloaded from . cpu time (s) ob j e c t i v e v a l ue FWAGDAAGDL-BFGSSQLP (a) a1a . cpu time (s) ob j e c t i v e v a l ue FWAGDAAGD (b) a9a . Figure 1: Objective vs CPU time (logscale) on two data sets. number of iterations Figure 2: The progress of residual value (cid:107) X k − Y k (cid:107) F on a9a .igure 3 shows the convergences of the objective on the USPS Digits and

Newsgroups 20 data sets respectively. Itcan be observed that, to arrive the ﬁnal objective-value (abroken mauve line),

AAGD converges signiﬁcantly muchfaster than the others. On these two data sets, the run time of

SQLP exceeds seconds, so we don’t plot them. cpu time (s) -3.4-3.2-3 ob j e c t i v e v a l ue FWAGDAAGDL-BFGS (a)

USPS Digits . cpu time (s) -1.6-1.55-1.5-1.45-1.4 ob j e c t i v e v a l ue FWAGDAAGDL-BFGS (b)

Newsgroups 20 . Figure 3: Objective vs CPU time (logscale) on the

USPSDigits and

Newsgroups 20 . Conclusion

Many machine learning problems can be reduced to SDPformulation, which would be expensive for large-scale learn-ing. In this paper, we take the scalable learning of SDP prob-lem into consideration. More speciﬁcally, we reformulateunconstrained SDP as a biconvex problem with the additionof a Courant penalty, which can be easily optimized usingalternating accelerated gradient descent algorithm. Further-more, we show that when the penalty parameter is largerthan our theoretical bound, any local minimizer of the bicon-vex surrogate will provide a global minimizer of the originalSDP problem if the condition of rank deﬁciency is satisﬁed.Experiments on two SDP-related problems: nonparametrickernel learning, colored maximum variance unfolding andspectral decomposition, demonstrate that the proposed algo-rithm is both accurate and scalable.For the future, we would generalize the surrogate of bilin-ear factorization to more complex circumstances.

References [Aksoylar et al.

ICML , pages 51–59, 2017.[Beck and Teboulle 2009] A. Beck and M. Teboulle. A fastiterative shrinkage-thresholding algorithm for linear inverseproblems.

SIJIS , 2(1):183–202, 2009.[Burer and Choi 2006] S. Burer and C. Choi. Computationalenhancements in low-rank semideﬁnite programming.

Opti-mization Methods and Software , 21(3):493–512, 2006.[Burer and Monteiro 2003] S. Burer and R.D.C. Monteiro. Anonlinear programming algorithm for solving semideﬁniteprograms via low-rank factorization.

MathProg , 95:329–357, 2003.[D’aspremont et al.

SIAMReview , 49(3):434–48, 2007. [Erdogdu et al.

NeurIPS , pages416–424, 2017.[Ge et al.

ICML , pages 1233–1242, 2017.[Ghadimi and Lan 2016] Saeed Ghadimi and Guanghui Lan.Accelerated gradient methods for nonconvex nonlinearand stochastic programming.

Mathematical Programming ,156(1-2):59–99, 2016.[Hu et al.

ICML , pages 209–216, 2011.[Jaggi 2013] M. Jaggi. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In

ICML , 2013.[Journe´e et al.

SIOPT , pages 2327–2351, 2010.[Laue 2012] S. Laue. A hybrid algorithm for convexsemideﬁnite optimization. In

ICML , pages 177–184, 2012.[Li and Lin 2015] H. Li and Z. Lin. Accelerated proximalgradient methods for nonconvex programming. In

NeurIPS ,pages 379–387, 2015.[Li and Tang 2017] Q.W. Li and G.G. Tang. The nonconvexgeometry of low-rank matrix optimizations with general ob-jective functions. In

GlobalSIP , 2017.[Nesterov and Nemirovski 1994] Y. Nesterov and A. Ne-mirovski.

Interior-Point Polynomial Algorithms in ConvexProgramming . SIAM, 1994.[Nocedal and Wright 2006] J. Nocedal and S.J. Wright.

Nu-merical Optimization . Springer, 2006.[Obozinski et al.

Statistics and Comput-ing , 20(2):231–252, 2009.[Rand 1971] W. M. Rand. Objective criteria for the evalua-tion of clustering methods.

Journal of the American Statis-tical Association , 66:846–850, 1971.[Royer 2017] M. Royer. Adaptive clustering throughsemideﬁnite programming. In

NeurIPS , pages 1793–1801,2017.[Song et al.

NeurIPS , 2008.[Srebro et al.

NeurIPS17 , pages 1329–1336, 2005.[Tropp et al.

NeurIPS 30 , pages 1225–1234, 2017.[Weinberger et al.

ICML , pages 839–846, 2004.Wu et al.

NeurIPS 22 ,pages 1964–1972, 2009.[Xing et al.

NeurIPS , pages 505–512, 2002.[Xu and Yin 2013] Y. Xu and W. Yin. A block coordinate de-scent method for regularized multiconvex optimization withapplications to nonnegative tensor factorization and comple-tion.

SIAM Journal on Imaging Sciences , 6(3):1758–1789,2013.[Zhuang et al.