[PDF] Large-scale Binary Quadratic Optimization Using Semidefinite Relaxation and Applications

Abstract

In computer vision, many problems such as image segmentation, pixel labelling, and scene parsing can be formulated as binary quadratic programs (BQPs). For submodular problems, cuts based methods can be employed to efficiently solve large-scale problems. However, general nonsubmodular problems are significantly more challenging to solve. Finding a solution when the problem is of large size to be of practical interest, however, typically requires relaxation. Two standard relaxation methods are widely used for solving general BQPs--spectral methods and semidefinite programming (SDP), each with their own advantages and disadvantages. Spectral relaxation is simple and easy to implement, but its bound is loose. Semidefinite relaxation has a tighter bound, but its computational complexity is high, especially for large scale problems. In this work, we present a new SDP formulation for BQPs, with two desirable properties. First, it has a similar relaxation bound to conventional SDP formulations. Second, compared with conventional SDP methods, the new SDP formulation leads to a significantly more efficient and scalable dual optimization approach, which has the same degree of complexity as spectral methods. We then propose two solvers, namely, quasi-Newton and smoothing Newton methods, for the dual problem. Both of them are significantly more efficiently than standard interior-point methods. In practice, the smoothing Newton solver is faster than the quasi-Newton solver for dense or medium-sized problems, while the quasi-Newton solver is preferable for large sparse/structured problems. Our experiments on a few computer vision applications including clustering, image segmentation, co-segmentation and registration show the potential of our SDP formulation for solving large-scale BQPs.

Full PDF

AAPPEARING IN IEEE TRANS. PATTERN ANALYSIS AND MACHINE INTELLIGENCE, FEB. 2016 1

Large-scale Binary Quadratic OptimizationUsing Semideﬁnite Relaxation and Applications

Peng Wang , Chunhua Shen , Anton van den Hengel , Philip H. S. Torr Abstract —In computer vision, many problems can be formulated as binary quadratic programs (BQPs), which are in general NP hard.Finding a solution when the problem is of large size to be of practical interest typically requires relaxation. Semideﬁnite relaxationusually yields tight bounds, but its computational complexity is high. In this work, we present a semideﬁnite programming (SDP)formulation for BQPs, with two desirable properties. First, it produces similar bounds to the standard SDP formulation. Second,compared with the conventional SDP formulation, the proposed SDP formulation leads to a considerably more efﬁcient and scalabledual optimization approach. We then propose two solvers, namely, quasi-Newton and smoothing Newton methods, for the simpliﬁeddual problem. Both of them are signiﬁcantly more efﬁcient than standard interior-point methods. Empirically the smoothing Newtonsolver is faster than the quasi-Newton solver for dense or medium-sized problems, while the quasi-Newton solver is preferable for largesparse/structured problems.

Index Terms —Binary Quadratic Optimization, Semideﬁnite Programming, Markov Random Fields (cid:70) University of Adelaide, Australia; University of Oxford, UnitedKingdom. Correspondence should be addressed to C. Shen ([email protected]). a r X i v : . [ c s . C V ] M a y PPEARING IN IEEE TRANS. PATTERN ANALYSIS AND MACHINE INTELLIGENCE, FEB. 2016 2 C ONTENTS References Biographies

A.1 Preliminaries

A.2 Inexact Smoothing Newton Methods: Computational Details

PPEARING IN IEEE TRANS. PATTERN ANALYSIS AND MACHINE INTELLIGENCE, FEB. 2016 3

NTRODUCTION

Binary quadratic programs (BQPs) are a class of combinatorialoptimization problems with binary variables, quadratic objec-tive function and linear/quadratic constraints. They appear in awide variety of applications in computer vision, such as imagesegmentation/pixel labelling, image registration/matching, imagedenoising/restoration. Moreover, Maximum a Posteriori (MAP)inference problems for Markov Random Fields (MRFs) can beformulated as BQPs too. There are a long list of referencesto applications formulated as BQPs or speciﬁcally MRF-MAPproblems. Readers may refer to [1], [2], [3], [4], [5], [6] and thereferences therein for detailed studies.Unconstrained BQPs with submodular pairwise terms canbe solved exactly and efﬁciently using graph cuts [7], [8], [9].However solving general BQP problems is known to be NP-hard (see [10] for exceptions). In other words, it is unlikely toﬁnd polynomial time algorithms to exactly solve these problems.Alternatively, relaxation approaches can be used to produce afeasible solution close to the global optimum in polynomial time.In order to accept such a relaxation we require a guarantee thatthe divergence between the solutions to the original problem andthe relaxed problem is bounded. The quality of the relaxation thusdepends upon the tightness of the bounds. Developing an efﬁcientrelaxation algorithm with a tight relaxation bound that can achievea good solution (particularly for large problems) is thus of greatpractical importance. There are a number of relaxation methodsfor BQPs (in particular MRF-MAP inference problems) in theliterature, including linear programming (LP) relaxation [11], [12],[13], [14], quadratic programming relaxation [15], second ordercone relaxation [16], [17], [18], spectral relaxation [19], [20], [21],[22] and SDP relaxation [17], [23].Spectral methods are effective for many computer visionapplications, such as image segmentation [19], [20] and motionsegmentation [24]. The optimization of spectral methods even-tually lead to the computation of top eigenvectors. Nevertheless,spectral methods may produce loose relaxation bounds in manycases [25], [26], [27]. Moreover, the inherent quadratic program-ming formulation of spectral methods is difﬁcult to incorporatecertain types of additional constraints [21].SDP relaxation has been shown that it leads to tighter ap-proximation than other relaxation methods for many combinatorialoptimization problems [28], [29], [30], [31]. In particular for themax-cut problem, Goemans and Williamson [32] achieve the state-of-the-art . approximation ratio using SDP relaxation. SDPrelaxation has also been used in a range of vision problems,such as image segmentation [33], restoration [34], [35], graphmatching [36], [37] and co-segmentation [38]. In a standard SDPproblem, a linear function of a symmetric matrix X is optimized,subject to linear (in)equality constraints and the constraint of X being positive semideﬁnite (p.s.d.). The standard SDP problemand its Lagrangian dual problem are written as:(SDP-P) min X ∈ S n + p( X ) := (cid:104) X , A (cid:105) , (1) s . t . (cid:104) X , B i (cid:105) = b i , i ∈ I eq , (cid:104) X , B i (cid:105) ≤ b i , i ∈ I in , (SDP-D) max u ∈ R m d( u ) := − u (cid:62) b , (2) s . t . A + (cid:80) mi =1 u i B i ∈ S n + ,u i ≥ , i ∈ I in , where m = | I eq | + | I in | , and I eq ( I in ) denotes the indexes oflinear (in)equality constraints. The p.s.d. constraint X ∈ S n + isconvex, so SDP problems are convex optimization problems andthe above two formulations are equivalent if a feasible solutionexists. The SDP problem (1) can be considered as a semi-inﬁniteLP problem, as the p.s.d. constraint can be converted to an inﬁnite number of linear constraints: (cid:104) X , aa (cid:62) (cid:105) ≥ , ∀ a ∈ R n . ThroughSDP, these inﬁnite number of linear constraints can be handled in ﬁnite time.It is widely accepted that interior-point methods [39], [40]are very robust and accurate for general SDP problems up to amoderate size (see SeDuMi [41], SDPT3 [42] and MOSEK [43]for implementations). However, its high computational complex-ity and memory requirement hampers the application of SDPmethods to large-scale problems. Approximate nonlinear program-ming methods [44], [45], [46] are proposed for SDP problemsbased on low-rank factorization, which may converge to a localoptimum. Augmented Lagrangian methods [47], [48] and thevariants [49], [50] have also been developed. As gradient-descendbased methods [51], they may converge slowly. The spectralbundle method [52] and the log-barrier algorithm [53] can be usedfor large-scale problems as well. A drawback is that they can failto solve some SDP problems to satisfactory accuracies [48].In this work, we propose a regularized SDP relaxation ap-proach to BQPs. Preliminary results of this paper appeared in[54]. Our main contributions are as follows.1) Instead of directly solving the standard SDP relaxationto BQPs, we propose a quadratically regularized versionof the original SDP formulation, which can be solvedefﬁciently and achieve a solution quality comparable tothe standard SDP relaxation.2) We proffer two algorithms to solve the dual problem,based on quasi-Newton (referred to as SDCut-QN) andsmoothing Newton (referred to as SDCut-SN) methodsrespectively. The sparse or low-rank structure of speciﬁcproblems are also exploited to speed up the computation.The proposed solvers require much lower computationalcost and storage memory than standard interior-pointmethods. In particular, SDCut-QN has a lower compu-tational cost in each iteration while needs more iterationsto converge. On the other hand, SDCut-SN convergesquadratically with higher computational complexity periteration. In our experiments, SDCut-SN is faster fordense or medium-sized problems, and SDCut-QN is moreefﬁcient for large-scale sparse/structured problems.3) We demonstrate the efﬁciency and ﬂexibility of ourproposed algorithms by applying them to a variety ofcomputer vision tasks. We show that due to the capa-bility of accommodating various constraints, our meth-ods can encode problem-dependent information. Morespeciﬁcally, the formulation of SDCut allows multipleadditional linear and quadratic constraints, which enablesa broader set of applications than what spectral methodsand graph-cut methods can be applied to. Notation

A matrix (column vector) is denoted by a bold capital(lower-case) letter. R n denotes the space of real-valued n × vectors. R n + and R n − represent the non-negative and non-positiveorthants of R n respectively. S n denotes the space of n × n symmetric matrices, and S n + represents the corresponding coneof positive semideﬁnite (p.s.d.) matrices. For two vectors, x ≤ y PPEARING IN IEEE TRANS. PATTERN ANALYSIS AND MACHINE INTELLIGENCE, FEB. 2016 4 indicates the element-wise inequality; trace( X ) , rank( X ) and diag( X ) denote the trace, rank and the main diagonal elementsof X respectively. Diag( x ) denotes a diagonal matrix with theelements of vector x on the main diagonal. (cid:107) X (cid:107) F denotes theFrobenius norm of X . The inner product of two matrices is deﬁnedas (cid:104) X , Y (cid:105) . I n indicates the n × n identity matrix. and denoteall-zero and all-one column vectors respectively. ∇ f( · ) and ∇ f( · ) stand for the ﬁrst-order and second-order derivatives of function f( · ) respectively. S AND THEIR

SDP

RELAXATION

Let us consider a binary quadratic program of the following form: min x ∈{ +1 , − } n x (cid:62) A x + a (cid:62) x , (3a) s . t . x (cid:62) A i x + a (cid:62) i x = b i , i ∈ I eq , (3b) x (cid:62) A i x + a (cid:62) i x ≤ b i , i ∈ I in , (3c)where A i ∈ S n , a i ∈ R n , ∀ i ∈ I eq ∪ I in ; b ∈ R | I eq | + | I in | .Note that BQP problems can be considered as special casesof quadratically constrained quadratic program (QCQP), as theconstraint x ∈ { , − } n is equivalent to x i = 1 , ∀ i = 1 , · · · , n .Problems over x ∈ { , } n can be also expressed as { , − } -problems (3) by replacing x with y = 2 x − .Solving (3) is in general NP-hard, so relaxation methodsare considered in this paper. Relaxation to (3) can be done byextending the feasible set to a larger set, such that the optimalvalue of the relaxation is a lower bound on the optimal value of(3). The SDP relaxation to (3) can be expressed as: min x , X (cid:104) X , A (cid:105) + a (cid:62) x , (4a) s . t . diag( X ) = , (4b) (cid:104) X , A i (cid:105) + a (cid:62) i x = b i , i ∈ I eq , (4c) (cid:104) X , A i (cid:105) + a (cid:62) i x ≤ b i , i ∈ I in , (4d) (cid:104) x (cid:62) x X (cid:105) ∈ S n +1 . (4e)Note that constraint (4e) is equivalent to X − xx (cid:62) ∈ S n , which isthe convex relaxation to the nonconvex constraint X − xx (cid:62) = .In other words, (4) is equivalent to (3), by replacing constraint (4e)with X = xx (cid:62) or by adding the constraint rank( (cid:104) x (cid:62) x X (cid:105) ) = 1 .The objective function and constraints (apart from the p.s.d.constraint) of the problem (4) are all linear with respect to X = (cid:104) x (cid:62) x X (cid:105) , so (4) can be expressed in the homogenized formshown in (1) with respect to X . For simplicity, we consider thehomogeneous problem (1), instead of (4), in the sequel.Note that the SDP solution does not offer a feasible solution tothe BQP (3) directly, unless it is of rank . A rounding procedure isrequired to extract a feasible BQP solution from the SDP solution,which will be discussed in Section 4.3. UT F ORMULATION

A regularized SDP formulation is considered in this work:(SDCut-P) min X ∈ S n + p γ ( X ) := (cid:104) X , A (cid:105) + 12 γ (cid:107) X (cid:107) F , (5a) s . t . (cid:104) B i , X (cid:105) = b i , i ∈ I eq , (5b) (cid:104) B i , X (cid:105) ≤ b i , i ∈ I in , (5c) where γ > is a prescribed parameter (its practical value isdiscussed in Section 5.1).Compared to (1), the formulation (5) adds into the objectivefunction a Frobenius-norm term with respect to X . The reasons forchoosing this particular formulation are two-fold: i ) The solutionquality of (5) can be as close to that of (4) as desired by making γ sufﬁciently large. ii ) A simple dual formulation can be derivedfrom (5), which can be optimized using quasi-Newton or inexactgeneralized Newton approaches.In the following, a few desirable properties of (5) are demon-strated, where X (cid:63) denotes the optimal solution to (1) and X (cid:63)γ denotes the optimal solution to (5) with respect to γ . The proofscan be found in Section 7. Proposition 1.

The following results hold: ( i ) ∀ (cid:15) > , ∃ γ > such that | p( X (cid:63) ) − p( X (cid:63)γ ) | ≤ (cid:15) ; ( ii ) ∀ γ > γ > , we have p( X (cid:63)γ ) ≥ p( X (cid:63)γ ) . The above results show that the solution quality of (5) canbe monotonically improved towards that of (4), by making γ sufﬁciently large. Proposition 2.

The dual problem of (5) can be simpliﬁed to(SDCut-D) max u ∈ R m d γ ( u ) := − u (cid:62) b − γ (cid:107) Π S n + ( C ( u )) (cid:107) F , s . t . u i ≥ , i ∈ I in , (6) where C ( u ) := − A − (cid:80) mi =1 u i B i , and Π S n + ( C ( u )) := (cid:80) ni =1 max(0 , λ i ) p i p (cid:62) i .λ i , p i , i = 1 , · · · , n are eigenvalues and the correspondingeigenvectors of C ( u ) . Supposing problem (5) is feasible anddenoting u (cid:63) as the dual optimal solution, we have: X (cid:63) = γ Π S n + ( C ( u (cid:63) )) . (7)The simpliﬁed dual (6) is convex and contains only simplebox constraints. Furthermore, its objective function d γ ( · ) has thefollowing important properties. Proposition 3. d γ ( · ) is continuously differentiable but not neces-sarily twice differentiable, and its gradient is given by ∇ d γ ( u ) = − γ Φ (cid:104) Π S n + ( C ( u )) (cid:105) − b . (8) where Φ : S n → R m denotes the linear transformation Φ[ X ] :=[ (cid:104) B , X (cid:105) , · · · , (cid:104) B m , X (cid:105) ] (cid:62) . Based on the above result, the dual problem can be solvedby quasi-Newton methods directly. Furthermore, we also showin Section 4.2 that, the second-order derivatives of d γ ( · ) can besmoothed such that inexact generalized Newton methods can beapplied. Proposition 4. ∀ u ∈ R | I eq | × R | I in | + , ∀ γ > , d γ ( u ) − n γ yields alower-bound on the optimum of the BQP (3) . The above result is important as the lower-bound can be usedto examine how close between an approximate binary solution andthe global optimum.

PPEARING IN IEEE TRANS. PATTERN ANALYSIS AND MACHINE INTELLIGENCE, FEB. 2016 5

Considering the original SDP dual problem (2), we can ﬁnd thatits p.s.d. constraint, that is A + (cid:80) mi =1 u i B i ∈ S n + , is penalized in(6) by minimizing (cid:107) Π S n + ( C ( u )) (cid:107) F = (cid:80) ni =1 max(0 , λ i ) , where λ i , · · · , λ n are the eigenvalues of − A − (cid:80) mi =1 u i B i . The p.s.d.constraint is satisﬁed if and only if the penalty term equals to zero.Other forms of penalty terms may be employed in the dual.The spectral bundle method of [52] penalizes λ max ( X ) and thelog-barrier function is used in [53]. It is shown in [48] thatthese two ﬁrst-order methods may converge slowly for some SDPproblems. Note that the objective function of the spectral bundlemethods is not necessarily differentiable ( λ max ( · ) is differentiableif and only if it has multiplicity one). The objective functionof our formulation is differentiable and its twice derivatives canbe smoothed, such that classical methods can be easily used forsolving our problems, using quasi-Newton and inexact generalizedNewton methods.Consider a proximal algorithm for solving SDP with onlyequality constraints (see [47], [48], [49], [50], [55]): min Y ∈ S n (cid:0) G γ ( Y ) := min X ∈ S n + , Φ[ X ]= b (cid:104) X , A (cid:105) + 12 γ || X − Y || F (cid:1) , (9)where Φ[ X ] := [ (cid:104) B , X (cid:105) , · · · , (cid:104) B m , X (cid:105) ] (cid:62) . Our algorithm isequivalent to solving the inner problem, that is, evaluating G γ ( Y ) ,with a ﬁxed γ and Y = . In other words, our methodsattempt to solve the original SDP relaxation approximately, witha faster speed. After rounding, typically, the resulting solutionsof our algorithms are already close to those of the original SDPrelaxation.Our method is mainly motivated by the work of Shen etal . [56], which presented a fast dual SDP approach to Mahalanobismetric learning. They, however, focused on learning a real-valuedmetric for nearest neighbour classiﬁcation. Here, in contrast, weare interested in discrete combinatorial optimization problemsarising in computer vision. Krislock et al . [57] have independentlyformulated a similar SDP problem for the max-cut problem, whichis simpler than the problems that we solve here. Moreover, theyfocus on globally solving the max-cut problem using branch-and-bound. OLVING THE D UAL P ROBLEM

Based on Proposition 3, ﬁrst-order methods (for example gradientdescent, quasi-Newton), which only require the calculation of theobjective function and its gradients, can be directly applied tosolving (6). It is difﬁcult in employing standard Newton methods,however, as they require the calculation of second-order deriva-tives. In the following two sections, we present two algorithms forsolving the dual (6), which are based on quasi-Newton and inexactgeneralized Newton methods respectively.

One main advantage of quasi-Newton methods over Newtonmethods is that the inversion of the Hessian matrix is approx-imated by analyzing successive gradient vectors, and thus thatthere is no need to explicitly compute the Hessian matrix and itsinverse, which can be very expensive. Therefore the per-iterationcomputation cost of quasi-Newton methods is less than that ofstandard Newton methods.

Algorithm 1

SDCut-QN: Solving (6) using quasi-Newton methods.

Input : A , Φ , b , γ , u , K max , τ > . Step 1:

Solving the dual using L-BFGS-B for k = 0 , , , . . . , K max doStep 1.1: Compute ∇ d γ ( u k ) and update H . Step 1.2:

Compute the descent direction ∆ u = − H ∇ d γ ( u k ) . Step 1.3:

Find a step size ρ , and u k +1 = u k + ρ ∆ u . Step 1.4:

Exit, if (d γ ( u k +1 ) − d γ ( u k ))max {| d γ ( u k +1 ) | , | d γ ( u k ) | , } ≤ τ . Step 2: u (cid:63) = u k +1 , X (cid:63) = γ Π S n + ( C ( u (cid:63) )) . Step 3: x (cid:63) = Round( X (cid:63) ) . Output : x (cid:63) , u (cid:63) , upper-bound: p( x (cid:63) x (cid:63) (cid:62) ) and lower-bound: d γ ( u (cid:63) ) − n γ . The quasi-Newton algorithm for (6) (referred to as SDCut-QN)is summarized in Algorithm 1. In Step 1, the dual problem (6) issolved using L-BFGS-B [58], which only requires the calculationof the dual objective function (6) and its gradient (8). At eachiteration, a descent direction for ∆ u is computed based on thegradient ∇ d γ ( u ) and the approximated inverse of the Hessianmatrix: H ≈ ( ∇ d γ ( u )) − . A step size ρ is found using linesearch. The algorithm is stopped when the difference betweensuccessive dual objective values is smaller than a pre-set tolerance.After solving the dual using L-BFGS-B, the primal optimalvariable X (cid:63) is calculated from the dual optimal u (cid:63) based onEquation (7) in Step 2.Finally in Step 3, the primal optimal variable X (cid:63) is discretizedand factorized to produce the feasible binary solution x (cid:63) , whichwill be described in Section 4.3.Now we have an upper-bound and a lower-bound (see Propsi-tion 4) on the optimum of the original BQP (3) (referred to as p (cid:63) ): p( x (cid:63) x (cid:63) (cid:62) ) ≥ p (cid:63) ≥ d γ ( u (cid:63) ) − n γ . These two values are used tomeasure the solution quality in the experiments. As d γ ( u ) is a concave function, the dual problem (6) is equivalentto ﬁnding u (cid:63) ∈ D such that (cid:104) u − u (cid:63) , −∇ d γ ( u (cid:63) ) (cid:105) ≥ , ∀ u ∈ D ,which is known as variational inequality [59]. D := R | I eq | × R | I in | + is used to denote the feasible set of the dual problem. Thus (6) isalso equivalent to ﬁnding a root of the following equation: F( u ) := u − Π D (cid:0) u − γ Φ (cid:104) Π S n + ( C ( u )) (cid:105) − b (cid:1) = , u ∈ R m , (10)where [Π D ( v )] i := (cid:110) v i if i ∈ I eq max(0 , v i ) if i ∈ I in can be consideredas a metric projection from R m to R | I eq | × R | I in | + . Note that F( u ) is continuous but not continuously differentiable, as both Π D and Π S n + have the same smoothness property. Therefore, standardNewton methods cannot be applied directly to solving (10). In thiswork, we use the inexact smoothing Newton method in [60] tosolve the smoothed Newton equation: E( (cid:15), u ) := (cid:104) (cid:15) ; ˜F( (cid:15), u ) (cid:105) = , ( (cid:15), u ) ∈ R × R m , (11)where ˜F( (cid:15), u ) is a smoothing function of F( u ) , which is con-structed as follows.Firstly, the smoothing functions for Π D and Π S n + are respec-tively written as: (cid:104) ˜Π D ( (cid:15), v ) (cid:105) i := (cid:26) v i if i ∈ I eq ,φ ( (cid:15), v i ) if i ∈ I in , ( (cid:15), v ) ∈ R × R m , (12) ˜Π S n + ( (cid:15), X ) := n (cid:88) i =1 φ ( (cid:15), λ i ) p i p (cid:62) i , ( (cid:15), X ) ∈ R × S n , (13) PPEARING IN IEEE TRANS. PATTERN ANALYSIS AND MACHINE INTELLIGENCE, FEB. 2016 6

Algorithm 2

SDCut-SN: Solving (6) using smoothing Newton methods.

Input : A , Φ , b , γ , u , (cid:15) , K max , τ > , µ ∈ (0 , , ρ ∈ (0 , . Step 1:

Solving the dual using smoothing Newton methods for k = 0 , , , . . . , K max doStep 1.1: ¯ (cid:15) ← (cid:15) k or µ(cid:15) k . Step 1.2:

Solve the following linear system up to certain accuracy E( (cid:15) k , u k ) + ∇ E( (cid:15) k , u k ) [∆ (cid:15) k ; ∆ u k ] = [¯ (cid:15) ; ] . (16) Step 1.3:

Line Search l = 0 ; while (cid:107) E( (cid:15) k + ρ l ∆ (cid:15) k , u k + ρ l ∆ u k ) (cid:107) ≥ (cid:107) E( (cid:15) k , u k ) (cid:107) do l = l + 1 ; (cid:15) k +1 = (cid:15) k + ρ l ∆ (cid:15) k , u k +1 = u k + ρ l ∆ u k . Step 1.4: If | d γ ( u k +1 ) − d γ ( u k ) | max {| d γ ( u k +1 ) | , | d γ ( u k ) | , } ≤ τ , break. Step 2: u (cid:63) = u k +1 , X (cid:63) = γ Π S n + ( C ( u (cid:63) )) . Step 3: x (cid:63) = Round( X (cid:63) ) . Output : x (cid:63) , u (cid:63) , upper-bound: p( x (cid:63) x (cid:63) (cid:62) ) and lower-bound: d γ ( u (cid:63) ) − n γ . Algorithm 3

Randomized Rounding Procedure: x (cid:63) = Round( X (cid:63) ) Input : The SDP solution X (cid:63) , which is decomposed to a set of vectors v . . . v n ∈ R r where r = rank( X (cid:63) ) . for k = 0 , , , . . . , K doStep 1: Random sampling: obtain a real -dimensional vector z =[ v . . . v n ] (cid:62) y , where y ∼ N ( , I r ) . Step 2:

Discretization: z is discretized to a feasible BQP solution(see Table 2 for problem-speciﬁc methods). Output : x (cid:63) is assigned to the best feasible solution. where λ i and p i are the i th eigenvalue and the correspondingeigenvector of X . φ ( (cid:15), v ) is the Huber smoothing function thatwe adopt here to replace max(0 , v ) : φ ( (cid:15), v ) :=  v if v > . (cid:15), ( v + 0 . (cid:15) ) / (cid:15), if − . (cid:15) ≤ v ≤ . (cid:15), if v < − . (cid:15). (14)Note that at (cid:15) = 0 , φ ( (cid:15), v ) = max(0 , v ) , ˜Π D ( (cid:15), v ) = Π D ( v ) and ˜Π S n + ( (cid:15), X ) = Π S n + ( X ) . φ , ˜Π D , ˜Π S n + are Lipschitz continu-ous on R , R × R m , R × S n respectively, and they are continuouslydifferentiable when (cid:15) (cid:54) = 0 . Then ˜F( (cid:15), u ) is deﬁned as: ˜F( (cid:15), u ) := u − ˜Π D (cid:16) (cid:15), u − γ Φ (cid:104) ˜Π S n + ( (cid:15), C ( u )) (cid:105) − b (cid:17) , (15)which has the same smoothness property as ˜Π D and ˜Π S n + .The presented inexact smoothing Newton method (referred toas SDCut-SN) is shown in Algorithm 2. In Step 1.2, the Newtonlinear system (16) is solved approximately using conjugate gradi-ent (CG) methods when | I in | = 0 and using biconjugate gradientstabilized (BiCGStab) methods [61] otherwise. In Step 1.3, wecarry out a search in the direction [∆ (cid:15) k ; ∆ u k ] for an appropriatestep size ρ l such that the norm of E( (cid:15), u ) is decreased. In this section, we describe a randomized rounding procedure (seeAlgorithm 3) for obtaining a feasible binary solution from therelaxed SDP solution X (cid:63) .Suppose that X (cid:63) is decomposed into a set of r -dimensionalvectors v . . . v n , such that X (cid:63)ij = v (cid:62) i v j . This decompositioncan be easily obtained through the eigen-decomposition of X (cid:63) : X = VV (cid:62) and V = [ v . . . v n ] (cid:62) . We can see that these vectorsreside on the r -dimensional unit sphere S r := { v ∈ R r , v (cid:62) v =1 } , and the angle between two vectors v i and v j deﬁnes howlikely the corresponding two variables x i and x j will be separated(assigned with different labels). To transform these vectors into binary solutions, they are ﬁrstly projected onto a random -dimensional line y ∼ N ( , I r ) in Step of Algorithm 3,that is, z = [ v . . . v n ] (cid:62) y . Note that Step is equivalent tosampling z from the Gaussian distribution N ( , X (cid:63) ) , which has aprobabilistic interpretation [62], [63]: X (cid:63) is the optimal solutionto the problem min Σ E z ∼ N ( , Σ) [ z (cid:62) Az ] , (17) s . t . E z ∼ N ( , Σ) [ z (cid:62) B i z ] = b i , i ∈ I eq , E z ∼ N ( , Σ) [ z (cid:62) B i z ] ≤ b i , i ∈ I in , where Σ denotes a covariance matrix. The proof is simple: since E z ∼ N ( , Σ) [ z (cid:62) Az ] = (cid:80) i,j A ij E z ∼ N ( , Σ) [ z i z j ] = (cid:80) i,j A ij Σ ij for any A ∈ S n , (17) is equivalent to (1). In other words, z solvesthe BQP in expectation. As the eigen-decomposition of X (cid:63) isalready known when computing Π S n + ( C ( u (cid:63) )) at the last descentstep, there is no extra computation for obtaining v . . . v n . Dueto the low-rank structure of SDP solutions (see Section 4.4), thecomputational complexity of sampling z is linear in the numberof variables n .Note that the above random sampling procedure does notguarantee that a feasible solution can always be found. In partic-ular, this procedure will certainly fail when equality constraintsare imposed on the problems [62]. But for all the problemsconsidered in this work, each random sample z can be discretizedto a “nearby” feasible solution (Step of Algorithm 3). Thediscretization step is problem dependant, which is discussed inTable 2. In this section, we discuss several techniques for the eigen-decompostion of C ( u ) , which is one of the computational bot-tleneck for our algorithms. Low-rank Solution

In our experiments, we observe that theﬁnal p.s.d. solution typically has a low-rank structure and r =rank(Π S n + ( C ( u ))) usually decreases sharply such that r (cid:28) n for most of descent iterations in both our algorithms. Actually,it is known (see [64] and [65]) that any SDP problem with m linear constraints has an optimal solution X (cid:63) ∈ S n + , such that rank( X (cid:63) )(rank( X (cid:63) ) + 1) / ≤ m . It means that the rank of X (cid:63) is roughly bounded by √ m . Then Lanczos methods can be usedto efﬁciently calculate the r positive eigenvalues of C ( u ) andthe corresponding eigenvectors. Lanczos methods rely only on theproduct of the matrix C ( u ) and a column vector. This simpleinterface allows us to exploit speciﬁc structures of the coefﬁcientmatrices A and B i , i = 1 , · · · , m . Speciﬁc Problem Structure

In many cases, A and B i are sparseor structured. Such that the computational complexity and memoryrequirement of the matrix-vector product with respect to C ( u ) canbe considered as linear in n , which are assumed as O ( nt ) and O ( nt ) respectively. The iterative Lanczos methods are faster thanstandard eigensolvers when r (cid:28) n and C ( u ) is sparse/structured,which require O ( nr + nt r ) ﬂops and O ( nr + nt ) bytes ateach iteration of Lanczos factorization, given that the numberof Lanczos basis vectors is set to a small multiple ( ∼ ) of r . ARPACK [66], an implementation of Lanczos algorithms, isemployed in this work for the eigen-decomposition of sparse orstructured matrices. The DSYEVR function in LAPACK [67] isused for dense matrices. PPEARING IN IEEE TRANS. PATTERN ANALYSIS AND MACHINE INTELLIGENCE, FEB. 2016 7

Warm Start

A good initial point is crucial for the convergencespeed of iterative Lanczos methods. In quasi-Newton and smooth-ing Newton methods, the step size ∆ u = u k +1 − u k tendsto decrease with descent iterations. It means that C ( u k +1 ) and C ( u k ) may have similar eigenstructures, which inspires us to usea random linear combination of eigenvectors of C ( u k ) as thestarting point of the Lanczos process for C ( u k +1 ) . Parallelization

Due to the importance of eigen-decomposition, itsparallelization has been well studied and there are several off-the-shelf parallel eigensolvers (such as SLEPc [68], PLASMA [69]and MAGMA [70]). Therefore, our algorithms can also be easilyparallelized by using these off-the-shelf parallel eigensolvers.

SDCut-QN

In general, quasi-Newton methods converge superlin-early given that the objective function is at least twice differen-tiable (see [71], [72], [73]). However, the dual objective functionin our case (6) is not necessarily twice differentiable. So thetheoretical convergence speed of SDCut-QN is unknown .At each iteration of L-BFGS-B, both of the computationalcomplexity and memory requirement of L-BFGS-B itself are O ( m ) . The only computational bottleneck of SDCut-QN is on thecomputation of the projection Π S n + ( C ( u )) , which is discussed inSection 4.4. SDCut-SN

The inexact smoothing Newton method SDCut-SNis quadratically convergent under the assumption that the con-straint nondegenerate condition holds at the optimal solution(see [60]). There are two computationally intensive aspects ofSDCut-SN: i ). the CG algorithms for solving the linear sys-tem (16). In the appendix, we show that the Jacobian-vectorproduct requires O ( m + n r ) ﬂops at each CG iteration, where r = rank(Π S n + ( C ( u ))) . ii ). All eigenpairs of C ( u ) are neededto obtain Jacobian matrices implicitly, which takes O ( n ) ﬂopsusing DSYEVR function in LAPACK.From Table 1, we can see that the computational costs andmemory requirements for both SDCut-QN and SDCut-SN arelinear in m , which means that our methods are much more scalableto m than interior-point methods. In terms of n , our methods isalso more scalable than interior-point methods and comparableto spectral methods. Especially for sparse/structured matrices, thecomputational complexity of SDCut-QN is linear in n . As SDCut-SN cannot signiﬁcantly beneﬁt from sparse/structured matrices, itneeds more time than SDCut-QN in each descent iteration for suchmatrices. However, SDCut-SN has a fast convergence rate thanSDCut-QN. In the experiment section, we compare the speeds ofSDCut-SN and SDCut-QN in different cases. PPLICATIONS

We now show how we can attain good solutions on various visiontasks with the proposed methods. The two proposed methods,SDCut-QN and SDCut-SN, are evaluated on several computervision applications. The BQP formulation of different applicationsand the corresponding rounding heuristics are demonstrated inTable 2. The corresponding SDP relaxation can be obtained basedon (4). In the experiments, we also compare our methods to spec-tral methods [19], [20], [21], [22], graph cuts based methods [7],[8], [9] and interior-point based SDP methods [41], [42], [43].The upper-bounds (that is, the objective value of BQP solutions)and the lower-bounds (on the optimal objective value of BQPs) −2 −1 0 1 2−2−1012 −2 −1 0 1 2−2−1012 −2 −1 0 1 2−2−1012 −2 −1 0 1 2−2−1012

Original data NCut RatioCut SDCut-QN

Fig. 1:

Results for -demensional points bisection. The resulting two classesof points are shown in red ‘+’ and blue ‘ ◦ ’ respectively. SDCut-QN succeedsin clustering the points as desired, while both RatioCut and NCut failed inthese two cases. achieved by different methods are demonstrated, and the runtimesare also compared.The code is written in Matlab, with some key subroutinesimplemented in C/MEX. We have used the L-BFGS-B [58] for theoptimization in SDCut-QN. All of the experiments are evaluatedon a core of Intel Xeon E - . GHz CPU ( MB cache). Themaximum number of descent iterations of SDCut-QN and SDCut-SN are set to and respectively. As shown in Algorithm 1and Algorithm 2, the same stopping criterion is used for SDCut-QN and SDCut-SN, and the tolerance τ is set to eps where eps is the machine precision. The initial values of the dual variables u i , i ∈ I eq are set to , and u i , i ∈ I in are set to a small positivenumber. The selection of parameter γ will be discussed in the nextsection. Graph bisection is a problem of separating the nodes of a weightedgraph G = ( V , E ) into two disjoint sets with equal cardinality,while minimizing the total weights of the edges being cut. V denotes the set of nodes and E denotes the set of non-zero edges.The BQP formulation of graph bisection can be found in (18) ofTable 2. To enforce the feasibility (two partitions with equal size),the randomized score vector z in Algorithm 3 is dicretized bythresholding the median value (see Table 2).To show that the proposed SDP methods have better solutionquality than spectral methods we compare the graph-bisectionresults of RatioCut [76], Normalized Cut (NCut) [19] and SDCut-QN on two artiﬁcial 2-dimensional datasets.As shown in Fig. 1, the ﬁrst data set (the ﬁrst row) contains twosets of points with different densities, and the second set containsan outlier. RatioCut and NCut fail to offer satisfactory results onboth of the data sets, possibly due to the poor approximation ofspectral relaxation. In contrast, our SDCut-QN achieves desiredresults on these data sets.Secondly, to demonstrate the impact of the parameter γ ,we test SDCut-QN and SDCut-SN on a random graph with γ ranging from to ( A and B i in (1) are scaled suchthat (cid:107) A (cid:107) F = (cid:107) B (cid:107) F = 1 ). The graph is generated with vertices and all possible edges are assigned a non-zero weightuniformly sampled from (0 , . As the resulting afﬁnity matricesare dense, the DSYEVR routine in LAPACK package is usedfor eigen-decomposition. In Fig. 2, we show the upper-bounds, PPEARING IN IEEE TRANS. PATTERN ANALYSIS AND MACHINE INTELLIGENCE, FEB. 2016 8

Algorithms Convergence Eigen-solver Computational Complexity Memory RequirementSDCut-QN DenseSparse/Structured unknown LAPACK-DSYEVRARPACK O ( m + n ) O ( m ) + O ( nr + nt r ) × Lanczos-iters O ( m + n ) O ( m + nr + nt ) SDCut-SN quadratic LAPACK-DSYEVR O ( n ) + O ( m + n r ) × CG-iters O ( m + n ) Interior Point Methods quadratic − O ( m + mn + m n ) O ( m + n ) TABLE 1:

The comparison of our algorithms and interior-point algorithms on convergence rate, computational complexity and memory requirement. SDCut-QNis considered in two cases: the matrix C ( u ) is dense or sparse/structured and different eigen-solvers are applied. n and m denotes the primal p.s.d. matrix sizeand the number of dual variables. The deﬁnition of r , t and t can be found in Section 4.4. Application BQP formulation CommentsGraphbisection(Sec. 5.1) min x ∈{− , } n − x (cid:62) Wx , (18a) s . t . x (cid:62) = 0 . (18b) W ij = (cid:26) exp( − d ij /σ ) if ( i, j ) ∈ E , otherwise, where d ij denotes the Euclidean distancebetween i and j . Discretization : x = sign( z − median( z )) .Imagesegmentationwith partialgroupingconstraints(Sec. 5.2) min x ∈{− , } n − x (cid:62) Wx , (19a) s . t . ( s (cid:62) f x ) ≥ κ , (19b) ( s (cid:62) b x ) ≥ κ , (19c) − x (cid:62) ( s f s (cid:62) b + s b s (cid:62) f ) x ≥ κ . (19d) W ij = (cid:40) exp (cid:16) −(cid:107) f i − f j (cid:107) /σ f − d ij /σ d (cid:17) if d ij in this work: min x ∈{− , } n x (cid:62) ( α B − W ) x + α a (cid:62) x , s . t . (20c). Constraint (20c) is used to avoid trivialsolutions. Discretization : x = sign( z − θ ) . See (26) for the computation of the threshold θ .Image co-segmentation(Sec. 5.3) min x ∈{− , } n x (cid:62) Ax , (21a) s . t . ( x (cid:62) t i ) ≤ κ n i , i = 1 , . . . , s. (21b) The deﬁnition of A can be found in [38]. s is the number of images, n i is the number ofpixels for i -th image, and n = (cid:80) si =1 n i . t i ∈ { , } n is the indicator vector for the i -thimage. κ ∈ (0 , . Discretization : see (27).Graphmatching(Sec. 5.4) min x ∈{ , } KL h (cid:62) x + x (cid:62) Hx , (22a) s . t . (cid:80) Lj =1 x ( i − L + j = 1 , i =1 ,. . . ,K, (22b) (cid:80) Ki =1 x ( i − L + j ≤ , j =1 ,. . . ,L. (22c) x ( i − L + j =1 if the i -th source point is matched to the j -th target point; otherwise it equalsto . h ( i − L + j records the local feature similarity between source point i and target point j ; H ( i − L + j, ( k − L + l = exp( − (d ij − d kl ) /σ ) encodes the structural consistencyof source point i , j and target point k , l . See [37] for details. Discretization : see (29).Imagedeconvolution(Sec. 5.5) min x ∈{ , } n (cid:107) q − Kx (cid:107) + S( x ) . (23) K is the convolution matrix corresponding to the blurring kernel k ; S denotes the smoothnesscost; x and q represent the input image and the blurred image respectively. See [74] fordetails. Discretization : x = (sign( z ) + ) / .Chinesecharacterinpainting(Sec. 5.6) min x ∈{− , } n h (cid:62) x + x (cid:62) Hx . (24) The unary terms ( h ∈ R n ) and pairwise terms ( H ∈ R n × n ) are learned using decision treeﬁelds [75]. Discretization : x = sign( z ) . TABLE 2:

BQP formulations for different applications considered in this paper. The discretization step in Algorithm 3 for each application is also described. lower-bounds, number of iterations and time achieved by SDCut-QN and SDCut-SN, with respect to different values of γ . Thereare several observations: i ) With the increase of γ , upper-boundsbecome smaller and lower-bounds become larger, which implies atighter relaxation. ii ) Both SDCut-QN and SDCut-SN take moreiterations to converge when γ is larger. iii ) SDCut-SN uses feweriterations than SDCut-QN. The above observations coincide withthe analysis in Section 4.5. Using a larger parameter γ yields bettersolution quality, but at the cost of slower convergence speed. Thechoice of a good γ is data dependant. To reduce the difﬁculty ofthe choice of γ , the matrices A and B i of Equation (1) are scaledsuch that the Frobenius norm is in the following experiments.Thirdly, experiments are performed to evaluate another twofactors affecting the speed of our methods: the sparsity of theafﬁnity matrix W and the matrix size n . The numerical resultscorresponding to dense and sparse afﬁnity matrices are shownin Table 3 and Table 4 respectively. The sparse afﬁnity matrices are generated from random graphs with -neighbour connection.In these experiments, the size of matrix W is varied from to . ARPACK is used by SDCut-QN for partial eigen-decomposition of sparse problems, and DSYEVR is used forother cases. For both SDCut-QN and SDCut-SN, the numberof iterations does not grow signiﬁcantly with the increase of n .However, the running time is still correlated with n , since aneigen-decompostion of an n × n matrix needs to be computedat each iteration for both of our methods. We also ﬁnd that thesecond-order method SDCut-SN uses signiﬁcantly fewer iterationsthan the ﬁrst-order method SDCut-QN. For dense afﬁnity matrices,SDCut-SN runs consistently faster than SDCut-QN. In contrast forsparse afﬁnity matrices, SDCut-SN is only faster than SDCut-QNon problems of size up to n ≥ . That is because the Lanc-zos method used by SDCut-QN (for partial eigen-decompostion)scales much better for large sparse matrices than the standard fac-torization method (DSYEVR) used by SDCut-SN (for full eigen- PPEARING IN IEEE TRANS. PATTERN ANALYSIS AND MACHINE INTELLIGENCE, FEB. 2016 9

CPU CPU+GPUTime/Iters h m/ . m / . Upper-bound .

42 20 . Lower-bound − . − . TABLE 5:

Graph bisection on large dense graphs ( n = 10000 , m = 10001 ).SDCut-QN is tested on core of Intel Xeon E - . GHz CPU ( MBcache) and a workstation with Intel Xeon E - . GHz CPU ( coresand MB cache) and NVIDIA Tesla K c GPU. A -fold speedup isachieved by using CPU+GPU compared with using CPU only.

100 500 1000 5000 10000−6−4−20246

Parameter γ U pp e r - b o und / L o w e r - b o und I t e r a t i o n s Upper−boundLower−boundIterations SDCut−SNIterations SDCut−QN

Fig. 2:

Results for graph bisection with different values of the parameter γ .The illustrated results are averaged over random graphs. Upper-boundsand lower-bounds achieved by SDCut-QN are shown in this ﬁgure (those ofSDCut-SN is very similar and thus omitted). The relaxation becomes tighter(that is, upper-bounds and lower-bounds are closer) for larger γ . The numberof iterations for both SDCut-SN and SDCut-QN grows with the increase of γ . decomposition). The upper-/lower-bounds yielded by our methodsare similar to those of the interior-point methods. Meanwhile,NCut and RatioCut run much faster than other methods, but offersigniﬁcantly worse upper-bounds.Finally, we evaluate SDCut-QN on a large dense graphwith nodes. The speed performance is compared ona single CPU core (using DSYEVR function of LAPACK aseigensolver) and a hybrid CPU+GPU workstation (using theDSYEVDX STAGE function of MAGMA as eigensolver). Theresults are shown in Table 5 and we can see that the parallelizationbrings a -fold speedup over running on a single CPU core. Thelower-/upper-bounds are almost identical as there is no differenceapart from the implementation of eigen-decompostion. We consider image segmentation with two types of quadraticconstraints (with respect to x ): partial grouping constraints [20]and histogram constraints [77]. The afﬁnity matrix W is sparse,so ARPACK is used by SDCut-QN for eigen-decomposition.Besides interior-point SDP methods, we also compare ourmethods with graph-cuts [7], [8], [9] and two constrained spectralclustering method proposed by Maji et al . [78] (referred to asBNCut) and Wang and Davidson [79] (referred to as SMQC).BNCut and SMQC can encode only one quadratic constraint, butit is difﬁcult (if not impossible) to generalize them to multiplequadratic constraints. Partial Grouping Constraints

The corresponding BQP formula-tion is Equation (19) in Table 2. A feasible solution x to (19) can Methods SDCut-QN SeDuMi SDPT3Time . s m s m sUpper-bound − . − . − . TABLE 6:

Numerical results for image segmentation with partial groupingconstraints. Time and upper-bound are the means over the ﬁve images in Fig. 3.SDCut-QN runs times faster than SeDuMi and SDPT3, and offers a similarupper-bound. Methods SDCut-QN SDCut-SN SeDuMi SDPT3 MOSEK GC SMQCTime/Iters . s/ . . s / . m s m s m s . s . sF-measure .

930 0 .

925 0 .

928 0 .

926 0 .

928 0 .

722 0 . Upper-bound − . − . − . − . − . − − Lower-bound − . − . − . − . − . − − TABLE 7:

Image segmentation with histogram constraints. Results are theaverage of the eight images shown in Fig. 4. SDCut-SN uses fewer iterationsthan SDCut-QN and is faster than all other SDP based methods. Graph cutsand SMQC exhibit worse F-measure scores than SDP based methods. obtained from any random sample z as follows: x i =  sign( z i − θ f ) if ( s f ) i > , sign( z i − θ b ) if ( s b ) i > , sign( z i ) otherwise , (25)where θ f and θ b are chosen from [min( { z i | ( s f ) i > } ) , + ∞ ) and ( −∞ , max( { z i | ( s b ) i > } ] respectively. Note that for anysample z , x is feasible if θ f = min( { z i | ( s f ) i > } ) and θ b =max( { z i | ( s b ) i > } ) .Fig. 3 illustrates the result for image segmentation with partialgrouping constraints on the Berkeley dataset [80]. All the testimages are over-segmented into about superpixels. We ﬁndthat BNCut did not accurately segment foreground, as it onlyincorporates a single set of grouping pixels (foreground). Incontrast, our methods are able to accommodate multiple sets ofgrouping pixels and segment the foreground more accurately. InTable 6, we compare the CPU time and the upper-bounds ofSDCut-QN, SeDuMi and SDPT3. SDCut-QN achieves objectivevalues similar to that of SeDuMi and SDPT3, yet is over timesfaster. Histogram Constraints

Given a random sample z , a feasiblesolution x to the corresponding BQP formulation (20) can beobtained through x = sign( z − θ ) , where θ =  if | (cid:62) sign( z ) | ≤ κn, (˜ z (cid:98) n + κn (cid:99) + ˜ z (cid:98) n + κn (cid:99) +1 ) / if (cid:62) sign( z ) > κn, (˜ z (cid:100) n − κn (cid:101) + ˜ z (cid:100) n − κn (cid:101) +1 ) / if (cid:62) sign( z ) < − κn, (26)and ˜ z is obtained by sorting z in descending order. For graphcuts methods, the histogram constraint is encoded as unaryterms: ϕ i = − ln (cid:0) Pr( f i | fore ) / Pr( f i | back ) (cid:1) , i = 1 , , . . . , n . Pr( f i | fore ) and Pr( f i | back ) are probabilities for the color of the i th pixel belonging to foreground and background respectively.Fig. 4 and Table 7 demonstrate the results for image segmen-tation with histogram constraints. We can see that unary terms(the second row in Fig. 4) are not ideal especially when thecolor distribution of foreground and background are overlapped.For example in the ﬁrst image, the white collar of the personin the foreground have similar unary terms with the white wallin the background. The fourth row of Fig. 4 shows that theunsatisfactory unary terms degrade the segmentation results ofgraph cuts methods signiﬁcantly.The average F-measure of all evaluated methods are reportedin Table 7. Our methods outperforms graph cuts and SMQC interms of F-measure. As for the running time, SDCut-SN is faster PPEARING IN IEEE TRANS. PATTERN ANALYSIS AND MACHINE INTELLIGENCE, FEB. 2016 10 n , m Methods SDCut-QN SDCut-SN SeDuMi SDPT3 MOSEK NCut RatioCut , Time/Iters . s/ . . / . . s . s . s . s . sUpper-bound .

03 1 .

04 1 .

03 1 .

04 1 .

82 4 . Lower-bound − . − . − . − . − . − − , Time/Iters . s/ . . / . m s . s . s . s . sUpper-bound .

94 2 .

96 2 .

93 2 .

92 2 .

93 4 .

01 9 . Lower-bound − . − . − . − . − . − − , Time/Iters . s/ . . / . m s † m s . s . sUpper-bound .

06 5 .

10 5 . † .

04 6 .

10 13 . Lower-bound − . − .

19 0 . † . − − , Time/Iters m s/ . . / . m s † m s . s . sUpper-bound .

02 7 .

99 7 . † .

95 9 .

00 20 . Lower-bound − . − .

18 0 . † . − − , Time/Iters m s/ . m s/ . h m † h m . s . sUpper-bound .

89 13 .

87 13 . † .

60 14 .

91 33 . Lower-bound − . − .

32 0 . † . − − TABLE 3:

Numerical results forgraph bisection with dense afﬁnitymatrices. All the results are the av-erage over random graphs. SDPbased methods (the left ﬁve columns)achieve better upper-bounds thanspectral methods (NCut and Ratio-Cut). SDCut-SN uses fewer itera-tions than SDCut-QN and achievesthe fastest speed of the ﬁve SDPbased methods. † denotes the caseswhere SDPT3 fails to output feasiblesolutions. n , m Methods SDCut-QN SDCut-SN SeDuMi SDPT3 MOSEK NCut RatioCut , Time/Iters . s/ . . / . . s . s . s . s . sUpper-bound − . − . − . − . − .

57 8 . − . Lower-bound − . − . − . − . − . − − , Time/Iters . s/ . . / . m s . s . s . s . sUpper-bound .

65 0 .

64 0 .

65 0 .

64 0 .

64 19 .

20 0 . Lower-bound − . − . − . − . − . − − , Time/Iters . s/ . . / . m s † m s . s . sUpper-bound .

35 1 .

35 1 . † .

34 28 .

32 1 . Lower-bound .

25 0 .

25 0 . † . − − , Time/Iters m s/ . m s/ . m s † m s . s . sUpper-bound .

43 2 .

43 2 . † .

41 41 .

18 2 . Lower-bound .

01 1 .

01 1 . † . − − , Time/Iters m s/ . m s/ . h m † h m . s . sUpper-bound .

00 3 .

99 3 . † .

95 64 .

98 4 . Lower-bound .

24 2 .

24 3 . † . − − TABLE 4:

Numerical results forgraph bisection with sparse afﬁn-ity matrices. All the results are theaverage over random graphs.The upper-bounds achieved by SDPbased methods are close to eachother and signiﬁcantly better thanspectral methods (NCut and Ratio-Cut). The number of iterations forSDCut-SN is much less than SDCut-QN. For problems with n ≤ ,SDCut-SN is faster than SDCut-QN.While for larger problems ( n ≥ ), SDCut-QN achieves fasterspeeds than SDCut-SN. † denotesthe cases where SDPT3 fails to out-put feasible solutions. I m a g e s B N C u t S D C u t - QN Fig. 3:

Image segmentation with partial grouping constraints. The top row shows the original images with labelled foreground (red markers) and background (blue markers) pixels. SDCut-QN achieves signiﬁcantly better results than BNCut. The results of SeDuMi and SDPT3 are omitted, as they aresimilar to those of SDCut-QN. than all other SDP-based methods (that is, SDCut-QN, SeDuMi,SDPT3 and MOSEK). As expected, SDCut-SN uses much less( / ) iterations than SDCut-QN. SDCut-QN and SDCut-SN havecomparable upper-bounds and lower-bounds than interior-pointmethods.From Table 7, we can ﬁnd that SMQC is faster than ourmethods. However, SMQC does not scale well to large problemssince it needs to compute full eigen-decomposition. We alsotest SDCut-QN and SMQC on problems with a larger numberof superpixels ( ). Both of the algorithms achieve similarsegmentation results, but SDCut-QN is much faster than SMQC( m s vs . h m). The task of image co-segmentation [38] aims to partition a com-mon object from multiple images simultaneously. In this work, theWeizman horses and MSRC datasets are tested. There are ∼ images in each of four classes, namely “car-front”, “car-back”,“face” and “horse”. Each image is oversegmented to ∼ superpixels. The number of binary variables n is then increased to ∼ . PPEARING IN IEEE TRANS. PATTERN ANALYSIS AND MACHINE INTELLIGENCE, FEB. 2016 11 I m a g e s G T G r a ph c u t s un a r y G r a ph c u t s S M Q C S D C u t - QN Fig. 4:

Image segmentation with histogram constraints (coarse over-segmentation). The number of superpixels is around . From top to bottom are: originalimages, ground-truth (GT), superpixels, unary terms for graph-cuts, results of graph-cuts, SMQC and SDCut-QN. Results for other SDP based methods aresimilar to that of SDCut-QN and thus omitted. Graph cuts tends to mix together the foreground and background with similar color. SDCut-QN achieves the bestsegmentation results.

The BQP formulation for image co-segmentation can be foundin Table 2 (see [38] for details). The matrix A can be decomposedinto a sparse matrix and a structural matrix, such that ARPACKcan be used by SDCut-QN. Each vector z = [ z (1) (cid:62) , · · · , z ( K ) (cid:62) ] (cid:62) (where z i corresponds to the i -th image) randomly sampled from N ( , X (cid:63) ) is discretized to a feasible BQP solution as follows: x = (cid:104) sign( z (1) − θ (1) ) (cid:62) , · · · , sign( z ( K ) − θ ( K ) ) (cid:62) (cid:105) (cid:62) , (27)where θ ( i ) can be obtained as (26).We compare our methods with the low-rank factorizationmethod [38] (referred to as LowRank) and interior-point methods.As we can see in Table 8, SDCut-QN takes times moreiterations than SDCut-SN, but still runs faster than SDCut-SNespecially when the size of problem is large (see “face” data).The reason is that SDCut-QN can exploit the speciﬁc structure ofmatrix A in eigen-decomposition. SDCut-QN runs also timesfaster than LowRank. All methods provide similar upper-bounds(primal objective values), and the score vectors shown in Fig. 5also show that the evaluated methods achieve similar visual results. In the graph matching problems considered in this work, each ofthe K source points must be matched to one of the L target points,where L ≥ K . The optimal matching should maximize both of thelocal feature similarities between matched-pairs and the structuresimilarity between the source and target graphs. The BQP formulation of graph matching can be found inTable 2, which can be relaxed to: min x , X (cid:104) X , H (cid:105) + h (cid:62) x (28a) s . t . diag( X ) = x , (28b) (cid:80) Lj =1 x ( i − L + j = 1 , ∀ i ∈ K , (28c) X ( i − L + j, ( i − L + k = 0 , ∀ i ∈ K , j (cid:54) = k ∈ L , (28d) X ( j − L + i, ( k − L + i = 0 , ∀ i ∈ L , j (cid:54) = k ∈ K , (28e) (cid:104) x (cid:62) x X (cid:105) ∈ S KL +1 , (28f)where K = { , · · · , K } and L = { , · · · , L } .A feasible binary solution x is obtained by solving the follow-ing linear program (see [37] for details): max x ≥ x (cid:62) diag( X (cid:63) ) , (29a) s . t . (cid:80) Lj =1 x ( i − L + j = 1 , ∀ i ∈ K , (29b) (cid:80) Ki =1 x ( i − L + j ≤ , ∀ j ∈ L . (29c)Two-dimensional points are randomly generated for evalua-tion. Table 9 shows the results for different problem sizes: n rangesfrom to and m ranges from to . SDCut-SN and SDCut-QN achieves exactly the same upper-bounds asinterior-point methods and comparable lower-bounds. Regardingthe running time, SDCut-SN takes much less number of iterationsto converge and is relatively faster (within times) than SDCut-QN. Our methods run signiﬁcantly faster than interior-point meth-ods. Taking the case K × L = 25 × as an example, SDCut-SN and SDCut-QN converge at around minutes and interior-point methods do not converge within hours. Furthermore,interior-point methods runs out of G memory limit when thenumber of primal constraints m is over . SMAC [22], a spectral PPEARING IN IEEE TRANS. PATTERN ANALYSIS AND MACHINE INTELLIGENCE, FEB. 2016 12 I m a g e s L o w R a nk S D C u t - QN I m a g e s L o w R a nk S D C u t - QN Fig. 5:

Image co-segmentation on Weizman horses and MSRC datasets. The original images, the results (score vectors) of LowRank and SDCut-QN areillustrated from top to bottom. Other methods produce similar segmentation results.

Data, n , m Methods SDCut-QN SDCut-SN SeDuMi MOSEK LowRankcar-back, , Time/Iters m s/

140 09 m s/ h m h m m sUpper-bound .

71 12 .

74 12 .

74 12 . Lower-bound − . − . − . − . − car-front, , Time/Iters m s/

188 11 m s/ h m h m m sUpper-bound .

16 8 .

16 8 . Lower-bound − . − . − . − . − face, , Time/Iters m s/

164 43 m s/ > hrs h m m sUpper-bound .

65 12 . − .

96 20 . Lower-bound − . − . − − . − horse, , Time/Iters m s/

167 17 m s/ h m h m m sUpper-bound .

78 14 .

76 14 .

76 15 . Lower-bound − . − . − . − . − TABLE 8:

Numerical results for image co-segmentation. All the evaluated are SDP basedmethods and achieve similar upper-boundsand lower-bounds. SDCut-QN runs signiﬁcantlyfaster than other methods, although it needsmore iterations to converge than SDCut-SN.

Methods SDCut-SN SeDuMi MOSEK TRWS MPLPTime/Iters m s/ . h m m s m mError . . .

083 0 .

111 0 . Upper-bound − . − . − . − . − . Lower-bound − . − . − . − . − . TABLE 10:

Image deconvolution ( n = 2250 , m = 2250 ). The resultsare the average over two models in Figure 6. TRWS and MPLP are stoppedat minutes. Compared to TRWS and MPLP, SDCut-SN achieves betterupper-/lower-bounds and uses less time. The bounds given by SDCut-SN iscomparable to those of interior-point methods. method incorporating afﬁne constrains, is also evaluated in thisexperiment, which provides worse upper-bounds and error ratios. Image deconvolution with a known blurring kernel is typicallyequivalent to solving a regularized linear inverse problem (see(23) in Table 2). In this experiment, we test our algorithms on twobinary × images blurred by an × Gaussian kernel. LPbased methods such as TRWS [12] and MPLP [13] are also eval-uated. Note that the resulting models are difﬁcult for graph cuts orLP relaxation based methods in that it is densely connected andcontains a large portion of non-submodular pairwise potentials.We can see from Fig. 6 and Table 10 that QPBO [7], [8], [9] leaves most of pixels unlabelled and LP methods (TRWS and MPLP)achieves worse segmentation accuracy. SDCut-SN achieves a -fold speedup over interior-point methods while keep comparableupper-/lower-bounds. Using much less runtime, SDCut-SN stillyields signiﬁcantly better upper-/lower-bounds than LP methods. The MRF models for Chinese character inpainting are obtainedfrom the OpenGM benchmark [82], in which the unary termsand pairwise terms are learned using decision tree ﬁelds [75]. Asthere are non-submodular terms in these models, they cannot besolved exactly using graph cuts. In this experiments, all models areﬁrstly reduced using QPBO and different algorithms are comparedon the reduced models. Our approach is compared to LP-based methods, including TRWS, MPLP. From the results shownin Table 11, we can see that SDCut-SN runs much faster thaninterior-point methods (SeDuMi and MOSEK) and has similarupper-bounds and lower-bounds. SDCut-SN is also better thanTRWS and MPLP in terms of upper-bound and lower-bound.An extension of MPLP (refer to as MPLP-C) [81], [14], whichadds violated cycle constraints iteratively, is also evaluated inthis experiment. In MPLP-C,

LP iterations are performedinitially and then cycle constraints are added at every PPEARING IN IEEE TRANS. PATTERN ANALYSIS AND MACHINE INTELLIGENCE, FEB. 2016 13 K × L , n , m Methods SDCut-QN SDCut-SN SeDuMi SDPT3 MOSEK SMAC × , , Time/Iters . s/ . . / . m s . s . s . sError ratio /

100 1 /

100 1 / Upper-bound − . × − − . × − − . × − − . × − − . × − − . × − Lower-bound − . × − − . × − − . × − − . × − − . × − − × , , Time/Iters . s/ . . / . h m m s m s . sError ratio /

150 1 /

150 6 / Upper-bound − . × − − . × − − . × − − . × − − . × − − . × − Lower-bound − . × − − . × − − . × − − . × − − . × − − × , , Time/Iters m s/ . . / . h m h m h m . sError ratio /

200 1 /

200 6 / Upper-bound . × − . × − . × − . × − . × − . × − Lower-bound . × − . × − . × − . × − . × − − × , , Time/Iters m s/ . m s/ . > hrs > hrs > hrs . sError ratio /

250 0 / − − − / Upper-bound . × − . × − − − − . × − Lower-bound . × − . × − − − − − × , , Time/Iters m s/ . m s/ . > hrs > hrs > hrs . sError ratio /

300 2 / − − − / Upper-bound . × − . × − − − − . × − Lower-bound . × − . × − − − − − × , , Time/Iters h m/ . h m/ . Out of mem. Out of mem. Out of mem. . sError ratio /

400 1 / − − − / Upper-bound . × − . × − − − − . × − Lower-bound . × − . × − − − − − TABLE 9:

Numerical results for graph matching, which are the mean over random graphs. For the fourth and ﬁfth models, interior-point methods includingSedumi, SDPT3 and Mosek do not converge within hours. For the last model with around × constraints, Sedumi, SDPT3 and Mosek run out of Gmemory limit. SDCut-SN uses fewer iterations than SDCut-QN and achieves the fastest speed over all SDP based methods. All SDP based methods achievethe same upper-bounds and error rates. The lower-bounds for SDCut-SN and SDCut-QN are slightly worse than interior-point methods. SMAC provides worseupper-bounds and error rates than SDP-based methods.

Images Blurred Images SDCut-SN MOSEK QPBO TRWS MPLP

Fig. 6:

Image deconvolution. QPBO cannot label most of pixels (grey pixels denote unlabelled pixels), as the MRF models are highly non-submodular.SDCut-SN and MOSEK have similar segmentation results. TRWS and MPLP achieve worse segmentation results than our methods.

LP iteration. MPLP-C performs worse than SDCut-SN under theruntime limit of minutes, and outperforms SDCut-SN with amuch longer runtime limit ( hour). We also ﬁnd that SDCut-SN achieves better lower-bounds than MPLP-C on the instanceswith more edges (pairwise potential terms). Note that the timecomplexity of MPLP-C (per LP iteration) is proportional to thenumber of edges, while the time complexity of SDCut-SN (seeTable 1) is less affected by the edge number. It should be alsonoticed that SDCut-SN uses much less runtime than MPLP-C,and its bound quality can be improved by adding linear constraints(including cycle constraints) as well [83]. ONCLUSION

In this paper, we have presented a regularized SDP algorithm (SD-Cut) for BQPs. SDCut produces bounds comparable to the conven-tional SDP relaxation, and can be solved much more efﬁciently.Two algorithms are proposed based on quasi-Newton methods(SDCut-QN) and smoothing Newton methods (SDCut-SN) re-spectively. Both SDCut-QN and SDCut-SN are more efﬁcientthan classic interior-point algorithms. To be speciﬁc, SDCut-SNis faster than SDCut-QN for small to medium sized problems. Ifthe matrix to be eigen-decomposed, C ( u ) , has a special structure(for example, sparse or low-rank) such that matrix-vector productscan be computed efﬁciently, SDCut-QN is much more scalable tolarge problems. The proposed algorithms have been applied to several computer vision tasks, which demonstrate their ﬂexibilityin accommodating different types of constraints. Experiments alsoshow the computational efﬁciency and good solution quality ofSDCut. We have made the code available online . Acknowledgements

We thank the anonymous reviewers for theconstructive comments on Propositions 1 and 4.This work was in part supported by ARC Future FellowshipFT120100969. This work was also in part supported by the Datato Decisions Cooperative Research Centre.

ROOFS

Proof. ( i ) Let t := γ and P := { X ∈ S n + |(cid:104) B i , X (cid:105) = b i , i ∈ I eq ; (cid:104) B i , X (cid:105) ≤ b i , i ∈ I in } , we have | p( X (cid:63) ) − p( X (cid:63)γ ) | ≤ p( X (cid:63)γ ) + 12 γ (cid:107) X (cid:63)γ (cid:107) F − p( X (cid:63) ) (30) = min X ∈ P p( X ) + t (cid:107) X (cid:107) F − p( X (cid:63) ) := θ ( t ) . As a pointwise minimum of afﬁne functions of t , θ ( t ) is concaveand continuous. It is also easy to ﬁnd that θ (0) = 0 and θ ( t ) ismonotonically increasing on R + . So for any (cid:15) > , there is a t > (and equivalently γ = t > ) such that | p( X (cid:63) ) − p( X (cid:63)γ ) | < (cid:15) .

3. http://cs.adelaide.edu.au/ ∼ chhshen/projects/BQP/ PPEARING IN IEEE TRANS. PATTERN ANALYSIS AND MACHINE INTELLIGENCE, FEB. 2016 14

Methods SDCut-SN SeDuMi MOSEK TRWS MPLP MPLP-C ( m) MPLP-C ( hr)Time/Iters . s/ . m s s . s . s m m sUpper-bound − . − . − . − . − . − . − . Lower-bound − . − . − . − . − . − . − . TABLE 11:

Chinese character inpainting using decision tree ﬁelds ( n = 191 ∼ , m = 191 ∼ ). The results are the average over models. Thenumbers of instances on which SDCut-SN performs better are shown in the parentheses. The upper-/lower-bounds given by SDCut-SN is comparable to thoseof interior-point methods. SDCut-SN also achieves better solutions than TRWS, MPLP and MPLP-C ( m). Using a much longer runtime ( m s vs . . s),MPLP-C outperforms SDCut-SN. ( ii ) By the deﬁnition of X (cid:63)γ and X (cid:63)γ , it is clear that p γ ( X (cid:63)γ ) ≤ p γ ( X (cid:63)γ ) and p γ ( X (cid:63)γ ) ≤ p γ ( X (cid:63)γ ) . Thenwe have p γ ( X (cid:63)γ ) − γ γ p γ ( X (cid:63)γ ) = (1 − γ γ ) · p( X (cid:63)γ ) ≤ p γ ( X (cid:63)γ ) − γ γ p γ ( X (cid:63)γ ) = (1 − γ γ ) · p( X (cid:63)γ ) . Because γ /γ > , p( X (cid:63)γ ) ≥ p( X (cid:63)γ ) . Proof.

The Lagrangian of the primal problem (5) is: L( X , u , Z ) = (cid:104) X , A (cid:105) − (cid:104) X , Z (cid:105) + 12 γ (cid:107) X (cid:107) F + (cid:80) mi =1 u i ( (cid:104) X , B i (cid:105)− b i ) , (31)where u ∈ R | I eq | × R | I in | + and Z ∈ S n + are Lagrangian multipliers.Supposing (5) and (31) are feasible, strong duality holds and ∇ X L( X (cid:63) , u (cid:63) , Z (cid:63) ) = 0 , where X (cid:63) , u (cid:63) and Z (cid:63) are optimalsolutions. Then we have that X (cid:63) = γ ( Z (cid:63) − A − (cid:80) mi =1 u (cid:63)i B i ) = γ ( Z (cid:63) + C ( u (cid:63) )) . (32)By substituting X (cid:63) to (31), we have the dual: max u ∈ R | I eq | × R | I in | + , Z ∈ S n + − u (cid:62) b − γ (cid:107) Z + C ( u ) (cid:107) F . (33)The variable Z can be further eliminated as follows. Givena ﬁxed u , (33) can be simpliﬁed to: min Z ∈ S n + (cid:107) Z + C ( u ) (cid:107) F ,which is proved to has the solution Z (cid:63) = Π S n + ( − C ( u )) (see[84] or Section 8.1.1 of [85]). Note that C ( u ) = Π S n + ( C ( u )) − Π S n + ( − C ( u )) , so Z (cid:63) + C ( u ) = Π S n + ( C ( u )) . By substituting Z into (33) and (32), we have the simpliﬁed dual (6) and Equa-tion (7). Proof.

Set ζ ( X ) := (cid:107) Π S n + ( X ) (cid:107) F = (cid:80) ni =1 (max(0 , λ i )) ,where λ i denotes the i -th eigenvalue of X . ζ ( X ) is a sep-arable spectral function associated with the function g( x ) = (max(0 , x )) . ζ : S n → R is continuously differentiable butnot necessarily twice differentiable at X ∈ S n , as g : R → R hasthe same smoothness property (see [86], [87], [88]). We also have ∇ ζ ( X ) = Π S n + ( X ) . Before proving Proposition 4, we ﬁrst give the following theorem.

Theorem 5. (The spherical constraint). For any X ∈ S n + , we havethe inequality (cid:107) X (cid:107) F ≤ trace( X ) , in which the equality holds ifand only if rank( X ) = 1 .Proof. The proof given here is an extension of the one in [89]. Wehave (cid:107) X (cid:107) F = trace( XX (cid:62) ) = (cid:80) ni =1 λ i ≤ (trace( X )) , where λ i ≥ denotes the i -th eigenvalue of X . Note that (cid:107) X (cid:107) F =trace( X ) (that is (cid:80) ni =1 λ i = ( (cid:80) ni =1 λ i ) ), if and only if there isonly one non-zero eigenvalue of X , that is, rank( X ) = 1 . Proof.

Firstly, we have the following inequalities: p( X (cid:63) ) = p γ ( X (cid:63) ) − (cid:107) X (cid:63) (cid:107) F γ ≥ p γ ( X (cid:63) ) − (trace( X (cid:63) )) γ , (34)where the second inequality is based on Theorem 5. For theBQP (3) that we consider, it is easy to see trace( X (cid:63) ) = n .Furthermore, p γ ( X (cid:63) ) ≥ p γ ( X (cid:63)γ ) holds by deﬁnition. Then wehave that p( X (cid:63) ) ≥ p γ ( X (cid:63)γ ) − n γ . (35)It is known that the optimum of the original SDP problem (4)is a lower-bound on the optimum of the BQP (3) (denoted by p (cid:63) ): p( X (cid:63) ) ≤ p (cid:63) . Then according to (35), we have p γ ( X (cid:63)γ ) − n γ ≤ p( X (cid:63) ) ≤ p (cid:63) . Finally based on the strong duality, the primalobjective value is not smaller than the dual objective value in thefeasible set (see for example [85]): d γ ( u ) ≤ p γ ( X (cid:63)γ ) , where u ∈ R | I eq | × R | I in | + , γ > . In summary, we have: d γ ( u ) − n γ ≤ p γ ( X (cid:63)γ ) − n γ ≤ p( X (cid:63) ) ≤ p (cid:63) , ∀ u ∈ R | I eq | × R | I in | + , ∀ γ > . R EFERENCES [1] N. Van Thoai, “Solution methods for general quadratic programmingproblem with continuous and binary variables: Overview,” in

AdvancedComputational Methods for Knowledge Engineering . Springer, 2013,pp. 3–17.[2] G. Kochenberger, J.-K. Hao, F. Glover, M. Lewis, Z. L¨u, H. Wang, andY. Wang, “The unconstrained binary quadratic programming problem: asurvey,”

J. Combinatorial Optim. , vol. 28, no. 1, pp. 58–81, 2014.[3] S. Z. Li, “Markov random ﬁeld models in computer vision,” in

Proc. Eur.Conf. Comp. Vis.

Springer, 1994, pp. 361–370.[4] C. Wang, N. Komodakis, and N. Paragios, “Markov random ﬁeld mod-eling, inference & learning in computer vision & image understanding:A survey,”

Comp. Vis. Image Understanding , vol. 117, no. 11, pp. 1610–1627, 2013.[5] J. H. Kappes, B. Andres, F. A. Hamprecht, C. Schn¨orr, S. Nowozin,D. Batra, S. Kim, B. X. Kausler, T. Kr¨oger, J. Lellmann, N. Komodakis,B. Savchynskyy, and C. Rother, “A comparative study of modern infer-ence techniques for structured discrete energy minimization problems,”

Int. J. Comp. Vis. , 2015.[6] S. D. Givry, B. Hurley, D. Allouche, G. Katsirelos, B. O’Sullivan,and T. Schiex, “An experimental evaluation of CP/AI/OR solvers foroptimization in graphical models,” in

Congr`es ROADEF’2014, Bordeaux,FRA , 2014.[7] V. Kolmogorov and R. Zabin, “What energy functions can be minimizedvia graph cuts?”

IEEE Trans. Pattern Anal. Mach. Intell. , vol. 26, no. 2,pp. 147–159, 2004.[8] V. Kolmogorov and C. Rother, “Minimizing nonsubmodular functionswith graph cuts-a review,”

IEEE Trans. Pattern Anal. Mach. Intell. ,vol. 29, no. 7, pp. 1274–1279, 2007.[9] C. Rother, V. Kolmogorov, V. Lempitsky, and M. Szummer, “Optimizingbinary MRFs via extended roof duality,” in

Proc. IEEE Conf. Comp. Vis.Patt. Recogn. , 2007, pp. 1–8.[10] D. Li, X. Sun, S. Gu, J. Gao, and C. Liu, “Polynomially solvable cases ofbinary quadratic programs,” in

Optimization and Optimal Control , 2010,pp. 199–225.

PPEARING IN IEEE TRANS. PATTERN ANALYSIS AND MACHINE INTELLIGENCE, FEB. 2016 15 [11] M. J. Wainwright, T. S. Jaakkola, and A. S. Willsky, “MAP estimationvia agreement on trees: message-passing and linear programming,”

IEEETrans. Information Theory , vol. 51, no. 11, pp. 3697–3717, 2005.[12] V. Kolmogorov, “Convergent tree-reweighted message passing for energyminimization,”

IEEE Trans. Pattern Anal. Mach. Intell. , vol. 28, no. 10,pp. 1568–1583, 2006.[13] A. Globerson and T. Jaakkola, “Fixing max-product: Convergent messagepassing algorithms for MAP LP-relaxations,” in

Proc. Adv. Neural Inf.Process. Syst. , 2007.[14] D. Sontag, D. K. Choe, and Y. Li, “Efﬁciently searching for frustratedcycles in MAP inference,” in

Proc. Uncertainty in Artiﬁcial Intell. , 2012.[15] P. Ravikumar and J. Lafferty, “Quadratic programming relaxations formetric labeling and Markov random ﬁeld MAP estimation,” in

Proc. Int.Conf. Mach. Learn. , 2006, pp. 737–744.[16] M. P. Kumar, P. H. Torr, and A. Zisserman, “Solving Markov randomﬁelds using second order cone programming relaxations,” in

Proc. IEEEConf. Comp. Vis. Patt. Recogn. , vol. 1, 2006, pp. 1045–1052.[17] S. Kim and M. Kojima, “Exact solutions of some nonconvex quadraticoptimization problems via SDP and SOCP relaxations,”

Comput. Optim.Appl. , vol. 26, no. 2, pp. 143–154, 2003.[18] B. Ghaddar, J. C. Vera, and M. F. Anjos, “Second-order cone relaxationsfor binary quadratic polynomial programs,”

SIAM J. Optim. , vol. 21,no. 1, pp. 391–414, 2011.[19] J. Shi and J. Malik, “Normalized cuts and image segmentation,”

IEEETrans. Pattern Anal. Mach. Intell. , vol. 22, no. 8, pp. 888–905, 8 2000.[20] S. X. Yu and J. Shi, “Segmentation given partial grouping constraints,”

IEEE Trans. Pattern Anal. Mach. Intell. , vol. 26, no. 2, pp. 173–183,2004.[21] T. Cour and J. Bo, “Solving Markov random ﬁelds with spectral relax-ation,” in

Proc. Int. Workshop Artiﬁcial Intell. & Statistics , 2007.[22] T. Cour, P. Srinivasan, and J. Shi, “Balanced graph matching,” in

Proc.Adv. Neural Inf. Process. Syst. , 2006, pp. 313–320.[23] M. J. Jordan and M. I. Wainwright, “Semideﬁnite relaxations for approx-imate inference on graphs with cycles,” in

Proc. Adv. Neural Inf. Process.Syst. , vol. 16, pp. 369376, 2003.[24] F. Lauer and C. Schnorr, “Spectral clustering of linear subspaces formotion segmentation,” in

Proc. IEEE Int. Conf. Comp. Vis. , 2009.[25] S. Guattery and G. Miller, “On the quality of spectral separators,”

SIAMJ. Matrix Anal. Appl. , vol. 19, pp. 701–719, 1998.[26] K. J. Lang, “Fixing two weaknesses of the spectral method,” in

Proc.Adv. Neural Inf. Process. Syst. , 2005, pp. 715–722.[27] R. Kannan, S. Vempala, and A. Vetta, “On clusterings: Good, bad andspectral,”

J. ACM , vol. 51, pp. 497–515, 2004.[28] L. Vandenberghe and S. Boyd, “Semideﬁnite programming,”

SIAMreview , vol. 38, no. 1, pp. 49–95, 1996.[29] H. Wolkowicz, R. Saigal, and L. Vandenberghe,

Handbook of Semidef-inite Programming: Theory, Algorithms, and Applications . SpringerScience & Business Media, 2000.[30] M. J. Wainwright and M. I. Jordan, “Graphical models, exponential fam-ilies, and variational inference,”

Foundations and Trends R (cid:13) in MachineLearning , vol. 1, no. 1-2, pp. 1–305, 2008.[31] M. P. Kumar, V. Kolmogorov, and P. H. S. Torr, “An analysis of convexrelaxations for MAP estimation of discrete MRFs,” J. Mach. Learn. Res. ,vol. 10, pp. 71–106, Jun 2009.[32] M. X. Goemans and D. Williamson, “Improved approximation algo-rithms for maximum cut and satisﬁability problems using semideﬁniteprogramming,”

J. ACM , vol. 42, pp. 1115–1145, 1995.[33] M. Heiler, J. Keuchel, and C. Schnorr, “Semideﬁnite clustering for imagesegmentation with a-priori knowledge,” in

Proc. DAGM Symp. PatternRecogn. , 2005, pp. 309–317.[34] J. Keuchel, C. Schnoerr, C. Schellewald, and D. Cremers, “Binarypartitioning, perceptual grouping and restoration with semideﬁnite pro-gramming,”

IEEE Trans. Pattern Anal. Mach. Intell. , vol. 25, no. 11, pp.1364–1379, 2003.[35] C. Olsson, A. Eriksson, and F. Kahl, “Solving large scale binary quadraticproblems: Spectral methods vs. semideﬁnite programming,” in

Proc.IEEE Conf. Comp. Vis. Patt. Recogn. , 2007, pp. 1–8.[36] P. Torr, “Solving Markov random ﬁelds using semi-deﬁnite program-ming,” in

Proc. Int. Workshop Artiﬁcial Intell. & Statistics , 2003, pp.1–8.[37] C. Schellewald and C. Schn¨orr, “Probabilistic subgraph matching basedon convex relaxation,” in

Proc. Int. Conf. Energy Minimization Methodsin Comp. Vis. & Pattern Recogn. , 2005, pp. 171–186.[38] A. Joulin, F. Bach, and J. Ponce, “Discriminative clustering for imageco-segmentation,” in

Proc. IEEE Conf. Comp. Vis. Patt. Recogn. , 2010. [39] F. Alizadeh, “Interior point methods in semideﬁnite programming withapplications to combinatorial optimization,”

SIAM J. Optim. , vol. 5, no. 1,pp. 13–51, 1995.[40] Y. Nesterov, A. Nemirovskii, and Y. Ye,

Interior-point polynomial algo-rithms in convex programming . SIAM, 1994, vol. 13.[41] J. F. Sturm, “Using SeDuMi 1.02, a MATLAB toolbox for optimizationover symmetric cones,”

Optim. Methods Softw. , vol. 11, pp. 625–653,1999.[42] K. C. Toh, M. Todd, and R. H. T¨ut¨unc¨u, “SDPT3—a MATLAB softwarepackage for semideﬁnite programming,”

Optim. Methods Softw. , vol. 11,pp. 545–581, 1999.[43]

The MOSEK optimization toolbox for MATLAB manual. Version 7.0(Revision 139) , MOSEK ApS, Denmark.[44] S. Burer and R. D. Monteiro, “A nonlinear programming algorithmfor solving semideﬁnite programs via low-rank factorization,”

Math.Program. , vol. 95, no. 2, pp. 329–357, 2003.[45] M. Journ´ee, F. Bach, P.-A. Absil, and R. Sepulchre, “Low-rank opti-mization on the cone of positive semideﬁnite matrices,”

SIAM J. Optim. ,vol. 20, no. 5, pp. 2327–2351, 2010.[46] R. Frostig, S. Wang, P. S. Liang, and C. D. Manning, “Simple MAPinference via low-rank relaxations,” in

Proc. Adv. Neural Inf. Process.Syst. , 2014, pp. 3077–3085.[47] J. Malick, J. Povh, F. Rendl, and A. Wiegele, “Regularization methods forsemideﬁnite programming,”

SIAM J. Optim. , vol. 20, no. 1, pp. 336–356,2009.[48] X.-Y. Zhao, D. Sun, and K.-C. Toh, “A Newton-CG augmented La-grangian method for semideﬁnite programming,”

SIAM J. Optim. , vol. 20,no. 4, pp. 1737–1765, 2010.[49] Z. Wen, D. Goldfarb, and W. Yin, “Alternating direction augmentedlagrangian methods for semideﬁnite programming,”

Math. Program.Comput. , vol. 2, no. 3-4, pp. 203–230, 2010.[50] Q. Huang, Y. Chen, and L. Guibas, “Scalable semideﬁnite relaxation formaximum a posterior estimation,” in

Proc. Int. Conf. Mach. Learn. , 2014.[51] R. T. Rockafellar, “A dual approach to solving nonlinear programmingproblems by unconstrained optimization,”

Math. Program. , vol. 5, no. 1,pp. 354–373, 1973.[52] C. Helmberg and F. Rendl, “A spectral bundle method for semideﬁniteprogramming,”

SIAM J. Optim. , vol. 10, no. 3, pp. 673–696, 2000.[53] S. Burer, R. D. Monteiro, and Y. Zhang, “A computational study of agradient-based log-barrier algorithm for a class of large-scale SDPs,”

Math. Program. , vol. 95, no. 2, pp. 359–379, 2003.[54] P. Wang, C. Shen, and A. van den Hengel, “A fast semideﬁnite approachto solving binary quadratic problems,” in

Proc. IEEE Conf. Comp. Vis.Patt. Recogn.

IEEE, 2013, pp. 1312–1319.[55] D. Henrion and J. Malick, “Projection methods in conic optimization,”in

Handbook on Semideﬁnite, Conic and Polynomial Optimization .Springer, 2012, pp. 565–600.[56] C. Shen, J. Kim, and L. Wang, “A scalable dual approach to semideﬁnitemetric learning,” in

Proc. IEEE Conf. Comp. Vis. Patt. Recogn. , 2011,pp. 2601–2608.[57] N. Krislock, J. Malick, and F. Roupin, “Improved semideﬁnite boundingprocedure for solving Max-Cut problems to optimality,”

Math. Program.Ser. A , 2013, published online 13 Oct. 2012 at http://doi.org/k2q.[58] C. Zhu, R. H. Byrd, P. Lu, and J. Nocedal, “Algorithm 778: L-BFGS-B: Fortran subroutines for large-scale bound-constrained optimization,”

ACM Trans. Math. Softw. , vol. 23, no. 4, pp. 550–560, 1997.[59] P. T. Harker and J.-S. Pang, “Finite-dimensional variational inequalityand nonlinear complementarity problems: a survey of theory, algorithmsand applications,”

Math. Program. , vol. 48, no. 1-3, pp. 161–220, 1990.[60] Y. Gao and D. Sun, “Calibrating least squares covariance matrix problemswith equality and inequality constraints,”

SIAM J. Matrix Anal. Appl. ,vol. 31, pp. 1432–1457, 2009.[61] H. A. Van der Vorst, “Bi-CGSTAB: A fast and smoothly convergingvariant of Bi-CG for the solution of nonsymmetric linear systems,”

SIAMJ. scientiﬁc Statistical Computing , vol. 13, no. 2, pp. 631–644, 1992.[62] A. d’Aspremont and S. Boyd, “Relaxations and randomized methods fornonconvex QCQPs,”

EE392o Class Notes, Stanford University , 2003.[63] Z.-Q. Luo, W.-k. Ma, A. M.-C. So, Y. Ye, and S. Zhang, “Semideﬁniterelaxation of quadratic optimization problems,”

IEEE Signal ProcessingMag. , vol. 27, no. 3, pp. 20–34, 2010.[64] A. I. Barvinok, “Problems of distance geometry and convex propertiesof quadratic maps,”

Discrete Comput. Geometry , vol. 13, no. 1, pp. 189–202, 1995.[65] G. Pataki, “On the rank of extreme matrices in semideﬁnite programs andthe multiplicity of optimal eigenvalues,”

Math. oper. res. , vol. 23, no. 2,pp. 339–358, 1998.

PPEARING IN IEEE TRANS. PATTERN ANALYSIS AND MACHINE INTELLIGENCE, FEB. 2016 16 [66] R. B. Lehoucq, D. C. Sorensen, and C. Yang, “ARPACK users’ guide:Solution of large scale eigenvalue problems with implicitly restartedArnoldi methods.”[67] E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra,J. Du Croz, A. Greenbaum, S. Hammerling, A. McKenney et al. , LAPACK Users’ guide . Siam, 1999, vol. 9.[68] V. Hernandez, J. E. Roman, and V. Vidal, “SLEPc: A scalable andﬂexible toolkit for the solution of eigenvalue problems,”

ACM Trans.Math. Softw. , vol. 31, no. 3, pp. 351–362, 2005.[69] “Plasma 2.7.1,” http://icl.cs.utk.edu/plasma/index.html, 2015.[70] “Magma 1.6.3,” http://icl.cs.utk.edu/magma/, 2015.[71] C. G. Broyden, J. E. Dennis, and J. J. Mor´e, “On the local and superlinearconvergence of quasi-Newton methods,”

IMA J. Appl. Math. , vol. 12,no. 3, pp. 223–245, 1973.[72] J. Dennis and J. J. Mor´e, “A characterization of superlinear convergenceand its application to quasi-Newton methods,”

Math. Comput. , vol. 28,no. 126, pp. 549–560, 1974.[73] L. Qi, “On superlinear convergence of quasi-Newton methods for non-smooth equations,”

Oper. research letters , vol. 20, no. 5, pp. 223–228,1997.[74] A. Raj and R. Zabih, “A graph cut algorithm for generalized imagedeconvolution,” in

Proc. IEEE Int. Conf. Comp. Vis. , 2005.[75] S. Nowozin, C. Rother, S. Bagon, T. Sharp, B. Yao, and P. Kohli,“Decision tree ﬁelds,” in

Proc. IEEE Int. Conf. Comp. Vis. , 2011.[76] L. Hagen and A. B. Kahng, “New spectral methods for ratio cut partition-ing and clustering,”

IEEE Trans. Computer-aided Design of IntegratedCircuits and Systems , vol. 11, no. 9, pp. 1074–1085, 1992.[77] L. Gorelick, F. R. Schmidt, Y. Boykov, A. Delong, and A. Ward,“Segmentation with non-linear regional constraints via line-search cuts,”in

Proc. Eur. Conf. Comp. Vis.

Springer, 2012, pp. 583–597.[78] S. Maji, N. K. Vishnoi, and J. Malik, “Biased normalized cuts,” in

Proc.IEEE Conf. Comp. Vis. Patt. Recogn. , 2011, pp. 2057–2064.[79] X. Wang and I. Davidson, “Flexible constrained spectral clustering,” in

Proc. ACM Int. Conf. Knowledge Discovery & Data Mining . ACM,2010, pp. 563–572.[80] D. Martin, C. Fowlkes, D. Tal, and J. Malik, “A database of humansegmented natural images and its application to evaluating segmentationalgorithms and measuring ecological statistics,” in

Proc. IEEE Conf.Comp. Vis. Patt. Recogn. , vol. 2, 2001, pp. 416–423.[81] D. Sontag, T. Meltzer, A. Globerson, T. Jaakkola, and Y. Weiss, “Tighten-ing LP relaxations for MAP using message passing,” in

Proc. Uncertaintyin Artiﬁcial Intell. , 2008.[82] B. Andres, T. Beier, and J. Kappes, “OpenGM: A c++ library for discretegraphical models,” http://hci.iwr.uni-heidelberg.de/opengm2/, 2012.[83] C. Helmberg and R. Weismantel, “Cutting plane algorithms for semideﬁ-nite relaxations,”

Fields Institute Communications , vol. 18, pp. 197–213,1998.[84] N. J. Higham, “Computing a nearest symmetric positive semideﬁnitematrix,”

Linear algebra appl. , vol. 103, pp. 103–118, 1988.[85] S. Boyd and L. Vandenberghe,

Convex Optimization . CambridgeUniversity Press, 2004.[86] A. S. Lewis, “Derivatives of spectral functions,”

Math. Oper. Res. ,vol. 21, no. 3, pp. 576–588, 1996.[87] A. S. Lewis and H. S. Sendov, “Twice differentiable spectral functions,”

SIAM J. Matrix Anal. Appl. , vol. 23, no. 2, pp. 368–386, 2001.[88] H. S. Sendov, “The higher-order derivatives of spectral functions,”

Linearalgebra appl. , vol. 424, no. 1, pp. 240–281, 2007.[89] J. Malick, “The spherical constraint in boolean quadratic programs,”

J.Glob. Optim. , vol. 39, no. 4, pp. 609–622, 2007.

Peng Wang received the B.S. degree in electrical engineering andautomation, and the PhD degree in control science and engineeringfrom Beihang University, China, in 2004 and 2011, respectively. He isnow a post-doctoral researcher at the University of Adelaide.

Chunhua Shen is a Professor at School of Computer Science, TheUniversity of Adelaide. His research interests are in the intersection ofcomputer vision and statistical machine learning. He studied at NanjingUniversity, at Australian National University, and received his PhD de-gree from University of Adelaide. In 2012, he was awarded the AustralianResearch Council Future Fellowship.

Anton van den Hengel is a Professor and the Founding Directorof the Australian Centre for Visual Technologies, at the University ofAdelaide, focusing on innovation in the production and analysis of visualdigital media. He received the Bachelor of mathematical science degree,Bachelor of laws degree, Master’s degree in computer science, and thePhD degree in computer vision from The University of Adelaide in 1991,1993, 1994, and 2000, respectively.

Philip H. S. Torr received the Ph.D. (D.Phil.) degree from the RoboticsResearch Group at the University of Oxford, Oxford, U.K., under Prof. D.Murray of the Active Vision Group. He was a research fellow at Oxfordfor another three years. He left Oxford to work as a research scientistwith Microsoft Research for six years, ﬁrst with Redmond, WA, USA,and then with the Vision Technology Group, Cambridge, U.K., foundingthe vision side of the Machine Learning and Perception Group. Currently,he is a professor in computer vision and machine learning with OxfordUniversity, Oxford, U.K.

PPEARING IN IEEE TRANS. PATTERN ANALYSIS AND MACHINE INTELLIGENCE, FEB. 2016 17 A PPENDIX

In this section, we present some computational details.

A.1 P

RELIMINARIES

A.1.1 Euclidean Projection onto the P.S.D. Cone

Theorem A.1.

The Euclidean projection of a symmetric matrix X ∈ S n onto the positive semideﬁnite cone S n + , is given by Π S n + ( X ) := arg min Y ∈ S n + (cid:107) Y − X (cid:107) F = n (cid:88) i =1 max(0 , λ i ) p i p (cid:62) i , (A.1) where λ i , p i , i = 1 , · · · , n are the eigenvalues and the corresponding eigenvectors of X .Proof. This result is well-known and its proof can be found in [84] or Section 8.1.1 of [85].

A.1.2 Derivatives of Separable Spectral Functions A spectral function F( X ) : S n → R is a function which depends only on the eigenvalues of a symmetric matrix X , and can be writtenas f( λ ) for some symmetric function f : R n → R , where λ = [ λ i , · · · , λ n ] (cid:62) denotes the vector of eigenvalues of X . A function f( · ) is symmetric means that f( x ) = f( Ux ) for any permutation matrix U and any x in the domain of f( · ) . Such symmetric functions and thecorresponding spectral functions are called separable , when f( x ) = (cid:80) ni =1 g( x i ) for some function g : R → R . It is known (see [86],[87], [88], for example) that a spectral function has the following properties: Theorem A.2.

A separable spectral function F( · ) is k -times (continuously) differentiable at X ∈ S n , if and only if its correspondingfunction g( · ) is k -times (continuously) differentiable at λ i , i = 1 , . . . , n , and the ﬁrst- and second-order derivatives of F( · ) are givenby ∇ F( X ) = P (cid:16) diag (cid:0) ∇ g( λ ) , ∇ g( λ ) , . . . , ∇ g( λ n ) (cid:1)(cid:17) P (cid:62) , (A.2) ∇ F( X )( H ) = P (cid:16) Ω( λ ) ◦ ( P (cid:62) HP ) (cid:17) P (cid:62) , ∀ H ∈ S n (A.3) where [Ω( λ )] ij := (cid:40) ∇ g( λ i ) −∇ g( λ j ) λ i − λ j if λ i (cid:54) = λ j , ∇ g( λ i ) if λ i = λ j , i, j = 1 , . . . , n . λ = [ λ , · · · , λ n ] (cid:62) and P = [ p , · · · , p n ] are the collectionof eigenvalues and the corresponding eigenvectors of X . A.2 I

NEXACT S MOOTHING N EWTON M ETHODS : C

OMPUTATIONAL D ETAILS

A.2.1 Smoothing Function

In this section, we show how to constrcut a smoothing function of F( u ) (see (10)). First, the smoothing functions for Π D and Π S n + arewritten as follows respectively: ˜Π D ( (cid:15), v ) := (cid:26) v i if i ∈ I eq ,φ ( (cid:15), v i ) if i ∈ I in , ( (cid:15), v ) ∈ R × R m , (A.4) ˜Π S n + ( (cid:15), X ) := n (cid:88) i =1 φ ( (cid:15), λ i ) p i p (cid:62) i , ( (cid:15), X ) ∈ R × S n , (A.5)where λ i and p i are the i th eigenvalue and the corresponding eigenvector of X . φ ( (cid:15), v ) is the Huber smoothing function that we adopthere to replace max(0 , v ) : φ ( (cid:15), v ) :=  v if v > . (cid:15), ( v + 0 . (cid:15) ) / (cid:15), if − . (cid:15) ≤ v ≤ . (cid:15), if v < − . (cid:15). (A.6)Note that at (cid:15) = 0 , φ ( (cid:15), v ) = max(0 , v ) , ˜Π D ( (cid:15), v ) = Π D ( v ) and ˜Π S n + ( (cid:15), X ) = Π S n + ( X ) . φ , ˜Π D , ˜Π S n + are Lipschitz continuous on R , R × R m , R × S n respectively, and they are continuously differentiable when (cid:15) (cid:54) = 0 . Now we have a smoothing function for F( · ) : ˜F( (cid:15), u ) := u − ˜Π D (cid:16) (cid:15), u − γ Φ (cid:104) ˜Π S n + ( (cid:15), C ( u )) (cid:105) − b (cid:17) , ( (cid:15), u ) ∈ R × R m , (A.7)which has the same smooth property as ˜Π D and ˜Π S n + . PPEARING IN IEEE TRANS. PATTERN ANALYSIS AND MACHINE INTELLIGENCE, FEB. 2016 18

A.2.2 Solving the Linear System (16)The linear system (16) can be decomposed to two parts:(16) ⇔ (cid:20) (cid:15) k ˜F( (cid:15) k , u k ) (cid:21) + (cid:20) ∇ (cid:15) ˜F)( (cid:15) k , u k ) ( ∇ u ˜F)( (cid:15) k , u k ) (cid:21) , (A.8a) ⇔ (cid:26) ∆ (cid:15) k = ¯ (cid:15) − (cid:15) k (A.8b) ∇ u ˜F( (cid:15) k , u k )(∆ u k ) = − ˜F( (cid:15) k , u k ) − ∇ (cid:15) ˜F( (cid:15) k , u k )(∆ (cid:15) k ) , (A.8c)where ∇ (cid:15) ˜F and ∇ u ˜F denote the partial derivatives of ˜F with respect to (cid:15) and u respectively. One can ﬁrstly obtain the value of ∆ (cid:15) k by (A.8b) and then solve the linear system (A.8c) using CG-like algorithms.Since the Jacobian matrix ∇ u ˜F( (cid:15) k , u k ) ∈ R m × m is nonsymmetric when inequality constraints exist, biconjugate gradient stabilized(BiCGStab) methods [61] are used for (A.8c) with respect to | I in | (cid:54) = 0 , and classic conjugate gradient methods are used when | I in | = 0 .The computational bottleneck of CG-like algorithms is on the Jacobian-vector products at each iteration. We discuss in the followingthe computational complexity of it in our speciﬁc cases. Firstly, we give the partial derivatives of smoothing functions φ ( (cid:15), v ) : R × R → R , ˜Π D ( (cid:15), v ) : R × R m → R m and ˜Π S n + ( (cid:15), X ) : R × S n → S n : ∇ (cid:15) φ ( (cid:15), v ) = (cid:26) . − . v/(cid:15) ) if − . (cid:15) ≤ v ≤ . (cid:15), otherwise, (A.9a) ∇ v φ ( (cid:15), v ) =  if v > . (cid:15), . v/(cid:15) if − . (cid:15) ≤ v ≤ . (cid:15), if v < − . (cid:15), (A.9b) (cid:104) ∇ (cid:15) ˜Π D ( (cid:15), v ) (cid:105) i = (cid:26) if i ∈ I eq , ∇ (cid:15) φ ( (cid:15), v i ) if i ∈ I in , (A.10a) (cid:104) ∇ v ˜Π D ( (cid:15), v ) (cid:105) ij =  if i = j ∈ I eq , ∇ v i φ ( (cid:15), v i ) if i = j ∈ I in , if i (cid:54) = j, (A.10b) ∇ (cid:15) ˜Π S n + ( (cid:15), X ) = P diag ( ∇ (cid:15) φ ( (cid:15), λ )) P (cid:62) , (A.11a) ∇ X ˜Π S n + ( (cid:15), X )( H ) = P (cid:16) Ω( (cid:15), λ ) ◦ ( P (cid:62) HP ) (cid:17) P (cid:62) , (A.11b)where λ and P are the collection of eigenvalues and the corresponding eigenvectors of X . ∇ (cid:15) φ ( (cid:15), λ ) := [ ∇ (cid:15) φ ( (cid:15), λ i )] ni =1 and Ω( (cid:15), λ ) : R × R n → S n is deﬁned as [Ω( (cid:15), λ )] ij := (cid:40) φ ( (cid:15),λ i ) − φ ( (cid:15),λ j ) λ i − λ j if λ i (cid:54) = λ j , ∇ λ i φ ( (cid:15), λ i ) if λ i = λ j , i, j = 1 , . . . , n. (A.12)Equations (A.11a) and (A.11b) are derived based on Theorem A.2.Then we have the partial derivatives of ˜F( (cid:15), u ) : R × R m → R m with respect to (cid:15) and u : ∇ (cid:15) ˜F( (cid:15), u ) = −∇ (cid:15) ˜Π D ( (cid:15), w ) − ∇ w ˜Π D ( (cid:15), w ) ( ∇ (cid:15) w ) , = −∇ (cid:15) ˜Π D ( (cid:15), w ) + ∇ w ˜Π D ( (cid:15), w ) (cid:16) γ Φ (cid:104) P diag ( ∇ (cid:15) φ ( (cid:15), λ )) P (cid:62) (cid:105)(cid:17) , (A.13a) ∇ u ˜F( (cid:15), u )( h ) = h − ∇ w ˜Π D ( (cid:15), w ) ( ∇ u w ) , = h − ∇ w ˜Π D ( (cid:15), w ) (cid:16) h + γ Φ (cid:104) P (cid:16) Ω( (cid:15), λ ) ◦ ( P (cid:62) Ψ[ h ] P ) (cid:17) P (cid:62) (cid:105)(cid:17) , (A.13b)where w := u − γ Φ (cid:104) ˜Π S n + ( (cid:15), C ( u )) (cid:105) − b ; C ( u ) := − A − Ψ[ u ] ; Φ( X ) := [ (cid:104) B , X (cid:105) , · · · , (cid:104) B m , X (cid:105) ] (cid:62) ; Ψ( u ) := (cid:80) mi =1 u i B i ; λ and P are the collection of eigenvalues and the corresponding eigenvectors of C ( u ) .In general cases, computing (A.13a) and (A.13b) needs O ( mn + n ) ﬂops. However, based on the observation that most of B i , i = 1 , . . . , m contain only O (1) elements and r = rank(Π S n + ( C ( u ))) (cid:28) n , the computation cost can be dramatically reduced.Firstly, super sparse B i s lead to the computation cost of Φ and Ψ reduced from O ( mn ) to O ( m + n ) . Secondly, note that [Ω( (cid:15), λ )] ij =0 , ∀ λ i , λ j < . Given r (cid:28) n and (cid:15) is small enough, the matrix Ω only contains non-zero elements in the ﬁrst r columns and rows.Thus the matrix multiplication in (A.13a), (A.13b) and (A.7) can be computed in O ( n r ) ﬂops rather than the usual O ( n ) ﬂops.In summary, the computation cost of the right hand side of Equ. (A.8c) and the Jacobian-vector product (A.13b) can be reducedfrom O ( mn + n ) to O ( m + n r ))