[PDF] A Projection Method for Metric-Constrained Optimization

Abstract

We outline a new approach for solving optimization problems which enforce triangle inequalities on output variables. We refer to this as metric-constrained optimization, and give several examples where problems of this form arise in machine learning applications and theoretical approximation algorithms for graph clustering. Although these problem are interesting from a theoretical perspective, they are challenging to solve in practice due to the high memory requirement of black-box solvers. In order to address this challenge we first prove that the metric-constrained linear program relaxation of correlation clustering is equivalent to a special case of the metric nearness problem. We then developed a general solver for metric-constrained linear and quadratic programs by generalizing and improving a simple projection algorithm originally developed for metric nearness. We give several novel approximation guarantees for using our framework to find lower bounds for optimal solutions to several challenging graph clustering problems. We also demonstrate the power of our framework by solving optimizing problems involving up to 10^{8} variables and 10^{11} constraints.

Full PDF

aa r X i v : . [ c s . NA ] J un A PROJECTION METHOD FOR METRIC-CONSTRAINEDOPTIMIZATION

NATE VELDT , DAVID F. GLEICH , ANTHONY WIRTH , AND JAMES SAUNDERSON Abstract.

We outline a new approach for solving optimization problems which enforcetriangle inequalities on output variables. We refer to this as metric-constrained optimiza-tion, and give several examples where problems of this form arise in machine learningapplications and theoretical approximation algorithms for graph clustering. Althoughthese problem are interesting from a theoretical perspective, they are challenging to solvein practice due to the high memory requirement of black-box solvers. In order to ad-dress this challenge we ﬁrst prove that the metric-constrained linear program relaxationof correlation clustering is equivalent to a special case of the metric nearness problem. Wethen developed a general solver for metric-constrained linear and quadratic programs bygeneralizing and improving a simple projection algorithm originally developed for metricnearness. We give several novel approximation guarantees for using our framework to ﬁndlower bounds for optimal solutions to several challenging graph clustering problems. Wealso demonstrate the power of our framework by solving optimizing problems involving upto variables and constraints. Introduction

Learning pairwise distance scores among objects in a dataset is an important task inmachine learning and data mining. Principled methods abound for how to approach thistask in diﬀerent contexts. A number of these methods rely on solving a convex optimizationproblem involving metric constraints of the form x ij ≤ x ik + x jk , where, for instance, x ij rep-resents the distance score between objects i and j . Solving metric-constrained optimizationproblems has been applied to semi-supervised clustering [10, 4], joint clustering of imagesegmentations [51, 30], the metric nearness problem [12, 22, 23], and sensor location [28, 29].Metric-constrained linear programs also arise frequently as convex relaxations for NP-hardgraph clustering problems [16, 50, 38, 1, 43].Obtaining distance scores by solving a metric-constrained optimization problem is ofgreat theoretical interest, but can be very challenging in practice since these problems ofteninvolve O ( n ) variables and O ( n ) constraints, where n is the number of items in a datasetor number of nodes in a derived graph. In this paper we speciﬁcally consider a class of linearprogramming relaxations of NP-hard graph clustering objectives that have been extensivelystudied in theory, but rarely solved in practice due to the memory constraints of traditionalblack-box optimization software. The goal in our work is to develop practical solvers forthese and other related metric-constrained optimization tasks.The starting point in our work is an observation that the linear programming relaxationfor correlation clustering [3, 16] is equivalent to a special case of the ℓ metric nearnessproblem [12]. Based on this, we develop a general strategy for metric-constrained optimiza-tion that is related to the iterative triangle-ﬁxing algorithms that Dhillon et al. developedfor metric nearness [23]. Our approach applies broadly to any metric-constrained linear or Purdue University, Mathematics Department Purdue University, Computer Science Department The University of Melbourne, Computing and Information Systems School Monash University, Department of Electrical and Computer Systems Engineering quadratic program, and also comes with a more robust stopping criterion for the underlyingiterative procedure it employs. This stopping criterion is based on a careful consideration ofthe dual objective function and a rounding step that is applied when our solver is close to con-vergence. This leads to signiﬁcantly stronger constraint satisfaction and output guarantees.We additionally provide several novel results and strategies for setting a key parameter, γ ,which governs the relationship between a metric-constrained linear program and the relatedquadratic program that can be more easily solved in practice using our iterative procedure.We demonstrate the success of our metric-constrained optimization framework by obtain-ing high-quality solutions to convex relaxations of NP-hard clustering objectives on a muchlarger scale than has previously been accomplished in practice. In particular, we solve theLeighton-Rao relaxation of sparsest cut on graphs with thousands of nodes, as well as theLP relaxation of correlation clustering on real world networks with over 11 thousand nodes.In other words, we are able to use our techniques to solve optimization problems with variables and × constraints.2. Background and Related Work

We brieﬂy outline three areas in machine learning and optimization that are closely relatedto our work on developing solvers for metric-constrained optimization problems.2.1.

Metric Learning and Metric Nearness.

The problems we study are tangentiallyrelated to distance metric learning [55, 8], in which one is given a set of points in a metricspace and the goal is to learn a new metric that additionally respects a set of must-link andcannot-link constraints. Speciﬁc results within the metric learning literature involve solvinga metric-constrained optimization task: Batra et al. introduced a linear program withmetric constraints deﬁned on a set of code words [4]. Biswas and Jacobs considered a similarquadratic program which includes a full set of O ( n ) triangle inequality constraints [11].The authors note that such a QP would be very challenging to solve in practice. Ourwork is even more closely related to the metric nearness problem [12], which seeks to learnmetric distance scores that are as close as possible (with respect to an ℓ p norm) to a set ofnon-metric dissimilarity scores. Dhillon et al. used Dykstra’s projection method to solve ℓ p metric nearness problems when p = 1 , or ∞ , and applied generalized Bregman projectionsfor all other p [23].2.2. Implementing Clustering Approximation Algorithms.

A number of triangle in-equality constrained LP relaxations can be rounded to produce good approximation guar-antees for NP-hard clustering problems [50, 38, 16]. However, these are rarely implementeddue to memory constraints. For correlation clustering, Wirth noted that the LP can besolved more eﬃciently by using a multicommodity ﬂow formulation of the problem, thoughthis is still very expensive in practice [53]. Gael and Zhu employed an

LP chunking tech-nique which allowed them to solve the correlation clustering relaxation on graphs with up tonearly 500 nodes [48]. The LP rounding algorithm Charikar et al. [15] inspired others to usemetric-constrained LPs for modularity clustering [1] and joint-clustering of image segmenta-tions [51, 30]. In practice these algorithms scaled to only a few hundred nodes when a fullset of O ( n ) constraints was included. In the case of sparsest cut, Lang and Rao developeda practical algorithm closely related to the original Leighton-Rao algorithm [36], which waslater evaluated empirically by Lang et al. [37]. However, the algorithm only heuristicallysolves the underlying multicommodity ﬂow problem, and therefore doesn’t satisfy the sametheoretical guarantees. PROJECTION METHOD FOR METRIC-CONSTRAINED OPTIMIZATION 3

Projection Methods for Optimization.

The algorithmic framework we developfor solving metric-constrained optimization problems is based on a well-known projectionmethod developed by Dykstra [25]. Dykstra’s method is also equivalent to Han’s method [31],and equivalent to Hildreth’s method in the case of half-space constraints [33]. Certainvariants of the problem are also equivalent to performing coordinate descent on a dualobjective function [47]. For detailed convergence results and empirical evaluations of diﬀerentprojection algorithms, see [7, 27, 13, 14].3.

Metric-Constrained LPs and Graph Clustering

Formally, we deﬁne a metric-constrained optimization problem to be an optimizationproblem involving constraints of the form x ij ≤ x ik + x ik , where x ij is a non-negativevariable representing the learned distance between two objects i and j in a given dataset.Optimization problems of this form arise very naturally in the study of graph clusteringobjectives, since any non-overlapping clustering C for a graph G = ( V, E ) is in one-to-one correspondence with a set of binary variables x = ( x ij ) satisfying triangle inequalityconstraints: ( x ij ∈ { , } for all i, j and x ij ≤ x ik + x jk for all i, j, k ⇐⇒ ∃ C s.t. x ij = ( if i, j are together in C otherwise . In this section, we speciﬁcally consider a number of metric-constrained linear programs, mostof which arise as a relaxation of an NP-hard graph clustering task. We also prove a newequivalence between the metric nearness and the correlation clustering LPs (Theorem 1).3.1.

Metric Nearness.

The Metric Nearness Problem [12] seeks the nearest metric ma-trix X ∗ = ( x ∗ ij ) to a dissimilarity matrix D = ( d ij ) . Here, a dissimilarity matrix is anon-negative, symmetric, zero-diagonal matrix, and a metric matrix is a dissimilarity ma-trix whose entries satisfy metric constraints. If M n represents the set of metric matrices ofsize n × n , then the problem can be formalized as follows:(1) X ∗ = argmin X ∈ M n X i = j w ij | ( x ij − d ij ) | p  /p where w ij ≥ is a weight indicating how strongly we wish X ∗ and D to coincide at entry ij . When p = 1 , the problem can be cast as a linear program by introducing variables M = ( m ij ) :(2) minimize P i

Correlation clustering is an NP-hard problem for partition-ing a signed graph G = ( V, W + , W − ) [3, 54]. Each pair of distinct nodes i and j in G possesses two non-negative weights, w + ij ∈ W + and w − ij ∈ W − , indicating measures of simi-larity and dissimilarity, respectively. The goal is to cluster the nodes in a way that minimizesmistakes, where the mistake at pair ( i, j ) is w + ij if i and j are separated, and w − ij if they are PROJECTION METHOD FOR METRIC-CONSTRAINED OPTIMIZATION 4 together. The objective can be written formally as an integer linear program:(3) minimize P i

Consider an instance of correlation clustering G = ( V, W + , W − ) and set w ij = | w + ij − w − ij | . Deﬁne an n × n matrix D = ( d ij ) where d ij = 1 if w − ij > w + ij and d ij = 0 otherwise. Then X = ( x ij ) is an optimal solution to the LP relaxation of (3) if and only if ( X , M ) is an optimal solution for (2) , where M = ( m ij ) = ( | x ij − d ij | ) .Proof. We will assume that at most one of w + ij , w − ij is positive, so every pair of nodes is eitherlabeled similar or dissimilar. If this were not the case, we could alter the edge weights sothat this holds without changing the LP solution .We equivalently consider an unsigned graph G ′ = ( V, E ) with the same node set V andan adjacency matrix A = ( A ij ) where A ij = 1 if w − ij = 0 and A ij = 0 otherwise. If w ij = max { w + ij , w − ij } , the correlation clustering LP can then be written(4) minimize P i

Sparsest Cut.

The sparsest cut score of a set S ⊂ V in an n -node graph G = ( V, E ) is deﬁned to be φ ( S ) = cut ( S ) | S | + cut ( S ) | ¯ S | = n cut ( S ) | S || ¯ S | , where ¯ S = V \ S is the complement of S and cut ( S ) indicates the number of edges crossingbetween S and ¯ S . Leighton and Rao developed an O (log n ) -approximation for ﬁnding theminimum sparsest cut set φ ∗ = min S ⊂ V φ ( S ) for any graph by solving a maximum multi-commodity ﬂow problem [38]. This result is equivalent to solving the LP relaxation for thefollowing metric-constrained linear program:(6) minimize P ( i,j ) ∈ E x ij subject to P i

Maximum modularity clustering [42, 41] takesa graph G = ( V, E ) as input and seeks to optimize the following objective score over allclusterings C :(7) max 12 | E | X i,j (cid:18) A ij − d i d j | E | (cid:19) δ C ij where d i is the degree of node i , and A ij is the { , } indicator for whether i, j are adjacentin G . The δ C ij variables encode the clustering: δ C ij = ( if i, j are together in C otherwise . Although modularity has been widely used in clustering applications, Dinh showed that itis NP-hard to obtain a constant-factor approximation algorithm for the objective [24]. How-ever, inspired by the LP rounding algorithm of Charikar et al. for correlation clustering [15],Agarwal and Kempe noted that by replacing δ C ij = 1 − x ij in (7) and introducing metricconstraints, one obtains a the following LP relaxation [1]:(8) maximize | E | P i,j (cid:16) A ij − d i d j | E | (cid:17) (1 − x ij ) subject to x ij ≤ x ik + x jk for all i, j, k ≤ x ij ≤ for all i, j .Solving this LP relaxation provides a useful upper bound on the maximum modularity. Inpractice this opens up the possibility of obtaining a posteriori approximation guarantees forusing heuristic methods for modularity clustering.3.5. Cluster Deletion.

Cluster deletion is the problem of deleting a minimum number ofedges in an unweighted, undirected graph G = ( V, E ) so that the remaining graph is adisjoint set of cliques. This problem can be viewed as a variant of correlation clusteringin which each pair of nodes has weights ( w + ij = 1 , w − ij = 0) or ( w + ij = 0 , w − ij = ∞ ) . TheLP relaxation can be obtained by starting with the relaxation for correlation clustering and PROJECTION METHOD FOR METRIC-CONSTRAINED OPTIMIZATION 6 ﬁxing x ij = 1 for ( i, j ) / ∈ E . This eliminates the need to work explicitly with variables fornon-edges, and therefore simpliﬁes into the following relaxation:(9) minimize P ( i,j ) ∈ E x ij subject to x ij ≤ x ik + x jk if ( i, j, k ) ∈ T ≤ x ik + x jk if ( i, j, k ) ∈ ˜ T k ≤ x ij ≤ for all ( i, j ) ∈ E. In the above, T represents the set of triangles, i.e. triplets of nodes i, j, k that form a cliquein G . The set ˜ T k represents bad triangles “centered" at k , i.e. k shares an edge with i and j ,but ( i, j ) / ∈ E . In recent work we showed that the solution to this LP can be rounded intoa clustering that is within a factor two of the optimal cluster deletion solution [50].3.6. Maximum Cut.

Given a graph G = ( V, E ) , the maximum cut problem seeks topartition G into two clusters in a way that maximizes the number of edges crossing betweenthe clusters. The linear programming relaxation for Max Cut is(10) maximize P i,j A ij x ij subject to x ij ≤ x ik + x jk for all i, j, kx ij + x ik + x jk ≤ for all i, j, k ≤ x ij ≤ for all i, j .Note that if the linear constraints x ij ∈ [0 , were replaced with binary constraints x ij ∈{ , } , then this would correspond to an integer linear program for the exact Max Cutobjective. The constraint x ij + x ik + x jk ≤ are included to ensure in the binary case thatnodes are assigned to at most two diﬀerent clusters. It is well known that the integrality gapof this linear program is − ǫ [43]. Fernandez de la Vega and Mathieu noted that this can beimproved to a ǫ integrality gap in the case of dense graphs when additional constraintsare added [20].4. Projection Methods for Quadratic Programming

The graph clustering problems we have considered above can be relaxed to obtain an LPof the form(11) min x c T x s.t. Ax ≤ b , where A is a large and very sparse matrix with up to O ( n ) rows and O ( n ) columns.Standard optimization software will be unable to solve these LPs for large values of n , dueto memory constraints, so we instead turn our attention to applying a simple projectionmethod for solving a closely related quadratic program:(12) min x Q ( x ) = c T x + 12 γ x T Wx s.t. Ax ≤ b , where W is a diagonal matrix with positive diagonal entries and γ > . When W is theidentity matrix, it is well known that there exists some γ > such that for all γ > γ , theoptimal solution to the quadratic program corresponds to the minimum 2-norm solution ofthe original LP [40].4.1. Applying Dykstra’s Projection Method.

The dual of (12) is another quadraticprogram: max y D ( y ) = − b T y − γ ( A T y + c ) T W − ( A T y + c ) s.t. y ≥ . (13)The core of our algorithm for solving (12) is Dykstra’s projection method [25], which itera-tively updates a set of primal and dual variables x and y that are guaranteed to converge to PROJECTION METHOD FOR METRIC-CONSTRAINED OPTIMIZATION 7

Algorithm 1

Dykstra’s Method for Quadratic Programming

Input: A ∈ R N × M , b ∈ R M , c ∈ R N , γ > , W ∈ R N × N ( diagonal, positive deﬁnite ) Output: ˆ x = argmin x ∈A Q ( x ) where A = { x ∈ R N : Ax ≤ b } y := ∈ R M x := − γ W − c , k := 0 while not converged do k := k + 1 (Visit constraints cyclically): i := ( k −

1) mod M + 1 (Perform correction step): x := x + y i ( γ W − a i ) where a i is the i th row of A (Perform projection step): x := x − θ + i ( γ W − a i ) where θ + i = max { a Ti x − b i , } γ a Ti W − a i (Update dual variables): y i := θ + i ≥ the optimal solution of (12) and (13) respectively. The method cyclically visits constraintsone at a time, ﬁrst performing a correction step to the vector x based on the dual variableassociated with the constraint, and then performing a projection step so that x satisﬁes thelinear constraint in question. Pseudocode for the method, speciﬁcally when applied to ourquadratic program, is given in Algorithm 1. For quadratic programming, Dykstra’s methodis also equivalent to Hildreth’s projection method [33], and is guaranteed to have a linearconvergence rate [27]. In Appendix B, we provide more extensive background informationregarding Dykstra’s method. In particular we prove the equivalence relationship betweenDykstra’s method and Hildreth’s method in the case of quadratic programming. We alsoprovide a slight generalization of the results of Dax [19] which show that the dual variablesproduced by Dykstra’s method allow us to obtain a strictly increasing lower bound on thequadratic objective (12) we are trying to solve.5. Implementing Dykstra’s Method for Metric-Constrained Optimization

Our full algorithmic approach takes Dykstra’s method (Algorithm 1) and includes anumber of key features that allow us to eﬃciently obtain high-quality solutions to metric-constrained problems in practice. The ﬁrst feature, a procedure for locally performingprojections at metric constraints, is the key insight which led Dhillon et al. to develop eﬃ-cient algorithms for metric nearness [23]. In addition, we detail a sparse storage scheme fordual vectors, and include a more robust convergence check that leads to better constraintsatisfaction and stronger optimality guarantees for a variety of metric-constrained problems.In order to demonstrate our application of Dykstra’s method to metric-constrained linearprograms, we will speciﬁcally consider the quadratic program related to the Leighton-Raosparsest cut relaxation:(14) minimize P ( i,j ) ∈ E x ij + γ P i

Projections of the form x := x + α W − a i for W diagonaland a constant α will change x by at most the number of nonzero entries of a i , the i th rowof constraint matrix A . In the case of triangle inequality constraints, which dominate ourconstraint set, there are exactly three non-zero entries per constraint, so we perform eachprojection in a constant number of operations.Consider the triangle inequality constraint x ij − x ik − x jk ≤ associated with an orderedtriplet ( i, j, k ) . Let t = t ijk represent a unique ID corresponding to this constraint. For nowwe ignore the correction step of Dykstra’s method, which is skipped over in the ﬁrst roundsince the vector of dual variables is initialized to zero (i.e. y t = 0 ). The projection step wemust perform is x ← x − [ a Tt x − b t ] + a Tt W − γ a t W − γ a t where a t contains exactly three entries: , − , and − , at the locations corresponding tovariables x ij , x ik , and x jk . This projection will only change x if constraint t is violated, sowe ﬁrst check if δ = a Tt x − b t = x ij − x ik − x jk > . If so, we compute θ + t = [ a Tt x − b t ] + a Tt W − γ a t = δ /w ij + 1 /w ik + 1 /w jk = δw ij w ik w jk w ij w ik + w ij w jk + w ik w ij . The projection step then updates exactly three entries of x : x ij ← x ij − θ + t x ij w ij , x ik ← x ik − θ + t x ik w ik , x jk ← x jk − θ + t x jk w jk . Note that all of this can be done in a constant number of operations for each triangleinequality constraint.5.2.

Sparse storage of y . For constraint sets that include triangle inequalities for everytriplet of nodes ( i, j, k ) , the dual vector y will be of length O ( n ) . Observe that the correctionstep in Algorithm 1 at constraint t will be nontrivial if and only if there was a nontrivialprojection last time the constraint was visited. In other words, θ + t was nonzero in theprevious round and therefore y t > .5.2.1. Sparsity in the Triangle Constraint Variables.

Note that each triplet ( i, j, k ) corre-sponds to three diﬀerent metric constraints: x ij − x ik − x jk ≤ , x jk − x ik − x ij ≤ , and x ik − x ij − x jk ≤ , and in each round at most one of these constraints will be violated, indi-cating that at least two dual variables will be zero. Dhillon et al. concluded that (cid:0) n (cid:1) ﬂoatingpoint numbers must be stored in implementing Dykstra’s algorithm for the metric nearnessproblem [22]. We further observe, especially for the correlation clustering LP, that oftenin practice for a large percentage of triplets ( i, j, k ) , none of the three metric constraints isviolated. Thus we can typically avoid the worst case O ( n ) memory requirement by storing y sparsely. PROJECTION METHOD FOR METRIC-CONSTRAINED OPTIMIZATION 9

Algorithm 2

MetricProjection ( i, j, k ) t := unique ID for ( i, j, k ) Obtain ( x ij , x ik , x jk ) and weights ( w ij , w ik , w jk ) from X and W γ if y t > then x ij ← x ij + y t x ij w ij , x ik ← x ik + y t x ik w ik , x jk ← x jk + y t x jk w jk . δ := x ij − x ik − x jk if δ > then θ t = δw ij w ik w jk w ij w ik + w ij w jk + w ik w ij x ij ← x ij − θ t x ij w ij , x ik ← x ik − θ t x ik w ik , x jk ← x jk − θ t x jk w jk .Store y t = θ t Storing y in dictionaries or arrays. Conceptually the easiest approach to storingnonzero entries in y is to maintain a dictionary of key-value pairs ( t, y t ) . In this case, whenvisiting constraint t , we check if the dictionary contains a nonzero dual variable y t > forthis constraint, and if so we perform the corresponding non-trivial correction step. However,because we always visit constraints in the same order, we ﬁnd it faster in practice to store twoarrays with pairs ( t, y t ) rather than a dictionary. The ﬁrst array stores entries y t that wereset to a nonzero value in the previous pass through the constraints. We maintain a pointerto the entry in the array which gives us the next such constraint t that will require a nonzerocorrection in the current pass through the constraint set. The second array allocates spacefor the new dual variables that become nonzero after a projection step in the current passthrough the constraints. These will be needed for corrections in the next round. Dykstra’smethod does not require we remember history beyond the last pass through constraints, sowe never need more than two arrays storing pairs ( t, y t ) .5.2.3. Pseudocode.

Algorithm 2 displays pseudocode for one step of our implementation ofDykstra’s method when visiting a metric constraint. We assume the nonzero dual variables y t are stored sparsely and can be eﬃciently queried and updated. The same basic outlineapplies also to non-metric constraints.5.3. Robust Stopping Criteria.

Many implementations of Dykstra’s method stop whenthe change in vector x drops below a certain tolerance after one or more passes through theentire constraint set. However, Birgin et al. noted that in some cases this may occur evenwhen the iterates are far from convergence [9]. Because we are applying Dykstra’s methodspeciﬁcally to quadratic programming, we can obtain a much more robust stopping criterionby carefully monitoring the dual objective function and dual variables, in a manner similarto the approach of Dax [19].5.3.1. Optimality Conditions.

Let ( y k , x k ) denote the pair of primal and dual vectors com-puted by Dykstra’s method after k projections. We know that these vectors will convergeto an optimal pair (ˆ x , ˆ y ) such that D (ˆ y ) = ˆ Q = Q (ˆ x ) where ˆ Q is the optimal objective forboth the primal (12) and dual (13) quadratic programs. The KKT optimality conditions forquadratic programming state that the pair (ˆ x , ˆ y ) is optimal for the primal (12) and dual (13)quadratic programs if and only if the following conditions hold: A ˆ x ≤ b ˆ y T ( A ˆ x − b ) = 0 W γ ˆ x = − A T ˆ y − c . 4. ˆ y ≥ In this case we know that D (ˆ y ) = ˆ Q = Q (ˆ x ) where ˆ Q is the optimal objective for both theprimal and dual quadratic programs. We show iin Appendix B that the dual update step y i := θ + i in Algorithm 1 will guarantee that the last two KKT conditions are always satisﬁed.In other words, the primal and dual variables at iteration k , ( x k , y k ) , satisfy y k ≥ and PROJECTION METHOD FOR METRIC-CONSTRAINED OPTIMIZATION 10 W γ x k = − A T y k − c . This means that y k is always feasible for the dual objective (13), andby weak duality we have the following lower bound on the optimal solution to objective (12)(15) D ( y k ) = − b T y k − γ ( A T y k + c ) T W − γ ( A T y k + c ) = − b T y k − x Tk W γ x k . We also prove in the Appendix that performing Dykstra’s method is equivalent to applying acoordinate ascent procedure on the dual quadratic program (13). This means that D ( y k ) is astrictly increasing lower bound that converges to ˆ Q . Meanwhile, Q ( x k ) does not necessarilyupper bound ˆ Q since x k is not necessarily feasible. However, x k converges to the optimalprimal solution, so as the algorithm progresses, the maximum constraint violation of x k decreases to zero. In practice, once x k has satisﬁed constraints to within a small enoughtolerance we treat Q ( x k ) as an upper bound.After each pass through the constraints we check the primal-dual gap ω k and maximumconstraint violation ρ k , given by ω k = D ( y k ) − Q ( x k ) D ( y k ) ρ k = max t ( b t − a Tt x k ) . Together these two scores provide an indication for how close ( x k , y k ) are to convergence.5.3.2. Computing ω k and ρ k . To compute ω k in practice, we note that x Tk W γ x k appearsin both Q ( x k ) and D ( y k ) (see (15)). This term, as well as the term c T x can be easily com-puted by iterating through the O ( n ) entries of x . Finding b T y k could theoretically involve O ( n ) computations, but this can be done during the main loop of Dykstra’s algorithm bycontinually updating a variable that adds up terms of the form y t b t whenever y t and b t are both nonzero for a particular constraint t . Note that in most of the problems we haveconsidered here, b t = 0 for the majority of the constraints. For example, in the sparsest cutrelaxation (14), b t is only nonzero for the constraint P i

In practice we could simply run Dykstra’s itera-tion until both ω k and ρ k fall below user-deﬁned tolerances. We additionally incorporateanother step in out convergence check that signiﬁcantly improves the algorithm’s perfor-mance in practice. Because x k → ˆ x , we know that after a certain point, the maximumentrywise diﬀerence between ˆ x and x k will be arbitrarily small. Therefore, once both ρ k and | ω k | have dropped below a given tolerance, we will test for convergence by rounding everyentry of x k to r signiﬁcant ﬁgures for a range of values of r : x r = round ( x k , r ) . As long as x k is close enough to optimality and we have chosen the proper r , x r will satisfy constraintsto within the desired tolerance and will have an objective exactly or nearly equal to the bestlower bound we have for ˆ Q : [ D ( y k ) − Q ( x r )] /D ( y k ) ≤ ǫ . If x r does not satisfy constraintsor has a poor objective score, we simply discard x r and continue with x k and the originalDykstra iteration. Even if this rounding procedure is always unsuccessful, we simply fallback on the iterates ( x k , y k ) until ω k and ρ k eventually fall below the deﬁned tolerance. Inpractice however, we do ﬁnd that the rounding procedure dramatically improves both theruntime of our method as well as constraint satisfaction. PROJECTION METHOD FOR METRIC-CONSTRAINED OPTIMIZATION 11

We highlight the fact that when checking whether we are close enough to convergence toapply the entrywise rounding step, we consider the absolute value of ω k and not ω k itself.Recall that this value may be negative if Q ( x k ) is not an upper bound on the optimalobjective. Often in practice we ﬁnd that by the time we are close to convergence, Q ( x k ) is indeed an upper bound and ω k is a small positive number. However, we also observecases where x k is infeasible and ω k is negative, but | ω k | is small and our entrywise roundingprocedure succeeds in producing a feasible point x r . When this happens, the duality gapbetween D ( y k ) and Q ( x r ) is guaranteed to be non-negative, and tells us how close thefeasible vector x r is to the optimal solution.6. Approximation Guarantees for Clustering Objectives

The results of Mangasarian conﬁrm that for all γ greater than some γ > , the origi-nal linear program (11) and the quadratic regularization (12) will have the same optimalsolution [40]. However, it is challenging to compute γ in practice, and if we set γ to betoo high then this may lead to very slow convergence for solving QP (12) using projectionmethods. Dhillon et al. suggest ways to set γ for variants of the metric nearness problembased on empirical observations, but no approximation guarantees are provided [23]. A keycontribution in our work is a set of results, outlined in this section, which show how to set W and γ in order to obtain speciﬁc guarantees for approximating the correlation clustering andsparsest cut objectives. These results hold for all γ > , whether larger or smaller thanthe unknown value γ . We begin with a general theorem that provides a useful strategy forobtaining approximation guarantees for a large class of linear programs. Theorem 2.

Let A ∈ R M × N , b ∈ R N , c ∈ R N> , and A = { x ∈ R N : Ax ≤ b } . Denote x ∗ = argmin x ∈A c T x and assume that all entries of x ∗ are between and B > . Let W bea diagonal matrix with entries W ii = c i > and let ˆ x = argmin x ∈A [ c T x + 1 / (2 γ ) x T Wx ] .Then c T x ∗ ≤ c T ˆ x ≤ c T x ∗ (1 + B/ (2 γ )) . Proof.

Vectors ˆ x and x ∗ are optimal for their respective problems, meaning that c T x ∗ ≤ c T ˆ x ≤ c T ˆ x + 12 γ ˆ x T W ˆ x ≤ c T x ∗ + 12 γ ( x ∗ ) T Wx ∗ . The proof follows from combining these inequalities with a bound on the second term onthe far right. By our construction of W and the bounds we assume hold for x ∗ , we have: ( x ∗ ) T Wx ∗ = n X i =1 c i ( x ∗ i ) = B n X i =1 c i ( x ∗ i /B ) ≤ B n X i =1 c i x ∗ i = B c T x ∗ where the second to last step holds because ≤ x ∗ i ≤ B = ⇒ ( x ∗ i /B ) < ( x ∗ i /B ) . (cid:3) Cluster Deletion Approximation.

We observe that Theorem 2 directly implies aresult for the cluster deletion LP relaxation (9). Cluster deletion has a variable x ij for eachedge ( i, j ) ∈ E . The objective can be written e T x = P ( i,j ) ∈ E x ij , where e is the all onesvector. Since the LP also includes constraints x ij ∈ [0 , , we see that the assumptions ofTheorem 2 hold with W equal to the identity matrix and B = 1 . This means that we canuse our Dykstra-based projection method to optimize a quadratic objective that produces asolution within a factor (1 + 1 / (2 γ )) of the optimal cluster deletion LP relaxation. Couplingthis result with the LP rounding procedure we developed in previous work, we can obtain a (2 + 1 /γ ) approximation for cluster deletion in practice [50]. PROJECTION METHOD FOR METRIC-CONSTRAINED OPTIMIZATION 12

Correlation Clustering.

Consider a correlation clustering problem on n nodes whereeach pair of nodes ( i, j ) is either strictly similar or strictly dissimilar, with a nonzero weight w ij > . That is, exactly one of the weights ( w − ij , w + ij ) is positive and the other is zero.We focus on the LP relaxation for this problem given in the form of the ℓ metric nearnessLP (2). We slightly alter this formulation by performing a change of variables y ij = x ij − d ij .The LP can then be written equivalently as:(16) minimize P i

Let ( m ∗ ij ) and ( y ∗ ij ) be the optimal solution vectors for the correlation clusteringLP relaxation given in (16) and ( ˆ m ij ) , (ˆ y ij ) be the optimal solution to the related QP (18) .Then X i

Therefore, given any rounding procedure for the original LP that gives a factor p ap-proximation for a correlation clustering problem, we can instead solve the related QP usingprojection methods to obtain a factor p (1 + 1 /γ ) approximation. For weighted correlationclustering, the best rounding procedures guarantee an O (log n ) approximation [21, 26, 15],so this can still be achieved even if we use a small value for γ .6.3. Sparsest Cut.

The Leighton-Rao linear programming relaxation for sparsest cut ispresented in (6). This LP has a variable x ij for every pair of distinct nodes i < j in someunweighted graph G = ( V, E ) . Let x = ( x ij ) be a linearization of these distance variables,and deﬁne c = ( c ij ) to be the adjacency indicators, i.e. c ij = ( if ( i, j ) ∈ E otherwise . Then the objective can be written in the familiar format min x c T x . If we assume we havechosen a weight matrix W and a parameter γ > , the regularized version of (6) has objective min x X i

Let G = ( V, E ) be a connected graph with n = | V | > . Let φ ∗ be the minimumsparsest cut score for G and assume that each side of the optimal sparsest cut partition hasat least 2 nodes. Let γ > , W = diag ( w ) be deﬁned as above for a given λ ∈ (0 , , and let A denote the set of constraints from the Leighton-Rao LP relaxation for sparsest cut. Then min x ∈A c T x + 12 γ x T Wx ≤ (cid:18) λn γ (cid:19) φ ∗ . Proof.

The quadratic regularization of the sparsest cut LP relaxation is(21) minimize P i

Without loss of generality, assume | S ∗ | ≤ | ¯ S ∗ | . In the statement of the theorem we assumethat G is connected, n > , and | S ∗ | > . The connectivity of G ensures the problem can’t betrivially solved by ﬁnding a single connected component, and guarantees that cut ( S ∗ ) ≥ .Together the remaining two assumptions guarantee that n | S ∗ || ¯ S ∗ | ≤ n n − ≤ , which will beuseful later in the proof. We will also use the fact that n | S ∗ || ¯ S ∗ | ≤ n cut ( S ∗ ) | S ∗ || ¯ S ∗ | = φ ∗ . Note thatif n ≤ , the problem is trivial to solve by checking all possible partitions, and if | S ∗ | = 1 then the minimum sparsest cut problem is easy to solve by checking all n partitions thatput a single node by itself.In order to encode the optimal partition as a vector, deﬁne s ∗ = ( s ∗ ij ) by s ∗ ij = ( n | S ∗ || ¯ S ∗ | if nodes i and j are on opposite side of the partition { S ∗ , ¯ S ∗ } otherwise . Observe that this vector s ∗ satisﬁes the constraints of (21) and that X i

The approximation bounds in the previous section provide helpful suggestions for howto set parameters γ and W before running Dykstra’s projection algorithm on a quadraticregularization of a metric-constrained LP. Once we have chosen these parameters and solved PROJECTION METHOD FOR METRIC-CONSTRAINED OPTIMIZATION 15 the quadratic program, we would like to see if we can improve these guarantees using theoutput solution for the QP.7.1.

A First Strategy for Improved Bounds.

Consider again the optimal solutions tothe LP and QP given by x ∗ = argmin A c T x ˆ x = argmin A c T x + 12 γ x T Wx . where A = { x ∈ R N : Ax ≤ b } is the set of feasible solutions. For each of the NP-hardgraph clustering objectives we have considered, we have proven a sequence of inequalities ofthe form c T x ∗ ≤ c T ˆ x ≤ c T ˆ x + 12 γ ˆ x T W ˆ x ≤ c T x ∗ + 12 γ ( x ∗ ) T W ( x ∗ ) ≤ (1 + A ) OP T where A is a term in the approximation factor (e.g. /γ , / (2 γ ) , (1 + λn ) /γ ) and OPT is theoptimal score for the NP-hard objective. If we have already computed ˆ x , we can improvethis approximation result by computing R = ˆ x T W ˆ x γ c T ˆ x . We then get an improved approximation guarantee: c T ˆ x + 12 γ ˆ x T W ˆ x = (1 + R ) c T ˆ x = ⇒ c T ˆ x ≤ A R OP T.

In some cases R will be small and this improvement will be minimal. However, intuitivelywe can see that in some special cases R may be large enough to signiﬁcantly improve theapproximation factor. For example, it may be the case that for some correlation clusteringrelaxation, we choose γ large enough so that the optimal solution to the QP, ˆ m , and theoptimal solution to the LP, m ∗ , are actually identical. Even after computing ˆ m we may notrealize that c T ˆ x = c T x ∗ . However, in some cases, a signiﬁcant proportion of the m ∗ ij = ˆ m ij variables will close to zero or close to one. Thus, ˆ m ij ≈ ˆ m ij for many pairs i, j . For thecorrelation clustering relaxation this will mean that ˆ xW ˆ x ≈ c T ˆ x = ⇒ R ≈ A . Even incases where x ∗ and ˆ x are not identical but very close, similar reasoning shows that the abovea posteriori approximation result may be much better than the a priori (1+ A ) approximation.We note that our approximation results for correlation clustering in the experiments sectionare greatly aided by this a posteriori guarantee.7.2. Improved Guarantees by Solving a Small LP.

We outline one more approach forgetting improved approximation guarantees, this time based on a careful consideration ofdual variables ˆ y computed by Dykstra’s method. This result requires a more sophisticatedapproach than the guarantee given in the last section. We ﬁnd it extremely helpful forproviding strong a posteriori guarantees when solving our quadratic relaxation of sparsestcut.Once more we consider our initial linear program, which we assume is too challenging tosolve using black-box software because of memory constraints: min x c T x (22) s.t. Ax ≤ b . PROJECTION METHOD FOR METRIC-CONSTRAINED OPTIMIZATION 16

We again let x ∗ denote the (unknown) optimizer for (22). In practice, we solve a quadraticregularization: min x c T x + 12 γ x T Wx (23) s.t. Ax ≤ b . We solve (23) by ﬁnding a primal-dual pair of vectors (ˆ x , ˆ y ) satisfying KKT conditions. Inparticular, as noted in previous sections, these vectors satisfy γ W ˆ x = − A T ˆ y − c (24) − b T ˆ y − γ ˆ x T W ˆ x = c T ˆ x + 12 γ ˆ x T W ˆ x . (25)Given this setup, we prove a new theorem for obtaining a lower bound on c T x ∗ by considering ˆ y and solving another small, less expensive LP. Theorem 5.

Given (ˆ x , ˆ y ) , set ˆ p = 1 /γ W ˆ x and let ˜ x be the optimal solution to the followingnew linear program: max x ˆ p T x (26) s.t. c T x ≤ c T ˆ xx ∈ B where B is any set which is guaranteed to contain x ∗ (i.e. B encodes a subset of constraintsthat are known to be satisﬁed by x ∗ ). Then we have the following lower bound on the optimalsolution to (22) : (27) − b T ˆ y − ˆ p T ˜ x ≤ c T x ∗ . Furthermore, if x ∗ = ˆ x = ˜ x , then this bound is tight.Proof. The dual of the original linear program (22) is max − b T y (28) s.t. − A T y − c = 0 y ≥ . One way to obtain a lower bound on c T x ∗ would be to ﬁnd some feasible point y for (28),in which case − b T y ≤ c T x ∗ . Note that we have access to a vector ˆ y satisfying ˆ y ≥ and A T ˆ y − c = ˆ p = (1 /γ ) W ˆ x . This ˆ y is not feasible for (28), but we note that if the entries of ˆ p are very small (which they will be for large γ ), then the constraint A T y − c = 0 is nearly satisﬁed by ˆ y . If we deﬁne a new vector ˆ c = c + ˆ p , then we can observe that ˆ y is feasiblefor a slightly perturbed linear program: max − b T y (29) s.t. − A T y − ˆ c = 0 y ≥ . We realize that this is the dual of a slight perturbation of the original LP (22): min x ˆ c T x (30) s.t. Ax ≤ b . PROJECTION METHOD FOR METRIC-CONSTRAINED OPTIMIZATION 17

Since ˆ y is feasible for (29) and x ∗ is feasible for (30), we have the following inequality:(31) − b T ˆ y ≤ ˆ c T x ∗ = c T x ∗ + ˆ px ∗ . Finally, observe that x ∗ is feasible for the LP (26) deﬁned in the statement of the theorem,and therefore ˆ px ∗ ≤ ˆ p ˜ x . Combining this fact with (31) we get our ﬁnal result: − b T ˆ y ≤ c T x ∗ + ˆ px ∗ ≤ c T x ∗ + ˆ p ˜ x = ⇒ − b T ˆ y − ˆ p T ˜ x ≤ c T x ∗ . If we happen to choose γ > and W in such a way that x ∗ = ˆ x , and then pick a set B sothat ˜ x = x ∗ , then property (25) ensures that this bound will be tight. (cid:3) Typically it will be diﬃcult to choose parameters in such a way that x ∗ = ˜ x = ˆ x . However,the fact that this bound is tight for a certain choice of parameters is a good sign that thebound will not be too loose to be useful in practice as long as we choose parameters carefully.7.3. A Bound for Sparsest Cut.

Consider the quadratic regularization of the sparsestcut relaxation shown in (21), with diagonal weight matrix deﬁned as in Section 6.3. Assume (ˆ x , ˆ y ) is the set of primal and dual variables obtained by solving the objective with Dykstra’smethod. We give a corollary to Theorem 5 that shows how to obtain good a posterioriapproximations for how close c T ˆ x is to the original LP relaxation of sparsest cut (6). Corollary 6.

Let ˜ x = (˜ x ij ) be the optimizer for the following LP: (32) maximize (1 /γ ) P i n/ ( n − for some pair ( u, v ) , then therewould exist ( n − nodes k distinct from u and v such that nn − < x ∗ uv ≤ x ∗ uk + x ∗ vk . Then X i nn − n − nn − n, which contradicts the fact that the entries of x ∗ sum to n . We see then that all the constraintsincluded in LP (32) are satisﬁed by x ∗ ij , so the result holds. (cid:3) We will show in our upcoming experiments section that this bound is very helpful inguaranteeing the result of our projection method are very close to the solution of the originalLP relaxation of sparsest cut. For the majority of our experiments we set parameters to λ = 1 /n and γ = 5 , which guarantees a priori that the optimal QP solution will be withina factor 1.2 of the minimum sparsest cut score (Theorem 4). In practice, it takes only afew seconds to solve LP (32) after minimizing the quadratic objective, and this signiﬁcantlyimproves the approximation guarantee. In the worst case out of 12 graphs, we use it to PROJECTION METHOD FOR METRIC-CONSTRAINED OPTIMIZATION 18 show that we are within 1.05 of the optimal LP lower bound for sparsest cut for the graphUSAir97. Incidentally, we are able to compute the exact optimal LP solution for this graphusing Gurobi, and we ﬁnd that the actual approximation is 1.04, so our bound is very close.Typically the bound is able to conﬁrm that our solution is within 1% of the optimal LPscore. When we decrease γ to 2 and set λ = 1 / for a graph with 3086 nodes, the aposteriori approximation drops to 1.17, but this is still far better than the result we obtainby invoking the a priori guarantee in Theorem 4.8. Experiments

We implement Dykstra-based solvers for relaxations of sparsest cut (DykstraSC) and cor-relation clustering (DykstraCC) in the Julia programming language. Most previous metric-constrained LP solvers have managed to obtain results only on graphs with 500 or fewernodes [48, 1, 30]. In contrast, Dhillon et al. apply their triangle-ﬁxing algorithm to solvemetric nearness problems on random n × n dissimilarity matrices with n up to 5000 [23].However, their method simply runs Dykstra’s method until the change in the solution vectorfalls below a certain threshold. Since this approach does not take constraint satisfaction orduality gap into consideration, it comes with no output guarantees. Here we use DykstraSCto solve the sparsest cut relaxation on real-world graphs with up to 3068 nodes. Our methodis able to satisfy constraints to within machine precision, and our choice of γ allows us toobtain strong guarantees with respect to the optimal sparsest cut score. We also solve thecorrelation clustering relaxation to within a small constraint tolerance on signed, weightedgraphs with up to 11,204 nodes. This corresponds to solving a quadratic program with over × constraints.8.1. Using Gurobi Software.

We compare our algorithms against black-box Gurobi op-timization software. A free academic license for Gurobi software can be obtained online at

Gurobi.com . When comparing our algorithm against black-box software, we take care to en-sure as fair of a comparison as possible. Gurobi possesses a number of underlying solvers forLPs. In practice we separately run Gurobi’s barrier method (i.e. the interior point solver),the primal simplex method, and the dual simplex method, to see which performs the best.For the interior point method, Gurobi’s default setting is to convert any solution it ﬁnds toa basic feasible solution, but we turn this setting oﬀ since we do not require this of our ownsolver and we are simply interested in ﬁnding any solution to the LP. In practice we ﬁndthat the interior point solver is the fastest. The runtimes we report do not include the timespent forming the constraint matrix. This in and of itself is an expensive task that must betaken into account when using black-box software to solve problems of this form.8.1.1.

Lazy-Constraint Method.

Both for sparsest cut and correlation clustering we also testout an additional lazy-constraint method when employing Gurobi software. This procedureworks as follows:(1) Given a metric-constrained LP, solve the objective on a subproblem that includesall the same constraints except metric constraints.(2) Given the solution to the subproblem, check for violations in the metric constraints.Update the constraint set to include all such violated constraints. Re-solve the LPusing black-box software on the updated set of constraints.(3) Continually re-solve the problem, check for violations, and update the constraintset. If we reach a point when all original metric constraints are satisﬁed before thealgorithm fails due to memory issues, the solution is guaranteed to be the solutionto the original metric-constrained LP.

PROJECTION METHOD FOR METRIC-CONSTRAINED OPTIMIZATION 19

This procedure in some cases leads to signiﬁcantly improved runtimes since it may permitus to solve the original LP without ever forming the entire O ( n ) × O ( n ) constraint matrix.Quite often we ﬁnd, especially for correlation clustering problems, that many constraintswill naturally be satisﬁed without explicitly including them in the problem setup. However,for the sparsest cut relaxation, we ﬁnd that a large number of metric constraints are tightat optimality, and therefore must be included explicitly in the constraint set. In practicetherefore we observe that for the sparsest cut relaxation, Gurobi continues to add constraintsuntil a very large percentage of the original constraints are included explicitly. It thereforetypically does not save time or space to repeatedly solve smaller subproblems.8.2. Real-world Graphs.

In our experiments we use real-world networks obtained almostexclusively from the SuiteSparse Matrix Collection [17]. In particular we use graphs fromthe Newman, Arenas, Pajek, and MathWorks groups for our sparsest cut experiments. Inour correlation clustering experiments, we use Power, from the Newman group, and threecollaboration networks from the SNAP repository [39].The graphs fall into the following categories: • Citation networks : SmallW and SmaGri • Collaboration networks : caGrQc, caHepTh, caHepPh, Netscience, Erdos991 • Power grid : Power • US ﬂights graph : USAir97 • Web-based graphs : Harvard500 (web matrix), Polblogs (links between politicalblogs), Email (email correspondence graph) • Word graph : Roget (thesaurus associations) • Biology networks

C. El-Neural (neural network for nematode C. Elegans), C. El-Meta (metabolic network for C. Elegans).We also run one experiment on a graph not included in the SuiteSparse Matrix Collection.The graph Vassar85 is a snapshot of the Facebook network at Vassar College from theFacebook100 datasets. We include it in order to run our algorithm on an undirected networkwith around 3000 nodes.Before running experiments on any of the graphs above, we make all edges undirected,remove edge weights, and ﬁnd the largest connected component. In this way we ensure weare always working with connected, unweighted, and undirected networks.8.3.

The Sparsest Cut Relaxation.

We run DykstraSC on ranging in size from 198to 3068 nodes. Our machine has two 14-core 2.66 GHz Xeon processors and for ease ofreproducibility we limit experiments to 100GB of RAM. Results are shown in Table 1 andFigure 1. Gurobi has an advantage on smaller graphs, but slows down and then run outof memory once the graphs scale beyond a few hundred nodes. Since DykstraSC is in factoptimizing a quadratic regularization of the sparsest cut LP relaxation, we also report howclose our solution is to the optimal LP solution, either by comparing against Gurobi or usingour a posteriori approximation guarantee presented in Corollary 6. In nearly all cases weare within 1% of the optimal LP solution.When running Gurobi, for graphs with fewer than 500 nodes we have run all three solvers(interior point, dual simplex, and primal simplex). We report times for the interior pointsolver, since it proves to be the fastest in all cases. Gurobi runs out of memory when tryingto form the entire constraint matrix for larger problems. We also test the lazy-constraintmethod to ﬁnd it yields almost not beneﬁt for the sparsest cut relaxation. For graphs smallerthan Harvard500, where Gurobi was able to work with the entire constraint matrix, couplingthe interior point solver with the lazy-constraint procedure leads to much longer runtimes.Additionally, we ﬁnd in all cases that by the time the lazy-constraint solver converged, well

PROJECTION METHOD FOR METRIC-CONSTRAINED OPTIMIZATION 20 n(n-1)/210 R un t i m e ( s ) Figure 1.

Runtimes forDykstraSC on real-worldgraphs with 198 to 3068nodes. If n is the num-ber of nodes in the graph,then DykstraSC solves for n ( n − / distance scores.over half of the original constraint set had to be explicitly included in order to force all othermetric constraints to be satisﬁed. Therefore, in addition to signiﬁcantly worse runtimes, wesee only a minor decrease in the memory requirement.On larger graphs, the slight decrease in memory aﬀorded by the lazy-constraint methoddoes allows us to solve the sparsest cut relaxation on Harvard500, which was not possiblewhen forming the entire constraint matrix up front. This is the only positive result wesee for using this approach for this relaxation. However, it still requires solving a largenumber of expensive subproblems, leading to a runtime that is an order of magnitude slowerthan DykstraSC. We also tried the lazy-constraint approach on Roget, SmaGri, Email, andPolblogs. For all of these graphs, Gurobi spends a considerable amount of time solvingsubproblems, but still eventually runs out of memory before ﬁnding a solution. Due to thisrepeated failure to produce results on much smaller graphs, we did not attempt to run thelazy-constraint solver on Vassar85. Table 1.

We solve the LP relaxation for sparsest cut via DykstraSC on 13graphs. For Vassar85, we set γ = 2 and λ = 1 / ; for all other datasets weset γ = 5 and λ = 1 /n . Both DykstraSC and Gurobi (when it doesn’t runout of memory) solve the problems to within a relative gap tolerance of − ,and satisfy constraints to within machine precision. The last column reportsan upper bound on the ratio between the LP score produced by DykstraSCand the optimal LP solution. Time is given in seconds.Graph | V | | E | . ×

60 81 1.003SmallW 233 994 . ×

93 166 1.001C.El-Neural 297 2148 . ×

274 350 1.000USAir97 332 2126 . ×

471 511 1.041Netscience 379 914 . ×

887 1134 1.000Erdos991 446 1413 . × . × . × . × out of memory 53449 1.008SmaGri 1024 4916 . × out of memory 25703 1.002Email 1133 5451 . × out of memory 34621 1.005Polblogs 1222 16714 . × out of memory 41080 1.013Vassar85 3068 119161 . × out of memory 155333 1.165 PROJECTION METHOD FOR METRIC-CONSTRAINED OPTIMIZATION 21

Weighted Correlation Clustering.

We convert several real-world graphs into in-stances of correlation clustering using the approach of Wang et al. [52]. The procedure is asfollows:(1) Given an input graph G = ( V, E ) , compute the Jaccard coeﬃcient between each pairof nodes i, j : J ij = | N ( i ) ∩ N ( j ) || N ( i ) ∪ N ( j ) | where N ( u ) is the set of nodes adjacent to node u .(2) Apply a non-linear function on Jaccard coeﬃcients to obtain a score indicating sim-ilarity or dissimilarity: S ij = log (cid:18) J ij − δ )1 − ( J ij − δ ) (cid:19) . Here, δ is a parameter set so that S ij > if J ij > δ and S ij < when J ij < δ .Following Wang et al. [52], we ﬁx δ = 0 . .(3) Wang et al. stop after the above step and use S ij scores for their correlation clusteringproblems. We additionally oﬀset each entry by ± ǫ to avoid cases where edge weightsare zero: Z ij =  S ij + ǫ if S ij > S ij − ǫ if S ij < ǫ if S ij = 0 and ( i, j ) ∈ E − ǫ if S ij = 0 and ( i, j ) / ∈ E. If S ij = 0 , this indicates there is no strong similarity or dissimilarity between nodesbased on their Jaccard coeﬃcient. If in this case nodes i and j are adjacent, weinterpret this as a small indication of similarity and assign them a small positiveweight. Otherwise we assign a small negative weight. In all our experiments we ﬁx ǫ = 0 . .The sign of Z ij indicates whether nodes i and j are similar or dissimilar, and w ij = | Z ij | > is the non-negative weight for the associated correlation clustering problem. Results forrunning DykstraCC and Gurobi on the resulting signed graphs are shown in Table 2.On problems of this size, we have no hope of ever forming the entire constraint matrixand using Gurobi without running out of memory. Therefore, we restrict to using thelazy-constraint approach, coupled with Gurobi’s interior point solver. In one case, the lazy-constraint method converges very quickly. Eﬀectively, it ﬁnds a small subset of constraintsthat are suﬃcient to force all other metric constraints to be satisﬁed at optimality. However,Gurobi runs out of memory on the other large problems considered, indicating that, even ifwe are extremely careful, black-box solvers are unable to compete with our Dykstra-basedapproach.Because the correlation clustering problems we address are so large, we set γ = 1 and runDykstra’s method until constraints are satisﬁed to within a tolerance of 0.01. We ﬁnd thatlong before the constraint tolerance reaches this point, the duality gap shrinks below − .We note that although it takes a long time to reach convergence on graphs with thousandsof nodes, DykstraCC has no issues with memory. Monitoring the memory usage of ourmachine, we noted that for the 11,204 node graph, DykstraCC was using only around 12of the 100GB of available RAM. Given enough time therefore, we expect our method tobe able to solve metric-constrained LPs on an even much larger scale. The ability to solvethese relaxations on problems of this scale is already an accomplishment, given the fact thatstandard optimization software often fails on graphs with even a few hundred nodes. Infuture work we wish to continue exploring options for randomized and parallel projection PROJECTION METHOD FOR METRIC-CONSTRAINED OPTIMIZATION 22

Table 2.

DykstraCC can solve convex relaxations of correlation clusteringwith up to 700 billion constraints. The lazy-constraint Gurobi solver doesvery well for one very sparse graph, but runs out of memory on all other prob-lems. We set γ = 1 , and constraint tolerance to . . Selecting a small γ leads to poorer approximation guarantees, but dramatically decreases thenumber of needed iterations until convergence. We can still obtain the stan-dard O (log n ) approximation for weighted correlation clustering by applyingprevious rounding techniques [21, 26, 15].Graph | V | | E | . ×

549 s 7.6 hrs 1.07caGrQc 4158 13422 . × out of memory 6.6 hrs 1.33caHepTh 8638 24806 . × out of memory 88.3 hrs 1.34caHepPh 11204 117619 . × out of memory 167.5 hrs 1.27methods, so that we can more quickly obtain answers for large-scale correlation clusteringrelaxations and other related problems.8.5. Experimental Diﬀerences Between Metric LPs.

We note that the correlationclustering relaxation appears categorically easier to solve than the sparsest cut relaxation inour experiments. Dykstra’s method tends to generate a denser dual vector for the sparsestcut relaxation, leading to memory issues on relatively small problems. The lazy-constraintmethod, which can successfully solve some correlation clustering problems with thousandsof nodes, appears to provide no beneﬁt for the sparsest cut relaxation. These observationsare perhaps surprising, given that the underlying constraint set for both problems is nearlyidentical.In order to explain this phenomenon, we ﬁrst consider the original clustering objectives,rather than their LP relaxations. Note that the minimum sparsest cut always partitionsa graph into two clusters, which often are at least somewhat balanced in size. Considerwhat this means for the binary variables x ij that encode the clustering: in the case of twobalanced clusters, many of these variables will be zero. In other words, there will be a largenumber of tight triangle inequality constraints of the form x ij = 0 = 0 + 0 = x ik + x jk ,because of triplets ( i, j, k ) that are all in the same cluster. Similarly, for two nodes i, j inone cluster and a third node k in the other cluster, the triangle constraints will also be tight: x ik = c = c + 0 = x jk + x ij , where c > is some constant depending on the graph and the underlying partition (seeSection 6.3 for details). On the other hand, the optimal correlation clustering problem willoften partition a graph into a large number of clusters. In this case, triplets of nodes ( i, j, k ) that are all in distinct clusters are much more prevalent, and such triplets correspond tometric constraints that are not tight at optimality ( x ij = c < c + c = x ik + x jk ). When werelax these clustering objectives so that distances x ij are no longer binary, we expect thisphenomenon to still be reﬂected to some degree or another in the relaxed distances.Intuitively, given that the sparsest cut relaxation typically involves a larger number oftight constraints, we would expect the dual vector in our projection algorithm to becomedense more quickly, since adjustments need to be made more frequently and carefully at tightconstraints. This explains the higher memory requirement we see for solving the sparsest cutrelaxation. This also explains why the lazy-constraint black-box method is unsuccessful. Theexistence of many tight constraints implies that we need to eventually include a signiﬁcantfraction of the metric constraints explicitly if we wish to solve the original problem. In PROJECTION METHOD FOR METRIC-CONSTRAINED OPTIMIZATION 23 practice therefore it is often best to simply include the full constraint set up front ratherthan solving a large number of subproblems that don’t solve optimal solution.9.

Discussion and Future Challenges

In this paper we have developed an approach for solving expensive convex relaxationsof clustering objectives that works on a much larger scale than was previously possible.We now observe several challenges that seem inherent in improving this approach, withoutsigniﬁcantly departing from the application of projection methods. We tried variants ofBauschke’s method [5] and Haugazeau’s projection method [32, 6], which do not computedual variables as Dykstra’s does. Such methods hence require only O ( n ) memory, insteadof O ( n ) , but come with signiﬁcantly worse convergence guarantees. Because it is hard todetermine in practice when these methods have converged, it is diﬃcult to use them toobtain an output satisfying any guarantees with respect to the optimal solution. Anothernatural approach to consider is the use of parallelization or randomization. Parallel versionsof Dykstra’s method exist [34], but they rely on averaging out a large number of verytiny changes in each iteration, equal to the number of constraints. Since in our case thereare O ( n ) constraints, we ﬁnd that, in practice, this averaging approach leads to changes thatare so small that no meaningful progress can be made from one iteration to the next. Thechallenge in using a randomized approach (see [35]) is that visiting constraints at randomleads to a much higher cost for visiting the same number of constraints. This is becauseaccessing dual variables at random from a dictionary is slower in practice than sequentiallyvisiting elements in an array of dual variables. References [1] G. Agarwal and D. Kempe. Modularity-maximizing graph communities via mathematical programming.

The European Physical Journal B , 66(3):409–418, Dec 2008.[2] Sanjeev Arora, Satish Rao, and Umesh Vazirani. Expander ﬂows, geometric embeddings and graphpartitioning.

Journal of the ACM (JACM) , 56(2), 2009.[3] Nikhil Bansal, Avrim Blum, and Shuchi Chawla. Correlation clustering.

Machine Learning , 56:89–113,2004.[4] D. Batra, R. Sukthankar, and T. Chen. Semi-supervised clustering via learnt codeword distances. In

Proceedings of the British Machine Vision Conference , BMVA 2008, pages 90.1–90.10. BMVA Press,2008. doi:10.5244/C.22.90.[5] Heinz H. Bauschke. The approximation of ﬁxed points of compositions of nonexpansive mappings inHilbert space.

Journal of Mathematical Analysis and Applications , 202(1):150 – 159, 1996.[6] Heinz H Bauschke and Patrick L Combettes.

Convex analysis and monotone operator theory in Hilbertspaces , volume 2011. Springer, 2017.[7] Heinz H Bauschke and Valentin R Koch. Projection methods: Swiss army knives for solving feasibilityand best approximation problems with halfspaces.

Contemp. Math , 636:1–40, 2015.[8] Aurélien Bellet, Amaury Habrard, and Marc Sebban. A survey on metric learning for feature vectorsand structured data.

CoRR , abs/1306.6709, 2013.[9] Ernesto G. Birgin and Marcos Raydan. Robust stopping criteria for Dykstra’s algorithm.

SIAM Journalon Scientiﬁc Computing , 26(4):1405–1414, 2005.[10] Arijit Biswas.

Semi-supervised and Active Image Clustering with Pairwise Constraints from Humans .PhD thesis, University of Maryland, College Park, 2014.[11] Arijit Biswas and David Jacobs. An eﬃcient algorithm for learning distances that obey the triangleinequality. In

Proceedings of the British Machine Vision Conference , BMVA 2015, pages 10.1–10.13.BMVA Press, September 2015.[12] Justin Brickell, Inderjit S. Dhillon, Suvrit Sra, and Joel A. Tropp. The metric nearness problem.

SIAMJournal on Matrix Analysis and Applications , 30(1):375–396, 2008.[13] Andrzej Cegielski.

Iterative methods for ﬁxed point problems in Hilbert spaces , volume 2057. Springer,2012.[14] Yair Censor. Computational acceleration of projection algorithms for the linear best approximationproblem.

Linear Algebra and its Applications , 416(1):111–123, 2006.

PROJECTION METHOD FOR METRIC-CONSTRAINED OPTIMIZATION 24 [15] Moses Charikar, Venkatesan Guruswami, and Anthony Wirth. Clustering with qualitative information.

Journal of Computer and System Sciences , 71(3):360 – 383, 2005. Learning Theory 2003.[16] Shuchi Chawla, Konstantin Makarychev, Tselil Schramm, and Grigory Yaroslavtsev. Near optimal LProunding algorithm for correlation clustering on complete and complete k -partite graphs. In Proceedingsof the Forty-Seventh Annual ACM on Symposium on Theory of Computing , STOC 2015, pages 219–228.ACM, 2015.[17] Timothy A. Davis and Yifan Hu. The university of ﬂorida sparse matrix collection.

ACM Trans. Math.Softw. , 38(1):1:1–1:25, December 2011.[18] Achiya Dax. On theory and practice of row relaxation methods. In Dan Butnariu, Yair Censor, andSimeon Reich, editors,

Inherently Parallel Algorithms in Feasibility and Optimization and their Appli-cations , volume 8 of

Studies in Computational Mathematics , pages 153 – 186. Elsevier, 2001.[19] Achiya Dax. The adventures of a simple algorithm.

Linear Algebra and its Applications , 361:41 – 61,2003.[20] Wenceslas Fernandez de la Vega and Claire Kenyon-Mathieu. Linear programming relaxations of maxcut.In

Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms , SODA 2007,pages 53–61, Philadelphia, PA, USA, 2007. Society for Industrial and Applied Mathematics.[21] Erik D. Demaine and Nicole Immorlica. Correlation clustering with partial information. In SanjeevArora, Klaus Jansen, José D. P. Rolim, and Amit Sahai, editors,

Approximation, Randomization, andCombinatorial Optimization.. Algorithms and Techniques , pages 1–13, Berlin, Heidelberg, 2003. SpringerBerlin Heidelberg.[22] Inderjit S Dhillon, Suvrit Sra, and Joel A Tropp. The metric nearness problems with applications.Technical report, 2003.[23] Inderjit S. Dhillon, Suvrit Sra, and Joel A. Tropp. Triangle ﬁxing algorithms for the metric nearnessproblem. In

Advances in Neural Information Processing Systems 17 , NIPS 2004, pages 361–368, Cam-bridge, MA, USA, 2004. MIT Press.[24] Thang N Dinh, Xiang Li, and My T Thai. Network clustering via maximizing modularity: Approxima-tion algorithms and theoretical limits. In

Proceedings of the 2015 IEEE International Conference onData Mining , ICDM 2015, pages 101–110. IEEE, 2015.[25] Richard L Dykstra. An algorithm for restricted least squares regression.

Journal of the American Sta-tistical Association , 78(384):837–842, 1983.[26] Dotan Emanuel and Amos Fiat. Correlation clustering – minimizing disagreements on arbitrary weightedgraphs. In

European Symposium on Algorithms , pages 208–220, Berlin, Heidelberg, 2003. Springer BerlinHeidelberg.[27] R. Escalante and M. Raydan.

Alternating Projection Methods . Society for Industrial and Applied Math-ematics, Philadelphia, PA, 2011.[28] Camillo Gentile. Sensor location through linear programming with triangle inequality constraints. In

IEEE International Conference on Communications , volume 5, pages 3192–3196. IEEE, 2005.[29] Camillo Gentile. Distributed sensor location through linear programming with triangle inequality con-straints.

IEEE transactions on wireless communications , 6(7), 2007.[30] D. Glasner, S. N. Vitaladevuni, and R. Basri. Contour-based joint clustering of multiple segmentations.In

Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition , CVPR 2011,pages 2385–2392, Washington, DC, USA, 2011. IEEE Computer Society.[31] Shih-Ping Han. A successive projection method.

Mathematical Programming , 40(1-3):1–14, 1988.[32] Yves Haugazeau.

Sur les inéquations variationnelles et la minimisation de fonctionnelles convexes . PhDthesis, Universite de Paris, 1968.[33] Cliﬀord Hildreth. A quadratic programming procedure.

Naval Research Logistics (NRL) , 4(1):79–85,1957.[34] Alfredo N Iusem and Alvaro R De Pierro. On the convergence of Han’s method for convex programmingwith quadratic objective.

Mathematical Programming , 52(1-3):265–284, 1991.[35] Noreen Jamil, Xuemei Chen, and Alexander Cloninger. Hildreth’s algorithm with applications to softconstraints for user interface layout.

Journal of Computational and Applied Mathematics , 288:193 – 202,2015.[36] Kevin Lang and Satish Rao. Finding near-optimal cuts: an empirical evaluation. In

Proceedings of thefourth annual ACM-SIAM Symposium on Discrete algorithms , SODA 1993, pages 212–221. Society forIndustrial and Applied Mathematics, 1993.[37] Kevin J Lang, Michael W Mahoney, and Lorenzo Orecchia. Empirical evaluation of graph partitioningusing spectral embeddings and ﬂow. In

International Symposium on Experimental Algorithms , SEA2009, pages 197–208. Springer, 2009.

PROJECTION METHOD FOR METRIC-CONSTRAINED OPTIMIZATION 25 [38] Tom Leighton and Satish Rao. Multicommodity max-ﬂow min-cut theorems and their use in designingapproximation algorithms.

Journal of the ACM (JACM) , 46(6):787–832, November 1999.[39] Jure Leskovec and Andrej Krevl. SNAP Datasets: Stanford large network dataset collection. http://snap.stanford.edu/data , June 2014.[40] O. L. Mangasarian. Normal solutions of linear programs.

Mathematical Programming at Oberwolfach II ,pages 206–216, 1984.[41] Mark EJ Newman. Finding community structure in networks using the eigenvectors of matrices.

Physicalreview E , 74(3):036104, 2006.[42] Mark EJ Newman and Michelle Girvan. Finding and evaluating community structure in networks.

Physical review E , 69(026113), 2004.[43] S. Poljak and Zsolt Tuza. Maximum cuts and large bipartite subgraphs. pages 181–244. AMS, Provi-dence, 1995.[44] Gregory J. Puleo and Olgica Milenkovic. Correlation clustering with constrained cluster sizes and ex-tended weights bounds.

SIAM Journal on Optimization , 25(3):1857–1872, 2015.[45] Gregory J. Puleo and Olgica Milenkovic. Correlation clustering and biclustering with locally boundederrors. In

Proceedings of the 33rd International Conference on International Conference on MachineLearning , ICML 2016, pages 869–877. JMLR.org, 2016.[46] Chaitanya Swamy. Correlation clustering: maximizing agreements via semideﬁnite programming. In

Proceedings of the ﬁfteenth annual ACM-SIAM symposium on Discrete algorithms , SODA 2004, pages526–527. Society for Industrial and Applied Mathematics, 2004.[47] Ryan J Tibshirani. Dykstra’s algorithm, ADMM, and coordinate descent: Connections, insights, and ex-tensions. In

Advances in Neural Information Processing Systems 30 , NIPS 2017, pages 517–528. CurranAssociates, Inc., 2017.[48] Jurgen Van Gael and Xiaojin Zhu. Correlation clustering for crosslingual link detection. In

Proceedingsof the 20th International Joint Conference on Artiﬁcal Intelligence , IJCAI 2007, pages 1744–1749, SanFrancisco, CA, USA, 2007. Morgan Kaufmann Publishers Inc.[49] Anke van Zuylen and David P. Williamson. Deterministic pivoting algorithms for constrained rankingand clustering problems.

Mathematics of Operations Research , 34(3):594–620, 2009.[50] Nate Veldt, David F. Gleich, and Anthony Wirth. A correlation clustering framework for communitydetection. In

Proceedings of the 2018 World Wide Web Conference , WWW 2018, pages 439–448, 2018.[51] S. N. Vitaladevuni and R. Basri. Co-clustering of image segments using convex optimization appliedto em neuronal reconstruction. In , CVPR 2010, pages 2203–2210, June 2010.[52] Yubo Wang, Linli Xu, Yucheng Chen, and Hao Wang. A scalable approach for general correlationclustering. In

International Conference on Advanced Data Mining and Applications , ADMA 2013, pages13–24. Springer, 2013.[53] Anthony Wirth.

Approximation algorithms for clustering . PhD thesis, Princeton University, 2004.[54] Anthony Wirth. Correlation clustering. In

Encyclopedia of Machine Learning , pages 227–231. 2010.[55] Eric P. Xing, Andrew Y. Ng, Michael I. Jordan, and Stuart Russell. Distance metric learning, withapplication to clustering with side-information. In

Advances in Neural Information Processing Systems15 , NIPS 2002, pages 521–528, Cambridge, MA, USA, 2002. MIT Press.

Appendix A. Lemma Supporting the Proof of Theorem 1

The full proof of Theorem 1 relies the following lemma showing that we can safely discardupper and lower bound on x ij variables without changing the optimal result. Lemma 7.

Even without the constraint family “ ≤ x ij ≤ for all pairs ( i, j ) ”, everyoptimal solution to problem (5) in fact satisﬁes those constraints.Proof. The proof proceeds by contradiction: assume that X = ( x ij ) is an optimal solutionobtained by optimizing objective (5), without constraints x ij ∈ [0 , . Assume also that atleast one variable is greater than one or less than zero. Next, deﬁne a new set of variables Z = ( z ij ) as follows: z ij =  if x ij < x ij if x ij ∈ [0 , if x ij > PROJECTION METHOD FOR METRIC-CONSTRAINED OPTIMIZATION 26

Notice that this Z would have a strictly lower (i.e. better) objective score than X forproblem (5), because | z ij − d ij | ≤ | x ij − d ij | for all ( i, j ) , and this inequality is strict for atleast one pair of nodes since we assumed that there is at least one variable x ij that is notin [0 , . It just remains to show that Z also satisﬁes triangle inequality constraints and isthus feasible, which would lead to the desired contradiction to the optimality of X .Proving Z is feasible is tedious, as it requires checking a long list of cases. We consider atriplet ( i, j, k ) , and we wish to check that z ij ≤ z jk + z ik z ik ≤ z jk + z ij z jk ≤ z ij + z ik for all possible values of ( x ij , x ik , x jk ) . To illustrate the technique we will provide proofs forthe cases where the X variables are all nonnegative, but may be larger than . The sameapproach works if we also considered the case where some of the X are negative, thoughthis involves checking 27 cases, which we do not list exhaustively here. Let v ab be a binaryvariable indicating whether x ab > . • Case 1: ( v ij , v ik , v jk ) = (0 , , . In this case the y variables are identical to the x variables and the constraints hold. • Case 2: ( v ij , v ik , v jk ) = (1 , , . In this case z ij = z ik = z jk = 1 and the constraintshold. • Case 3: ( v ij , v ik , v jk ) = (1 , , . Since x ij > , by construction z ij = 1 , and z jk = x jk ≤ , z ik = x ik ≤ , so we can conﬁrm that: z ij < x ij ≤ x ik + x jk = z ik + z jk z ik = x ik ≤ ≤ x jk + 1 = z jk + z ij z jk = x jk ≤ ≤ x ik + 1 = z ik + z ij • Case 4: ( v ij , v ik , v jk ) = (0 , , . This is symmetric to case 3. • Case 5: ( v ij , v ik , v jk ) = (1 , , . In this case we see that z ij = 1 < x ij , z ik = 1 < x ik ,and z jk = x jk , so z ij = 1 < z jk = z ik + z jk z ik = 1 < z jk = z ij + z jk z jk < ≤ z ij + z ik . • Case 6: ( v ij , v ik , v jk ) = (0 , , . Symmetric to cases 3 and 4. • Case 7: ( v ij , v ik , v jk ) = (1 , , . Symmetric to case 5. • Case 8: ( v ij , v ik , v jk ) = (0 , , . Symmetric to cases 5 and 7. (cid:3) Appendix B. Projection Methods and Quadratic Programming

In this appendix we provide more general background on Dykstra’s projection method [25]and how it can be used to minimize a quadratic program. The results here are not new, butare rather a compilation of past work on projection methods and quadratic programmingwhich are most relevant to us.

PROJECTION METHOD FOR METRIC-CONSTRAINED OPTIMIZATION 27

B.1.

Quadratic Programming.

Let A ∈ R N × M , b ∈ R M , c ∈ R N , γ > , and W ∈ R N × N be a positive deﬁnite matrix. Consider the following quadratic program: min x Q ( x ) = 12 x T Wx + c T x (33) s.t. Ax ≤ b . The dual of this quadratic program is max y D ( y ) = − b T y − (cid:13)(cid:13) A T y + c (cid:13)(cid:13) W − (34) s.t. y ≥ . The objectives are equal when the Karush–Kuhn–Tucker (KKT) conditions are satisﬁed, i.e.,when we ﬁnd vectors ˆ x and ˆ y that satisfy:KKT Conditions.(1) W ˆ x = − A T ˆ y − c (2) A ˆ x ≤ b (3) ˆ y T ( A ˆ x − b ) = 0 (4) ˆ y ≥ .If the above are satisﬁed, we know that ˆ y = argmin D ( y ) and ˆ x = argmin Q ( x ) and D (ˆ y ) = Q (ˆ x ) . In general, if x satisﬁes Ax ≤ b and y ≥ then we know that D ( y ) ≤ D (ˆ y ) = Q (ˆ x ) ≤ Q ( x ) .B.2. The Best Approximation Problem.

Let C i ⊂ R N for i = 1 , , . . . , M be convexsets and C = ∩ Mi =1 C i be their intersection (which is also convex). Given z ∈ R N , the bestapproximation problem (BAP) is to ﬁnd(35) x ∗ = P C ( z ) = arg min x ∈ C || x − z || where we can specify any norm || · || . P C ( z ) is the projection of z onto C . BAP is oftensolved using projection methods , which operate by visiting the constraints ( C i ) cyclically andrepeatedly performing easier projections of the form x i = P C i ( z i ) = arg min x ∈ C i || x − z i || Note that we are not restricted to only the standard Euclidean norm, in fact we will use aweighted norm in our work.B.3.

Dykstra’s Method for BAP.

Dykstra’s method is one common approach to solvingthe BAP (see [25]). In order to solve (35), Dykstra’s method computes the following updates:(1) I i = 0 ∈ R N for i = 1 , , . . . , M ( increment or correction vectors)(2) x := z (3) At step k • i := ( k −

1) mod M + 1 (cyclically visit constraints) • x c := x − I i (correction step) • x := P C i ( x c ) (projection step) • I i := x − x c (update increment).This simpliﬁes when we consider applying Dykstra’s method to a quadratic program like (33).To do this, we will not use the standard norm and standard inner product, but, givenarbitrary vectors f , g ∈ R N , a weighted norm: h f , g i w = f T Wg k f k w = h f , f i w = f T Wf . PROJECTION METHOD FOR METRIC-CONSTRAINED OPTIMIZATION 28

Observe the the following equivalence: (cid:13)(cid:13) x + W − c (cid:13)(cid:13) w = 12 x T Wx + x T WW − c + 12 c T W − c = 12 x T Wx + c T x + constant . In other words, the quadratic program (33) is equivalent to the following best approximationproblem:(36) x ∗ = P C ( z ) = arg min x ∈ C || x − z || w , where C = ∩ Mi =1 C i . Here, z = − W − c and we are projecting onto half-space constraints: C i = { x ∈ R N : a Ti x ≤ b i } = { x ∈ R N : h ˜ a i , x i w ≤ b i } , where a Ti is the i th row of constraint matrix A and ˜ a i = W − a i is scaled so that h ˜ a i , x i w = ˜ a iT Wx = a Ti W − Wx = a Ti x . Dykstra’s method relies on being able to project quickly onto each constraint C i . For simplehalf-space constraints, the projection can be computed as follows: P C i ( x ) = arg min x ′ ∈ C i || x ′ − x || = x − [ h ˜ a i , x i − b i ] + k ˜ a i k ˜ a i where [ a ] + is a if a > , and is zero otherwise (a standard textbook result, see section 4.1.3of [13]). Since we are using the W -weighted norm and inner product, for our problem thisbecomes: P C i ( x ) = x − [ a Ti x − b i ] + a Ti W − a i ( W − a i ) . Observe that this type of projection always takes the original vector x and just adds aconstant times a vector W − a i when visiting constraint i . Therefore, when implementingthe algorithm, we do not need to store an entire increment vector I i for each constraint.As long as we have the weight matrix W and the constraint matrix A stored up front, itwill suﬃce to store a single extra constant [ a Ti x − b i ] + / a Ti W − a i in order to perform thecorrection step.We re-write Dykstra’s algorithm for quadratic programming as follows: Dykstra’s Algorithm applied to Quadratic Program (33)(1) y := 0 ∈ R M (2) x := − W − c (3) At step k : • i := ( k −

1) mod M + 1 (cyclically visit constraints) • x := x + y i W − a i (correction step) • x := x − θ + i W − a i (projection step)where θ + i = [ a Ti x − b i ] + a Ti W − a i • y i := θ + i ≥ . B.4.

Hildreth’s Projection Method.

It is well known that for half-space constraints,Dykstra’s method is equivalent to Hildreth’s method [33]. At ﬁrst glance there appear to beslight diﬀerences, but these are easily accounted for.

Hildreth’s Algorithm (1) y := 0 ∈ R M (2) x := − W − c (3) At step k : PROJECTION METHOD FOR METRIC-CONSTRAINED OPTIMIZATION 29 • i := ( k −

1) mod M + 1 • θ i = a Ti x − b i a Ti W − a i • δ = min {− θ i , y i }• x := x + δ W − a i • y i := y i − δ. Proving the equivalence between methods amounts simply to combining the two separateupdates: • x c = x + y i W − a i (correction step) • x ′ := x c − [ a Ti x c − b i ] + a Ti W − a i W − a i (projection step)from Dykstra’s method. Combining these updates gives: x ′ = x + y i W − a i − [ a Ti ( x + y i W − a i ) − b i ] + a Ti W − a i W − a i . The outcome of this update depends on the value of a Ti ( x + y i W − a i ) − b i = ( a Ti x + y i a Ti W − a i ) − b i which is greater than or equal to zero if and only if: − θ i = − a Ti x − b i a Ti W − a i ≤ y i . Therefore, if − θ i ≤ y i , then δ = − θ i and the update is: x ′ = x + y i W − a i − a Ti x − b i + y i a Ti W − a i a Ti W − a i W − a i = x + y i W − a i − ( θ i + y i ) W − a i = x − θ i W − a i = x + δ W − a i . On the other hand, if y i < − θ i , then [ a Ti ( x + y i W − a i ) − b i ] + = 0 and the update is x ′ = x + y i W − a i .B.4.1. Nonnegative Dual Variables.

Note that y i ≥ is maintained in either case: if δ = y i then we update y i := y i − δ = 0 , and if δ = − θ i that means − θ i ≤ y i = ⇒ ≤ y i + θ i = y i − δ .This is important, because as we will show in the next section, these y i variables are thenonnegative dual variables in the optimization problem.B.5. Hildreth’s Method, Coordinate Descent, and KKT Conditions.

Now we willshow the equivalence between Dykstra’s/Hildreth’s method and performing coordinate de-scent on the negative of the dual function. The details shown here are a slight generalizationof the result shown by Dax [18, 19], updated to explicitly include a non-identity weight ma-trix W .We ﬁrst replace the maximization objective (34) with an equivalent quadratic programthat is minimized: min y F ( y ) = b T y + 12 (cid:13)(cid:13) A T y + c (cid:13)(cid:13) W − (37) s.t. y ≥ and consider the following optimization approach: let y = 0 and set x = − W − c so thatthe ﬁrst KKT condition is satisﬁed: Wx = − A T y − c . We will update y one variable at a PROJECTION METHOD FOR METRIC-CONSTRAINED OPTIMIZATION 30 time so that F ( y ) is strictly decreasing and so that the ﬁrst and fourth KKT conditions aresatisﬁed at every step: Wx k = − A T y k − c and y k ≥ .B.5.1. Maintaining Two KKT Conditions.

When we visit the i th constraint, we performthe following steps:(1) Compute θ i = a Ti x − b i a Ti W − a i (2) Set δ = min {− θ i , y i } (3) Update x := x + δ W − a i , and y := y − e i δ, where e i is the vector with zeros everywhere except for a 1 in the i th position. We havealready noted that the entries of y computed by the method will always be nonnegative. Tosee that one other KKT condition is also maintained, assume that Wx = − A T y − c holdsat one point in time and perform a single update to get new primal and dual vectors x ′ and y ′ : x ′ = x + δ W − a i y ′ = y − e i δ. Then note: x ′ = x + δ W − a i = W − ( − A T y − c ) + δ W − a i = W − ( − A T y + δ a i − c ) and − A T y ′ − c = − A T ( y − e i δ ) − c = − A T y − c + δ a i so these combine to yield Wx ′ = − A T y ′ − c .B.5.2. Coordinate Descent.

The connection to coordinate descent is seen by realizing thatthe value θ i computed above uniquely minimizes the following one-variable function: f ( θ ) = F ( y + e i θ )= b T y + θb i + 12 (cid:13)(cid:13) A T ( y + e i θ ) + c (cid:13)(cid:13) W − = b T y + θb i + 12 k− Wx + θ a i k W − = b T y + θb i + 12 (cid:16) ( − Wx + θ a i ) T W − ( − Wx + θ a i ) (cid:17) = b T y + θb i + 12 (cid:0) x T Wx + θ a Ti W − a i − θ a Ti x (cid:1) = θb i + 12 θ a Ti W − a i − θ a Ti x + 12 x T Wx + b T y . This is minimized when f ′ ( θ ) = b i − a Ti x + θ a Ti W − a i = 0 = ⇒ θ = a Ti x − b i a Ti W − a i . If we can perform this update without violating constraint y i ≥ , then we do so, i.e. δ = θ i and update y i = y i + θ ≥ . Otherwise if θ < − y i ≤ , we simply decrease y i as much as wecan without violating the constraint (i.e. set y i = 0 ). The function strictly decreases, since f (0) − f ( δ ) ≥ δ a Ti W − a i > . PROJECTION METHOD FOR METRIC-CONSTRAINED OPTIMIZATION 31

To see this, realize that δ = νθ i for some ν ∈ [0 , and θ i ( a Ti W − a i ) = ( a Ti x − b i ) , so f (0) − f ( δ ) = δ ( a Ti x − b i ) − δ / a Ti W − a i = νθ i ( θ i a Ti W − a i ) − ν θ i a Ti W − a i = 12 θ i ν a Ti W − a i (2 − ν ) /ν ≥ δ a Ti W − a i (since ν ∈ [0 ,1]