A Unifying Approximate Method of Multipliers for Distributed Composite Optimization
aa r X i v : . [ m a t h . O C ] S e p A UNIFYING APPROXIMATE METHOD OF MULTIPLIERS FORDISTRIBUTED COMPOSITE OPTIMIZATION ∗ XUYANG WU AND JIE LU † Abstract.
This paper investigates solving convex composite optimization on an arbitrarilyconnected graph, where each node in the graph, endowed with a smooth component function and anonsmooth one, is required to minimize the sum of all the component functions throughout the graph.To address such a problem, a general Approximate Method of Multipliers (AMM) is developed,which attempts to approximate the Method of Multipliers by virtue of a surrogate function withnumerous options. We then design the surrogate function in various ways, leading to several subsetsof AMM that enable distributed implementations over the graph. We demonstrate that AMM unifiesa variety of existing distributed optimization algorithms, and certain specific designs of its surrogatefunction introduce different types of original algorithms to the literature. Furthermore, we showthat AMM is able to achieve an O (1 /k ) rate of convergence to both optimality and feasibility, andthe convergence rate becomes linear when the problem is locally restricted strongly convex andsmooth. Such convergence rates provide new and stronger convergence results to many state-of-the-art methods that can be viewed as specializations of AMM. Key words.
Distributed optimization, convex composite optimization, Method of Multipliers
AMS subject classifications.
This paper addresses the following convex composite opti-mization problem: minimize x ∈ R d N X i =1 ( f i ( x ) + h i ( x )) , (1.1)where each f i : R d → R is convex and has a Lipschitz continuous gradient, and each h i : R d → R ∪ { + ∞} is convex and can be non-differentiable. Notice that f i isallowed to be a zero function. Also note that if h i contains an indicator function I X i with respect to a convex set X i ⊂ R d , i.e., I X i ( x ) = 0 if x ∈ X i and I X i ( x ) = + ∞ otherwise, then problem (1.1) turns into a nonsmooth, constrained convex program.We consider solving problem (1.1) in a distributed way over a connected, undi-rected graph G = ( V , E ), where V = { , . . . , N } is the vertex set and E ⊆ {{ i, j }| i, j ∈V , i = j } is the edge set. We suppose each node i ∈ V possesses two private com-ponent functions f i and h i , aims at solving (1.1), and only exchanges informationwith its neighbors denoted by the set N i = { j | { i, j } ∈ E} of nodes. Applicationsof such a distributed optimization problem include economic dispatch in smart grids[33], resource allocation for communication networks [3], robust estimation by sensornetworks [25], as well as distributed learning [21].To date, a large body of distributed optimization algorithms have been pro-posed to solve problem (1.1) or its special cases. For instance, the distributed sub-gradient methods [16, 30, 34] and the distributed gradient-tracking-based methods[17, 20, 19, 31, 27] are designed to emulate the centralized subgradient/gradientmethods. Another typical practice is to reformulate (1.1) into a constrained prob-lem with a separable objective function and a consensus constraint to be relaxed. ∗ Funding : This work has been supported by the National Natural Science Foundation of Chinaunder grant 61603254. † School of Information Science and Technology, ShanghaiTech University, Shanghai 201210, China([email protected], [email protected]).1
XUYANG WU AND JIE LU
For example, the inexact methods [2, 13] solve this reformulated problem by penaliz-ing the consensus constraint in the objective function, and the primal-dual methods[24, 22, 11, 10, 18, 12, 35, 1, 23, 32, 9, 28, 7, 5, 14, 15, 29] relax the consensus constraintthrough dual decomposition techniques.Despite the growing literature, relatively few methods manage to tackle the gen-eral form of (1.1), among which only the first-order methods [18, 28, 11, 35, 5, 12,1, 23, 32, 7, 10] are guaranteed to converge to optimality with constant step-sizes.Here, the algorithms in [18, 28] rely on strong convexity to evolve, and the methodsin [11, 35, 5] only prove asymptotic convergence for non-strongly convex problems. Incontrast, the remaining algorithms [12, 1, 23, 32, 7, 10] are able to establish O (1 /k )convergence rates in solving (1.1). Although these algorithms are developed usingdifferent rationales, we discover that the majority of them, i.e., [12, 1, 23, 32, 7], canindeed be thought to originate from the Method of Multipliers [6].The Method of Multipliers is a seminal (centralized) optimization method, andone of its notable variants is the Alternating Direction Method of Multipliers (ADMM)[6]. In section 2 of this paper, we develop a paradigm of solving (1.1) via approx-imating the behavior of the Method of Multipliers, called Approximate Method ofMultipliers (AMM). The proposed AMM adopts a possibly time-varying surrogatefunction under mild conditions to take the place of the smooth objective function atevery minimization step in the Method of Multipliers, facilitating an abundance ofdistributed algorithm designs. By narrowing down the diverse choices of the surrogatefunction in AMM, we first construct a subset of AMM, called
Distributed ApproximateMethod of Multipliers (DAMM), which opts for a Bregman-divergence-type surrogatefunction to enable fully decentralized solving of (1.1). We also utilize convex conju-gate and neighbor-sparse structures induced from the graph to design two additionaldistributed subsets of AMM, referred to as DAMM-SC and DAMM-SQ, for solvingsmooth convex optimization, i.e., (1.1) with h i ≡ ∀ i ∈ V . We concretely exemplifythese distributed subsets of AMM, specifying that they can be specialized to newproximal algorithms, second-order methods, gradient-tracking methods, etc.In section 3, we unify various state-of-the-art distributed algorithms [22, 32, 15, 7,23, 1, 24, 12, 9, 17, 14] for solving (1.1) or its special cases, identifying that they can becast into the form of AMM including its distributed subsets DAMM and DAMM-SQ.In section 4, we show that AMM enables the nodes to reach a consensus and convergeto the optimal value at a rate of O (1 /k ) either with a particular set of surrogatefunctions or under a local restricted strong convexity condition on P i ∈V f i ( x ). The O (1 /k ) convergence rate of AMM in solving the general form of (1.1) achieves thebest order in the literature and is of the same order as [12, 1, 23, 32, 7, 10]. Notethat the algorithms in [12, 1, 23, 32, 7] are indeed specializations of AMM with thatparticular set of surrogate functions, and the convergence rate in [10] is in termsof an implicit measure of optimality error. Moreover, we show that when (1.1) isboth smooth and locally restricted strongly convex with respect to the optimum,AMM allows the nodes to attain the optimum at a linear rate. Unlike most existingworks, such linear convergence is established in no need of global strong convexity.Furthermore, the convergence analysis on AMM naturally yields new convergenceresults for many existing methods discussed in section 3, and relaxes their problemassumptions without loss in their convergence rate orders. Subsequently, section 5validates the practical performance of several distributed versions of AMM via anumerical example, and section 6 concludes the paper. Notation and Definition : We use A = ( a ij ) n × n to denote an n × n real matrixwhose ( i, j )-entry, denoted by [ A ] ij , is equal to a ij . In addition, Null( A ) is the null UNIFYING APPROXIMATE METHOD OF MULTIPLIERS A , Range( A ) is the range of A , and k A k is the spectral norm of A . Besides,diag( D , . . . , D n ) ∈ R nd × nd represents the block diagonal matrix with D , . . . , D n ∈ R d × d being its diagonal blocks. Also, , , and O represent a zero vector, an all-onevector, and a zero square matrix of proper dimensions, respectively, and I n representsthe n × n identity matrix. For any two matrices A and B , A ⊗ B is the Kroneckerproduct of A and B . If A and B are square matrices of the same size, A (cid:23) B means A − B is positive semidefinite and A ≻ B means A − B is positive definite. For any A = A T ∈ R n × n , λ max ( A ) denotes the largest eigenvalue of A . For any z ∈ R n and A (cid:23) O , k z k = √ z T z and k z k A = √ z T Az . For any countable set S , we use | S | torepresent its cardinality. For any convex function f : R d → R , ∂f ( x ) ⊂ R d denotesits subdifferential (i.e., the set of subgradients) at x . If f is differentiable, ∂f ( x ) onlycontains the gradient ∇ f ( x ).Given a convex set X ⊆ R d , a function f : R d → R is said to be stronglyconvex on X with convexity parameter σ > σ -strongly convex on X )if h g x − g y , x − y i ≥ σ k x − y k ∀ x, y ∈ X ∀ g x ∈ ∂f ( x ) ∀ g y ∈ ∂f ( y ). We say f is (globally) strongly convex if it is strongly convex on R d . Given ˜ x ∈ X , f is said to be restricted strongly convex with respect to ˜ x on X with convexity parameter σ > h g x − g ˜ x , x − ˜ x i ≥ σ k x − ˜ x k ∀ x ∈ X ∀ g x ∈ ∂f ( x ) ∀ g ˜ x ∈ ∂f (˜ x ). We say f is locallyrestricted strongly convex with respect to ˜ x if it is restricted strongly convex withrespect to ˜ x on every convex and compact subset of R d containing ˜ x . Given L ≥ f is said to be L -smooth if it is differentiable and its gradient is Lipschitz continuouswith Lipschitz constant L , i.e., k∇ f ( x ) − ∇ f ( y ) k ≤ L k x − y k ∀ x, y ∈ R d . Thissection develops distributed optimization algorithms for solving problem (1.1) over theundirected, connected graph G , with the following formalized problem assumption. Assumption i ∈ V , f i : R d → R and h i : R d → R ∪ { + ∞} areconvex. In addition, f i is M i -smooth for some M i >
0. Moreover, there exists at leastone optimal solution x ⋆ to problem (1.1). We first propose a family of op-timization algorithms that effectively approximate the Method of Multipliers [6] andserve as a cornerstone of developing distributed algorithms for solving problem (1.1).Let each node i ∈ V keep a copy x i ∈ R d of the global decision variable x ∈ R d in problem (1.1), and define f ( x ) := X i ∈V f i ( x i ) , h ( x ) := X i ∈V h i ( x i ) , where x = ( x T , . . . , x TN ) T ∈ R Nd . Then, problem (1.1) can be equivalently transformed into the following:minimize x ∈ R Nd f ( x ) + h ( x )subject to x ∈ S := { x ∈ R Nd | x = . . . = x N } . (2.1)Next, let ˜ H ∈ R Nd × Nd be such that ˜ H = ˜ H T , ˜ H (cid:23) O , and Null( ˜ H ) = S . It has beenshown in [14] that the consensus constraint x ∈ S in problem (2.1) is equivalent to˜ H x = . Therefore, (2.1) is identical tominimize x ∈ R Nd f ( x ) + h ( x )subject to ˜ H x = . (2.2) XUYANG WU AND JIE LU
Clearly, problems (1.1), (2.1), and (2.2) have the same optimal value. Also, x ⋆ ∈ R d is an optimal solution to (1.1) if and only if x ⋆ = (( x ⋆ ) T , . . . , ( x ⋆ ) T ) T ∈ R Nd is anoptimal solution to (2.1) and (2.2).Consider applying the Method of Multipliers [6] to solve (2.2), which gives x k +1 ∈ arg min x ∈ R Nd f ( x ) + h ( x ) + ρ k x k H + ( v k ) T ˜ H x , v k +1 = v k + ρ ˜ H x k +1 . Here, x k ∈ R Nd is the primal variable at iteration k ≥
0, which is updated byminimizing an augmented Lagrangian function with the penalty ρ k x k H , ρ > H x = . In addition, v k ∈ R Nd is the dual variable, whoseinitial value v can be arbitrarily set.Although the Method of Multipliers may be applied to solve (2.2) with properlyselected parameters, it is not implementable in a distributed fashion over G , evenwhen the problem reduces to linear programming. To address this issue, our strategyis to first derive a paradigm of approximating the Method of Multipliers and thendesign its distributed realizations.Our approximation approach is as follows: Starting from any v ∈ R Nd , x k +1 = arg min x ∈ R Nd u k ( x ) + h ( x ) + ρ k x k H + ( v k ) T ˜ H x , (2.3) v k +1 = v k + ρ ˜ H x k +1 , ∀ k ≥ . (2.4)Compared to the Method of Multipliers, we adopt the same dual update (2.4) butconstruct a different primal update (2.3). In (2.3), we use a possibly time-varyingsurrogate function u k : R Nd → R to replace f ( x ) in the primal update of the Methodof Multipliers, whose conditions are imposed in Assumption 2.2 below. Additionally,to introduce more flexibility, we use a different weight matrix H ∈ R Nd × Nd to definethe penalty term ρ k x k H , ρ >
0. We suppose H has the same properties as ˜ H , i.e., H = H T (cid:23) O and Null( H ) = S . Assumption u k ∀ k ≥ u k ∀ k ≥ u k + ρ k·k H ∀ k ≥ uniformly bounded from below by some positive constant.(c) ∇ u k ∀ k ≥ uniformly bounded from above by some nonnegative constant.(d) ∇ u k ( x k ) = ∇ f ( x k ) ∀ k ≥
0, where x k is generated by (2.3)–(2.4).The strong convexity condition in Assumption 2.2(b) guarantees that x k +1 in (2.3)is well-defined and uniquely exists. Assumption 2.2(d) is the key to making (2.3)–(2.4) solve problem (2.2). To explain this, note that (2.3) is equivalent to finding theunique x k +1 satisfying −∇ u k ( x k +1 ) − ρH x k +1 − ˜ H v k ∈ ∂h ( x k +1 ) . (2.5)Let ( x ⋆ , v ⋆ ) be a primal-dual optimal solution pair of problem (2.2), which satisfies − ˜ H v ⋆ − ∇ f ( x ⋆ ) ∈ ∂h ( x ⋆ ) . (2.6)If ( x k , v k ) = ( x ⋆ , v ⋆ ), then x k +1 has to be x ⋆ because of Assumption 2.2(d), H x ⋆ = ,(2.6), and the uniqueness of x k +1 in (2.5). It follows from (2.4) and ˜ H x ⋆ = UNIFYING APPROXIMATE METHOD OF MULTIPLIERS v k +1 = v ⋆ . Therefore, ( x ⋆ , v ⋆ ) is a fixed point of (2.3)–(2.4). The remainingconditions in Assumption 2.2 will be used for convergence analysis later in section 4.Algorithms described by (2.3)–(2.4) and obeying Assumption 2.2 are called Ap-proximate Method of Multipliers (AMM). As there are numerous options of the sur-rogate function u k , AMM unifies a wealth of optimization algorithms, including avariety of existing methods (cf. section 3) and many brand new algorithms. More-over, since Assumption 2.2 allows u k to have a more favorable structure than f , AMMwith appropriate u k ’s may induce a prominent reduction in computational cost com-pared to the Method of Multipliers. In the sequel, we will provide various options of u k , which give rise to a series of distributed versions of AMM. This subsectionlays out the parameter designs of AMM for distributed implementations.We first apply the following change of variable to AMM: q k = ˜ H v k , k ≥ . (2.7)Then, AMM (2.3)–(2.4) can be rewritten as x k +1 = arg min x ∈ R Nd u k ( x ) + h ( x ) + ρ k x k H + ( q k ) T x , (2.8) q k +1 = q k + ρ ˜ H x k +1 . (2.9)Moreover, note thatRange( ˜ H ) = Range( ˜ H ) = S ⊥ := { x ∈ R Nd | x + . . . + x N = } , where S ⊥ is the orthogonal complement of S in (2.1). Hence, (2.7) requires q k ∈ S ⊥ ∀ k ≥
0, which, due to (2.9), can be ensured simply by the following initialization: q ∈ S ⊥ . (2.10)Therefore, (2.8)–(2.10) is an equivalent form of AMM.Next, partition the primal variable x k and the dual variable q k in (2.8)–(2.10) as x k = (( x k ) T , . . . , ( x kN ) T ) T and q k = (( q k ) T , . . . , ( q kN ) T ) T . Suppose each node i ∈ V maintains x ki ∈ R d and q ki ∈ R d . Clearly, the nodes manage to collectively meet (2.10)by setting, for instance, q i = ∀ i ∈ V . Below, we discuss the selections of u k , H , and˜ H for the sake of distributed implementations of (2.8) and (2.9).To this end, we let H = P ⊗ I d , ˜ H = ˜ P ⊗ I d , and impose Assumption 2.3 on P, ˜ P ∈ R N × N . Assumption P = ( p ij ) N × N and ˜ P = (˜ p ij ) N × N satisfy thefollowing:(a) p ij = p ji , ˜ p ij = ˜ p ji , ∀{ i, j } ∈ E .(b) p ij = ˜ p ij = 0, ∀ i ∈ V , ∀ j / ∈ N i ∪ { i } .(c) Null( P ) = Null( ˜ P ) = span( ).(d) P (cid:23) O , ˜ P (cid:23) O .The nodes can jointly determine P, ˜ P under Assumption 2.3 without any centralizedcoordination. For instance, we can let each node i ∈ V agree with every neigh-bor j ∈ N i on p ij = p ji < p ij = ˜ p ji <
0, and set p ii = − P j ∈N i p ij and XUYANG WU AND JIE LU ˜ p ii = − P j ∈N i ˜ p ij , which directly guarantee Assumption 2.3(a). Then, Assump-tion 2.3(c)(d) are satisfied effortlessly by means of Assumption 2.3(b) and the con-nectivity of G . Typical examples of such P and ˜ P include the graph Laplacian matrix L G ∈ R N × N with [ L G ] ij = [ L G ] ji = − ∀{ i, j } ∈ E and the Metropolis weight matrix M G ∈ R N × N with [ M G ] ij = [ M G ] ji = − {|N i | , |N j |} +1 ∀{ i, j } ∈ E . Furthermore, allthe conditions on H and ˜ H imposed in subsection 2.1 hold due to Assumption 2.3.With Assumption 2.3, (2.9) can be accomplished by letting each node update as q k +1 i = q ki + ρ X j ∈N i ∪{ i } ˜ p ij x k +1 j , ∀ i ∈ V . (2.11)Observe that this is a local operation, as each node i ∈ V only needs to acquire theprimal variables associated with its neighbors.It remains to design the surrogate function u k so that (2.8) can be realized in adistributed way. To this end, let ψ ki : R d → R ∀ i ∈ V ∀ k ≥ Assumption ψ ki ∀ i ∈ V ∀ k ≥ ψ ki ∀ i ∈ V ∀ k ≥ ψ ki ∀ i ∈ V ∀ k ≥ uniformly bounded from below by some positive constant.(c) ∇ ψ ki ∀ i ∈ V ∀ k ≥ uniformly bounded from above by some positive constant.(d) ( P i ∈V ψ ki ( x i )) − ρ k x k H ∀ k ≥ ψ ki ∀ i ∈ V ∀ k ≥ ρλ max ( H ) >
0, and one readily available upper bound on λ max ( H ) is max i ∈V , j ∈N i | p ij | N .Now define φ k ( x ) := (cid:0)X i ∈V ψ ki ( x i ) (cid:1) − ρ k x k H , ∀ k ≥ , (2.12)and let D φ k ( x , x k ) be the Bregman divergence associated with φ k at x and x k , i.e., D φ k ( x , x k ) = φ k ( x ) − φ k ( x k ) − h∇ φ k ( x k ) , x − x k i . (2.13)Then, we set u k as u k ( x ) = D φ k ( x , x k ) + h∇ f ( x k ) , x i . (2.14)With Assumption 2.4, (2.14) is sufficient to ensure Assumption 2.2. To see this, notefrom Assumption 2.4(a)(d) that u k ( x ) in (2.14) is twice continuously differentiableand convex, i.e., Assumption 2.2(a) holds. Also note from (2.13) and (2.14) that ∇ u k ( x ) = ∇ φ k ( x ) − ∇ φ k ( x k ) + ∇ f ( x k ) . (2.15)This, along with Assumption 2.4 (b)(c), guarantees Assumption 2.2(b)(c)(d).To see how (2.14) results in a distributed implementation of (2.8), note from(2.15) that (2.8) is equivalent to = ∇ φ k ( x k +1 ) − ∇ φ k ( x k ) + ∇ f ( x k ) + g k +1 + ρH x k +1 + q k UNIFYING APPROXIMATE METHOD OF MULTIPLIERS g k +1 ∈ ∂h ( x k +1 ). Then, using the definition of φ k in (2.12) and the structureof H given in Assumption 2.3, this can be written as = ∇ ψ ki ( x k +1 i ) + g k +1 i + q ki − ∇ ψ ki ( x ki ) + ∇ f i ( x ki ) + ρ X j ∈N i ∪{ i } p ij x kj , ∀ i ∈ V , where g k +1 i ∈ ∂h i ( x k +1 i ). In other words, (2.8) can be achieved by letting each node i ∈ V solve the following strongly convex optimization problem: x k +1 i = arg min x ∈ R d ψ ki ( x ) + h i ( x ) + h x, q ki − ∇ ψ ki ( x ki ) + ∇ f i ( x ki ) + ρ X j ∈N i ∪{ i } p ij x kj i , (2.16)which can be locally carried out by node i .Algorithms described by (2.10), (2.16), and (2.11) under Assumption 2.3 and As-sumption 2.4 constitute a subset of AMM, which can be implemented by the nodes in G via exchanging information with their neighbors only and, thus, are called DistributedApproximate Method of Multipliers (DAMM). Algorithm 2.1 describes in detail howthe nodes act in DAMM.
Algorithm 2.1
DAMM Initialization: Each node i ∈ V selects q i ∈ R d such that P i ∈V q i = (or simply sets q i = ). Each node i ∈ V arbitrarily sets x i ∈ R d and sends x i to every neighbor j ∈ N i . for k ≥ do Each node i ∈ V computes x k +1 i = arg min x ∈ R d ψ ki ( x ) + h i ( x ) + h x, q ki −∇ ψ ki ( x ki ) + ∇ f i ( x ki ) + ρ P j ∈N i ∪{ i } p ij x kj i . Each node i ∈ V sends x k +1 i to every neighbor j ∈ N i . Each node i ∈ V computes q k +1 i = q ki + ρ P j ∈N i ∪{ i } ˜ p ij x k +1 j . end for Finally, we provide two examples of DAMM with two particular choices of ψ ki . Example i ∈ V , let ψ ki ( x ) = r i ( x − x ki ) + ǫ i k x k ∀ k ≥
0, where r i : R d → R ∀ i ∈ V can be any convex, smooth, and twice continuously differen-tiable functions and ǫ i > ǫ i ≥ ρλ max ( H ). Then, DAMM reduces to adistributed proximal algorithm, with the following update of x ki : x k +1 i = arg min x ∈ R d r i ( x − x ki )+ ǫ i k x − x ki k + h i ( x )+ h x, q ki −∇ r i ( )+ ∇ f i ( x ki )+ ρ X j ∈N i ∪{ i } p ij x kj i . Example f i ∀ i ∈ V are twice continuously differentiable. Since ∇ f i ( x ) (cid:22) M i I d ∀ x ∈ R d , we can let each ψ ki ( x ) = x T ( ∇ f i ( x ki ) + ǫ i I d ) x , where ǫ i > ǫ i ≥ ρλ max ( H ). Then, the resulting DAMM is a distributed second-order method, with the following update of x ki : x k +1 i = arg min x ∈ R d k x − x ki k ∇ f i ( x ki )+ ǫ i I d h i ( x ) + h x, q ki + ∇ f i ( x ki ) + ρ X j ∈N i ∪{ i } p ij x kj i . In this subsection, we focus on thesmooth convex optimization problem min x ∈ R d P i ∈V f i ( x ), i.e., (1.1) with h i ( x ) ≡ XUYANG WU AND JIE LU ∀ i ∈ V . For this special case of (1.1), we provide additional designs of the surrogatefunction u k in AMM, leading to a couple of variations of DAMM.Here, we let u k still be in the form of (2.14), but no long require φ k + ρ k · k H be a separable function as in (2.12). Instead, we construct φ k based upon anotherfunction γ k : R Nd → R under Assumption 2.7. Assumption γ k ∀ k ≥ γ k ∀ k ≥ γ k ∀ k ≥ uniformly bounded from below by some γ > ∇ γ k ∀ k ≥ uniformly bounded from above by some ¯ γ > γ k ) ⋆ ( x ) − ρ k x k H ∀ k ≥ γ k ) ⋆ ( x ) = sup y ∈ R Nd h x , y i − γ k ( y )is the convex conjugate function of γ k .(e) For any k ≥ x ∈ R Nd , the i th d -dimensional block of ∇ γ k ( x ), denotedby ∇ i γ k ( x ), is independent of x j ∀ j / ∈ N i ∪ { i } .From Assumption 2.7(b)(c), each ( γ k ) ⋆ is (1 / ¯ γ )-strongly convex and (1 /γ )-smooth[8], so that Assumption 2.7(d) holds as long as I Nd / ¯ γ − ρH (cid:23) O . Now we set φ k ( x ) = ( γ k ) ⋆ ( x ) − ρ k x k H , ∀ k ≥ . (2.17)Below, we show that u k ∀ k ≥ γ k ) ⋆ guaranteeAssumption 2.2(b)(c). Also note from (2.15) that Assumption 2.2(d) is assured. Inaddition, due to Assumption 2.7(d), φ k is convex and, thus, so is u k . To show thetwice continuous differentiability of u k in Assumption 2.2(a), consider the fact from[4] that due to Assumption 2.7(b)(c), ∇ γ k is invertible and its inverse function is( ∇ γ k ) − = ∇ ( γ k ) ⋆ . This, along with (2.17), implies that ∇ φ k ( x ) = ( ∇ γ k ) − ( x ) − ρH x . (2.18)From the inverse function theorem [26], ( ∇ γ k ) − is continuously differentiable, so that φ k and therefore u k given by (2.14) and (2.17) are twice continuously differentiable.We then conclude that Assumption 2.2 holds.Equipped with (2.14), (2.17), and Assumption 2.7, we start to derive additionaldistributed subsets of AMM (2.8)–(2.10) for minimizing P i ∈V f i ( x ). As h i ( x ) ≡ ∀ i ∈ V , (2.8) is equivalent to ∇ φ k ( x k +1 ) − ∇ φ k ( x k ) + ∇ f ( x k ) + ρH x k +1 + q k = . (2.19)This, together with (2.18), gives x k +1 = ∇ γ k ( ∇ φ k ( x k ) − ∇ f ( x k ) − q k ) . (2.20)Like DAMM, we let each node i ∈ V maintain the i th d -dimensional blocks of x k and q k , i.e., x ki and q ki , where the update of q ki is the same as (2.11). According to(2.20), the update of x ki is given by x k +1 i = ∇ i γ k ( ∇ φ k ( x k ) − ∇ f ( x k ) − q k ). FromAssumption 2.7(e), this can be computed by node i provided that it has access to ∇ j φ k ( x k ) − ∇ f j ( x kj ) − q kj ∀ j ∈ N i ∪ { i } , where ∇ j φ k ( x k ) ∈ R d denotes the j th blockof ∇ φ k ( x k ). Therefore, the remaining question is whether each node i ∈ V is able to UNIFYING APPROXIMATE METHOD OF MULTIPLIERS ∇ i φ k ( x k ). In fact, this can be enabled by the following two concreteways of constructing γ k . Way : Let γ k = γ ∀ k ≥ γ : R Nd → R satisfying Assumption 2.7.Then, according to (2.17), φ k ∀ k ≥ φ : R Nd → R . We introducean auxiliary variable y k = (( y k ) T , . . . , ( y kN ) T ) T such that y k = ∇ φ ( x k ) ∀ k ≥
0. From(2.19), y k satisfies the following recursive relation: y k +1 = y k − ∇ f ( x k ) − q k − ρH x k +1 . Due to the structure of H in Assumption 2.3, each node i ∈ V is able to locallyupdate y k +1 i as above. Nevertheless, the initialization y = ∇ φ ( x ) = ( ∇ γ ) − ( x ) − ρH x cannot be achieved without centralized coordination, given that x is arbitrarilychosen in AMM. We may overcome this by imposing a restriction on x as follows:Let each node i ∈ V pick any ˜ z i ∈ R d , and force x i = ∇ i γ (˜ z ) with ˜ z = (˜ z T , . . . , ˜ z TN ) T ,so that ˜ z = ( ∇ γ ) − ( x ). Then, by letting each y i = ˜ z i − ρ P j ∈N i ∪{ i } p ij x j , we obtain y = ∇ φ ( x ). Due to Assumption 2.7(e), such initialization only requires each node i to share ˜ z i and x i with its neighbors, and thus is fully decentralized.Incorporating the above into (2.20) yields another subset of AMM, which is called Distributed Approximate Method of Multipliers for Smooth problems with Constantupdate functions (DAMM-SC) and is specified in Algorithm 2.2.
Algorithm 2.2
DAMM-SC Initialization: Each node i ∈ V selects q i ∈ R d such that P i ∈V q i = (or simply sets q i = ). Each node i ∈ V arbitrarily chooses ˜ z i ∈ R d and sends ˜ z i to every neighbor j ∈ N i . Each node i ∈ V sets x i = ∇ i γ (˜ z ) (which only depends on ˜ z j ∀ j ∈ N i ∪ { i } ) andsends it to every neighbor j ∈ N i . Each node i ∈ V sets y i = ˜ z i − ρ P j ∈N i ∪{ i } p ij x j . Each node i ∈ V sends y i − ∇ f i ( x i ) − q i to every neighbor j ∈ N i . for k ≥ do Each node i ∈ V computes x k +1 i = ∇ i γ ( y k − ∇ f ( x k ) − q k ) (which only dependson y kj − ∇ f j ( x kj ) − q kj ∀ j ∈ N i ∪ { i } ). Each node i ∈ V sends x k +1 i to every neighbor j ∈ N i . Each node i ∈ V computes y k +1 i = y ki − ∇ f i ( x ki ) − q ki − ρ P j ∈N i ∪{ i } p ij x k +1 j . Each node i ∈ V computes q k +1 i = q ki + ρ P j ∈N i ∪{ i } ˜ p ij x k +1 j . Each node i ∈ V sends y k +1 i − ∇ f i ( x k +1 i ) − q k +1 i to every neighbor j ∈ N i . end for Example ∇ i γ to solely depend on node i . For example, let g i : R d → R ∀ i ∈V be arbitrary twice continuously differentiable, smooth, convex functions and let g ( x ) = P i ∈V g i ( x i ). Then, we may choose γ ( x ) = g ( x ) + ǫ k x k with ǫ >
0. It isstraightforward to see that Assumption 2.7(a)(b)(c) hold and Assumption 2.7(d) canbe satisfied with proper ρ, ǫ >
0. For such a choice of γ , ∇ i γ ( x ) = ∇ g i ( x i ) + ǫx i ,which is up to node i itself and hence meets Assumption 2.7(e). Thus, the primalupdate of DAMM-SC, i.e., Line 8 of Algorithm 2.2, is simplified to x k +1 i = ∇ g i ( y ki −∇ f i ( x ki ) − q ki ) + ǫ ( y ki − ∇ f i ( x ki ) − q ki ), so that each node i only needs to share x k +1 i with its neighbors during each iteration and the local communications in Line 6 andLine 12 of Algorithm 2.2 are eliminated.0 XUYANG WU AND JIE LU
Way : For each k ≥
0, let γ k ( x ) = x T G k x , where G k = ( G k ) T ∈ R Nd × Nd .Suppose there exist ¯ γ ≥ γ > γI Nd (cid:22) G k (cid:22) ¯ γI Nd ∀ k ≥
0, and alsosuppose ( G k ) − (cid:23) ρH ∀ k ≥
0. Moreover, we let each G k have a neighbor-sparsestructure, i.e., the d × d ( i, j )-block of G k , denoted by G kij , is equal to O ∈ R d × d forall j / ∈ N i ∪ { i } . It can be shown that such quadratic γ k ’s satisfy Assumption 2.7.Substituting γ k ( x ) = x T G k x into (2.20) gives(2.21) x k +1 = x k − G k ( ρH x k + ∇ f ( x k ) + q k ) , which can be executed in a distributed manner due to the neighbor-sparse structure of G k . This leads to an additional distributed subset of AMM, which is called DistributedApproximate Method of Multipliers for Smooth problems with Quadratic update func-tions (DAMM-SQ) and is elaborated in Algorithm 2.3 where we introduce, for con-venience, an auxiliary variable z k = (( z k ) T , . . . , ( z kN ) T ) T = ρH x k + ∇ f ( x k ) + q k . Algorithm 2.3
DAMM-SQ Initialization: Same as Lines 2–3 in Algorithm 2.1. Each node i ∈ V computes z i = ∇ f i ( x i ) + q i + ρ P j ∈N i ∪{ i } p ij x j and sends it toevery neighbor j ∈ N i . for k ≥ do Each node i ∈ V computes x k +1 i = x ki − P j ∈N i ∪{ i } G kij z kj . Each node i ∈ V sends x k +1 i to every neighbor j ∈ N i . Each node i ∈ V computes q k +1 i = q ki + ρ P j ∈N i ∪{ i } ˜ p ij x k +1 j . Each node i ∈ V computes z k +1 i = ∇ f i ( x k +1 i ) + q k +1 i + ρ P j ∈N i ∪{ i } p ij x k +1 j . Each node i ∈ V sends z k +1 i to every neighbor j ∈ N i . end for Example gradient-tracking algorithm as anexample of DAMM-SQ, where G k = ρ ( I N − P ) ⊗ I d ∀ k ≥ P = ˜ P ≺ I N underAssumption 2.3. Note that P = ˜ P ≺ I N can be satisfied by letting I N − P = I N − ˜ P bestrictly diagonally dominant with positive diagonal entries. Similar to the discussionsbelow Assumption 2.3, such P, ˜ P and therefore G k can be locally determined by thenodes. Moreover, since ( I N − P ) − (cid:23) I N ≻ P , ( G k ) − = ρ ( I N − P ) − ⊗ I d ≻ ρH .Also, it can be verified that all the other conditions on G k required by DAMM-SQhold. With this particular G k , (2.21) becomes x k +1 = (( P + ( I N − P ) ) ⊗ I d ) x k − ρ (( I N − P ) ⊗ I d )( ∇ f ( x k ) + q k ) . Since P = and ( T ⊗ I d ) q k = , this indicates N P i ∈V x k +1 i = N P i ∈V x ki − ρ (cid:16) N P i ∈V ∇ f i ( x ki ) (cid:17) . It can thus be seen that this example of DAMM-SQ tracks theaverage of all the local gradients so as to make the average primal iterates N P i ∈V x ki imitate the behavior of the (centralized) gradient descent method. This section shows that AMMand its distributed subsets generalize a variety of existing distributed optimizationalgorithms. Although these existing methods were originally developed in differentways and some of them are not explicitly connected to the Method of Multipliers, wemanage to unify them as special forms of DAMM-SQ, DAMM, and AMM.
UNIFYING APPROXIMATE METHOD OF MULTIPLIERS DAMM-SQ in Algorithm 2.3 generalizesseveral distributed first-order and second-order algorithms for solving problem (1.1)with h i ≡ ∀ i ∈ V , including EXTRA [22], ID-FBBS [32], and DQM [15]. EXTRA [22] is a well-known distributed first-order algorithmdeveloped from a decentralized gradient descent method. From [22, Eq. (3.5)], EX-TRA can be expressed as(3.1) x k +1 = ( ˜ W ⊗ I d ) x k − α ∇ f ( x k ) + k X t =0 (( W − ˜ W ) ⊗ I d ) x t , ∀ k ≥ , where x is arbitrarily given, α >
0, and W, ˜ W ∈ R N × N are two average matricesassociated with G . By letting q k = α P kt =0 (( ˜ W − W ) ⊗ I d ) x t , (3.1) becomes x k +1 = ( ˜ W ⊗ I d ) x k − α ∇ f ( x k ) − α q k , (3.2) q k +1 = q k + 1 α (( ˜ W − W ) ⊗ I d ) x k +1 . (3.3)This is in the form of DAMM-SQ with ρ = 1 /α , ˜ P = ˜ W − W , P = I N − ˜ W , and G k = αI Nd . As [22] assumes ˜ W (cid:23) W and Null( ˜ W − W ) = span( ), Assumption 2.3and (2.10) are guaranteed. Besides, [22] assumes ˜ W ≻ O , so that ( G k ) − = ρI Nd (cid:23) ρ ( I N − ˜ W ) ⊗ I d = ρH . It is then straightforward to see that this particular G k satisfiesall the conditions in subsection 2.3. ID-FBBS [32] takes the form of (3.2)–(3.3), except that W =2 ˜ W − I N and q can be any vector in S ⊥ . Since [32] also assumes ˜ W ≻ O , it followsfrom the analysis in subsection 3.1.1 that ID-FBBS is a particular example of DAMM-SQ, where ρ = 1 /α , ˜ P = P = I N − ˜ W , and G k = αI Nd , with Assumption 2.3 and allthe conditions on G k in subsection 2.3 satisfied. DQM [15] is a distributed second-order method for solving prob-lem (1.1) with strongly convex, smooth, and twice continuously differentiable f i ’s andzero h i ’s. DQM takes the following form: x k +1 i = x ki − (2 c |N i | I d + ∇ f i ( x ki )) − ( c X j ∈N i ( x ki − x kj ) + ∇ f i ( x ki ) + q ki ) ,q k +1 i = q ki + c X j ∈N i ( x k +1 i − x k +1 j ) , where x i ∀ i ∈ V are arbitrarily given, q i ∀ i ∈ V satisfy P i ∈V q i = , and c >
0. Ob-serve that DAMM-SQ reduces to DQM if we set G k = (2 c diag( |N | , . . . , |N N | ) ⊗ I d + ∇ f ( x k )) − , ρ = c , p ij = ˜ p ij = − ∀{ i, j } ∈ E , and p ii = ˜ p ii = |N i | ∀ i ∈ V . Clearly, P and ˜ P satisfy Assumption 2.3. Additionally, ( G k ) − (cid:23) c diag( |N | , . . . , |N N | ) ⊗ I d (cid:23) ρP ⊗ I d = ρH , and G k meets all the other requirements in subsection 2.3. A number of distributed algorithms for com-posite or nonsmooth convex optimization can be cast into the form of DAMM in Al-gorithm 2.1, including PGC [7], PG-EXTRA [23], DPGA [1], a decentralized ADMMin [24], and D-FBBS [32]. We say W ∈ R N × N is an average matrix associated with G if W = W T , W = , k W − T N k <
1, and [ W ] ij = 0 ∀ i ∈ V ∀ j / ∈ N i ∪ { i } . XUYANG WU AND JIE LU
PGC [7] and PG-EXTRA [23] are two recentlyproposed distributed methods for solving problem (1.1), where PGC is constructedupon ADMM [6] and PG-EXTRA is an extension of EXTRA [22] to address (1.1)with nonzero h i ’s. According to [7, section V-D], PGC can be described by q ∈ S ⊥ , q k = q + P kℓ =1 ((Λ β ( ˜ W − W )) ⊗ I d ) x ℓ ∀ k ≥
1, and x k +1 = arg min x ∈ R Nd h ( x )+ 12 k x − ( ˜ W ⊗ I d ) x k k β ⊗ I d + h∇ f ( x k ) + q k , x i , ∀ k ≥ , where x is arbitrarily given and the parameters are chosen as follows: Let Λ β =diag( β , . . . , β N ) be a positive definite diagonal matrix and W, ˜ W ∈ R N × N be tworow-stochastic matrices such that Λ β W and Λ β ˜ W are symmetric, [ W ] ij , [ ˜ W ] ij > ∀ j ∈ N i ∪ { i } , and [ W ] ij = [ ˜ W ] ij = 0 otherwise. To cast PGC in the form of DAMM,let ρ = 1, ˜ P = Λ β ( ˜ W − W ), and P = Λ β ( I N − ˜ W ). Then, starting from any q ∈ S ⊥ ,the updates of PGC can be expressed as x k +1 = arg min x ∈ R Nd h ( x )+ 12 k x − x k k β ⊗ I d + h∇ f ( x k )+ q k + ρ ( P ⊗ I d ) x k , x i , (3.4) q k +1 = q k + ρ ( ˜ P ⊗ I d ) x k +1 . (3.5)This means that PGC is exactly in the form of DAMM with ψ ki ( · ) = β i k · k . Notethat Λ β (cid:23) Λ β ˜ W and Null(Λ β ( I N − ˜ W )) = span( ). In addition, [7] assumes Λ β ˜ W (cid:23) Λ β W , Λ β ˜ W (cid:23) O , and Null(Λ β ( ˜ W − W )) = span( ). Consequently, Assumption 2.3and Assumption 2.4 hold. PG-EXTRA can also be described by (3.4)–(3.5) with β i = ρ > ∀ i ∈ V and q = ρ ( ˜ P ⊗ I d ) x , i.e., is a special form of PGC. Therefore,DAMM generalizes both PGC and PG-EXTRA. DPGA [1] is a distributed prox-imal gradient method and has the following form: Given arbitrary x i and q i = , x k +1 i = arg min x ∈ R d h i ( x ) + 12 c i k x − x ki k + h∇ f i ( x ki ) + q ki + X j ∈N i ∪{ i } Γ ij x kj , x i ,q k +1 i = q ki + X j ∈N i ∪{ i } Γ ij x k +1 j , ∀ i ∈ V , where c i > ∀ i ∈ V , Γ ij = Γ ji < ∀{ i, j } ∈ E , and Γ ii = − P j ∈N i Γ ij ∀ i ∈ V . Theabove update equations of DPGA are equivalent to those of DAMM with ψ ki ( · ) = c i k · k , ρ = 1, and P, ˜ P such that ˜ p ij , p ij are equal to Γ ij if { i, j } ∈ E or i = j andare equal to 0 otherwise. Apparently, P and ˜ P satisfy Assumption 2.3. Furthermore,due to the conditions on c i in [1], it is guaranteed that P i ∈V ψ ki ( x i ) − ρ k x k H isconvex and, thus, Assumption 2.4 holds. The decentralized ADMM in [24] for solvingstrongly convex and smooth problems can be viewed as a special case of DPGA with c i = c |N i | ∀ i ∈ V for some c > ij = Γ ji = − c ∀{ i, j } ∈ E , so that it is also aspecialization of DAMM. D-FBBS [32] addresses problem (1.1) with f i ≡ ∀ i ∈ V ,which is designed based on a Bregman method and an operator splitting technique.D-FBBS is described by (3.4)–(3.5) with β i = ρ > ∀ i ∈ V , ˜ P = P = I N − W for an average matrix W ∈ R N × N associated with G , x being arbitrary, and q ∈ S ⊥ . Clearly, P, ˜ P satisfy Assumption 2.3. Now let ψ ki ( · ) = ρ k · k , which satisfiesAssumption 2.4 because [32] assumes W ≻ O . Therefore, we conclude that D-FBBScan be specialized from DAMM. UNIFYING APPROXIMATE METHOD OF MULTIPLIERS Since DAMM-SQ and DAMM are subsets ofAMM, the algorithms in subsection 3.1 and subsection 3.2 are also specializations ofAMM. Below, we present a number of existing methods [12, 9, 17, 14] that can bespecialized from AMM but do not belong to DAMM-SQ or DAMM.
In [12], a distributed ADMM in the followingform is proposed to solve problem (1.1) with f i ≡ i ∈ V : x k +1 = arg min x ∈ R Nd h ( x ) + h Q T w k , x i + c h Q T (Λ − ⊗ I d ) Q x k , x i + c k x − x k k Q , w k +1 = w k + c (Λ − ⊗ I d ) Q x k +1 , where x ∈ R Nd is arbitrarily given and w = . In the above, c >
0, Λ = diag( |N | +1 , . . . , |N N | + 1), and Q = Γ ⊗ I d with Γ ∈ R N × N satisfying [Γ] ij = 0 ∀ i ∈ V ∀ j / ∈N i ∪ { i } and Null(Γ T Λ − Γ) = span( ). Additionally, ˜ Q = diag( Q T Q , . . . , Q TN Q N ) ∈ R Nd × Nd , where Q i ∈ R Nd × d ∀ i ∈ V are such that Q = ( Q , . . . , Q N ). It is shown in[12] that ˜ Q (cid:23) Q T (Λ − ⊗ I d ) Q . We may view this algorithm as a specialization fromAMM (2.8)–(2.10), in which q k = Q T w k , H = ˜ H = Q T (Λ − ⊗ I d ) Q , ρ = c , and u k ( x ) = ρ ( x − x k ) T ( ˜ Q − H )( x − x k ). Apparently, H, ˜ H are symmetric and positivesemidefinite, and Null( H ) = Null( ˜ H ) = S , as required in subsection 2.1. Also, sinceeach Q i has full column rank, ˜ Q ≻ O . This guarantees Assumption 2.2. In [9], a distributed primal-dual algorithm is developed to solve problem (1.1) with each h i equal to the indicatorfunction I X i with respect to a closed convex set X i , and takes the following form: x k +1 = P X [ x k − α ( ∇ f ( x k ) + (Γ ⊗ I d ) w k + (Γ ⊗ I d ) x k )] , (3.6) w k +1 = w k + α (Γ ⊗ I d ) x k , (3.7)where x , w are arbitrarily given, X = X × · · · × X N , and P X [ · ] is the projectiononto X . Besides, Γ ∈ R N × N is a symmetric positive semidefinite matrix satisfying[Γ] ij = 0 ∀ i ∈ V ∀ j / ∈ N i ∪ { i } and Null(Γ) = span( ), and 0 < α ≤ k Γ k . To see howthis algorithm relates to AMM, we use (3.7) to rewrite (3.6) as x k +1 = arg min x ∈ R Nd h ( x )+ 12 α k x − ( x k − α ( ∇ f ( x k ) + (Γ ⊗ I d ) w k +1 + ((Γ − α Γ ) ⊗ I d ) x k )) k , (3.8)where h ( x ) = I X ( x ). It can be seen that (3.7)–(3.8) are equivalent to AMM (2.3)–(2.4) with v k = w k +1 , ρ = α , ˜ H = Γ ⊗ I d , H = (Γ /α − Γ ) ⊗ I d , and u k ( x ) = h∇ f ( x k ) , x i + α k x − x k k I Nd − α (Γ − α Γ ) ⊗ I d . Since α ≤ k Γ k , we have H (cid:23) Γ ⊗ I d / (2 α )and I Nd − α (Γ − α Γ ) ⊗ I d (cid:23) I Nd − α Γ ⊗ I d ≻ O . Thus, H, ˜ H satisfy the condi-tions required by AMM in subsection 2.1. Also, u k is strongly convex and satisfiesAssumption 2.2. DIGing [17] is a distributed gradient-tracking method for solving problem (1.1) with h i ≡ ∀ i ∈ V over time-varyingnetworks. Here, we only consider DIGing on static networks. Let α > W ∈ R N × N satisfy W = W T = , [ W ] ij = 0 ∀ i ∈ V ∀ j / ∈ N i ∪{ i } , and k W − N T k < W = W T , DIGing can be written as x k +2 = (2 W ⊗ I d ) x k +1 − ( W ⊗ I d ) x k − α ( ∇ f ( x k +1 ) − ∇ f ( x k )) , ∀ k ≥ , XUYANG WU AND JIE LU where x can be arbitrary and x = ( W ⊗ I d ) x − α ∇ f ( x ). Adding the aboveequation from k = 0 to k = K − x K +1 = ( W ⊗ I d ) x K + ( I N − W ) ⊗ I d ) x − K X k =0 (( I N − W ) ⊗ I d ) x k − α ∇ f ( x K )for all K ≥
0. By letting q = α (( W − W ) ⊗ I d ) x , the above update is the same as x K +1 = ( W ⊗ I d ) x K − α ∇ f ( x K ) − α q K , q K +1 = q K + 1 α (( I N − W ) ⊗ I d ) x K +1 . Such an algorithmic form of DIGing is identical to AMM (2.8)–(2.10) with the abovegiven q ∈ S ⊥ , ρ = 1 /α , H = ( I N − W ) ⊗ I d , ˜ H = ( I N − W ) ⊗ I d , and u k ( x ) = h∇ f ( x k ) , x i + ρ k ( W ⊗ I d )( x − x k ) k . It can be verified that u k ∀ k ≥ H, ˜ H satisfy all the conditions in subsection 2.1. ESOM [14] is a class of distributed second-order algorithms thataddress problem (1.1) with f i ∀ i ∈ V being strongly convex, smooth, and twicecontinuously differentiable and with h i ≡ ∀ i ∈ V . It is developed by incorporatinga proximal technique and certain second-order approximations into the Method ofMultipliers. To describe ESOM, let α > ǫ >
0, and W ∈ R N × N be an averagematrix associated with G such that [ W ] ij ≥ ∀ i, j ∈ V . In addition, define W d :=diag([ W ] , . . . , [ W ] NN ), B := α ( I Nd + ( W − W d ) ⊗ I d ), D k := ∇ f ( x k ) + ǫI Nd +2 α ( I Nd − W d ⊗ I d ), and Q k ( K ) := ( D k ) − P Kt =0 (( D k ) − B ( D k ) − ) t ( D k ) − , K ≥ K algorithm can be described by x k +1 = x k − Q k ( K )( ∇ f ( x k ) + q k + α ( I Nd − W ⊗ I d ) x k ) , q k +1 = q k + α ( I Nd − W ⊗ I d ) x k +1 , (3.9)where x is any vector in R Nd and q = which satisfies the initialization (2.10) ofAMM. Note that the primal and dual updates of AMM, i.e., (2.8)–(2.9), reduce to(3.9) when ˜ H = H = I Nd − W ⊗ I d , ρ = α , and u k ( x ) = x T (( Q k ( K )) − − ρH ) x −h (( Q k ( K )) − − ρH ) x k , x i + h∇ f ( x k ) , x i . Clearly, H and ˜ H satisfy the conditions insubsection 2.1. Also, Assumption 2.2 holds, since ( M + ǫ + 2 ρ ) I Nd (cid:23) ( Q k ( K )) − (cid:23)∇ f ( x k ) + ǫI Nd + ρH [14], where M > ∇ f i ’s. In this section, we analyze the convergence perfor-mance of AMM described by (2.3)–(2.4).For the purpose of analysis, we add H (cid:23) ˜ H to the conditions on H, ˜ H in subsec-tion 2.1, leading to the following assumption. Assumption H and ˜ H are symmetric and Null( H ) = Null( ˜ H ) = S , where S is defined in (2.1). In addition, H (cid:23) ˜ H (cid:23) O .Recall that in DAMM, DAMM-SC, and DAMM-SQ, we adopt H = P ⊗ I d and˜ H = ˜ P ⊗ I d , where P and ˜ P comply with Assumption 2.3. Thus, as long as we furtherimpose ˜ P (cid:22) P , Assumption 4.1 holds. Moreover, all the existing specializations ofAMM, DAMM, and DAMM-SQ in section 3, except DIGing [17], also need to satisfyAssumption 4.1, while DIGing is not necessarily required to meet H (cid:23) ˜ H .In what follows, we let ( x k , v k ) be generated by (2.3)–(2.4), and let ( x ⋆ , v ⋆ ) bea primal-dual optimal solution pair of problem (2.2), which satisfies (2.6). Also, let UNIFYING APPROXIMATE METHOD OF MULTIPLIERS M := diag( M , . . . , M N ) ⊗ I d , where M i > ∇ f i . Inaddition, define A k := R ∇ u k ((1 − s ) x k + s x k +1 ) ds ∀ k ≥
0. Note that every A k exists and is symmetric due to Assumption 2.2. Moreover, since ∇ u k ( x k ) = ∇ f ( x k ), ∇ u k ( x k +1 ) = ∇ u k ( x k ) + Z ∇ u k ((1 − s ) x k + s x k +1 )( x k +1 − x k ) ds = ∇ f ( x k ) + A k ( x k +1 − x k ) . (4.1)Then, we introduce the following auxiliary lemma for deriving the main results. Lemma
Suppose Assumption , Assumption , and Assumption hold.For any η ∈ [0 , and k ≥ , ρ h v k − v k +1 , v k +1 − v ⋆ i ≥ η h x k − x ⋆ , ∇ f ( x k ) − ∇ f ( x ⋆ ) i + h x k +1 − x ⋆ , A k ( x k +1 − x k ) i − k x k +1 − x k k M − η ) . (4.2) Proof.
See Appendix 7.1.
In this subsection, we pro-vide the convergence rates of AMM in solving the general problem (1.1).Let ¯ x k = k P kt =1 x t ∀ k ≥ x t from t = 1 to t = k .Below, we derive sublinear convergence rates for (i) the consensus error k ˜ H ¯ x k k ,which represents the infeasibility of ¯ x k in solving (2.2), and (ii) the objective error | f (¯ x k ) + h (¯ x k ) − f ( x ⋆ ) − h ( x ⋆ ) | , which is a direct measure of optimality. Throughoutsubsection 4.1, we tentatively consider the following form of the surrogate function u k that fulfills Assumption 2.2:(4.3) u k ( x ) = 12 k x − x k k A + h∇ f ( x k ) , x i , ∀ k ≥ , where A ∈ R Nd × Nd satisfies A = A T , A (cid:23) O , and A + ρH ≻ O . Such a choice of u k results in A k = A ∀ k ≥ u k ’s other than (4.3),they all require problem (1.1) to be strongly convex and smooth—In such a case, wewill provide convergence rates for the general form of AMM in subsection 4.2.Now we bound k ˜ H ¯ x k k and | f (¯ x k ) + h (¯ x k ) − f ( x ⋆ ) − h ( x ⋆ ) | . This plays a keyrole in acquiring the rates at which these errors vanish. Lemma
Suppose Assumption , Assumption , and (4.3) hold. For any k ≥ , k ˜ H ¯ x k k = 1 ρk k v k − v k , (4.4) f (¯ x k )+ h (¯ x k ) − f ( x ⋆ ) − h ( x ⋆ ) ≤ k k v k ρ + k x − x ⋆ k A k − X t =0 k x t +1 − x t k M − A ! , (4.5) f (¯ x k ) + h (¯ x k ) − f ( x ⋆ ) − h ( x ⋆ ) ≥ − k v ⋆ k · k v k − v k ρk . (4.6)6 XUYANG WU AND JIE LU
Proof.
See Appendix 7.2.Observe from Lemma 4.3 that as long as k v k − v k and P k − t =0 k x t +1 − x t k M − A are bounded, k ˜ H ¯ x k k and | f (¯ x k ) + h (¯ x k ) − f ( x ⋆ ) − h ( x ⋆ ) | are guaranteed to convergeto 0. The following lemma assures this is true with the help of Lemma 4.2, in which z k := (( x k ) T , ( v k ) T ) T ∀ k ≥ z ⋆ := (( x ⋆ ) T , ( v ⋆ ) T ) T , and G := diag( A, I Nd /ρ ). Lemma
Suppose all the conditions in Lemma hold. Also suppose that A ≻ Λ M / . For any k ≥ , k v k − v k ≤ k v − v ⋆ k + √ ρ k z − z ⋆ k G , (4.7) k − X t =0 k x t +1 − x t k M − A ≤ k z − z ⋆ k G − σ , (4.8) where σ := min { σ | σA (cid:23) Λ M / } ∈ (0 , .Proof. See Appendix 7.3.The following theorem results from Lemma 4.3 and Lemma 4.4, which provides O (1 /k ) convergence rates for both the consensus error and the objective error at therunning average ¯ x k . Theorem
Suppose all the conditions in Lemma hold. For any k ≥ , k ˜ H ¯ x k k ≤ k v − v ⋆ k + √ ρ k z − z ⋆ k G ρk ,f (¯ x k )+ h (¯ x k ) − f ( x ⋆ ) − h ( x ⋆ ) ≤ k (cid:18) k v k ρ + k x − x ⋆ k A + k z − z ⋆ k G − σ (cid:19) ,f (¯ x k ) + h (¯ x k ) − f ( x ⋆ ) − h ( x ⋆ ) ≥ − k v ⋆ k ( k v − v ⋆ k + √ ρ k z − z ⋆ k G ) ρk . Proof.
Substituting (4.7)–(4.8) into (4.4)–(4.6) completes the proof.Among the existing algorithms [35, 11, 10, 12, 1, 23, 32, 7, 5, 16, 30] that are ableto solve nonsmooth convex optimization problems with non-strongly convex objectivefunctions, the best convergence rates so far are of O (1 /k ), achieved by [12, 1, 7, 23,32, 10]. Indeed, the algorithms in [12, 1, 7, 23, 32] are special cases of AMM under allthe conditions in Theorem 4.5. Like Theorem 4.5, [12, 1, 7] achieve O (1 /k ) rates forboth of the objective error and the consensus error, whereas [23, 32, 10] reach O (1 /k )rates in terms of optimality residuals that implicitly reflect the differences from anoptimality condition, and they only guarantee O (1 / √ k ) rates for the consensus error.Theorem 4.5 also provides new convergence results for some existing algorithmsspecialized from AMM. In particular, the O (1 /k ) rate in terms of the objective erroris new to PG-EXTRA [23] and D-FBBS [32] for nonsmooth problems as well asEXTRA [22] and ID-FBBS [32] for smooth problems, which only had O (1 /k ) rateswith respect to optimality residuals before. Moreover, Theorem 4.5 improves their O (1 / √ k ) rates in reaching consensus to O (1 /k ). Furthermore, Theorem 4.5 extendsthe O (1 /k ) rate of the distributed primal-dual algorithm in [9] for constrained smoothconvex optimization to general nonsmooth convex problems, and allows PGC [7] toestablish its O (1 /k ) rate without the originally required condition that each h i needsto have a compact domain. UNIFYING APPROXIMATE METHOD OF MULTIPLIERS Inthis subsection, we additionally impose an assumption on problem (1.1) and acquirethe corresponding convergence rates of AMM (2.3)–(2.4). Henceforth, we no longerrestrict u k to be in the form of (4.3). Assumption P i ∈V f i ( x ) is locally restricted strongly convexwith respect to x ⋆ , where x ⋆ is an optimal solution to problem (1.1).Under Assumption 4.6, problem (1.1) has a unique optimal solution x ⋆ , andproblem (2.2) has a unique optimal solution x ⋆ = (( x ⋆ ) T , . . . , ( x ⋆ ) T ) T ∈ S [29].Note that even with Assumption 4.6, f ( x ) may not be locally restricted stronglyconvex. Nevertheless, it is shown in Proposition 4.7 below that f ( x ) + ρ k x k H islocally restricted strongly convex with respect to x ⋆ . To present this result, we de-fine the following: For any convex and compact set C ⊂ R Nd containing x ⋆ , let¯ C := { x ∈ R d | x = P i ∈V y i /N, ( y T , . . . , y TN ) T ∈ C } ⊂ R d . Clearly, ¯ C is also convexand compact, and x ⋆ ∈ ¯ C . Owing to Assumption 4.6, P i ∈V f i ( x ) is restricted stronglyconvex with respect to x ⋆ on ¯ C with some convexity parameter m ¯ C >
0. Additionally,let λ ˜ H > H and M := max i ∈V M i . Proposition [29, Lemma 1] Suppose Assumption and Assumption hold. For any convex and compact C ⊂ R Nd containing x ⋆ , f ( x )+ ρ k x k H is restrictedstrongly convex with respect to x ⋆ on C . The corresponding convexity parameter canbe given by ζ C ( γ ) := min { m ¯ C /N − M γ, ρ λ ˜ H /γ ) } > ∀ γ ∈ (0 , m ¯ C MN ) . Subsequently, we bound x k ∀ k ≥ C ⊂ R Nd containing x ⋆ , so that the restricted strong convexity in Proposition 4.7 takes intoeffect on C . To introduce C , note from Assumption 2.2 that there exist symmetricpositive semidefinite matrices A ℓ , A u ∈ R Nd × Nd such that for any x ∈ R Nd and k ≥ A ℓ (cid:22) ∇ u k ( x ) (cid:22) A u . Let ∆ := k A u − A ℓ k ≥
0. Moreover, define A a = ( A ℓ + A u ) / G = diag( A a , I Nd /ρ ), which are also symmetric positive semidefinite. Then, let C = n x | k x − x ⋆ k A a ≤ q k z − z ⋆ k G + ρ k x k H o , where z and z ⋆ are defined as in subsection 4.1. Thus, from Proposition 4.7, thereexists m ρ ∈ (0 , ∞ ) as the convexity parameter of f ( x ) + ρ k x k H on C . The followinglemma is another consequence of Lemma 4.2, showing that x k stays identically in C . Lemma
Suppose Assumption , Assumption , Assumption , and As-sumption hold. For each k ≥ , x k ∈ C , provided that A a ≻ (cid:18) ∆ ηm ρ + ∆ (cid:19) I Nd + Λ M − η ) for some η ∈ (0 , . (4.9) Proof.
See Appendix 7.4.When ∆ = 0 (which means ∇ u k is constant) or P i ∈V f i ( x ) is globally restrictedstrongly convex (which means m ρ is independent of C ), it is straightforward to find u k so that (4.9) holds. Otherwise, m ρ depends on C and thus on A a , so that both sides of(4.9) involve A a . With that being said, (4.9) can always be satisfied by proper u k ’s.To see this, arbitrarily pick η ∈ (0 ,
1) and ¯ λ a , λ a > λ a > λ a > M − η ) . Ifwe choose u k such that the corresponding A a satisfies λ a I Nd (cid:22) A a (cid:22) ¯ λ a I Nd , then C is a subset of C ′ := { x | λ a k x − x ⋆ k ≤ ¯ λ a k x − x ⋆ k + ρ k v − v ⋆ k + ρ k x k H } . Let m ′ ρ be any convexity parameter of f ( x ) + ρ k x k H on C ′ . Clearly, m ′ ρ ∈ (0 , m ρ ] and is8 XUYANG WU AND JIE LU independent of A a . Then, the following suffices to satisfy (4.9):(4.10) (cid:18) ∆ ηm ′ ρ + ∆ + λ a (cid:19) I Nd (cid:22) A a (cid:22) ¯ λ a I Nd . The lower and upper bounds on A a in (4.10) do not need to depend on A a . Also,(4.10) is well-posed, since ∆ ηm ′ ρ + ∆ + λ a ≤ ¯ λ a holds for sufficiently small ∆ > u k ∀ k ≥ A a satisfying (4.10) meet (4.9).Next, we present the convergence rates of AMM in both optimality and feasibilityunder the local restricted strong convexity condition. We first consider the smoothcase of (1.1) with each h i identically equal to 0, and provide a linear convergence ratefor k z k − z ⋆ k G + ρ k x k k H , which quantifies the distance to primal optimality, dualoptimality, and primal feasibility of (2.2). In Theorem 4.9, we force v ⋆ to satisfy notonly (2.6) but v ⋆ − v ∈ S ⊥ as well. Such v ⋆ is a particular optimum to the Lagrangedual of (2.2), and can be chosen as v ⋆ = ˜ v + (( N P i ∈V ( v i − ˜ v i )) T , . . . , ( N P i ∈V ( v i − ˜ v i )) T ) T , where ˜ v is any Lagrange dual optimum of (2.2) and v = (( v ) T , . . . , ( v N ) T ) T is the initial dual iterate of AMM. Theorem
Suppose all the conditions in Lemma hold. Also suppose h i ( x ) ≡ ∀ i ∈ V . Then, there exists δ ∈ (0 , such that for each k ≥ , k z k +1 − z ⋆ k G + ρ k x k +1 k H ≤ (1 − δ )( k z k − z ⋆ k G + ρ k x k k H ) . (4.11) Proof.
See Appendix 7.5.In comparison with many existing works such as [24, 14, 15, 12, 17, 20, 11, 10, 19,31] that assume global strong convexity to derive linear convergence rates, the linearrate in Theorem 4.9 only requires the weaker condition of local restricted strong con-vexity in Assumption 4.6. Moreover, Theorem 4.9 provides new convergence results toa number of existing algorithms that can be specialized from AMM. Specifically, The-orem 4.9 establishes linear convergence for D-FBBS [32], ID-FBBS [32], and DPGA[1], which has never been discussed before. In addition, Theorem 4.9 allows EXTRA[22], DIGing [17], ESOM [14], DQM [15], and the distributed ADMMs in [12, 24]to achieve linear convergence under less restrictive problem assumptions, includingrelaxing the global strong convexity assumed in [17, 14, 15, 12, 24] and the globalrestricted strong convexity assumed in [22] to local restricted strong convexity, andeliminating the Lipschitz continuity of ∇ f i required in [14, 15].Finally, we remove the condition of each h i ≡ Theorem
Suppose all the conditions in Lemma hold. For any k ≥ , k ˜ H ¯ x k k ≤ d + k v − v ⋆ k ρk , (4.12) f (¯ x k ) + h (¯ x k ) − f ( x ⋆ ) − h ( x ⋆ ) ≤ k (cid:18) d ρ (cid:16) ∆2(2 ηm ρ − β ∆) + 1 − η − σ (cid:17) + k x − x ⋆ k A u k v k ρ (cid:19) , (4.13) f (¯ x k ) + h (¯ x k ) − f ( x ⋆ ) − h ( x ⋆ ) ≥ − k v ⋆ k ( d + k v − v ⋆ k ) ρk , (4.14) where d = q ρ k z − z ⋆ k G + ρ k x k H , η ∈ (0 , is given in (4.9) , and β > and σ ∈ (0 , are such that β ∆ < ηm ρ and σA a (cid:23) ( β + 1)∆ I Nd + Λ M − η ) . UNIFYING APPROXIMATE METHOD OF MULTIPLIERS Proof.
See Appendix 7.6.The convergence rates in Theorem 4.10 are of the same order as those in The-orem 4.5. Theorem 4.5 considers a more general optimization problem, while Theo-rem 4.10 allows for more general u k ’s. In the literature, [18, 34, 28, 35] also deal withnonsmooth, strongly convex problems. Even under global strong convexity, [34, 28]provide convergence rates slower than O (1 /k ), and [18] derives an O (1 /k ) rate of con-vergence only to dual optimality. Although [35] attains linear convergence, it considersa more restricted problem, which assumes each f i to be globally restricted stronglyconvex and the subgradients of each h i to be uniformly bounded. Remark u k is specialized to thecorresponding particular forms introduced in section 3. In this section, we compare the convergence perfor-mance of several state-of-the-art distributed optimization algorithms and a new par-ticular DAMM in addressing a numerical example of convex composite optimization.Consider the following constrained l -regularized problem:(5.1) minimize x ∈∩ i ∈V X i X i ∈V (cid:18) k B i x − b i k + 1 N k x k (cid:19) , where each B i ∈ R m × d , b i ∈ R m , and X i = { x | k x − a i k ≤ k a i k + 1 } with a i ∈ R d ,so that ∩ i ∈V X i contains and is nonempty. Thus, we set f i ( x ) = k B i x − b i k and h i ( x ) = N k x k + I X i ( x ) ∀ i ∈ V . We choose N = 20, d = 5, and m = 3. Besides,we randomly generate each B i , b i , and a i . The graph G is also randomly generated,which is connected by 26 links in total.In the simulation, we implement PG-EXTRA [23], D-FBBS [32], DPGA [1], andthe distributed ADMM in [12], which are guaranteed to solve (5.1). Recall fromsection 3 that all these algorithms are specializations of AMM, and the first threealgorithms can also be specialized from DAMM. In addition to these existing methods,we consider a particular DAMM with ψ ki ( · ) = ̟ i ( · ) := k · k B Ti B i + ǫI d , ǫ >
0, whichdepends on the problem data. Note that this is a new algorithm for solving (5.1).The algorithmic forms of PG-EXTRA, D-FBBS, DPGA, and the distributedADMM are given in section 3. For DPGA, we let c i = c ∀ i ∈ V for some c > ij = c [ M G ] ij ∀ i ∈ V ∀ j ∈ N i ∪ { i } , where M G is the Metropolis weight matrixof G and is defined below Assumption 2.3. We also set the weight matrix Γ = M G in the distributed ADMM. We assign P = ˜ P = M G to the above new DAMMand to PG-EXTRA and D-FBBS when cast into the form of DAMM. The remainingalgorithm parameters are all fine-tuned in their theoretical ranges.Figure 1 plots the objective error | f ( · ) + h ( · ) − f ( x ⋆ ) − h ( x ⋆ ) | and the consensuserror k ˜ H ( · ) k at the running average ¯ x k and the iterate x k generated by each of0 XUYANG WU AND JIE LU -2 -1 (a) Objective error at ¯ x k -10 -5 (b) Objective error at x k -3 -2 -1 (c) Consensus error at ¯ x k -10 -5 (d) Consensus error at x k Fig. 1 . Convergence performance. the aforementioned algorithms in solving problem (5.1). Observe that the runningaverages produce smoother curves than the iterates, while the iterates converge fasterthan the running averages. It can be seen that the new DAMM with ψ ki = ̟ i outperforms the other four existing methods in convergence to both optimality andfeasibility. This suggests that AMM not only unifies a diversity of existing methods,but also can create novel distributed optimization algorithms that achieve superiorperformance in solving various convex composite optimization problems. We have introduced a unifying Approximate Method of Mul-tipliers (AMM) that emulates the Method of Multipliers via a surrogate function.Proper designs of the surrogate function lead to a wealth of distributed algorithmsfor solving convex composite optimization over undirected graphs. Sublinear and lin-ear convergence rates for AMM are established under various mild conditions. Theproposed AMM and its distributed subsets generalize a number of well-noted meth-ods in the literature. Moreover, the theoretical convergence rates of AMM are noworse than and sometimes even better than the convergence results of such existingspecializations of AMM in the sense of rate order, problem assumption, etc. The gen-erality of AMM as well as its convergence analysis provides insights into the designand analysis of distributed optimization algorithms originating from the Method ofMultipliers, and allows us to explore high-performance specializations of AMM foraddressing specific convex optimization problems in practice.
UNIFYING APPROXIMATE METHOD OF MULTIPLIERS Define g k +1 = −∇ u k ( x k +1 ) − ρH x k +1 − ˜ H v k and g ⋆ = −∇ f ( x ⋆ ) − ˜ H v ⋆ . Due to (2.5) and (2.6), g k +1 ∈ ∂h ( x k +1 ) and g ⋆ ∈ ∂h ( x ⋆ ). Itfollows from (4.1) that˜ H ( v ⋆ − v k ) = ∇ f ( x k ) − ∇ f ( x ⋆ ) + A k ( x k +1 − x k ) + g k +1 − g ⋆ + ρH x k +1 . (7.1)Also, since Null( ˜ H ) = Null( ˜ H ) = S , we have ˜ H x ⋆ = . This, along with (2.4),gives v k − v k +1 = − ρ ˜ H ( x k +1 − x ⋆ ). Thus, h v k − v k +1 , v k − v ⋆ i = − ρ h ˜ H ( x k +1 − x ⋆ ) , v k − v ⋆ i = ρ h x k +1 − x ⋆ , ˜ H ( v ⋆ − v k ) i . By substituting (7.1) into the aboveequation and using H x ⋆ = , we obtain h v k − v k +1 , v k − v ⋆ i = ρ h x k +1 − x ⋆ , ∇ f ( x k ) −∇ f ( x ⋆ ) i + ρ h x k +1 − x ⋆ , g k +1 − g ⋆ i + ρ h x k +1 − x ⋆ , A k ( x k +1 − x k ) i + ρ k x k +1 k H . Inthis equation, h x k +1 − x ⋆ , g k +1 − g ⋆ i ≥
0, because g k +1 ∈ ∂h ( x k +1 ), g ⋆ ∈ ∂h ( x ⋆ ), and h is convex. Moreover, due to H (cid:23) ˜ H and (2.4), we have ρ k x k +1 k H ≥ ρ k x k +1 k H = k v k +1 − v k k = h v k − v k +1 , v k − v ⋆ i − h v k − v k +1 , v k +1 − v ⋆ i . It follows that ρ h v k − v k +1 , v k +1 − v ⋆ i ≥ h x k +1 − x ⋆ , ∇ f ( x k ) −∇ f ( x ⋆ ) i + h x k +1 − x ⋆ , A k ( x k +1 − x k ) i .Hence, to prove (4.2), it suffices to show that for any η ∈ (0 , h x k +1 − x ⋆ , ∇ f ( x k ) −∇ f ( x ⋆ ) i ≥ η h x k − x ⋆ , ∇ f ( x k ) −∇ f ( x ⋆ ) i − k x k +1 − x k k M − η ) . To do so, note from the AM-GM inequality and the Lipschitz continuity of ∇ f i thatfor any i ∈ V and c i > h x k +1 i − x ki , ∇ f i ( x ki ) −∇ f i ( x ⋆ ) i ≥ − c i k∇ f i ( x ki ) −∇ f i ( x ⋆ ) k − c i k x k +1 i − x ki k ≥ − c i M i h x ki − x ⋆ , ∇ f i ( x ki ) − ∇ f i ( x ⋆ ) i − c i k x k +1 i − x ki k . By addingthe above inequality over all i ∈ V with c i = − ηM i , we have h x k +1 − x k , ∇ f ( x k ) −∇ f ( x ⋆ ) i +(1 − η ) h x k − x ⋆ , ∇ f ( x k ) −∇ f ( x ⋆ ) i ≥ − k x k +1 − x k k M − η ) . Then, adding η h x k − x ⋆ , ∇ f ( x k ) −∇ f ( x ⋆ ) i to both sides of this inequality leads to (7.2). Let k ≥
1. Due to (2.4), ˜ H ¯ x k = k P kt =1 ˜ H x t = ρk ( v k − v ), so that (4.4) holds.Next, we prove (4.5). For simplicity, let J t ( x ) = h∇ f ( x t ) , x i + k x − x t k A + h ( x ) + ρ k x k H + ( v t ) T ˜ H x ∀ t ≥
0. Since J t ( x ) − k x k A is convex, we have (cid:16) J t ( x ⋆ ) − k x ⋆ k A (cid:17) − (cid:16) J t ( x t +1 ) − k x t +1 k A (cid:17) ≥ h ˜ g t +1 − A x t +1 , x ⋆ − x t +1 i for all ˜ g t +1 ∈ ∂J t ( x t +1 ). Due to (2.3), ∈ ∂J t ( x t +1 ). Thus, by letting ˜ g t +1 = in the above inequality, we obtain(7.3) J t ( x t +1 ) − J t ( x ⋆ ) ≤ − k x t +1 − x ⋆ k A . In addition, because of (2.4) and H (cid:23) ˜ H , h ˜ H x t +1 , v t i = 1 ρ h v t +1 − v t , v t i = k v t +1 k − k v t k − k v t +1 − v t k ρ = k v t +1 k − k v t k ρ − ρ k x t +1 k H ≥ k v t +1 k − k v t k ρ − ρ k x t +1 k H . (7.4)Also, due to the convexity and the M i -smoothness of each f i , f ( x t +1 ) − f ( x ⋆ ) = ( f ( x t +1 ) − f ( x t )) + ( f ( x t ) − f ( x ⋆ ))2 XUYANG WU AND JIE LU ≤h∇ f ( x t ) , x t +1 − x t i + k x t +1 − x t k M h∇ f ( x t ) , x t − x ⋆ i = h∇ f ( x t ) , x t +1 − x ⋆ i + k x t +1 − x t k M . (7.5)Through combining (7.3), (7.4), and (7.5) and utilizing H x ⋆ = ˜ H x ⋆ = , we derive f ( x t +1 ) + h ( x t +1 ) − f ( x ⋆ ) − h ( x ⋆ ) + k v t +1 k − k v t k ρ ≤ k x t − x ⋆ k A − k x t +1 − x ⋆ k A k x t +1 − x t k M − A . (7.6)Now adding (7.6) over t = 0 , . . . , k − P kt =1 ( f ( x t ) + h ( x t ) − f ( x ⋆ ) − h ( x ⋆ )) ≤ k v k ρ + k x − x ⋆ k A + P k − t =0 k x t +1 − x t k M − A . Moreover, the convexity of f and h implies f (¯ x k ) + h (¯ x k ) ≤ k P kt =1 ( f ( x t ) + h ( x t )). Combining the above results in (4.5).Finally, to prove (4.6), note that x ⋆ is an optimal solution of min x f ( x ) + h ( x ) + h v ⋆ , ˜ H x i . It then follows from ˜ H x ⋆ = that f ( x ⋆ ) + h ( x ⋆ ) ≤ f (¯ x t ) + h (¯ x t ) + h v ⋆ , ˜ H ¯ x t i ≤ f (¯ x t ) + h (¯ x t ) + k v ⋆ k · k ˜ H ¯ x t k . This, together with (4.4), implies (4.6). From the definition of G , for each t ≥ k z t − z ⋆ k G − k z t +1 − z ⋆ k G − k z t +1 − z t k G = 2 h G ( z t − z t +1 ) , z t +1 − z ⋆ i =2 h A ( x t − x t +1 ) , x t +1 − x ⋆ i + 2 ρ h v t − v t +1 , v t +1 − v ⋆ i . (7.7)By substituting (4.2) with A t = A and η = 0 into (7.7), k z t − z ⋆ k G − k z t +1 − z ⋆ k G ≥ ρ k v t +1 − v t k + k x t +1 − x t k A − Λ M / ≥ (1 − σ ) k x t +1 − x t k A , (7.8)where the last step is due to σA (cid:23) Λ M /
2. Now adding (7.8) from t = 0 to t = k − k v k − v ⋆ k ≤ √ ρ k z k − z ⋆ k G , we have ρ k v k − v ⋆ k +(1 − σ ) P k − t =0 k x t +1 − x t k A ≤k z − z ⋆ k G . Therefore, (4.7) can be proved by combining the above inequality with k v k − v k ≤ k v k − v ⋆ k + k v − v ⋆ k . Furthermore, the above inequality, along with A ≻ Λ M − A , implies (4.8). For simplicity, define c k := k z k − z ⋆ k G + ρ k x k k H .To show x k ∈ C ∀ k ≥
0, it suffices to show that c k ≤ c ∀ k ≥
0. We prove this byinduction. Clearly, c k ≤ c holds for k = 0. Now suppose c k ≤ c and below we show c k +1 ≤ c k ≤ c . Using (4.2) and (7.7) with G replaced by ˜ G , k z k − z ⋆ k G − k z k +1 − z ⋆ k G − k z k +1 − z k k G ≥ η h x k − x ⋆ , ∇ f ( x k ) −∇ f ( x ⋆ ) i + 2 h x k +1 − x ⋆ , ( A k − A a )( x k +1 − x k ) i − k x k +1 − x k k M − η ) . (7.9)On the other hand, since A ℓ (cid:22) A k (cid:22) A u , we have k A k − A a k ≤ ∆ /
2. Due to (4.9),there exist β > σ ∈ (0 ,
1) such that(7.10) β ∆ < ηm ρ , σA a (cid:23) ( 14 β + 1)∆ I Nd + Λ M − η ) . UNIFYING APPROXIMATE METHOD OF MULTIPLIERS h x k +1 − x ⋆ , ( A k − A a )( x k +1 − x k ) i = h x k − x ⋆ , ( A k − A a )( x k +1 − x k ) i + h x k +1 − x k , ( A k − A a )( x k +1 − x k ) i ≥ − ( β ∆ k x k − x ⋆ k + ∆4 β k x k +1 − x k k ) − ∆2 k x k +1 − x k k , which further implies k x k +1 − x k k A a +2 h x k +1 − x ⋆ , ( A k − A a )( x k +1 − x k ) i− k x k +1 − x k k M − η ) ≥ (1 − σ ) k x k +1 − x k k A a − β ∆ k x k − x ⋆ k .Substituting this into (7.9) and applying (2.4), we obtain c k − c k +1 ≥ (1 − σ ) k x k +1 − x k k A a + 2 η h x k − x ⋆ , ∇ f ( x k ) − ∇ f ( x ⋆ ) i + ρ k x k k H − β ∆ k x k − x ⋆ k . (7.11)Due to the hypothesis c k ≤ c , we have x k ∈ C . It then follows from Proposition 4.7and ˜ H x ⋆ = that 2 η h x k − x ⋆ , ∇ f ( x k ) − ∇ f ( x ⋆ ) i ≥ ηm ρ k x k − x ⋆ k − ρη k x k k H .This, together with (7.11), results in(7.12) c k − c k +1 ≥ (1 − σ ) k x k +1 − x k k A a + (2 ηm ρ − β ∆) k x k − x ⋆ k + ρ (1 − η ) k x k k H . Note that σ ∈ (0 , β ∆ < ηm ρ due to (7.10), and η ∈ (0 , c k +1 ≤ c k ≤ c . Let η ∈ (0 , β >
0, and σ ∈ (0 ,
1) be such that(7.10) holds. Since h ( x ) ≡ g k +1 = g ⋆ = , so that (7.1) becomes ˜ H ( v ⋆ − v k ) = ∇ f ( x k ) − ∇ f ( x ⋆ ) + ( A k + ρH )( x k +1 − x k ) + ρH x k . Then, via the AM-GM inequality, k ˜ H ( v k − v ⋆ ) k ≤ (1 + θ )(1 + θ ) k∇ f ( x k ) − ∇ f ( x ⋆ ) k + (1 + θ )(1 + 1 /θ ) k ( A k + ρH )( x k +1 − x k ) k + ρ (1 + 1 /θ ) k H x k k (7.13)for any θ , θ >
0. Recall from the paragraph above Theorem 4.9 that we force v − v ⋆ ∈ S ⊥ . Also, due to (2.4), v k − v ∈ Range( ˜ H ) = S ⊥ . Hence, v k − v ⋆ ∈ S ⊥ = Range( ˜ H ), which results in k v k − v ⋆ k ≤ λ ˜ H k ˜ H ( v k − v ⋆ ) k . By combiningthis with (7.13) and utilizing the M i -smoothness of each f i , δ k z k − z ⋆ k G ≤ δ k x k − x ⋆ k (1+ θ θ Mρλ ˜ H + A a + δρ (1 + 1 /θ ) λ ˜ H k H x k k + δ (1 + θ )(1 + 1 /θ ) ρλ ˜ H k ( A k + ρH )( x k +1 − x k ) k , ∀ δ ∈ (0 , . Because of this, (7.12), and k A k + ρH k (cid:22) k A u + ρH k , we have (1 − δ )( k z k − z ⋆ k G + ρ k x k k H ) − ( k z k +1 − z ⋆ k G + ρ k x k +1 k H ) ≥ k x k − x ⋆ k B ( δ ) + k x k k B ( δ ) + k x k +1 − x k k B ( δ ) ,where B ( δ ) = (2 ηm ρ − β ∆) I Nd − δA a − δ (1 + θ )(1 + θ )Λ M / ( ρλ ˜ H ), B ( δ ) = ρ (1 − δ − η ) ˜ H − ρδ (1+1 /θ ) λ ˜ H H , and B ( δ ) = (1 − σ ) A a − δ (1+ θ )(1+1 /θ ) k A u + ρH k ρλ ˜ H I Nd . Notefrom (7.10) that 2 ηm ρ − β ∆ > A a ≻ O . Also note that θ > θ > ρ > λ ˜ H > η ∈ (0 , H (cid:23) O , Range( ˜ H ) = Range( H ) = S ⊥ , and σ ∈ (0 , δ ∈ (0 ,
1) such that B i ( δ ) (cid:23) O ∀ i ∈ { , , } , which guarantees (4.11). In the proof of Lemma 4.8, we have shown that( c k ) ∞ k =0 is non-increasing. Thus, k z k − z ⋆ k G ≤ c k ≤ c = d /ρ . It follows that k v − v k k ≤ k v − v ⋆ k + k v ⋆ − v k k ≤ k v − v ⋆ k + √ ρ k z k − z ⋆ k ˜ G ≤ k v − v ⋆ k + d .By substituting this into (4.4) and (4.6), we obtain (4.12) and (4.14). Note that thissubstitution is legitimate, since (4.4) and (4.6) do not rely on (4.3) to hold.4 XUYANG WU AND JIE LU
To prove (4.13), we let ˜ J t ( x ) = u t ( x ) + h ( x ) + ρ k x k H + ( v t ) T ˜ H x ∀ t ≥
0. Since ∇ u t ( x ) (cid:23) A ℓ ∀ x ∈ R Nd , ˜ J t ( x ) − k x k Aℓ is convex. Similar to the derivation of (7.3),(7.14) ˜ J t ( x t +1 ) − ˜ J t ( x ⋆ ) ≤ − k x t +1 − x ⋆ k A ℓ . On the other hand, since A ℓ (cid:22) ∇ u t ( x ) (cid:22) A u ∀ x ∈ R Nd , we have u t ( x t +1 ) ≥ u t ( x t ) + h∇ u t ( x t ) , x t +1 − x t i + k x t +1 − x t k Aℓ and u t ( x ⋆ ) ≤ u t ( x t )+ h∇ u t ( x t ) , x ⋆ − x t i + k x ⋆ − x t k Au .These two inequalities along with ∇ u t ( x t ) = ∇ f ( x t ) imply(7.15) u t ( x t +1 ) − u t ( x ⋆ ) ≥ h∇ f ( x t ) , x t +1 − x ⋆ i + k x t +1 − x t k A ℓ − k x ⋆ − x t k A u . Again, note that (7.4) and (7.5) hold in no need of (4.3). Hence, by integrating(7.14), (7.15), (7.4), (7.5), and H x ⋆ = ˜ H x ⋆ = , we derive f ( x t +1 ) + h ( x t +1 ) − f ( x ⋆ ) − h ( x ⋆ ) + k v t +1 k −k v t k ρ ≤ k x t − x ⋆ k Au − k x t +1 − x ⋆ k Aℓ + k x t +1 − x t k M − Aℓ . Similarto the derivation of (4.5) from (7.6), it follows that f (¯ x k ) + h (¯ x k ) − f ( x ⋆ ) − h ( x ⋆ ) ≤ k (cid:16) k v k ρ + k x − x ⋆ k Au (cid:17) + k P k − t =0 k x t +1 − x t k M − Aℓ + ∆2 k P k − t =1 k x t − x ⋆ k . From (7.12),we have P k − t =0 k x t +1 − x t k A a ≤ c − c k − σ ≤ d ρ (1 − σ ) and P k − t =1 k x t − x ⋆ k ≤ c − c k ηm ρ − β ∆ ≤ d ρ (2 ηm ρ − β ∆) , where β > σ ∈ (0 ,
1) satisfying (7.10) exist due to (4.9). Also,from (4.9), Λ M − A ℓ (cid:22) Λ M (cid:22) − η ) A a . Combining the above gives (4.13). REFERENCES[1]
N. S. Aybat, Z. Wang, T. Lin, and S. Ma , Distributed linearized alternating direction methodof multipliers for composite convex consensus optimization , IEEE Transactions on Auto-matic Control, 63 (2018), pp. 5–20.[2]
D. Bajovi´c, D. Jekovetic´c, N. Kreji´c, and N. K. Jeriniki´c , Newton-like method with di-agonal correction for distributed optimization , SIAM Journal on Optimization, 27 (2017),pp. 1171–1203.[3]
A. Beck, A. Nedi´c, A. Ozdaglar, and M. Teboulle , An O (1 /k ) gradient method for networkresource allocation problems , IEEE Transactions on Control of Network Systems, 1 (2014),pp. 64–73.[4] A. Beck and M. Teboulle , Mirror descent and nonlinear projected subgradient methods forconvex optimization , Operations Research Letters, 31 (2003), pp. 167–175.[5]
P. Bianchi, W. Hachem, and F. Iutzeler , A coordinate descent primal-dual algorithm andapplication to distributed asynchronous optimization , IEEE Transactions on AutomaticControl, 61 (2016), pp. 2947–2957.[6]
S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein , Distributed optimization andstatistical learning via the alternating direction method of multipliers , Foundations andTrends in Machine Learning, 3 (2011), pp. 1–122.[7]
M. Hong and T. Chang , Stochastic proximal gradient consensus over random networks , IEEETransactions on Signal Processing, 65 (2017), pp. 2933–2948.[8]
S. Kakade, S. Shalev-Shwartz, and A. Tewari , On the duality of strong convexity andstrong smoothness: Learning applications and matrix regularization , Manuscript, (2009).[9]
J. Lei, H. Chen, and H. Fang , Primal-dual algorithm for distributed constrained optimization ,System & Control Letters, 96 (2016), pp. 110–117.[10]
Z. Li, W. Shi, and M. Yan , A decentralized proximal-gradient method with network indepen-dent stepsizes and separated convergence rates , IEEE Transactions on Signal Processing,67 (2019), pp. 4494–4506.[11]
Y. Liu, W. Xu, G. Wu, Z. Tian, and Q. Ling , Communication-censored ADMM for de-centralized consensus optimization , IEEE Transactions on Signal Processing, 67 (2019),pp. 2565–2579. UNIFYING APPROXIMATE METHOD OF MULTIPLIERS [12] A. Makhdoumi and A. Ozdaglar , Convergence rate of distributed ADMM over networks ,IEEE Transactions on Automatic Control, 62 (2017), pp. 5082–5095.[13]
F. Mansoori and E. Wei , Fast distributed asynchronous Newton-based optimization algorithm ,IEEE Transactions on Automatic Control, 65 (2020), pp. 2769–2784.[14]
A. Mokhtari, W. Shi, Q. Ling, and A. Ribeiro , A decentralized second-order method withexact linear convergence rate for consensus optimization , IEEE Transactions on Signal andInformation Processing over Networks, 2 (2016), pp. 507–522.[15]
A. Mokhtari, W. Shi, Q. Ling, and A. Ribeiro , DQM: Decentralized quadratically approxi-mated alternating direction method of multipliers , IEEE Transactions on Signal Processing,64 (2016), pp. 5158–5173.[16]
A. Nedi´c and A. Olshevsky , Distributed optimization over time-varying directed graphs ,IEEE Transactions on Automatic Control, 60 (2015), pp. 601– 615.[17]
A. Nedi´c, A. Olshevsky, and W. Shi , Achieving geometric convergence for distributed opti-mization over time-varying graphs , SIAM Journal on Optimization, 27 (2017), pp. 2597–2633.[18]
I. Notarnicola and G. Notarstefano , Asynchronous distributed optimization via random-ized dual proximal gradient , IEEE Transactions on Automatic Control, 62 (2017), pp. 2095–2106.[19]
S. Pu, W. Shi, J. Xu, and A. Nedic , Push-pull gradient methods for distributed optimizationin networks . accepted to IEEE Transactions on Automatic Control, 2020.[20]
G. Qu and N. Li , Harnessing smoothness to accelerate distributed optimization , IEEE Trans-actions on Control of Network Systems, 5 (2017), pp. 1245–1260.[21]
A. Sayed , Adaptation, learning, and optimization over networks , Foundations and Trends inMachine Learning, 7 (2014), pp. 311–801.[22]
W. Shi, Q. Ling, G. Wu, and W. Yin , EXTRA: An exact first-order algorithm for decentral-ized consensus optimization , SIAM Journal on Optimization, 25 (2015), pp. 944–966.[23]
W. Shi, Q. Ling, G. Wu, and W. Yin , A proximal gradient algorithm for decentralized com-posite optimization , IEEE Transactions on Signal Processing, 63 (2015), pp. 6013–6023.[24]
W. Shi, Q. Ling, K. Yuan, G. Wu, and W. Yin , On the linear convergence of the ADMM indecentralized consensus optimization , IEEE Transactions on Signal Processing, 62 (2014),pp. 1750–1761.[25]
S.-H. Son, M. Chiang, S. R. Kulkarni, and S. C. Schwartz , The value of clustering indistributed estimation for sensor networks , in Proc. International Conference on WirelessNetworks, Communications and Mobile Computing, Princeton, USA, 2005, pp. 969–974.[26]
M. Spivak , Calculus on manifolds: A modern approach to classical theorems of advancedcalculus , Addison-Wesley, 1965.[27]
D. Varagnolo, F. Zanella, A. Cenedese, G. Pillonetto, and L. Schenato , Newton-Raphson consensus for distributed convex optimization , IEEE Transactions on AutomaticControl, 61 (2016), pp. 994–1009.[28]
X. Wu and J. Lu , Fenchel dual gradient methods for distributed convex optimization overtime-varying networks , IEEE Transactions on Automatic Control, 64 (2019), pp. 4629–4636.[29]
X. Wu, Z. Qu, and J. Lu , A second-order proximal algorithm for consensus optimization .accepted to IEEE Transactions on Automatic Control, 2020.[30]
C. Xi and U. A. khan , Distributed subgradient projection algorithm over directed graphs , IEEETransactions on Automatic Control, 62 (2017), pp. 3986–3992.[31]
R. Xin and U. A. Khan , Distributed heavy-ball: A generalization and acceleration of first-order methods with gradient tracking , IEEE Transactions on Automatic Control, 65 (2020),pp. 2627–2633.[32]
J. Xu, S. Zhu, Y. Soh, and L. Xie , A bregman splitting scheme for distributed optimizationover networks , IEEE Transactions on Automatic Control, 63 (2018), pp. 3809–3824.[33]
T. Yang, J. Lu, D. Wu, J. Wu, G. Shi, Z. Meng, and K. Johansson , A distributed algorithmfor economic dispatch over time-varying directed networks with delays , IEEE Transactionson Industrial Electronics, 64 (2017), pp. 5095–5106.[34]
D. Yuan, Y. Hong, D. W.C.Ho, and C. Jiang , Optimal distributed stochastic mirror descentfor strongly convex optimization , Automatica, 90 (2018), pp. 196–203.[35]