[PDF] Distributed Optimization with Coupling Constraints via Dual Proximal Gradient Method with Applications to Asynchronous Networks

Abstract

In this paper, we consider solving a distributed optimization problem (DOP) with coupling constraints in a multi-agent network based on proximal gradient method. In this problem, each agent aims to minimize an individual cost function composed of both smooth and non-smooth parts. To this end, we derive the dual problem by the concept of Fenchel conjugate, which results in two kinds of dual problems: consensus based constrained and augmented unconstrained problems. In the first scenario, we propose a fully distributed dual proximal gradient (D-DPG) algorithm, where the agents can make updates only with the dual information of their neighbours and local step-sizes. Moreover, if the non-smooth parts of the objective functions are with certain simple structures, the agents only need to update dual variables with some simple operations, which can reduce the overall computational complexity. In the second scenario, an augmented dual proximal gradient (A-DPG) algorithm is proposed, which allows for the asymmetric interpretations of the global constraints for the agents and can be more efficient than D-DGP algorithm in some special-structured DOPs. Based on A-DPG algorithm, an asynchronous dual proximal gradient (Asyn-DPG) algorithm is proposed for the asynchronous networks where each agent updates its strategy with heterogenous step-size and possible outdated dual information of others. In all the discussed scenarios, analytical (ergodic) convergence rates are derived. The effectiveness of the proposed algorithms is verified by solving a social welfare optimization problem in the electricity market.

Full PDF

11 Distributed Optimization with Coupling Constraintsvia Dual Proximal Gradient Method and ItsApplications in Asynchronous Networks

Jianzheng Wang and Guoqiang Hu

Abstract —In this paper, we consider solving a compositeoptimization problem (DOP) with coupling constraints in a multi-agent network based on proximal gradient method. In thisproblem, each agent aims to minimize an individual cost functioncomposed of both smooth and non-smooth parts. To this end, wederive the dual problem by the concept of Fenchel conjugate,which results in two kinds of dual problems: consensus-likeconstrained and augmented unconstrained problems. In the ﬁrstscenario, we propose a fully distributed dual proximal gradient(D-DPG) algorithm, where the agents can make updates onlywith the dual information of their neighbours and local step-sizes.Moreover, if the non-smooth parts of the objective functions arewith certain simple structures, the agents only need to updatedual variables with some simple (projected) gradient based itera-tions, which can reduce the overall computational complexity. Inthe second scenario, an augmented dual proximal gradient (A-DPG) algorithm is proposed, which allows for the asymmetricinterpretations of the global constraints for the agents. Basedon A-DPG algorithm, an asynchronous dual proximal gradient(Asyn-DPG) algorithm is proposed for the asynchronous net-works where each agent updates its strategy with heterogenousstep-size and possible outdated dual information of others. In allthe discussed scenarios, analytical convergence rates are derived.The effectiveness of the proposed algorithms is veriﬁed by solvinga social welfare optimization problem in the electricity market.

Index Terms —Multi-agent system, distributed optimization,proximal gradient, dual algorithm, Fenchel conjugate, asyn-chronous, electricity market.

I. I

NTRODUCTION

A. Background and Motivation D ISTRIBUTED optimization problems have drawn muchattention due to the wide applications in solving practicalproblems, such as task assignment in multi-robot systems,machine learning problems and economic dispatch in powersystems [1–3]. In DOPs, each agent focuses on optimizingits local objective function, and the optimal solution to thesystem will be achieved through multiple rounds of strategy-makings and communications among the agents. In this work,we consider a class of DOPs with composite objective func-tions, i.e., composed of both smooth and non-smooth parts,which have been extensively studied in the ﬁelds of Lassoregressions, boosting and support vector machines [4–6]. Tosolve these problems, applicable techniques include primal-dual subgradient methods, alternating direction method of

Jianzheng Wang and Guoqiang Hu are with the School of Electrical andElectronic Engineering, Nanyang Technological University, Singapore 639798(email: [email protected], [email protected] ). multipliers (ADMM), proximal gradient methods and ﬁrst-order iterative algorithms, etc [7–10].Most of the existing works on DOPs require that the agentsare fully connected to ensure the correctness of the optimiza-tion results, which limits their usage in large-scale distributednetworks, e.g., in [11, 12]. To overcome this issue, some moredistributed algorithms were investigated by applying graphtheory in modelling the communication links in the recentfew years, where the agents only need to communicate withtheir neighbours [13]. However, considering the increasingdemand on the computational efﬁciency of the algorithms andthe asynchrony of networks in the various ﬁelds, more explo-rations on the algorithm development for DOPs are needed.Compared with subgradient methods, proximal gradient basedalgorithms take the advantage of some simple-structured ob-jective functions and are usually numerically more stable thanthe gradient based counterparts [14]. With this motivation,in this paper, we aim to develop some efﬁcient optimizationalgorithms for DOPs based on proximal gradient method andfurther investigate the effectiveness in asynchronous multi-agent networks.Dual proximal gradient method means applying the proxi-mal gradient method in dual problems. Based on the conceptof Fenchel conjugate [15], we propose three dual proximalgradient based algorithms, namely D-DPG, A-DPG and Asyn-DPG algorithms, which can handle composite DOPs withlocal convex and coupling afﬁne constraints. Firstly, we for-mulate a consensus-like constrained DOP, where each agentmaintains a private objective function with both the smoothand non-smooth parts decoupled from each other. By ourproposed D-DPG algorithm, each agent only needs to accessthe dual information of its neighbours, which realizes a fullydistributed computation environment for the agents. Secondly,we consider some special-structured DOPs, where the agentswhose decision variables are included in the same couplingconstraints are connected. By considering the asymmetric indi-vidual interpretations of the global constraint, an unconstraineddual optimization problem is formulated with the smooth partsof the objective functions coupled while the non-smooth partsdecoupled. Then, an A-DPG algorithm is proposed, by whichthe agents collect the dual functions from their neighboursand make strategies with local dual information. To adapt tosome asynchronous factors in multi-agent systems, an Asyn-DPG algorithm is proposed based on the architecture of theA-DPG algorithm, where heterogenous step-sizes and boundedcommunication delays are considered. Given that the non- a r X i v : . [ m a t h . O C ] F e b smooth parts of the objective functions are with some simplestructures, our proposed D-DPG, A-DPG and Asyn-DPG al-gorithms only require the updates of dual variables, which canreduce the computational complexity in each updating step andimprove the overall computational efﬁciency. In addition, ouralgorithms require some commonly used assumptions on theprimal problems, and explicit convergence rates are providedfor all the discussed scenarios. B. Literature Review

Fruitful distributed algorithms for solving composite DOPshave been studied in the existing works. To adapt to large-scaledistributed networks, some consensus based DOPs withoutcoupling constraint were studied in [16–21], where the agentsupdate their strategies by optimizing their private functions,and certain agreement on the optimal solution will be achievedfor all the agents through local communications. Alternatively,in this work, we focus on optimizing a class of compositeDOPs subject to coupling afﬁne constraints based on dualproximal gradient method. To facilitate the algorithm develop-ment for the dual problem, we decouple the smooth and non-smooth parts in the objective functions. To this end, the methodapplied in this work is related to [22], where a consensusbased dual proximal gradient method for solving a compositeDOP without coupling constraint was proposed. In addition,[23, 24] discussed two faster dual proximal gradient methodsfor solving composite DOPs without coupling constraint. Tothe best knowledge of the authors, this is the ﬁrst work thatinvestigates the dual proximal gradient based algorithms forcomposite DOPs with general coupling afﬁne constraints.Firstly, to develop a fully distributed algorithm for the DOPswith general afﬁne coupling constraint, we propose a D-DPGalgorithm. To highlight the new features and advantages ofour proposed algorithm, the comparisons with some state-of-the-art works with similar problem setups are summarized asfollows. • The authors of [25–27] studied some distributed algo-rithms with certain parts of the primal problems coupledamong the agents, e.g., coupled individual objective func-tions and coupled global constraints, which requires someglobally shared information in the updating algorithmsand may cause the information releasing issue of somesensitive individual data. In our work, the separatedobjective functions and afﬁne global constraints enablethe agents to only access the limited local information,e.g., private objective functions, local parameters in theglobal constraints and the strategies of neighbours, whichcan effectively protect the privacy of the agents. • The consensus based distributed optimization algorithmsstudied in [27–30] require the updating of some weightedrunning averages, which increases the computationalcomplexity and requires more memory capacity for theauxiliary variables in each updating step. In addition,the algorithms in [27–30] assume some compact localconstraints to ensure the convergence of the algorithms.By contrast, in our work, there is no boundedness re-quirement on the primal variables. • In [25–32], some common ﬁxed or varying global step-sizes are required during the updating process. How-ever, our proposed fully distributed D-DPG algorithmallows for heterogenous step-sizes determined by localinformation, which provides more ﬂexibilities for theinitialization process and is more adaptive to large-scaledistributed networks. • An explicit convergence rate is derived for our proposedD-DPG algorithm, which is not provided in [25, 27, 28,30]. Also, the objective functions in [25, 27] are assumedto be continuous, while the D-DPG algorithm only re-quires the non-smooth parts of the objective functions tobe lower semi-continuous. • One distinct advantage of our proposed D-DPG algorithmis that, if the non-smooth parts of the objective functionsare with some simple structures, the updating of theprimal variables is not required. Instead, we computeintermediate variables only related to the non-smoothparts. Moreover, if the proximal mapping of the conjugateof the non-smooth parts can be explicitly given, then weonly need to update the dual variables with some simpleoperations, which technically can be more efﬁcient thanthe existing algorithms which need to compute the primalvariables by some costly computations in each updatingstep, e.g., provided in [25–32]. Secondly, we consider some special-structured DOPs, wherethe agents involved in the same coupling constraints areconnected. In these problems, the primal constrained DOP ismodiﬁed into a dual DOP without coupling constraint. Then,an A-DPG algorithm with single type of updating variables isproposed, which can be more efﬁcient than D-DPG algorithm.The Asyn-DPG algorithm can be viewed as an applicationof the A-DPG algorithm in asynchronous networks with het-erogenous step-sizes and communication delays. Speciﬁcally,the bounded communication delays are addressed based ondeterministic analysis with asymptotic convergence result,as discussed in [34–38]. However, the algorithms in [34–36] do not consider any general coupling equality/inequalityconstraint. Also, it is unclear that whether the particularlyconcerned equality constraints in [37, 38] can be extended togeneral cases. Compared with all the aforementioned works,the elimination of the update of primal variables is still adistinct advantage to reduce the computational complexity.The contributions of this work are summarized as follows. • We consider a standard composite DOP with both localconvex and coupling afﬁne constraints. To solve thisproblem, by decoupling the smooth and non-smooth partsin the primal problem, we propose three dual proximalgradient based algorithms, namely D-DPG, A-DPG andAsyn-DPG algorithms, which can be ﬂexibly selectedaccording to different implementation environment. We also note that some dual gradient based algorithms dealing withsmooth objective functions also can avoid the update of primal variables, e.g.,discussed in [33]. However, directly extending their algorithms to non-smoothcases can be challenging in the sense that the computation of the subgradientof the conjugate of a non-smooth function can be costly in general. Therefore,our contribution to the computational efﬁciency (also A-DPG and Asyn-DPGalgorithms as introduced later) is established for the DOPs with non-smoothobjective functions. • We formulate two dual problems of the primal DOP:(i) consensus-like constrained and (ii) augmented uncon-strained problems. In (i), by the D-DPG algorithm, theoptimal solution to the primal problem can be obtainedwhen all the agents only access the dual informationof neighbours and update their variables with locallydetermined step-sizes, which results in a fully distributedcomputation environment. In addition, the D-DGP algo-rithm allows the agents to only update the dual variablesif the non-smooth parts of their objective functions arewith certain simple structures, which can avoid somecostly computations of the primal variables and reducethe computational complexity. In (ii), an unconstrainedDOP is formulated with the smooth parts of the ob-jective functions coupled while the non-smooth partsdecoupled. To solve this problem, an A-DPG algorithmis proposed, which can be more efﬁcient than D-DPGalgorithm in some special-structured problems and allowsfor the asymmetric individual interpretations of the globalconstraints. Based on the A-DPG algorithm, we furtherpropose an Asyn-DPG algorithm, which can be appliedin some asynchronous networks with heterogenous step-sizes and bounded communication delays. • The convergence rate of the dual sequences in all thediscussed scenarios is analytically derived. The feasibilityof the proposed algorithms is veriﬁed by solving a socialwelfare optimization problem in the electricity market.The remainder of this paper is organized as follows. SectionII gives some frequently used deﬁnitions in this work. Sec-tion III provides the formulation of the primal problem andsome basic assumptions. In Section IV, a consensus-like dualproblem of the primal problem is formulated, and the D-DPGalgorithm is proposed therein. In Section V, an unconstraineddual problem is formulated by considering the asymmetricconstraints for the agents, and the A-DPG algorithm is pro-posed. In Section VI, an Asyn-DPG algorithm is proposed byconsidering the heterogenous step-sizes and communicationdelays in the network. The convergence analysis of the D-DPG, A-DPG and Asyn-DPG algorithms is conducted inSection VII. The feasibility of our proposed algorithms isveriﬁed by solving a social welfare optimization problem inthe electricity market in Section VIII. Section IX concludesthis paper.

C. Notations N and N + denote the non-negative and positive integerspaces, respectively. Let notation | A |∈ N be the size of set A .Let R denote real number space. R n and R n × m denote the realEuclidian spaces with dimensions n and n × m , respectively. R n + denotes the n -dimensional Euclidian space only with non-negative real elements. Operator ( · ) T represents the transposeof a matrix. A × A denotes the Cartesian product of sets A and A . core( A ) is the algebraic interior of a convex set A . u (cid:22) v means that each element in vector u is smaller thanor equal to the corresponding element in v , where u and v are with suitable dimensions (analogously, (cid:23) means “largerthan or equal to”). Let (cid:98) u (cid:99) ( (cid:100) u (cid:101) ) be the largest integer smaller than (smallest integer no smaller than) u . (cid:107) · (cid:107) and (cid:107) · (cid:107) refer to the l and l -norms, respectively. (cid:104)· , ·(cid:105) is an innerproduct operator. Deﬁne (cid:107) X (cid:107) v := v (cid:107) X (cid:107) with v a scalarand (cid:107) X (cid:107) Y := X T Y X with Y a symmetric square matrix. ⊗ is the Kronecker product operator. n and n denote n -dimensional column vectors with all elements being 0 and 1,respectively. I n and O n × m denote the n -dimensional iden-tity matrix and n × m -dimensional zero matrix, respectively. dom( ζ ) represents the domain of a function ζ . Deﬁne diag[ u n ] n ∈A :=  u O . . . O u |A|  ∈ R |A|×|A| , (1)and D m A [ u n ] : = diag[ u n ] n ∈A ⊗ I m =  u I m O . . . O u |A| I m  ∈ R m |A|× m |A| . (2)II. P RELIMINARIES

In this section, we present some fundamental deﬁnitions onconvex analysis, graph theory, proximal gradient algorithm andFenchel conjugate.

1) Convex Analysis:

Function ζ : R n → R ∪ { + ∞} is saidto be convex if ζ ( a u + (1 − a v )) ≤ aζ ( u ) + (1 − a ) ζ ( v ) , ∀ u , v ∈ R n , ∀ a ∈ [0 , . ζ is ν -Lipschitz continuously differ-entiable if (cid:107) ∇ ζ ( u ) − ∇ ζ ( v ) (cid:107)≤ ν (cid:107) u − v (cid:107) , ν > . For anon-smooth function ψ : R n → R ∪ { + ∞} , its subdifferentialis deﬁned by ∂ψ ( u ) = { w ∈ R n | ψ ( v ) ≥ ψ ( u )+ w T ( v − u ) } .

2) Graph Theory:

A multi-agent network can be describedby a connected and undirected graph G , which is composed ofthe set of vertices V = { , , ..., | V |} and set of edges E ⊆{ ( i, j ) | i, j ∈ V and i (cid:54) = j } with ( i, j ) an unordered pair. V i denotes the set of the neighbours of agent i . Let L ∈ R |V|×|V| denote the Laplace matrix of G [39]. Let d ij be the elementat the cross of i th row and j th column of L . Thus, d ij = − if ( i, j ) ∈ E , d ii = | V i | , and d ij = 0 otherwise, ∀ i, j ∈ V .An augmented Laplace matrix is deﬁned as ¯ L := L ⊗ I n ∈ R |V| n ×|V| n .

3) Proximal Gradient Method:

A basic proximal mappingof a closed and proper function ψ : R n → R ∪ { + ∞} isdeﬁned by prox aψ [ u ] := arg min v ∈ R n ( ψ ( v ) + a (cid:107) v − u (cid:107) ) with a scalar a > . Classical proximal gradient method isan iterative usage of proximal mapping by including a smoothcomponential function ζ : R n → R ∪ { + ∞} , which can bewritten as u ( k + 1) = prox aψ [ u ( k ) − a ∇ ζ ( u ( k ))] , k ∈ N . Ageneralized version of proximal mapping can be deﬁned as prox X ψ [ u ] := arg min v ∈ R n ( ψ ( v ) + 12 (cid:107) v − u (cid:107) X − ) , (3)with X ∈ R n × n a positive deﬁnite matrix [22].

4) Fenchel Conjugate:

Let ζ ( v ) : R n → R ∪ { + ∞} bea proper function. The Fenchel conjugate of ζ is deﬁned as ζ (cid:5) ( u ) := sup v { u T v − ζ ( v ) } , which is convex [15]. III. P

ROBLEM F ORMULATION

Consider a connected and undirected multi-agent network G := {V , E} and a global cost function H = (cid:80) i ∈V H i ( x i ) .Agent i ∈ V aims to minimize a private cost function H i ( x i ) := f i ( x i ) + g i ( x i ) , where f i : Ω i → R ∪ { + ∞} is smooth and g i : Ω i → R ∪ { + ∞} is possible non-smooth, Ω i ⊆ R M , i ∈ V . Then, the domain of H can be deﬁnedby Ω := Ω × Ω × ... × Ω |V| ∈ R |V| M . We consider aglobal afﬁne constraint Ax = b , A ∈ R N ×|V| M , b ∈ R N , x = [ x T , ..., x T |V| ] T ∈ R |V| M . Then, a DOP of V can beformulated as (P1) min x i ∈ Ω i , ∀ i ∈V (cid:88) i ∈V H i ( x i ) s.t. Ax = b , which is equivalent to (P2) min x (cid:88) i ∈V ( H i ( x i ) + I Ω i ( x i )) s.t. Ax = b , with I Ω i ( x i ) = (cid:26) if x i ∈ Ω i , + ∞ otherwise . (4) Remark 1.

Note that for an inequality constraint Ax (cid:22) b ,we can formulate an equality constraint Ax + u = b with u ∈ R N + being a slack vector. Then, the inequality constrainedproblem can be equivalently written as (P1+) min x i ∈ Ω i , u ∈ R N + , ∀ i ∈V (cid:88) i ∈V H i ( x i ) s.t. Ax + u = b . To realize decentralized computations, u can be decomposedand assigned to the agents. Hence, the structure of Problem(P1+) complies with that of Problem (P1). The subdifferential of I Ω i ( x i ) , ∀ i ∈ V , can be determinedby [40] ∂ I Ω i ( x i ) =  (cid:26) u ∈ R M (cid:12)(cid:12)(cid:12)(cid:12) (cid:104) u , v i − x i (cid:105) ≤ , ∀ v i ∈ Ω i (cid:27) if x i ∈ Ω i , ∅ otherwise.(5) Assumption 1. f i ( x i ) is proper, σ i -strongly convex andsmooth, σ i > ; g i ( x i ) is proper, convex, lower semi-continuous, and possible non-smooth, i ∈ V . Assumption 2. Ω i is non-empty, convex and closed, i ∈ V ; b ∈ core( A Ω) . Remark 2.

By Assumption 2, I Ω i is convex and lower semi-continuous [41]. IV. D

ISTRIBUTED D UAL P ROXIMAL G RADIENT A LGORITHM D EVELOPMENT

A. Dual Problem

By decoupling the objective function of Problem (P2), wehave (P3) min x (cid:88) i ∈V ( f i ( x i ) + ( g i + I Ω i )( z i )) s.t. Ax = b , x i = z i , ∀ i ∈ V , where z i ∈ R M is a slack vector.The Lagrangian function of Problem (P3) can be given by L ( x , z , θ , µ ) := (cid:88) i ∈V ( f i ( x i ) + ( g i + I Ω i )( z i )+ µ Ti ( x i − z i )) + θ T ( Ax − b ) , (6)where µ i ∈ R M and θ ∈ R N are the Lagrangian multipliervectors associated with constraints x i = z i and Ax = b ,respectively, i ∈ V , µ = [ µ T , ..., µ T |V| ] T ∈ R |V| M . With somearrangements, the Lagrangian function (6) can be rewritten as L ( x , z , θ , µ ) = (cid:88) i ∈V ( f i ( x i ) + x Ti ( A Ti θ + µ i )+ ( g i + I Ω i )( z i ) − z Ti µ i ) − b T θ , (7)where A i ∈ R N × M is a sub-block of A with A =[ A , ..., A i , ..., A |V| ] . Then, the dual function can be obtainedby minimizing L ( x , z , θ , µ ) with variables ( x , z ) , which is V ( θ , µ ) := min x , z L ( x , z , θ , µ )= min x , z (cid:88) i ∈V ( f i ( x i ) + x Ti ( A Ti θ + µ i )+ ( g i + I Ω i )( z i ) − z Ti µ i ) − b T θ = min x , z (cid:88) i ∈V ( f i ( x i ) − x Ti W i λ i + ( g i + I Ω i )( z i ) − z Ti F λ i − κ i Eλ i )= (cid:88) i ∈V ( − f (cid:5) i ( W i λ i ) − ( g i + I Ω i ) (cid:5) ( F λ i ) − κ i Eλ i ) , (8)where W i = [ − A Ti , − I M ] ∈ R M × ( M + N ) , λ i = [ θ T , µ Ti ] T ∈ R M + N , F = [ O M × N , I M ] ∈ R M × ( M + N ) , E = [ b T , TM ] ∈ R × ( M + N ) , κ i is a decomposition coefﬁcient with (cid:80) i ∈V κ i =1 . The third equality in (8) employs the deﬁnition of Fenchelconjugate and ( g i + I Ω i ) (cid:5) denotes the Fenchel conjugate of g i + I Ω i , i ∈ V . Hence, the dual problem of Problem (P3) canbe formulated as (P4) min λ P ( λ ) + G ( λ ) where λ = [ λ T , ..., λ T |V| ] T ∈ R |V| N + |V| M , P ( λ ) = (cid:80) i ∈V ( f (cid:5) i ( W i λ i ) + κ i Eλ i ) and G ( λ ) = (cid:80) i ∈V ( g i + I Ω i ) (cid:5) ( F λ i ) . Lemma 1. (Fenchel Duality [42, Cor. 3.3.11]) Given anyconvex function ζ : R n → R ∪{ + ∞} , any linear mapping Z : R n → R m , and any vector v ∈ R m . If v ∈ core( Z dom( ζ )) ,then inf u ∈ R n { ζ ( u ) | Zu = v } = sup y ∈ R m { v T y − ζ (cid:5) ( Z T y ) } holds. Based on Lemma 1, Assumptions 1 and 2, the strong dualitybetween Problems (P3) and (P4) holds.

B. Distributed Dual Proximal Gradient Algorithm

In this subsection, we aim to solve Problem (P4) in adistributed manner based on proximal gradient method. InProblem (P4), the variables of f (cid:5) i ( W i λ i ) are coupled interms of the common component θ in λ i , but those of ( g i + I Ω i ) (cid:5) ( F λ i ) are decoupled since F λ i = µ i , i ∈ V . In the following, with a slight abuse of notation, we redeﬁne λ i := [ θ Ti , µ Ti ] T ∈ R M + N , where θ i is the local estimate ofthe common θ , i ∈ V . Then, Problem (P4) can be equivalentlyrewritten as (P5) min λ P ( λ ) + G ( λ ) s.t. Kλ j = Kλ l , ∀ ( j, l ) ∈ E , (9)where K = [ I N , O N × M ] . Constraint (9) ensures the partialconsistency among λ i in terms of component θ i , i.e., θ i = Kλ i , ∀ i ∈ V . Let λ ∗ := [( λ ∗ ) T , ..., ( λ ∗|V| ) T ] T be optimalsolution to Problem (P5) with λ ∗ i = [( θ ∗ i ) T , ( µ ∗ i ) T ] T , i ∈ V .In the following, we assume that the range of θ ∗ i is estimable,i.e., θ ∗ i ∈ S i with S i ⊂ R N being the estimated non-empty,convex and compact zone, ∀ i ∈ V . For the convenience ofthe following discussion, we deﬁne Γ := max i ∈V sup θ i ∈S i (cid:107) θ i (cid:107) . Note that considering constraint θ i ∈ S i is equivalentto accommodating an indicator function I S i ( θ i ) into the non-smooth part. Then, Problem (P5) can be modiﬁed into (P6) min λ ϕ ( λ ) s.t. Kλ j = Kλ l , ∀ ( j, l ) ∈ E , (10)where ϕ ( λ ) = P ( λ ) + Q ( λ ) , P ( λ ) = (cid:80) i ∈V p i ( λ i ) , Q ( λ ) = (cid:80) i ∈V q i ( λ i ) , p i ( λ i ) = f (cid:5) i ( W i λ i ) + κ i Eλ i , q i ( λ i ) = ( g i + I Ω i ) (cid:5) ( µ i )+ I S i ( θ i ) = ( g i + I Ω i ) (cid:5) ( F λ i )+ I S i ( Kλ i ) . Note that(10) can be represented by a compact equation M λ = |V| N ,where M = L ⊗ K ∈ R |V| N × ( |V| N + |V| M ) . It can be checkedthat M λ = (cid:101) L ˆ θ , where (cid:101) L = L ⊗ I N ∈ R |V| N ×|V| N is theaugmented Laplace matrix of G and ˆ θ = [ θ T , ..., θ T |V| ] T ∈ R |V| N . Then, the Lagrangian function of Problem (P6) can begiven by φ ( λ , ξ ) := P ( λ ) + Q ( λ ) + ξ T M λ , (11)where ξ = [ ξ T , ..., ξ T |V| ] T ∈ R |V| N is the aggregated La-grangian multiplier vector with ξ i being the individual mul-tiplier maintained by agent i ∈ V . Let R be the set of thesaddle points of φ ( λ , ξ ) . Then, any saddle point ( λ ∗ , ξ ∗ ) ∈ R satisﬁes [43] φ ( λ , ξ ∗ ) ≥ φ ( λ ∗ , ξ ∗ ) ≥ φ ( λ ∗ , ξ ) , (12) ∀ λ ∈ R |V| N + |V| M , ∀ ξ ∈ R |V| N . We aim to seek the saddlepoint of φ ( λ , ξ ) , which means solving (P7) min λ max ξ φ ( λ , ξ ) . (13)The solution to Problem (P7) can be characterized by Karush-Kuhn-Tucker (KKT) conditions [44] |V| M + |V| N ∈ ∇ λ P ( λ ∗ ) + ∂ λ Q ( λ ∗ ) + M T ξ ∗ , (14) M λ ∗ = |V| N , (15)which means M + N ∈ ∇ λ i p i ( λ ∗ i ) + ∂ λ i q i ( λ ∗ i ) + M Ti ξ ∗ , (16) (cid:88) l ∈V M l λ ∗ l = |V| N , (17)where M i is the i th sub-block of M with M =[ M , ..., M i , ..., M |V| ] , i ∈ V . Based on the previous discussion, the D-DPG algorithm forsolving Problem (P7) is designed as λ ( k + 1) =prox D M + N V [ c l ] Q (cid:2) λ ( k ) − D M + N V [ c l ]( ∇ λ P ( λ ( k )) + M T ξ ( k )) (cid:3) , (18) ξ ( k + 1) = ξ ( k ) + D N V [ γ l ] M λ ( k + 1) , (19)which means λ i ( k + 1) = prox c i q i (cid:2) λ i ( k ) − c i ( ∇ λ i p i ( λ i ( k )) + M Ti ξ ( k )) (cid:3) = prox c i q i (cid:2) λ i ( k ) − c i ( ∇ λ i p i ( λ i ( k ))+ (cid:88) j ∈V i K T ( ξ i ( k ) − ξ j ( k ))) (cid:3) , (20) ξ i ( k + 1) = ξ i ( k ) + γ i M i λ ( k + 1)= ξ i ( k ) + γ i (cid:88) j ∈V i K ( λ i ( k + 1) − λ j ( k + 1)) , (21)due to the separability of P and Q , and c i , γ i > are step-sizes, k ∈ N , ∀ i ∈ V . Remark 3.

From (20) and (21), it can be seen that eachagent only needs the information of its neighbours and updateswith locally determined step-sizes, which results in a fullydistributed computation fashion of the D-DPG algorithm.

The detailed computation procedure of D-DPG algorithm isstated in Algorithm 1.

Algorithm 1

Distributed Dual Proximal Gradient Algorithm Initialize λ (0) , ξ (0) . Determine step-sizes c i , γ i > , ∀ i ∈V . for k = 0 , , , ... do for i = 1 , , ..., | V | do (in parallel) Update λ i : λ i ( k + 1) =prox c i q i (cid:2) λ i ( k ) − c i ( ∇ λ i p i ( λ i ( k ))+ (cid:88) j ∈V i K T ( ξ i ( k ) − ξ j ( k ))) (cid:3) . (22) Update ξ i : ξ i ( k + 1) = ξ i ( k ) + γ i (cid:88) j ∈V i K ( λ i ( k + 1) − λ j ( k + 1)) . (23) end for Obtain an output ( λ out , ξ out ) under certain conver-gence criterion. end for C. Computational Complexity of D-DPG Algorithm

To apply (22), one needs to compute (i) p i and (ii) theproximal mapping of q i , i ∈ V , depending on the structureof the problems. For (i), the conjugate of f can be easilyderived analytically if f is certain simple-structured smoothfunction. The conjugate of some widely used functions can bereferred to in [41, Sec. 3.3.1]. For (ii), some feasible methodsfor different cases are introduced as follows.

1) :

Directly solve the analytical expression of q i if theproximal mapping of q i can be easily obtained, i ∈ V . Forexample, (i) if g i = 0 , then q i ( λ i ) = I (cid:5) Ω i ( µ i ) + I S i ( θ i ) with I (cid:5) Ω i ( µ i ) being a support function of Ω i [41, Sec. 3.3.1]. Then, prox c i q i ( λ i ) = prox c i I (cid:5) Ω i ( µ i ) × prox c i I S i ( θ i ) [45, Th. 6.6] , where prox c i I S i ( θ i ) is an Euclidean projection onto S i [9, Sec. 1.2] and prox c i I (cid:5) Ω i ( µ i ) can be computed by the property to be introducedin Lemma 2; (ii) if we can estimate a sufﬁciently large S i suchthat θ i ( k ) is always an interior point of S i , then q i ( λ i ) =( g i + I Ω i ) (cid:5) ( µ i ) , i ∈ V , k ∈ N . In this case, Lemma 2 alsocan be used.

2) :

Take the advantage of the structure of g i , i ∈ V .For example, consider a widely discussed l regularizationproblem, where g i ( x i ) = (cid:107) x i (cid:107) , Ω i = R M , and S i is anestimated N -dimensional Euclidean box zone, i ∈ V . Then,we can have q i ( λ i ) = g (cid:5) i ( µ i ) + I S i ( θ i ) = I W i ( µ i ) + I S i ( θ i )= (cid:26) if µ i ∈ W i & θ i ∈ S i , + ∞ otherwise , = (cid:26) if λ i ∈ Y i , + ∞ otherwise , = I Y i ( λ i ) , (24)where W i = { u ∈ R M | (cid:107) u (cid:107) ∞ ≤ } (convex zone), Y i = S i × W i (convex zone). The second equality holds by thecomputing the conjugate of l -norm [41, Sec. 3.3.1]. Then, theproximal mapping of indicator function q i ( λ i ) is an Euclideanprojection onto Y i [9, Sec. 1.2].

3) : If q i is with certain complex structure, as a generalmethod, we rewrite (22) by the deﬁnition of proximal map-ping, which gives λ i ( k + 1) = arg min λ i ( q i ( λ i ) + 12 c i (cid:107) λ i − λ i ( k )+ c i ( ∇ λ i p i ( λ i ( k )) + (cid:88) j ∈V i K T ( ξ i ( k ) − ξ j ( k ))) (cid:107) ) . (25)To solve (25), one can utilize a subgradient descent methodby computing the subdifferential of q i with the help of (5) andLemma 3 to be introduced in Section VII, i.e., arg max v ∈ R M (( F λ i ) T v − ( g i + I Ω i )( v )) ∈ ∂ F λ i ( g i + I Ω i ) (cid:5) ( F λ i ) , (26) ∂q i ( λ i ) = ∂ λ i ( g i + I Ω i ) (cid:5) ( F λ i ) + ∂ I S i ( λ i )= F T ∂ F λ i ( g i + I Ω i ) (cid:5) ( F λ i ) + ∂ I S i ( λ i ) . (27)In cases 1 and 2, each agent only needs to update λ i and ξ i by some simple (projected) gradient based iterationswithout any other costly computation, e.g., the updating ofprimal variables in [25–32], which reduces the computationalcomplexity in each iteration step. In case 3, the updating of λ i ( k + 1) requires an inner-loop computation of ∂q i , whichcan be completed only with local information, i ∈ V , k ∈ N . This property complies with [45, Th. 6.6] since prox αζ = prox αζ , where α > and ζ is an extended real-valued function. V. A

UGMENTED D UAL P ROXIMAL G RADIENT A LGORITHM D EVELOPMENT

In this section, we formulate an augmented dual problemof Problem (P1) by augmenting the dual variables of theagents. In the new problem, we consider an asymmetricscenario where certain agent i maintains a private constraint X i := { x ∈ R |V| M | A ( i ) x = b ( i ) } which can be regarded asan individual interpretation of certain global constraint X := { x ∈ R |V| M | Ax = b } , A ( i ) ∈ R N ×|V| M , b ( i ) ∈ R N , ∀ i ∈ V .Therefore, it is reasonable to assume that ∩ i ∈V X i = X . Then,Problem (P1) can be equivalently written as (P8) min x i ∈ Ω i , ∀ i ∈V (cid:88) i ∈V H i ( x i ) s.t. A ( i ) x = b ( i ) , ∀ i ∈ V . To facilitate the following discussion, we let A ( l ) i ∈ R N × M denote the i th column vector of A ( l ) , i.e., A ( l ) =[ A ( l )1 , ..., A ( l ) i , ..., A ( l ) |V| ] , i, l ∈ V .To realize distributed computations, constraint A ( i ) x = b ( i ) should only contain the decision variables of agent i ’s neigh-bours, i.e., A ( i ) j = O N × M , ∀ ( i, j ) / ∈ E and i (cid:54) = j. (28)Speciﬁcally, in some distributed optimization problems intelecommunication systems, A ( i ) x = b ( i ) can be deﬁnedby an afﬁne edge-constraint maintained by agent i [46]. Inaddition, if A ( i ) i = |V i | I M , A ( i ) l = − I M , A ( i ) j = O M × M , b ( i ) = M , ∀ ( i, l ) ∈ E , ∀ ( i, j ) / ∈ E and i (cid:54) = j, (29)then the constraint of Problem (P8) can be written as (cid:98) Lx = |V| M with (cid:98) L = L ⊗ I M ∈ R |V| M ×|V| M (i.e., x i = x l , ∀ ( i, l ) ∈ E ), which means Problem (P8) essentially is aconsensus optimization problem.By introducing the slack variables into Problem (P8), wehave (P9) min x (cid:88) i ∈V ( f i ( x i ) + ( g i + I Ω i )( z i )) s.t. A ( i ) x = b ( i ) , x i = z i , ∀ i ∈ V . Then, the Lagrangian function of Problem (P9) can be writtenas (cid:101) L ( x , z , (cid:101) θ , (cid:101) µ ) := (cid:88) i ∈V ( f i ( x i ) + ( g i + I Ω i )( z i )+ ( µ ( i ) ) T ( x i − z i )) + (cid:88) i ∈V ( θ ( i ) ) T ( A ( i ) x − b ( i ) )= (cid:88) i ∈V ( f i ( x i ) + ( g i + I Ω i )( z i ) + ( µ ( i ) ) T ( x i − z i ))+ (cid:88) i ∈V x Ti (cid:88) l ∈V ( A ( l ) i ) T θ ( l ) − (cid:88) i ∈V ( b ( i ) ) T θ ( i ) = (cid:88) i ∈V ( f i ( x i ) + x Ti ( (cid:88) l ∈V ( A ( l ) i ) T θ ( l ) + µ ( i ) )+ ( g i + I Ω i )( z i ) − z Ti µ ( i ) ) − (cid:88) i ∈V ( b ( i ) ) T θ ( i ) , (30) where we use (cid:88) i ∈V ( θ ( i ) ) T A ( i ) x = (cid:88) i ∈V ( θ ( i ) ) T (cid:88) l ∈V A ( i ) l x l = (cid:88) i ∈V (cid:88) l ∈V ( θ ( l ) ) T A ( l ) i x i = (cid:88) i ∈V (cid:88) l ∈V ( A ( l ) i x i ) T θ ( l ) = (cid:88) i ∈V (cid:88) l ∈V x Ti ( A ( l ) i ) T θ ( l ) = (cid:88) i ∈V x Ti (cid:88) l ∈V ( A ( l ) i ) T θ ( l ) (31)with (cid:101) θ = [( θ (1) ) T , ..., ( θ ( |V| ) ) T ] T ∈ R |V| N and (cid:101) µ =[( µ (1) ) T , ..., ( µ ( |V| ) ) T ] T ∈ R |V| M . θ ( i ) and µ ( i ) denotethe Lagrangian multiplier vectors associated with constraints A ( i ) x = b ( i ) and x i = z i , respectively, i ∈ V . Then, the dualfunction can be obtained by minimizing (cid:101) L ( x , z , (cid:101) θ , (cid:101) µ ) withvariables ( x , z ) , which gives (cid:101) V ( (cid:101) θ , (cid:101) µ ) := min x , z (cid:101) L ( x , z , (cid:101) θ , (cid:101) µ )= min x , z (cid:88) i ∈V ( f i ( x i ) + x Ti ( (cid:88) l ∈V (cid:0) A ( l ) i (cid:1) T θ ( l ) + µ ( i ) )+ ( g i + I Ω i )( z i ) − z Ti µ ( i ) ) − (cid:88) i ∈V (cid:0) b ( i ) (cid:1) T θ ( i ) = min x , z (cid:88) i ∈V ( f i ( x i ) − x Ti (cid:102) W i (cid:101) λ + ( g i + I Ω i )( z i ) − z Ti (cid:101) F i (cid:101) λ ) − (cid:101) E (cid:101) λ = (cid:88) i ∈V ( − f (cid:5) i ( (cid:102) W i (cid:101) λ ) − ( g i + I Ω i ) (cid:5) ( (cid:101) F i (cid:101) λ ) − (cid:101) E i (cid:101) λ ) , (32)where (cid:102) W i = [ − ( A (1) i ) T , ..., − ( A ( |V| ) i ) T , O M × ( i − M , − I M , O M × ( |V|− i ) M ] ∈ R M × ( |V| M + |V| N ) , (cid:101) λ = [ (cid:101) θ T , (cid:101) µ T ] T ∈ R |V| M + |V| N , (cid:101) F i =[ O M ×|V| N , O M × ( i − M , I M , O M × ( |V|− i ) M ] ∈ R M × ( |V| M + |V| N ) , (cid:101) E = [( b (1) ) T , ..., ( b ( |V| ) ) T , T |V| M ] ∈ R × ( |V| M + |V| N ) , (cid:101) E i = [ T ( i − N , ( b ( i ) ) T , T ( |V|− i ) N + |V| M ] ∈ R × ( |V| M + |V| N ) . In the forth equality, we decompose (cid:101) E (cid:101) λ = (cid:80) i ∈V (cid:101) E i (cid:101) λ , which enables agent i to maintain thelocal information (cid:101) E i (cid:101) λ = ( b ( i ) ) T θ ( i ) .Then, the dual problem of Problem (P9) can be formulatedas (P10) min (cid:101) λ (cid:101) ϕ ( (cid:101) λ ) where (cid:101) ϕ ( (cid:101) λ ) = (cid:101) P ( (cid:101) λ ) + (cid:101) Q ( (cid:101) λ ) , (cid:101) P ( (cid:101) λ ) = (cid:80) i ∈V (cid:101) p i ( (cid:101) λ ) , (cid:101) Q ( (cid:101) λ ) = (cid:80) i ∈V (cid:101) q i ( (cid:101) λ ) , (cid:101) p i ( (cid:101) λ ) = f (cid:5) i ( (cid:102) W i (cid:101) λ ) + (cid:101) E i (cid:101) λ , (cid:101) q i ( (cid:101) λ ) = ( g i + I Ω i ) (cid:5) ( (cid:101) F i (cid:101) λ ) . Deﬁne H as the set of the optimal solutions ofProblem (P10). To solve Problem (P10), the A-DPG algorithmis designed as (cid:101) λ ( k + 1) =prox c (cid:101) Q (cid:2)(cid:101) λ ( k ) − c ∇ (cid:101) λ (cid:101) P ( (cid:101) λ ( k ))] , (33) which means  θ (1) ( k + 1) ... θ ( |V| ) ( k + 1) µ (1) ( k + 1) ... µ ( |V| ) ( k + 1)  =  θ (1) ( k ) − c ∇ θ (1) (cid:101) P ( (cid:101) λ ( k )) ... θ ( |V| ) ( k ) − c ∇ θ ( |V| ) (cid:101) P ( (cid:101) λ ( k ))prox c (cid:101) q (cid:2) µ (1) ( k ) − c ∇ µ (1) (cid:101) P ( (cid:101) λ ( k ))] ... prox c (cid:101) q |V| (cid:2) µ ( |V| ) ( k ) − c ∇ µ ( |V| ) (cid:101) P ( (cid:101) λ ( k ))]  , (34)where ∇ (cid:101) λ (cid:101) P = [ ∇ T θ (1) (cid:101) P , ..., ∇ T θ ( |V| ) (cid:101) P , ∇ T µ (1) (cid:101) P , ..., ∇ T µ ( |V| ) (cid:101) P ] T , c > is the step-size, k ∈ N . The proximal mapping forcomputing (cid:101) θ is omitted since (cid:101) θ is not contained in (cid:101) Q . Torealize distributed computations, we can let the updating of (cid:101) λ i := [( θ ( i ) ) T , ( µ ( i ) ) T ] T ∈ R N + M be maintained by agent i ,i.e., (cid:101) λ i ( k + 1) = prox c (cid:101) q i (cid:2)(cid:101) λ i ( k ) − c ∇ (cid:101) λ i (cid:101) P ( (cid:101) λ ( k ))] , (35)which means θ ( i ) ( k + 1) = θ ( i ) ( k ) − c ∇ θ ( i ) (cid:101) P ( (cid:101) λ ( k )) , (36) µ ( i ) ( k + 1) = prox c (cid:101) q i [ µ ( i ) ( k ) − c ∇ µ ( i ) (cid:101) P ( (cid:101) λ ( k ))] . (37)Note that (cid:101) F i (cid:101) λ = µ ( i ) , hence the variables of (cid:101) q i are decoupledfrom each other, i ∈ V . However, each (cid:101) p i ( (cid:101) λ ) contains theinformation (cid:80) l ∈V (cid:0) A ( l ) i (cid:1) T θ ( l ) = (cid:80) l ∈V i (cid:0) A ( l ) i (cid:1) T θ ( l ) (due to(28)), which means (cid:101) P ( (cid:101) λ ) is coupled among the neighbouringagents. Therefore, to compute the complete gradient vector ∇ (cid:101) λ i (cid:101) P ( (cid:101) λ ( k )) , agent i needs to collect (cid:101) p l ( (cid:101) λ ) from neighbour l ∈ V i , k ∈ N . The detailed computation procedure is statedin Algorithm 2. Algorithm 2

Augmented Dual Proximal Gradient Algorithm Initialize (cid:101) λ (0) . Determine step-size c > . for k = 0 , , , ... do for i = 1 , , ..., | V | do (in parallel) Broadcast (cid:101) p i ( (cid:101) λ ) to neighbours at k = 0 . Update (cid:101) λ i : (cid:101) λ i ( k + 1) =prox c (cid:101) q i (cid:2)(cid:101) λ i ( k ) − c ∇ (cid:101) λ i (cid:101) P ( (cid:101) λ ( k ))] . (38) end for Obtain an output (cid:101) λ out under certain convergence cri-terion. end forRemark 4. The asymmetric individual constraints imply thatdifferent agents may interpret some common global constraintsin different manners. For instance, agent i may interpretconstraint Ax = b by a linear transformation T i Ax = T i b ,i.e., A ( i ) = T i A and b ( i ) = T i b , where T i ∈ R N × N isnot necessarily full column rank, i ∈ V . Compared withsymmetric scenarios, the asymmetric individual constraintsintroduce asymmetric Lagrangian multipliers for the coupling constraints, where the dual variables of the agents are de-composed in a natural way and no global consensus of θ ( i ) is required, ∀ i ∈ V . VI. A

SYNCHRONOUS D UAL P ROXIMAL G RADIENT A LGORITHM D EVELOPMENT

In synchronous networks, the information accessed by theagents is assumed to be up-to-date, which requires efﬁcientdata transmission and can be restrictive for practical appli-cations [47]. To address this issue, we propose an Asyn-DPG algorithm for asynchronous networks by consideringcommunication delays. To this end, based on the setup ofProblem (P10), we deﬁne τ ( k ) as the time instant previousto instant k with k − τ ( k ) ≥ , τ ( k ) ∈ N , k ∈ N . Therefore,the accessed dual information by the agents may not be thelatest (cid:101) λ ( k ) but a historical version (cid:101) λ ( τ ( k )) . It is reasonable toassume that certain agent always knows the latest informationof itself. (cid:2869) Agent 1 Ω (cid:2869) ,𝑐 (cid:2869) ,𝑝(cid:3556) 𝝀(cid:3561) , 𝑝(cid:3556) 𝒍 𝝀(cid:3561) ,∀𝑙 ∈ 𝒱 (cid:2869) Communication network (cid:3015) (cid:2925)(cid:2931)(cid:2930) …… (cid:3036) Agent i Ω (cid:3036) ,𝑐 (cid:3036) ,𝑝(cid:3556) 𝒊 𝝀(cid:3561) , 𝑝(cid:3556) 𝒍 𝝀(cid:3561) ,∀𝑙 ∈ 𝒱 (cid:3036) Agent N Ω (cid:3015) ,𝑐 (cid:3015) ,𝑝(cid:3556) 𝑵 𝝀(cid:3561) , 𝑝(cid:3556) 𝒍 𝝀(cid:3561) ,∀𝑙 ∈ 𝒱 (cid:3015) … iN Communication line

Fig. 1. An illustration of the proposed communication and computationmechanism of the asynchronous network.

Assumption 3.

The communication delays in the network areupper bounded by D ∈ N , which means ≤ k − τ ( k ) ≤ D , τ ( k ) ∈ N , k ∈ N . The upper bound of delays is a commonly used assumptionin asynchronous networks, e.g., in [34]. The proposed mecha-nism of the asynchronous network is illustrated in Fig. 1. Theproposed Asyn-DPG algorithm is stated in Algorithm 3.

Lemma 2. (Extended Moreau Decomposition [22, LemmaV.10]) Let ζ ( u ) : R n → R ∪ { + ∞} be a proper, closed,convex function and ζ (cid:5) be its Fenchel conjugate. Then, wehave u = α prox α ζ (cid:5) [ u α ] + prox αζ [ u ] , (40) α > , ∀ u ∈ R n . Remark 5.

Based on Lemma 2, the updating of µ ( i ) inAlgorithms 2 and 3 can be equivalently written as (cid:101) (cid:37) i ( k ) = µ ( i ) ( k ) − c ∇ µ ( i ) (cid:101) P ( (cid:101) λ ( k )) , (41) Algorithm 3

Asynchronous Dual Proximal Gradient Algo-rithm Initialize (cid:101) λ (0) . Determine step-size c i > , ∀ i ∈ V . for k = 0 , , , ... do for i = 1 , , ..., | V | do (in parallel) Broadcast (cid:101) p i ( (cid:101) λ ) to neighbours at k = 0 . Update (cid:101) λ i : (cid:101) λ i ( k + 1) =prox c i (cid:101) q i (cid:2)(cid:101) λ i ( k ) − c i ∇ (cid:101) λ i (cid:101) P ( (cid:101) λ ( τ ( k )))] . (39) end for Obtain an output (cid:101) λ out under certain convergence cri-terion. end for µ ( i ) ( k + 1) = prox c (cid:101) q i (cid:2) (cid:101) (cid:37) i ( k ) (cid:3) = (cid:101) (cid:37) i ( k ) − c prox c (cid:101) q (cid:5) i [ (cid:101) (cid:37) i ( k ) c ] , (42) and (cid:101) (cid:37) (cid:48) i ( k ) = µ ( i ) ( k ) − c i ∇ µ ( i ) (cid:101) P ( (cid:101) λ ( τ ( k ))) , (43) µ ( i ) ( k + 1) = prox c i (cid:101) q i (cid:2) (cid:101) (cid:37) (cid:48) i ( k ) (cid:3) = (cid:101) (cid:37) (cid:48) i ( k ) − c i prox ci (cid:101) q (cid:5) i [ (cid:101) (cid:37) (cid:48) i ( k ) c i ] , (44) respectively, with (cid:101) q (cid:5) i ( (cid:101) λ ) = ( g i + I Ω i ) (cid:5)(cid:5) ( (cid:101) F i (cid:101) λ ) = ( g i + I Ω i )( (cid:101) F i (cid:101) λ ) due to the convexity and lower semi-continuity of g i + I Ω i , where ( g i + I Ω i ) (cid:5)(cid:5) is the biconjugate of g i + I Ω i , ∀ i ∈ V [41, Sec. 3.3.2]. With these arrangements, thecalculation of ( g i + I Ω i ) (cid:5) is not required, which reduces thecomputational complexity. VII. C

ONVERGENCE A NALYSIS AND D ISCUSSION

Lemma 3.

Let ζ : R n → R ∪ { + ∞} be a proper, closed,convex function and ζ (cid:5) be its Fenchel conjugate. Then arg max v ( u T v − ζ ( v )) ∈ ∂ u ζ (cid:5) ( u ) . (45) Moreover, if ζ is σ -strongly convex, σ > , then arg max v ( u T v − ζ ( v )) = ∇ u ζ (cid:5) ( u ) , (46) and ζ (cid:5) ( u ) is σ -Lipschitz continuously differentiable [22,Lemma V.7]. The proof of (45) can be referred to in Appendix A.

Lemma 4.

With Assumption 1, the Lipschitz constant of ∇ λ P ( λ ) is given by h = (cid:115)(cid:88) i ∈V h i , where h i = (cid:107) W i (cid:107) σ i . (47)The proof can be referred to in Appendix B. Lemma 5.

By algorithm (19), we have ( ξ − ξ ( k + 1)) T D N V [ 1 γ l ]( ξ ( k ) − ξ ( k + 1))+ ( ξ − ξ ( k + 1)) T M λ ( k + 1) = 0 , (48) (41) and (43) can be included into (42) and (44), respectively, to generateone-step formulas for µ ( i ) , i ∈ V . where D N V [ γ l ] = ( D N V [ γ l ]) − , ∀ ξ ∈ R |V| N . The proof can be referred to in Appendix C.

Lemma 6.

Suppose that Assumptions 1 and 2 hold. Based onAlgorithm 1, we have for any ( λ ∗ , ξ ∗ ) ∈ R , ϕ ( λ ( k + 1)) − ϕ ( λ ∗ ) + ( ξ ∗ ) T M λ ( k + 1) ∈ [0 , C k ] , where C k = (cid:107) λ ∗ − λ ( k ) (cid:107) D M + N V [ cl ] − (cid:107) λ ∗ − λ ( k + 1) (cid:107) D M + N V [ cl ] + (cid:107) ξ ∗ − ξ ( k ) (cid:107) D N V [ γl ] − (cid:107) ξ ∗ − ξ ( k + 1) (cid:107) D N V [ γl ] − (cid:107) λ ( k ) − λ ( k + 1) (cid:107) D M + N V [ cl − hl ] − (cid:107) ξ ( k ) − ξ ( k + 1) (cid:107) D N V [ γl ] + (cid:107) M λ ( k + 1) (cid:107) D N V [ γ l ] , k ∈ N . (49)The proof can be referred to in Appendix D. Theorem 1.

Suppose that Assumptions 1 and 2 hold. Let c i ≤ h i , ∀ i ∈ V . By Algorithm 1, we have for any ( λ ∗ , ξ ∗ ) ∈ R , | ϕ (¯ λ ( K + 1)) − ϕ ( λ ∗ ) |≤ K + 1 ( (cid:107) λ ∗ − λ (0) (cid:107) D M + N V [ cl ] + (cid:107) ξ ∗ (cid:107) D N V [ γl ] + (cid:107) ξ (0) (cid:107) D N V [ γl ] ) + O ( γ max ) , (50) (cid:107) ξ ∗ (cid:107)(cid:107) M ¯ λ ( K + 1) (cid:107)≤ K + 1 ( (cid:107) λ ∗ − λ (0) (cid:107) D M + N V [ cl ] + (cid:107) ξ ∗ (cid:107) D N V [ γl ] + (cid:107) ξ (0) (cid:107) D N V [ γl ] ) + O ( γ max ) , (51) where γ max = max l ∈V γ l , O ( γ max ) = γ max |V| (cid:107) (cid:101) L (cid:107) Γ , ¯ λ ( K + 1) = K +1 (cid:80) Kk =0 λ ( k + 1) , K ∈ N + . The proof can be referred to in Appendix E.

Remark 6. O ( γ max ) characterizes the upper bound of thestationary error of the algorithm. A larger estimated zone of S i may lead to a larger Γ , which may produce a larger stationaryerror. Meanwhile, a smaller γ max can reduce the stationaryerror but may sacriﬁce the convergence speed. Lemma 7.

With Assumption 1, the Lipschitz constant of ∇ (cid:101) λ (cid:101) P ( (cid:101) λ ) is given by (cid:101) h = (cid:88) i ∈V (cid:101) h i , where (cid:101) h i = (cid:107) (cid:102) W i (cid:107) σ i . (52)The proof can be referred to in Appendix F. Theorem 2.

Suppose that Assumptions 1 and 2 hold. Let /c ≥ (cid:101) h . By Algorithm 2, we have for any (cid:101) λ ∗ ∈ H , (cid:101) ϕ ( (cid:101) λ ( K + 1)) − (cid:101) ϕ ( (cid:101) λ ∗ ) ≤ (cid:101) h (cid:107) (cid:101) λ (0) − (cid:101) λ ∗ (cid:107) K , K ∈ N + . (53)The result of Theorem 2 can be deduced with [11, Th.3.1]by employing the Lipschitz constant in (52). Hence, detailedproof is omitted for simplicity. Lemma 8.

Based on Assumption 3, for some K ∈ N + , wehave K (cid:88) k =0 ( (cid:107) (cid:101) λ ( k + 1) − (cid:101) λ ( k ) (cid:107) + · · · + (cid:107) (cid:101) λ ( τ ( k ) + 1) − (cid:101) λ ( τ ( k ) (cid:107) ) ≤ K (cid:88) k =0 ( D + 1) (cid:107) (cid:101) λ ( k + 1) − (cid:101) λ ( k ) (cid:107) , (54) and K (cid:88) k =0 k ( (cid:107) (cid:101) λ ( k + 1) − (cid:101) λ ( k ) (cid:107) + · · · + (cid:107) (cid:101) λ ( τ ( k ) + 1) − (cid:101) λ ( τ ( k )) (cid:107) ) ≤ K (cid:88) k =0 (2 k + D )( D + 1)2 (cid:107) (cid:101) λ ( k + 1) − (cid:101) λ ( k ) (cid:107) . (55)The proof can be referred to in Appendix G. Theorem 3.

Suppose that Assumptions 1 to 3 hold. ByAlgorithm 3, given that c i ≥ (cid:101) h ( D + 1) , (56) for some K ∈ N + , we have for any (cid:101) λ ∗ ∈ H , (cid:101) ϕ ( (cid:101) λ K +1 ) − (cid:101) ϕ ( (cid:101) λ ∗ ) ≤ Λ( c , ..., c |V| , D ) K + 1 , (57) and Λ( c , ..., c |V| , D ) = (cid:98) D (cid:99) (cid:88) k =0 (cid:88) i ∈V ( (cid:101) h (2 k + D )( D + 1) − kc i ) · (cid:107) (cid:101) λ i ( k + 1) − (cid:101) λ i ( k ) (cid:107) + (cid:88) i ∈V c i (cid:107) (cid:101) λ i (0) − (cid:101) λ ∗ i (cid:107) . (58)The proof can be referred to in Appendix H. Remark 7.

As seen from Theorem 3, if c i equals certaincommon value c > and D = 0 , ∀ i ∈ V , then Asyn-DPGalgorithm is equivalent to A-DPG algorithm. VIII. N

UMERICAL R ESULT

In this section, we will verify the feasibility of Algorithms1 to 3 by considering a social welfare optimization problem inan electricity market which contains 2 utility companies (UCs)and 3 energy users.

A. Simulation Setup

The social welfare optimization problem of the market isformulated as follows. (P11) min x (cid:88) i ∈V UC Φ i ( x UC i ) − (cid:88) j ∈V user Ψ j ( x user j ) s.t. (cid:88) i ∈V UC x UC i = (cid:88) j ∈V user x user j , (59a) x UC i ∈ [0 , x UC i, max ] , ∀ i ∈ V UC (59b) x user j ∈ [0 , x user j, max ] , ∀ j ∈ V user (59c)where V UC = { , , ..., |V UC |} and V user = { , , ..., |V user |} are the sets of UCs and users, respectively. x =[ x UC1 , ..., x UC |V UC | , x user1 , ..., x user |V user | ] T with x UC i and x user j beingthe quantities of energy generation and consumption of UC i and user j , respectively. Φ i ( x UC i ) is the cost function of UC i and Ψ j ( x user j ) is the utility function of user j , i ∈ V UC , j ∈ V user . The constraint (59a) ensures the supply-demand TABLE IP

ARAMETERS OF UC S AND ENERGY USERS

UCs UsersIndex i/j κ ϑ β x

UCmax π ς x usermax balance in the market. x UC i, max > and x user j, max > are theupper bounds of x UC i and x user j , respectively. The detailedexpressions of Φ i ( x UC i ) and Ψ j ( x user j ) are designed as Φ i ( x UC i ) = κ i ( x UC i ) + ϑ i x UC i + β i , (60) Ψ j ( x user j ) = (cid:40) π j x user j − ς j ( x user j ) x user j ≤ π j ς j , π j ς j x user j > π j ς j , (61)where κ i , ϑ i , β i , π j , ς j are parameters, ∀ i ∈ V UC , ∀ j ∈ V user .The values of the parameters are set in Table I [48]. UC 1UC 2 user 1user 3 user 2 (a)

UC 1UC 2 user 1user 3 user 2 (b)Fig. 2. Two communication typologies of the market participants.

We rewrite Problem (P11) as (P12) min x (cid:88) i ∈V UC (Φ i ( x UC i ) + I Ω i ( x UC i ))+ (cid:88) j ∈V user ( − Ψ j ( x user j ) + I Ω j ( x user j )) s.t. (cid:88) i ∈V UC x UC i = (cid:88) j ∈V user x user j , (62)where Ω i = [0 , x UC i, max ] and Ω j = [0 , x user j, max ] , ∀ i ∈ V UC , ∀ j ∈ V user . Moreover, we deﬁne constraint matrix A :=[ T |V UC | , − T |V user | ] . Then, (62) can be represented by Ax = 0 .With the above arrangements, Problem (P12) complies withthe structure of Problem (P2).By decomposing the objective function of Problem (P12)with slack variables, the corresponding Lagrangian functioncan be written as L ( x , z , θ, µ ) := (cid:88) i ∈V UC (Φ i ( x UC i ) + I Ω i ( x UC i ))+ (cid:88) j ∈V user ( − Ψ j ( x user j ) + I Ω j ( x user j )) + θ Ax + (cid:88) i ∈V UC µ UC i ( x UC i − z UC i ) + (cid:88) j ∈V user µ user j ( x user j − z user j ) , (63) where z = [ z UC1 , ..., z UC |V UC | , z user1 , ..., z user |V user | ] T is a slackvector, θ and µ = [ µ UC1 , ..., µ UC |V UC | , µ user1 , ..., µ user |V user | ] T aredual vectors. Deﬁne ˆ θ := [ θ UC , ..., θ UC |V UC | , θ user1 , ..., θ user |V user | ] T ,which contains the local estimates of θ . Let ξ :=[ ξ UC1 , ..., ξ UC |V UC | , ξ user1 , ..., ξ user |V user | ] T be the Lagrangian multi-plier vector associated with the constraint on ˆ θ as indicatedin Problem (P7).To construct the asymmetric individ-ual constraints, we deﬁne A ( i ) , UC :=[ A ( i ) , UC1 , ..., A ( i ) , UC |V UC | , A ( i ) , UC |V UC | +1 , ..., A ( i ) , UC |V UC | + |V user | ] and A ( j ) , user := [ A ( j ) , user1 , ..., A ( j ) , user |V UC | , A ( j ) , user |V UC | +1 , ..., A ( j ) , user |V UC | + |V user | ] as the constraint matrices of UC i ∈ V UC and user j ∈ V user ,respectively. Then, the augmented Lagrangian function ofProblem (P12) can be written as (cid:101) L ( x , z , (cid:101) θ , (cid:101) µ ) := (cid:88) i ∈V UC (Φ i ( x UC i ) + I Ω i ( x UC i ))+ (cid:88) j ∈V user ( − Ψ j ( x user j ) + I Ω j ( x user j )) + (cid:88) i ∈V UC x UC i · ( (cid:88) l ∈V UC A ( l ) , UC i θ UC l + (cid:88) l (cid:48) ∈V user A ( l (cid:48) ) , user i θ user l (cid:48) + µ UC i )+ (cid:88) j ∈V user x user j ( (cid:88) l ∈V UC A ( l ) , UC |V UC | + j θ UC l + (cid:88) l (cid:48) ∈V user A ( l (cid:48) ) , user |V UC | + j θ user l (cid:48) + µ user j ) − (cid:88) i ∈V UC z UC i µ UC i − (cid:88) j ∈V user z user j µ user j . (64)With some direct calculations, the optimal solutionto Problem (P12) is x ∗ = [0 , , . , . , . T .In this simulation, the initial states are set as θ user i (0) , θ UC j (0) , µ user i (0) , µ UC j (0) , ξ UC i (0) , ξ user j (0) ∈ [0 , , ∀ i ∈ V UC , ∀ j ∈ V user . B. Simulation Result and Discussion1) Simulation with D-DPG Algorithm:

To demonstrate theperformance of Algorithm 1, we consider the communicationtypology shown in Fig. 2-(a). The step-size of UCs and usersis chosen as c l ∈ [0 . , . and γ l ∈ [0 . , . , ∀ l ∈V UC ∪ V user . The simulation result is shown in Figs. 3-(a) to3-(c). Fig. 3-(a) shows the dynamics of dual variables ˆ θ and µ .It can be seen that all the elements in ˆ θ converge to θ ∗ = − . while µ converges to µ ∗ = [ − . , . , , , T . It can bechecked that the optimal solution to the primal problem is x ∗ = arg min x L ( x , z , θ ∗ , µ ∗ ) = [0 , , . , . , . T ,which means the lower bound and upper bound of x UC1 and x UC2 are activated, respectively, while other variables reachinterior optimal solutions. Fig. 3-(b) depicts the dynamics of ξ . Fig. 3-(c) shows that the value of dual function ϕ ( λ ) isdecreased to around 756.53.

2) Simulation with A-DPG Algorithm:

To apply Algo-rithm 2, we consider a fully connected network sinceall the agents are involved in supply-demand balanceconstraint. In this case, the communication graph ofthe network is shown in Fig. 2-(b). Due to the dif-ferent individual interpretations of the global constraint, The minimization with x is independent of z since x and z are decoupledin (63). Interations -25-20-15-10-505 V a l ue UC1UC1UC2UC2user1 user1user2user2user3user3 (a)

Interations -150-100-50050100150 V a l ue UC1UC2user1user2user3 (b)

Interations V a l ue (c)Fig. 3. (a) Dynamics of ˆ θ and µ ; (b) dynamics of ξ ; (c) dynamics of ϕ ( λ ) . with some linear transformations introduced in Remark 4,we let [ A (1) , UC , A (2) , UC , A (1) , user , A (2) , user , A (3) , user ] =[ T A , T A , T A , T A , T A ] , where T = 1 , T = 2 , T = − , T = 1 , T = − . The uniform step-size of the agentsis set as c = 0 . . The simulation result is shown in Figs.4-(a) and 4-(b). As shown in Fig. 4-(a), different from theresult of simulation 1, no consensus of the elements in (cid:101) θ isensured (see Remark 4 for detailed explanation). Meanwhile,the optimal solution to the primal problem can be obtained by x ∗ = arg min x (cid:101) L ( x , z , (cid:101) θ ∗ , (cid:101) µ ∗ ) = [0 , , . , . , . T ,and the value of dual function (cid:101) ϕ ( (cid:101) λ ) reaches the minimumvalue (cid:101) ϕ ∗ , which is around 756.53.

3) Simulation with Asyn-DPG Algorithm:

To apply Algo-rithm 3, we utilize the communication typology of simulation2. The heterogenous step-size is set as c l ∈ [0 , . , ∀ l ∈ V UC ∪ V user . The upper bound of communication delaysis set as D ∈ { , , , } . To represent the “worst delays”,we let τ ( k ) = max { , k − D } , k ∈ N . In addition, wedeﬁne (cid:15) ( k ) := (cid:101) ϕ ( (cid:101) λ ( k )) − (cid:101) ϕ ∗ to characterize the dynamics ofconvergence error. With the same asymmetric constraints insimulation 2, the simulation result is shown in Fig. 5. It canbe seen that, with different delays, the minimum of (cid:101) ϕ ( (cid:101) λ ) , i.e., (cid:101) ϕ ∗ , is achieved asymptotically with certain convergence errortolerance, which implies the optimal solution to the primalproblem is achieved since simulations 2 and 3 are based onthe same setup of Problem (P10). In Fig. 5, one can also notethat a larger delay can slower the convergence speed, whichis consistent with result (57), i.e., a larger value of D can Interations -3-2-10123 V a l ue UC1UC2user1user2user3 UC1UC2user1user2user3 (a)

Interations V a l ue (b)Fig. 4. (a) Dynamics of (cid:101) θ and (cid:101) µ ; (b) dynamics of (cid:101) ϕ ( (cid:101) λ ) . Interations -5 V a l ue D=1D=5D=10D=15

Fig. 5. Dynamics of (cid:15) with different delays. produce a larger error bound for certain given step index.IX. C

ONCLUSION

In this work, we focused on optimizing a composite DOPwith both local convex and coupling afﬁne constraints bydeveloping some efﬁcient distributed optimization algorithms.Speciﬁcally, three kinds of dual proximal gradient basedalgorithms were proposed, namely D-DPG, A-DPG and Asyn-DPG algorithms. Each of the algorithms has respective ad-vantage depending on different implementation environment.With the D-DPG algorithm, each agent updates the state withlocal parameters and the dual information of its neighbours,which can reduce the computational complexity with somesimple-structured objective functions and realize a fully dis-tributed optimization fashion. By the A-DPG algorithm, allthe agents jointly solve an unconstrained dual optimizationproblem where the asymmetric individual constraints are con-sidered. Finally, an Asyn-DPG algorithm was proposed basedon the architecture of the A-DPG algorithm for solving anasynchronous optimization problem by considering the het-erogenous step-sizes and communication delays. In the futureworks, many topics are promising to investigate based on dualproximal gradient method, such as distributed optimizationproblems with non-convex coupling constraint, generalizedNash games [49], and their applications in asynchronousnetworks. A PPENDIX

A. Proof of Lemma 3

By the ﬁrst-order optimality condition of the left-hand sideof (45) and certain optimal solution v ∗ , we have u ∈ ∂ζ ( v ∗ ) .Then, by [43, Cor. 23.5.1], we have v ∗ ∈ ∂ζ (cid:5) ( u ) , whichproves (45). B. Proof of Lemma 4

By the σ i -Lipschitz continuity of ∇ f (cid:5) i (see Lemma 3), wehave (cid:107) ∇ u f (cid:5) i ( W i u ) − ∇ v f (cid:5) i ( W i v ) (cid:107) = (cid:107) W Ti ( ∇ W i u f (cid:5) i ( W i u ) − ∇ W i v f (cid:5) i ( W i v )) (cid:107)≤(cid:107) W Ti (cid:107)(cid:107) ∇ W i u f (cid:5) i ( W i u ) − ∇ W i v f (cid:5) i ( W i v ) (cid:107)≤ (cid:107) W Ti (cid:107) σ i (cid:107) W i u − W i v (cid:107)≤ (cid:107) W i (cid:107) σ i (cid:107) u − v (cid:107) = h i (cid:107) u − v (cid:107) , (65) ∀ u , v ∈ R M + N , which means ∇ λ i f (cid:5) i ( W i λ i ) is h i -Lipschitzcontinuous and ∇ λ i p i ( λ i ) = ∇ λ i f (cid:5) i ( W i λ i ) + κ i E T is also h i -Lipschitz continuous, ∀ λ i ∈ R M + N , ∀ i ∈ V .On the other hand, due to the separability of P ( λ ) , ∇ λ P ( λ ) can be decoupled with respect to each λ i , i ∈ V , i.e., ∇ λ P ( λ ) =  ∇ λ P ( λ ) ... ∇ λ |V| P ( λ )  =  ∇ λ p ( λ ) ... ∇ λ |V| p |V| ( λ |V| )  . (66)By using the Euclidean l -norm, the Lipschitz constant of ∇ λ P ( λ ) can be obtained as (47). C. Proof of Lemma 5

By (19), we have |V| N = D N V [ 1 γ l ]( ξ ( k ) − ξ ( k + 1)) + M λ ( k + 1) . (67)Therefore, by multiplying the both sides of (67) by ( ξ − ξ ( k +1)) T , we have (48), ∀ ξ ∈ R |V| N . D. Proof of Lemma 6

By the ﬁrst-order optimality condition of (18) in terms of(3), we have |V| N + |V| M ∈ ∂ λ Q ( λ ( k + 1)) − D M + N V [ 1 c l ]( λ ( k ) − λ ( k + 1))+ ∇ λ P ( λ ( k )) + M T ξ ( k )= ∂ λ Q ( λ ( k + 1)) − D M + N V [ 1 c l ]( λ ( k ) − λ ( k + 1))+ ∇ λ P ( λ ( k )) + M T ξ ( k + 1) − M T D N V [ γ l ] M λ ( k + 1) . (68)From the convexity of Q ( λ ) , we have Q ( λ ) − Q ( λ ( k + 1)) ≥ ( λ − λ ( k + 1)) T D M + N V [ 1 c l ]( λ ( k ) − λ ( k + 1)) − ( λ − λ ( k + 1)) T ∇ λ P ( λ ( k )) − ( λ − λ ( k + 1)) T M T ξ ( k + 1)+ ( λ − λ ( k + 1)) T M T D N V [ γ l ] M λ ( k + 1) . (69)From the convexity and h i -Lipschitz continuous differentia-bility of p i , we have ( λ − λ ( k + 1)) T ∇ λ P ( λ ( k ))= (cid:88) i ∈V ( λ i − λ i ( k )) T ∇ λ i p i ( λ i ( k ))+ (cid:88) i ∈V ( λ i ( k ) − λ i ( k + 1)) T ∇ λ i p i ( λ i ( k )) ≤ (cid:88) i ∈V ( p i ( λ i ) − p i ( λ i ( k ))) + (cid:88) i ∈V ( p i ( λ i ( k )) − p i ( λ i ( k + 1)))+ (cid:107) λ ( k ) − λ ( k + 1) (cid:107) D M + N V [ hl ] = P ( λ ) − P ( λ ( k + 1))+ (cid:107) λ ( k ) − λ ( k + 1) (cid:107) D M + N V [ hl ] . (70)By adding up (69) and (70) from the both sides, we have ϕ ( λ ( k + 1)) − ϕ ( λ ) ≤ − ( λ − λ ( k + 1)) T D M + N V [ 1 c l ]( λ ( k ) − λ ( k + 1)) + ( λ − λ ( k + 1)) T M T ξ ( k + 1)+ (cid:107) λ ( k ) − λ ( k + 1) (cid:107) D M + N V [ hl ] − ( λ − λ ( k + 1)) T M T D N V [ γ l ] M λ ( k + 1)= − ( λ − λ ( k + 1)) T D M + N V [ 1 c l ]( λ ( k ) − λ ( k + 1)) − ( ξ − ξ ( k + 1)) T D N V [ 1 γ l ]( ξ ( k ) − ξ ( k + 1)) − ( ξ − ξ ( k + 1)) T M λ ( k + 1) + ( ξ ( k + 1)) T M λ − ( ξ ( k + 1)) T M λ ( k + 1)+ (cid:107) λ ( k ) − λ ( k + 1) (cid:107) D M + N V [ hl ] − ( λ − λ ( k + 1)) T M T D N V [ γ l ] M λ ( k + 1)= (cid:107) λ − λ ( k ) (cid:107) D M + N V [ cl ] − (cid:107) λ − λ ( k + 1) (cid:107) D M + N V [ cl ] − (cid:107) λ ( k ) − λ ( k + 1) (cid:107) D M + N V [ cl ] + (cid:107) ξ − ξ ( k ) (cid:107) D N V [ γl ] − (cid:107) ξ − ξ ( k + 1) (cid:107) D N V [ γl ] − (cid:107) ξ ( k ) − ξ ( k + 1) (cid:107) D N V [ γl ] +( ξ ( k + 1)) T M λ − ξ T M λ ( k + 1)+ (cid:107) λ ( k ) − λ ( k + 1) (cid:107) D M + N V [ hl ] − ( λ − λ ( k + 1)) T M T D N V [ γ l ] M λ ( k + 1)= (cid:107) λ − λ ( k ) (cid:107) D M + N V [ cl ] − (cid:107) λ − λ ( k + 1) (cid:107) D M + N V [ cl ] − (cid:107) λ ( k ) − λ ( k + 1) (cid:107) D M + N V [ cl − hl ] + (cid:107) ξ − ξ ( k ) (cid:107) D N V [ γl ] − (cid:107) ξ − ξ ( k + 1) (cid:107) D N V [ γl ] − (cid:107) ξ ( k ) − ξ ( k + 1) (cid:107) D N V [ γl ] + ( ξ ( k + 1)) T M λ − ξ T M λ ( k + 1)+ (cid:107) M λ ( k + 1) (cid:107) D N V [ γ l ] − ( M λ ) T D N V [ γ l ] M λ ( k + 1) , (71)where we use relation (48) in the ﬁrst equality and the secondequality holds with relation u T v = 12 ( (cid:107) u (cid:107) + (cid:107) v (cid:107) − (cid:107) u − v (cid:107) ) , (72) ∀ u , v ∈ R M |V| + N |V| . Let ξ = ξ ∗ and λ = λ ∗ and rearrange (71), then we have ϕ ( λ ( k + 1)) − ϕ ( λ ∗ ) + ( ξ ∗ ) T M λ ( k + 1) ≤(cid:107) λ ∗ − λ ( k ) (cid:107) D M + N V [ cl ] − (cid:107) λ ∗ − λ ( k + 1) (cid:107) D M + N V [ cl ] + (cid:107) ξ ∗ − ξ ( k ) (cid:107) D N V [ γl ] − (cid:107) ξ ∗ − ξ ( k + 1) (cid:107) D N V [ γl ] − (cid:107) λ ( k ) − λ ( k + 1) (cid:107) D M + N V [ cl − hl ] − (cid:107) ξ ( k ) − ξ ( k + 1) (cid:107) D N V [ γl ] + (cid:107) M λ ( k + 1) (cid:107) D N V [ γ l ] , (73)where KKT condition (15) is used. By combining (11), (12)and (15), we have ϕ ( λ ) − ϕ ( λ ∗ ) + ( ξ ∗ ) T M λ ≥ , (74) ∀ λ ∈ R |V| M + |V| N . Based on (73) and (74), the proof iscompleted. E. Proof of Theorem 1

Recall that (71) holds for any ξ ∈ R |V| N .

1) : If M ¯ λ ( K + 1) (cid:54) = |V| N , by letting λ = λ ∗ and ξ = 2 (cid:107) ξ ∗ (cid:107) M ¯ λ ( K +1) (cid:107) M ¯ λ ( K +1) (cid:107) in (71) and summing up the resultover k = 0 , , ..., K , we have ( K + 1)( ϕ (¯ λ ( K + 1)) − ϕ ( λ ∗ ) + 2 (cid:107) ξ ∗ (cid:107)(cid:107) M ¯ λ ( K + 1) (cid:107) ) ≤ K (cid:88) k =0 ( ϕ ( λ ( k + 1)) − ϕ ( λ ∗ ) + 2 (cid:107) ξ ∗ (cid:107)(cid:107) M ¯ λ ( K + 1) (cid:107) ) ≤(cid:107) (cid:107) ξ ∗ (cid:107) M ¯ λ ( K + 1) (cid:107) M ¯ λ ( K + 1) (cid:107) − ξ (0) (cid:107) D N V [ γl ] + (cid:107) λ ∗ − λ (0) (cid:107) D M + N V [ cl ] +( K + 1) (cid:107) M ¯ λ ( k + 1) (cid:107) D N V [ γ l ] ≤(cid:107) λ ∗ − λ (0) (cid:107) D M + N V [ cl ] + (cid:107) ξ ∗ (cid:107) D N V [ γl ] + (cid:107) ξ (0) (cid:107) D N V [ γl ] + ( K + 1) γ max |V| (cid:107) (cid:101) L (cid:107) Γ , (75)where the ﬁrst inequality is from the convexity of ϕ , the secondinequality holds due to c i ≤ h i , and the third inequality isfrom relations (cid:107) u + v (cid:107) ≤ (cid:107) u (cid:107) + (cid:107) v (cid:107) ) and (cid:107) M λ (cid:107) D N V [ γ l ] = (cid:107) (cid:101) L ˆ θ (cid:107) D N V [ γ l ] ≤ γ max (cid:107) (cid:101) L (cid:107) (cid:107) ˆ θ (cid:107) ≤ γ max |V| (cid:107) (cid:101) L (cid:107) Γ , ∀ u , v ∈ R |V| N , ∀ λ ∈ R |V| N + |V| M . Therefore, ϕ (¯ λ ( K + 1)) − ϕ ( λ ∗ ) ≤ K + 1 ( (cid:107) λ ∗ − λ (0) (cid:107) D M + N V [ cl ] + (cid:107) ξ ∗ (cid:107) D N V [ γl ] + (cid:107) ξ (0) (cid:107) D N V [ γl ] ) + γ max |V| (cid:107) M (cid:107) Γ − (cid:107) ξ ∗ (cid:107)(cid:107) M ¯ λ ( K + 1) (cid:107)≤ K + 1 ( (cid:107) λ ∗ − λ (0) (cid:107) D M + N V [ cl ] + (cid:107) ξ ∗ (cid:107) D N V [ γl ] + (cid:107) ξ (0) (cid:107) D N V [ γl ] ) + O ( γ max ) . (76)By letting λ = ¯ λ ( K + 1) in (74), we have ϕ (¯ λ ( K + 1)) − ϕ ( λ ∗ ) ≥ − (cid:107) ξ ∗ (cid:107)(cid:107) M ¯ λ ( K + 1) (cid:107) . (77)By combining the ﬁrst inequality in (76) and (77), we have (cid:107) ξ ∗ (cid:107)(cid:107) M ¯ λ ( K + 1) (cid:107)≤ K + 1 ( (cid:107) λ ∗ − λ (0) (cid:107) D M + N V [ cl ] + (cid:107) ξ ∗ (cid:107) D N V [ γl ] + (cid:107) ξ (0) (cid:107) D N V [ γl ] ) + O ( γ max ) . (78) By (77) and (78), we have ϕ (¯ λ ( K + 1)) − ϕ ( λ ∗ ) ≥ − (cid:107) ξ ∗ (cid:107)(cid:107) M ¯ λ ( K + 1) (cid:107)≥ − K + 1 ( (cid:107) λ ∗ − λ (0) (cid:107) D M + N V [ cl ] + (cid:107) ξ ∗ (cid:107) D N V [ γl ] + (cid:107) ξ (0) (cid:107) D N V [ γl ] ) − O ( γ max ) . (79)By combining (76), (78) and (79), (50) and (51) are proved.

2) : If M ¯ λ ( K + 1) = |V| N , one can let λ = λ ∗ and ξ = 2 ξ ∗ in (71), which directly gives ϕ (¯ λ ( K + 1)) − ϕ ( λ ∗ ) ≤ K + 1 ( (cid:107) λ ∗ − λ (0) (cid:107) D M + N V [ cl ] + (cid:107) ξ ∗ (cid:107) D N V [ γl ] + (cid:107) ξ (0) (cid:107) D N V [ γl ] ) + O ( γ max ) (80)by the same derivation process of (75) and (76). Then, consid-ering that (cid:107) M ¯ λ ( K + 1) (cid:107) = 0 and ϕ (¯ λ ( K + 1)) − ϕ ( λ ∗ ) ≥− (cid:107) ξ ∗ (cid:107)(cid:107) M ¯ λ ( K + 1) (cid:107) = 0 , (50) and (51) hold as well. F. Proof of Lemma 7

Similar to the proof of Lemma 4, the Lipschitz constant of ∇ (cid:101) λ (cid:101) p i ( (cid:101) λ ) can be obtained as (cid:101) h i = (cid:107) (cid:102) W i (cid:107) σ i by substituting W i by (cid:102) W i in (65). Detailed derivation is omitted for simplicity.Then, the Lipschitz constant of ∇ (cid:101) λ (cid:101) P ( (cid:101) λ ) is a linear sum of (cid:101) h i , which gives (52), i ∈ V . G. Proof of Lemma 8

For (54), K (cid:88) k =0 ( (cid:107) (cid:101) λ ( k + 1) − (cid:101) λ ( k ) (cid:107) + · · · + (cid:107) (cid:101) λ ( τ ( k ) + 1) − (cid:101) λ ( τ ( k ) (cid:107) )= ( (cid:107) (cid:101) λ ( K + 1) − (cid:101) λ ( K ) (cid:107) + (cid:107) (cid:101) λ ( K ) − (cid:101) λ ( K − (cid:107) + · · · + (cid:107) (cid:101) λ ( τ ( K ) + 1) − (cid:101) λ ( τ ( K )) (cid:107) )+ ( (cid:107) (cid:101) λ ( K ) − (cid:101) λ ( K − (cid:107) + (cid:107) (cid:101) λ ( K − − (cid:101) λ ( K − (cid:107) + · · · + (cid:107) (cid:101) λ ( τ ( K −

1) + 1) − (cid:101) λ ( τ ( K − (cid:107) ) + · · · + ( (cid:107) (cid:101) λ (1) − (cid:101) λ (0) (cid:107) + · · · + (cid:107) (cid:101) λ ( τ (0) + 1) − (cid:101) λ ( τ (0)) (cid:107) ) ≤(cid:107) (cid:101) λ ( K + 1) − (cid:101) λ ( K ) (cid:107) +2 (cid:107) (cid:101) λ ( K ) − (cid:101) λ ( K − (cid:107) + · · · + ( D + 1) (cid:107) (cid:101) λ ( τ ( K ) + 1) − (cid:101) λ ( τ ( K )) (cid:107) + · · · + ( D + 1) (cid:107) (cid:101) λ (1) − (cid:101) λ (0) (cid:107) ≤ K (cid:88) k =0 ( D + 1) (cid:107) (cid:101) λ ( k + 1) − (cid:101) λ ( k ) (cid:107) , K ∈ N + . (81)For (55), K (cid:88) k =0 k ( (cid:107) (cid:101) λ ( k + 1) − (cid:101) λ ( k ) (cid:107) + · · · + (cid:107) (cid:101) λ ( τ ( k ) + 1) − (cid:101) λ ( τ ( k )) (cid:107) ) = K ( (cid:107) (cid:101) λ ( K + 1) − (cid:101) λ ( K ) (cid:107) + · · · + (cid:107) (cid:101) λ ( τ ( K ) + 1) − (cid:101) λ ( τ ( K )) (cid:107) )+ ( K − (cid:107) (cid:101) λ ( K ) − (cid:101) λ ( K − (cid:107) + · · · + (cid:107) (cid:101) λ ( τ ( K −

1) + 1) − (cid:101) λ ( τ ( K − (cid:107) ) + · · · + 1 · ( (cid:107) (cid:101) λ (2) − (cid:101) λ (1) (cid:107) + · · · + (cid:107) (cid:101) λ ( τ (1) + 1) − (cid:101) λ ( τ (1)) (cid:107) )+ 0 · ( (cid:107) (cid:101) λ (1) − (cid:101) λ (0) (cid:107) + · · · + (cid:107) (cid:101) λ ( τ (0) + 1) − (cid:101) λ ( τ (0)) (cid:107) ) ≤ K (cid:107) (cid:101) λ ( K + 1) − (cid:101) λ ( K ) (cid:107) + (( K −

1) + K ) (cid:107) (cid:101) λ ( K ) − (cid:101) λ ( K − (cid:107) + · · · + ( k + ( k + 1) + · · · + ( k + D )) (cid:107) (cid:101) λ ( k + 1) − (cid:101) λ ( k ) (cid:107) + · · · + (0 + 1 + · · · + D ) (cid:107) (cid:101) λ (1) − (cid:101) λ (0) (cid:107) ≤ K (cid:88) k =0 ( k + ( k + 1) + · · · + ( k + D )) (cid:107) (cid:101) λ ( k + 1) − (cid:101) λ ( k ) (cid:107) = K (cid:88) k =0 (2 k + D )( D + 1)2 (cid:107) (cid:101) λ ( k + 1) − (cid:101) λ ( k ) (cid:107) , K ∈ N + . (82) H. Proof of Theorem 3

For agent i ∈ V , by the ﬁrst-order optimality condition ofproximal mapping (39), we have M + N ∈ ∇ (cid:101) λ i (cid:101) P ( (cid:101) λ ( τ ( k ))) + ∂ (cid:101) q i ( (cid:101) λ i ( k + 1))+ 1 c i ( (cid:101) λ i ( k + 1) − (cid:101) λ i ( k )) . (83)By the convexity of (cid:101) q i , ∀ (cid:101) λ i ∈ R M + N , we have (cid:101) q i ( (cid:101) λ i ( k + 1)) − (cid:101) q i ( (cid:101) λ i ) ≤ (cid:104)∇ (cid:101) λ i (cid:101) P ( (cid:101) λ ( τ ( k ))) , (cid:101) λ i − (cid:101) λ i ( k + 1) (cid:105) + 1 c i (cid:104) (cid:101) λ i ( k + 1) − (cid:101) λ i ( k ) , (cid:101) λ i − (cid:101) λ i ( k + 1) (cid:105) . (84)Summing up (84) over i ∈ V gives (cid:101) Q ( (cid:101) λ ( k + 1)) − (cid:101) Q ( (cid:101) λ ) ≤ (cid:104)∇ (cid:101) λ (cid:101) P ( (cid:101) λ ( τ ( k ))) , (cid:101) λ − (cid:101) λ ( k + 1) (cid:105) + (cid:88) i ∈V c i (cid:104) (cid:101) λ i ( k + 1) − (cid:101) λ i ( k ) , (cid:101) λ i − (cid:101) λ i ( k + 1) (cid:105) , (85)where the separability of (cid:101) Q is used. By the Lipschitz continuityof ∇ (cid:101) λ (cid:101) P and convexity of (cid:101) P , ∀ (cid:101) λ ∈ R |V| M + |V| N , we have (cid:101) P ( (cid:101) λ ( k + 1)) ≤ (cid:101) P ( (cid:101) λ ( τ ( k ))) + (cid:104)∇ (cid:101) λ (cid:101) P ( (cid:101) λ ( τ ( k ))) , (cid:101) λ ( k + 1) − (cid:101) λ ( τ ( k )) (cid:105) + (cid:101) h (cid:107) (cid:101) λ ( k + 1) − (cid:101) λ ( τ ( k )) (cid:107) ≤ (cid:101) P ( (cid:101) λ ) − (cid:104)∇ (cid:101) λ (cid:101) P ( (cid:101) λ ( τ ( k ))) , (cid:101) λ − (cid:101) λ ( τ ( k )) (cid:105) + (cid:104)∇ (cid:101) λ (cid:101) P ( (cid:101) λ ( τ ( k ))) , (cid:101) λ ( k + 1) − (cid:101) λ ( τ ( k )) (cid:105) + (cid:101) h (cid:107) (cid:101) λ ( k + 1) − (cid:101) λ ( τ ( k )) (cid:107) ≤ (cid:101) P ( (cid:101) λ ) + (cid:104)∇ (cid:101) λ (cid:101) P ( (cid:101) λ ( τ ( k ))) , (cid:101) λ ( k + 1) − (cid:101) λ (cid:105) + (cid:101) h (cid:107) (cid:101) λ ( k + 1) − (cid:101) λ ( τ ( k )) (cid:107) . (86)Summing up the both sides of (85) and (86) gives (cid:101) ϕ ( (cid:101) λ ( k + 1)) − (cid:101) ϕ ( (cid:101) λ ) ≤ (cid:88) i ∈V c i (cid:104) (cid:101) λ i ( k + 1) − (cid:101) λ i ( k ) , (cid:101) λ i − (cid:101) λ i ( k + 1) (cid:105) + (cid:101) h (cid:107) (cid:101) λ ( k + 1) − (cid:101) λ ( τ ( k )) (cid:107) = (cid:101) h (cid:107) (cid:101) λ ( k + 1) − (cid:101) λ ( τ ( k )) (cid:107) − (cid:88) i ∈V c i (cid:107) (cid:101) λ i ( k + 1) − (cid:101) λ i ( k ) (cid:107) + (cid:88) i ∈V c i ( (cid:107) (cid:101) λ i ( k ) − (cid:101) λ i (cid:107) − (cid:107) (cid:101) λ i ( k + 1) − (cid:101) λ i (cid:107) ) , (87)where relation (72) is used. By letting (cid:101) λ = (cid:101) λ ∗ in (87) andsumming up the result over k = 0 , ..., K , we have K (cid:88) n =0 ( (cid:101) ϕ ( (cid:101) λ ( k + 1)) − (cid:101) ϕ ( (cid:101) λ ∗ )) ≤ K (cid:88) k =0 ( (cid:101) h (cid:107) (cid:101) λ ( k + 1) − (cid:101) λ ( τ ( k )) (cid:107) − (cid:88) i ∈V c i (cid:107) (cid:101) λ i ( k + 1) − (cid:101) λ i ( k ) (cid:107) + (cid:88) i ∈V c i ( (cid:107) (cid:101) λ i ( k ) − (cid:101) λ ∗ i (cid:107) − (cid:107) (cid:101) λ i ( k + 1) − (cid:101) λ ∗ i (cid:107) )) ≤ K (cid:88) k =0 ( (cid:101) h ( D + 1)2 ( (cid:107) (cid:101) λ ( k + 1) − (cid:101) λ ( k ) (cid:107) + · · · + (cid:107) (cid:101) λ ( τ ( k ) + 1) − (cid:101) λ ( τ ( k )) (cid:107) ) − (cid:88) i ∈V c i (cid:107) (cid:101) λ i ( k + 1) − (cid:101) λ i ( k ) (cid:107) + (cid:88) i ∈V c i ( (cid:107) (cid:101) λ i ( k ) − (cid:101) λ ∗ i (cid:107) − (cid:107) (cid:101) λ i ( k + 1) − (cid:101) λ ∗ i (cid:107) )) ≤ K (cid:88) k =0 ( (cid:101) h ( D + 1) (cid:107) (cid:101) λ ( k + 1) − (cid:101) λ ( k ) (cid:107) − (cid:88) i ∈V c i (cid:107) (cid:101) λ i ( k + 1) − (cid:101) λ i ( k ) (cid:107) + (cid:88) i ∈V c i ( (cid:107) (cid:101) λ i ( k ) − (cid:101) λ ∗ i (cid:107) − (cid:107) (cid:101) λ i ( k + 1) − (cid:101) λ ∗ i (cid:107) )) ≤ K (cid:88) k =0 (cid:88) i ∈V ( (cid:101) h ( D + 1) − c i ) (cid:107) (cid:101) λ i ( k + 1) − (cid:101) λ i ( k ) (cid:107) + (cid:88) i ∈V c i (cid:107) (cid:101) λ i (0) − (cid:101) λ ∗ i (cid:107) , (88)where Cauchy-Schwarz inequality and (54) are used in thesecond and third inequalities, respectively. Letting (cid:101) λ = (cid:101) λ ( k ) in (87) gives (cid:101) ϕ ( (cid:101) λ ( k + 1)) − (cid:101) ϕ ( (cid:101) λ ( k )) ≤ (cid:101) h (cid:107) (cid:101) λ ( k + 1) − (cid:101) λ ( τ ( k )) (cid:107) − (cid:88) i ∈V c i (cid:107) (cid:101) λ i ( k + 1) − (cid:101) λ i ( k ) (cid:107) . (89)Multiplying (89) by k and summing up the result over k =0 , , ..., K gives K (cid:88) k =0 k ( (cid:101) ϕ ( (cid:101) λ ( k + 1)) − (cid:101) ϕ ( (cid:101) λ ( k )))= K (cid:88) k =0 (( k + 1) (cid:101) ϕ ( (cid:101) λ ( k + 1)) − k (cid:101) ϕ ( (cid:101) λ ( k )) − (cid:101) ϕ ( (cid:101) λ ( k + 1)))= ( K + 1) (cid:101) ϕ ( (cid:101) λ K +1 ) − K (cid:88) n =0 (cid:101) ϕ ( (cid:101) λ ( k + 1)) ≤ K (cid:88) k =0 k ( (cid:101) h (cid:107) (cid:101) λ ( k + 1) − (cid:101) λ ( τ ( k )) (cid:107) − (cid:88) i ∈V c i (cid:107) (cid:101) λ i ( k + 1) − (cid:101) λ i ( k ) (cid:107) ) ≤ K (cid:88) k =0 k ( (cid:101) h ( D + 1)2 ( (cid:107) (cid:101) λ ( k + 1) − (cid:101) λ ( k ) (cid:107) + · · · + (cid:107) (cid:101) λ ( τ ( k ) + 1) − (cid:101) λ ( τ ( k )) (cid:107) ) − (cid:88) i ∈V c i (cid:107) (cid:101) λ i ( k + 1) − (cid:101) λ i ( k ) (cid:107) ) ≤ K (cid:88) k =0 (cid:88) i ∈V ( (cid:101) h (2 k + D )( D + 1) − kc i ) · (cid:107) (cid:101) λ i ( k + 1) − (cid:101) λ i ( k ) (cid:107) , (90)where (55) is used in the last inequality. By summing up theboth sides of (88) and (90), we have ( K + 1)( (cid:101) ϕ ( (cid:101) λ K +1 ) − (cid:101) ϕ ( (cid:101) λ ∗ )) ≤ (cid:88) i ∈V c i (cid:107) (cid:101) λ i (0) − (cid:101) λ ∗ i (cid:107) + K (cid:88) k =0 (cid:88) i ∈V ( (cid:101) h ( D + 1) − c i ) (cid:107) (cid:101) λ i ( k + 1) − (cid:101) λ i ( k ) (cid:107) + K (cid:88) k =0 (cid:88) i ∈V ( (cid:101) h (2 k + D )( D + 1) − kc i ) (cid:107) (cid:101) λ i ( k + 1) − (cid:101) λ i ( k ) (cid:107) = K (cid:88) k =0 (cid:88) i ∈V ( (cid:101) h ( D + 1) − c i ) (cid:107) (cid:101) λ i ( k + 1) − (cid:101) λ i ( k ) (cid:107) + (cid:98) D (cid:99) (cid:88) k =0 (cid:88) i ∈V ( (cid:101) h (2 k + D )( D + 1) − kc i ) (cid:107) (cid:101) λ i ( k + 1) − (cid:101) λ i ( k ) (cid:107) + K (cid:88) k = (cid:100) D (cid:101) (cid:88) i ∈V ( (cid:101) h (2 k + D )( D + 1) − kc i ) · (cid:107) (cid:101) λ i ( k + 1) − (cid:101) λ i ( k ) (cid:107) + (cid:88) i ∈V c i (cid:107) (cid:101) λ i (0) − (cid:101) λ ∗ i (cid:107) = K (cid:88) k =0 (cid:88) i ∈V ( (cid:101) h ( D + 1) − c i (cid:124) (cid:123)(cid:122) (cid:125) := (cid:36) ) (cid:107) (cid:101) λ i ( k + 1) − (cid:101) λ i ( k ) (cid:107) + (cid:98) D (cid:99) (cid:88) k =0 (cid:88) i ∈V ( (cid:101) h (2 k + D )( D + 1) − kc i ) (cid:107) (cid:101) λ i ( k + 1) − (cid:101) λ i ( k ) (cid:107) + K (cid:88) k = (cid:100) D (cid:101) (cid:88) i ∈V ( k ( (cid:101) h ( D + 1) − c i (cid:124) (cid:123)(cid:122) (cid:125) := (cid:36) ) + ( (cid:101) hD ( D + 1) − k c i (cid:124) (cid:123)(cid:122) (cid:125) := (cid:36) )) · (cid:107) (cid:101) λ i ( k + 1) − (cid:101) λ i ( k ) (cid:107) + (cid:88) i ∈V c i (cid:107) (cid:101) λ i (0) − (cid:101) λ ∗ i (cid:107) ≤ (cid:98) D (cid:99) (cid:88) k =0 (cid:88) i ∈V ( (cid:101) h (2 k + D )( D + 1) − kc i ) (cid:107) (cid:101) λ i ( k + 1) − (cid:101) λ i ( k ) (cid:107) + (cid:88) i ∈V c i (cid:107) (cid:101) λ i (0) − (cid:101) λ ∗ i (cid:107) = Λ( c , ..., c |V| , D ) , (91)where (cid:36) , (cid:36) , (cid:36) ≤ with c i ≥ (cid:101) h ( D + 1) , i ∈ V . Thisproves (57) . R EFERENCES [1] L. Luo, N. Chakraborty, and K. Sycara, “Provably-gooddistributed algorithm for constrained multi-robot task assignment for grouped tasks,”

IEEE Transactions onRobotics , vol. 31, no. 1, pp. 19–30, 2014.[2] K. Lee, M. Lam, R. Pedarsani, D. Papailiopoulos, andK. Ramchandran, “Speeding up distributed machinelearning using codes,”

IEEE Transactions on InformationTheory , vol. 64, no. 3, pp. 1514–1529, 2017.[3] L. Bai, M. Ye, C. Sun, and G. Hu, “Distributed economicdispatch control via saddle point dynamics and consen-sus algorithms,”

IEEE Transactions on Control SystemsTechnology , vol. 27, no. 2, pp. 898–905, 2017.[4] C. Hans, “Bayesian lasso regression,”

Biometrika ,vol. 96, no. 4, pp. 835–845, 2009.[5] R. E. Schapire and Y. Freund, “Boosting: Foundationsand algorithms,”

Kybernetes , 2013.[6] M. A. Hearst, S. T. Dumais, E. Osuna, J. Platt, andB. Scholkopf, “Support vector machines,”

IEEE Intel-ligent Systems and Their Applications , vol. 13, no. 4, pp.18–28, 1998.[7] K. C. Kiwiel, “An aggregate subgradient method fornonsmooth convex minimization,”

Mathematical Pro-gramming , vol. 27, no. 3, pp. 320–341, 1983.[8] Y. Wang, W. Yin, and J. Zeng, “Global convergence ofadmm in nonconvex nonsmooth optimization,”

Journalof Scientiﬁc Computing , vol. 78, no. 1, pp. 29–63, 2019.[9] N. Parikh and S. Boyd, “Proximal algorithms,”

Foun-dations and Trends in Optimization , vol. 1, no. 3, pp.127–239, 2014.[10] A. Juditsky, A. Nemirovski et al. , “First order methodsfor nonsmooth convex large-scale optimization, i: generalpurpose methods,”

Optimization for Machine Learning ,pp. 121–148, 2011.[11] A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholding algorithm for linear inverse problems,”

SIAM Journal on Imaging Sciences , vol. 2, no. 1, pp.183–202, 2009.[12] S. Boyd, N. Parikh, and E. Chu,

Distributed optimizationand statistical learning via the alternating directionmethod of multipliers . Now Publishers Inc, 2011.[13] T. Yang, X. Yi, J. Wu, Y. Yuan, D. Wu, Z. Meng,Y. Hong, H. Wang, Z. Lin, and K. H. Johansson, “Asurvey of distributed optimization,”

Annual Reviews inControl , vol. 47, pp. 278–305, 2019.[14] D. P. Bertsekas, “Incremental proximal methods for largescale convex optimization,”

Mathematical programming ,vol. 129, no. 2, p. 163, 2011.[15] W. Fenchel, “On conjugate convex functions,”

CanadianJournal of Mathematics , vol. 1, no. 1, pp. 73–77, 1949.[16] W. Shi, Q. Ling, K. Yuan, G. Wu, and W. Yin, “Onthe linear convergence of the admm in decentralizedconsensus optimization,”

IEEE Transactions on SignalProcessing , vol. 62, no. 7, pp. 1750–1761, 2014.[17] P. Wang, P. Lin, W. Ren, and Y. Song, “Distributedsubgradient-based multiagent optimization with moregeneral step sizes,”

IEEE Transactions on AutomaticControl , vol. 63, no. 7, pp. 2295–2302, 2017.[18] T.-H. Chang, M. Hong, and X. Wang, “Multi-agentdistributed optimization via inexact consensus admm,”

IEEE Transactions on Signal Processing , vol. 63, no. 2, pp. 482–497, 2014.[19] A. Nedic and A. Ozdaglar, “Distributed subgradientmethods for multi-agent optimization,” IEEE Transac-tions on Automatic Control , vol. 54, no. 1, pp. 48–61,2009.[20] A. Nedic, A. Ozdaglar, and P. A. Parrilo, “Constrainedconsensus and optimization in multi-agent networks,”

IEEE Transactions on Automatic Control , vol. 55, no. 4,pp. 922–938, 2010.[21] S. Lee and A. Nedic, “Distributed random projectionalgorithm for convex optimization,”

IEEE Journal ofSelected Topics in Signal Processing , vol. 7, no. 2, pp.221–229, 2013.[22] I. Notarnicola and G. Notarstefano, “Asynchronous dis-tributed optimization via randomized dual proximal gra-dient,”

IEEE Transactions on Automatic Control , vol. 62,no. 5, pp. 2095–2106, 2016.[23] A. Beck and M. Teboulle, “A fast dual proximal gradientalgorithm for convex minimization and applications,”

Operations Research Letters , vol. 42, no. 1, pp. 1–6,2014.[24] D. Kim and J. A. Fessler, “Fast dual proximal gradientalgorithms with rate o (1 /k . ) for convex minimization,” arXiv preprint arXiv:1609.09441 , 2016.[25] M. Zhu and S. Mart´ınez, “On distributed convex opti-mization under inequality and equality constraints,” IEEETransactions on Automatic Control , vol. 57, no. 1, pp.151–164, 2011.[26] D. Mateos-N´unez and J. Cort´es, “Distributed saddle-point subgradient algorithms with laplacian averaging,”

IEEE Transactions on Automatic Control , vol. 62, no. 6,pp. 2720–2735, 2016.[27] T.-H. Chang, A. Nedi´c, and A. Scaglione, “Distributedconstrained optimization by consensus-based primal-dualperturbation method,”

IEEE Transactions on AutomaticControl , vol. 59, no. 6, pp. 1524–1538, 2014.[28] A. Falsone, K. Margellos, S. Garatti, and M. Prandini,“Dual decomposition for multi-agent distributed opti-mization with coupling constraints,”

Automatica , vol. 84,pp. 149–158, 2017.[29] A. Simonetto and H. Jamali-Rad, “Primal recovery fromconsensus-based dual decomposition for distributed con-vex optimization,”

Journal of Optimization Theory andApplications , vol. 168, no. 1, pp. 172–197, 2016.[30] A. Falsone, K. Margellos, S. Garatti, and M. Prandini,“Distributed constrained convex optimization and con-sensus via dual decomposition and proximal minimiza-tion,” in . IEEE, 2016, pp. 1889–1894.[31] T.-H. Chang, “A proximal dual consensus admm methodfor multi-agent constrained optimization,”

IEEE Trans-actions on Signal Processing , vol. 64, no. 14, pp. 3719–3734, 2016.[32] X. Li, G. Feng, and L. Xie, “Distributed proximalalgorithms for multi-agent optimization with coupledinequality constraints,”

IEEE Transactions on AutomaticControl , 2020.[33] I. Necoara and V. Nedelcu, “On linear convergence of a distributed dual gradient algorithm for linearlyconstrained separable convex problems,”

Automatica ,vol. 55, pp. 209–216, 2015.[34] Y. Zhou, Y. Liang, Y. Yu, W. Dai, and E. P. Xing,“Distributed proximal gradient algorithm for partiallyasynchronous computer clusters,”

The Journal of Ma-chine Learning Research , vol. 19, no. 1, pp. 733–764,2018.[35] A. Aytekin, H. R. Feyzmahdavian, and M. Johansson,“Analysis and implementation of an asynchronous op-timization algorithm for the parameter server,” arXivpreprint arXiv:1610.05507 , 2016.[36] M. Li, L. Zhou, Z. Yang, A. Li, F. Xia, D. G. Andersen,and A. Smola, “Parameter server for distributed machinelearning,” in

Big Learning NIPS Workshop , vol. 6, 2013,p. 2.[37] M. Hong, “A distributed, asynchronous, and incremen-tal algorithm for nonconvex optimization: an admmapproach,”

IEEE Transactions on Control of NetworkSystems , vol. 5, no. 3, pp. 935–945, 2017.[38] S. Kumar, R. Jain, and K. Rajawat, “Asynchronousoptimization over heterogeneous networks via consensusadmm,”

IEEE Transactions on Signal and InformationProcessing over Networks , vol. 3, no. 1, pp. 114–129,2016.[39] D. B. West et al. , Introduction to graph theory . Prenticehall Upper Saddle River, NJ, 1996, vol. 2.[40] H. H. Bauschke, P. L. Combettes et al. , Convex anal-ysis and monotone operator theory in Hilbert spaces .Springer, 2011, vol. 408.[41] S. Boyd, S. P. Boyd, and L. Vandenberghe,

Convexoptimization . Cambridge University Press, 2004.[42] J. Borwein and A. S. Lewis,

Convex analysis and nonlin-ear optimization: theory and examples . Springer Science& Business Media, 2010.[43] R. T. Rockafellar,

Convex analysis . Princeton universitypress, 1970, no. 28.[44] M. A. Hanson, “On sufﬁciency of the kuhn-tucker con-ditions,”

Journal of Mathematical Analysis and Applica-tions , vol. 80, no. 2, pp. 545–550, 1981.[45] A. Beck,

First-order methods in optimization . SIAM,2017.[46] G. Zhang and R. Heusdens, “Distributed optimizationusing the primal-dual method of multipliers,”

IEEETransactions on Signal and Information Processing overNetworks , vol. 4, no. 1, pp. 173–187, 2017.[47] C.-T. Chou, I. Cidon, I. S. Gopal, and S. Zaks, “Syn-chronizing asynchronous bounded delay networks,”

IEEETransactions on communications , vol. 38, no. 2, pp. 144–147, 1990.[48] H. Pourbabak, J. Luo, T. Chen, and W. Su, “A novelconsensus-based distributed algorithm for economic dis-patch based on local estimation of power mismatch,”

IEEE Transactions on Smart Grid , vol. 9, no. 6, pp.5930–5942, 2017.[49] F. Facchinei, A. Fischer, and V. Piccialli, “On general-ized nash games and variational inequalities,”