[PDF] Bilevel Distributed Optimization in Directed Networks

Abstract

Motivated by emerging applications in wireless sensor networks and large-scale data processing, we consider distributed optimization over directed networks where the agents communicate their information locally to their neighbors to cooperatively minimize a global cost function. We introduce a new unifying distributed constrained optimization model that is characterized as a bilevel optimization problem. This model captures a wide range of existing problems over directed networks including: (i) Distributed optimization with linear constraints; (ii) Distributed unconstrained nonstrongly convex optimization over directed networks. Employing a novel regularization-based relaxation approach and gradient-tracking schemes, we develop an iteratively regularized push-pull gradient algorithm. We establish the consensus and derive new convergence rate statements for suboptimality and infeasibility of the generated iterates for solving the bilevel model. The proposed algorithm and the complexity analysis obtained in this work appear to be new for addressing the bilevel model and also for the two sub-classes of problems. The numerical performance of the proposed algorithm is presented.

Full PDF

AAchieving Convergence Rates for DistributedConstrained and Unconstrained Optimization inDirected Networks

Farzad Youseﬁan

School of Industrial Engineering & ManagementOklahoma State UniversityStillwater, OK 74078, USA [email protected]

Abstract

We consider the problem of distributed optimization over directed networks wherethe agents communicate their information locally to their neighbors to cooperativelyminimize a global cost. The existing theory and algorithms for this class of multi-agent problems are founded on two strong assumptions: (1) The problem is oftenassumed to be unconstrained, e.g., this is the case in Push-DIGing and Push-Pullalgorithms that have been recently developed. (2) The objective function is assumedto be strongly convex, an assumption that appears in many of the existing modelsand algorithms in the area of multi-agent optimization. In this work, we aimat addressing these shortcomings. We ﬁrst introduce a new unifying distributedconstrained optimization model that is characterized as a bilevel optimizationproblem. The proposed model captures a wide range of existing problems includingthe following two classes: (i) Distributed linearly constrained optimization overdirected networks. (ii) Distributed unconstrained non-strongly convex optimizationover directed networks. To address the proposed optimization model, utilizing aregularization-based relaxation approach, we develop a new push-pull gradientmethod where at each iteration, the information of iteratively regularized gradientsis pushed to the neighbors, while the information about the decision variableis pulled from the neighbors. The analysis of the proposed algorithm leads totwo main contributions: (1) For model (i), we derive new convergence rates forsuboptimality, infeasibility, and consensus violations. (2) For model (ii), we derivea new convergence rate statement despite the absence of the strong convexityproperty. The numerical performance of the proposed algorithm is presented.

We consider a new class of distributed optimization problems in directed networks given as follows: min x ∈ R n (cid:88) mi =1 f i ( x ) subject to: x ∈ argmin x ∈ R n (cid:110)(cid:88) mi =1 g i ( x ) (cid:111) , (1)under the following assumptions (see the Notation for details): Assumption 1. (a) Functions f i : R n → R are µ f –strongly convex and L f –smooth for all i ∈ [ m ] . (b) Functions g i : R n → R are convex and L g –smooth for all i ∈ [ m ] . (c) The set argmin x ∈ R n { (cid:80) mi =1 g i ( x ) } is nonempty. Here, m agents cooperatively seek to ﬁnd among the optimal solutions to the problem min x ∈ R n (cid:80) mi =1 g i ( x ) , one that minimizes a secondary metric, i.e., (cid:80) mi =1 f i ( x ) . Here, functions Preprint. Under review. a r X i v : . [ m a t h . O C ] J un i and g i are known locally only by agent i and the cooperation among the agents occurs over adirected network. Given a set of nodes N , a directed graph (digraph) is denoted by G = ( N , E ) where E ⊆ N × N is the set of ordered pairs of vertices. For any edge ( i, j ) ∈ E , i and j are called parentnode and child node, respectively. Graph G is called strongly connected if there is a path between thepair of any two different vertices. The digraph induced by a given nonnegative matrix B ∈ R m × m isdenoted by G B (cid:44) ( N B , E B ) , where N B (cid:44) [ m ] and ( j, i ) ∈ E B if and only if B ij > . We let N in B ( i ) and N out B ( i ) denote the set of parents (in-neighbors) and the set of children (out-neighbors) of vertex i , respectively. Also, R G denotes the set of roots of all possible spanning trees in G . Signiﬁcance of the problem formulation:

Problem (1) provides a unifying mathematical frameworkcapturing different variants of existing problems in the distributed optimization literature. From these,we present two important cases below:(i) Distributed linearly constrained optimization in directed networks: Consider the model given as: min x ∈ R n (cid:88) mi =1 f i ( x ) subject to: A i x = b i for all i ∈ [ m ] , (2)where A i ∈ R m i × n and b i ∈ R m i are known parameters. Let problem (2) be feasible. Then, bychoosing g i ( x ) := (cid:107) A i x − b i (cid:107) for i ∈ [ m ] , problem (2) is equivalent to (1).(ii) Distributed unconstrained optimization in the absence of strong convexity: Let us deﬁne f i ( x ) := (cid:107) x (cid:107) /m . Then, problem (1) is equivalent to ﬁnding the least (cid:96) -norm solution of the followingcanonical distributed unconstrained optimization problem: min x ∈ R n (cid:88) mi =1 g i ( x ) , (3)where the component functions g i are all merely convex and smooth. Existing distributed optimization models and algorithms:

The classical mathematical models,tools, and algorithms for consensus-based optimization were introduced and studied as early asthe ’70s [14] and ’80s [39, 40, 6] Of these, in the seminal work of Tsitsiklis [39], it was assumedthe agents share a global (smooth) objective while their decision component vectors are distributed locally over the network. In the past two decades, in light of the unprecedented growth in dataand its imperative role in several broad ﬁelds such as social networks, biology, and medicine, thetheory of distributed and parallel optimization over networks has much evolved. The distributedoptimization problems with local objective functions were ﬁrst studied in [22, 28]. In this framework,the agents communicate their local information with their neighbors in the network at discrete timesto cooperatively minimize the global cost function. Without characterizing the communication rulesexplicitly, this model can be formulated as (cid:80) mi =1 f i ( x ) subject to x ∈ X . Here, the local function f i is known only to the agent i and X denotes the system constraint set. This modeling frameworkcaptures a wide spectrum of decentralized applications in the areas of statistical learning, signalprocessing, sensor networks, control, and robotics [15, 32, 43, 18, 33, 12, 10]. Because of this, inthe past decade, there has been a ﬂurry of research focused on the design and analysis of fast andscalable computational methods to address applications in networks. Among these, average-basedconsensus methods are one of the most studied approaches. Here, the network is characterized witha stochastic matrix that is possibly time-varying. The underlying idea is that at a given time, eachagent uses this matrix and obtains a weighted-average of its neighbors’ local variables. Then, theupdate is completed by performing a standard subgradient step for the agent. In the case where themodel (cid:80) mi =1 f i ( x ) is unconstrained, the seminal work [28] proposed the following scheme: x i ( k + 1) := (cid:16)(cid:88) mj =1 W ij ( k ) x j ( k ) (cid:17) − α k (cid:101) ∇ f i ( x i ( k )) , (4)where x i ( k ) ∈ R n , W ij ( k ) ∈ [0 , , and (cid:101) ∇ f i denote the local decision variable of agent i at time k , the weight agent i assigns to the estimate from agent j , and the subgradient of local function f i , respectively. Random projection variants of this scheme were developed for both synchronousand asynchronous cases, assuming X (cid:44) ∩ mi =1 X i [19, 36]. For the constrained model where X iseasy-to-project, a dual averaging variant of (4) was developed in [11]. The algroithm EXTRA [34]and its proximal variant were developed addressing X = R n . EXTRA is a synchronous and time-invariant scheme and achieves a sublinear and a linear rate of convergence for smooth merely convexand strongly convex problems, respectively. Among many other works such as [21, 16], is theDIGing algorithm [27] which was the ﬁrst work achieving a linear convergence rate for unconstrained2ptimizaiton over a time-varying network. When the graph is directed, a key shortcoming in theweighted-based schemes lies in that the double stochasticity requirement of the weight matrix isimpractical. Push-sum protocols were ﬁrst leveraged in [25, 26, 29] to weaken this requirement.Recently, the Push-Pull algorithm equipped with a linear converence rate was developed in [30] forunconstrained strongly convex problems. Extensions of push-sum algorithms to nonconvex regimeshave been developed more recently [37, 23, 38]. Other popular distributed optimization schemes arethe dual-based methods, such as ADMM-type methods studied in [7, 42, 41, 20, 35, 3]. Most of thesealgorithms can address only static and undirected graphs. Moreover, there are only a few works inthe litrature that can cope with constraints emplying primal-dual methods [9, 24, 8, 2]. Research gap and contributions:

Despite the aforementioned advances, the existing theory andalgorithms for in-network optimization are founded on two strong assumptions: (1) The problemis often assumed to be unconstrained, e.g., this is the case in Push-DIGing [27] and Push-Pull [30]algorithms that have been recently developed. (2) The objective function is assumed to be stronglyconvex, an assumption that appears in many of the existing models and algorithms in the area ofmulti-agent optimization. In this work, we aim at addressing these shortcomings through consideringthe bilevel framework (1). To address this model, utilizing a novel regularization-based relaxationapproach, we develop a new push-pull gradient algorithm where at each iteration, the information ofiteratively regularized gradients is pushed to the neighbors, while the information about the decisionvariable is pulled from the neighbors. Our contributions are as follows: (i)

For the bilevel model (1),in Theorem 1, we derive a family of rates for suboptimality, infeasibility, and consensus violation. (ii)

For the linearly constrained model (2), in Corollary 1, we derive a new convergence rate of O (1 /k ) for suboptimality and a rate of O (1 /k . ) for the infeasibility of the generated iterates. (iii) For theunconstrained non-strongly convex model (3), in Corollary 2, we derive a new convergence rate of O (1 /k . ) and establish the consensus to the least (cid:96) -norm optimal solution. Notation:

For an integer m , the set { , . . . , m } is denoted as [ m ] . A vector x is assumed tobe a column vector (unless otherwise noted) and x T denotes its transpose. We use (cid:107) x (cid:107) to de-note the Euclidean vector norm of x . A continuously differentiable function f : R n → R issaid to be µ f –strongly convex if and only if its gradient mapping is µ f –strongly monotone, i.e., ( ∇ f ( x ) − ∇ f ( y )) T ( x − y ) ≥ µ f (cid:107) x − y (cid:107) for any x, y ∈ R n . Also, it is said to be L f –smooth ifits gradient mapping is Lipschitz continuous with parameter L f > , i.e., for any x, y ∈ R n , we have (cid:107)∇ f ( x ) − ∇ f ( y ) (cid:107) ≤ L f (cid:107) x − y (cid:107) . We use the following deﬁnitions: x (cid:44) [ x , x , . . . , x m ] T ∈ R m × n , y (cid:44) [ y , y , . . . , y m ] T ∈ R m × n f ( x ) (cid:44) (cid:88) mi =1 f i ( x ) , f ( x ) (cid:44) (cid:88) mi =1 f i ( x i ) , ∇ f ( x ) = [ ∇ f ( x ) , . . . , ∇ f ( x m )] T ∈ R m × n ,g ( x ) (cid:44) (cid:88) mi =1 g i ( x ) , g ( x ) (cid:44) (cid:88) mi =1 g i ( x i ) , ∇ g ( x ) = [ ∇ g ( x ) , . . . , ∇ g ( x m )] T ∈ R m × n . Here, x i denotes the local copy of the decision vector for agent i and x includes the local copies of allagents. Vector y i denotes the auxiliary variable for agent i to track the average of regularized gradientmappings. Throughout, we use the following deﬁnition of a matrix norm: Given an arbitrary vectornorm (cid:107)·(cid:107) , the induced norm of a matrix W ∈ R m × n is deﬁned as (cid:107) W (cid:107) (cid:44) (cid:107) [ (cid:107) W • (cid:107) , . . . , (cid:107) W • n (cid:107) ] (cid:107) . Remark 1.

Under the above deﬁnition of matrix norm, it can be seen we have (cid:107) Ax (cid:107) ≤ (cid:107) A (cid:107)(cid:107) x (cid:107) forany A ∈ R m × m and x ∈ R m × p . Also, for any a ∈ R m × and x ∈ R × n , we have (cid:107) ax (cid:107) = (cid:107) a (cid:107)(cid:107) x (cid:107) . Lagrangian duality and relaxation rules have proved very successful in addressing classical con-strained optimization problems [5]. However, in distributed optimization over networks in thepresence of hard-to-project functioncal constraints, a very limited number of works in the literaturehave been successful in employing the duality theory [9, 24, 8, 2]. When it comes to solving the pro-posed model (1) in directed networks, due to the presence of the inner-level optimization constraints,Lagrangian duality does not seem applicable. Overcoming this challenge calls for new relaxationrules that can tackle the inner-level constraints. In this paper, we consider a regularization-basedrelaxation rule. To this end, motivated by the recent success of so-called iteratively regularized (IR) algorithms in centralized regimes [45, 44, 17, 1], we develop Algorithm 1. Core to the IR frameworkis the philosophy that the regularization parameter λ k is updated after every step within the algorithm.Here, each agent holds a local copy of the global variable x , denoted by x i,k , and an auxiliary variable3 i,k is used to track the average of a regularized gradient. At each iteration, each agent i uses the i throw of two matrices R = [ R ij ] ∈ R m × m and C = [ C ij ] ∈ R m × m to update vectors x i,k and y i,k ,respectively. Below, we state the main assumptions on the these two weight mixing matrices. Assumption 2. (a) The matrix R is nonnegative, with a strictly positive diagonal, and is row-stochastic, i.e., R1 = . (b) The matrix C is nonnegative, with a strictly positive diagonal, and iscolumn-stochastic, i.e., T C = T . (c) The induced digraphs G R and G C T satisfy R R ∩ R C T (cid:54) = ∅ . Algorithm 1

Iteratively Regularized Push-Pull Input:

For all i ∈ [ m ] , agent i sets step-size γ i, ≥ , pulling weights R ij ≥ for all j ∈ N in R ( i ) ,pushing weights C ij ≥ for all j ∈ N out C ( i ) , an arbitrary initial point x i, ∈ R n , an initial y i, := ∇ g i ( x i, ) + λ ∇ f i ( x i, ) ; for k = 0 , , . . . , do For all i ∈ [ m ] , agent i receives (pulls) the vector x j,k − γ j,k y j,k from each agent j ∈ N in R ( i ) ,sends (pushes) C (cid:96)i y i,k to each agent (cid:96) ∈ N out C ( i ) , and does the following updates: x i,k +1 := (cid:80) mj =1 R ij ( x j,k − γ j,k y j,k ) ; y i,k +1 := (cid:80) mj =1 C ij y i,k + ∇ g i ( x i,k +1 ) + λ k +1 ∇ f i ( x i,k +1 ) − ∇ g i ( x i,k ) − λ k ∇ f i ( x i,k ) ; end for Importantly, Assumption 2 does not require the strong condition of a doubly stochastic matrix forcommunication in a directed network. In turn, utilizing a push-pull protocol and in a similar fashionto the recent work [30], it only entails a row stochastic R and a column stochastic matrix C . Anexample is as follows where agent i chooses scalars r i , c i > and: R i,j =  / (cid:0)(cid:12)(cid:12) N in R ( i ) (cid:12)(cid:12) + r i (cid:1) , if j ∈ N in R ( i ) r i / (cid:0)(cid:12)(cid:12) N in R ( i ) (cid:12)(cid:12) + r i (cid:1) , if j = i , otherwise , C (cid:96),i =  / ( |N out C ( i ) | + c i ) , if (cid:96) ∈ N out C ( i ) c i / ( |N out C ( i ) | + c i ) , if (cid:96) = i , otherwise . (5)Note that Assumption 2(c) is weaker than imposing strong connectivity on G R and G C . The updaterules in Algorithm 1 can be compactly represented as the following: x k +1 := R ( x k − γ k y k ) , (6) y k +1 := Cy k + ∇ g ( x k +1 ) + λ k +1 ∇ f ( x k +1 ) − ∇ g ( x k ) − λ k ∇ f ( x k ) , (7)where γ k is a nonnegative diagonal matrix deﬁned as γ k (cid:44) diag ( γ ,k , . . . , γ m,k ) . In this section, we provide the mathematical toolbox for the convergence and rate analysis ofAlgorithm 1. Under Assumption 2, there exits a unique nonnegative left eigenvector u ∈ R m suchthat u T R = u T and u T = m . Similarly, there exits a unique nonnegative right eigenvector v ∈ R m such that C v = v and T v = m (cf. Lemma 1 in [30]). Throughout the analysis, we use thefollowing deﬁnitions for any integer k ≥ and the regularization parameter λ k > : x ∗ (cid:44) argmin x ∈ argmin g ( x ) { f ( x ) } ∈ R × n , x ∗ λ k (cid:44) argmin x ∈ R n { g ( x ) + λ k f ( x ) } ∈ R × n , (8) G k ( x ) (cid:44) ∇ g ( x ) + λ k ∇ f ( x ) ∈ R m × n , G k ( x ) (cid:44) m T G k ( x ) ∈ R × n , (9) G k ( x ) (cid:44) G k (cid:0) x T (cid:1) ∈ R × n , ¯ g k (cid:44) G k (¯ x k ) ∈ R × n , L k (cid:44) L g + λ k L f , (10) ¯ x k (cid:44) m u T x k ∈ R × n , ¯ y k (cid:44) m T y k ∈ R × n , Λ k (cid:44) (cid:12)(cid:12)(cid:12)(cid:12) − λ k +1 λ k (cid:12)(cid:12)(cid:12)(cid:12) . (11)Here, x ∗ denotes the optimal solution of problem (1) and x ∗ λ is deﬁned as the optimal solution to aregularized problem. Note that the strong convexity of g ( x ) + λ k f ( x ) implies that x ∗ λ k exits and is aunique vector (cf. Proposition 1.1.2 in [4]). Also, under Assumption 1, the set argmin g ( x ) is closedand convex. As such, from the strong convexity of f and invoking Proposition 1.1.2 in [4] again, we4onclude that x ∗ also exits and is a unique vector. The sequence { x ∗ λ k } is the so-called Tikhonovtrajectory and plays a key role in the convergence analysis (cf. [13]). The mapping G k ( x ) denotesthe regularized gradient matrix. The vector ¯ x k holds a weighted average of the local copies of theagent’s iterates. Next, we consider a family of update rules for the sequences of the step-size and theregularization parameter under which the convergence and rate analysis can be performed. Assumption 3 (Update rules) . Assume the step-size γ k and the regularization parameter λ k areupdated satisfying: ˆ γ k := ˆ γ ( k +1) a and λ k := λ ( k +1) b where ˆ γ k (cid:44) max j ∈ [ m ] γ j,k for k ≥ , and a and b satisfy the following conditions: < b < a < and a + b < . Also, let α k ≥ θ ˆ γ k for k ≥ for some θ > , where α k (cid:44) m u T γ k ν . The constant θ is Assumption 3 measures the size of the rage within which the agents in R R ∩ R C select their stepsizes. The condition α k ≥ θ ˆ γ k is satisﬁed in many cases including the case where allthe agents choose strictly positive stepsizes (see Remark 4 in [30] for more details). In the followinglemma, we list some of the main properties of the update rules in Assumption 3. We will make use ofthese properties in the convergence and rate analysis. Lemma 1 (Properties of the update rules) . Let Assumption 3 hold. Then, the following propertieshold: { λ k } ∞ k =0 is a decreasing strictly positive sequence satisfying λ k → , Λ k λ k → , Λ k +1 ≤ Λ k for all k ≥ , Λ k − ≤ k +1 for k ≥ , where Λ k is given by (11) . Also, { ˆ γ k } ∞ k =0 is a decreasingstrictly positive sequence such that ˆ γ k → and ˆ γ k λ k → . Moreover, for any scalar τ > , there exitsan integer K τ such that ( k +1)ˆ γ k λ k k ˆ γ k − λ k − ≤ τ ˆ γ k λ k for all k ≥ K τ .Proof. See Appendix A.1.Next, we present a key property of the sequence { x ∗ λ k } and quantify the error between two consecutiveelements of the trajectory. Lemma 2 (Properties of Tikhonov trajectory) . Let Assumptions 1 and 3 hold and x ∗ λ k be deﬁned asin (8) . Then, we have: (i) The sequence { x ∗ λ k } converges to the unique solution of problem (1) , i.e., x ∗ . (ii) There exits a scalar M > such that for any k ≥ , we have (cid:13)(cid:13)(cid:13) x ∗ λ k − x ∗ λ k − (cid:13)(cid:13)(cid:13) ≤ Mµ f Λ k − .Proof. See Appendix A.2.In the following, we state the properties of the regularized maps. These results will be used in ﬁndingsuitable error bounds in the next section.

Lemma 3.

Consider Algorithm 1. Let Assumptions 1 and 2 hold. For any k ≥ , mappings G k , G k ,and ¯ g k given by (9) and (10) satisfy the following relations: (i) We have that ¯ y k = G k ( x k ) . (ii) Wehave G k (cid:0) x ∗ λ k (cid:1) = 0 . (iii) The mapping G k ( x ) is ( λ k µ f ) -strongly monotone and Lipschitz continuouswith parameter L k . (iv) We have (cid:107) ¯ y k − ¯ g k (cid:107) ≤ L k √ m (cid:107) x k − ¯ x k (cid:107) and (cid:107) ¯ g k (cid:107) ≤ L k (cid:107) ¯ x k − x ∗ λ k (cid:107) .Proof. See Appendix A.3.We state the following result from [30] introducing two matrix norms induced by matrices R and C . Lemma 4 (cf. Lemma 4 and Lemma 6 in [30]) . Let Assumption 2 hold. Then: (i) There existmatrix norms (cid:107) · (cid:107) R and (cid:107) · (cid:107) C such that for σ R (cid:44) (cid:13)(cid:13)(cid:13) R − u T m (cid:13)(cid:13)(cid:13) R and σ R (cid:44) (cid:13)(cid:13)(cid:13) C − u T m (cid:13)(cid:13)(cid:13) C wehave that σ R < and σ C < . (ii) There exit scalars δ R , , δ C , , δ R , C , δ C , R > such that forany W ∈ R m × n , we have (cid:107) W (cid:107) R ≤ δ R , (cid:107) W (cid:107) , (cid:107) W (cid:107) C ≤ δ C , (cid:107) W (cid:107) , (cid:107) W (cid:107) R ≤ δ R , C (cid:107) W (cid:107) C , (cid:107) W (cid:107) C ≤ δ C , R (cid:107) W (cid:107) R , (cid:107) W (cid:107) ≤ (cid:107) W (cid:107) R , and (cid:107) W (cid:107) ≤ (cid:107) W (cid:107) C . In this section, we analyze the convergence of Algorithm 1 by introducing three errors metrics (cid:13)(cid:13) ¯ x k +1 − x ∗ λ k (cid:13)(cid:13) , (cid:107) x k +1 − ¯ x k +1 (cid:107) R , (cid:107) y k +1 − ν ¯ y k +1 (cid:107) C . Of these, the ﬁrst term relates the averagediterate with the Tikhonov trajectory, the second term measures the consensus violation for the decision5atrix, and the third term measures the consensus violation for the matrix of the regularized gradients.We begin by deriving a system of recursive relations for the three error terms provided below. Proposition 1.

Consider Algorithm 1 under Assumptions 1, 2, and 3. Let α k and ˆ γ k be given byAssumption 3, and c (cid:44) δ C , (cid:13)(cid:13) I − m ν T (cid:13)(cid:13) C . Then, there exit scalars M > , B g > , and aninteger K such that for any k ≥ K , the following recursive relations hold: (cid:13)(cid:13) ¯ x k +1 − x ∗ λ k (cid:13)(cid:13) ≤ (1 − µ f α k λ k ) (cid:13)(cid:13)(cid:13) ¯ x k − x ∗ λ k − (cid:13)(cid:13)(cid:13) + M Λ k − µ f + α k L k √ m (cid:107) x k − ¯ x k (cid:107) R + ˆ γ k (cid:107) u (cid:107) m (cid:107) y k − ν ¯ y k (cid:107) C , (12) (cid:107) x k +1 − ¯ x k +1 (cid:107) R ≤ σ R (cid:18) γ k (cid:107) ν (cid:107) R L k √ m (cid:19) (cid:107) x k − ¯ x k (cid:107) R + σ R ˆ γ k δ R , C (cid:107) y k − ν ¯ y k (cid:107) C + σ R ˆ γ k L k (cid:107) ν (cid:107) R (cid:13)(cid:13)(cid:13) ¯ x k − x ∗ λ k − (cid:13)(cid:13)(cid:13) + M σ R ˆ γ k L k (cid:107) ν (cid:107) R µ f Λ k − , (13) (cid:107) y k +1 − ν ¯ y k +1 (cid:107) C ≤ ( σ C + c L k ˆ γ k (cid:107) R (cid:107) ) (cid:107) y k − ν ¯ y k (cid:107) C + c L k (cid:18) (cid:107) R − I (cid:107) + ˆ γ k (cid:107) R (cid:107)(cid:107) ν (cid:107) L k √ m + 2Λ k (cid:19) (cid:107) x k − ¯ x k (cid:107) + c L k (cid:0) ˆ γ k (cid:107) R (cid:107) (cid:107) ν (cid:107) L k + 2 √ m Λ k (cid:1) (cid:107) ¯ x k − x ∗ λ k − (cid:107) + c L k (cid:18) ˆ γ k (cid:107) R (cid:107) (cid:107) ν (cid:107) L k + √ m Λ k + µ f c B g M (cid:19) Mµ f Λ k − . (14) Proof. Proof of (12): From (6) and (11), ¯ x k +1 = m u T R ( x k − γ k y k ) = ¯ x k − m u T γ k y k . Thus: ¯ x k +1 = ¯ x k − m u T γ k ( y k − ν ¯ y k + ν ¯ y k ) = ¯ x k − m u T γ k ν ¯ y k − m u T γ k ( y k − ν ¯ y k )= ¯ x k − α k ¯ g k − α k (¯ y k − ¯ g k ) − m u T γ k ( y k − ν ¯ y k ) . From Assmption 3, there exits a K such that α k < L ≤ L k for k ≥ K . From Lemma 3(iii), G k ( x ) is ( µ f λ k ) -strongly convex and L k -smooth. Invoking Lemma 10 in [31], we have for α k < L k that (cid:13)(cid:13) ¯ x k − α k ¯ g k − x ∗ λ k (cid:13)(cid:13) ≤ max ( | − µ f λ k α k | , | − L k α k | ) (cid:13)(cid:13) ¯ x k − x ∗ λ k (cid:13)(cid:13) . Thus, since µ f λ k ≤ L k ,we obtain for α k ≤ L k that (cid:13)(cid:13) ¯ x k − α k ¯ g k − x ∗ λ k (cid:13)(cid:13) ≤ (1 − µ f λ k α k ) (cid:13)(cid:13) ¯ x k − x ∗ λ k (cid:13)(cid:13) . Using the precedingtwo relations, we obtain: (cid:13)(cid:13) ¯ x k +1 − x ∗ λ k (cid:13)(cid:13) = (cid:13)(cid:13)(cid:13)(cid:13) ¯ x k − x ∗ λ k − α k ¯ g k − α k (¯ y k − ¯ g k ) − m u T γ k ( y k − ν ¯ y k ) (cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:107) ¯ x k − x ∗ λ k − α k ¯ g k (cid:107) + α k (cid:107) ¯ y k − ¯ g k (cid:107) + 1 m (cid:13)(cid:13) u T γ k ( y k − ν ¯ y k ) (cid:13)(cid:13) ≤ (1 − µ f α k λ k ) (cid:107) ¯ x k − x ∗ λ k (cid:107) + α k (cid:107) ¯ y k − ¯ g k (cid:107) + 1 m (cid:13)(cid:13) u T γ k ( y k − ν ¯ y k ) (cid:13)(cid:13) . By adding and subtracting x ∗ λ k − and using Lemmas 2 and 3(iv), we obtain: (cid:13)(cid:13) ¯ x k +1 − x ∗ λ k (cid:13)(cid:13) ≤ (1 − µ f α k λ k ) (cid:13)(cid:13)(cid:13) ¯ x k − x ∗ λ k − (cid:13)(cid:13)(cid:13) + M Λ k − µ f + α k (cid:107) ¯ y k − ¯ g k (cid:107) + 1 m (cid:13)(cid:13) u T γ k ( y k − ν ¯ y k ) (cid:13)(cid:13) ≤ (1 − µ f α k λ k ) (cid:13)(cid:13)(cid:13) ¯ x k − x ∗ λ k − (cid:13)(cid:13)(cid:13) + M Λ k − µ f + α k L k √ m (cid:107) x k − ¯ x k (cid:107) + (cid:107) u (cid:107) (cid:107) γ k (cid:107) m (cid:107) y k − ν ¯ y k (cid:107) . Then, inequality (12) is obtained by invoking Lemma 4(ii), Remark 1, and deﬁnition of ˆ γ k . Proof of (13): From (6) and (11) and that R1 = , we have: x k +1 − ¯ x k +1 = R ( x k − γ k y k ) − ¯ x k + 1 m u T γ k y k = (cid:18) R − u T m (cid:19) (( x k − ¯ x k ) − γ k y k ) . (cid:107) x k +1 − ¯ x k +1 (cid:107) R ≤ σ R (cid:107) x k − ¯ x k (cid:107) R + σ R (cid:107) γ k (cid:107) R (cid:107) y k (cid:107) R ≤ σ R (cid:107) x k − ¯ x k (cid:107) R + σ R (cid:107) γ k (cid:107) (cid:107) y k − ν ¯ y k (cid:107) R + σ R (cid:107) γ k (cid:107) (cid:107) ν (cid:107) R (cid:107) ¯ y k (cid:107) ≤ σ R (cid:107) x k − ¯ x k (cid:107) R + σ R ˆ γ k (cid:107) y k − ν ¯ y k (cid:107) R + σ R ˆ γ k (cid:107) ν (cid:107) R (cid:18) L k √ m (cid:107) x k − ¯ x k (cid:107) + L k (cid:13)(cid:13) ¯ x k − x ∗ λ k (cid:13)(cid:13) (cid:19) ≤ σ R (cid:18) γ k (cid:107) ν (cid:107) R L k √ m (cid:19) (cid:107) x k − ¯ x k (cid:107) R + σ R ˆ γ k δ R , C (cid:107) y k − ν ¯ y k (cid:107) C + σ R ˆ γ k L k (cid:107) ν (cid:107) R (cid:13)(cid:13) ¯ x k − x ∗ λ k (cid:13)(cid:13) . Adding and subtracting x ∗ λ k − and using Lemma 2, we obtain the inequality (13). Proof of (14): From (7) and the deﬁnition of G k ( x ) in (9), we obtain y k +1 = Cy k + G k +1 ( x k +1 ) − G k ( x k ) . Multiplying both sides of the preceding relation by m T and using the deﬁnition of ¯ y k in(11), we obtain that ¯ y k +1 = ¯ y k + m T G k +1 ( x k +1 ) − m T G k ( x k ) . From the last two relations,we have: y k +1 − ν ¯ y k +1 = (cid:18) C − m ν T (cid:19) ( y k − ν ¯ y k ) + (cid:18) I − m ν T (cid:19) ( G k +1 ( x k +1 ) − G k ( x k )) . Invoking Lemma 4, the deﬁnition of G k ( x ) in (9) and c , and we obtain: (cid:107) y k +1 − ν ¯ y k +1 (cid:107) C ≤ σ C (cid:107) y k − ν ¯ y k (cid:107) C + c (cid:107) G k +1 ( x k +1 ) − G k ( x k ) (cid:107) ≤ σ C (cid:107) y k − ν ¯ y k (cid:107) C + c (cid:107) λ k +1 ∇ f ( x k ) − λ k ∇ f ( x k ) (cid:107) + c (cid:107)∇ g ( x k +1 ) + λ k +1 ∇ f ( x k +1 ) − ∇ g ( x k ) − λ k +1 ∇ f ( x k ) (cid:107) ≤ σ C (cid:107) y k − ν ¯ y k (cid:107) C + c (cid:12)(cid:12)(cid:12)(cid:12) − λ k +1 λ k (cid:12)(cid:12)(cid:12)(cid:12) (cid:107) λ k ∇ f ( x k ) (cid:107) + c L k (cid:107) x k +1 − x k (cid:107) . (15)Next, we estimate the two terms (cid:107) λ k ∇ f ( x k ) (cid:107) and (cid:107) x k +1 − x k (cid:107) . From Lemma 2, there exits ascalar B g < ∞ such that L g (cid:107) x ∗ λ k − x ∗ (cid:107) ≤ B g . Since ∇ g ( x ∗ ) = 0 , we have: (cid:107) λ k ∇ f ( x k ) (cid:107) ≤ (cid:107)∇ g ( x k ) + λ k ∇ f ( x k ) (cid:107) + (cid:107)∇ g ( x k ) − ∇ g ( x ∗ ) (cid:107) ≤ (cid:107)∇ g ( x k ) + λ k ∇ f ( x k ) − ∇ g ( x ∗ λ k ) − λ k ∇ f ( x ∗ λ k ) (cid:107) + L g (cid:107) x k − x ∗ (cid:107) ≤ ( L k + L g ) (cid:107) x k − x ∗ λ k (cid:107) + L g (cid:107) x ∗ λ k − x ∗ (cid:107) ≤ L k (cid:0) (cid:107) x k − ¯ x k (cid:107) + (cid:107) ¯ x k − x ∗ λ k (cid:107) (cid:1) + B g ≤ L k (cid:107) x k − ¯ x k (cid:107) + 2 √ mL k (cid:107) ¯ x k − x ∗ λ k (cid:107) + B g . From row-stochasticity of R , we have ( R − I ) ¯ x k = 0 . Thus, from Lemma 3 we have: (cid:107) x k +1 − x k (cid:107) = (cid:107) R ( x k − γ k y k ) − x k (cid:107) = (cid:107) ( R − I ) ( x k − ¯ x k ) − R γ k y k (cid:107) ≤ (cid:107) R − I (cid:107) (cid:107) x k − ¯ x k (cid:107) + (cid:107) R (cid:107) (cid:107) γ k (cid:107) ( (cid:107) y k − ν ¯ y k (cid:107) + (cid:107) ν (cid:107) (cid:107) ¯ y k − ¯ g k (cid:107) + (cid:107) ν (cid:107) (cid:107) ¯ g k (cid:107) ) ≤(cid:107) R − I (cid:107) (cid:107) x k − ¯ x k (cid:107) + ˆ γ k (cid:107) R (cid:107) (cid:18) (cid:107) y k − ν ¯ y k (cid:107) + L k (cid:107) ν (cid:107) (cid:18) (cid:107) x k − ¯ x k (cid:107) √ m + (cid:107) ¯ x k − x ∗ λ k (cid:107) (cid:19)(cid:19) . It remains to ﬁnd a recursive bound for the term (cid:107) x k +1 − x k (cid:107) . From Lemma 2, we can write: (cid:107) ¯ x k − x ∗ λ k (cid:107) ≤ (cid:107) ¯ x k − x ∗ λ k − (cid:107) + (cid:107) x ∗ λ k − − x ∗ λ k (cid:107) ≤ (cid:107) ¯ x k − x ∗ λ k − (cid:107) + Mµ f Λ k − . From (15), the preceding three relations, we can obtain the desired relation (14).Next, we derive a unifying recursive bound for the three error bounds in Proposition 1. This resultwill play a key role n deriving the rate statements.

Proposition 2.

Consider Algorithm 1. Let Assumptions 1, 2, and 3 hold. Let us deﬁne the errormetric ∆ k (cid:44) (cid:104)(cid:13)(cid:13)(cid:13) ¯ x k − x ∗ λ k − (cid:13)(cid:13)(cid:13) , (cid:107) x k − ¯ x k (cid:107) R , (cid:107) y k − ν ¯ y k (cid:107) C (cid:105) T for k ≥ . Then, there exits aninteger K ≥ such that for any k ≥ K , the following holds:(a) Let Θ (cid:44) max (cid:110) , σ R ˆ γ L (cid:107) ν (cid:107) R , c L (cid:16) ˆ γ (cid:107) R (cid:107) (cid:107) ν (cid:107) L + √ m Λ + µ f c B g M (cid:17)(cid:111) √ Mµ f . Then: (cid:107) ∆ k +1 (cid:107) ≤ (1 − . µ f α k λ k ) (cid:107) ∆ k (cid:107) + ΘΛ k − . (b) There exits a scalar B > such that (cid:107) ∆ k (cid:107) ≤ B k − a − b . roof. See Appendix A.4.We next provide a family of convergence rates for the suboptimality in both levels of the problemformulation (1).

Theorem 1 (Rate statements for the bilevel model) . Consider problem (1) and Algorithm 1. LetAssumptions 1, 2, and 3 hold. Then, we have the following results:(a) We have lim k →∞ ¯ x k = x ∗ . Also, the consensus violation of x k and y k characterized by (cid:107) x k +1 − ¯ x k +1 (cid:107) R and (cid:107) y k +1 − ν ¯ y k +1 (cid:107) C , respectively, are both bounded by O (cid:0) /k − a − b (cid:1) for anysufﬁciently large k .(b) We have f (¯ x k ) − f ( x ∗ ) ≤ Q ( L g + λ L f )2 1 k − a − b for some Q > and any sufﬁciently large k .(c) g (¯ x k ) − g ( x ∗ ) ≤ Q ( L g + λ L f )2 1 k − a − b + λ Q k b for Q , Q > and any sufﬁciently large k .Proof. See Appendix A.5.

Corollary 1 (Rate statements for the linearly constrained model) . Consider problem (2) and Algo-rithm 1 where we set g i ( x ) := (cid:107) A i x − b i (cid:107) . Let the feasible set be nonempty and Assumption 1(a)and Assumption 2 hold. Suppose Assumption 3 holds with a := 0 . and b := 0 . − (cid:15)/ where (cid:15) > is a sufﬁciently small scalar. Then, we have lim k →∞ ¯ x k = x ∗ and for any sufﬁciently large k :(a) We have (cid:107) x k +1 − ¯ x k +1 (cid:107) R = O (cid:0) /k . (cid:15)/ (cid:1) , and (cid:107) y k +1 − ν ¯ y k +1 (cid:107) C = O (cid:0) /k . (cid:15)/ (cid:1) .(b) We have f (¯ x k ) − f ( x ∗ ) = O (cid:0) /k − (cid:15) (cid:1) .(c) We have (cid:107) A ¯ x k − b (cid:107) = O (cid:0) /k . − (cid:15)/ (cid:1) where A (cid:44) (cid:2) A T , . . . , A Tm (cid:3) T and b (cid:44) (cid:2) b T , . . . , b Tm (cid:3) .Proof. See Appendix A.6.

Corollary 2 (Rate statements for the unconstrained non-strongly convex model) . Consider problem (3) and Algorithm 1 where we set f i ( x ) := (cid:107) x (cid:107) /m . Let Assumption 1(b), 1(c) and Assumption 2hold. Suppose Assumption 3 holds with a := 0 . and b := 0 . − (cid:15) where (cid:15) > is a sufﬁciently smallscalar. Let x ∗ (cid:96) denote the least (cid:96) -norm optimal solution of problem (3) . Then, for any sufﬁcientlylarge k :(a) We have (cid:107) x k +1 − ¯ x k +1 (cid:107) R = O (cid:0) /k . (cid:15) (cid:1) and (cid:107) y k +1 − ν ¯ y k +1 (cid:107) C = O (cid:0) /k . (cid:15) (cid:1) .(b) We have g (¯ x k ) − g (cid:0) x ∗ (cid:96) (cid:1) = O (cid:0) /k . − (cid:15) (cid:1) and that (cid:107) ¯ x k − x ∗ (cid:96) (cid:107) = O (cid:0) /k (cid:15) (cid:1) .Proof. See Appendix A.7.

We present a numerical comparison of the performance of Algorithm 1 with that of Push-Pullalgorithm [30]. Motivated by sensor network applications, we consider the unconstrained ill-posedproblem min x ∈ R n (cid:80) mk =1 (cid:107) z i − H i x (cid:107) , where H i ∈ R d × m and z i ∈ R d denote the measurementmatrix and the noisy observation of the i th sensor. Due to the challenges raised by ill-conditioningand also the lack of convergence and rate guarantees, Push-Pull algorithm needs to be applied to aregularized variant of the problem. To this end, in the implementation of the Push-Pull scheme, weuse an (cid:96) regularizer with a parameter . . Importantly, our comparison would lead to similar resultswith smaller choices of this parameter as well. Accordingly, in Algorithm 1, we set λ := 0 . . Weemploy the tuning rules according to Corollary 2, while a constant step-size is used for the Push-Pullmethod. We generate H i and z i randomly and choose m = 10 , n = 20 , and d = 1 . We generatematrices R and C from the same underlying graph with three different directed graphs (see Figure 1).For the star and ring graphs, we use R = I − d max in L R where L R denotes the Laplacian matrix and d max in denotes the maximum in-degree. We use the same formula for C using maximum out-degree.For the Lollipop graph, we generate matrices R and C using rules (5) with r i = c i = 3 for all i . Insights:

Figure 1 shows the comparison of the two schemes. We compare objective function valuesand consensus violations. For the latter, we use the term (cid:13)(cid:13)(cid:13) x k − T m x k (cid:13)(cid:13)(cid:13) . In terms of the objective8 irected star graph Directed ring graph Directed Lollipop graph L n (cid:80) m i = f i (cid:16) T x k (cid:17) Push-PullAlgorithm 1

Push-PullAlgorithm 1

Push-PullAlgorithm 1 L n (cid:13)(cid:13)(cid:13)(cid:13) x k − T m x k (cid:13)(cid:13)(cid:13)(cid:13) Iterations

Push-PullAlgorithm 1

Iterations

Push-PullAlgorithm 1

Iterations

Push-PullAlgorithm 1

Figure 1: Algorithm 1 vs. regularized Push-Pull algorithm under different choices of R and C .function value, Algorithm 1 performs signiﬁcantly better in the star and ring cases. In the Lollipopcase, while Algorithm 1 performs almost the same as Push-Pull in terms of the objective value, itachieves a lower consensus violation. Acknowledgments and Disclosure of Funding

Farzad Youseﬁan gratefully acknowledges the support of the National Science Foundation throughCAREER grant ECCS-1944500.

References [1] M. Amini and F. Youseﬁan. An iterative regularized incremental projected subgradient methodfor a class of bilevel optimization problems.

Proceedings of the American Control Conference(ACC) , pages 4069–4074, 2019.[2] N. S. Aybat and E. Y. Hamedani. A primal-dual method for conic constrained distributedoptimization problems.

Advances in Neural Information Processing Systems , pages 5050–5058,2016.[3] N. S. Aybat, Z. Wang, T. Lin, and S. Ma. Distributed linearized alternating direction methodof multipliers for composite convex consensus optimization.

IEEE Transactions on AutomaticControl , 63(1):5–20, 2017.[4] D. P. Bertsekas.

Nonlinear Programming: 3rd Edition . Athena Scientiﬁc, Bellmont, MA, 2016.[5] D. P. Bertsekas, A. Nedich, and A. Ozdaglar.

Convex analysis and optimization . AthenaScientiﬁc, Belmont, MA, 2003.[6] D. P. Bertsekas and J. N. Tsitsiklis.

Parallel and Distributed Computation: Numerical Methods .Athena Scientiﬁc, Belmont, MA, 1989.[7] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statisticallearning via the alternating direction method of multipliers.

Foundations and Trends in MachineLearning , 3(1):1–122, 2010.[8] T. H. Chang. A proximal dual consensus ADMM method for multi-agent constrained optimiza-tion.

IEEE Transactions on Signal Processing , 64(14):3719–3734, 2016.[9] T. H. Chang, A. Nedich, and A. Scaglione. Distributed constrained optimization by consensus-based primal-dual perturbation method.

IEEE Transactions on Automatic Control , 59(6):1524–1538, 2014.[10] M. F. Duarte and Y. H. Hu. Vehicle classiﬁcation in distributed sensor networks.

Journal ofParallel and Distributed Computing , 64(7):826–838, 2014.[11] J. C. Duchi, P. L. Bartlett, and Martin J. Wainwright. Randomized smoothing for stochasticoptimization.

SIAM Journal on Optimization (SIOPT) , 22(2):674–701, 2012.[12] J. W. Durham, A. Franchi, and F. Bullo. Distributed pursuit-evasion without mapping or globallocalization via local frontiers.

Autonomous Robots , 32(1):81—-95, 2012.913] F. Facchinei and J.-S. Pang.

Finite-dimensional Variational Inequalities and ComplementarityProblems. Vols. I,II . Springer Series in Operations Research. Springer-Verlag, New York, 2003.[14] M. H. De Groot. Reaching a consensus.

Journal of the American Statistical Association ,69(345):118–121, 1974.[15] A. Jadbabaie, J. Lin, and A. S. Morse. Coordination of groups of mobile autonomous agentsusing nearest neighbor rules.

IEEE Transactions on Automatic Control , 48(6):988–1001, 2003.[16] D. Jakoveti´c, J. Xavier, and J. M. Moura. Fast distributed gradient methods.

IEEE Transactionson Automatic Control , 59(5):1131–1146, 2014.[17] H. Kaushik and F. Youseﬁan. A randomized block coordinate iterative regularized subgradientmethod for high-dimensional ill-posed convex optimization.

Proceedings of the AmericanControl Conference (ACC) , pages 3420–3425, 2019.[18] H. Lakshmanan and D. Farias. Decentralized recourse allocation in dynamic networks of agents.

SIAM Journal on Optimization , 19(2):911–940, 2008.[19] S. Lee and A. Nedich. Asynchronous gossip-based random projection algorithms over networks.

IEEE Transactions on Automatic Control , 61(4):953–958, 2016.[20] Q. Ling and A. Ribeiro. Decentralized dynamic optimization through the alternating directionmethod of multiplier.

IEEE Transactions on Signal Processing , 62(5):1185–1197, 2014.[21] I. Lobel, A. Ozdaglar, and D. Feijer. Distributed multi-agent optimization with state-dependentcommunication.

Mathematical Programming , 129(2):255—-284, 2011.[22] C. G. Lopes and A. H. Sayed. Distributed processing over adaptive networks.

In 2007 9thInternational Symposium on Signal Processing and Its Applications , pages 1–3, 2007.[23] P. Di Lorenzo and G. Scutari. Distributed nonconvex optimization over time-varying networks.

IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2016.[24] D. Mateos-Nunez and J. Cortes. Distributed subgradient methods for saddle-point problems.

IEEE Conference on Decision and Control (CDC) , 2015.[25] A. Nedich and A. Olshevsky. Distributed optimization over time-varying directed graphs.

IEEETransactions on Automatic Control , 60(3):601–615, 2015.[26] A. Nedich and A. Olshevsky. Stochastic gradient-push for strongly convex functions on time-varying directed graphs.

IEEE Transactions on Automatic Control , 61(12):3936–3947, 2016.[27] A. Nedich, A. Olshevsky, and W. Shi. Achieving geometric convergence for distributedoptimization over time-varying graphs.

SIAM Journal on Optimization , 27(4):2597–2633, 2017.[28] A. Nedich and A. Ozdaglar. Subgradient methods for saddle-point problems.

Journal ofOptimization Theory and Applications , 142(1):205–228, 2009.[29] S. Pu and A. Nedich. Distributed stochastic gradient tracking methods.

Mathematical Program-ming , 2020.[30] S. Pu, W. Shi, J. Xu, and A. Nedich. Push-pull gradient methods for distributed optimization innetworks.

IEEE Transactions on Automatic Control , 2020.[31] G. Qu and N. Li. Harnessing smoothness to accelerate distributed optimization.

IEEE Transac-tions on Control of Network Systems , 5(3):1245–1260, 2018.[32] M. Rabbat and R. D. Nowak. Distributed optimization in sensor networks.

The InternationalConference on Information Processing in Sensor Networks , pages 20–27, 2004.[33] S. S. Ram, V. V. Veeravalli, and A. Nedich. Distributed non-autonomous power control throughdistributed convex optimization.

IEEE International Conference on Computer Communications ,page 3001–3005, 2009.[34] W. Shi, Q. Ling, G. Wu, and W. Yin. EXTRA: an exact ﬁrst-order algorithm for decentralizedconsensus optimization.

SIAM Journal on Optimization , 25(2):5944–966, 2015.[35] W. Shi, Q. Ling, K. Yuan, G. Wu, and W. Yin. On the linear convergence of the ADMM indecentralized consensus optimization.

IEEE Transactions on Signal Processing , 62(7):1750–1761, 2014.[36] K. Srivastava and A. Nedich. Distributed asynchronous constrained stochastic optimization.

IEEE Journal of Selected Topics in Signal Processing , 5(4):772—-90, 2011.1037] Y. Sun, G. Scutari, and D. Palomar. Distributed nonconvex multiagent optimization over time-varying networks.

In 2016 50th Asilomar Conference on Signals, Systems and Computers , pages788–794, 2016.[38] T. Tatarenki and B. Touri. Non-convex distributed optimization.

IEEE Transactions on Auto-matic Control , 62(8):3744–3757, 2017.[39] J. N. Tsitsiklis.

Problems in Decentralized Decision Making and Computation . PhD thesis,Dept. of Electrical Engineering and Computer Science, Massachussetts Institute of Technology,1984.[40] J. N. Tsitsiklis, D. P. Bertsekas, and M. Athans. Distributed asynchronous deterministicand stochastic gradient optimization algorithms.

IEEE Transactions on Automatic Control ,31(9):803—-812, 1986.[41] E. Wei and A. Ozdaglar. Distributed alternating direction method of multipliers.

IEEE Confer-ence on Decision and Control (CDC) , pages 5445–5450, 2012.[42] E. Wei and A. Ozdaglar. On the O (1 /k ) convergence of asynchronous distributed alternatingdirection method of multipliers. In 2013 IEEE Global Conference on Signal and InformationProcessing , pages 551–554, 2013.[43] L. Xiao, S. Boyd, and S.-J. Kim. Distributed average consensus with least-mean-squaredeviation.

Journal of Parallel and Distributed Computing , 67(1):33–46, 2007.[44] F. Youseﬁan, A. Nedich, and U. V. Shanbhag. On smoothing, regularization, and averaging instochastic approximation methods for stochastic variational inequality problems.

MathematicalProgramming , 165(1):391–431, 2017.[45] F. Youseﬁan, A. Nedich, and U. V. Shanbhag. On stochastic and deterministic quasi-Newtonmethods for nonstrongly convex optimization: Asymptotic convergence and rate analysis.

SIAMJournal on Optimization , 30(2):1144–1172, 2020.11 ppendices

Appendix A Supplementary proofs

A.1 Proof of Lemma 1

Recall that ˆ γ k = ˆ γ ( k +1) a and λ k = λ ( k +1) b where < b < a < and a + b < . Consequently, { ˆ γ k } ∞ k =0 and { λ k } ∞ k =0 are strictly positive decreasing sequences and ˆ γ k → , λ k → , and ˆ γ k λ k → .Next, we show that Λ k − ≤ k +1 for k ≥ . From (11) and that λ k ≤ λ k − , for any k ≥ we have: Λ k − = 1 − λ k λ k − = 1 − λ ( k + 1) − b λ k − b = 1 − (cid:18) kk + 1 (cid:19) b = 1 − (cid:18) − k + 1 (cid:19) b . (16)From < b < a and a + b < , we have b < . . This implies that (cid:16) − k +1 (cid:17) b ≥ (cid:16) − k +1 (cid:17) . .Combining this relation with (16), we have: Λ k − ≤ − (cid:18) − k + 1 (cid:19) . = 1 − (cid:16) − k +1 (cid:17) (cid:113) − k +1 = 

11 + (cid:113) − k +1  k + 1 ≤ k + 1 , where the last inequality is implied from k ≥ . Next, we show Λ k +1 ≤ Λ k for all k ≥ . From (16),we have: Λ k +1 = 1 − (cid:18) − k + 3 (cid:19) b ≤ − (cid:18) − k + 2 (cid:19) b = Λ k . Next, we show that for any scalar τ > , there exits an integer K τ such that ( k +1)ˆ γ k λ k k ˆ γ k − λ k − ≤ τ ˆ γ k λ k for all k ≥ K τ . It sufﬁces to show lim k →∞ (cid:16) ( k +1)ˆ γ k λ k k ˆ γ k − λ k − − (cid:17) γ k λ k = 0 . From update formulas of ˆ γ k and λ k , we have: (cid:18) ( k + 1)ˆ γ k λ k k ˆ γ k − λ k − − (cid:19) γ k λ k = (cid:32) ( k + 1) k (cid:18) kk + 1 (cid:19) a + b − (cid:33) ( k + 1) a + b ˆ γ λ ≤ (cid:32)(cid:18) k (cid:19) − a − b − (cid:33) k a + b ˆ γ λ ≤ (cid:18) k − (cid:19) k a + b ˆ γ λ = 2ˆ γ λ k − a − b . Therefore, lim k →∞ (cid:16) ( k +1)ˆ γ k λ k k ˆ γ k − λ k − − (cid:17) γ k λ k = 0 , implying the existence of the speciﬁed integer K τ . A.2 Proof of Lemma 2(a)

From Assumption 1, the function g ( x ) + λ k f ( x ) is ( λ k µ f ) -strongly convex. Since x ∗ λ k is theminimizer of this regularized function, we have: g ( x ∗ ) + λ k f ( x ∗ ) ≥ g ( x ∗ λ k ) + λ k f ( x ∗ λ k ) + λ k µ f (cid:13)(cid:13) x ∗ − x ∗ λ k (cid:13)(cid:13) . (17)Also, the deﬁnition x ∗ in (8) implies that x ∗ ∈ argmin g ( x ) and so g ( x ∗ λ k ) ≥ g ( x ∗ ) . From thepreceding two inequalities, we obtain: f ( x ∗ ) ≥ f (cid:0) x ∗ λ k (cid:1) + µ f (cid:13)(cid:13) x ∗ − x ∗ λ k (cid:13)(cid:13) for all k ≥ . (18)Recall that under Assumption 1, x ∗ and x ∗ λ k both exist and are unique vectors. Thus, we have that f ( x ∗ ) − f (cid:0) x ∗ λ k (cid:1) < ∞ . Therefore, from relation (18), the sequence { x ∗ λ k } is bounded implyingthat it must have at least one limit point. Let { x ∗ λ k } k ∈K be an arbitrary subsequence such that lim k →∞ , k ∈K x ∗ λ k = ˆ x . Next, we show that ˆ x is a feasible solution to problem (1). Consider relation(17). Taking the limit from both sides of (17), using the continuity of g and f (induced from their12onvexity), invoking lim k →∞ λ k = 0 , and neglecting the nonnegative term µ f (cid:13)(cid:13) x ∗ − x ∗ λ k (cid:13)(cid:13) , weobtain that: g ( x ∗ ) ≥ g (cid:18) lim k →∞ , k ∈K x ∗ λ k (cid:19) = g (ˆ x ) , implying that ˆ x is a minimizer of g ( x ) and so it is a feasible solution to problem (1). Next, we showthat ˆ x is indeed the optimal solution of problem (1). Taking the limit from both sides of (18), usingthe continuity of f , and neglecting the nonnegative term µ f (cid:13)(cid:13) x ∗ − x ∗ λ k (cid:13)(cid:13) , we obtain: f ( x ∗ ) ≥ f (cid:18) lim k →∞ , k ∈K x ∗ λ k (cid:19) = f (ˆ x ) . Hence, from the uniqueness of x ∗ we conclude that all limit points of the Tikhonov trajectory areidentical to x ∗ . Therefore, lim k →∞ x ∗ λ k = x ∗ . (b) If x ∗ λ k = x ∗ λ k − , the desired relation holds. Consider the case where x ∗ λ k (cid:54) = x ∗ λ k − for k ≥ .Writing the optimality conditions for x ∗ λ k and x ∗ λ k − , for any x, y ∈ R n , we have that: g ( x ) + λ k − f ( x ) ≥ g (cid:16) x ∗ λ k − (cid:17) + λ k − f (cid:16) x ∗ λ k − (cid:17) + λ k − µ f (cid:13)(cid:13)(cid:13) x − x ∗ λ k − (cid:13)(cid:13)(cid:13) .g ( y ) + λ k f ( y ) ≥ g (cid:0) x ∗ λ k (cid:1) + λ k f (cid:0) x ∗ λ k (cid:1) + λ k µ f (cid:13)(cid:13) y − x ∗ λ k (cid:13)(cid:13) . By substituting x := x ∗ λ k and y := x ∗ λ k − and adding the resulting two relations together, we have: ( λ k − − λ k ) (cid:16) f (cid:0) x ∗ λ k (cid:1) − f (cid:16) x ∗ λ k − (cid:17)(cid:17) ≥ µ f ( λ k + λ k − )2 (cid:13)(cid:13)(cid:13) x ∗ λ k − − x ∗ λ k (cid:13)(cid:13)(cid:13) . From the convexity property of f , we have that: f (cid:0) x ∗ λ k (cid:1) − f (cid:16) x ∗ λ k − (cid:17) ≤ (cid:16) x ∗ λ k − x ∗ λ k − (cid:17) T ∇ f (cid:0) x ∗ λ k (cid:1) . From the preceding two inequalities and that λ k − > λ k , we obtain: µ f λ k − (cid:13)(cid:13)(cid:13) x ∗ λ k − − x ∗ λ k (cid:13)(cid:13)(cid:13) ≤ ( λ k − − λ k ) (cid:16) x ∗ λ k − x ∗ λ k − (cid:17) T ∇ f (cid:0) x ∗ λ k (cid:1) . Dividing both sides of the by λ k − and using Cauchy-Schwarz inequality, we have: (cid:13)(cid:13)(cid:13) x ∗ λ k − − x ∗ λ k (cid:13)(cid:13)(cid:13) ≤ µ f (cid:18) − λ k λ k − (cid:19) (cid:13)(cid:13)(cid:13) x ∗ λ k − x ∗ λ k − (cid:13)(cid:13)(cid:13) (cid:13)(cid:13) ∇ f (cid:0) x ∗ λ k (cid:1)(cid:13)(cid:13) . Note that since x ∗ λ k (cid:54) = x ∗ λ k − , we have (cid:13)(cid:13)(cid:13) x ∗ λ k − x ∗ λ k − (cid:13)(cid:13)(cid:13) (cid:54) = 0 . Thus, we obtain: (cid:13)(cid:13)(cid:13) x ∗ λ k − − x ∗ λ k (cid:13)(cid:13)(cid:13) ≤ µ f (cid:18) − λ k λ k − (cid:19) (cid:13)(cid:13) ∇ f (cid:0) x ∗ λ k (cid:1)(cid:13)(cid:13) = 2Λ k − µ f (cid:13)(cid:13) ∇ f (cid:0) x ∗ λ k (cid:1)(cid:13)(cid:13) . From part (a), { x ∗ λ t } ∞ t =0 is a bounded sequence. Thus, there exits a compact ball X ⊂ R n such that { x ∗ λ t } ∞ t =0 ⊆ X . From continuity of the mapping ∇ f , there exits a constant M > such that: (cid:13)(cid:13) ∇ f (cid:0) x ∗ λ k (cid:1)(cid:13)(cid:13) ≤ M for all k ≥ . Combining the preceding two relations, we obtain the desired inequality.

A.3 Proof of Lemma 3(i)

Multiplying both sides of (7) by m T and from the deﬁnitions of G k and G k in (9), we obtain: ¯ y k +1 = 1 m T y k + 1 m T G k +1 ( x k +1 ) − m T G k ( x k ) = ¯ y k + G k +1 ( x k +1 ) − G k ( x k ) , where we used T C = T . From Algorithm 1, we have y := ∇ g ( x ) + λ ∇ f ( x ) = G ( x ) ,implying that ¯ y = G ( x ) . From the two preceding relations, we obtain that ¯ y k = G k ( x k ) .13 ii, iii) From (9), we have that G k ( x ) = m (cid:80) mi =1 ( ∇ g i ( x i,k ) + λ k ∇ f i ( x i,k )) . Thus, fromthe deﬁnition of G k we obtain that G k ( x ) = G k (cid:0) x T (cid:1) = m (cid:80) mi =1 ( ∇ g i ( x ) + λ k ∇ f i ( x )) = m ( ∇ g ( x ) + λ k ∇ f ( x )) . Thus, from the deﬁnition of x ∗ λ k in (8), we obtain G k (cid:0) x ∗ λ k (cid:1) = 0 . Also,from Assumption 1, we conclude that G k ( x ) is a ( λ k µ f ) -strongly monotone mapping and Lipschitzcontinuous with parameter L k (cid:44) L g + λ k L f for k ≥ . (iv) For any u , v ∈ R m × n , with u i , v i ∈ R n denoting the i th row of u , v , respectively, we have: (cid:107) G k ( u ) − G k ( v ) (cid:107) = (cid:13)(cid:13)(cid:13)(cid:13) m T ( ∇ g ( u ) + λ k ∇ f ( u )) − m T ( ∇ g ( v ) + λ k ∇ f ( v )) (cid:13)(cid:13)(cid:13)(cid:13) ≤ m (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) m (cid:88) i =1 ∇ g i ( u i ) − m (cid:88) i =1 ∇ g i ( v i ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + λ k m (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) m (cid:88) i =1 ∇ f i ( u i ) − m (cid:88) i =1 ∇ f i ( v i ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ m m (cid:88) i =1 ( (cid:107)∇ g i ( u i ) − ∇ g i ( v i ) (cid:107) + λ k (cid:107)∇ f i ( u i ) − ∇ f i ( v i ) (cid:107) ) ≤ m m (cid:88) i =1 ( L g (cid:107) u i − v i (cid:107) + λ k L f (cid:107) u i − v i (cid:107) ) ≤ L k m m (cid:88) i =1 (cid:107) u i − v i (cid:107) ≤ L k √ m (cid:107) u − v (cid:107) . Consequently, we obtain (cid:107) ¯ y k − ¯ g k (cid:107) = (cid:107) G k ( x k ) − G k ( ¯ x k ) (cid:107) ≤ L k √ m (cid:107) x k − ¯ x k (cid:107) . Also, usingthe Lipschitzian property of G k in part (ii) and G k (cid:0) x ∗ λ k (cid:1) = 0 , we obtain: (cid:107) ¯ g k (cid:107) = (cid:107) G k (¯ x k ) (cid:107) = (cid:13)(cid:13) G k (¯ x k ) − G k (cid:0) x ∗ λ k (cid:1)(cid:13)(cid:13) ≤ L k (cid:107) ¯ x k − x ∗ λ k (cid:107) . A.4 Proof of Proposition 2(a)

In the ﬁrst step, from Proposition 1 we have for all k ≥ K that ∆ k +1 ≤ H k ∆ k + h k where H k = [ H ij,k ] × and h k = [ h i,k ] × are given as follows: H ,k := 1 − µ f α k λ k , H ,k := α k L k √ m , H ,k := ˆ γ k (cid:107) u (cid:107) m ,H ,k := σ R ˆ γ k L k (cid:107) ν (cid:107) R , H ,k := σ R (cid:18) γ k (cid:107) ν (cid:107) R L k √ m (cid:19) , H ,k := σ R ˆ γ k δ R , C ,H ,k := c L k (cid:0) ˆ γ k (cid:107) R (cid:107) (cid:107) ν (cid:107) L k + 2 √ m Λ k (cid:1) ,H ,k := c L k (cid:18) (cid:107) R − I (cid:107) + ˆ γ k (cid:107) R (cid:107)(cid:107) ν (cid:107) L k √ m + 2Λ k (cid:19) , H ,k := σ C + c L k ˆ γ k (cid:107) R (cid:107) ,h ,k := M Λ k − µ f , h ,k := M σ R ˆ γ k L k (cid:107) ν (cid:107) R µ f Λ k − ,h ,k := c L k (cid:18) ˆ γ k (cid:107) R (cid:107) (cid:107) ν (cid:107) L k + √ m Λ k + µ f c B g M (cid:19) Mµ f Λ k − . Let us deﬁne the sequence { ρ k } as ρ k (cid:44) − . µ f α k λ k for k ≥ . Next, we the utilize ourassumptions to ﬁnd suitable upper bounds for some of the above terms. We deﬁne ˆ H k = [ ˆ H ij,k ] × and ˆ h k = [ˆ h i,k ] × as follows: ˆ H ,k := 1 − µ f α k λ k , ˆ H ,k := α k L √ m , ˆ H ,k := ˆ γ k (cid:107) u (cid:107) m , ˆ H ,k := σ R ˆ γ k L (cid:107) ν (cid:107) R , ˆ H ,k := ρ k − − σ R , ˆ H ,k := σ R ˆ γ k δ R , C , ˆ H ,k := c L (cid:0) ˆ γ k (cid:107) R (cid:107) (cid:107) ν (cid:107) L + 2 √ m Λ k (cid:1) , ˆ H ,k := c L (cid:18) (cid:107) R − I (cid:107) + ˆ γ k (cid:107) R (cid:107)(cid:107) ν (cid:107) L √ m + 2Λ (cid:19) , ˆ H ,k := ρ k − − σ C , ˆ h ,k := Θ √ k − , ˆ h ,k := Θ √ k − , ˆ h ,k := Θ √ k − . ˆ H ,k − H ,k = 1 − . µ f α k λ k − − σ R − σ R (cid:18) γ k (cid:107) ν (cid:107) R L k √ m (cid:19) = 1 − σ R − . µ f α k λ k − ˆ γ k (cid:107) ν (cid:107) R L k √ m . From Assumption 3 and the deﬁnition of α k , we have ˆ γ k → , α k → , and λ k → . Thus, thereexits an integer K R ≥ such that for all k ≥ K R we have H ,k ≤ ˆ H ,k . Similarly, there exitsan integer K C ≥ such that for all k ≥ K C we have H ,k ≤ ˆ H ,k . Thus, by taking to accountthat λ k and Λ k are nonincreasing sequences and invoking the deﬁnition of Θ , we have H k ≤ ˆ H k and h k ≤ ˆ h k . This implies that for all k ≥ max { K, K R , K C } , we have ∆ k +1 ≤ ˆ H k ∆ k + ˆ h k .Consequently, we obtain: (cid:107) ∆ k +1 (cid:107) ≤ ρ (cid:16) ˆ H k (cid:17) (cid:107) ∆ k (cid:107) + ΘΛ k − , (19)where ρ (cid:16) ˆ H k (cid:17) denotes the spectral norm of ˆ H k . Next, we show that for a sufﬁciently large k , wehave that ρ (cid:16) ˆ H k (cid:17) ≤ ρ k . To show this relation, employing Lemma 5 in [29], it sufﬁces to show that ≤ ˆ H ii,k < ρ k for i ∈ { , , } and det (cid:16) ρ k I − ˆ H k (cid:17) > . Among these, it can be easily seen that H ii,k < ρ k holds for all i ∈ { , , } . Since α k → and λ k → , there exits an integer K such that ˆ H ,k = 1 − µ f α k λ k > . Similarly, from σ R < and σ C < , there exits integers K and K such that ˆ H ,k > and ˆ H ,k > , respectively. Next, we show det (cid:16) ρ k I − ˆ H k (cid:17) > . We have:det (cid:16) ρ k I − ˆ H k (cid:17) = (cid:89) i =1 (cid:16) ρ k − ˆ H ii,k (cid:17) − (cid:16) ρ k − ˆ H ,k (cid:17) ˆ H ,k ˆ H ,k − (cid:16) ρ k − ˆ H ,k (cid:17) ˆ H ,k ˆ H ,k − ˆ H ,k ˆ H ,k ˆ H ,k − ˆ H ,k ˆ H ,k ˆ H ,k − (cid:16) ρ k − ˆ H ,k (cid:17) ˆ H ,k ˆ H ,k = (0 . µ f α k λ k ) (cid:18) − σ R (cid:19) (cid:18) − σ C (cid:19) − (0 . µ f α k λ k ) ( σ R ˆ γ k δ R , C ) c L (cid:18) (cid:107) R − I (cid:107) + ˆ γ k (cid:107) R (cid:107)(cid:107) ν (cid:107) L √ m + 2Λ (cid:19) − (cid:18) − σ C (cid:19) (cid:18) α k L √ m (cid:19) ( σ R ˆ γ k L (cid:107) ν (cid:107) R ) − (cid:18) α k L √ m (cid:19) ( σ R ˆ γ k δ R , C ) (cid:0) c L (cid:0) ˆ γ k (cid:107) R (cid:107) (cid:107) ν (cid:107) L + 2 √ m Λ k (cid:1)(cid:1) − (cid:18) ˆ γ k (cid:107) u (cid:107) m (cid:19) ( σ R ˆ γ k L (cid:107) ν (cid:107) R ) (cid:18) c L (cid:18) (cid:107) R − I (cid:107) + ˆ γ k (cid:107) R (cid:107)(cid:107) ν (cid:107) L √ m + 2Λ (cid:19)(cid:19) − (cid:18) − σ R (cid:19) (cid:0) c L (cid:0) ˆ γ k (cid:107) R (cid:107) (cid:107) ν (cid:107) L + 2 √ m Λ k (cid:1)(cid:1) (cid:18) ˆ γ k (cid:107) u (cid:107) m (cid:19) . Next, we ﬁnd lower and upper bounds for α k in terms of ˆ γ k . Note that Assumption 3 provides θ ˆ γ k asa lower bound for α k . To ﬁnd an upper bound, from Lemma 1 in [30], we have that the eigenvector u is nonzero only on the entries i ∈ R R . Similarly, the eigenvector ν is nonzero only on the entries i ∈ R C T . Also, we have u T ν > . Thus, from the deﬁnition of α k , we can write: α k ˆ γ k = 1 m u T γ k ˆ γ k ν = 1 m (cid:88) i ∈R R ∩ C T u i v i γ i,k ˆ γ k ≤ m (cid:88) i ∈R R ∩ C T u i v i γ i,k ˆ γ k = 1 m u T ν > . Let us deﬁne ¯ θ = m u T ν . Thus, we have θ ˆ γ k ≤ α k ≤ ¯ θ ˆ γ k for all k ≥ . Using these bounds andrearranging the terms, we can obtain:det (cid:16) ρ k I − ˆ H k (cid:17) ≥ − c ˆ γ k − c ˆ γ k + c ˆ γ k λ k − c ˆ γ k Λ k , c , c , c are deﬁned as below: c (cid:44) (cid:0) . µ f ¯ θλ (cid:1) ( σ R δ R , C ) c L (cid:18) (cid:107) R (cid:107)(cid:107) ν (cid:107) L √ m (cid:19) + (cid:18) ¯ θL √ m (cid:19) ( σ R δ R , C ) ( c L ( (cid:107) R (cid:107) (cid:107) ν (cid:107) L ))+ (cid:18) (cid:107) u (cid:107) m (cid:19) ( σ R L (cid:107) ν (cid:107) R ) (cid:18) c L (cid:107) R (cid:107)(cid:107) ν (cid:107) L √ m (cid:19) c (cid:44) (cid:0) . µ f ¯ θλ (cid:1) ( σ R δ R , C ) c L ( (cid:107) R − I (cid:107) + 2Λ ) + (cid:18) − σ C (cid:19) (cid:18) ¯ θL √ m (cid:19) ( σ R L (cid:107) ν (cid:107) R )+ (cid:18) ¯ θL √ m (cid:19) ( σ R δ R , C ) (cid:0) c L √ m Λ (cid:1) + (cid:18) (cid:107) u (cid:107) m (cid:19) ( σ R L (cid:107) ν (cid:107) R ) ( c L ( (cid:107) R − I (cid:107) + 2Λ ))+ (cid:18) − σ R (cid:19) ( c L ( R (cid:107) (cid:107) ν (cid:107) L )) (cid:18) (cid:107) u (cid:107) m (cid:19) c (cid:44) (0 . µ f θ (1 − σ R ) (1 − σ C ) c (cid:44) (cid:18) − σ R (cid:19) (cid:0) c L √ m (cid:1) (cid:18) (cid:107) u (cid:107) m (cid:19) . It sufﬁces to show that − c ˆ γ k − c ˆ γ k + c ˆ γ k λ k − c ˆ γ k Λ k > for any sufﬁciently large k . FromLemma 1, we have Λ k λ k → . Thus, there exits an integer K ≥ such that for any k ≥ K wehave that c Λ k ≤ . c λ k . As such, it sufﬁces to show that c ˆ γ k + c ˆ γ k < . c λ k . From Lemma1, since ˆ γ k → and ˆ γλ k → , there exits an integer K ≥ such that c ˆ γ k + c ˆ γ k < . c λ k for any k ≥ K . We conclude that for K (cid:44) max { K, K , K , K , , K , K , K R , K C } , we havedet (cid:16) ρ k I − ˆ H k (cid:17) > for any k ≥ K . Therefore, we have ρ (cid:16) ˆ H k (cid:17) ≤ − . µ f α k λ k for all k ≥ K .The desired inequality is obtained from this inequality and the relation (19). (b) From Lemma 1, we have that Λ k − ≤ k +1 . From part (a) and Assumption 3, we obtain: (cid:107) ∆ k +1 (cid:107) ≤ (1 − . µ f α k ˆ γ k θ ) (cid:107) ∆ k (cid:107) + Θ k + 1 for all k ≥ K . (20)We use induction to show that the desired inequality holds for: B (cid:44) max (cid:26) ( K + 1) − a − b (cid:107) ∆ K (cid:107) , µ f λ ˆ γ θ (cid:27) . First, we observe that the inequality holds for k := K . This is because: (cid:107) ∆ K (cid:107) = (cid:0) ( K + 1) − a − b ∆ K (cid:1) K + 1) − a − b ≤ B ( K + 1) − a − b . Let us assume that (cid:107) ∆ k (cid:107) ≤ B k − a − b for some k ≥ K . We show that this relation also holds for k + 1 . Consider Lemma 1. Let us choose τ := µ f θ . Thus, from Lemma 1, there exits a K τ such that ( k +1)ˆ γ k λ k k ˆ γ k − λ k − ≤ τ ˆ γ k λ k for all k ≥ K τ . This implies that: k a + b k ≤ ( k + 1) a + b k + 1 (1 + 0 . µ f λ ˆ γ θ ) . (21)Let K be an integer such that . µ f α k ˆ γ k θ < . Without loss of generality, let us assume K ≥ max { K τ , K } . From (20) and the induction hypothesis, we obtain: (cid:107) ∆ k +1 (cid:107) ≤ (1 − . µ f α k ˆ γ k θ ) B k − a − b + Θ k + 1 . From the preceding relation and (21), we obtain: (cid:107) ∆ k +1 (cid:107) ≤ (1 − . µ f α k ˆ γ k θ )(1 + 0 . µ f α k ˆ γ k θ ) B ( k + 1) − a − b + Θ k + 1 . From the deﬁnition of B we have Θ ≤ . µ f λ ˆ γ θ B . From this relation and rearranging the termsin the preceding inequality, we obtain: (cid:107) ∆ k +1 (cid:107) ≤ B ( k + 1) − a − b (cid:0) − . µ f λ ˆ γ θ − . µ f λ ˆ γ θ ) (cid:1) + 0 . µ f λ ˆ γ θ B ( k + 1) − a − b . This implies that (cid:107) ∆ k +1 (cid:107) ≤ B ( k +1) − a − b . Thus, the induction statement holds for k + 1 and hence,the proof is completed. 16 .5 Proof of Theorem 1(a) From Lemma 2(a), we have that { x ∗ λ k } converges to x ∗ . Moreover, from Proposition 2(b), wehave that (cid:107) ¯ x k − x ∗ λ k − (cid:107) converges to zero. Therefore, we have lim k →∞ ¯ x k = x ∗ . To derive thebounds for (cid:107) x k − ¯ x k (cid:107) R and (cid:107) y k − ν ¯ y k (cid:107) C , from the deﬁnition of ∆ k in Proposition 2, we canwrite: (cid:107) x k − ¯ x k (cid:107) R ≤ (cid:107) ∆ k (cid:107) = O (cid:0) k − a − b (cid:1) . Similarly, we obtain (cid:107) y k − ν ¯ y k (cid:107) C = O (cid:0) k − a − b (cid:1) . (b) Consider the regularized function g ( x ) + λ k f ( x ) . Note that it is L k -smooth, where L k (cid:44) L g + λ k L f . Since x ∗ λ k is the minimizer of g ( x ) + λ k f ( x ) , we have: g ( x ) + λ k f ( x ) − g (cid:0) x ∗ λ k (cid:1) − λ k f (cid:0) x ∗ λ k (cid:1) ≤ L k (cid:13)(cid:13) x − x ∗ λ k (cid:13)(cid:13) for all x ∈ R n . Also, we can write that g (cid:0) x ∗ λ k (cid:1) + λ k f (cid:0) x ∗ λ k (cid:1) ≤ g ( x ∗ ) + λ k f ( x ∗ ) . Combining the preceding tworelations and substituting x by ¯ x k +1 , we obtain: g (¯ x k +1 ) − g ( x ∗ ) + λ k ( f (¯ x k +1 ) − f ( x ∗ )) ≤ L k (cid:13)(cid:13) ¯ x k +1 − x ∗ λ k (cid:13)(cid:13) . Applying the bound from Proposition 2(b), we obtain: g (¯ x k +1 ) − g ( x ∗ ) + λ k ( f (¯ x k +1 ) − f ( x ∗ )) ≤ (cid:18) L k B (cid:19) k + 1) − a − b for all k ≥ K . (22)Note that from the deﬁnition of x ∗ in (8), we have g (¯ x k +1 ) − g ( x ∗ ) ≥ . This implies that: f (¯ x k +1 ) − f ( x ∗ ) ≤ (cid:18) L k B λ k (cid:19) k + 1) − a − b ≤ (cid:18) L B λ (cid:19) k + 1) − a − b for all k ≥ K . Therefore, the desired relation holds for Q (cid:44) B λ . (c) From part (a), we know that { ¯ x k } converges to x ∗ . This result and that f is a continuous functionimply that there exits a scalar Q > such that | f (¯ x k +1 ) − f ( x ∗ ) | ≤ Q . Thus, from the inequality(22) and the update rule for λ k , we obtain: g (¯ x k +1 ) − g ( x ∗ ) ≤ (cid:18) L B (cid:19) k + 1) − a − b + Q λ ( k + 1) b for all k ≥ K . Therefore, the desired relation holds for Q (cid:44) B λ . A.6 Proof of Corollary 1

First, we show that problem (2) is equivalent to problem (1) where g i ( x ) := (cid:107) A i x − b i (cid:107) . Let X and X denote the feasible set of problem (1) and (2), respectively. Suppose ˆ x ∈ X is an arbitraryvector. Thus, we have ˆ x ∈ argmin x ∈ R n (cid:80) mi =1 (cid:107) A i x − b i (cid:107) . From the assumption X (cid:54) = ∅ , thereexits a point ¯ x satisfying A ¯ x = b . This implies that the minimum of the function (cid:80) mi =1 (cid:107) A i x − b i (cid:107) is zero. Therefore, ˆ x must satisfy A x = b implying that ˆ x ∈ X . Next, suppose ˆ x ∈ X is anarbitrary vector. Thus, we have (cid:80) mi =1 (cid:107) A i ˆ x − b i (cid:107) = 0 implying that ˆ x is a minimizer of the (cid:80) mi =1 (cid:107) A i x − b i (cid:107) . Therefore, we have ˆ x ∈ X . We conclude that X = X and thus problems(1) and (2) are equivalent. Next, we show that Assumption 1(b) is satisﬁed. From the deﬁnition offunction g i , we have that ∇ g i ( x ) = A Ti ( A i x − b i ) . We can write for all x, y ∈ R n : (cid:107)∇ g i ( x ) − ∇ g i ( y ) (cid:107) = (cid:13)(cid:13) A Ti ( A i x − b i ) − A Ti ( A i y − b i ) (cid:13)(cid:13) ≤ ρ (cid:0) A Ti A i (cid:1) (cid:107) x − y (cid:107) , where ρ (cid:0) A Ti A i (cid:1) denotes the spectral norm of A Ti A i . Thus, we conclude that Assumption 1(b) ismet for L g (cid:44) max i ∈ [ m ] ρ (cid:0) A Ti A i (cid:1) . Therefore, all conditions of Theorem 1 hold. To obtain the rateresults in part (a), (b), (c), it sufﬁces to substitute a by . and b by . − (cid:15) in the correspondingparts in Theorem 1. 17 irected star graph Directed ring graph Directed Lollipop graph L n (cid:80) m i = f i (cid:16) T x k (cid:17) Push-PullAlgorithm 1

Push-PullAlgorithm 1

Push-PullAlgorithm 1 L n (cid:13)(cid:13)(cid:13)(cid:13) x k − T m x k (cid:13)(cid:13)(cid:13)(cid:13) Iterations

Push-PullAlgorithm 1

Iterations

Push-PullAlgorithm 1

Iterations

Push-PullAlgorithm 1

Figure 2: Algorithm 1 vs. regularized Push-Pull algorithm (with the regularization parameter . )under different choices of R and C . Directed star graph Directed ring graph Directed Lollipop graph L n (cid:80) m i = f i (cid:16) T x k (cid:17) Push-PullAlgorithm 1

Push-PullAlgorithm 1

Push-PullAlgorithm 1 L n (cid:13)(cid:13)(cid:13)(cid:13) x k − T m x k (cid:13)(cid:13)(cid:13)(cid:13) Iterations

Push-PullAlgorithm 1

Iterations

Push-PullAlgorithm 1

Iterations

Push-PullAlgorithm 1

Figure 3: Algorithm 1 vs. regularized Push-Pull algorithm (with the regularization parameter . )under different choices of R and C . A.7 Proof of Corollary 2

Note that problem (3) is equivalent to problem (1) where f i ( x ) := (cid:107) x (cid:107) /m . This implies thatAssumption 1(a) holds with µ f = L f = m . Therefore, all conditions of Theorem 1 hold. To obtainthe rate results in part (a) and (b), it sufﬁces to substitute a by . and b by . − (cid:15) in the rate resultsin Theorem 1. Appendix B Supplementary numerical results

In this section, we present additional support for the numerical experiments where we choose theregularization parameter employed in the Push-Pull algorithm to take different values. Figure 2and Figure 3 show the results for the case where the regularization parameter is . and .005