Achieving Linear Convergence in Distributed Asynchronous Multi-agent Optimization
aa r X i v : . [ m a t h . O C ] S e p Achieving Linear Convergence in DistributedAsynchronous Multi-agent Optimization
Ye Tian, Ying Sun, and Gesualdo Scutari
Abstract —This papers studies multi-agent (convex and noncon-vex ) optimization over static digraphs. We propose a general dis-tributed asynchronous algorithmic framework whereby i) agentscan update their local variables as well as communicate withtheir neighbors at any time, without any form of coordination;and ii) they can perform their local computations using (possibly)delayed, out-of-sync information from the other agents. Delaysneed not be known to the agent or obey any specific profile, andcan also be time-varying (but bounded). The algorithm buildson a tracking mechanism that is robust against asynchrony (inthe above sense), whose goal is to estimate locally the average ofagents’ gradients. When applied to strongly convex functions, weprove that it converges at an R-linear (geometric) rate as longas the step-size is sufficiently small. A sublinear convergencerate is proved, when nonconvex problems and/or diminishing, uncoordinated step-sizes are considered. To the best of ourknowledge, this is the first distributed algorithm with provablegeometric convergence rate in such a general asynchronoussetting. Preliminary numerical results demonstrate the efficacyof the proposed algorithm and validate our theoretical findings.
Index Terms —Asynchrony, Delay, Directed graphs, Distributedoptimization, Linear convergence, Nonconvex optimization.
I. I
NTRODUCTION
We study convex and nonconvex distributed optimizationover a network of agents, modeled as a directed fixed graph.Agents aim at cooperatively solving the optimization problem min x ∈ R n F ( x ) , I X i =1 f i (cid:0) x (cid:1) (P)where f i : R n → R is the cost function of agent i , assumed tobe smooth (nonconvex) and known only to agent i . In thissetting, optimization has to be performed in a distributed,collaborative manner: agents can only receive/send informa-tion from/to its immediate neighbors. Instances of (P) thatrequire distributed computing have found a wide range ofapplications in different areas, including network informationprocessing, resource allocation in communication networks,swarm robotic, and machine learning, just to name a few.Many of the aforementioned applications give rise to ex-tremely large-scale problems and networks, which naturallycall for asynchronous , parallel solution methods. In fact, asyn-chronous modus operandi reduces the idle times of workers,mitigate communication and/or memory-access congestion,save power (as agents need not perform computations andcommunications at every iteration), and make algorithms morefault-tolerant. In this paper, we consider the following verygeneral, abstract, asynchronous model [3]: Part of this work has been presented at the 56th Annual Allerton Conference[1] and posted on arxiv [2] on March 2018. This work has been supported bythe USA National Science Foundation under Grants CIF 1632599 and CIF1719205; and in part by the Office of Naval Research under Grant N00014-16-1-2244, and the Army Research Office under Grant W911NF1810238.The authors are with the School of Industrial Engineer-ing, Purdue University, West-Lafayette, IN, USA; Emails: { tian110,sun578,gscutari } @purdue.edu. (i) Agents can perform their local computations as well ascommunicate (possibly in parallel) with their immediateneighbors at any time, without any form of coordinationor centralized scheduling; and (ii) when solving their local subproblems, agents can useoutdated information from their neighbors.In (ii) no constraint is imposed on the delay profiles: delayscan be arbitrary (but bounded), time-varying, and (possibly)dependent on the specific activation rules adopted to wakeupthe agents in the network. This model captures in a unifiedfashion several forms of asynchrony: some agents executemore iterations than others; some agents communicate morefrequently than others; and inter-agent communications can beunreliable and/or subject to unpredictable, time-varying delays.Several forms of asynchrony have been studied in theliterature–see Sec. I-A for an overview of related works.However, we are not aware of any distributed algorithm thatis compliant to the asynchrony model (i)-(ii) and distributed(nonconvex) setting above. Furthermore, when considering thespecial case of a strongly convex function F , it is not clear howto design a (first-order) distributed asynchronous algorithm (asspecified above) that achieves linear convergence rate. Thispaper answers these questions–see Sec. I-B and Table 1 for asummary of our contributions. A. Literature Review
Since the seminal work [11], asynchronous parallelism hasbeen applied to several centralized optimization algorithms, in-cluding block coordinate descent (e.g., [11]–[13]) and stochas-tic gradient (e.g., [14], [15]) methods. However, these schemesare not applicable to the networked setup considered in thispaper, because they would require the knowledge of thefunction F from each agent. Distributed methods exploring(some form of) asynchrony over networks with no centralizednode have been studied in [4]–[10], [16]–[26]. We group nextthese works based upon the features (i)-(ii) above. (a) Random activations and no delays [16]–[20]: Theseschemes considered distributed convex unconstrained opti-mization over undirected graphs. While substantially differentin the form of the updates performed by the agents– [16], [18],[20] are instances of primal-dual (proximal-based) algorithms,[19] is an ADMM-type algorithm, while [17] is based on thedistributed gradient tracking mechanism introduced in [27]–[29]–all these algorithms are asynchronous in the sense offeature (i) [but not (ii)]: at each iteration, a subset of agents[16], [18], [20] (or edge-connected agents [17], [19]), chosenat random, is activated, performing then their updates andcommunications with their immediate neighbors; between twoactivations, agents are assumed to be in idle mode (i.e., ableto continuously receive information). However, no form of
Algorithm NonconvexCost Function No IdleTime ArbitraryDelays Parallel Step Sizes Digraph Global Convergence to Exact Solutions Rate AnalysisFixed UncoordinatedDiminishing Linear Rate forStrongly Convex NonconvexAsyn. Broadcast [4]
X X X
In expectation (w. diminishing step)Asyn. Diffusion [5] X Asyn. ADMM [6]
X X
DeterministicDual Ascent in [7] X Restricted Restricted X ra-NRC [8] X X
ARock [9] X Restricted X Almost surely In expectationASY-PrimalDual [10] X Restricted X Almost surely
ASY-SONATA
X X X X X X X
Deterministic Deterministic Deterministic
Table 1C
OMPARISON WITH STATE - OF - ART DISTRIBUTED ASYNCHRONOUS ALGORITHMS . C
URRENT SCHEMES CAN DEAL WITH UNCOORDINATED ACTIVATIONSBUT ONLY WITH SOME FORMS OF DELAYS . ASY-SONATA
ENJOYS ALL THE DESIRABLE FEATURES LISTED IN THE TABLE . delays is allowed: every agent must perform its local com-putations/updates using the most updated information fromits neighbors. This means that all the actions performed bythe agent(s) in an activation must be completed before a newactivation (agent) takes place (wakes-up), which calls for somecoordination among the agents. Finally, no convergence ratewas provided for the aforementioned schemes but [17], [19]. (b) Synchronous activations and delays [21]–[26]: Theseschemes considered distributed constrained convex optimiza-tion over undirected graphs. They study the impact of delayedgradient information [21], [22] or communication delays (fixed[23], uniform [22], [26] or time-varying [24], [25]) on theconvergence rate of distributed gradient (proximal [21], [22]or projection-based [25], [26]) algorithms or dual-averagingdistributed-based schemes [23], [24]. While these schemesare all synchronous [thus lacking of feature (i)], they cantolerate communication delays [an instantiation of feature (ii)],converging at a sublinear rate to an optimal solution. Delaysmust be such that no losses occur–every agent’s message willeventually reach its destination within a finite time. (c) Random/cyclic activations and some form of delays [4]–[10]: The class of optimization problems along with the keyfeatures of the algorithms proposed in these papers are sum-marized in Table 1 and briefly discussed next. The majority ofthese works studied distributed (strongly) convex optimizationover undirected graphs, with [5] assuming that all the functions f i have the same minimizer, [6] considering also nonconvexobjectives, and [8] being implementable also over digraphs.The algorithms in [4], [5] are gradient-based schemes; [6] is adecentralized instance of ADMM; [9] applies an asynchronousparallel ADMM scheme to distributed optimization; and [10]builds on a primal-dual method. The schemes in [7], [8]instead build on (approximate) second-order information. Allthese algorithms are asynchronous in the sense of feature (i):[4]–[6], [9], [10] considered random activations of the agents(or edges-connected agents) while [7], [8] studied determin-istic, uncoordinated activation rules. As far as feature (ii) isconcerned, some form of delays is allowed. More specifically,[4]–[6], [8] can deal with packet losses : the information sentby an agent to its neighbors either gets lost or received with nodelay . They also assume that agents are always in idle mode between two activations. Closer to the proposed asynchronousframework are the schemes in [9], [10] wherein a probabilisticmodel is employed to describe the activation of the agentsand the aged information used in their updates. The modelrequires that the random variables triggering the activation ofthe agents are i.i.d and independent of the delay vector usedby the agent to performs its update. While this assumptionmakes the convergence analysis possible, in reality, there is a strong dependence of the delays on the activation index; see[13] for a detailed discussion on this issue and several counterexamples. Other consequences of this model are: the schemes[9], [10] are not parallel –only one agent per time can performthe update–and a random self-delay must be used in theupdate of each agent (even if agents have access to their mostrecent information). Furthermore, [9] calls for the solution of aconvex subproblem for each agent at every iteration. Referringto the convergence rate, [9] is the only scheme exhibiting linearconvergence in expectation , when each f i is strongly convexand the graph undirected . No convergence rate is available inany of the aforementioned papers, when F is nonconvex. B. Summary of Contributions
This paper proposes a general distributed, asynchronousalgorithmic framework for (strongly) convex and nonconvex instances of Problem (P), over directed graphs. The algorithmleverages a perturbed “sum-push” mechanism that is robustagainst asynchrony, whose goal is to track locally the averageof agents’ gradients; this scheme along with its convergenceanalysis are of independent interest. To the best of our knowl-edge, the proposed framework is the first scheme combiningthe following attractive features (cf. Table 1): (a) it is paralleland asynchronous [in the sense (i) and (ii)] –multiple agentscan be activated at the same time (with no coordination) and/oroutdated information can be used in the agents’ updates; ourasynchronous setting (i) and (ii) is less restrictive than theone in [9], [10]; furthermore, in contrast with [9], our schemeavoids solving possibly complicated subproblems; (b) it isapplicable to nonconvex problems, with probable convergenceto stationary solutions of (P); (c) it is implementable over digraph ; (d) it employs either a constant step-size or uncoor-dinated diminishing ones; (e) it converges at an R-linear rate (resp. sublinear) when F is strongly convex (resp. nonconvex)and a constant (resp. diminishing, uncoordinated) step-size(s)is employed; this contrasts [9] wherein each f i needs to bestrongly convex; and (f) it is “protocol-free”, meaning thatagents need not obey any specific communication protocols orasynchronous modus operandi (as long as delays are boundedand agents update/communicate uniformly infinitely often).On the technical side, convergence is studied introducingtwo techniques of independent interest, namely: i) the asyn-chronous agent system is reduced to a synchronous “aug-mented” one with no delays by adding virtual agents to thegraph. While this idea was first explored in [30], [31], [32],the proposed enlarged system and algorithm differ from thoseused therein, which cannot deal with the general asynchronousmodel considered here–see Remark 13, Sec.VI; and ii) therate analysis is employed putting forth a generalization ofthe small gain theorem (widely used in the literature [33] to analyze synchronous schemes), which is expected to bebroadly applicable to other distributed algorithms. C. Notation
Throughout the paper we use the following notation. Giventhe matrix M , ( M ij ) Ii,j =1 , M i, : and M : ,j denote its i -th rowvector and j -th column vector. Given the sequence { M t } kt = s ,with k ≥ s , we define M k : s , M k M k − · · · M s +1 M s , if k > s ; and M k : s , M s otherwise. Given two matrices(vectors) A and B of same size, by A B we mean that B − A is a nonnegative matrix (vector). The dimensions ofthe all-one vector and the i -th canonical vector e i will beclear from the context. We use k·k to represent the Euclideannorm for a vector whereas the spectral norm for a matrix. Theindicator function [ E ] of an event E equals to when theevent E is true, and otherwise. Finally, we use the convention P t ∈∅ x t = 0 and Q t ∈∅ x t = 1 .II. P ROBLEM S ETUP AND P RELIMINARIES
A. Problem Setup
We study Problem (P) under the following assumptions.
Assumption 1 (On the optimization problem) . a. Each f i : R n → R is proper, closed and L i -Lipschitzdifferentiable;b. F is bounded from below. (cid:3) Note that f i need not be convex. We also make the blanketassumption that each agent i knows only its own f i , butnot P j = i f j . To state linear convergence, we will use thefollowing extra condition on the objective function. Assumption 2 (Strong convexity) . Assumption 1(i) holds and,in addition, F is τ -strongly convex. (cid:3) On the communication network:
The communication net-work of the agents is modeled as a fixed, directed graph G = ( V , E ) , where V = { , . . . , I } is the set of nodes (agents),and E ⊆ V × V is the set of edges (communication links).If ( i, j ) ∈ E , it means that agent i can send information toagent j . We assume that the digraph does not have self-loops.We denote by N in i the set of in-neighbors of node i , i.e., N in i , { j ∈ V | ( j, i ) ∈ E} while N out i , { j ∈ V | ( i, j ) ∈ E} is the set of out-neighbors of agent i . We make the followingstandard assumption on the graph connectivity. Assumption 3.
The graph G is strongly connected. (cid:3) B. Preliminaries: The SONATA algorithm [34], [35]
The proposed asynchronous algorithmic framework buildson the synchronous SONATA algorithm, proposed in [34],[35] to solve (nonconvex) multi-agent optimization problemsover time-varying digraphs. This is motivated by the fact thatSONATA has the unique property of being provably applicableto both convex and nonconvex problems, and it achieves linearconvergence when applied to strongly convex objectives F .We thus begin reviewing SONATA, tailored to (P); then wegeneralized it to the asynchronous setting (cf. Sec. IV).Every agent controls and iteratively updates the tuple ( x i , y i , z i , φ i ) : x i is agent i ’s copy of the shared variables x in (P); y i acts as a local proxy of the sum-gradient ∇ F ;and z i and φ i are auxiliary variables instrumental to deal withcommunications over digraphs. Let x ki , z ki , φ ki , and y ki denotethe value of the aforementioned variables at iteration k ∈ N .The update of each agent i reads: x k +1 i = X j ∈N in i ∪{ i } w ij (cid:0) x kj − α k y kj (cid:1) , (1) z k +1 i = X j ∈N in i ∪{ i } a ij z kj + ∇ f i ( x k +1 i ) − ∇ f i ( x ki ) , (2) φ k +1 i = X j ∈N in i ∪{ i } a ij φ kj , (3) y k +1 i = z k +1 i /φ k +1 i , (4)with z i = y i = ∇ f i ( x i ) and φ i = 1 , for all i ∈V . In (1), y ki is a local estimate of the average-gradient (1 /I ) P Ii =1 ∇ f i ( x ki ) . Therefore, every agent, first movesalong the estimated gradient direction, generating x ki − α k y ki ( α k is the step-size); and then performs a consensus stepto force asymptotic agreement among the local variables x i .Steps (2)-(4) represent a perturbed-push-sum update, aiming attracking the gradient (1 /I ) ∇ F [28], [29], [35]. The weight-matrices W , ( w ij ) Ii,j =1 and A , ( a ij ) Ii,j =1 satisfy thefollowing standard assumptions. Assumption 4 (On the weight-matrices) . The weight-matrices W , ( w ij ) Ii,j =1 and A , ( a ij ) Ii,j =1 satisfy (we will write M , ( m ij ) Ii,j =1 to denote either A or W ):a. ∃ ¯ m > such that m ii ≥ ¯ m , for all i ∈ V ; and m ij ≥ ¯ m ,for all ( j, i ) ∈ E ; m ij = 0 , otherwise;b. W is row-stochastic, that is, W 1 = ;c. A is column-stochastic, that is, A T = ; (cid:3) In [33], a special instance of SONATA, was proved toconverges at an R-linear rate when F is strongly convex. Thisresult was extended to constrained, nonsmooth (composite),distributed optimization in [36]. A natural question is whetherSONATA works also in an asynchronous setting still converg-ing at a linear rate. Naive asynchronization of the updates (1)-(4)–such as using uncoordinated activations and/or replacinginstantaneous information with a delayed one–would not work.For instance, the tracking (2)-(4) calls for the invariance of theaverages, i.e., P Ii =1 z ki = P Ii =1 ∇ f i ( x k ) , for all k ∈ N . It isnot difficult to check that any perturbation in (2)-e.g., in theform of delays or packet losses–puts in jeopardy this property.To cope with the above challenges, a first step is robustifyingthe gradient tracking scheme. In Sec. III, we introduce P-ASY-SUM-PUSH–an asynchronous, perturbed, instance ofthe push-sum algorithm [37], which serves as a unified algo-rithmic framework to accomplish several tasks over digraphsin an asynchronous manner, such as solving the averageconsensus problem and tracking the average of agents’ time-varying signals. Building on P-ASY-SUM-PUSH, in Sec. IV,we finally present the proposed distributed asynchronous op-timization framework, termed ASY-SONATA.III. P ERTURBED A SYNCHRONOUS S UM -P USH
We present P-ASY-SUM-PUSH; the algorithm was firstintroduced in our conference paper [1], which we refer to for details on the genesis of the scheme and intuitions; here wedirectly introduce the scheme and study its convergence.Consider an asynchronous setting wherein agents computeand communicate independently without coordination. Everyagent i maintains state variables z i , φ i , y i , along with thefollowing auxiliary variables that are instrumental to dealwith uncoordinated activations and delayed information: i)the cumulative-mass variables ρ ji and σ ji , with j ∈ N out i ,which capture the cumulative (sum) information generatedby agent i up to the current time and to be sent to agent j ∈ N out i ; consequently, ρ ij and σ ij are received by i from itsin-neighbors j ∈ N in i ; and ii) the buffer variables ˜ ρ ij and ˜ σ ij ,with j ∈ N in i , which store the information sent from j ∈ N in i to i and used by i in its last update. Values of these variablesat iteration k ∈ N are denoted by the same symbols with thesuperscript “ k ”. Note that, because of the asynchrony, eachagent i might have outdated ρ ij and σ ij ; ρ k − d kj ij (resp. σ k − d kj ij )is a delayed version of the current ρ kij (resp. σ kij ) owned by j at time k , where ≤ d kj ≤ D < ∞ is the delay. Similarly, ˜ ρ ij and ˜ σ ij might differ from the last information generated by j for i , because agent i might not have received that informationyet (due to delays) or never will (due to packet losses).The proposed asynchronous algorithm, P-ASY-SUM-PUSH, is summarized in Algorithm 1. A global iteration clock(not known to the agents) is introduced: k → k +1 is triggeredbased upon the completion from one agent, say i k , of the fol-lowing actions. (S.2): agent i k maintains a local variable τ i k j ,for each j ∈ N in i k , which keeps track of the “age” (generatedtime) of the ( ρ , σ ) -variables that it has received from its in-neighbors and already used. If k − d kj is larger than the currentcounter τ k − i k j , indicating that the received ( ρ , σ ) -variables arenewer than those currently stored, agent i k accepts ρ k − d kj i k j and σ k − d kj i k j , and updates τ i k j as k − d kj ; otherwise, the variableswill be discarded and τ i k j remains unchanged. Note that (5)can be performed without any coordination. It is sufficient thateach agent attaches a time-stamp to its produced informationreflecting it local timing counter. We describe next the othersteps, assuming that new information has come in to agent i k , that is, τ i k j = k − d kj . (S.3.1): In (6), agent i k builds theintermediate “mass” z k + i k based upon its current information z ki k and ˜ ρ ki k j , and the (possibly) delayed one from its in-neighbors, ρ k − d kj i k j ; and ǫ k ∈ R n is an exogenous perturbation(later this perturbation will be properly chosen to accomplishspecific goals, see Sec. IV). Note that the way agent i k formsits own estimates ρ k − d kj i k j is immaterial to the description of thealgorithm. The local buffer ˜ ρ ki k j stores the value of ρ i k j thatagent i k used in its last update. Therefore, if the informationin ρ k − d kj i k j is not older than the one in ˜ ρ ki k j , the difference ρ k − d kj i k j − ˜ ρ ki k j in (6) will capture the sum of the a i k j z j ’sthat have been generated by j ∈ N in i k for i k up until k − d kj and not used by agent i k yet. For instance, in a synchronoussetting, one would have ρ ki k j − ˜ ρ ki k j = a i k j z k + j . (S.3.2): thegenerated z k + i k is “pushed back” to agent i k itself and its out- Algorithm 1
P-ASY-SUM-PUSH (Global View)
Data: z i ∈ R n , φ i = 1 , ˜ ρ ij = 0 , ˜ σ ij = 0 , τ − ij = − D ,for all j ∈ N in i and i ∈ V ; σ tij = 0 and ρ tij = 0 , for all t = − D, . . . , ; and { ǫ k } k ∈ N . Set k = 0 . While: a termination criterion is not met do (S.1) Pick ( i k , d k ) , with d k , ( d kj ) j ∈N in ik ; (S.2) Set (purge out the old information): τ ki k j = max (cid:0) τ k − i k j , k − d kj (cid:1) , ∀ j ∈ N in i k ; (5) (S.3) Update the variables performing • (S.3.1) Sum step: z k + i k = z ki k + X j ∈N in ik (cid:18) ρ τ kikj i k j − ˜ ρ ki k j (cid:19) + ǫ k (6) φ k + i k = φ ki k + X j ∈N in ik (cid:18) σ τ kikj i k j − ˜ σ ki k j (cid:19) • (S.3.2) Push step: z k +1 i k = a i k i k z k + i k , φ k +1 i k = a i k i k φ k + i k ρ k +1 ji k = ρ kji k + a ji k z k + i k , (7) σ k +1 ji k = σ kji k + a ji k φ k + i k , ∀ j ∈ N out i k • (S.3.3) Mass-Buffer update: ˜ ρ k +1 i k j = ρ τ kikj i k j , ˜ σ k +1 i k j = σ τ kikj i k j , ∀ j ∈ N in i k (8) • (S.3.4) Set: y k +1 i k = z k +1 i k /φ k +1 i k . (S.4) Untouched state variables shift to state k + 1 while keeping the same value; k ← k + 1 .neighbors. Specifically, out of the total mass z k + i k generated,agent i k gets a ii z k + i , determining the update z ki → z k +1 i while the remaining is allocated to the agents j ∈ N out i k , with a ji k z k + i k cumulating to the mass buffer ρ kji k and generatingthe update ρ kji k → ρ k +1 ji k , to be sent to agent j . (S.3.3): eachlocal buffer variable ˜ ρ ki k j is updated to account for the use ofnew information from j ∈ N in i k . The final information is thenread on the y -variables [cf. (S.3.4) ]. Remark 5. (Global view description)
Note that each agent’supdate is fully defined, once i k and d k are given. The selection ( i k , d k ) in (S.1) is not performed by anyone; it is insteadan a-posteriori description of agents’ actions: All agents actasynchronously and continuously; the agent completing the“push” step and updating its own variables triggers retrospec-tively the iteration counter k → k + 1 and determines thepair ( i k , d k ) along with all quantities involved in the othersteps. Differently from most of the current literature, this“global view” description of the agents’ actions allows us toabstract from specific computation-communication protocolsand asynchronous modus operandi and captures by a unifiedmodel a gamut of asynchronous schemes.Convergence is given under the following assumptions. Assumption 6 (On the asynchronous model) . Suppose: a. ∃ < T < ∞ such that ∪ k + T − t = k i t = V , for all k ∈ N ;b. ∃ < D < ∞ such that ≤ d kj ≤ D , for all j ∈ N in i k and k ∈ N . (cid:3) The next theorem studies convergence of P-ASY-SUM-PUSH, establishing geometric decay of the error k y ki − (1 /I ) · m kz k , even in the presence of unknown (bounded)perturbations, where m kz , P Ii =1 z ki + P ( j,i ) ∈E ( ρ kij − ˜ ρ kij ) represents the “total mass” of the system at iteration k . Theorem 7.
Let { y k , [ y k , . . . , y kI ] ⊤ , z k , [ z k , . . . , z kI ] ⊤ , ( ρ kij , ˜ ρ kij ) ( j,i ) ∈E } k ∈ N be the sequence generated by Algo-rithm 1, under Assumption 3, 6, and with A , ( a ij ) Ii,j =1 sat-isfying Assumption 4 (i),(iii). Define K , (2 I − · T + I · D. There exist constants ρ ∈ (0 , and C > , such that (cid:13)(cid:13)(cid:13) y k +1 i − (1 /I ) · m k +1 z (cid:13)(cid:13)(cid:13) ≤ C ρ k (cid:13)(cid:13) z (cid:13)(cid:13) + k X l =0 ρ k − l (cid:13)(cid:13) ǫ l (cid:13)(cid:13)! , (9) for all i ∈ V and k ≥ K − .Furthermore, m kz = P Ii =1 z i + P k − t =0 ǫ t .Proof. See Sec. VI.
Discussion:
Several comments are in order.
1) On the asynchronous model:
Algorithm 1 captures agamut of asynchronous parallel schemes and architectures,through the mechanism of generation of ( i k , d k ) . Assump-tion 6 on ( i k , d k ) is quite mild: (a) controls the frequency ofthe updates whereas (b) limits the age of the old informationused in the computations; they can be easily enforced inpractice. For instance, (a) is readily satisfied if each agentwakes up and performs an update whenever some independentinternal clock ticks or it is triggered by some of the neighbors;(b) imposes conditions on the frequency and quality of thecommunications: information used by each agent cannot be-come infinitely old, implying that successful communicationsmust occur sufficiently often. This however does not enforceany specific protocol on the activation/idle time/communica-tion. For instance, i) agents need not perform the actions inAlgorithm 1 sequentially or inside the same activation round;or ii) executing the “push” step does not mean that agentsmust broadcast their new variables in the same activation; thiswould just incur a delay (or packet loss) in the communication.Note that the time-varying nature of the delays d k permitsto model also packet losses, as detailed next. Suppose thatat iteration k agent j sends its current ρ, σ -variables to itsout-neighbor ℓ and they get lost; and let k be the subsequentiteration when j updates again. Let t be the first iteration after k when agent ℓ performs its update; it will use informationfrom j such that t − d tj / ∈ [ k + 1 , k ] , for some d tj ≤ D < ∞ .If t − d tj < k + 1 , no newer information from j has been usedby ℓ ; otherwise t − d tj ≥ k + 1 (implying k < t ), meaningthat agent ℓ has used information not older than k + 1 .
2) Comparison with [8], [30], [38]:
The use of countervariables [such as ( ρ , σ, ˜ ρ , ˜ σ ) -variables in our scheme] wasfirst introduced in [30] to design a synchronous averageconsensus algorithm robust to packet losses. In [38], thisscheme was extended to deal with uncoordinated (deter-ministic) agents’ activations whereas [8] built on [38] to design, in the same setting, a distributed Newton-Raphsonalgorithm. There are important differences between P-ASY-SUM-PUSH and the aforementioned schemes, namely: i)none of them can deal with delays but packet losses ; ii) [30]is synchronous ; and iii) [8], [38] are not parallel schemes, asat each iteration only one agent is allowed to wake up andtransmit information to its neighbors. For instance, [8], [38]cannot model synchronous parallel (Jacobi) updates. Hence,the convergence analysis of P-ASY-SUM-PUSH calls for anew line of proof, as introduced in Sec. VI.
3) Beyond average consensus:
By choosing properly theperturbation signal ǫ k , P-ASY-SUM-PUSH can solve differ-ent problems. Some examples are discussed next. (i) Error free: ǫ k = . P-ASY-SUM-PUSH solves theaverage consensus problem and (9) reads (cid:13)(cid:13)(cid:13) y k +1 i − (1 /I ) · I X i =1 z i (cid:13)(cid:13)(cid:13) ≤ C ρ k (cid:13)(cid:13) z (cid:13)(cid:13) . (ii) Vanishing error: lim k →∞ k ǫ k k = 0 . Using [29,Lemma 7(a)], (9) reads lim k →∞ k y k +1 i − m k +1 z k = 0 . (iii) Asynchronous tracking. Each agent i owns a (time-varying) signal { u ki } k ∈ N ; the average tracking problem con-sists in asymptotically track the average signal ¯ u k , (1 /I ) · P Ii =1 u ki , that is, lim k →∞ (cid:13)(cid:13) y k +1 i − ¯ u k +1 (cid:13)(cid:13) = 0 , ∀ i ∈ V . (10)Under mild conditions on the signal, this can be accomplishedin a distributed and asynchronous fashion, using P-ASY-SUM-PUSH, as formalized next. Corollary 7.1.
Consider, the following setting in
P-ASY-SUM-PUSH : z i = u i , for all i ∈ V ; ǫ k = u k +1 i k − ˜ u ki k ,with ˜ u k +1 i = ( u k +1 i if i = i k ;˜ u ki otherwise ; ˜ u i = u i ; Then (9) holds, with m k +1 z = P Ii =1 ˜ u k +1 i . Furthermore, if lim k →∞ P Ii =1 (cid:13)(cid:13) u k +1 i − u ki (cid:13)(cid:13) = 0 , then (10) holds.Proof. See Appendix E.This instance of P-ASY-SUM-PUSH will be used inSec. IV to perform asynchronous gradient tracking.
Remark 8 (Asynchronous average consensus) . To the best ofour knowledge, the error-free instance of the P-ASY-SUM-PUSH discussed above is the first (stepsize-free) scheme thatprovably solves the average consensus problem at a linearrate, under the general asynchronous model described byAssumption 6. In fact, the existing asynchronous consensusschemes [31] [32] achieve an agreement among the agents’local variables whose value is not in general the average oftheir initial values, but instead some unknown function of themand the asynchronous modus operandi of the agents. Relatedto the P-ASY-SUM-PUSH is the ra-AC algorithm in [38],which enjoys the same convergence property but under a morerestrictive and specific asynchronous model (no delays butpacket losses and single-agent activation per iteration).
IV. A
SYNCHRONOUS
SONATA (ASY-SONATA)We are ready now to introduce our distributed asynchronousalgorithm–ASY-SONATA. The algorithm combines SONATA(cf. Sec. II-B) with P-ASY-SUM-PUSH (cf. Sec. III), thelatter replacing the synchronous tracking scheme (2)-(4). The“global view” of the scheme is given in Algorithm 2.
Algorithm 2
ASY-SONATA (Global View)
Data:
For all agent i and ∀ j ∈ N in i , x i ∈ R n , z i = ∇ f i ( x i ) , φ i = 1 , ˜ ρ ij = 0 , ˜ σ ij = 0 , τ − ij = − D . Andfor t = − D, − D + 1 , . . . , , ρ tij = 0 , σ tij = 0 , v ti = 0 . Set k = 0 . While: a termination criterion is not met do (S.1) Pick ( i k , d k ) ; (S.2) Set: τ ki k j = max( τ k − i k j , k − d kj ) , ∀ j ∈ N in i k . (S.3) Local Descent: v k +1 i k = x ki k − γ k z ki k . (11) (S.4) Consensus: x k +1 i k = w i k i k v k +1 i k + X j ∈N in ik w i k j v τ kikj j . (S.5) Gradient Tracking: • (S.5.1) Sum step: z k + i k = z ki k + X j ∈N in ik (cid:18) ρ τ kikj i k j − ˜ ρ ki k j (cid:19) + ∇ f i k ( x k +1 i k ) − ∇ f i k ( x ki k ) • (S.5.2) Push step: z k +1 i k = a i k i k z k + i k , ρ k +1 ji k = ρ kji k + a ji k z k + i k , ∀ j ∈ N out i k • (S.5.3) Mass-Buffer update: ˜ ρ k +1 i k j = ρ τ kikj i k j , ∀ j ∈ N in i k (S.6) Untouched state variables shift to state k + 1 while keeping the same value; k ← k + 1 .In ASY-SONATA, agents continuously and with no coordi-nation perform: i) their local computations [cf. (S.3) ], possiblyusing an out-of-sync estimate z ki k of the average gradient; in(11), γ k is a step-size (to be properly chosen); ii) a consensusstep on the x -variables, using possibly outdated information v τ kikj j from their in-neighbors [cf. (S.4) ]; and iii) gradienttracking [cf. (S.5) ] to update the local estimate z ki k , basedon the current cumulative mass variables ρ τ kikj i k j , and buffervariables ˜ ρ ki k j , j ∈ N in i k .Note that in Algorithm 1, the tracking variable y k +1 i k isobtained rescaling z k +1 i k by the factor /φ k +1 i k . In Algorithm 2,we absorbed the scaling /φ k +1 i k in the step size and usedirectly z k +1 i k as a proxy of the average gradient, eliminating thus the φ -variables (and the related σ -, ˜ σ -variables). Also,for notational simplicity and without loss of generality, weassumed that the v - and ρ - variables are subject to the samedelays (e.g., they are transmitted within the same packet); sameconvergence results hold if different delays are considered.We study now convergence of the scheme, under a constantstep-size or diminishing, uncoordinated ones. A. Constant Step-size
Theorem 9 below establishes linear convergence of ASY-SONATA when F is strongly convex. Theorem 9 (Geometric convergence) . Consider (P) underAssumption 2, and let x ⋆ denote its unique solution. Let { ( x ki ) Ii =1 } k ∈ N be the sequence generated by Algorithm 2,under Assumption 3, 6, and with weight-matrices W and A satisfying Assumption 4. Then, there exists a constant ¯ γ > [cf. (46) ] such that if γ k ≡ γ ≤ ¯ γ , it holds M sc ( x k ) , k x k − I ⊗ x ⋆ k = O ( λ k ) , (12) with λ ∈ (0 , given by λ = ( − τ ¯ m K γ if γ ∈ (0 , ˆ γ ] ,ρ + √ J γ if γ ∈ (ˆ γ , ˆ γ ) , (13) where ˆ γ and ˆ γ are some constants strictly smaller than ¯ γ ,and J , (1 − ρ ) / ˆ γ .Proof. See Sec. VII.When F is convex (resp. nonconvex), we introduce thefollowing merit function to measure the progresses of the al-gorithm towards optimality (resp. stationarity) and consensus: M F ( x k ) , max { (cid:13)(cid:13) ∇ F (¯ x k ) (cid:13)(cid:13) , (cid:13)(cid:13) x k − I ⊗ ¯ x k (cid:13)(cid:13) } , (14)where x k , [ x k ⊤ , · · · , x k ⊤ I ] ⊤ and ¯ x k , (1 /I ) · P Ii =1 x ki . Note that M F is a valid merit function, since it is continuousand M F ( x ) = 0 if and only if all x i ’s are consensual andoptimal (resp. stationary solutions). Theorem 10 (Sublinear convergence) . Consider (P) underAssumption 1 (thus possibly nonconvex). Let { ( x ki ) Ii =1 } k ∈ N be the sequence generated by Algorithm 2, in the same settingof Theorem 9. Given δ > , let T δ be the first iteration k ∈ N such that M F ( x k ) ≤ δ . Then, there exists a ¯ γ > [cf. (55) ],such that if γ k ≡ γ ≤ ¯ γ , T δ = O (1 /δ ) . The values of theabove constants is given in the proof.Proof. See Sec. VIII.Theorem 9 states that consensus and optimization errors ofthe sequence generated by ASY-SONATA vanish at a linearrate. We are not aware of any other scheme enjoying sucha property in such a distributed, asynchronous computingenvironment. For general, possibly nonconvex instances ofProblem (P), Theorem 10 shows that both consensus and op-timization errors of the sequence generated by ASY-SONATAvanish at O (1 /δ ) sublinear rate.The choice of a proper stepsize calls for the estimates of ¯ γ and ¯ γ in Theorems 9 and 10, which depend on the followingquantities: the optimization parameters L i (Lipschitz constantsof the gradients) and τ (strongly convexity constant), the network connectivity parameter ρ , and the constants D and T due to the asynchrony (cf. Assumption 6). Notice that thedependence of the stepsize on L i , τ , and ρ is common to allthe existing distributed synchronous algorithms and so is thaton T and D to (even centralized) asynchronous algorithms[3]. While L i , τ , and ρ can be acquired following approachesdiscussed in the literature (see, e.g., [33, Remark 4]), it isless clear how to estimate D and T , as they are related tothe asynchronous model, generally not known to the agents.As an example, we address this question considering thefollowing fairly general model for the agents’ activations andasynchronous communications. Suppose that the length of anytime window between consecutive “push” steps of any agentbelongs to [ p min , p max ] , for some p max ≥ p min > , and oneagent always sends out its updated information immediatelyafter the completion of its “push” step. The traveling time ofeach packet is at most D tv . Also, at least one packet is success-fully received every D ls successive one-hop communications.Note that there is a vast literature on how to estimate D tv and D ls , based upon the specific channel model under consider-ation; see, e.g., [39], [40]. In this setting, it is not difficultto check that one can set T = ( I − ⌈ p max /p min ⌉ + 1 and D = I ⌈ D tv /p min ⌉ D ls . To cope with the issue of estimating ¯ γ and ¯ γ , in the next section we show how to employ inASY-SONATA diminishing, uncoordinated stepsizes. B. Uncoordinated diminishing step-sizes
The use of a diminishing stepsize shared across the agentsis quite common in synchronous distributed algorithms. How-ever, it is not clear how to implement such option in an asyn-chronous setting, without enforcing any coordination amongthe agents (they should know the global iteration counter k ).In this section, we provide for the first time a solution to thisissue. Inspired by [41], our model assumes that each agent, independently and with no coordination with the others, drawsthe step-size from a local sequence { α t } t ∈ N , according to itslocal clock. The sequence { γ k } k ∈ N in (11) will be thus theresult of the “uncoordinated samplings” of the local out-of-sync sequences { α t } t ∈ N . The next theorem shows that in thissetting, ASY-SONATA converges at a sublinear rate for bothconvex and nonconvex objectives. Theorem 11.
Consider Problem (P) under Assumption 1(thus possibly nonconvex). Let { ( x ki ) Ii =1 } k ∈ N be the sequencegenerated by Algorithm 2, in the same setting of Theorem 9,but with the agents using a local step-size sequence { α t } t ∈ N satisfying α t ↓ and P ∞ t =0 α t = ∞ . Given δ > , let T δ bethe first iteration k ∈ N such that M F ( x k ) ≤ δ . Then T δ ≤ inf n k ∈ N (cid:12)(cid:12)(cid:12) k X t =0 γ t ≥ c/δ o , (15) where c is a positive constant.Proof. See Sec. VIII.V. N
UMERICAL R ESULTS
We test ASY-SONATA on the least square regression andthe binary classification problems. The MATLAB code can befound at https://github.com/YeTian-93/ASY-SONATA.
Number of rounds -8 -6 -4 -2 O p t i m a li t y gap SONATAASYSONATAASYSONATA-dimi
Figure 1.
Directed graphs: optimality gap J k versus number of rounds. A. Least square regression
In the LS problem, each agent i aims to estimate anunknown signal x ∈ R n through linear measurements b i = M i x + n i , where M i ∈ R d i × n is the sensing matrix, and n i ∈ R d i is the additive noise. The LS problem can be writtenin the form of (P), with each f i ( x ) = k M i x − b i k . Data:
We fix x with its elements being i.i.d. randomvariables drawn from the standard normal distribution. Foreach M i , we firstly generate all its elements as i.i.d. randomvariables drawn from the standard normal distribution, andthen normalize the matrix by multiplying it with the reciprocalof its spectral norm. The elements of the additive noise n i are i.i.d. Gaussian distributed, with zero mean and varianceequal to . . We set n = 200 and d i = 30 for each agent. Network model:
We simulate a network of I = 30 agents.Each agent i has out-neighbors; one of them belongs toa directed cycle graph connecting all the agents while theother two are picked uniformly at random. Asynchronousmodel:
Agents are activated according to a cyclic rule wherethe order is randomly permuted at the beginning of eachround. Once activated, every agent performs all the steps as inAlgorithm 2 and then sends its updates to all its out-neighbors.Each transmitted message has (integer) traveling time whichis drawn uniformly at random within the interval [0 , D tv ] . Weset D tv = 40 . We test ASY-SONATA with a constant step size γ = 3 . ,and also a diminishing step-size rule with each agent updatingits local step size according to α t +1 = α t (1 − . · α t ) and α = 3 . ; as benchmark, we also simulate its synchronousinstance, with step size γ = 0 . . In Fig. 1, we plot J k , (1 /I ) qP Ii =1 k x ki − x ⋆ k versus the number of rounds (oneround corresponds to one update of all the agents). The curvesare averaged over Monte-Carlo simulations, with differentgraph and data instantiations. The plot clearly shows linearconvergence of ASY-SONATAwith a constant step-size.
B. Binary classification
In this subsection, we consider a strongly convex andnonconvex instance of Problem (P) over digraphs, namely:the regularized logistic regression (RLR) and the robust clas-sification (RC) problems. Both formulations can be abstractedas: min x |D| I X i =1 X j ∈D i V ( y j · ℓ x ( u j )) + λ k∇ ℓ x ( · ) k , (16) where D = ∪ Ii =1 D i is the set of indices of the data distributedacross the agents, with agent i owning D i , and D i ∩ D l = ∅ , for all i = l ; u j and y j ∈ {− , } are the feature vector andassociated label of the j -th sample in D ; ℓ x ( · ) is a linearfunction, parameterized by x ; and V is the loss function.More specifically, if the RLR problem is considered, V reads V ( r ) = e − r while for the RC problem, we have [42] V ( r ) = , if r > r − r + , if − ≤ r ≤ , if r < − . Data:
We use the following data sets for the RLR andRC problems. (RLR): We set ℓ x ( u ) = x ⊤ u , n = 100 ,each |D i | = 20 , and λ = 0 . . The underlying statisticalmodel is the following: We generated the ground truth b x with i.i.d. N (0 , components; each training pair ( u j , y j ) is generated independently, with each element of u j beingi.i.d. N (0 , and y j is set as with probability V ( ℓ b x ( u j )) , and − otherwise. (RC): We use the Cleveland Heart DiseaseData set with 14 features [43], preprocessing it by deletingobservations with missing entries, scaling features between 0-1, and distributing the data to agents evenly. We set ℓ x ( u ) = e ⊤ x + P d =1 e ⊤ d x e ⊤ d u . Network model:
We simulated adigraph of I = 30 agents. Each agent has out-neighbors;one of them belongs to a directed cycle connecting all theagents while the other are picked uniformly at random. Onerow and one column stochastic matrix with uniform weightsare generated. Asynchronous model: a) Activation lists aregenerated by concatenating random rounds.
To generate oneround, we first sample its length uniformly from the interval [ I, T ] , with T = 90 . Within a round, we first have each agentappearing exactly once and then sample agents uniformly forthe remaining spots. Finally a random shuffle of the agentsorder is performed on each round; b) Each transmitted messagehas (integer) traveling time which is sampled uniformly fromthe interval [0 , D tv ] , with D tv = 90 .We compare the performance of our algorithm with Asy-SubPush [44] and AsySPA [45], which appeared online duringthe revision process of our paper. AsySubPush and AsySPAdiffer from ASY-SONATA in the following aspects: i) they donot employ any gradient tracking mechanism; ii) they cannothandle packet losses and purge out old information from thesystem (information is used as it is received); iii) when F is strongly convex, they provably converge at sublinear rate;and iv) they cannot handle nonconvex F . The step sizes ofall algorithms are manually tuned to obtain the best practicalperformance. We run two instances of ASY-SONATA, oneemploying a constant step size γ = 0 . and the other one usingthe diminishing step size rule α t +1 = α t (1 − . · α t ) ,where α = 0 . and t is the local iteration counter. For Asy-SubPush (resp. AsySPA) we set, for each agent i , α i = 0 . (resp. ρ ( k ) = c/ √ k with c = 0 . ) in RLC and α i = 0 . (resp. ρ ( k ) = c/ √ k with c = 0 . ) in RC. The result isaveraged over 20 Monte Carlo experiments with different di-graph instances, and is presented in Fig. 2; for each algorithm,we plot the merit functions M sc (left panel) and M F (rightpanel) evaluated in the generated trajectory versus the globaliteration counter k . Consistently with the convergence theory, Global iterations × M sc -10 -5 ASYSONATAASYSONATA-dimiAsySubPushAsySPA Global iterations × M F -20 -10 ASYSONATAASYSONATA-dimiAsySubPushAsySPA
Figure 2.
L: regularized logistic regression; R: robust classification.
ASY-SONATA with a constant step size exhibits a linearconvergence rate. Also, ASY-SONATA outperforms the othertwo algorithms; this is mainly due to i) the presence in ASY-SONATA of an asynchronous gradient tracking mechanismwhich provides, at each iteration, a better estimate of ∇ F ; andii) the possibility in ASY-SONATA to discard old informationwhen received after a newer one [cf. (5)].VI. C ONVERGENCE A NALYSIS OF
P-ASY-SUM-PUSHWe prove Theorem 7; we assume n = 1 , without loss ofgenerality. The proof is organized in the following two steps. Step 1:
We first reduce the asynchronous agent system toa synchronous “augmented” one with no delays. This willbe done adding virtual agents to the graph G along withtheir state variables, so that P-ASY-SUM-PUSH will berewritten as a (synchronous) perturbed push-sum algorithmon the augmented graph. While this idea was first explored in[30], [31], there are some important differences between theproposed enlarged systems and those used therein, see Remark13. Step 2:
We conclude the proof establishing convergenceof the perturbed push-sum algorithm built in Step 1.
A. Step 1: Reduction to a synchronous perturbed push-sum1) The augmented graph:
We begin constructing the aug-mented graph–an enlarged agent system obtained adding vir-tual agents to the original graph G = ( V , E ) . Specifically, weassociate to each edge ( j, i ) ∈ E an ordered set of virtual nodes(agents), one for each of the possible delay values, denotedwith a slight abuse of notation by ( j, i ) , ( j, i ) , . . . , ( j, i ) D ;see Fig. 3. Roughly speaking, these virtual nodes store the“information on fly” based upon its associated delay, that is,the information that has been generated by j ∈ N in i for i but not used (received) by i yet. Adopting the terminologyin [31], nodes in the original graph G are termed computingagents while the virtual nodes will be called noncomput-ing agents . With a slight abuse of notation, we define theset of computing and noncomputing agents as b V , V ∪{ ( i, j ) d | ( i, j ) ∈ E , d = 0 , , . . . , D } , and its cardinality as S , (cid:12)(cid:12)(cid:12) b V (cid:12)(cid:12)(cid:12) = ( I + ( D + 1) |E| ) . We now identify the neighborsof each agent in this augmented systems. Computing agentsno longer communicate among themselves; each j ∈ V canonly send information to the noncomputing nodes ( j, i ) , with i ∈ N out j . Each noncomputing agent ( j, i ) d can either sendinformation to the next noncomputing agent, that is ( j, i ) d +1 (if any), or to the computing agent i ; see Fig. 3(b). ( i, ℓ ) ( i, ℓ ) ( j,i ) ( j,i ) ( j,i ) ℓ i j ( i, ℓ ) ℓ i j (a) Snapshot of the original graph(b) Augmented graph associated with (a) Figure 3. Example of augmented graph, when the maximum delay D = 2 ;three noncomputing agents are added for each edge ( j, i ) ∈ E . To describe the information stored by the agents in theaugmented system at each iteration, let us first introduce thefollowing quantities: T i , (cid:8) k (cid:12)(cid:12) i k = i, k ∈ N (cid:9) is the set ofglobal iteration indices at which the computing agent i ∈ V wakes up; and, given k ∈ N , let T ki , (cid:8) t ∈ T i (cid:12)(cid:12) t ≤ k (cid:9) . It isnot difficult to conclude from (7) and (8) that ρ kij = X t ∈T k − j a ij z t +1 / j and ˜ ρ kij = ρ τ k − ij ij , ( j, i ) ∈ E . (17)At iteration k = 0 , every computing agent i stores z i , whereasthe values of the noncomputing agents are initialized to . Atthe beginning of iteration k , every computing agent i will store z ki whereas every noncomputing agent ( j, i ) d , with ≤ d ≤ D − , stores the mass a ij z j (if any) generated by j for i atiteration k − d − (thus k − d − ∈ T k − j ), i.e., a ij z k − ( d +1)+1 / j (cf. Step 3.2), and not been used by i yet (thus k − d > τ k − ij );otherwise it stores . Formally, we have z k ( j,i ) d , a ij z t +1 / j · (cid:2) t = k − d − ∈ T k − j & t + 1 > τ k − ij (cid:3) . (18)The virtual node ( j, i ) D cumulates all the masses a ij z k − ( d +1)+1 / j with d ≥ D , not received by i yet: z k ( j,i ) D , X t ∈T k − D − j , t +1 >τ k − ij a ij z t +1 / j . (19)We write next P-ASY-SUM-PUSH on the augmented graphin terms of the z -variables of both the computing and noncom-puting agents, absorbing the ( ρ, ˜ ρ ) -variables using (17)-(19). The sum-step over the augmented graph.
In the sum-step,the update of the z -variables of the computing agents reads: z k + i k = z ki k + X j ∈N in ik (cid:18) ρ τ kikj i k j − ˜ ρ ki k j (cid:19) + ǫ k (17) − (19) = z ki k + X j ∈N in ik D X d = k − τ kikj z k ( j,i k ) d + ǫ k ; (20a) z k + j = z kj , j ∈ V \ { i k } . (20b)In words, node i k builds the update z ki k → z k + i k basedupon the masses transmitted by the noncomputing agents ( j, i k ) k − τ kikj , ( j, i k ) k − τ kikj +1 , . . . , ( j, i k ) D [cf. (20a)]. All the ( i k , ℓ ) ( i k , ℓ ) ( j,i k ) ( j,i k ) ( j,i k )
111 11 ℓ i k j ( i k , ℓ ) Figure 4. Sum step on the augmented graph: τ ki k j = k − (delay one); thetwo noncomputing agents, ( j, i k ) and ( j, i k ) , send their masses to i k . ( i k , ℓ ) ( i k , ℓ ) ( j,i k ) ( j,i k ) ( j,i k ) ℓ i k j ( i k , ℓ ) a i k i k a ℓ i k Figure 5. Push step on the augmented graph: Agent i k keeps a i k i k z k +1 / i k while sending a ℓi k z k +1 / i k to the virtual nodes ( i k , ℓ ) , ℓ ∈ N out i k . other computing agents keep their masses unchanged [cf.(20b)]. The updates of the noncomputing agents is set to z k + ( j,i k ) d , , d = k − τ ki k j , . . . , D, j ∈ N in i k ; (20c) z k + ( j ′ ,i ) τ , z k ( j ′ ,i ) τ , for all the other ( j ′ , i ) τ ∈ b V . (20d)The noncomputing agents in (20c) set their variables to zero(as they transferred their masses to i k ) while the other non-computing agents keep their variables unchanged [cf. (20d)].Fig. 4 illustrates the sum-step over the augmented graph. The push-step over the augmented graph.
In the push-step,the update of the z -variables of the computing agents reads: z k +1 i k = a i k i k z k + i k ; (21a) z k +1 j = z k + j , for j ∈ V \ { i k } . (21b)In words, agent i k keeps the portion a i k i k z k + i k of the newgenerated mass [cf. (21a)] whereas the other computing agentsdo not change their variables [cf. (21b)]. The noncomputingagents update as: z k +1( i k ,ℓ ) , a ℓi k z k +1 / i k , ℓ ∈ N out i k ; (21c) z k +1( i,j ) , , ( i, j ) ∈ E , i = i k ; (21d) z k +1( i,j ) d , z k + ( i,j ) d − , d = 1 , . . . , D − , ( i, j ) ∈ E ; (21e) z k +1( i,j ) D , z k + ( i,j ) D + z k + ( i,j ) D − , ( i, j ) ∈ E . (21f)In words, the computing agent i k pushes its masses a ℓi k z k + i k to the noncomputing agents ( i k , ℓ ) , with ℓ ∈ N out i k [cf.(21c)]. As the other noncomputing agents ( i, j ) , i = i k , donot receive any mass for their associated computing agents,they set their variables to zero [cf. (21d)]. Finally the othernoncomputing agents ( i, j ) d , with ≤ d ≤ D − , transferstheir mass to the next noncomputing node ( j, i ) d +1 [cf. (21f),(21e)]. This push-step is illustrated in Fig. 5.The following result establishes the equivalence between theupdate of the enlarged system with that of Algorithm 1. Proposition 12.
Consider the setting of Theorem 7. Thevalues of the z -variables of the computing agents in (20) - (21) coincide with those of the z -variables generated by P-ASY-SUM-PUSH (Algorithm 1), for all iterations k ∈ N . Proof.
By construction, the updates of the computing agents asin (20a)-(20b) and (21a)-(21b) coincide with the z-updates inthe sum- and push-steps of P-ASY-SUM-PUSH, respectively.Therefore, we only need to show that the updates of thenoncomputing agents are consistent with those of the ( ρ, ˜ ρ ) -variables in P-ASY-SUM-PUSH. This follows using (17) andnoting that the updates (21c)-(21f) are compliant with (18)and (19). For instance, by (17)-(18), it must be z k +1( i k ,j ) = a ji k z t +1 / j · [ t = k ∈ T ki k and t + 1 > τ kji k ] = a ji k z k +1 / j ,which in fact coincides with (21c). The other equations (21d)–(21f) can be similarly validated.Proposition 12 opens the way to study convergence ofP-ASY-SUM-PUSH via that of the synchronous perturbedpush-sum algorithm (20)-(21). To do so, it is convenient torewrite (20)-(21) in vector-matrix form, as described next.We begin introducing an enumeration rule for the compo-nents of the z-vector in the augmented system. We enumerateall the elements of E as , , . . . , |E| . The computing agents in b V are indexed as in V , that is, , , . . . , I . Each noncomputingagent ( j, i ) d is indexed as I + d |E| + s , where s is theindex associated with ( j, i ) in E ; we will use interchangeably z I + d |E| + s and z ( j,i ) d . We define the z -vector as b z = [ z i ] Si =1 ;and its value at iteration k ∈ N is denoted by b z k .The transition matrix S k of the sum step is defined as S khm , , if m ∈ { ( j, i k ) d | k − τ ki k j ≤ d ≤ D } and h = i k ;1 , if m ∈ b V \ { ( j, i k ) d | k − τ ki k j ≤ d ≤ D } and h = m ;0 , otherwise . Let ε k , ǫ k e i k be the S − dimensional perturbation vector.The sum-step can be written in compact form as b z k + = S k b z k + ε k . (22)Define the transition matrix P k of the push step as P khm , a ji k , if m = i k and h = ( j, i k ) , j ∈ N out i k ; a i k i k , if m = h = i k ;1 , if m = h ∈ V \ i k ;1 , if m = ( i, j ) d , h = ( i, j ) d +1 , ( i, j ) ∈ E , ≤ d ≤ D − , if m = h = ( i, j ) D , ( i, j ) ∈ E ;0 , otherwiseThen, the push-step can be written as b z k +1 = P k b z k + . (23)Combing (22) and (23), yields b z k +1 = b A k b z k + p k , b A k , P k S k , p k , P k ε k . (24)The updates of the φ variables and the definition of the φ -vector are similar as above. In summary, the P-ASY-SUM-PUSH algorithm can be rewritten in compact form as b z k +1 = b A k b z k + p k , p k = ǫ k P k e i k ; (25a) b φ k +1 = b A k b φ k ; (25b) with initialization: z i ∈ R and φ i = 1 , for i ∈ V ; and z i = 0 and φ i = 0 , for i ∈ b V \ V . Remark 13 (Comparison with [30]–[32], [38]) . The idea of re-ducing asynchronous (consensus) algorithms into synchronousones over an augmented system was already explored in [31],[32], [38]. However, there are several important differencesbetween the models therein and the proposed augmentedgraph. First of all, [38] extends the analysis in [30] to dealwith asynchronous activations, but both work consider onlypacket losses (no delays). Second, our augmented graph modeldeparts from that in [31], [32] in the following aspects: i) inour model, the virtual nodes are associated with the edges ofthe original graph rather than the nodes; ii) the noncomputingnodes store the information on fly (i.e., generated by a senderbut not received by the intended receiver yet), while in [31],[32], each noncomputing agent owns a delayed copy of themessage generated by the associated computing agent; and iii)the dynamics (25) over the augmented graph used to describethe P-ASY-SUM-PUSH procedure is different from those of theasynchronous consensus schemes [31, (1)] and [32, (1)].B. Step 2: Proof of Theorem 71) Preliminaries:
We begin studying some properties ofthe matrix product b A k : t , which will be instrumental to proveconvergence of the perturbed push-sum scheme (25). Lemma 14.
Let { b A k } k ∈ N be the sequence of matrices in (25) , generated by Algorithm 1, under Assumption 6, and with A , ( a ij ) Ii,j =1 satisfying Assumption 4 (i),(iii). The followinghold: for all k ∈ N , a) b A k is column stochastic; and b) theentries of the first I rows of b A k + K − k are uniformly lowerbounded by η , ¯ m K ∈ (0 , , with K , (2 I − · T + I · D .Proof. The lemma essentially proves that ( b A k + K − k ) ⊤ isa SIA (Stochastic Indecomposable Aperiodic) matrix [32],by showing that for any time length of K iterations, thereexists a path from any node m in the augmented graph toany computing node h. While at a high level the proof sharessome similarities with that of [31, Lemma 2] and [32, Lemma5 (a)], there are important differences due to the distinctmodeling of our augmented system. The complete proof isin Appendix A.The key result of this section is stated next and shows thatas k − t increases, b A k : t approaches a column stochastic rankone matrix at a linear rate. Given Lemma 14, the proof followsthe path of [31, Lemma 4, Lemma 5], [32, Lemma 4, Lemma5(b,c)] and thus is omitted. Lemma 15.
In the setting above, there exists a sequence ofstochastic vectors { ξ k } k ∈ N such that, for any k ≥ t ∈ N and i, j ∈ { , · · · , S } , there holds (cid:12)(cid:12)(cid:12) b A k : tij − ξ ki (cid:12)(cid:12)(cid:12) ≤ Cρ k − t , (26) with C , m − K − ¯ m K , ρ , (1 − ¯ m K ) K ∈ (0 , . Furthermore, ξ ki ≥ η , for all i ∈ V and k ∈ N .
2) Proof of Theorem 7:
Applying (25) telescopically,yields: b z k +1 = b A k :0 b z + P kl =1 b A k : l p l − + p k and b φ k +1 = b A k :0 b φ , which using the column stochasticity of b A k : t , yields ⊤ b z k +1 = ⊤ b z + k X l =0 ⊤ p l , ⊤ b φ k +1 = ⊤ b φ = I. (27)Using (27) and φ k +1 i ≥ Iη , for all i ∈ V and k ≥ K − [due to Lemma 14(b)], we have: for i ∈ V and k ≥ K − , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) z k +1 i φ k +1 i − ⊤ b z k +1 I (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ Iη (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) z k +1 i − φ k +1 i I ( ⊤ b z k +1 ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ Iη (cid:12)(cid:12) z k +1 i − ξ ki ⊤ b z k +1 (cid:12)(cid:12) + 1 Iη (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ξ ki − φ k +1 i I ! ⊤ b z k +1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ Iη (cid:12)(cid:12) z k +1 i − ξ ki ⊤ b z k +1 (cid:12)(cid:12) + 1 Iη (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ξ ki − b A k :0 i, : b φ I (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) · (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ⊤ b z + k X l =0 ⊤ p l (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (26) ≤ Iη (cid:12)(cid:12) z k +1 i − ξ ki ⊤ b z k +1 (cid:12)(cid:12) + Cρ k √ Iη (cid:13)(cid:13) z (cid:13)(cid:13) + k X l =0 (cid:12)(cid:12) ǫ l (cid:12)(cid:12)! (28)The next lemma provides a bound of (cid:12)(cid:12) z k +1 i − ξ ki ⊤ b z k +1 (cid:12)(cid:12) . Lemma 16.
Let { b z k } ∞ k =0 be the sequence generated by theperturbed system (25a) , under Assumption 6, A = ( a ij ) Ii,j =1 satisfying Assumption 4 (i), (iii), and given { ǫ k } k ∈ N . For any i ∈ V and k ≥ , there holds (cid:12)(cid:12) z k +1 i − ξ ki ⊤ b z k +1 (cid:12)(cid:12) ≤ C ρ k (cid:13)(cid:13) z (cid:13)(cid:13) + k X l =0 ρ k − l (cid:12)(cid:12) ǫ l (cid:12)(cid:12)! , (29) with { ξ k } n ∈ N defined in Lemma 15 and C , C √ S/ρ.
Proof. (cid:12)(cid:12) z k +1 i − ξ ki ⊤ b z k +1 (cid:12)(cid:12) (25a) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) b A k :0 i, : b z + k X l =1 b A k : li, : p l − + p ki ! − ξ ki ⊤ b z + k X l =0 ⊤ p l ! (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12) p ki (cid:12)(cid:12) + (cid:12)(cid:12) ⊤ p k (cid:12)(cid:12) + (cid:13)(cid:13)(cid:13) b A k :0 i, : − ξ ki ⊤ (cid:13)(cid:13)(cid:13) (cid:13)(cid:13)b z (cid:13)(cid:13) + k X l =1 (cid:13)(cid:13)(cid:13) b A k : li, : − ξ ki ⊤ (cid:13)(cid:13)(cid:13) (cid:13)(cid:13) p l − (cid:13)(cid:13) (26) ≤ √ Sρ C ρ k (cid:13)(cid:13)b z (cid:13)(cid:13) + k X l =0 ρ k − l (cid:13)(cid:13) P l (cid:13)(cid:13) (cid:12)(cid:12) ǫ l (cid:12)(cid:12)! ( a ) ≤ C ρ k (cid:13)(cid:13) z (cid:13)(cid:13) + k X l =0 ρ k − l (cid:12)(cid:12) ǫ l (cid:12)(cid:12)! , where in (a) we used (cid:13)(cid:13) P l (cid:13)(cid:13) ≤ p k P l k k P l k ∞ ≤ √ . Combing Eq. (28) and (29) leads to (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) z k +1 i φ k +1 i − ⊤ b z k +1 I (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C ρ k (cid:13)(cid:13) z (cid:13)(cid:13) + k X l =0 ρ k − l (cid:12)(cid:12) ǫ l (cid:12)(cid:12)! , where we defined C , C · / ( I η ) . Recalling the definition of m kz , P Ii =1 z ki + P ( j,i ) ∈E ( ρ kij − ˜ ρ kij ) , to complete the proof, it remains to show that m kz ( I ) = I X i =1 z i + k − X t =0 ǫ t ( II ) = ⊤ b z k . (30)We prove next the equalities (I) and (II) separately. Proof of (I) : Since m z = P Ii =1 z i , it suffices to show that m k +1 z = m kz + ǫ k for all k ∈ N . Since agent i k triggers k → k + 1 , we only need to show that z k +1 i k + X j ∈N in ik ( ρ k +1 i k j − ˜ ρ k +1 i k j ) + X j ∈N out ik ( ρ k +1 ji k − ˜ ρ k +1 ji k )= z ki k + X j ∈N in ik ( ρ ki k j − ˜ ρ ki k j ) + X j ∈N out ik ( ρ kji k − ˜ ρ kji k ) + ǫ k . We have z k +1 i k + X j ∈N in ik ( ρ k +1 i k j − ˜ ρ k +1 i k j ) + X j ∈N out ik ( ρ k +1 ji k − ˜ ρ k +1 ji k ) ( a ) = a i k i k z k + i k + X j ∈N in ik ( ρ ki k j − ρ τ kikj i k j )+ X j ∈N out ik ( ρ kji k + a ji k z k + i k − ˜ ρ kji k ) ( b ) = z k + i k + X j ∈N in ik ( ρ ki k j − ρ τ kikj i k j ) + X j ∈N out ik ( ρ kji k − ˜ ρ kji k ) ( c ) = z ki k + X j ∈N in ik ( ρ τ kikj i k j − ˜ ρ ki k j ) + ǫ k + X j ∈N in ik ( ρ ki k j − ρ τ kikj i k j )+ X j ∈N out ik ( ρ kji k − ˜ ρ kji k )= z ki k + X j ∈N in ik ( ρ ki k j − ˜ ρ ki k j ) + X j ∈N out ik ( ρ kji k − ˜ ρ kji k ) + ǫ k , where in (a) we used: the definition of the push step, ρ k +1 i k j = ρ ki k j for all j ∈ N in i k , and ˜ ρ k +1 ji k = ˜ ρ kji k for all j ∈ N out i k ; (b)follows from a i k i k + P j ∈N out ik a ji k = 1 ; and in (c), we usedthe sum-step. Proof of (II):
Using (27), yields ⊤ b z k +1 = ⊤ b z + P kl =0 ⊤ p l = ⊤ b z k + ⊤ ε k = P Ii =1 z i + P kt =0 ǫ t .VII. ASY-SONATA–P ROOF OF T HEOREM
Step 1:
Weintroduce and study convergence of an auxiliary perturbedconsensus scheme, which serves as a unified model for thedescent and consensus updates in ASY-SONATA–the main re-sult is summarized in Proposition 18;
Step 2:
We introduce theconsensus and gradient tracking errors along with a suitablydefined optimization error; and we derive bounds connectingthese quantities, building on results in Step 1 and convergenceof P-ASY-SUM-PUSH–see Proposition 19. The goal is toprove that the aforementioned errors vanish at a linear rate.To do so,
Step 3 introduces a general form of the smallgain theorem–Theorem 23–along with some technical results,which allows us to establish the desired linear convergencethrough the boundedness of the solution of an associatedlinear system of inequalities.
Step 4 builds such a linearsystem for the error quantities introduced in Step 2 and proves the boundedness of its solution, proving thus Theorem 9.The rate expression (13) is derived in Appendix D. Throughthe proof we assume n = 1 (scalar variables); and define C L , max i =1 ,...,I L i and L , P Ii =1 L i . Step 1: A perturbed asynchronous consensus scheme
We introduce a unified model to study the dynamics of theconsensus and optimization errors in ASY-SONATA, whichconsists in pulling out the tracking update (Step 5) and treatthe z-variables–the term − γ k z ki k in (11)–as an exogenousperturbation δ k . More specifically, consider the followingscheme (with a slight abuse of notation, we use the samesymbols as in ASY-SONATA): v k +1 i k = x ki k + δ k , (31a) x k +1 i k = w i k i k v k +1 i k + X j ∈N in ik w i k j v k − d kj j , (31b) v k +1 j = v kj , x k +1 j = x kj , ∀ j ∈ V \ { i k } , (31c)with given x i ∈ R , v ti = 0 , t = − D, − D + 1 , . . . , , for all i ∈ ˜ V . We make the blanket assumption that agents’ activationsand delays satisfy Assumption 6.Let us rewrite (31) in a vector-matrix form. Define x k , [ x k , · · · , x kI ] ⊤ and v k , [ v k , · · · , v kI ] ⊤ . Construct the ( D +2) I dimensional concatenated vectors h k , [ x k ⊤ , v k ⊤ , v k − ⊤ , · · · , v k − D ⊤ ] ⊤ , δ k , δ k e i k ; (32)and the augmented matrix c W k , defined as c W krm , w i k i k , if r = m = i k ; w i k j , if r = i k , m = j + ( d kj + 1) I ;1 , if r = m ∈ { , , . . . , I } \ { i k , i k + I } ;1 , if r ∈ { I + 1 , I + 2 , . . . , ( D + 2) I }∪ { i k + I } and m = r − I ;0 , otherwise . System (31) can be rewritten in compact form as h k +1 = c W k ( h k + δ k ) , (33)The following lemma captures the asymptotic behavior of c W k . Lemma 17.
Let { c W k } k ∈ N be the sequence of matrices in (33) , generated by (31) , under Assumption 6 and with W satisfying Assumption 4 (i), (ii). The following hold: for all k ∈ N , a) c W k is row stochastic; b) there exists a sequenceof stochastic vectors { ψ k } k ∈ N such that (cid:13)(cid:13)(cid:13)c W k : t − ψ t ⊤ (cid:13)(cid:13)(cid:13) ≤ C ρ k − t , C , p ( D + 2) I (1 + ¯ m − K )1 − ¯ m − K (34) Furthermore, ψ ki ≥ η = ¯ m K , for all k ≥ and i ∈ V .Proof. The proof follows similar techniques as in [31], [32],and can be found in Appendix G.We define now a proper consensus error for (33). Writing h k in (33) recursively, yields h k +1 = c W k :0 h + k X l =0 c W k : l δ l . (35) Using Lemma 17, for any fixed N ∈ N , we have lim k →∞ ( c W k :0 h + N X l =0 c W k : l δ l ) = ψ ⊤ h + N X l =0 ψ l ⊤ δ l . Define x ψ , ψ ⊤ h , x k +1 ψ , ψ ⊤ h + k X l =0 ψ l ⊤ δ l , k ∈ N . (36)Applying (36) inductively, it is easy to check that x k +1 ψ = x k ψ + ψ k ⊤ δ k = x k ψ + ψ ki k δ k . (37)We are now ready to state the main result of this subsection,which is a bound of the consensus disagreement k h k +1 − x k +1 ψ k in terms of the magnitude of the perturbation. Proposition 18.
In the above setting, the consensus error k h k +1 − x k +1 ψ k satisfies: for all k ∈ N , (cid:13)(cid:13)(cid:13) h k +1 − x k +1 ψ (cid:13)(cid:13)(cid:13) ≤ C ρ k (cid:13)(cid:13) h − x ψ (cid:13)(cid:13) + C k X l =0 ρ k − l (cid:12)(cid:12) δ l (cid:12)(cid:12) . Proof.
The proof follows readily from (35), (36), and Lemma17; we omit further details.
Step 2: Consensus, tracking, and optimization errors1) Consensus disagreement : As anticipated, the updates ofASY-SONATA are also described by (31), if one sets therein δ k = − γ k z ki k (with z ki k satisfying Step 5 of ASY-SONATA).Let h k and x k ψ be defined as in (32) and (36), respectively, with δ k = − γ k z ki k . The consensus error at iteration k is defined as E k c , (cid:13)(cid:13) h k − x k ψ (cid:13)(cid:13) . (38)
2) Gradient tracking error : The gradient tracking step inASY-SONATA is an instance of P-ASY-SUM-PUSH, with ǫ k = ∇ f i k ( x k +1 i k ) − ∇ f i k ( x ki k ) . By Proposition 12, P-ASY-SUM-PUSH is equivalent to (25). In view of Lemma 16and the following property ⊤ b z k = P Ii =1 ∇ f i ( x i ) + P k − t =0 (cid:0) ∇ f i t ( x t +1 i t ) − ∇ f i t ( x ti t ) (cid:1) = P Ii =1 ∇ f i ( x ki ) where thefirst equality follows from (30) and ǫ k = ∇ f i k ( x k +1 i k ) −∇ f i k ( x ki k ) while in the second equality we used x t +1 j = x tj ,for j = i t , the tracking error at iteration k along with themagnitude of the tracking variables are defined as E k t , (cid:12)(cid:12) z ki k − ξ k − i k ¯ g k (cid:12)(cid:12) , E k z , (cid:12)(cid:12) z ki k (cid:12)(cid:12) , ¯ g k , I X i =1 ∇ f i ( x ki ) , (39)with ξ − i , η , i ∈ V . Let g k , [ ∇ f ( x k ) , . . . , ∇ f I ( x kI )] ⊤ .
3) Optimization error:
Let x ⋆ be the unique minimizer of F . Given the definition of consensus disagreement in (38), wedefine the optimization error at iteration k as E k o , (cid:12)(cid:12) x k ψ − x ⋆ (cid:12)(cid:12) . (40)Note that this is a natural choice as, if consensual, all agents’local variables will converge to a limit point of { x k ψ } k ∈ N .
4) Connection among E k c , E k t , E k z , and E k o : The followingproposition establishes bounds on the above quantities.
Proposition 19.
Let { x k , v k , z k } k ∈ N be the sequence gener-ated by ASY-SONATA, in the setting of Theorem 9, but possiblywith a time-varying step-size { γ k } k ∈ N . The error quantities E k c , E k t , E k z , and E k o satisfy: for all k ∈ N , E k +1 c ≤ C ρ k E c + C k X l =0 ρ k − l γ l E l z . (41a) E k +1 t ≤ C C L k X l =0 ρ k − l (cid:0) E l c + γ l E l z (cid:1) + C ρ k (cid:13)(cid:13) g (cid:13)(cid:13) ; (41b) E k z ≤ E k t + C L √ IE k c + L E k o (41c) Further assume γ k ≤ /L, k ∈ N ; then E k +1 o ≤ k X l =0 (cid:18) k Y t = l +1 (cid:0) − τ η γ t (cid:1) (cid:19)(cid:0) C L √ IE l c + E l t (cid:1) γ l + k Y t =0 (cid:0) − τ η γ t (cid:1) E o , (41d) where η ∈ (0 , is defined in Lemma 15 and τ is the stronglyconvexity constant of F .Proof. Eq. (41a) follows readily from Proposition 18.We prove now (41b). Recall ⊤ b z k = ¯ g k . Using Lemma 16with ǫ k = ∇ f i k ( x k +1 i k ) − ∇ f i k ( x ki k ) , we obtain: for all i ∈ V , (cid:12)(cid:12) z k +1 i − ξ ki ¯ g k +1 (cid:12)(cid:12) ≤ C ρ k (cid:13)(cid:13) g (cid:13)(cid:13) + k X l =0 ρ k − l (cid:12)(cid:12) ∇ f l +1 i l − ∇ f li l (cid:12)(cid:12)! ≤ C ρ k (cid:13)(cid:13) g (cid:13)(cid:13) + C C L k X l =0 ρ k − l (cid:12)(cid:12) x l +1 i l − x li l (cid:12)(cid:12) ≤ C ρ k (cid:13)(cid:13) g (cid:13)(cid:13) + C C L k X l =0 ρ k − l (cid:13)(cid:13) h l +1 − h l (cid:13)(cid:13) = C ρ k (cid:13)(cid:13) g (cid:13)(cid:13) + C C L k X l =0 ρ k − l (cid:13)(cid:13)(cid:13)c W l (cid:0) h l + δ l (cid:1) − h l (cid:13)(cid:13)(cid:13) ( a ) = C ρ k (cid:13)(cid:13) g (cid:13)(cid:13) + C C L k X l =0 ρ k − l (cid:13)(cid:13)(cid:13)(cid:0) c W l − I (cid:1)(cid:0) h l − x l ψ (cid:1) − γ l z li l c W l e i l (cid:13)(cid:13)(cid:13) ≤ C ρ k (cid:13)(cid:13) g (cid:13)(cid:13) + C C L k X l =0 ρ k − l k c W l k γ l E l z + (cid:18) k c W l k + k I k (cid:19) E l c ! ( b ) ≤ C ρ k (cid:13)(cid:13) g (cid:13)(cid:13) + 3 C C L k X l =0 ρ k − l (cid:0) E l c + γ l E l z (cid:1) , where in (a) we used (33) and the row stochasticityof c W k [Lemma 17(a)]; and (b) follows from k c W l k ≤ q k c W l k k c W l k ∞ ≤ √ . This proves (41b).Eq. (41c) follows readily from E k z = (cid:12)(cid:12) z ki k (cid:12)(cid:12) ≤ (cid:12)(cid:12) z ki k − ξ k − i k ¯ g k (cid:12)(cid:12) + ξ k − i k (cid:12)(cid:12) ¯ g k − ∇ F ( x k ψ ) (cid:12)(cid:12) + ξ k − i k (cid:12)(cid:12) ∇ F ( x k ψ ) − ∇ F ( x ⋆ ) (cid:12)(cid:12) . Finally, we prove (41d). Using (41c) and x k +1 ψ = x k ψ − γψ ki k z ki k [cf. (37) and recall δ k = − γz ki k ], we can write E k +1 o = (cid:12)(cid:12) x k ψ − γ k ψ ki k z ki k − x ⋆ (cid:12)(cid:12) ≤ γ k ψ ki k ξ k − i k (cid:12)(cid:12) ∇ F ( x k ψ ) − ¯ g k (cid:12)(cid:12) + γ k ψ ki k (cid:12)(cid:12) ξ k − i k ¯ g k − z ki k (cid:12)(cid:12) + (cid:12)(cid:12) x k ψ − γ k ψ ki k ξ k − i k ∇ F ( x k ψ ) − x ⋆ (cid:12)(cid:12) ( a ) ≤ (cid:0) − τ η γ k (cid:1) E k o + C L √ Iγ k (cid:13)(cid:13) h k − x k ψ (cid:13)(cid:13) + γ k E k t ( b ) ≤ k X l =0 (cid:18) k Y t = l +1 (cid:0) − τ η γ t (cid:1) (cid:19)(cid:0) C L √ IE l c + E l t (cid:1) γ l + k Y t =0 (cid:0) − τ η γ t (cid:1) E o where in ( a ) we used η ≤ ψ ki k ξ k − i k < (cf. Lemma 15)and | x − γ ∇ F ( x ) − x ⋆ | ≤ (1 − τ γ ) | x − x ⋆ | , which holds for γ ≤ /L ; ( b ) follows readily by applying the above inequalitytelescopically. Step 3: The generalized small gain theorem
The last step of our proof is to show that the error quantities E k c , E k t , E k z , and E k o vanish linearly. This is not a straight-forward task, as these quantities are interconnected throughthe inequalities (41). This subsection provides tools to addressthis issue. The key result is a generalization of the small gaintheorem (cf. Theorem 23), first used in [33]. Definition 20 ( [33]) . Given the sequence { u k } ∞ k =0 , a constant λ ∈ (0 , , and N ∈ N , let us define | u | λ,N = max k =0 ,...,N (cid:12)(cid:12) u k (cid:12)(cid:12) λ k , | u | λ = sup k ∈ N (cid:12)(cid:12) u k (cid:12)(cid:12) λ k . If | u | λ is upper bounded, then u k = O ( λ k ) , for all k ∈ N . The following lemma shows how one can interpret the in-equalities in (41) using the notions introduced in Definition 20.
Lemma 21.
Let { u k } ∞ k =0 , { v ki } ∞ k =0 , i = 1 , . . . , m , be non-negative sequences; let λ , λ , . . . , λ m ∈ (0 , ; and let R , R , . . . , R m ∈ R + such that u k +1 ≤ R ( λ ) k + m X i =1 R i k X l =0 ( λ i ) k − l v li , ∀ k ∈ N . Then, there holds | u | λ,N ≤ u + R λ + m X i =1 R i λ − λ i | v i | λ,N , for any λ ∈ ( max i =0 , ,...,m λ i , and N ∈ N .Proof. See Appendix B.
Lemma 22.
Let { u k } ∞ k =0 and { v k } ∞ k =0 be two nonnegativesequences. The following holda. u k ≤ v k , for all k ∈ N = ⇒ | u | λ,N ≤ | v | λ,N , for any λ ∈ (0 , and N ∈ N ;b. | β u + β v | λ,N ≤ | β | | u | λ,N + | β | | v | λ,N , for any β , β ∈ R , λ ∈ (0 , , and positive integer N . | E z | λ,N | E c | λ,N | E t | λ,N | E o | λ,N b L C γλ − ρ b γλ − ρ b λ − ρ b γλ −L ( γ ) γλ −L ( γ ) | {z } , K | E z | λ,N | E c | λ,N | E t | λ,N | E o | λ,N + (cid:16) C λ (cid:17) E c C k g k λ + E t λλ E o , b , C L √ I, b , C C L . (42)The major result of this section is the generalized small gaintheorem, as stated next. Theorem 23. (Generalized Small Gain Theorem) Given non-negative sequences { u ki } ∞ k =0 , i = 1 , . . . , m , a non-negativematrix T ∈ R m × m , β ∈ R m , and λ ∈ (0 , such that u λ,N Tu λ,N + β , ∀ N ∈ N , (43) where u λ,N , [ | u | λ,N , . . . , | u m | λ,N ] ⊤ . If ρ ( T ) < , then | u i | λ is bounded, for all i = 1 , . . . , m . That is, each u ki vanishes at a R-linear rate O ( λ k ) . Proof.
See Appendix C.Then following results are instrumental to find a sufficientcondition for ρ ( T ) < . Lemma 24.
Consider a polynomial p ( z ) = z m − a z m − − a z m − − . . . − a m − z − a m , with z ∈ C and a i ∈ R + , i =1 , . . . m. Define z p , max (cid:8) | z i | (cid:12)(cid:12) p ( z i ) = 0 , i = 1 , . . . , m (cid:9) .Then, z p < if and only if p (1) > . Proof.
See Appendix F.
Step 4: Linear convergence rate (proof of Theorem 9)
Our path to prove linear convergence rate passes throughTheorem 23: we first cast the set of inequalities (41) into asystem in the form (43), and then study the spectral propertiesof the resulting coefficient matrix.Given γ < /L , define L ( γ ) , − τ η γ ; and choose λ ∈ R such that max ( ρ, L ( γ )) < λ < . (44)Note that L ( γ ) < , as γ < /L ; hence (44) is nonempty.Applying Lemma 21 and Lemma 22 to the set of inequalities(41) with γ k ≡ γ , we obtain the system (42) at the top of thepage. By Theorem 23, to prove the desired linear convergencerate, it is sufficient to show that ρ ( K ) < . The characteristicpolynomial p K ( t ) of T satisfies the conditions of Lemma 24;hence ρ ( K ) < if and only if p K (1) > , that is, (cid:18)(cid:18) Lγλ − L ( γ ) (cid:19) b λ − ρ + b + Lb γλ − L ( γ ) (cid:19) C γλ − ρ + (cid:18) Lγλ − L ( γ ) (cid:19) b γλ − ρ , B ( λ ; γ ) < . (45)By the continuity of B ( λ ; γ ) and (44), B (1; γ ) < is suffi-cient to claim the existence of some λ ∈ (max ( ρ, L ( γ )) , such that B ( λ ; γ ) < . Hence, setting B (1; γ ) < , yields < γ < ¯ γ , with ¯ γ , τ η (1 − ρ ) ( τ η + L ) b ( C + 1 − ρ ) + ( b τ η + Lb ) C (1 − ρ ) . (46)It is easy to check that ¯ γ < /L . Therefore, < γ < ¯ γ issufficient for E k c , E k t , E k z , E k o to vanish with an R-Linear rate.The desired result, (cid:12)(cid:12) x ki − x ⋆ (cid:12)(cid:12) = O ( λ k ) , i ∈ V , follows readilyfrom E k c = O ( λ k ) and E k o = O ( λ k ) . The explicit expressionof the rate λ , as in (13), is derived in Appendix D. VIII. ASY-SONATA: P ROOF OF T HEOREMS AND
A. Preliminaries
We begin establishing a connection between the meritfunction M F defined in (14) and the error quantities E k c , E k t ,and E k z , defined in (38), (39), and (40) respectively. Lemma 25.
The merit function M F satisfies M F ( x k ) ≤ C ( E k c ) + 3 η − (cid:0) ( E k t ) + ( E k z ) (cid:1) , (47) with C , C L I + L I + 6 C L L + 4 . Proof.
Define J , (1 /I ) · ⊤ and ¯ x k , (1 /I ) · ⊤ x k ; andrecall the definition of ξ ki (cf. Lemma 15) and that x k +1 ψ = x k ψ − γ k ψ ki k z ki k . [cf. (37)]. We have M F ( x k ) ≤ (cid:12)(cid:12) ∇ F (¯ x k ) (cid:12)(cid:12) + (cid:13)(cid:13) x k − ¯ x k (cid:13)(cid:13) (48) ≤ (cid:12)(cid:12) ∇ F (¯ x k ) (cid:12)(cid:12) + 2 (cid:13)(cid:13) x k − x k ψ (cid:13)(cid:13) + 2 (cid:13)(cid:13) x k ψ − ¯ x k (cid:13)(cid:13) (49) ≤ (cid:12)(cid:12) ∇ F (¯ x k ) (cid:12)(cid:12) + 2 (cid:13)(cid:13) x k − x k ψ (cid:13)(cid:13) + 2 (cid:13)(cid:13) J (cid:0) x k ψ − x k (cid:1)(cid:13)(cid:13) ≤ (cid:12)(cid:12) ∇ F (¯ x k ) (cid:12)(cid:12) + 4 (cid:13)(cid:13) x k − x k ψ (cid:13)(cid:13) . (50)We bound now (cid:12)(cid:12) ∇ F (¯ x k ) (cid:12)(cid:12) ; we have (cid:12)(cid:12) ∇ F (¯ x k ) (cid:12)(cid:12) ≤ (cid:12)(cid:12) ∇ F ( x k ψ ) (cid:12)(cid:12) + L (cid:12)(cid:12) ¯ x k − x k ψ (cid:12)(cid:12) ≤ (cid:12)(cid:12) ∇ F ( x k ψ ) − ¯ g k (cid:12)(cid:12) + (cid:12)(cid:12) ¯ g k − ( ξ k − i k ) − z ki k (cid:12)(cid:12) + ( ξ k − i k ) − (cid:12)(cid:12) z ki k (cid:12)(cid:12) + L √ I (cid:13)(cid:13) J (cid:0) x k − x k ψ (cid:1)(cid:13)(cid:13) ≤ (cid:18) C L √ I + L √ I (cid:19) E k c + η − E k t + η − E k z , (51)where in the last inequality we used ξ ki k ≥ η for all k (cf.Lemma 15) and k J ( x k − x k ψ ) k ≤ E k c .Eq. (47) follows readily from (50) and (51).Our ultimate goal is to show that the RHS of (47) issummable. To do so, we need two further results, Proposi-tion 26 and Lemma 27 below. Proposition 26 establishes aconnection between F ( x k +1 ψ ) and E k c , E k t , and E k z . Proposition 26.
In the above setting, there holds: k ∈ N , F ( x k +1 ψ ) ≤ F ( x ψ ) + 12 (cid:0) L + α − + β − (cid:1) k X t =0 ( E t z ) ( γ t ) − η k X t =0 ( E t z ) γ t + α C L I k X t =0 ( E t c ) + β η − k X t =0 ( E t t ) , (52) where α and β are two arbitrary positive constants.Proof. By descent lemma, we get F ( x k +1 ψ ) ≤ F ( x k ψ ) + γ k ψ ki k (cid:10) ∇ F ( x k ψ ) , − z ki k (cid:11) + L ( γ k ψ ki k ) (cid:12)(cid:12) z ki k (cid:12)(cid:12) ≤ F ( x k ψ ) + Lγ k (cid:12)(cid:12) z ki k (cid:12)(cid:12) + γ k ψ ki k (cid:10) ( ξ k − i k ) − z ki k , − z ki k (cid:11) + γ k ψ ki k (cid:10) ∇ F ( x k ψ ) − ¯ g k , − z ki k (cid:11) + γ k ψ ki k (cid:10) ¯ g k − ( ξ k − i k ) − z ki k , − z ki k (cid:11) ≤ F ( x k ψ ) + Lγ k (cid:12)(cid:12) z ki k (cid:12)(cid:12) − γ k η (cid:12)(cid:12) z ki k (cid:12)(cid:12) + γ k C L I X j =1 (cid:12)(cid:12) x k ψ − x kj (cid:12)(cid:12) (cid:12)(cid:12) z ki k (cid:12)(cid:12) + γ k η − E k t (cid:12)(cid:12) z ki k (cid:12)(cid:12) ≤ F ( x k ψ ) + Lγ k (cid:12)(cid:12) z ki k (cid:12)(cid:12) − γ k η (cid:12)(cid:12) z ki k (cid:12)(cid:12) + γ k C L √ IE k c (cid:12)(cid:12) z ki k (cid:12)(cid:12) + γ k η − E k t (cid:12)(cid:12) z ki k (cid:12)(cid:12) ≤ F ( x k ψ ) + Lγ k (cid:12)(cid:12) z ki k (cid:12)(cid:12) − γ k η (cid:12)(cid:12) z ki k (cid:12)(cid:12) + α C L I ( E k c ) + α − (cid:12)(cid:12) z ki k (cid:12)(cid:12) γ k + β η − ( E k t ) + β − (cid:12)(cid:12) z ki k (cid:12)(cid:12) ( γ k ) ≤ F ( x k ψ ) + 12 (cid:0) L + α − + β − (cid:1) ( E k z ) ( γ k ) − η ( E k z ) γ k + α C L I ( E k c ) + β η − ( E k t ) . Applying the above inequality inductively one gets (52).The last result we need is a bound of P kt =0 ( E t c ) and P kt =0 ( E t t ) in (52) in terms of P kt =0 ( E t z ) ( γ t ) . Lemma 27.
Define ̺ c , C (1 − ρ ) and ̺ t ,
36 ( C C L ) (cid:0) C + (1 − ρ ) (cid:1) (1 − ρ ) . The following holds: k ∈ N , k X t =0 ( E t c ) ≤ c c + ̺ c k X t =0 ( E t z ) ( γ t ) , k X t =0 ( E t t ) ≤ c t + ̺ t k X t =0 ( E t z ) ( γ t ) , (53) where c c and c t are some positive constants.Proof. The proof follows from Proposition 19 and Lemma 28below, which is a variant of [28] (its proof is thus omitted).
Lemma 28.
Let { u k } ∞ k =0 , { v ki } ∞ k =0 , i = 1 , . . . , m , be nonneg-ative sequences; λ ∈ (0 , ; and R ∈ R + such that u k +1 ≤ Rλ k + k X l =0 λ k − l v l . Then, there holds: k ∈ N , k X l =0 ( u l ) ≤ ( u ) + 2 R − λ + 2(1 − λ ) k X l =0 ( v l ) . Using (53) in (52), we finally obtain k X t =0 ( E t z ) γ t ( η − γ t C ( α, β )) ≤ F ( x ψ ) − F inf + C ( α, β ) (54)with C ( α, β ) , (1 /
2) ( L + α − + β − + C L Iα̺ c + η − β̺ t ) and C ( α, β ) = (1 / (cid:0) C L Iαc c + η − βc t (cid:1) ; and F inf > −∞ is the lower bound of F .We are now ready to prove Theorems 10 and 11. B. Proof of Theorem 10
Set γ k ≡ γ , for all k ∈ N . By (54), one infers that P ∞ t =0 E t z < ∞ if γ satisfies < γ < ¯ γ ( α, β ) , with ¯ γ ( α, β ) , η/C ( α, β ) . Note that ¯ γ ( α, β ) is maximizedsetting α = α ⋆ = (cid:0) C L √ I̺ c (cid:1) − and β = β ⋆ = η̺ − / t ,resulting in ¯ γ ( α ⋆ , β ⋆ ) = (2 η ) / ( L + 2 C L p I̺ c + 2 η − √ ̺ t ) . (55)Let < γ < ¯ γ ( α ⋆ , β ⋆ ) . Given δ > , let T δ be the firstiteration k ∈ N such that M F ( x k ) ≤ δ . Then we have T δ · δ < T δ − X k =0 M F ( x k ) ≤ ∞ X k =0 M F ( x k ) (47) ≤ C ∞ X k =0 ( E k c ) + 3 η − ∞ X k =0 (cid:0) ( E k t ) + ( E k z ) (cid:1) (53) , (54) ≤ F ( x ψ ) − F inf + C ( α ⋆ , β ⋆ ) γ ( η − γC ( α ⋆ , β ⋆ )) · C + C < ∞ where C , C ̺ c ( γ ) + 3 η − (cid:0) ̺ t ( γ ) + 1 (cid:1) and C is someconstant. Therefore, T δ = O (1 /δ ) . C. Proof of Theorem 11.
We begin showing that the step-size sequence { γ t } t ∈ N induced by the local step-size sequence { α t } t ∈ N and the asyn-chrony mechanism satisfying Assumption 6 is nonsummable.The proof is straightforward and is thus omitted. Lemma 29.
Let { γ t } t ∈ N be the global step-size sequenceresulted from Algorithm 2, under Assumption 6. Then, therehold: lim t →∞ γ t = 0 and P ∞ t =0 γ t = ∞ . Since lim t →∞ γ t = 0 , there exists a sufficiently large k ∈ N , say ¯ k , such that η − γ k C ( α ⋆ , β ⋆ ) ≥ η/ for all k > ¯ k .It is not difficult to check that this, together with (54), yields P ∞ k =0 ( E k z ) γ k < ∞ . We can then write ∞ X k =0 M F ( x k ) γ k (47) ≤ C ∞ X k =0 ( E k c ) γ k + 3 η − ∞ X k =0 (cid:0) ( E k t ) + ( E k z ) (cid:1) γ k < C , (56)for some finite constant C , where in the last inequality weused (53), P ∞ k =0 ( E k z ) γ k < ∞ and lim t →∞ γ t = 0 .Let N δ , inf (cid:8) k ∈ N : P kt =0 γ t ≥ C /δ (cid:9) . Note that N δ exists, as P ∞ k =0 γ k = ∞ (cf. Lemma 29). Let T δ , inf (cid:8) k ∈ N : M F ( x k ) ≤ δ (cid:9) . It must be T δ ≤ N δ . In fact, suppose bycontradiction that T δ > N δ ; and thus M F ( x k ) > δ , for ≤ k ≤ N δ . It would imply P N δ k =0 M F ( x k ) γ k > δ P N δ k =0 γ k ≥ δ · ( C /δ ) = C , which contradicts (56). This proves (15). IX. C
ONCLUSIONS
We proposed ASY-SONATA, a distributed asynchronousalgorithmic framework for convex and nonconvex (uncon-strained, smooth) multi-agent problems, over digraphs. Thealgorithm is robust against uncoordinated agents’ activa-tion and (communication/computation) (time-varying) delays.When employing a constant step-size, ASY-SONATA achievesa linear rate for strongly convex objectives–matching the rateof a centralized gradient algorithm–and sublinear rate for(non)convex problems. Sublinear rate is also established whenagents employ uncoordinated diminishing step-sizes, whichis more realistic in a distributed setting. To the best of ourknowledge, ASY-SONATA is the first distributed algorithmenjoying the above properties, in the general asynchronoussetting described in the paper.A
PPENDIX
A. Proof of Lemma 14
We study any entry b A k + K − hm with m ∈ b V and h ∈ V . Weprove the result by considering the following four cases. (i) Assume h = m ∈ V . It is easy to check that b A khh ≥ ¯ m ,for any k ∈ N and h ∈ V . Therefore, b A k + s − khh ≥ Q k + s − t = k b A thh ≥ ¯ m s , for all k, s ∈ N and h ∈ V . (ii) Let ( m, h ) ∈ E ; and let s be the first time m wakes up inthe interval [ k, k + T − . We have b A s ( m,h ) ,m = a hm . Theinformation that node m sent to node ( m, h ) at iteration s isreceived by node h when the information is on some virtualnode ( m, h ) d . We discuss separately the following three sub-cases for d : 1) ≤ d ≤ D − ; 2) d = 0 ; and 3) d = D . ≤ d ≤ D − : We have b A s + d : s +1( m,h ) d , ( m,h ) = b A s + d ( m,h ) d , ( m,h ) d − · · · b A s +1( m,h ) , ( m,h ) = 1 , b A s + d +1 h, ( m,h ) d = a hh . Therefore, b A s + d +1: shm = b A s + d +1 h, ( m,h ) d b A s + d : s +1( m,h ) d , ( m,h ) b A s ( m,h ) ,m = a hh a hm ≥ ¯ m . d = 0 : We simply have b A s +1: shm = b A s +1 h, ( m,h ) b A s ( m,h ) ,m = a hh a hm ≥ ¯ m . Therefore, for ≤ d ≤ D − , b A k +2 T + D − khm = b A k +2 T + D − s + d +2 hh b A s + d +1: shm b A s − kmm ≥ ¯ m k +2 T + D − s − d − ¯ m ¯ m s − k ≥ ¯ m T + D . d = D : Before agent j wakes up at time s + D + τ ,where ≤ τ ≤ T , the information will stay on virtual nodes ( m, h ) D . Once agent j wakes up, nodes ( m, h ) D will send allits information to it. Then we have b A s + D : s +1( m,h ) D , ( m,h ) = 1 , b A s + D + τ : s + D +1 h, ( m,h ) D = a hh . Similarly, we have b A k +2 T + D − hm = b A k +2 T + D − s + D + τ +1 hh b A s + D + τ : s + D +1 h, ( m,h ) D · b A s + D : s +1( m,h ) D , ( m,h ) b A s ( m,h ) ,m b A s − kmm ≥ ¯ m T + D . To summarize, in all of the three sub-cases, we have b A k + K − hm ≥ b A k + K − k +2 T + Dhh b A k +2 T + D − khm ≥ ¯ m K − T − D ¯ m T + D = ¯ m K . (iii) Let m = h and ( m, h ) ∈ V × V \ E . Sincethe graph ( V , E ) is connected, there are mutually dif-ferent agents i , . . . , i r , with r ≤ I − , such that ( m, i ) , ( i , i ) , . . . , ( i r − , i r ) , ( i r , h ) ⊂ E , which is actuallya directed path from m to h in the graph ( V , E ) . Then, byresult proved in (ii), we have b A k +( I − T + D ) − khm ≥ b A k +( I − T + D ) − k +( r +1)(2 T + D ) hh · b A k +( r +1)(2 T + D ) − k + r (2 T + D ) hi r · · · b A k +2 T + D − ki m ≥ ¯ m ( I − r − T + D ) ¯ m ( r +1)(2 T + D ) = ¯ m ( I − T + D ) . We can then easily get b A k + K − khm = b A k + K − k +( I − T + D ) hh b A k +( I − T + D ) − khm ≥ ¯ m K . (iv) If m is a virtual node, it must be associated with an edge ( j, i ) ∈ E and there exists ≤ d ≤ D such that m = ( j, i ) d . Asimilar argument as in (ii) above shows that any informationon m will eventually enter node i taking ≤ τ ≤ D + T .That is, b A k + τ − kim = a ii , for some ≤ τ ≤ D + T . On theother hand, by the above results, we know b A k + T + D +( I − T + D ) − k + T + Dhi ≥ ¯ m ( I − T + D ) . Therefore, b A k + K − khm ≥ b A k + K − k + T + Dhi b A k + T + D − k + τii b A k + τ − kim ≥ ¯ m ( I − T + D ) ¯ m T + D − τ ¯ m ≥ ¯ m K B. Proof of Lemma 21
Fix N ∈ N , and let k such that ≤ k + 1 ≤ N . We have: u k +1 λ k +1 ≤ R λ (cid:18) λ λ (cid:19) k + m X i =1 R i λ k X l =0 (cid:18) λ i λ (cid:19) k − l v li λ l ≤ R λ + m X i =1 R i λ | v i | λ,N k X l =0 (cid:18) λ i λ (cid:19) k − l ≤ R λ + m X i =1 R i λ − λ i | v i | λ,N . Hence, | u | λ,N ≤ max u , R λ + m X i =1 R i λ − λ i | v i | λ,N ! ≤ u + R λ + m X i =1 R i λ − λ i | v i | λ,N . C. Proof of Theorem 23
From [46, Ch. 5.6], we know that if ρ ( T ) < , then lim k →∞ T k = 0 , the series P ∞ k =0 T k converges (wherein wedefine T , I ), I − T is invertible and P ∞ k =0 T k = ( I − T ) − . Given N ∈ N , using (43) recursively, yields: u λ,N ≤ Tu λ,N + β ≤ T (cid:0) Tu λ,N + β (cid:1) + β = T u λ,N + ( T + I ) β ≤· · · ≤ T ℓ u λ,N + P ℓ − k =0 T k β , for any ℓ ∈ N . Let ℓ → ∞ ,we get u λ,N ≤ ( I − T ) − β . Since this holds for any given N ∈ N , we have u λ ≤ ( I − T ) − β . Hence, u λ is bounded,and thus each u ki vanishes at an R-linear rate O ( λ k ) . D. Proof of the rate decay (13) (Theorem 9)
Let λ ≥ L ( γ )+ ǫγ , with ǫ > to be properly chosen. Then, B ( λ ; γ ) ≤ (cid:18) Lǫ (cid:19) b γλ − ρ + (cid:18)(cid:18) Lǫ (cid:19) b λ − ρ + b + Lb ǫ (cid:19) C γλ − ρ . (57)Using λ − ρ < , a sufficient condition for Eq. (45) is [RHSless than one] (cid:18) b C + Lb C ǫ + (cid:18) Lǫ (cid:19) b (1 + C ) (cid:19) γ ≤ ( λ − ρ ) . (58)Now set ǫ = ( τ η ) / . Since the RHS of the above inequalitycan be arbitrarily close to (1 − ρ ) , an upper bound of γ is ˆ γ , (1 − ρ ) (cid:30) (cid:18) b C + 2 Lb C τ η + (cid:18) Lτ η (cid:19) b (1 + C ) (cid:19)| {z } , J . According to λ ≥ L ( γ ) + ǫγ and (58), we get λ = max (cid:18) − τ η γ , ρ + p J γ (cid:19) . (59)Notice that when γ goes from to ˆ γ , the first argument insidethe max operator decreases from to − ( τ η ˆ γ ) / , while thesecond argument increases from ρ to . Letting − τη γ = ρ + √ J γ, we get the solution as ˆ γ = (cid:18) √ J +2 τη (1 − ρ ) −√ J τη (cid:19) . The expression of λ as in (13) follows readily. E. Proof of Corollary 7.1
We know that m kz = P Ii =1 z i + P k − t =0 ǫ t . Clearly m z = P Ii =1 z i = P Ii =1 ˜ u i . Suppose for k = ℓ , we have that m ℓz = P Ii =1 ˜ u ℓi . Then we have that m ℓ +1 z = m ℓz + ǫ ℓ = I X i =1 ˜ u ℓi ! + u ℓ +1 i ℓ − ˜ u ℓi ℓ = X j = i ℓ ˜ u ℓj + u ℓ +1 i ℓ = I X i =1 ˜ u ℓ +1 i . Thus we have that m kz = P Ii =1 ˜ u ki , ∀ k ∈ N . Now we assume that lim k →∞ P Ii =1 (cid:12)(cid:12) u k +1 i − u ki (cid:12)(cid:12) = 0 . Notice that for k ≥ T, (cid:12)(cid:12) ǫ k (cid:12)(cid:12) = (cid:12)(cid:12) u k +1 i k − ˜ u ki k (cid:12)(cid:12) ≤ k X t = k − T +1 (cid:12)(cid:12) u t +1 i k − u ti k (cid:12)(cid:12) ≤ k X t = k − T +1 I X i =1 (cid:12)(cid:12) u t +1 i − u ti (cid:12)(cid:12) . Therefore we have that lim k →∞ (cid:12)(cid:12) ǫ k (cid:12)(cid:12) = 0 . According toTheorem 7 and [29, Lemma 7(a)], we have that lim k →∞ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) y k +1 i − (1 /I ) · I X i =1 ˜ u k +1 i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = 0 . On the other hand, we have that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) I X i =1 u k +1 i − I X i =1 ˜ u k +1 i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ I X i =1 (cid:12)(cid:12) u k +1 i − ˜ u k +1 i (cid:12)(cid:12) ≤ I X i =1 k X t = k − T +1 (cid:12)(cid:12) u t +1 i − u ti (cid:12)(cid:12) k →∞ −→ . By triangle inequality, we get that lim k →∞ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) y k +1 i − (1 /I ) · I X i =1 u k +1 i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = 0 . F. Proof of Lemma 24 “ ⇐ =: ” From p (1) > , we know that P mi =1 a i < . Weprove by contradiction. Suppose there is a root z ∗ of p ( z ) satisfying | z ∗ | ≥ , then we have z m ∗ = a z m − ∗ + a z m − ∗ + . . . + a m − z ∗ + a m . Clearly z ∗ = 0 , so equivalently a z ∗ + a z ∗ + . . . + a m − z m − ∗ + a m z m ∗ . Further, (cid:12)(cid:12)(cid:12)(cid:12) a z ∗ + a z ∗ + . . . + a m − z m − ∗ + a m z m ∗ (cid:12)(cid:12)(cid:12)(cid:12) ≤ a | z ∗ | + a | z ∗ | + . . . + a m − | z ∗ | m − + a m | z ∗ | m ≤ a + a + . . . + a m − + a m < . This is a contradiction.“ = ⇒ : ” If p (1) = 0 , we clearly have that z p ≥ . Nowsuppose p (1) < . Because lim z ∈ R ,z → + ∞ p ( z ) = + ∞ and p ( z ) is continuous on R , we know that p ( z ) has a zero in (1 , + ∞ ) ⊂ R . Thus z p > . G. Proof of Lemma 17
We interpret the dynamical system (33) over an augmentedgraph. We begin constructing the augmented graph obtainedadding virtual nodes to the original graph G = ( V , E ) . Weassociate each node i ∈ V with an ordered set of virtual nodes i [0] , i [1] , . . . , i [ D ] ; see Fig. 6. We still call the nodes in theoriginal graph G as computing agents and the virtual nodes as noncomputing agents . We now identify the neighbors of eachagent in this augmented system. Any noncomputing agent i [ d ] , d = D, D − , · · · , , can only receive information from theprevious virtual node i [ d − ; i [0] can only receive informationfrom the real node i or simply keep its value unchanged;computing agents cannot communicate among themselves. ℓ i j j [2] ℓ [0] ℓ [1] ℓ [2] i [2] i [1] i [0] j [0] j [1] ℓ i j (a) Snapshot of the original graph(b) Augmented graph associated with (a) Figure 6. Example of augmented graph, when the maximum delay is D = 2 ;three noncomputing agents are added for each node i ∈ V . At the beginning of each iteration k , every computing agent i ∈ V will store the information x ki ; whereas every noncom-puting agent i [ d ] , with d = 0 , , · · · , D, will store the delayedinformation v k − di . The dynamics over the augmented graphhappening in iteration k is described by (33). In words, anynoncomputing agent i [ d ] with i ∈ V and d = D, D − , · · · , receives the information from i [ d − ; the noncomputing agent i k [0] receives the perturbed information x ki k + δ k from node i k ;the values of noncomputing agents j [0] for j ∈ V \{ i k } remainthe same; node i k sets its new value as a weighted averageof the perturbed information x ki k + δ k and v k − d kj j ’s receivedfrom the virtual nodes j [ d kj ] ’s for j ∈ N in i k ; and the values ofthe other computing agents remain the same. The dynamics isfurther illustrated in Fig. 7. The following Lemma shows thatthe product of a sufficiently large number of any instantiationsof the matrix c W k , under Assumption 6, is a scrambling matrix. ℓ i k j ℓ [0] ℓ [1] ℓ [2] i k [2] i k [1] i k [0] j [0] j [1] j [2] w i k i k w i k j Figure 7. The dynamics in iteration k. Agent i k uses the delayed information v k − j from the virtual node j [1] . Lemma 30.
Let { c W k } k ∈ N be the sequence of augmentedmatrices generated according to the dynamical system (31) ,under Assumption 6, and with W satisfying Assumption 4(i), (ii). Then, for any k ∈ N , c W k is row stochastic and c W k + K − k has the property that all entries of its first I columns are uniformly lower bounded by η .Proof. We study any entry c W k + K − hm with m ∈ V and h ∈ b V .We prove the result by considering the following four cases. (i) Assume h = m ∈ V . Since c W khh ≥ ¯ m for any k ∈ N andany h ∈ V , we have c W k + s − khh ≥ Q k + s − t = k c W thh ≥ ¯ m s for ∀ k ∈ N , ∀ s ∈ N and ∀ h ∈ V . (ii) Assume that ( m, h ) ∈ E . Suppose that the first time whenagent h wakes up during the time interval [ k + T + D, k +2 T + D − is s , and agent h uses the information v s − dm fromthe noncomputing agent m [ d ] . Then we have c W s : s − dh,m [0] ≥ c W sh,m [ d ] · · · c W s − dm [1] ,m [0] = w hm ≥ ¯ m d +1 . Then suppose that the last time when agent m wakes up duringthe time interval [ s − d − T, s − d − is s − d − t . The noncomputing agent m [0] receives some perturbed informationfrom agent m at iteration s − d − t and then performs self-loop (i.e., keep its value unchanged) during the time interval [ s − d − t + 1 , s − d − . Thus we have c W s − d − s − d − tm [0] ,m = c W s − d − s − d − t +1 m [0] ,m [0] c W s − d − tm [0] ,m = 1 · ≥ ¯ m t − . Therefore we have c W k +2 T + D − khm ≥ c W k +2 T + D − s +1 hh c W s : s − dh,m [0] · c W s − d − s − d − tm [0] ,m c W s − d − t − kmm ≥ ¯ m k +2 T + D − s − ¯ m d +1 ¯ m t − ¯ m s − d − t − k ≥ ¯ m T + D . Further we have c W k + K − hm ≥ c W k + K − k +2 T + Dhh c W k +2 T + D − khm ≥ ¯ m K − T − D ¯ m T + D = ¯ m K . (iii) Assume that m = h and ( m, h ) ∈ V × V \ E . Because thegraph ( V , E ) is connected, there are mutually different agents i , . . . , i r with r ≤ I − such that ( m, i ) , ( i , i ) , . . . , ( i r − , i r ) , ( i r , h ) ⊂ E , which is actually a directed path from m to h . Then, by resultproved in (ii), we have c W k +( I − T + D ) − khm = c W k +( I − T + D ) − k +( r +1)(2 T + D ) hh c W k +( r +1)(2 T + D ) − k + r (2 T + D ) hi r · · · c W k +2(2 T + D ) − k +2 T + Di i · c W k +2 T + D − ki m ≥ ¯ m ( I − r − T + D ) ¯ m ( r +1)(2 T + D ) = ¯ m ( I − T + D ) . Then we can easily get c W k + K − khm = c W k + K − k +( I − T + D ) hh c W k +( I − T + D ) − khm ≥ ¯ m K − ( I − T + D ) ¯ m ( I − T + D ) = ¯ m K . (iv) If h is a noncomputing node, it must be affiliated with acomputing agent j ∈ V , i.e., there exists ≤ d ≤ D such that h = j [ d ] . Then we have c W k + K − k + K − dh,j [0] = c W k + K − j [ d ] ,j [ d − · · · c W k + K − dj [1] ,j [0] = 1 . Suppose that the last time when agent j wakes up during thetime interval [ k + K − d − T, k + K − d − is s . We have c W k + K − d − sj [0] ,j = c W k + K − d − s +1 j [0] ,j [0] c W sj [0] ,j = 1 . By results proved before, we have c W k + K − khm ≥ c W k + K − k + K − dh,j [0] c W k + K − d − sj [0] ,j · c W s − k +( I − T + D ) jj c W k +( I − T + D ) − kjm ≥ · · ¯ m s − k − ( I − T + D ) ¯ m ( I − T + D ) ≥ ¯ m K . Based on Lemma 30, we get the following result accordingto the discussion in [31].
Lemma 31.
In the setting above, there exists a sequence ofstochastic vectors { ψ k } k ∈ N such that for any k ≥ t ≥ , (cid:13)(cid:13)(cid:13)c W k : t − ψ t ⊤ (cid:13)(cid:13)(cid:13) ∞ ≤ m − K )1 − ¯ m − K ρ k − t . Furthermore, ψ ki ≥ η = ¯ m K for all k ≥ and i ∈ V . The above result leads to Lemma 17 by noticing that (cid:13)(cid:13)(cid:13) c W k : t − ψ t ⊤ (cid:13)(cid:13)(cid:13) ≤ p ( D + 2) I (cid:13)(cid:13)(cid:13)c W k : t − ψ t ⊤ (cid:13)(cid:13)(cid:13) ∞ ≤ C ρ k − t . R EFERENCES[1] Y. Tian, Y. Sun, and G. Scutari, “Asy-sonata: Achieving linear conver-gence for distributed asynchronous optimization,” in
Proc. of Allerton2018 , Oct. 3-5.[2] ——, “Asy-sonata: Achieving geometric convergence for distributedasynchronous optimization,” arXiv:1803.10359 , Mar. 2018.[3] D. P. Bertsekas and J. N. Tsitsiklis,
Parallel and distributed computation:numerical methods . Englewood Cliffs, NJ: Prentice-Hall, 1989.[4] A. Nedi´c, “Asynchronous broadcast-based convex optimization over anetwork,”
IEEE Trans. Auto. Contr. , vol. 56, no. 6, pp. 1337–1351, 2011.[5] X. Zhao and A. H. Sayed, “Asynchronous adaptation and learning overnetworks–Part I/Part II/Part III: Modeling and stability analysis/Per-formance analysis/Comparison analysis,”
IEEE Trans. Signal Process. ,vol. 63, no. 4, pp. 811–858, 2015.[6] S. Kumar, R. Jain, and K. Rajawat, “Asynchronous optimization overheterogeneous networks via consensus admm,”
IEEE Trans. Signal Inf.Process. Netw. , vol. 3, no. 1, pp. 114–129, 2017.[7] M. Eisen, A. Mokhtari, and A. Ribeiro, “Decentralized quasi-newtonmethods,”
IEEE Trans. Signal Process. , vol. 65, no. 10, pp. 2613–2628,2017.[8] N. Bof, R. Carli, G. Notarstefano, L. Schenato, and D. Varagnolo,“Newton-raphson consensus under asynchronous and lossy communi-cations for peer-to-peer networks,” arXiv:1707.09178 , 2017.[9] Z. Peng, Y. Xu, M. Yan, and W. Yin, “Arock: an algorithmic frameworkfor asynchronous parallel coordinate updates,”
SIAM J. Sci. Comput. ,vol. 38, no. 5, pp. A2851–A2879, 2016.[10] T. Wu, K. Yuan, Q. Ling, W. Yin, and A. H. Sayed, “Decentralizedconsensus optimization with asynchrony and delays,”
IEEE Trans. SignalInf. Process. Netw. , vol. PP, no. 99, 2017.[11] J. N. Tsitsiklis, D. P. Bertsekas, and M. Athans, “Distributed asyn-chronous deterministic and stochastic gradient optimization algorithms,”
IEEE Trans. Auto. Contr. , vol. 31, no. 9, pp. 803–812, 1986.[12] J. Liu and S. J. Wright, “Asynchronous stochastic coordinate descent:Parallelism and convergence properties,”
SIAM J. Optim. , vol. 25, no. 1,pp. 351–376, 2015.[13] L. Cannelli, F. Facchinei, V. Kungurtsev, and G. Scutari, “Asynchronousparallel algorithms for nonconvex optimization,”
Math. Prog. , June 2019.[14] F. Niu, B. Recht, C. Re, and S. J. Wright, “Hogwild: a lock-free approachto parallelizing stochastic gradient descent,” in
Proc. of NIPS 2011 , pp.693–701.[15] X. Lian, Y. Huang, Y. Li, and J. Liu, “Asynchronous parallel stochasticgradient for nonconvex optimization,” in
Proc. of NIPS 2015 , pp. 2719–2727.[16] I. Notarnicola and G. Notarstefano, “Asynchronous distributed optimiza-tion via randomized dual proximal gradient,”
IEEE Trans. Auto. Contr. ,vol. 62, no. 5, pp. 2095–2106, 2017.[17] J. Xu, S. Zhu, Y. C. Soh, and L. Xie, “Convergence of asynchronousdistributed gradient methods over stochastic networks,”
IEEE Trans.Auto. Contr. , vol. 63, no. 2, pp. 434–448, 2017.[18] F. Iutzeler, P. Bianchi, P. Ciblat, and W. Hachem, “Asynchronousdistributed optimization using a randomized alternating direction methodof multipliers,” in
Proc. of CDC 2013 , pp. 3671–3676.[19] E. Wei and A. Ozdaglar, “On the o (1 /k ) convergence of asynchronousdistributed alternating direction method of multipliers,” in Proc. ofGlobalSIP 2013 , pp. 551–554.[20] P. Bianchi, W. Hachem, and F. Iutzeler, “A coordinate descent primal-dual algorithm and application to distributed asynchronous optimiza-tion,”
IEEE Trans. Auto. Contr. , vol. 61, no. 10, pp. 2947–2957, 2016.[21] H. Wang, X. Liao, T. Huang, and C. Li, “Cooperative distributedoptimization in multiagent networks with delays,”
IEEE Trans. Syst.Man Cybern. Syst. , vol. 45, no. 2, pp. 363–369, 2015.[22] J. Li, G. Chen, Z. Dong, and Z. Wu, “Distributed mirror descent methodfor multi-agent optimization with delay,”
Neurocomputing , vol. 177, pp.643–650, 2016.[23] K. I. Tsianos and M. G. Rabbat, “Distributed dual averaging for convexoptimization under communication delays,” in
Proc. of ACC 2012 , pp.1067–1072.[24] ——, “Distributed consensus and optimization under communicationdelays,” in
Proc. of Allerton 2011 , pp. 974–982. [25] P. Lin, W. Ren, and Y. Song, “Distributed multi-agent optimization sub-ject to nonidentical constraints and communication delays,”
Automatica ,vol. 65, pp. 120–131, 2016.[26] T. T. Doan, C. L. Beck, and R. Srikant, “Impact of communicationdelays on the convergence rate of distributed optimization algorithms,” arXiv:1708.03277 , 2017.[27] P. Di Lorenzo and G. Scutari, “Distributed nonconvex optimization overnetworks,” in
Proc. of IEEE CAMSAP 2015 , Dec. 2015.[28] J. Xu, S. Zhu, Y. C. Soh, and L. Xie, “Augmented distributed gradientmethods for multi-agent optimization under uncoordinated constantstepsizes,” in
Proc. of CDC 2015 , Dec., pp. 2055–2060.[29] P. Di Lorenzo and G. Scutari, “Next: In-network nonconvex optimiza-tion,”
IEEE Trans. Signal Inf. Process. Netw. , vol. 2, no. 2, pp. 120–136,2016.[30] C. N. Hadjicostis, N. H. Vaidya, and A. D. Dominguez-Garcia, “Robustdistributed average consensus via exchange of running sums,”
IEEETrans. Auto. Contr. , vol. 31, no. 6, pp. 1492–1507, 2016.[31] A. Nedi´c and A. Ozdaglar, “Convergence rate for consensus withdelays,”
J. Glob. Optim. , vol. 47, no. 3, pp. 437–456, 2010.[32] P. Lin and W. Ren, “Constrained consensus in unbalanced networkswith communication delays,”
IEEE Trans. Auto. Contr. , vol. 59, no. 3,pp. 775–781, 2013.[33] A. Nedi´c, A. Olshevsky, and W. Shi, “Achieving geometric convergencefor distributed optimization over time-varying graphs,”
SIAM J. Optim. ,vol. 27, no. 4, pp. 2597–2633, 2017.[34] Y. Sun, G. Scutari, and D. Palomar, “Distributed nonconvex multiagentoptimization over time-varying networks,” in
Proc. of Asilomar 2016 .IEEE, pp. 788–794.[35] G. Scutari and Y. Sun, “Distributed nonconvex constrained optimizationover time-varying digraphs,”
Math. Prog. , vol. 176, no. 1-2, pp. 497–544, July 2019.[36] Y. Sun, A. Daneshmand, and G. Scutari, “Convergence rate of distributedoptimization algorithms based on gradient tracking,” arXiv:1905.02637 ,2019.[37] D. Kempe, A. Dobra, and J. Gehrke, “Gossip-based computation ofaggregate information,” in
Proc. of FOCS 2003 . IEEE, pp. 482–491.[38] N. Bof, R. Carli, and L. Schenato, “Average consensus with asyn-chronous updates and unreliable communication,” in
Proc. of the IFACWord Congress 2017 , pp. 601–606.[39] T. S. Rappaport,
Wireless Communications: Principles & Practice .Prentice Hall, 2002.[40] S. M. Kay,
Fundamentals of Statistical Signal Processing–EstimationTheory . Prentice Hall, 1993.[41] L. Cannelli, F. Facchinei, and G. Scutari, “Multi-agent asynchronousnonconvex large-scale optimization,” in
Proc. of IEEE CAMSAP 2017 ,pp. 1–5.[42] L. Zhao, M. Mammadov, and J. Yearwood, “From convex to nonconvex:a loss function analysis for binary classification,” in , pp. 1281–1288.[43] D. Dua and C. Graff, “UCI machine learning repository,” 2017.[Online]. Available: http://archive.ics.uci.edu/ml[44] M. Assran and M. Rabbat, “Asynchronous subgradient-push,” arXiv:1803.08950 , 2018.[45] J. Zhang and K. You, “Asyspa: An exact asynchronous algorithm forconvex optimization over digraphs,” arXiv:1808.04118 , 2018.[46] R. A. Horn and C. R. Johnson,