Communication-Censored Linearized ADMM for Decentralized Consensus Optimization
aa r X i v : . [ m a t h . O C ] S e p Communication-Censored Linearized ADMM forDecentralized Consensus Optimization
Weiyu Li, Yaohua Liu, Zhi Tian, and Qing Ling
Abstract —In this paper, we propose a communication- andcomputation-efficient algorithm to solve a convex consensusoptimization problem defined over a decentralized network.A remarkable existing algorithm to solve this problem is thealternating direction method of multipliers (ADMM), in whichat every iteration every node updates its local variable throughcombining neighboring variables and solving an optimizationsubproblem. The proposed algorithm, called as communication-censored linearized ADMM (COLA), leverages a linearizationtechnique to reduce the iteration-wise computation cost ofADMM and uses a communication-censoring strategy to al-leviate the communication cost. To be specific, COLA intro-duces successive linearization approximations to the local costfunctions such that the resultant computation is first-order andlight-weight. Since the linearization technique slows down theconvergence speed, COLA further adopts the communication-censoring strategy to avoid transmissions of less informativemessages. A node is allowed to transmit only if the distancebetween the current local variable and its previously transmittedone is larger than a censoring threshold. COLA is proven tobe convergent when the local cost functions have Lipschitzcontinuous gradients and the censoring threshold is summable.When the local cost functions are further strongly convex, weestablish the linear (sublinear) convergence rate of COLA, giventhat the censoring threshold linearly (sublinearly) decays to .Numerical experiments corroborate with the theoretical findingsand demonstrate the satisfactory communication-computationtradeoff of COLA. Index Terms —Decentralized network, consensus optimiza-tion, communication-censoring strategy, linearized approxima-tion, alternating direction method of multipliers.
I. I
NTRODUCTION
In this paper, we consider solving a convex consensusoptimization problem ˜ x ∗ = arg min ˜ x n X i =1 f i (˜ x ) , (1) Weiyu Li is with the School of Gifted Young, University of Science andTechnology of China, Hefei, Anhui 230026, China. Yaohua Liu is withthe Department of Automation, University of Science and Technology ofChina, Hefei, Anhui 230026, China. Zhi Tian is with the Department ofElectrical and Computer Engineering, George Mason University, Fairfax,VA 22030, USA. Qing Ling is with the School of Data and ComputerScience, Sun Yat-Sen University, Guangzhou, Guangdong 510006, China.Zhi Tian is supported by NSF grants CCF-1527396 and IIS-1741338. QingLing is supported by China NSF grants 61573331 and 11728105. Partof this paper has appeared in the 44th IEEE International Conference onAcoustics, Speech, and Signal Processing, Brighton, UK, May 12–17, 2019.Corresponding Email: [email protected]. which is defined over a bidirectionally connected decentral-ized network consisting of n nodes. All the nodes cooperateto find an optimal argument ˜ x ∗ of the common optimiza-tion variable ˜ x ∈ R p , but the convex local cost function f i (˜ x ) : R p → R held by every node i is kept private. Wefocus on the scenario that the nodes are unable to afford com-plicated computation, while the communication resourcesare also limited. Our goal is to devise a communication-efficient decentralized algorithm, which relies on light-weightcomputation, to solve (1).Decentralized consensus optimization has attracted ex-tensive interest in recent years. Problems in the form of(1) are involved in a variety of research areas, includingwireless sensor networks [1]–[3], communication networks[4], [5], multi-robot networks [6], [7], smart grids [8]–[10],machine learning systems [11]–[13], to name a few. Popularalgorithms to solve (1) span from the primal domain tothe dual domain. The primal domain algorithms, such assub-gradient descent [14]–[16], dual averaging [17]–[19] andnetwork Newton [20], have to use diminishing step sizes toguarantee exact convergence to an optimal solution, and thussuffer from slow convergence. On the other hand, (1) canbe reformulated as a constrained optimization problem andsolved by dual domain algorithms, among which the cele-brated alternating direction method of multipliers (ADMM)is able to achieve fast and exact convergence [2], [21]–[23].When ADMM is implemented in a synchronous manner, atevery iteration, every node solves an optimization subprob-lem dependent on its local cost function, and then exchangesthe calculated local variable with its neighbors. Therefore,if the local cost functions are not in simple forms, solvingthe subproblems is computationally demanding. To alleviatethe computation cost, the decentralized linearized ADMM(DLM) replaces the local cost functions in ADMM by theirlinear approximations, and attains a dual domain method withlight-weight computation [24], [25]. Similar techniques havealso been applied to develop other first-order dual domainalgorithms, such as EXTRA [26], NEXT [27], and gradienttracking methods [28]–[32]. If computing the inverse of aHessian matrix is affordable at a node, one can replace thelocal cost functions by their quadratic approximations. Theresultant second-order algorithms, DQM and ESOM, havefaster convergence than their first-order counterparts [33],[34]. Between the first- and second-order algorithms, a recentwork in [35] develops a primal-dual quasi-Newton method that approximates the second-order information with localgradients. The lower complexity bounds and rate-optimalalgorithms of decentralized optimization are developed in[36]–[38]. Note that the communication cost in the aforemen-tioned algorithms is proportional to the number of iterations,since after a given number of iterations every node needs tocommunicate with its neighbors.In all decentralized algorithms, there is an essentialcommunication-computation tradeoff [39]–[43]. An algo-rithm with light-weight iteration-wise computation generallyneeds more number of iterations, and in consequence morecommunication cost, to reach a target accuracy. For example,compared with ADMM, DLM enjoys simple gradient-basedcomputation, but suffers from relatively slow convergencespeed and high communication cost. In this paper, we aim atachieving a favorable communication-computation tradeoff ina decentralized network, where the nodes are only affordableto light-weight gradient-based computation. The limitation onthe computation power may come from that the nodes areequipped with cheap computing units in a wireless sensornetwork, or from that using higher-order information isprohibitively time-consuming for finding a high-dimensionalsolution in a machine learning system.Given the constraint on the computation cost, we adoptthe communication-censoring strategy to further save thecommunication cost. The basic idea of the communication-censoring strategy is to only allow transmissions of infor-mative messages over the network. A simple yet powerfulprotocol is to prevent a node from transmitting a variable thatis close to its previously transmitted one, where the “close-ness” is determined by comparing the Euclidean distancewith a predefined time-varying censoring threshold. Thecommunication-censoring strategy is tightly related to event-triggered control of continuous-time networks [44]–[46], andfinds successful applications in discrete-time decentralizedoptimization [47]–[50]. It has been combined with primaldomain methods such as sub-gradient descent [47] and dualaveraging [48], as well as dual domain methods such as dualdecomposition [49] and ADMM [50]. However, similar totheir uncensored counterparts, the primal domain methodsin [47], [48] have to use diminishing step sizes to guaranteeexact convergence. On the other hand, the dual domain meth-ods in [49], [50] require the nodes to solve computationallydemanding subproblems. Our proposed algorithm, called ascommunication-censored linearized ADMM (COLA), com-bines the communication-censoring strategy with the first-order dual domain method DLM. Particularly, we modify thestandard communication-censoring strategy in [47]–[50] to fitfor the special algorithmic structure of DLM so as to attainbetter performance. We rigorously establish convergence aswell as sublinear and linear convergence rates of COLA. Tothe best of our knowledge, COLA is the first communication-censored method that only uses gradient information butachieves linear convergence. Starting from the derivation of the classical ADMM inSection II-A, we introduce COLA in Section II-B. COLAmodifies ADMM in two aspects. First, linearizing the lo-cal cost functions enables approximately solving the time-consuming subproblems in ADMM, and thus saves com-putation. Second, the communication-censoring strategy isapplied to remedy the poor communication efficiency causedby the linearization step. To further demonstrate the designprinciples of COLA, its tradeoff between communicationand computation is discussed and compared with those ofseveral existing dual domain algorithms in Section II-C. InSection III, we prove that when the censoring threshold isproperly chosen, COLA converges to an optimal solution of(1) (Theorem 1). Moreover, when the local cost functionsare strongly convex, the linear and sublinear convergencerates of COLA are established (Theorems 2 and 3). Theanalysis provides guidelines for choosing the parametersof COLA to reduce computation and communication costs.Section IV presents numerical experiments and demonstratesthe communication-computation tradeoff of COLA. SectionV summarizes our work. Notation.
For matrices A ∈ R a × n and B ∈ R b × n , [ A ; B ] ∈ R ( a + b ) × n stacks the two matrices by rows. Definethe inner product of two vectors v and v as h v , v i := v T v , which naturally induces the Euclidean norm k v k := p h v, v i of a vector v . For a matrix M , define λ min ( M ) as the smallest eigenvalue, σ max ( M ) as the largest singularvalue, and ˜ σ min ( M ) as the smallest nonzero singular value.When M is a block matrix, ( M ) i,j denotes its ( i, j ) -th block.Throughout the paper, we consider a bidirectionally con-nected network G = {V , A} , where V = { , . . . , n } denotesthe set of n nodes and A = { , . . . , m } is the set of m directed arcs. Nodes i and j are called as neighbors if ( i, j ) ∈ A and ( j, i ) ∈ A . We denote the set of node i ’s neighbors as N i with cardinality d ii = |N i | . Furtherdefine the extended block arc source matrix A s ∈ R mp × np containing m × n square blocks ( A s ) e,i ∈ R p × p . Theblock ( A s ) e,i = I p if the arc e = ( i, j ) ∈ A and is nullotherwise, where I p is the p -dimensional identity matrix.Likewise, define the extended block arc destination matrix A d ∈ R mp × np , whose block ( A d ) e,j ∈ R p × p is not nullbut I p if and only if the arc e = ( i, j ) ∈ A terminates atnode j . Then, define the extended oriented incidence matrixas G o = A s − A d and the unoriented one as G u = A s + A d .The oriented Laplacian is written as L o = G To G o and theunoriented Laplacian L u = G Tu G u . The degree matrix isdefined as D = ( L o + L u ) , which is block diagonal withdiagonal blocks D i,i = d ii I p .II. A LGORITHM D EVELOPMENT
In this section, we propose COLA, the communication-censored linearized ADMM to solve the decentralized con-sensus optimization problem (1). Rooted on ADMM, COLA features in two ingredients, linearization to reduce the com-putation cost and communication censoring to reduce thecommunication cost. We shall first introduce the developmentof ADMM in Section II-A, and then combine the lin-earization and communication-censoring techniques to deviseCOLA in Section II-B. The tradeoff between computationand communication is discussed in Section II-C.
A. ADMM: Alternating Direction Method of Multipliers
ADMM is a powerful tool to solve a structured opti-mization problem with two blocks of variables, which areseparable in the cost function and subject to a linear equalityconstraint. To rewrite (1) into the standard bivariate form, weintroduce local variables x i ∈ R p as copies of ˜ x at nodes i ,and auxiliary variables z ij ∈ R p at arcs ( i, j ) ∈ A . Since thenetwork is connected, (1) is equivalent to min { x i } , { z ij } n X i =1 f i ( x i ) , s . t . x i = z ij , x j = z ij , ∀ ( i, j ) ∈ A . (2)An optimal solution of (2) satisfies x ∗ i = ˜ x ∗ and z ∗ ij = ˜ x ∗ ,where ˜ x ∗ is an optimal solution of (1).Concatenate the variables as x = [ x ; . . . ; x n ] ∈ R np and z = [ z ; . . . ; z m ] ∈ R mp , introduce the aggregate function f ( x ) := P ni =1 f i ( x i ) , and denote A := [ A s ; A d ] ∈ R mp × np and B := [ − I mp ; − I mp ] . The matrix form of (2) is min x,z f ( x ) , s . t . Ax + Bz = 0 , (3)which is the standard bivariate form handled by ADMM,except that the variable z is absent in the cost function.Introduce the augmented Lagrangian of (3) as L ( x, z, λ ) = f ( x ) + h λ, Ax + Bz i + c k Ax + Bz k , where the penalty parameter c > is an arbitrary positiveconstant and the Lagrange multiplier λ := [ φ ; ψ ] ∈ R mp .The two vectors φ , ψ ∈ R mp are the Lagrangian multipliersassociated with the two constraints A s x − z = 0 and A d x − z = 0 respectively. At time k , the ADMM update follows x k +1 = arg min x L ( x, z k , λ k ) ,z k +1 = arg min z L ( x k +1 , z, λ k ) ,λ k +1 = λ k + c ( Ax k +1 + Bz k +1 ) . According to [23], if the variables are initialized with φ = − ψ and G u x = 2 z , then we can eliminate z k +1 andreplace λ k +1 by a lower-dimensional dual variable, such thatthe update is reduced to x k +1 = arg min x f ( x ) + h µ k − cL u x k , x i + cx T Dx, (4) µ k +1 = µ k + cL o x k +1 , (5)where µ k := G To φ k ∈ R np . By splitting µ k = [ µ k , . . . , µ kn ] , µ ki ∈ R p denotes the local dual variable of node i . Using the definitions of f ( x ) , D , L u and L o , we describehow the decentralized ADMM is implemented. At time k ,every node i updates its local primal variable x k +1 i using its x ki and µ ki , as well as x kj from all neighbors j via x k +1 i = arg min x i f i ( x i ) + h µ ki − c X j ∈N i ( x ki + x kj ) , x i i + cd ii x i . (6)Then node i broadcasts its x k +1 i to all neighbors. Finally,node i updates its local dual variable µ k +1 i using its x k +1 i and µ ki , as well as x k +1 j from all neighbors j via µ k +1 i = µ ki + c X j ∈N i ( x k +1 i − x k +1 j ) . (7)The costs of implementing ADMM are two-fold. The firstis in computing the local primal and dual variables x ki and µ ki , in which the update of x ki in (6) is particularly demandingwhen the local cost function f i ( x i ) is complicated. Thesecond is in transmitting the local primal variables x k +1 i ,which is expensive when the bandwidth resource is limited. B. COLA: Communication-Censored Linearized ADMM
COLA adopts two strategies to improve the computationand communication efficiency of ADMM: linearization andcommunication censoring. The linearization technique hasbeen used in [24], [25] to devise DLM, a gradient-basedvariant of ADMM. DLM effectively reduces the computationcost of solving subproblems in ADMM, but sacrifices on theconvergence speed and thus results in high communicationcost. Therefore, we use the communication-censoring strat-egy to prevent transmissions of less informative messages.Note that though the communication-censoring strategy hasbeen applied to improve the communication efficiency ofsub-gradient descent, dual averaging, dual decompositionand ADMM [47]–[50], we customize it in COLA so as toachieve a satisfactory balance between communication andcomputation, as we shall explain below.
Linearization.
Notice that the update of the primal vari-able x k +1 i in (6), which usually has no explicit solution,dominates the computation cost of ADMM. Therefore, acomputationally demanding inner loop should be used tosolve x k +1 i . To address this issue, [24], [25] linearizes thelocal cost functions at every iteration. To be specific, at time k , the function f i ( x i ) in (6) is replaced by its quadraticapproximation f i ( x ki ) + h∇ f i ( x ki ) , x i − x ki i + ρ k x i − x ki k at x i = x ki , where ρ > is a positive linearization parameter.Therefore, the primal variable is updated via x k +1 i = x ki − cd ii + ρ (cid:0) ∇ f i ( x ki ) + c X j ∈N i ( x ki − x kj ) + µ ki (cid:1) . (8)Note that the main computation cost of (8) is in calculatingthe gradient ∇ f i ( x ki ) , which is light-weight. The update ofdual variable remains the same as (7) in ADMM. Communication censoring.
The linearization technique sig-nificantly reduces the computation cost of ADMM, but slowsdown the convergence speed, and hence results in high com-munication cost. Hence, we introduce the communication-censoring strategy to further reduce the communication cost.Intuitively, when x k +1 i is close to x ki , it is not necessaryfor node i to transmit both of them to neighbors. Motivatedby this fact, the communication-censoring strategy preventstransmissions of less informative messages so as to reducethe communication cost.To rigorously explain the communication-censoring strat-egy, define a state variable ˆ x ki ∈ R p as the latest valuethat node i has transmitted to neighbors before time k .At time k , after calculating x k +1 i , node i evaluates thedifference between ˆ x ki and x k +1 i by their Euclidean distance ξ k +1 i = k ˆ x ki − x k +1 i k , and then compares the difference witha predefined censoring threshold τ k +1 ≥ . Node i is allowedto transmit x k +1 i to neighbors and update ˆ x k +1 i = x k +1 i , ifand only if ξ k +1 i ≥ τ k +1 . Otherwise, the transmission iscensored and ˆ x k +1 i = ˆ x ki . With the state variable ˆ x ki , COLAchanges the DLM updates in (8) and (7) to x k +1 i = x ki − cd ii + ρ (cid:0) ∇ f i ( x ki ) + c X j ∈N i (ˆ x ki − ˆ x kj ) + µ ki (cid:1) , (9) µ k +1 i = µ ki + c X j ∈N i (ˆ x k +1 i − ˆ x k +1 j ) . (10)Stacking the state variables in ˆ x = [ˆ x ; . . . ; ˆ x n ] ∈ R np , wecan write (9) and (10) in the matrix form of x k +1 = x k − (2 cD + ρI ) − (cid:0) ∇ f ( x k ) + cL o ˆ x k + µ k (cid:1) , (11) µ k +1 = µ k + cL o ˆ x k +1 . (12)COLA run by node i is outlined in Algorithm 1. At time , node i initializes its local variables to x i = 0 , µ i = 0 , ˆ x i = 0 and ˆ x j = 0 for all j ∈ N i . For all times k ,node i first computes its local primal variable x k +1 i by (9).The computation of x k +1 i at node i is based on its latestlocal primal-dual variables x ki and µ ki , the latest broadcastinformation ˆ x ki of itself and ˆ x kj from its neighbors j , as wellas the gradient of the local cost function f i ( x i ) at x i = x ki .Then ξ ki , the difference between the newly computed primalvariable x k +1 i and the previously transmitted ˆ x ki is calculatedand denoted by ξ k +1 i . If ξ k +1 i ≥ τ k +1 , meaning that thedifference exceeds the threshold to communicate, node i transmits x k +1 i to neighbors and lets ˆ x k +1 i = x k +1 i . Oth-erwise, node i does not transmit and lets ˆ x k +1 i = ˆ x ki . On theother hand, if node i receives x k +1 j from any neighbor j , thenit lets ˆ x k +1 j = x k +1 j . Otherwise, it lets ˆ x k +1 j = ˆ x kj . Observethat this communication protocol guarantees that node i andits neighbors store the same state variable ˆ x k +1 i . Finally, thelocal dual variable µ k +1 i is updated by (10). Remark 1.
Comparing (8) and (7) with (9) and (10) , weobserve that the only difference between DLM and COLA is
Algorithm 1
COLA Run by Node i Require:
Initialize local variables to x i = 0 , µ i = 0 , ˆ x i = 0 and ˆ x j = 0 for all j ∈ N i . for times k = 0 , , · · · do Compute local primal variable x k +1 i by x k +1 i = x ki − cd ii + ρ ∇ f i ( x ki ) + c X j ∈N i (ˆ x ki − ˆ x kj ) + µ ki . Compute ξ k +1 i = k ˆ x ki − x k +1 i k . If ξ k +1 i ≥ τ k +1 , transmit x k +1 i to neighbors and let ˆ x k +1 i = x k +1 i ; else do not transmit and let ˆ x k +1 i = ˆ x ki . If receive x k +1 j from any neighbor j , let ˆ x k +1 j = x k +1 j ; elselet ˆ x k +1 j = ˆ x kj . Update local dual variable µ k +1 i as µ k +1 i = µ ki + c X j ∈N i (ˆ x k +1 i − ˆ x k +1 j ) . end for replacing c P j ∈N i ( x ki − x kj ) by c P j ∈N i (ˆ x ki − ˆ x kj ) in theprimal-dual updates. This is not the standard strategy usedin the other communication-censored algorithms [47]–[50],where all the local primal variables x i are replaced by thestate variables ˆ x i . We customize the communication censor-ing strategy for COLA and keep the local primal variable x ki in x ki − cd ii + ρ ∇ f i ( x ki ) as it is, because x ki has alreadybeen available for node i , and is more up-to-date than ˆ x ki . Recall that the term x ki − cd ii + ρ ∇ f i ( x ki ) comes fromthe linearization of f i ( x i ) . Intuitively, linearization around x i = x ki leads to faster convergence than linearizationaround x i = ˆ x ki , which has been validated in our preliminarynumerical experiments. On the other hand, we do not changethe state variables ˆ x ki by the corresponding local primalvariables x ki in the term c P j ∈N i (ˆ x ki − ˆ x kj ) in both primaland dual updates. Note that x ki and ˆ x ki are not equal whencommunication censoring happens and the error betweenthem is determined by the censoring threshold τ k , while thedual update (10) accumulates all the previous differencesbetween the neighboring state variables ˆ x ki and ˆ x kj . Thus,replacing the state variables ˆ x ki therein by the correspondinglocal primal variables x ki shall accumulate the errors, andresult in instability or even divergence of the recursion. The censoring threshold τ k is a critical factor that in-fluences the communication-computation tradeoff of COLA.Setting a large τ k prevents less-informative transmissions,and thus reduces the iteration-wise communication cost,though the recursion needs more number of iterations andhence more computation cost to reach a target accuracy.However, a too large τ k slows down the convergence speed,which in turn increases both the overall computation andcommunication costs. Since τ k sets an upper bound for thedistance between x ki and ˆ x ki , a small improvement of the local primal variable x ki cannot be accepted to the statevariable ˆ x ki and diffused to the network. In this sense, theprimal variable cannot converge faster than τ k . We shall giverigorous analysis on this issue in the theoretical analysis.If we expect to obtain a linear rate of convergence, a choicefor the censoring threshold will be τ k = α · ( β ) k , (13)where β ∈ (0 , and α > are constants. If τ k is set as α · ( k ) − r with r > , a sublinear rate depending on r will bederived. A special case is τ k = 0 for all times k , meaningthat there is no censoring and COLA degenerates to DLM. C. Tradeoff between Communication and Computation
Here we discuss the communication-computation tradeoffin ADMM, DLM, as well as their communication-censoredversions, COCA and COLA.Generally speaking, among the four algorithms, ADMMneeds the least number of iterations to reach a target accuracy,but the computation cost of solving subproblems is often re-markable. DLM alleviates the iteration-wise computation costthrough linearization, but requires more number of iterationsand higher overall communication cost than ADMM.The communication-censoring strategy in COCA andCOLA adjusts the communication-computation tradeoffthrough tuning the censoring threshold τ k . As we havediscussed in Section II-B, a larger τ k leads to more iterationsand thus higher computation cost, but lower iteration-wisecommunication cost. Regarding the overall communicationcost required to reach a target accuracy, there is a phasetransition in tuning τ k . When τ k is too large, communicationcensoring is too often and much more number of iterations isnecessary to compensate the information loss, which woulddeteriorate the overall communication cost.Though COCA and COLA both adopt the communication-censoring strategy, their application scenarios are different.COCA fits for applications where computation of solvingcomplicated subproblems is not an issue, but communicationis the main bottleneck. Examples include distributed resourceallocation in a data center network and collaborative targettracking in a radar network. On the contrary, COLA inheritsthe advantage of light-weight computation from DLM, andfurther reduces the communication cost on top of it. In thissense, COLA fits for applications where nodes are unableto afford solving complicated subproblems due to hardwareor time constraints, such as an IoT network equipped withcheap computation units and a drone network cruising in afast changing environment.An illustration of the tradeoff between computation ef-ficiency and communication efficiency in ADMM, DLM,COCA and COLA is given by Fig. 1. low high Communication Efficiency lowhigh C o m pu t a t i on E ff i c i en cy DLM COLA ADMMCOCA
Fig. 1. Illustration of the tradeoff between computation efficiency andcommunication efficiency in ADMM, DLM, COCA and COLA.
III. C
ONVERGENCE AND R ATES OF C ONVERGENCE
In this section, we prove that COLA converges to anoptimal solution of the convex consensus optimization prob-lem (1) under mild conditions. Further, if the local costfunctions are strongly convex, COLA converges to the uniqueoptimal solution of (1) at a linear or sublinear rate, dependingon the choice of the censoring threshold. Section III-Aprovides assumptions and lemmas for the proofs. SectionIII-B analyzes the convergence of COLA, while linear andsublinear rates are established in Section III-C.
A. Preliminaries
We make the following assumptions for the analysis. As-sumptions 1–4 are sufficient for proving the convergence ofCOLA to an optimal solution of (1). Further with Assumption5, COLA is guaranteed to converge to the unique optimalsolution of (1) at a linear (sublinear) rate when the censoringthreshold is linearly (sublinearly) decaying to . Assumption 1 ( Network connectivity ) . The communicationgraph G = {V , A} is bidirectionally connected. Assumption 2 ( Convexity and differentiability ) . The localcost functions f i are convex and differentiable. Assumption 3 ( Lipschitz continuous gradients ) . The gradi-ents of the local cost functions ∇ f i are Lipschitz continuouswith constant M > . That is, given any ˜ x, ˜ y ∈ R p , k∇ f i (˜ x ) − ∇ f i (˜ y ) k ≤ M k ˜ x − ˜ y k for any i . Assumption 4 ( Initialization ) . The dual variable µ of COLAis initialized in the column space of G To . That is, there existsa vector φ ∈ R mp such that µ = G To φ . Assumption 5 ( Strong convexity ) . The local cost functions f i are strongly convex with constant m > . That is, givenany ˜ x, ˜ y ∈ R p , h∇ f i (˜ x ) − ∇ f i (˜ y ) , ˜ x − ˜ y i ≥ m k ˜ x − ˜ y k forany i . Assumptions 1, 2, 3 and 5 are standard in analysis ofdecentralized algorithms. The initial condition in Assumption4 can be easily satisfied, with the simplest choice µ = 0 .COLA involves a primal sequence { x k } and a dual se-quence { µ k } . In the theoretical analysis, we shall constructa triple ( x k , z k , φ k ) from the pair ( x k , µ k ) , and prove itsconvergence to ( x ∗ , z ∗ , φ ∗ ) , which is optimal to (3). Here z k , z ∗ , φ k , φ ∗ ∈ R mp . The next lemma gives the propertiesof ( x ∗ , z ∗ , φ ∗ ) . Lemma 1. (Lemma 1, [24]) Given a primal optimal solution x ∗ of (3) and z ∗ := G u x ∗ , there exist multiple optimaldual variables [ φ ∗ ; − φ ∗ ] such that every ( x ∗ , z ∗ , [ φ ∗ ; − φ ∗ ]) is a primal-dual optimal solution of (3) . Among all theseoptimal dual variables, there exists a unique [ φ ∗ ; − φ ∗ ] inwhich φ ∗ lies in the column space of G o . Moreover, anyprimal-dual optimal solution ( x ∗ , z ∗ , [ φ ∗ ; − φ ∗ ]) satisfies theKKT conditions ∇ f ( x ∗ ) + G To φ ∗ = 0 , (14) G o x ∗ = 0 , (15) G u x ∗ = z ∗ . (16)According to Lemma 1, it is natural to construct z k := G u x k ∈ R mp . To construct φ k , note that under Assumption4, µ lies in the column space of G To , and by the definitionof L o = G To G o , every µ k +1 in the dual update (12) alsolies in the column space of G To . Thus, there exists a vector φ k ∈ R mp satisfying µ k = G To φ k for any k ≥ , such thatthe recursion of COLA can be rewritten as x k +1 = x k − (2 cD + ρI ) − (cid:0) ∇ f ( x k ) + cL o ˆ x k + G To φ k (cid:1) , (17) φ k +1 = φ k + c G o ˆ x k +1 . (18)Combining (17) and (18) with the KKT conditions (14)–(16), the next lemma gives two equations that are corner-stones of the theoretical analysis. To emphasize the errorcaused by the communication-censoring strategy, we definean error term E k := x k − ˆ x k therein. Lemma 2.
Let x ∗ and φ ∗ be a primal-dual optimal pair of (3) , with φ ∗ lying in the column space of G o . Then, for all k ≥ , the recursion of COLA satisfies ∇ f ( x k ) − ∇ f ( x ∗ ) =( cL u + ρI )( x k − x k +1 ) (19) − G To ( φ k +1 − φ ∗ ) + cL o ( E k − E k +1 ) ,c G o ( x k +1 − x ∗ ) = φ k +1 − φ k + c G o E k +1 . (20) Proof:
See Appendix A.The convergence analysis of COLA relies on the followingenergy function V k := ρ k x k − x ∗ k + c k z k − z ∗ k + 1 c k φ k − φ ∗ k , (21) where the auxiliary variables z k and φ k as well as theiroptimal values z ∗ and φ ∗ are defined above. This energyfunction also appears in the analysis of DLM, the uncensoredversion of COLA [24]. However, due to the existence of thecommunication-censoring strategy which introduces an errorterm in the recursion, the analysis of COLA is significantlydifferent to that of DLM. B. Convergence
The convergence of COLA is established as follows.
Theorem 1.
Under Assumptions 1–4, in COLA we choosethe penalty parameter c > and the linearization parameter ρ > such that cλ min ( L u ) + ρ > M , and set the censoringthreshold { τ k } as a non-increasing non-negative summablesequence such that P ∞ k =0 τ k < ∞ . Then the primal variable x k converges to an optimal solution x ∗ of (3).Proof: See Appendix B.Theorem 1 asserts that COLA converges to an optimalsolution of (1) under mild conditions and provides guide-lines for setting parameters. It is interesting to see that therequirement cλ min ( L u ) + ρ > M is the same as that inDLM [24]. Fixing ρ , a network with better connectedness(namely, larger λ min ( L u ) ) allows us to choose a smallerpenalty constant c . Fixing c and λ min ( L u ) , the linearizationparameter ρ must be large enough to guarantee convergence.Note that ρI p approximates the Hessians of the local costfunctions f i ( x i ) . A large ρ over-approximates the curvatureand forces x k +1 i to be close to x ki , which stabilizes therecursion. On the contrary, a small ρ under-approximates thecurvature and allows the local variables to change quickly, atthe cost of possible divergence. Fig. 2 illustrates the impactof ρ . Regarding the censoring threshold τ k , we require itto be summable. Intuitively, τ k determines the maximalerror introduced to the primal update. When this error iscontrollable, the convergence of COLA is guaranteed C. Rates of Convergence
In Section III-B, we have shown that COLA requires { τ k } to be summable so as to guarantee convergence. Below, weshall prove that convergence rate of COLA also depends onconvergence rate of { τ k } . In addition to Assumptions 1–4, we need the local cost functions to be strongly convex,as stated in Assumption 5. In this circumstance, COLAconverges to the unique optimal solution of (1) at a linear(sublinear) rate when { τ k } is linearly (sublinearly) decaying. Theorem 2.
Under Assumptions 1–5, in COLA we choosethe penalty parameter c > and the linearization parameter ρ > M m , and set the censoring threshold τ k = α · ( β ) k with α > and β ∈ (0 , . Then there exists a positiveconstant δ > such that the primal variable x k convergesto the unique optimal solution x ∗ of (3) at a global linearrate O ((1 + δ ) − k ) . =0.8=2=5 Fig. 2. An illustration of choosing different approximation parameters ρ .In this situation, at the top-right point, we approximate the original costfunction (in blue) by the dashed lines (in cyan, green and red). When ρ = 2 and ρ = 5 that are both larger than the accurate second derivative, theupdates are conservative and go to the green and red points, respectively.When ρ = 0 . that is smaller than the accurate second derivative, the updateis aggressive and goes to the cyan point. Proof:
See Appendix C.As shown in (60) in the proof of Theorem 2, the constant δ depends on the algorithm parameters c , ρ and β , the networktopology parameterized by ˜ σ min ( G o ) and σ max ( G u ) , and theproperties of the local cost functions parameterized by M and m . Define the condition numbers of cost functions and graphas κ f = Mm and κ G = σ max ( G u )˜ σ min ( G o ) , respectively. The followingcorollary shows clearer that, by properly setting c and ρ , howthe constant δ is determined by κ f , κ G and β . Corollary 1.
Under Assumptions 1–5, in COLA we choose c = Mσ max ( G u )˜ σ min ( G o ) and ρ = M κ f . Then the global linearrate O ((1 + δ ) − k ) satisfies δ ≤ min ( κ G , κ f + 16 κ f κ G , κ f κ G + 6 κ f κ G , β − ) . (22)In (22), the terms κ G , κ f +16 κ f κ G and κ f κ G +6 κ f κ G are monotonically decreasing when either κ f or κ G in-creases, meaning that the convergence is slower whenthe cost functions are worse-conditioned and/or the net-work is less-connected. In addition, δ is bounded by β − . Therefore, (22) shows that among all β thatdo not affect the convergence rate, the one satisfying min { κ G , κ f +16 κ f κ G , κ f κ G +6 κ f κ G } = β − achieveslargest communication reduction per iteration. Theorem 3.
Under Assumptions 1–5, in COLA we choosethe penalty parameter c > and the linearization parameter ρ > M m , and set the censoring threshold τ k = α · ( k ) − r with α > and r > . Then there exists a finite time index k such that the distance between the primal variable x k andthe unique optimal solution x ∗ of (3) is upper-bounded by asequence decaying sublinearly to at a rate of O (( k ) − q ) , where q ∈ (0 , r − , when k ≥ k .Proof: See Appendix D.Theorems 2 and 3 indicate that, to achieve linear (sublin-ear) convergence, we have to impose stronger requirementson the parameters. The sequence of censoring thresholdshould be not only summable, but also linearly (sublinearly)decaying. The parameters c and ρ should be larger, too. Notethat because M ≥ m , ρ > M m ≥ M and consequently cλ min ( L u ) + ρ > M , which is required in Theorem 1.According to the upper bound of δ given in (60), thelinear rate of x k reaching x ∗ (namely, O ((1 + δ ) − k/ ) )must be slower than the linear rate of τ k decaying to (namely, O ( β k ) ). From Theorem 3, one can also see that thesublinear rate of x k reaching x ∗ (namely, O (( k ) − q ) where q ∈ (0 , r − ) must be slower than the sublinear rate of τ k decaying to (namely, O (( k ) − r ) ). Therefore, in boththe linear and the sublinear cases, the sequence of censoringthreshold τ k bounds the convergence rate of x k to x ∗ . Thismakes sense because τ k means the maximal error allowed toenter the recursion of x k due to communication censoring. Remark 2.
Though COLA is devised from DLM, the errorcaused by the communication-censoring strategy makes itsanalysis different to that of DLM. The analysis of COLAis also different to that of COCA, the censored version ofADMM. The reason is that COLA updates x k by gradient de-scent steps, while COCA updates x k by solving optimizationsubproblems. This is analogous to the difference in the proofsof DLM and ADMM. In addition to the difference in the prooftechniques, we also establish the sublinear convergence ofCOLA, which is absent in the analysis of DLM and COCA. Remark 3.
When the censoring threshold τ k is set to ,COLA degenerates to DLM. Intuitively, the convergence rateof COLA is no faster than that of DLM due to the introduc-tion of the communication-censoring strategy. This is alsoobserved from, for example, the linear convergence constant δ in Corollary 1. Nevertheless, the slower convergence interms of the number of iterations is acceptable, since COLAeffectively reduces the iteration-wise communication cost. Weshall demonstrate with numerical experiments that COLA canreduce the overall communication cost comparing to DLM. IV. N
UMERICAL E XPERIMENTS
This section provides numerical experiments to demon-strate the satisfactory communication-computation tradeoff ofCOLA. In particular, we shall show that COLA inherits theadvantage of cheap computation from its uncensored coun-terpart DLM [24], [25], but significantly reduces the overallcommunication cost. Beyond DLM, we compare COLA withthe classical ADMM [23] and its censored version COCA[50], both of which do not use the linearization techniqueand are not computation-efficient. We also compare with theevent-triggered sub-gradient descent (ETSD) algorithm [47], which is a primal domain first-order method but much slowerthan COLA in terms of convergence speed. We consider twodecentralized consensus optimization problems, least squaresin IV-A and logistic regression in IV-B. The cost functionsare both smooth, but the latter is not strongly convex. ForADMM and COCA, subproblems in least squares haveexplicit solutions, while those in logistic regression needscomputationally demanding inner loops. We use the accuracyof the primal variable as the performance metric, definedby k x k − x ∗ k / k x − x ∗ k . Logistic regression may havemultiple optimal solutions, among which we choose theone closest to the limit of iterate as x ∗ . The computationcost is evaluated by time spent to reach a target accuracy,and the communication cost is defined as the accumulatednumber of broadcast messages. The simulations are carriedout on a laptop with an Intel I7 processor and 8GB memory,programmed with Matlab R2017a in macOS Sierra. A. Decentralized Least Squares
The local cost function in the decentralized least squaresproblem is f i (˜ x ) = k A ( i ) ˜ x − y ( i ) k , with A ( i ) ∈ R p × p and y ( i ) ∈ R p being private for node i . Thus, the primal updateof node i at time k in COLA is x k +1 i = x ki − (2 cd ii + ρ ) − (cid:20) A T ( i ) ( A ( i ) x ki − y ( i ) )+ c X j ∈N i (ˆ x ki − ˆ x kj ) + µ ki (cid:21) . Note that node i can compute (2 cd ii + ρ ) − in advance toaccelerate the computation. In the experiments, entries of A ( i ) and b ( i ) are independently and identically sampled from theuniform distribution within [0 , . Then we let y ( i ) = A ( i ) b ( i ) .We set the network size as n = 50 and the dimension of thelocal variables as p = 3 .First, we compare four algorithms, COLA, DLM, COCAand ADMM, over four network topologies: line, random, starand complete, as shown in Figs. 3–6. In the random network, of all possible bidirectional edges are randomly chosento be connected. The accuracies are compared with respect tothe number of iterations and the cumulative communicationcost. The parameters c and ρ are tuned to be the best for theuncensored algorithms DLM and ADMM, and kept the samein their censored counterparts, respectively. We use the linearcensoring threshold in the form of τ k = α · ( β ) k , where theparameters α and β are hand-tuned in COLA and COCA soas to achieve the best communication efficiency. Taking therandom network as an example, we choose c = 0 . , ρ = 1 . in DLM and α = 0 . , β = 0 . in COLA, while c = 0 . in ADMM and α = 0 . , β = 0 . in COCA.In all the networks, the two censored methods COLA andCOCA require more iterations to reach the target accuracythan their uncensored counterparts due to the error caused bycensoring, but the saving in communication is remarkable. -5 A cc u r a cy COLADLMCOCAADMM0 0.5 1 1.5 2Cumulative Communication Cost 10 -5 A cc u r a cy COLADLMCOCAADMM
Fig. 3. Performance over line network for decentralized least squares. -5 A cc u r a cy COLADLMCOCAADMM0 0.5 1 1.5 2Cumulative Communication Cost 10 -5 A cc u r a cy COLADLMCOCAADMM
Fig. 4. Performance over random network for decentralized least squares.
Compared to DLM and given a target accuracy of − ,COLA saves ∼ / communication costs in the line andrandom networks, and ∼ / in the star and completenetworks. The required number of iterations in the linenetwork is much more than those in the other networks,since the connectedness of the line network is the worst.In better connected networks such as star and complete,variable updating is often informative, such that the deteriora-tion of convergence speed caused by skipping transmissionsbecomes more noticeable, yet communication per iteration isstill saved by censoring.We study the influence of communication censoring overthe random network. The censoring pattern of the first iterations is shown in Fig. 7. The horizontal axis is thenumber of iterations, and the vertical axis is the node index.A white dot means that the node broadcasts at the time,while a black dot means that the node is silent. Observe thatcommunication censoring happens uniformly, namely, thefrequency of communication censoring does not change toomuch along the optimization process. In addition, the nodeshave similar communication costs eventually. On average,every node broadcasts . ∼ . message per time.Next, we compare the choice of the censoring threshold inCOLA over the random network. We compare four censoring -5 A cc u r a cy COLADLMCOCAADMM0 2000 4000 6000 8000 10000 12000 14000Cumulative Communication Cost10 -5 A cc u r a cy COLADLMCOCAADMM
Fig. 5. Performance over star network for decentralized least squares. -5 A cc u r a cy COLADLMCOCAADMM0 2000 4000 6000 8000 10000 12000Cumulative Communication Cost10 -5 A cc u r a cy COLADLMCOCAADMM
Fig. 6. Performance over complete network for decentralized least squares. thresholds, the linear sequences τ k = α · ( β ) k with α = 0 . while β = 0 . , . and . , as well as the sublinearsequence τ k = α · ( k ) − r with α = 1000 and r = 2 . .The parameters c and ρ remain the same. As shown in Fig.8, the linear censoring thresholds outperforms the sublinearcensoring threshold, in terms of both communication andcomputation. The reason is that the sublinear rate of thethreshold limits the convergence rate of COLA, as we havetheoretically analyzed in Section III. Regarding the differentchoices of the linear rate, we observe that a smaller β needsless number of iterations to reach a target accuracy, sinceit leads to faster decay of the censoring threshold, and thusless communication censoring per iteration. In contrast, witha larger β , we need more number of iterations and lesscommunication cost per iteration. Therefore, a moderate β ,such as β = 0 . in this case, is preferred.In Fig. 8, we also compare COLA with ETSD, acommunication-censored primal domain first-order method.ETSD adopts the Metropolis-Hastings rule to design itsmixing matrix. It uses a linear censoring threshold α · ( β ) k and a sublinear step size O (( k ) − ) , where the parameters areall hand-tuned to achieve the best communication efficiency.From Fig. 8, we observe that ETSD requires much morenumber of iterations and communication cost to reach a
50 100 150 2001020304050
Fig. 7. Censoring pattern of the first 200 iterations of COLA over randomnetwork for decentralized least squares. The horizontal axis is the number ofiterations, and the vertical axis is the index of node. A dark dot representsthat the node is censored at that time. -5 A cc u r a cy DLMCOLA, =0.93COLA, =0.95COLA, =0.97COLA, p=2.5ETSD -5 A cc u r a cy DLMCOLA, =0.93COLA, =0.95COLA, =0.97COLA, p=2.5ETSD
Fig. 8. Performance of DLM, ETSD and COLA with different censoringthresholds over random network for decentralized least squares. In COLA,with the linear censoring thresholds τ k = α · ( β ) k , we fix α = 0 . andchoose different β . With the sublinear censoring threshold τ k = α ( k ) − r ,we choose α = 1000 and r = 2 . . target accuracy comparing to COLA. The main reason ofthe unsatisfactory performance of ETSD is the diminishingstep size, which is used to guarantee exact convergence.Similar performance gap can be observed in comparing theuncensored algorithms, sub-gradient descent and DLM. B. Decentralized Logistic Regression
In the decentralized logistic regression problem, the localcost function of node i is f i (˜ x ) = 1 l i l i X l =1 ln (cid:18) − y ( i ) l q T ( i ) l ˜ x ) (cid:19) , where q ( i ) l ∈ R p is the l th column of a matrix Q ( i ) ∈ R p × l i , y ( i ) l ∈ {− , +1 } is the l th element of a binary vector y ( i ) ∈R l i , and l i is the number of samples held by node i . Theprimal update of node i at time k in COLA is x k +1 i = x ki − cd ii + ρ (cid:20) − l i l i X l =1 y ( i ) l exp( − y ( i ) l q T ( i ) l x ki ) q ( i ) l − y ( i ) l q T ( i ) l x ki )+ c X j ∈N i (ˆ x ki − ˆ x kj ) + µ ki (cid:21) , -4 A cc u r a cy COLADLMCOCAADMM0 2000 4000 6000 8000 10000 12000 14000Cumulative Communication Cost10 -4 A cc u r a cy COLADLMCOCAADMM
Fig. 9. Performance over random network with nodes for decentralizedlogistic regression. -4 A cc u r a cy COLADLMCOCAADMM0 0.5 1 1.5 2 2.5Cumulative Communication Cost 10 -4 A cc u r a cy COLADLMCOCAADMM
Fig. 10. Performance over random network with nodes for decentral-ized logistic regression. while the primal updates of ADMM and COCA have no ex-plicit solutions. Therefore, we solve the subproblems thereinby a gradient descent inner loop, which terminates when the ℓ norm of the gradient is less than − .We conduct simulations over two random networks with n = 50 and n = 100 nodes, in both of which ofall possible bidirectional edges are randomly chosen to beconnected. The dimension of local variables is p = 3 . Thenumbers of samples held by the nodes are i.i.d. and uniformlychosen from integers within [1 , . Entries of the first tworows of Q ( i ) follow the i.i.d. discrete uniform distributionon the set { . w } , w = 1 , · · · , , while entries of the lastrow are all set as . Entries of y ( i ) are i.i.d. and follow theuniform distribution on {− , } . As we have done in SectionIV-A, c in ADMM is tuned to achieve the fastest convergence,and is also used for COCA; c and ρ in DLM are tuned toachieve the fastest convergence, and are also used for COLA.The censoring threshold in both COCA and COLA is set as τ k = α · ( β ) k , with parameters α and β hand-tuned to obtainthe best communication efficiency.As depicted in Figs. 9 and 10, the four algorithms behavesimilarly to those in the least squares problem (Figs. 3–6). n Accuracy COLA DLM COCA ADMM50 − − − − HE TIME SPENT OF THE FOUR ALGORITHMS IN TWO NETWORKS WITHDIFFERENT NUMBERS OF NODES n AND TARGET ACCURACIES . COLA saves nearly / communication cost with few moreiterations compared to DLM. To demonstrate its computationefficiency, we also show the CPU time for the four algorithmsto reach target accuracies − and − in Table I. Noticethat the two linearized algorithms, COLA and DLM, computemuch faster than COCA and ADMM. The time spent byCOLA in both networks is slightly more than that of DLMdue to the communication censoring operations.V. C ONCLUSION
In this paper, we propose COLA, a communication- andcomputation-efficient decentralized consensus optimizationalgorithm. Compared to the classical ADMM, COLA usesthe linearization technique to reduce the iteration-wise com-putation cost, and fits for networks where only light-weightcomputation is affordable. To compensate the sacrifice inthe convergence speed, which is caused by the linearizationstep and results in low communication efficiency, COLAfurther introduces the communication-censoring strategy toprevent a node from transmitting its “less-informative” localvariable to neighbors. We establish convergence and rates ofconvergence for COLA, and demonstrate the computation-communication tradeoff with numerical experiments. Ourfuture work is to apply the linearization and communicationcensoring techniques to decentralized optimization applica-tions in dynamic, online and stochastic environments.R
EFERENCES[1] M. Rabbat and R. Nowak, “Distributed optimization in sensor net-works,” In: Proceedings IPSN, 2004[2] I. Schizas, A. Ribeiro, and G. Giannakis, “Consensus in ad hocWSNs with noisy links - Part I: Distributed estimation of deterministicsignals,” IEEE Trans. on Signal Processing, vol. 56, no. 1, pp. 350–364, 2008[3] S. Dougherty and M. Guay, “An extremum-seeking controller for dis-tributed optimization over sensor networks,” IEEE Trans. on AutomaticControl, vol. 62, no. 2, pp. 928–933, 2017[4] F. Zeng, C. Li, and Z. Tian, “Distributed compressive spectrum sensingin cooperative multi-hop wideband cognitive networks,” IEEE Journalof Selected Topics on Signal Processing, vol. 5, no. 1, pp. 37–48, 2011[5] G. Giannakis, Q. Ling, G. Mateos, I. Schizas, and H. Zhu, “De-centralized learning for wireless communications and networking,”In: Splitting Methods in Communication and Imaging, Science andEngineering, Springer, 2016[6] F. Bullo, J. Cortes, and S. Martinez,
Distributed Control of RoboticNetworks , Princeton University Press, 2009[7] Y. Cao, W. Yu, W. Ren, and G. Chen, “An overview of recent progressin the study of distributed multi-agent coordination,” IEEE Trans. onIndustrial Informatics, vol. 9, no. 1, pp. 427–438, 2013 [8] G. Giannakis, V. Kekatos, N. Gatsis, S. Kim, H. Zhu, and B.Wollenberg, “Monitoring and optimization for power grids: A signalprocessing perspective,” IEEE Signal Processing Magazine, vol. 30,no. 5, pp. 107–128, 2013[9] K. Antoniadou-Plytaria, I. Kouveliotis-Lysikatos, P. Georgilakis, andN. Hatziargyriou, “Distributed and decentralized voltage control ofsmart distribution networks: Models, methods, and future research,”IEEE Trans. on Smart Grid, vol. 8, no. 6, pp. 2999–3008, 2017[10] H. Liu, W. Shi, and H. Zhu, “Distributed voltage control in distributionnetworks: Online and robust implementations,” IEEE Trans. on SmartGrid, vol. 9, no. 6, pp. 6106–6117, 2018[11] X. Zhao and A. Sayed, “Distributed clustering and learning overnetworks,” IEEE Trans. on Signal Processing, vol. 63, no. 13, pp.3285–3300, 2015[12] A. Mokhtari, Efficient Methods for Large-Scale Empirical Risk Mini-mization , PhD Thesis, University of Pennsylvania, 2017[13] X. Lian, C. Zhang, H. Zhang, C. Hsieh, W. Zhang, and J. Liu,“Can decentralized algorithms outperform centralized algorithms? Acase study for decentralized parallel stochastic gradient descent,” In:Proceedings of NIPS, 2017[14] A. Nedic and A. Ozdaglar, “Distributed subgradient methods formultiagent optimization,” IEEE Trans. on Automatic Control, vol. 54,no. 1, pp. 48–61, 2009[15] D. Jakovetic, J. Xavier, and J. Moura, “Fast distributed gradientmethods,” IEEE Trans. on Automatic Control, vol. 59, no. 5, pp. 1131–1146, 2014[16] K. Yuan, Q. Ling, and W. Yin, “On the convergence of decentralizedgradient descent,” SIAM Journal on Optimization, vol. 30, no. 5, pp.1835–1854, 2016[17] J. Duchi, A. Agarwal, and M. Wainwright, “Dual averaging fordistributed optimization: Convergence analysis and network scaling,”IEEE Trans. on Automatic Control, vol. 57, no. 3, pp. 592–606, 2012[18] K. Tsianos and M. Rabbat, “Distributed dual averaging for convexoptimization under communication delays,” In: Proceedings of ACC,2012[19] S. Lee, A. Nedic, and M. Raginsky, “Stochastic dual averagingfor decentralized online optimization on time-varying communicationgraphs,” IEEE Trans. on Automatic Control, vol. 62, no. 12, pp. 6407–6414, 2017[20] A. Mokhtari, Q. Ling, and A. Ribeiro, “Network Newton distributedoptimization methods,” IEEE Trans. on Signal Processing, vol. 65, no.1, pp. 146–161, 2017[21] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributedoptimization and statistical learning via the alternating directionmethod of multipliers,” Foundations and Trends in Machine Learning,vol. 3, pp. 1–122, 2010[22] G. Mateos, J. Bazerque, and G. Giannakis, “Distributed sparse linearregression,” IEEE Trans. on Signal Processing, vol. 58, no. 10, pp.5262–5276, 2010[23] W. Shi, Q. Ling, K. Yuan, G. Wu, and W. Yin, “On the linearconvergence of the ADMM in decentralized consensus optimization,”IEEE Trans. on Signal Processing, vol. 62, no. 7, pp. 1750–1761, 2014[24] Q. Ling, W. Shi, G. Wu, and A. Ribeiro, “DLM: Decentralizedlinearized alternating direction method of multipliers,” IEEE Trans.on Signal Processing, vol. 63, pp. 4051–4064, 2015[25] T. Chang, M. Hong, and X. Wang, “Multi-agent distributed optimiza-tion via inexact consensus ADMM,” IEEE Trans. on Signal Processing,vol. 63, no.2, pp. 482–497, 2015[26] W. Shi, Q. Ling, G. Wu, and W. Yin, “EXTRA: An exact first-orderalgorithm for decentralized consensus optimization,” SIAM Journal onOptimization, vol. 25, no. 2, pp. 944–966, 2015[27] P. Lorenzo and G. Scutari, “NEXT: In-network nonconvex opti-mization,” IEEE Trans. on Signal and Information Processing overNetworks, vol. 2, no. 2, pp. 120–136, 2016[28] Y. Sun, G. Scutari, and D. Palomar, “Distributed nonconvex multiagentoptimization over time-varying networks,” In: Proceedings of ASILO-MAR, 2016[29] G. Qu and N. Li, “Harnessing smoothness to accelerate distributedoptimization,” IEEE Transactions on Control of Network Systems, vol.5, no. 3, pp. 1245–1260, 2017 [30] A. Nedic, A. Olshevsky, and W. Shi, “Achieving geometric conver-gence for distributed optimization over time-varying graphs,” SIAMJournal of Optimization, vol. 27, no. 4, pp. 2597–2633, 2017[31] R. Xin and U. Khan, “A linear algorithm for optimization over directedgraphs with geometric convergence,” IEEE Control Systems Letters,vol. 2, no. 3, pp. 325–330, 2018[32] S. Pu, W. Shi, J. Xu, A. Nedic, “A push-pull gradient method fordistributed optimization in networks,” In: Proceedings of CDC, 2018[33] A. Mokhtari, W. Shi, Q. Ling, and A. Ribeiro, “DQM: Decentralizedquadratically approximated alternating direction method of multipli-ers,” IEEE Trans. on Signal Processing, vol. 64, no. 19, pp. 5158–5173,2016[34] A. Mokhtari, W. Shi, Q. Ling, and A. Ribeiro, “A decentralizedsecond-order method with exact linear convergence rate for consensusoptimization,” IEEE Trans. on Signal and Information Processing overNetworks, vol. 2, no. 4, pp. 507–522, 2016[35] M. Eisen, A. Mokhtari, and A. Ribeiro, “A primal-dual quasi-Newtonmethod for exact consensus optimization,” arxiv: 1809.01212, 2018[36] K. Scaman, F. Bach, S. Bubeck, Y. Lee, and L. Massoulie, “Optimalalgorithms for smooth and strongly convex distributed optimization innetworks,” In: Proceedings of ICML, 2017[37] K. Scaman, F. Bach, S. Bubeck, Y. Lee, and L. Massoulie, “Optimalalgorithms for non-smooth distributed optimization in networks,” In:Proceedings of NeurIPS, 2018[38] H. Sun and M. Hong, “Distributed non-convex first-order optimizationand information processing: Lower complexity bounds and rate optimalalgorithms,” arxiv: 1804.02729, 2018[39] K. Tsianos, S. Lawlor, M. Rabbat, “Communication/computation trade-offs in consensus-based distributed optimization,” In: Proceedings ofNIPS, 2012[40] A. Berahas, R. Bollapragada, N. Keskar, and E. Wei, “Balancingcommunication and computation in distributed optimization,” arxiv:1709.02999, 2017[41] G. Lan, S. Lee, and Y. Zhou, “Communication-efficient algorithms fordecentralized and stochastic optimization,” arxiv: 1701.03961, 2017[42] A. Nedic, A. Olshevsky, and M. Rabbat, “Network topology andcommunication-computation tradeoffs in decentralized optimization,”arxiv: 1709.08765, 2017[43] H. Tang, S. Gan, C. Zhang, T. Zhang, and J. Liu, “Communicationcompression for decentralized training,” arxiv: 1803.06443, 2018[44] D. Dimarogonas, E. Frazzoli, and K. Johansson, “Distributed event-triggered control for multi-agent systems,” IEEE Trans. on AutomaticControl, vol. 57, no. 5, pp. 1291–1297, 2012[45] E. Garcia, Y. Cao, H. Yu, P. Antsaklis, and D. Casbeer, “Decentralisedevent-triggered cooperative control with limited communication,” In-ternational Journal of Control, vol. 86, no. 9, pp. 1479–1488, 2013[46] C. Nowzari and J. Cortes, “Distributed event-triggered coordination foraverage consensus on weight-balanced digraphs,” Automatica, vol. 68,pp. 237–244, 2016[47] Q. Lu and H. Li, “Event-triggered discrete-time distributed consensusoptimization over time-varying graphs,” Complexity, 5385708, 2017[48] K. Tsianos, S. Lawlor, J. Yu, and M. Rabbat, “Networked optimizationwith adaptive communication,” In: Proceedings of GlobalSIP, 2013[49] W. Chen and W. Ren, “Event-triggered zero-gradient-sum distributedconsensus optimization over directed networks,” Automatica, vol. 65,pp. 90–97, 2016[50] Y. Liu, W. Xu, G. Wu, Z. Tian, and Q. Ling, “COCA: Communication-censored ADMM for decentralized consensus optimization,” In: Pro-ceedings of ASILOMAR, 2018 A PPENDIX AP ROOF OF L EMMA Proof:
From (18), it holds that φ k +1 − φ k = c G o ˆ x k +1 (15) = c G o (ˆ x k +1 − x ∗ ) (23) = c G o ( x k +1 − E k +1 − x ∗ ) , where the last equality uses the definition E k +1 = x k +1 − ˆ x k +1 . Rearranging terms in (23) yields (19).Also rearranging terms in (17) to place ∇ f ( x k ) at the leftside, we have ∇ f ( x k ) = (2 cD + ρI )( x k − x k +1 ) − cL o ˆ x k − G To φ k . (24)Subtracting (24) with (14) and noticing the definitions of D = ( L o + L u ) and L o = G To G o , we have ∇ f ( x k ) − ∇ f ( x ∗ )=(2 cD + ρI )( x k − x k +1 ) − cL o ˆ x k − G To ( φ k − φ ∗ )=( cL u + ρI )( x k − x k +1 ) + cL o ( x k − x k +1 ) − cL o ˆ x k − G To ( φ k +1 − φ ∗ ) + G To ( φ k +1 − φ k ) (23) = ( cL u + ρI )( x k − x k +1 ) + cL o ( x k − x k +1 ) − cL o ˆ x k − G To ( φ k +1 − φ ∗ ) + cL o ( x k +1 − E k +1 − x ∗ ) (15) = ( cL u + ρI )( x k − x k +1 ) − G To ( φ k +1 − φ ∗ )+ cL o ( E k − E k +1 ) , which completes the proof.A PPENDIX BP ROOF OF T HEOREM Proof:
Throughout the proof, we assume τ > . When τ = 0 such that all τ k = 0 and COLA degenerates to DLM,since the value of τ does not affect the operation of COLA,we can simply set τ as any positive constant. Step 1.
The proof in this step is analogous to the proofof Lemma 3 in [24], but more complicated due to theexistence of censoring error. From Assumptions 2 and 3, thegradients of the convex local cost functions ∇ f i are Lipschitzcontinuous with constant M > . Thus, we have M k∇ f ( x k ) − ∇ f ( x ∗ ) k (25) ≤h∇ f ( x k ) − ∇ f ( x ∗ ) , x k − x ∗ i = h∇ f ( x k ) − ∇ f ( x ∗ ) , x k +1 − x ∗ i + h∇ f ( x k ) − ∇ f ( x ∗ ) , x k − x k +1 i . For the second term at the right-hand side of (25), wechoose an upper bound h∇ f ( x k ) − ∇ f ( x ∗ ) , x k − x k +1 i (26) ≤ M k∇ f ( x k ) − ∇ f ( x ∗ ) k + M k x k − x k +1 k . To establish an upper bound for the first term at the right-hand side of (25), we use (19) in Lemma 2 to rewrite it as h∇ f ( x k ) − ∇ f ( x ∗ ) , x k +1 − x ∗ i (27) = h ( cL u + ρI )( x k − x k +1 ) , x k +1 − x ∗ i− h G To ( φ k +1 − φ ∗ ) , x k +1 − x ∗ i + h cL o ( E k − E k +1 ) , x k +1 − x ∗ i . We shall handle the terms at the right-hand side of (27) oneby one. The first one satisfies h ( cL u + ρI )( x k − x k +1 ) , x k +1 − x ∗ i (28) = c h G u ( x k − x k +1 ) , G u ( x k +1 − x ∗ ) i + ρ h x k − x k +1 , x k +1 − x ∗ i =2 c h z k − z k +1 , z k +1 − z ∗ i + ρ h x k − x k +1 , x k +1 − x ∗ i , which uses the definitions L u = G Tu G u , z k = G u x k and z ∗ = G u x ∗ . The second one satisfies − h G To ( φ k +1 − φ ∗ ) , x k +1 − x ∗ i (29) = − h φ k +1 − φ ∗ , G o ( x k +1 − x ∗ ) i (20) = 2 c h φ k +1 − φ ∗ , φ k − φ k +1 i − h φ k +1 − φ ∗ , G o E k +1 i . The third one satisfies h cL o ( E k − E k +1 ) , x k +1 − x ∗ i (30) = c h G o ( E k − E k +1 ) , G o ( x k +1 − x ∗ ) i (20) = h G o ( E k − E k +1 ) , φ k +1 − φ k i + c h G o ( E k − E k +1 ) , G o E k +1 i , which uses the definition L o = G To G o . Summing up (28),(29) and (30), applying the equality h v a − v b , v b − v c i = k v a − v c k −k v a − v b k −k v b − v c k that holds for any vectors v a , v b and v c to h z k − z k +1 , z k +1 − z ∗ i , h x k − x k +1 , x k +1 − x ∗ i and h φ k +1 − φ ∗ , φ k +1 − φ k i , and then reorganizing terms,we can rewrite (27) as h∇ f ( x k ) − ∇ f ( x ∗ ) , x k +1 − x ∗ i (31) = c ( k z k − z ∗ k − k z k − z k +1 k − k z k +1 − z ∗ k )+ ρ k x k − x ∗ k − k x k − x k +1 k − k x k +1 − x ∗ k )+ 1 c ( k φ k − φ ∗ k − k φ k − φ k +1 k − k φ k +1 − φ ∗ k ) − h φ k +1 − φ ∗ , G o E k +1 i + h G o ( E k − E k +1 ) , φ k +1 − φ k i + c h G o ( E k − E k +1 ) , G o E k +1 i (21) = V k − V k +1 − c k z k − z k +1 k − ρ k x k − x k +1 k − c k φ k − φ k +1 k − h G o E k +1 , φ k − φ ∗ i + h G o ( E k − E k +1 ) , φ k +1 − φ k i + c h G o ( E k − E k +1 ) , G o E k +1 i . For the term −h G o E k +1 , φ k − φ ∗ i , we observe that − h G o E k +1 , φ k − φ ∗ i (32) ≤ c k G o E k +1 kk φ k − φ ∗ k + 12 c k G o E k +1 k≤ c σ max ( G o ) k E k +1 k k φ k − φ ∗ k + σ max ( G o )2 c k E k +1 k , where c > is any positive constant. Similarly, for h G o ( E k − E k +1 ) , φ k +1 − φ k i , it holds h G o ( E k − E k +1 ) , φ k +1 − φ k i (33) ≤ c k G o ( E k − E k +1 ) kk φ k +1 − φ k k + 12 c k G o ( E k − E k +1 ) k≤ c σ max ( G o )( k E k k + 2 k E k +1 k )2 k φ k +1 − φ k k + σ max ( G o )2 c ( k E k k + 2 k E k +1 k ) , where c > is any positive constant. For c h G o ( E k − E k +1 ) , G o E k +1 i , we have c h G o ( E k − E k +1 ) , G o E k +1 i (34) ≤ c h G o E k , G o E k +1 i≤ c k G o E k k + k G o E k +1 k ≤ c σ ( G o ) k E k k + c σ ( G o ) k E k +1 k . Using (32), (33) and (34) to rewrite (31) followed by substi-tuting the result and (26) into (25), we obtain V k − V k +1 (35) − c k z k − z k +1 k − (cid:18) ρ − M (cid:19) k x k − x k +1 k − c k φ k − φ k +1 k + c σ max ( G o ) k E k +1 k k φ k − φ ∗ k + σ max ( G o )2 c k E k +1 k + c σ max ( G o )( k E k k + 2 k E k +1 k )2 k φ k +1 − φ k k + σ max ( G o )2 c ( k E k k + 2 k E k +1 k )+ c σ ( G o ) k E k k + c σ ( G o ) k E k +1 k ≥ . Step 2.
Now we characterize the upper bound of k E k k .According to the censoring strategy, ˆ x ki − x ki , the i th blockof E k , becomes if k ˆ x k − i − x ki k ≥ τ k or equals ˆ x k − i − x ki otherwise. In both cases, it holds k ˆ x ki − x ki k ≤ τ k . Therefore,we know k E k k ≤ √ nτ k . Since τ k is non-increasing, it alsoholds k E k +1 k ≤ √ nτ k +1 ≤ √ nτ k . Thus, (35) becomes c k z k − z k +1 k + (cid:18) ρ − M (cid:19) k x k − x k +1 k (36) + (cid:18) c − c σ max ( G o ) √ nτ k (cid:19) k φ k − φ k +1 k ≤ V k − V k +1 + c σ max ( G o ) √ nτ k k φ k − φ ∗ k + (cid:18) c + 32 c (cid:19) σ max ( G o ) √ nτ k + cnσ ( G o )2 ( τ k ) . Setting the constants c and c in (36) as c = 3 c = 1 cσ max ( G o ) √ nτ , we rewrite (36) to c k z k − z k +1 k + (cid:0) ρ − M (cid:1) k x k − x k +1 k (37) + (cid:0) c − τ k cτ (cid:1) k φ k − φ k +1 k ≤ V k − V k +1 + τ k cτ k φ k − φ ∗ k + 5 cnσ ( G o ) τ τ k + cnσ ( G o )( τ k ) . Since τ k is non-decreasing, c − τ k cτ ≥ c . Meanwhile, bythe definition of the energy function, V k ≥ c k φ k − φ ∗ k . Bythe definitions of z k = G u x k and L u = G Tu G u , k z k − z k +1 k = k G u ( x k − x k +1 ) k ≥ λ min ( L u ) k x k − x k +1 k .Applying these three facts to (37) yields (cid:0) cλ min ( L u ) + ρ − M (cid:1) k x k − x k +1 k + 12 c k φ k − φ k +1 k ≤ (cid:0) τ k τ (cid:1) V k − V k +1 + 5 cnσ ( G o ) τ τ k + cnσ ( G o )( τ k ) . (38) Step 3.
Define θ k := 5 cnσ ( G o ) τ τ k + cnσ ( G o )( τ k ) , which is a non-increasing non-negative summable summablesequence as τ k is. The left-hand side of (38) is non-negativebecause cλ min ( L u ) + ρ ≥ M . Thus, (38) leads to (cid:0) τ k τ (cid:1) V k − V k +1 + θ k ≥ . (39)We use this inequality to show that V k has a finite upperbound. From (39) we have V k +1 ≤ (cid:0) τ k τ (cid:1) V k + θ k ≤ (cid:0) τ k τ (cid:1) (cid:18)(cid:0) τ k − τ (cid:1) V k − + θ k − (cid:19) + θ k ≤ . . . ≤ V k Y k ′ =0 (cid:0) τ k ′ τ (cid:1) + k − X k ′′ =0 (cid:18) θ k ′′ k Y k ′ = k ′′ +1 (cid:0) τ k ′ τ (cid:1)(cid:19) + θ k ≤ V k Y k ′ =0 (cid:0) τ k ′ τ (cid:1) + k X k ′′ =0 θ k ′′ k Y k ′ =0 (cid:0) τ k ′ τ (cid:1) ≤ (cid:0) V + ∞ X k ′′ =0 θ k ′′ (cid:1) ∞ Y k ′ =0 (cid:0) τ k ′ τ (cid:1) ≤ (cid:0) V + ∞ X k ′′ =0 θ k ′′ (cid:1) exp ( ∞ X k ′ =0 τ k ′ τ ) < ∞ , (40)where we use the inequality a ≤ exp { a } that holds forall a ∈ R , and the fact that τ k and θ k are both non-negativeand summable. Thus, we conclude that V k has a finite upperbound, denoted as ¯ V . Step 4.
Now we begin to prove the convergence. Summingup (38) from k = 0 to k = ∞ yields ∞ X k =0 (cid:20) (cid:0) cλ min ( L u ) + ρ − M (cid:1) k x k − x k +1 k + 12 c k φ k − φ k +1 k (cid:21) ≤ V + ∞ X k =0 τ k τ V k + ∞ X k =0 θ k ≤ V + ¯ V τ ∞ X k =0 τ k + ∞ X k =0 θ k < ∞ . (41)Thus, we conclude that lim k →∞ ( x k − x k +1 ) = 0 and lim k →∞ ( φ k − φ k +1 ) = 0 . Following these limiting proper-ties, when k → ∞ , the dual update (18) leads to G o ˆ x k → ,which implies that G o x k = G o ˆ x k + G o E k → . (42)Also, we have L o ˆ x k → as L o = G To G o . Consequently,in the limit (17) becomes ∇ f ( x k ) + G To φ k → . (43)Meanwhile, by definition G u x k − z k = 0 . (44)Comparing (43), (42) and (44) with the KKT conditions (14),(15) and (16), we conclude that the triple ( x k , z k , φ k ) satisfiesthe KKT conditions when k goes to infinity.Next, we show that { ( x k , z k , φ k ) } converges when k →∞ . Since the sequence V k is bounded, k x k − x ∗ k and k φ k − φ ∗ k are also bounded. Thus, there exists a subsequence { ( x k t , φ k t ) } which converges to a cluster point ( x ∞ , φ ∞ ) of { ( x k , φ k ) } and ( x ∞ , φ ∞ ) is optimal to (3).Construct another energy function V k ∞ := ρ k x k − x ∞ k + c k z k − z ∞ k + c k φ k − φ ∞ k , where z ∞ := G u x ∞ . Theanalysis for V k can be applied to V k ∞ . In particular, analogousto (40), given any fixed k t , we have V k ∞ ≤ (cid:18) V k t ∞ + ∞ X k ′′ = k t θ k ′′ (cid:19) exp (cid:26) ∞ X k ′ = k t τ k ′ τ k t (cid:27) (45) ≤ (cid:18) V k t ∞ + ∞ X k ′′ = k t θ k ′′ (cid:19) exp (cid:26) ∞ X k ′ = k t τ k ′ τ (cid:27) , for any k ≥ k t . Observe that ( x k t , φ k t ) → ( x ∞ , φ ∞ ) leads to V k t ∞ → . In addition, the sequences θ k and τ k P ∞ k ′′ =0 θ k ′′ < ∞ and P ∞ k ′ =0 τ k ′ < ∞ , respectively. Therefore, for any ǫ > there exists an integer t such that V k t ∞ < ǫ , ∞ X k ′′ = k t θ k ′′ < ǫ , and ∞ X k ′ = k t τ k ′ < τ log 2 . Then according to (45) we have V k ∞ < ǫ for all k ≥ k t .Therefore, V k ∞ → as k → ∞ . From the definition of V k ∞ ,we conclude that { ( x k , z k , φ k ) } converges to ( x ∞ , z ∞ , φ ∞ ) ,which is optimal to (3).A PPENDIX CP ROOF OF T HEOREM Proof:
Step 1.
From Assumption 5, the local costfunctions f i are strongly convex with constant m > . Thus,we have m k x k +1 − x ∗ k ≤h∇ f ( x k +1 ) − ∇ f ( x ∗ ) , x k +1 − x ∗ i (46) = h∇ f ( x k ) − ∇ f ( x ∗ ) , x k +1 − x ∗ i + h∇ f ( x k +1 ) − ∇ f ( x k ) , x k +1 − x ∗ i . Observe that h∇ f ( x k ) −∇ f ( x ∗ ) , x k +1 − x ∗ i , the first termat the right-hand side of (46), also appears in (25) in the proofof Theorem 1. We follow the derivation to obtain (31), butthen look for new upper bounds of h G o E k +1 , φ k − φ ∗ i and h G o E k , φ k +1 − φ k i , which are different to those in (32) and(33). For the term h G o E k +1 , φ k − φ ∗ i , we observe that h G o E k +1 , φ k − φ ∗ i (47) = h G o E k +1 , φ k +1 − φ ∗ i + h G o E k +1 , φ k − φ k +1 i≤ c k G o E k +1 k + 12 c k φ k +1 − φ ∗ k + c k G o E k +1 k + 12 c k φ k − φ k +1 k ≤ c σ ( G o ) k E k +1 k + 12 c k φ k +1 − φ ∗ k + 12 c k φ k − φ k +1 k , where c > is any positive constant. For h G o E k , φ k +1 − φ k i , it holds h G o E k , φ k +1 − φ k i (48) ≤ c k G o E k k + 12 c k φ k +1 − φ k k ≤ c σ ( G o ) k E k k c k φ k +1 − φ k k , where c > is any positive constant.For the second term at the right-hand side of (46), we have h∇ f ( x k +1 ) − ∇ f ( x k ) , x k +1 − x ∗ i (49) ≤ c k∇ f ( x k +1 ) − ∇ f ( x k ) k + 12 c k x k +1 − x ∗ k ≤ c M k x k +1 − x k k + 12 c k x k +1 − x ∗ k , where c > is any positive constant. The last inequalityuses the fact that the gradients of the local cost functions ∇ f i are Lipschitz continuous with constant M > according toAssumption 3.Using (47), (48) and (34) to rewrite (31) followed bysubstituting the result and (49) into (46), we obtain V k +1 ≤ V k − c k z k − z k +1 k (50) − (cid:0) ρ − c M (cid:1) k x k − x k +1 k − (cid:0) c − c − c (cid:1) k φ k − φ k +1 k − (cid:0) m − c (cid:1) k x k +1 − x ∗ k + 12 c k φ k +1 − φ ∗ k + (cid:0) c c (cid:1) σ ( G o ) k E k k + (cid:0) c + c (cid:1) σ ( G o ) k E k +1 k . By the same reasoning in the proof of Theorem 1, k E k k ≤√ nτ k and k E k +1 k ≤ √ nτ k . Thus, (50) becomes V k +1 ≤ V k − c k z k − z k +1 k (51) − (cid:0) ρ − c M (cid:1) k x k − x k +1 k − (cid:0) c − c − c (cid:1) k φ k − φ k +1 k − (cid:0) m − c (cid:1) k x k +1 − x ∗ k + 12 c k φ k +1 − φ ∗ k + sn ( τ k ) . where s := (cid:16) c + c c (cid:17) σ ( G o ) > . Step 2.
Now we are going to find constants δ > and γ ≥ such that (1 + δ ) V k +1 ≤ V k + γn ( τ k ) . (52)Given any δ , using the definition of the energy function V k +1 to rewrite (51) as (1 + δ ) V k +1 ≤ V k − c k z k − z k +1 k (53) − (cid:0) ρ − c M (cid:1) k x k − x k +1 k − (cid:0) c − c − c (cid:1) k φ k − φ k +1 k − (cid:0) m − c − ρδ (cid:1) k x k +1 − x ∗ k + (cid:0) c + δc (cid:1) k φ k +1 − φ ∗ k + cδ k z k +1 − z ∗ k + sn ( τ k ) . We shall replace the terms k z k +1 − z ∗ k and k φ k +1 − φ ∗ k in(53) with terms k z k − z k +1 k , k z k +1 − z ∗ k , k x k − x k +1 k , k x k +1 − x ∗ k and ( τ k ) .For k z k +1 − z ∗ k , because z k +1 − z ∗ = G u ( x k +1 − x ∗ ) ,we have k z k +1 − z ∗ k ≤ σ ( G u )4 k x k +1 − x ∗ k . (54) To handle k φ k +1 − φ ∗ k , use the fact that L u = G Tu G u and reorganize (19) to obtain G To ( φ k +1 − φ ∗ ) (55) = − (cid:0) ∇ f ( x k ) − ∇ f ( x ∗ ) (cid:1) + cG Tu ( z k − z k +1 )+ ρ ( x k − x k +1 ) + cL o ( E k − E k +1 ) . Since both φ k +1 and φ ∗ are in the column space of G o , theleft-hand side of (55) is lower-bounded by ˜ σ ( G o ) k φ k +1 − φ ∗ k ≤ k G To ( φ k +1 − φ ∗ ) k . (56)The right-hand side of (55) is upper-bounded by k − (cid:0) ∇ f ( x k ) − ∇ f ( x ∗ ) (cid:1) + cG Tu ( z k − z k +1 ) (57) + ρ ( x k − x k +1 ) + cL o ( E k − E k +1 ) k ≤ k∇ f ( x k ) − ∇ f ( x ∗ ) k + 4 k cG Tu ( z k − z k +1 ) k + 4 k ρ ( x k − x k +1 ) k + 4 k cL o ( E k − E k +1 ) k ≤ k∇ f ( x k +1 ) − ∇ f ( x ∗ ) k + 8 k∇ f ( x k ) − ∇ f ( x k +1 ) k + 4 k cG Tu ( z k − z k +1 ) k + 4 k ρ ( x k − x k +1 ) k + 8 k cL o E k k + 8 k cL o E k +1 k ≤ M k x k +1 − x ∗ k + (8 M + 4 ρ ) k x k − x k +1 k + 4 c σ ( G u ) k z k − z k +1 k + 4 c σ ( G o ) n ( τ k ) . The last inequality uses the fact that ∇ f is Lipschitz contin-uous with constant M > such that k∇ f ( x k +1 ) − ∇ f ( x ∗ ) k ≤ M k x k +1 − x ∗ k , k∇ f ( x k ) − ∇ f ( x k +1 ) k ≤ M k x k − x k +1 k , and the definition of L o = G To G o such that k cL o E k k ≤ c σ ( G o ) k E k k ≤ c σ ( G o ) n ( τ k ) , k cL o E k +1 k ≤ c σ ( G o ) n ( τ k ) . Combining (55), (56) and (57), we obtain k φ k +1 − φ ∗ k ≤ σ ( G o ) (cid:0) M k x k +1 − x ∗ k (58) + (8 M + 4 ρ ) k x k − x k +1 k + 4 c σ ( G u ) k z k − z k +1 k + 4 c σ ( G o ) n ( τ k ) (cid:1) . Thus, we can use (54) and (58) to rewrite (53) as (1 + δ ) V k +1 (59) ≤ V k − c (cid:18) − (cid:0) c + δc (cid:1) cσ ( G u )˜ σ ( G o ) (cid:19) k z k − z k +1 k − (cid:18) ρ − c M − (cid:0) c + δc (cid:1) (8 M + 4 ρ )˜ σ ( G o ) (cid:19) k x k − x k +1 k − (cid:18) c − c − c (cid:19) k φ k − φ k +1 k − (cid:18) m − c − ρδ − cδσ ( G u )4 − (cid:0) c + δc (cid:1) M ˜ σ ( G o ) (cid:19) · k x k +1 − x ∗ k + (cid:18) s + (cid:0) c + δc (cid:1) c σ ( G o )˜ σ ( G o ) (cid:19) n ( τ k ) . For convenience, set the constants as c = c m (2 mρ − M )˜ σ ( G o ) (cid:18) M (2 mρ + M )+ 16 m (2 M + ρ ) + 2 cσ ( G u ) m (2 mρ − M ) (cid:19) ,c = c c c − c ,c = 1 / m + ρ/M mρ + M mM ∈ (cid:18) m , ρM (cid:19) , where the range of c is from the hypothesis that ρ > M m .Then (52) is achieved with constants δ ≤ min (cid:26) ˜ σ ( G o )4 σ ( G u ) − c c , c ˜ σ ( G o )(2 mρ − M )32 m ( ρ + 2 M ) − c c , m (2 mρ − M )2 mρ + M − M c ˜ σ ( G o ) (cid:0) ρ + cσ ( G u )4 + M c ˜ σ ( G o ) (cid:1) (cid:27) ,γ = s + (cid:18) δc + 12 c (cid:19) c σ ( G o )˜ σ ( G o ) . Note that δ > and γ > . Step 3.
Now we prove the linear convergence of V k to ,which implies the linear convergence of x k to x ∗ . Using thecensoring threshold rule τ k = α · ( β ) k , we further rewrite(52) as (1 + δ ) V k +1 ≤ V k + γnα · ( β ) k . Analogous to the technique used in handling (40), it holds V k +1 ≤ (1 + δ ) − (cid:0) V k + γnα · ( β ) k (cid:1) ≤ (1 + δ ) − (cid:2) (1 + δ ) − (cid:0) V k − + γnα · ( β ) k − (cid:1) + γnα · ( β ) k (cid:3) ≤ . . . ≤ (1 + δ ) − ( k +1) V + Cnα k X k ′ =0 (cid:16) (1 + δ ) − ( k +1 − k ′ ) ( β ) k ′ (cid:17) = (1 + δ ) − ( k +1) " V + γnα k X k ′ =0 (cid:0) (1 + δ ) β (cid:1) k ′ ≤ (1 + δ ) − ( k +1) (cid:18) V + γnα − (1 + δ ) β (cid:19) , where the last inequality holds when (1 + δ ) β < .In summary, for any positive δ > that satisfies δ ≤ min (cid:26) ˜ σ ( G o )4 σ ( G u ) − c c , c ˜ σ ( G o )(2 mρ − M )32 m ( ρ + 2 M ) − c c , m (2 mρ − M )2 mρ + M − M c ˜ σ ( G o ) (cid:0) ρ + cσ ( G u )4 + M c ˜ σ ( G o ) (cid:1) , β − (cid:27) , (60) the energy function V k converges to at a linear rate of O ((1+ δ ) − k ) . Moreover, by the definition of V k , it holds that V k ≥ ρ k x k − x ∗ k . Thus, the primal variable x k convergesto the unique optimal solution x ∗ at O ((1 + δ ) − k ) . Remark 4.
If we set c = Mσ max ( G u )˜ σ min ( G o ) and ρ = M κ f and further set c = δc , c = c − c and c = m in (59) , then following the proof of Theorem 2, we can deriveanother upper bound of δ as shown in (22) of Corollary 1. A PPENDIX DP ROOF OF T HEOREM Proof:
Step 1.
As in Step 1 of the proof of Theorem 2,we obtain the inequality (51).
Step 2.
Now our aim is different to that in Step 2 of theproof of Theorem 2, as we are going to find constants q > and η k ≥ as well as a time index k such that ( k + 1) q (cid:0) V k +1 + η k +1 (cid:1) ≤ ( k ) q (cid:0) V k + η k (cid:1) , (61)for all k ≥ k .From (61), in which k , q and η k will be determined later,we have ( k + 1) q (cid:0) V k +1 + η k +1 (cid:1) =( k ) q V k +1 + [( k + 1) q − ( k ) q ] (cid:0) ρ k x k +1 − x ∗ k + c k z k +1 − z ∗ k + 1 c k φ k +1 − φ ∗ k (cid:1) + ( k + 1) q η k +1 (51) ≤ ( k ) q V k − c ( k ) q k z k − z k +1 k − ( k ) q (cid:0) ρ − c M (cid:1) k x k − x k +1 k − ( k ) q (cid:0) c − c − c (cid:1) k φ k − φ k +1 k − (cid:26) ( k ) q (cid:0) m − c (cid:1) − (cid:2) ( k + 1) q − ( k ) q (cid:3) ρ (cid:27) ·k x k +1 − x ∗ k + c [( k + 1) q − ( k ) q ] k z k +1 − z ∗ k + (cid:20) ( k ) q c + ( k + 1) q − ( k ) q c (cid:21) k φ k +1 − φ ∗ k + ( k ) q sn ( τ k ) + ( k + 1) q η k +1 (54) , (58) ≤ ( k ) q V k − ( k ) q (cid:20) c − c − ( k + 1) q − ( k ) q ( k ) q · cσ ( G u )˜ σ ( G o ) (cid:21) k z k − z k +1 k − ( k ) q (cid:26) ρ − c M − (cid:0) c c + ( k + 1) q − ( k ) q ( k ) q (cid:1) · M + 4 ρ c ˜ σ ( G o ) (cid:27) k x k − x k +1 k − ( k ) q (cid:20) m − c − M c ˜ σ ( G o ) − ( k + 1) q − ( k ) q ( k ) q · (cid:0) ρ cσ ( G u )4 + 8 M c ˜ σ ( G o ) (cid:1)(cid:21) k x k +1 − x ∗ k − ( k ) q (cid:0) c − c − c (cid:1) k φ k − φ k +1 k + ( k ) q t k n ( τ k ) + ( k + 1) q η k +1 , where t k := s + 2 c σ ( G o ) c ˜ σ ( G o ) + ( k + 1) q − ( k ) q ( k ) q cσ ( G o )˜ σ ( G o ) > . Set the constants c , c and c the same values as thosein the proof of Theorem 2. Notice that ( k +1) q − ( k ) q ( k ) q = (cid:0) k (cid:1) q − → as k goes to infinity. Then, there exists atime index k such that for any k ≥ k , it holds ( k + 1) q − ( k ) q ( k ) q ≤ min (cid:26) ˜ σ ( G o )4 σ ( G u ) (cid:0) c − c (cid:1) ,c ˜ σ ( G o )4( ρ + 2 M ) (cid:0) ρ − c M − ρ + 2 M ) c ˜ σ ( G o ) (cid:1) ,m − c − M c ˜ σ ( G o ) ρ + cσ ( G u )4 + M c ˜ σ ( G o ) , ˜ σ ( G o )4 cσ ( G o ) (cid:27) , (62)where the right-hand side is larger than . In this situation, t k ≤ t := s + c σ ( G o ) c ˜ σ ( G o ) + 1 . Further, using the censoringthreshold τ k = α ( k ) r , we have ( k +1) q (cid:0) V k +1 + η k +1 (cid:1) ≤ ( k ) q V k + tnα ( k ) r − q +( k +1) q η k +1 . (63)Now we determine the values of q and η k . Since P ∞ k ′ = k k ′ ) r − q < ∞ for any time index k when r − q > ,setting ( k ) q η k := P ∞ k ′ = k tnα ( k ′ ) r − q in (63) leads to an equiv-alent form ( k + 1) q (cid:0) V k +1 + η k +1 (cid:1) ≤ ( k ) q (cid:0) V k + η k (cid:1) , which is exactly what we want in (61). Therefore, for any k ≥ k , it holds V k ≤ V k + η k ≤ ( k ) q (cid:0) V k + η k (cid:1) ( k ) q . That is, the energy function V k converges to at a sublinearrate of O (( k ) − q ) . Moreover, by the definition of V k , itholds that V k ≥ ρ k x k − x ∗ k . Thus, the primal variable x k converges to the unique optimal solution x ∗ at a sublinearrate of O (( k ))
Now our aim is different to that in Step 2 of theproof of Theorem 2, as we are going to find constants q > and η k ≥ as well as a time index k such that ( k + 1) q (cid:0) V k +1 + η k +1 (cid:1) ≤ ( k ) q (cid:0) V k + η k (cid:1) , (61)for all k ≥ k .From (61), in which k , q and η k will be determined later,we have ( k + 1) q (cid:0) V k +1 + η k +1 (cid:1) =( k ) q V k +1 + [( k + 1) q − ( k ) q ] (cid:0) ρ k x k +1 − x ∗ k + c k z k +1 − z ∗ k + 1 c k φ k +1 − φ ∗ k (cid:1) + ( k + 1) q η k +1 (51) ≤ ( k ) q V k − c ( k ) q k z k − z k +1 k − ( k ) q (cid:0) ρ − c M (cid:1) k x k − x k +1 k − ( k ) q (cid:0) c − c − c (cid:1) k φ k − φ k +1 k − (cid:26) ( k ) q (cid:0) m − c (cid:1) − (cid:2) ( k + 1) q − ( k ) q (cid:3) ρ (cid:27) ·k x k +1 − x ∗ k + c [( k + 1) q − ( k ) q ] k z k +1 − z ∗ k + (cid:20) ( k ) q c + ( k + 1) q − ( k ) q c (cid:21) k φ k +1 − φ ∗ k + ( k ) q sn ( τ k ) + ( k + 1) q η k +1 (54) , (58) ≤ ( k ) q V k − ( k ) q (cid:20) c − c − ( k + 1) q − ( k ) q ( k ) q · cσ ( G u )˜ σ ( G o ) (cid:21) k z k − z k +1 k − ( k ) q (cid:26) ρ − c M − (cid:0) c c + ( k + 1) q − ( k ) q ( k ) q (cid:1) · M + 4 ρ c ˜ σ ( G o ) (cid:27) k x k − x k +1 k − ( k ) q (cid:20) m − c − M c ˜ σ ( G o ) − ( k + 1) q − ( k ) q ( k ) q · (cid:0) ρ cσ ( G u )4 + 8 M c ˜ σ ( G o ) (cid:1)(cid:21) k x k +1 − x ∗ k − ( k ) q (cid:0) c − c − c (cid:1) k φ k − φ k +1 k + ( k ) q t k n ( τ k ) + ( k + 1) q η k +1 , where t k := s + 2 c σ ( G o ) c ˜ σ ( G o ) + ( k + 1) q − ( k ) q ( k ) q cσ ( G o )˜ σ ( G o ) > . Set the constants c , c and c the same values as thosein the proof of Theorem 2. Notice that ( k +1) q − ( k ) q ( k ) q = (cid:0) k (cid:1) q − → as k goes to infinity. Then, there exists atime index k such that for any k ≥ k , it holds ( k + 1) q − ( k ) q ( k ) q ≤ min (cid:26) ˜ σ ( G o )4 σ ( G u ) (cid:0) c − c (cid:1) ,c ˜ σ ( G o )4( ρ + 2 M ) (cid:0) ρ − c M − ρ + 2 M ) c ˜ σ ( G o ) (cid:1) ,m − c − M c ˜ σ ( G o ) ρ + cσ ( G u )4 + M c ˜ σ ( G o ) , ˜ σ ( G o )4 cσ ( G o ) (cid:27) , (62)where the right-hand side is larger than . In this situation, t k ≤ t := s + c σ ( G o ) c ˜ σ ( G o ) + 1 . Further, using the censoringthreshold τ k = α ( k ) r , we have ( k +1) q (cid:0) V k +1 + η k +1 (cid:1) ≤ ( k ) q V k + tnα ( k ) r − q +( k +1) q η k +1 . (63)Now we determine the values of q and η k . Since P ∞ k ′ = k k ′ ) r − q < ∞ for any time index k when r − q > ,setting ( k ) q η k := P ∞ k ′ = k tnα ( k ′ ) r − q in (63) leads to an equiv-alent form ( k + 1) q (cid:0) V k +1 + η k +1 (cid:1) ≤ ( k ) q (cid:0) V k + η k (cid:1) , which is exactly what we want in (61). Therefore, for any k ≥ k , it holds V k ≤ V k + η k ≤ ( k ) q (cid:0) V k + η k (cid:1) ( k ) q . That is, the energy function V k converges to at a sublinearrate of O (( k ) − q ) . Moreover, by the definition of V k , itholds that V k ≥ ρ k x k − x ∗ k . Thus, the primal variable x k converges to the unique optimal solution x ∗ at a sublinearrate of O (( k )) − q ))