Communication-Efficient Variance-Reduced Decentralized Stochastic Optimization over Time-Varying Directed Graphs
CCommunication-Efficient Variance-Reduced DecentralizedStochastic Optimization over Time-Varying Directed Graphs
Yiyue Chen, Abolfazl Hashemi, Haris Vikalo ∗ Abstract
We consider decentralized optimization over time-varying directed networks. Networknodes can access only their local objectives, and aim to collaboratively minimize a globalfunction by exchanging messages with their neighbors. Leveraging sparsification, gradienttracking and variance-reduction, we propose a novel communication-efficient decentralizedoptimization scheme that is suitable for resource-constrained time-varying directed networks.We prove that in the case of smooth and strongly-convex objective functions, the proposedscheme achieves an accelerated linear convergence rate. To our knowledge, this is the firstdecentralized optimization framework that achieves such a convergence rate and applies tosettings requiring sparsified communication. Experimental results on both synthetic and realdatasets verify the theoretical results and demonstrate efficacy of the proposed scheme.
Decentralized optimization problems are encountered in a number of settings in control, signalprocessing, and machine learning [1, 2, 3]. Formally, the goal of a decentralized optimizationtask is to minimize global objective in the form of a finite summin x ∈X (cid:34) f ( x ) := 1 n n (cid:88) i =1 f i ( x ) (cid:35) , (1)where f i ( x ) = m i (cid:80) m i j =1 f i,j ( x ) : R d → R for i ∈ [ n ] := { , ..., n } denotes the local objectivefunction that averages loss of m i data points at node i , and X denotes a convex compactconstraint set. The n nodes of the network exchange messages to collaboratively solve (1).Since the communication links between nodes in real-world networks are often uni-directionaland dynamic (i.e., time-varying), we model the network by a sequence of directed graphs, G ( t ) = ( | n | , E ( t )), where the existence of an edge e i,j ∈ E ( t ) implies that node i can sendmessages to node j at time step t .As the networks and datasets become large-scale, computational complexity and commu-nication cost of decentralized optimization start presenting major challenges. To reduce thecomplexity of computing gradients, decentralized stochastic methods that allow each agent toperform gradient estimation by processing small subset of local data are preferable [4]. However,such techniques exhibit low convergence rates and suffer from high variance of the local stochasticgradients. Remedies for these impediments in decentralized optimization over networks andin stochastic optimization in centralized settings include gradient tracking [5, 6] and variancereduction [7, 8], respectively; however, no such remedies have been developed for decentralizedoptimization over time-varying directed networks. On another note, communication constraints ∗ Yiyue Chen and Haris Vikalo are with the Department of Electrical and Computer Engineering, Universityof Texas at Austin, Austin, TX 78712 USA. Abolfazl Hashemi is with the Oden Institute for ComputationalEngineering and Sciences, University of Texas at Austin, Austin, TX 78712 USA. This work was supported inpart by NSF grant 1809327. a r X i v : . [ ee ss . S Y ] J a n ften exacerbate large-scale decentralized optimization problems where the size of the network orthe dimension of local model parameters may be on the order of millions. This motivates designof communication-efficient algorithms that compress messages exchanged between network nodesyet preserve fast convergence. In this paper, we study the general setting where a network istime-varying and directed, and present, to the best of our knowledge, the first variance-reducedcommunication-sparsifying algorithm for decentralized convex optimization for such networks.Moreover, we theoretically establish that when the local objective functions are smooth andstrongly-convex, the proposed algorithm achieves accelerated linear convergence rate. Note thatwhile our focus is on communication-constrained settings, the proposed scheme readily appliesto decentralized optimization problems where time-varying directed graphs operate under nocommunication constraints; in fact, to our knowledge, this is the first variance-reduced stochasticalgorithm for such problems. The first work on decentralized optimization over networks dates back to the 1980s [9]. Anumber of studies that followed in subsequent years was focused on the decentralized averageconsensus problem, where the network nodes work collaboratively to compute the average valueof local vectors. Convergence conditions for achieving consensus over directed and undirectedtime-varying graphs were established in [2, 1, 10, 11, 12]. In [13], the first communication-efficient algorithm that achieves linear convergence over time-invariant (static) undirected graphswas proposed, while [14] presents the first communication-sparsifying algorithm that achievessublinear convergence over time-varying directed graphs.The consensus problem can be viewed as a stepping stone towards more general decentralizedoptimization problems, where the nodes in a network aim to collaboratively minimize the sum oflocal objective functions. A number of solutions to this problem has been proposed for the settingwhere the network is undirected, including the well-known distributed (sub)gradient descentalgorithm (DGD) [3, 15], distributed alternating direction method of multipliers (D-ADMM)[16], and decentralized dual averaging methods [17, 18, 19]. Recently, [20, 13] proposed a novelcommunication-efficient algorithm for decentralized convex optimization problems; the provablyconvergent algorithm relies on a message-passing scheme with memory and biased compression.A key technical property required to ensure convergence of decentralized convex optimizationalgorithms over undirected networks is that the so-called mixing matrix characterizing thenetwork connectivity is doubly-stochastic. However, in directed networks characterized bycommunication link asymmetry, doubly-stochastic mixing matrices are atypical. This motivatedalgorithms that rely on auxiliary variables to cancel out the imbalance in asymmetric directednetworks in order to achieve convergence. For instance, the subgradient-push algorithm [21,22] works with column-stochastic mixing matrices and introduces local normalization scalarsto ensure converge. The directed distributed gradient descent (D-DGD) algorithm [23], onthe other hand, keeps track of the variation of local models by utilizing auxiliary variablesof the same dimension as the local model parameters. For convex objective functions, bothalgorithms achieve O ( ln T √ T ) convergence rate. When the objectives are strongly-convex withLipshitz gradients, and assuming availability of only the stochastic gradient terms, the stochasticgradient-push algorithm proposed in [24] achieves O ( ln TT ) convergence rate. A common featureof these algorithms is their reliance upon diminishing stepsizes to achieve convergence to theoptimal solution; in comparison, using a fixed stepsize can accelerate the gradient search butcannot guarantee the exact convergence, only the convergence to a neighborhood of the optimalsolution. The implied exactness-speed dilemma can be overcome using schemes that deploygradient tracking (see, e.g., [5, 6, 25]). These schemes utilize fixed step sizes to achieve linearconvergence rate when the objective functions are both smooth and strongly-convex. Amongthem, the Push-DIGing algorithm [6] follows the same basic ideas of the subgradient-push2lgorithm, while TV-AB [25] relies on column- and row-stochastic matrices to update modelparameters and gradient terms, respectively.The aforementioned linearly convergent methods rely on full gradient, i.e., each node isassumed to use all of its data to compute the local gradient. However, if the number ofdata points stored at each node is large, full gradient computation becomes infeasible. Tothis end, stochastic gradient descent was adapted to decentralized settings, but the resultingcomputational savings come at the cost of sublinear convergence rate [26, 24]. To acceleratetraditional stochastic gradient methods in centralized settings, variance-reduction algorithmssuch as SAG [8] and SVRG [7] have been proposed; these schemes enable linear convergence whenthe objective functions are smooth and strongly-convex. In decentralized settings, GT-SVRG[26] and Network-SVRG [27] leverage variance-reduction techniques to achieve linear convergencerate. However, these algorithms are restricted to static and undirected networks, and a narrowclass of directed networks where the mixing matrices can be rendered doubly stochastic. In recent years, decentralized learning tasks have experienced rapid growth in the amount ofdata and the dimension of the optimization problems, which may lead to practically infeasibledemands for communication between network nodes. To this end, various communicationcompression schemes have been proposed; among them, the most frequently encountered arequantization and sparsification. Quantization schemes limit the number of bits encoding themessages, while the sparsification schemes select a subset of features and represent messages inlower dimension. For instance, [29, 30, 31, 20, 32, 33] propose algorithms for distributed trainingof (non)convex machine learning models in static master-worker settings (i.e., star graph topology)using quantized/compressed information, while [34, 13, 35, 36] develop communication-efficientalgorithms for decentralized optimization over static and undirected networks. However, directednetworks in general, and time-varying ones in particular, have received considerably less attention.Decentralized optimization over such networks faces technical challenges of developing analgorithmic framework conducive to theoretical analysis and establishing convergence guarantees,which is further exacerbated when the communication is compressed. Early steps in thisdirection were made in [37] by building upon the subgradient-push algorithm to develop aquantized communication framework for decentralized optimization over a static network. In ourrecent work [14], we proposed a communication-sparsifying algorithm for decentralized convexoptimization over time-varying directed networks; however, that algorithm requires full gradientcomputation. In Table 1 we briefly summarize and contrast several algorithms for decentralizedoptimization over directed graphs.
We use lowercase bold letters to represent vectors and uppercase letters to represent matrices.[ A ] ij denotes the ( i, j ) entry of matrix A , while (cid:107) · (cid:107) denotes the standard Euclidean norm. Thespectral radius of a matrix A is denoted by ρ ( A ). The weighted infinity norm of x given apositive vector w is (cid:107) x (cid:107) w ∞ = max i | x i | /w i and the induced matrix norm is (cid:107)| · |(cid:107) w ∞ . Finally, I denotes the identity matrix whose dimension is inferred from the context. To have doubly stochastic mixing matrices, directed graphs require weight balance, i.e., at each node of agraph, the sum of the weights from in-coming edges should be equal to that of the weights from out-coming edges[28].
Algorithm Convergence Digraph Gradient Convex Objective Setting CompressionSubgradient-push [24] O ( (cid:15) ) Time-varying Stoc Strong convexity (cid:55) Push-DIGing [6] O (ln (cid:15) ) Time-varying Full Strong convexity and smoothness (cid:55) Quantized Push-sum [37] O ( (cid:15) ) Static Full - (cid:51) De-Full [14] O ( (cid:15) ) Time-varying Full - (cid:51) This work O (ln (cid:15) ) Time-varying Stoc Strong convexity and smoothness (cid:51) Assume a network of n agents where each node maintains a local model consisting of d parameters.The agents’ goal is to collaboratively solve the decentralized convex optimization problemmin x ∈ R d (cid:34) f ( x ) := 1 n n (cid:88) i =1 f i ( x ) (cid:35) , (2)where f i : R d → R denotes the local objective function at node i ; f i is assumed to be smoothand not accessible to nodes other than the i th one, and the global objective f is assumed to bestrongly-convex. We further assume existence of a unique optimal solution x ∗ ∈ R d and thateach node can communicate to its neighbors; the nodes identify x ∗ by exchanging messages overa time-varying directed network. The network’s connectivity properties are elaborated upon inSection 3. In practice, bandwidth limitations may restrict the amount of data that the network nodes cancommunicate to each other; this is typical of high-dimensional scenarios where the dimension d of local parameters x i is exceedingly large. To handle communication constraints, networknodes may employ sparsification to reduce the size of the messages. Typically, these are twoapproaches to sparsification: (i) each node selects and communicates d q out of d componentsof a d -dimensional message; or (ii) each component of a d -dimensional message is selected tobe communicated independently with probability d q /d . Note that the former imposes a hardconstraint on the number of communicated entries while the latter results in d q communicatedentries on expectation. Both select a specific entry with probability d q /d . Throughout thispaper, we focus on and study the first approach.Let Q : R d → R d denote the sparsification operator; we allow for biased Q with varianceproportional to the argument norm, i.e., E [ Q ( x )] (cid:54) = x and E [ (cid:107) Q ( x ) − x (cid:107) ] ∝ (cid:107) x (cid:107) . This standsin contrast to typical compression operators which aim to achieve no bias and have boundedvariance (see, e.g., [29]). More recent works [20, 13, 35, 37] do consider biased compressionoperators but only for time-invariant communication networks – a setting that is more restrictivethan the one considered in this paper. We start by specifying a procedure for decentralized average consensus, an intermediate steptowards decentralized optimization and, ultimately, an integral part thereof. Following theidea of [11], for each node of the network we define a so-called surplus vector, i.e., an auxiliaryvariable y i ∈ R d which tracks local state vector variations over consecutive time steps; as shown4ater in this section, one can use the surplus vector to help provide guarantees of convergence tothe optimal solution of the decentralized problem. The surplus vector is exchanged along withthe state vector, i.e., at time t , node i sends both y ti and the state vector x ti to its out-neighbors.For the sake of having compact notation, let us introduce z ti ∈ R d , z ti = (cid:40) x ti , i ∈ { , ..., n } y ti − n , i ∈ { n + 1 , ..., n } , (3)to represent messages node i communicates to its neighbors in the network.We assume that the time-varying graph is B -jointly connected, i.e., that there exists a windowsize B ≥ (cid:83) t + B− l = t G l is strongly connected for all t = k B , k ∈ N .Note that if B = 1, each instance of the graph is strongly connected. This is a more generalassumption than the often used B -bounded strong-connectivity (see, e.g. [25]) which requiresstrong connectivity of the union graph (cid:83) t + B− l = t G l for all t ≥ B -jointly connected graphs, the product of mixing matrices of graph instances over B consecutive time steps has a non-zero spectral gap [14]. To formalize this statement, we need todefine the mixing matrices. Let us start by constructing two weight matrices that reflect thenetwork topology; in particular, let W tin (row-stochastic) and W tout (column-stochastic) denotethe in-neighbor and out-neighbor weight matrices at time t , respectively. It holds that [ W tin ] ij > j ∈ N tin,i and [ W tout ] ij > i ∈ N tout,j , where N tin,i denotes the set ofnodes that may send information to node i (including i ) whereas N tout,j denotes the set of nodesthat may receive information from node j (including j ) at time t . We assume W tin and W tout aregiven and that both N tin,i and N tout,i are known to node i . A common policy for designing W tin and W tout is to assign [ W tin ] ij = 1 / |N tin,i | , [ W tout ] ij = 1 / |N tout,j | . (4)Recall that we are interested in sparsifying messages exchanged between nodes of a network;clearly, the sparsification should impact the structure of a mixing matrix. Indeed, if one attemptssparsifying messages used by existing methods, e.g. [21, 11, 12, 22], without any modificationsof the mixing matrices therein, non-vanishing error terms induced by the compression operatorwill prevent those methods from converging. We note that the impact of sparsification on thecomponents of a message vector is similar to the impact of link failures, and may thereforebe captured by the structure of the weight matrices. To elaborate on this, observe that thevector-valued problem at time t can essentially be decomposed to d individual scalar-valuedtasks with weight matrices { W tin,m } dm =1 and { W tout,m } dm =1 . For the sparsified entries, i.e., thosethat are set to zero and not communicated, the corresponding entries in the weight matrices canbe replaced by zero while the entries in the weight matrices related to the communicated entriesremain unchanged, leading to the violation of stochasticity of the weight matrices. To addressthis, we re-normalize the weight matrices { W tin,m } dm =1 and { W tout,m } dm =1 , thus ensuring theirrow and column stochasticity. Note that the re-normalization of the i th row of { W tin,m } dm =1 ( i th column of { W tout,m } dm =1 ) is performed by the i th network node.To specify the normalization rule, we first need to define the sparsification operation.Sparsification of x ti (and, consequently, y ti ) is done via the compression operator Q ( · ) appliedto z ti ; we denote the result by Q ( z ti ). Let [ Q ( z ti )] m denote the m th component of Q ( z ti ). Let { A tm } dm =1 and { B tm } dm =1 be the weight matrices obtained after normalizing { W tin,m } dm =1 and { W tout,m } dm =1 , respectively. To formalize the normalization procedure, we introduce the weightmatrix [ A tm ] ij = [ W tin,m ] ij (cid:80) j ∈S tm ( i,j ) [ W tin,m ] ij if j ∈ S tm ( i, j )0 otherwise , (5)5here S tm ( i, j ) := { j | j ∈ N tin,i , [ Q ( z tj )] m (cid:54) = 0 } ∪ { i } . Likewise, the weight matrix B tm is defined as[ B tm ] ij = [ W tout,m ] ij (cid:80) i ∈T tm ( i,j ) [ W tout,m ] ij if i ∈ T tm ( i, j )0 otherwise , (6)where T tm ( i, j ) := { i | i ∈ N tout,j , [ Q ( z ti )] m (cid:54) = 0 } ∪ { j } .We can now define the mixing matrix of a directed network with sparsified messages. Definition 1.
The m th mixing matrix at time t of a time-varying directed network with sparsifiedmessages, ¯ M tm ∈ R n × n , is a matrix whose columns sum up to and whose eigenvalues satisfy | λ ( ¯ M tm ) | = | λ ( ¯ M tm ) | ≥ | λ ( ¯ M tm ) | ≥ · · · | λ n ( ¯ M tm ) | , constructed from the current networktopology as ¯ M tm = (cid:20) A tm I − A tm B tm (cid:21) , (7) where A tm and B tm denote the m th normalized in-neighbor and out-neighbor weight matrices attime t , respectively. Given z ti and ¯ M tm in (3) and (7), respectively, one can state a compact recursive update for z ti in the communication-efficient optimization algorithm, z t +1 im = n (cid:88) j =1 [ ¯ M tm ] ij [ Q ( z tj )] m + { t mod B = B− } γ [ F ] ij z B(cid:98) t/ B(cid:99) jm − { t mod B = B− } αg B(cid:98) t/ B(cid:99) im (8)where F = (cid:20) I − I (cid:21) , m denotes the coordinate index, and g im combines global gradient trackingwith local stochastic reduction to achieve accelerated convergence (elaborated upon shortly).Note that (8) implies the following element-wise update rules for state and surplus vectors,respectively: x t +1 im = n (cid:88) j =1 [ A tm ] ij [ Q ( x tj )] m + { t mod B = B− } γy B(cid:98) t/ B(cid:99) im − { t mod B = B− } αg B(cid:98) t/ B(cid:99) im , (9) y t +1 im = n (cid:88) j =1 [ B tm ] ij [ Q ( y tj )] m − ( x t +1 im − x tim ) . (10)Therefore, to solve (2) we rely on state vectors and gradient terms stored at time steps t satisfying t mod B = B − . (11)As seen in (8), vectors z ti (containing state vectors to be averaged) are readily updated viasparsification and multiplication with the mixing matrix at all times t except those that satisfy(11). When (11) does hold, vectors z B(cid:98) t/ B(cid:99) i , stored at time B(cid:98) t/ B(cid:99) , are also used to update z ti . The stored vectors are used in order to mitigate the problem emerging when ¯ M tm has spectralgap (i.e., the difference between the moduli of its largest two eigenvalues) equal to zero; such asituation would be an obstacle to achieving linearly convergent average consensus algorithms.However, with a judiciously chosen perturbation parameter γ , which controls to which extent Note that F has all-zero matrices for its (1 ,
1) and (2 ,
1) blocks and thus we only need to store z B(cid:98) t/ B(cid:99) i (equivalently, y B(cid:98) t/ B(cid:99) i − n ), where n + 1 ≤ i ≤ n . nj =1 [ F ] ij z B(cid:98) t/ B(cid:99) jm affects the update, one can guarantee a nonzero spectral gap of the product of B consecutive mixing matrices starting from t = k B .In the proposed algorithm, updates of the gradient term g ti combine global gradient trackingwith local stochastic variance reduction. In particular, updates of g ti mix gradient messageswhile keeping track of the changes in gradient estimates v ti ; this guides g ti towards the gradientof the global objective, ultimately ensuring convergence to the optimal solution x ∗ (i.e., globalgradient tracking helps avoid the pitfall of non-vanishing local gradients which would otherwiselead the search only to a neighborhood of x ∗ ). The m th entry of g ti , g tim , is updated as g B ( (cid:98) t/ B(cid:99) ) im = (cid:80) nj =1 [ B m ( k B − k − B )] ij g ( k − B jm + v B ( (cid:98) t/ B(cid:99) ) im − v B ( (cid:98) t/ B(cid:99)− im i ≤ n . (12)where k = (cid:98) t/ B(cid:99) . The gradient estimate v ti in (12) is updated via the stochastic variance-reduction method [7]. Specifically, v B ( (cid:98) t/ B(cid:99) +1) i = ∇ f i,l i ( z B(cid:98) t/ B(cid:99) i ) − ∇ f i,l i ( ˜ w i ) + ˜ µ i , ∀ i ∈ [ n ] . (13)One can interpret this update as being executed in a double loop fashion: when a local fullgradient at node i , ˜ µ i , is computed (in what can be considered an outer loop), it is retainedin the subsequent T iterations (the inner loop). In each iteration of the inner loop, if the timestep satisfies (11), node i uniformly at random selects a local sample, l i , for calculation of twostochastic gradient estimates – an estimate of the current state, ∇ f i,l i ( z B(cid:98) t/ B(cid:99) i ), and an estimateof the state from the last outer loop, ∇ f i,l i ( ˜ w i ) – the terms needed to perform update of v ti . Bycomputing a full gradient periodically in the outer loop and estimating gradient stochastically inthe inner loop, the described procedure trades computation cost for convergence speed, ultimatelyachieving linear convergence at fewer gradient computation per sample than the full gradienttechniques.The described procedure is formalized as Algorithm 1. Remark 1.
We highlight a few important observations regarding Algorithm 1.(a) When there are no communication constraint and each agent in the network can send fullinformation to out-neighboring agents, Algorithm 1 reduces to a novel stochastic variance-reduced scheme for decentralized convex optimization over such networks.(b) For B = 1 , the problem reduces to decentralized optimization over networks that are stronglyconnected at all time steps, a typical connectivity assumption for many decentralized opti-mization algorithms [23, 13].(c) Algorithm 1 requires each node in the network to store local vectors of size d , includingthe current state vector, current and past surplus vector, and local gradient vector. Whilethe current state vector and current surplus vector may be communicated to the neighboringnodes, past surplus vectors are only used locally to add local perturbations at the time stepssatisfying (11) .(d) The columns of ¯ M tm sum up to one. However, ¯ M tm is not column-stochastic as it has negativeentries, which stands in contrast to the stochasticity property of the mixing matrices appearingin the average consensus algorithms [38, 13]. lgorithm 1 Directed Communication-Sparsifying Stochastic Variance-Reduced GradientDescent (Di-CS-SVRG) Input: T , x , y = , α , γ set z = [ x ; y ], ˜ w = z and g i = v i = ∇ f i ( x i ) ∀ i ∈ [ n ] for each s ∈ [0 , , ..., S ] do ˜ w = ˜ w s ˜ µ i = ∇ f i ( ˜ w ) = m i (cid:80) m i j =1 ∇ f i,j ( ˜ w ) for each t ∈ [ sT + 1 , ..., ( s + 1) T − do generate non-negative matrices { W tin,m } dm =1 and { W tout,m } dm =1 for each m ∈ [1 , ..., d ] do construct a row-stochastic A tm and a column-stochastic B tm according to (5) and (6) construct ¯ M tm according to (7) for each i ∈ [1 , ..., n ] do update z t +1 im according to (8) end for if t mod B = B − then for each i ∈ [1 , ..., n ] do select l i uniformly randomly from [ m i ]: update v B ( (cid:98) t/ B(cid:99) +1) i according to (13) update g B ( (cid:98) t/ B(cid:99) +1) im according to (12) end for end if end for end for ˜ w s +1 = z ( s +1) T end for For convenience, let us denote the product of a sequence of mixing matrices from time step s to T as ¯ M m ( T : s ) = ¯ M Tm ¯ M T − m · · · ¯ M sm . (14)To further simplify notation, we also introduce M m (( k + 1) B − k B ) = ¯ M m (( k + 1) B − k B ) + γF, (15)and M m ( t : k B ) = ¯ M m ( t : k B ) M m ( k B − k − B ) · · · M m (( k + 1) B − k B ) , (16)where k B ≤ t ≤ ( k + 1) B − k , k ∈ N , k ≤ k . Note that M m (( k + 1) B − k B ) isformed by adding a perturbation matrix γF to the product ¯ M m (( k + 1) B − k B ). Finally, wealso introduce shorthand notation for the product of the weight matrices B m from time s to T , B m ( T : s ) = B Tm B T − m · · · B sm . (17)Our analysis relies on several standard assumptions about the graph and network connectivitymatrices as well as the characteristics of the local and global objective functions. These aregiven next. Assumption 1.
Suppose the following conditions hold: a) The product of consecutive mixing matrices M m (( k + 1) B − k B ) in (15) has a non-zerospectral gap for all k ≥ , ≤ m ≤ d , and all < γ < γ for some < γ < .(b) The collection of all possible mixing matrices { ¯ M tm } is a finite set.(c) Each component of the local objective function f i,j is L -smooth and the global objective f is µ -strongly-convex . Remark 2.
Assumption 1(a) is readily satisfied for a variety of graph structures such as the B -strongly connected directed graph introduced in [6], i.e., the setting where the union of graphsover B consecutive instances starting from k B forms a strongly connected graph for any non-negative integer k . Furthermore, one can readily verify that Assumption 1(b) holds for theweight matrices defined in (4) . Before stating the main theorem, we provide the following lemma which, under Assumption1, establishes the consensus contraction of the product of mixing matrices and the product ofnormalized weight matrices.
Lemma 1.
Suppose Assumptions 1(a) and 1(b) hold. Let σ = max( | λ M, | , | λ B, | ) denote thelarger of the second largest eigenvalues of M m (( k + 1) B − k B ) and B m (( k + 1) B − k B ) .Then, (cid:107) M m (( k + 1) B − k B ) z − ¯ z (cid:107) ≤ σ (cid:107) z − ¯ z (cid:107) , ∀ z ∈ R n and (cid:107) B m (( k + 1) B − k B ) y − ¯ y (cid:107) ≤ σ (cid:107) y − ¯ y (cid:107) , ∀ y ∈ R n , (18) where ¯ z = [ n (cid:80) ni =1 z i , · · · , n (cid:80) ni =1 z i ] T and ¯ y = [ n (cid:80) ni =1 y i , · · · , n (cid:80) ni =1 y i ] T .Proof. The result follows directly from Lemma 2 in [14] and the fact that M m (( k + 1) B − k B )and B m (( k + 1) B − k B ) both have column sums equal to 1. (cid:4) Our main result, stated in Theorem 1 below, establishes that Algorithm 1 provides linearconvergence of local parameters to their average values, which itself converges linearly to theoptimal solution of (1).
Theorem 1.
Suppose Assumption 1 holds. Denote the condition number of f by ˜ Q = Lµ . If thestep size α is chosen according to α = (1 − σ )
187 ˜
QL , (19) the iterates of Algorithm 1 satisfy n n (cid:88) i =1 E (cid:107) ¯ z ST − z STi (cid:107) + E (cid:107) ¯ z ST − x ∗ (cid:107) ≤ λ S × (cid:32) n n (cid:88) i =1 (cid:107) ¯ z − z i (cid:107) + (cid:107) ¯ z − x ∗ (cid:107) + (1 − σ ) nL n (cid:88) i =1 d (cid:88) m =1 E | g im − ¯ g m | (cid:33) , (20) where λ = 8 ˜ Q exp ( − (1 − σ ) T
748 ˜ Q ) + 0 . , (21)¯ z t = n (cid:80) ni =1 z ti , and ¯ g t = n (cid:80) ni =1 g ti . This implies that ∀ x , x ∈ R d there exists L > (cid:107)∇ f i,j ( x ) −∇ f i,j ( x ) (cid:107) ≤ L (cid:107) x − x (cid:107) . Furthermore, ∀ x , x ∈ R d there exists µ > f ( x ) ≥ f ( x ) + (cid:104)∇ f ( x ) , x − x (cid:105) + µ (cid:107) x − x (cid:107) . There are two versions of the definition of B -strongly connected directed graphs, the difference being thewindow starting time. As noted in Section II, we consider the definition where the window may start at any time t = k B ; this differs from the (more demanding in regards to connectivity) definition in [25] where the startingtime is an arbitrary non-negative integer. orollary 1.1. Instate the notation and hypotheses of Theorem 1. If, in addition, the inner-loopduration T is chosen as T = B(cid:100) Q (1 − σ ) B ln(200 ˜ Q ) (cid:101) , (22) the proposed algorithm achieves a linear convergence rate. Furthermore, to reach an (cid:15) -accuratesolution, Algorithm 1 takes at most O ( ˜ Q ln ˜ Q (1 − σ ) ln 1 /(cid:15) ) communication rounds and performs O ( ˜ Q ln ˜ Q (1 − σ ) B ln 1 /(cid:15) ) gradient computations.Proof. It is straightforward to verify that for the stated value of T , λ ≤ . < (cid:4) Note that due to the gradient tracking step in Algorithm 1, when constructing the linearsystem of inequalities (which includes the gradient tracking error), the ˜ Q factor in the coefficientmatrix leads to the dependence of the rate upon ˜ Q . Remark 3.
Clearly, the communication cost of Algorithm 1 depends on the level of sparsification,i.e., the value of parameter d q . Intuitively, if agents communicate fewer entries in each round,the communication cost per round decreases but the algorithm may take more rounds to reach thesame accuracy. Therefore, the total communication cost until reaching a pre-specified (cid:15) -accuracy,found as the product of the number of communication rounds and the cost per round, is ofinterest. Let q denote the fraction of entries being communicated per iteration; smaller q impliesmore aggressive sparsification. This compression level parameter, q , impacts σ in Theorem 1;in particular, for a fixed network connectivity parameter B , smaller q leads to sparser mixingmatrices and, consequently, greater σ . Note that large B may be caused by sparsity of theinstances of a time-varying network, thus leading to large values of σ . In this section, we prove Theorem 1 by analyzing various error terms that collectively impact theconvergence rate of Algorithm 1. Specifically, the convergence rate depends on: (i) the expectedconsensus error, i.e., the expected squared difference between local vectors and the average vectorsat time ( k + 1) B , E [ | z ( k +1) B im − ¯ z ( k +1) B m | ]; (ii) the expected optimality error, i.e., the expectedsquared difference between the average vectors and the optimal vector, E [ (cid:107) ¯ z ( k +1) B − x ∗ (cid:107) ]; and(iii) the expected gradient tracking error, E [ (cid:80) dm =1 (cid:80) ni =1 | g ( k +1) B im − ¯ g ( k +1) B m | . Hence, it is criticalto determine evolution of these sequences. Note that compared to the gradient-tracking basedwork, e.g. [6, 26], analyzing the proposed scheme is more involved due to its reliance upon acombination of variance reduction techniques and a communication-sparsifying consensus [14];showing that the novel scheme achieves linear convergence on general directed time-varyinggraphs despite sparsified communication calls for a careful examination of the error terms in amanner distinct from the analysis found in prior work.Dynamics of the aforementioned errors are clearly interconnected. Consequently, our analysisrelies on deriving recursive bounds for the errors in terms of linear combinations of their pastvalues. The results are formally stated in Lemma 2, Lemma 3, and Lemma 4. Proofs of theselemmas are provided in the supplementary document. Lemma 2.
Suppose Assumption 1 holds. Then ∀ i ≤ n , k ≥ and < m ≤ d , updates generatedby Algorithm 1 satisfy E [ | z ( k +1) B im − ¯ z ( k +1) B m | ] ≤ σ E [ | z k B im − ¯ z k B m | ]+ 2 α − σ E [ | g k B im − ¯ g k B m | ] . (23)10aving established in Lemma 2 a recursive bound on the expected consensus error, weproceed by stating in Lemmas 3 and 4 recursive bounds on the expected optimality and gradienttracking errors, respectively. First, let us introduce (for k ≥ τ k B = 1 n n (cid:88) i =1 τ k B i ,τ ( k +1) B i = (cid:40) x ( k +1) B i if ( k + 1) B mod T = 0˜ w i otherwise . (24) Lemma 3.
Suppose Assumption 1 holds and let < α < µ L . Then for all k > it holds that E [ n (cid:107) ¯ z ( k +1) B − x ∗ (cid:107) ] ≤ L αµ E [ n (cid:88) i =1 (cid:107) ¯ z k B − z k B i (cid:107) ]+ (1 − µα E [ n (cid:107) ¯ z k B − x ∗ (cid:107) ]+ 4 L α n E [ n (cid:88) i =1 (cid:107) τ k B i − ¯ τ k B (cid:107) ]+ 4 L α n E [ n (cid:107) ¯ τ k B − x ∗ (cid:107) ] , (25) Lemma 4.
Suppose Assumption 1 holds. Then L E [ d (cid:88) m =1 n (cid:88) i =1 | g ( k +1) B im − ¯ g ( k +1) B m | ] ≤ − σ E [ n (cid:88) i =1 (cid:107) ¯ z k B − z k B i (cid:107) ]+ 891 − σ E [ n (cid:107) ¯ z k B − x ∗ (cid:107) ]+ 3 + σ E [ (cid:80) dm =1 (cid:80) ni =1 | g k B im − ¯ g k B m | L ]+ 381 − σ E [ n (cid:88) i =1 (cid:107) ¯ z k B − z k B i (cid:107) ]+ 381 − σ E [ n (cid:107) ¯ τ k B − x ∗ (cid:107) ] . (26)We proceed by defining a system of linear inequalities involving the three previously discussederror terms; study of the conditions for the geometric convergence of the powers of the resultantmatrix in the system of linear inequalities leads to the linear convergence result in Theorem 1.To this end, we first state Proposition 1 whose proof follows by combining and re-organizing theinequalities in Lemmas 2-4 in a matrix form. Proposition 1.
Suppose Assumption 1 holds. Define u k B = E [ (cid:80) ni =1 (cid:107) ¯ z k B − z k B i (cid:107) ] E [ n (cid:107) ¯ z k B − x ∗ (cid:107) ] E [ (cid:80) dm =1 (cid:80) ni =1 | g k B im − ¯ g k B m | L ] , (27)˜ u k B = (cid:20) E [ (cid:80) ni =1 (cid:107) τ k B i − ¯ τ k B (cid:107) ] E [ n (cid:107) ¯ τ k B − x ∗ (cid:107) ] (cid:21) , (28)11 α = σ α L − σ L αµ − µα − σ − σ σ , (29) H α = L α n L α n − σ − σ . (30) If ≤ α ≤ µ (1 − σ )14 √ L , then for any k ≥ it holds that u ( k +1) B ≤ J α u k B + H α ˜ u k B . (31)As a direct consequence of Proposition 1, for the iterations of the inner loop of Algorithm 1,for all k ∈ [ s (cid:98) T /
B(cid:99) , ( s + 1) (cid:98) T /
B(cid:99) −
1] it holds that u ( k +1) B ≤ J α u k B + H α u sT , (32)while for the outer loop of Algorithm 1 it holds for all ∀ s ≥ u ( s +1) T ≤ ( J Tα + T − (cid:88) l =0 J lα H α ) u sT . (33)Now, to guarantee linear decay of the outer loop sequence, we restrict the range of the innerloop duration T and the step size α according to ρ ( J Tα + T − (cid:88) l =0 J lα H α ) < , (34)where ρ ( · ) denotes the spectral radius of its argument.In Lemma 5 below, we establish the range of α such that the weighted matrix norms of J Tα and (cid:80) T − l =0 J lα H α are small, thereby ensuring the geometric convergence of the powers of thesematrices to . Lemma 5.
Suppose Assumption 1 holds and let < α ≤ (1 − σ )
187 ˜ QL , where ˜ Q = Lµ . Then ρ ( J α ) < (cid:107)| J α |(cid:107) δ ∞ < − µα and (cid:107)| T − (cid:88) l =0 J lα H α |(cid:107) q ∞ ≤ (cid:107)| ( I − J α ) − H α |(cid:107) q ∞ < . , (36) where δ = (cid:104) , Q , Q (1 − σ ) (cid:105) and q = [1 , , − σ ) ] . Essentially, Lemma 5 establishes the range of the stepsize α such that the matrices involved inthe system of linear inequalities (33) have small norm. Upon setting α = (1 − σ )
187 ˜ QL (i.e., assigning α the largest feasible value), we proceed to determine the number of iterations in the outer loopsuch that the powers of matrix J Tα + (cid:80) T − l =0 J lα H α in (33), and hence the components of u (i.e.the error terms), converge to zero at a geometric rate.To this end, note that since J α is non-negative, (cid:80) T − l =0 J lα ≤ (cid:80) ∞ l =0 J lα = ( I − J α ) − . Hence,for all s ≥ u ( s +1) T ≤ ( J Tα + ( I − J α ) − H α ) u sT . (37)12ince α = (1 − σ ) QL , we may write (cid:107) u ( s +1) T (cid:107) q ∞ ≤ (cid:107)| J Tα + ( I − J α ) − H α |(cid:107) q ∞ (cid:107) u sT (cid:107) q ∞ ≤ ( (cid:107)| J Tα |(cid:107) q ∞ + 0 . (cid:107) u sT (cid:107) q ∞ ≤ (8 ˜ Q ( (cid:107)| J Tα |(cid:107) δ ∞ ) T + 0 . (cid:107) u sT (cid:107) q ∞ ≤ (8 ˜ Q exp ( − (1 − σ ) T
748 ˜ Q ) + 0 . (cid:107) u sT (cid:107) q ∞ := λ (cid:107) u sT (cid:107) q ∞ . (38)The result in (20) follows simply by noting the definitions of u sT and the (cid:107) · (cid:107) q ∞ norm. Therefore,the proof of Theorem 1 is complete. In this section, we report results of benchmarking the proposed Algorithm 1; for convenience,we refer to the algorithm as Di-CS-SVRG (Directed Communication-Sparsifying StochasticVariance-Reduced Gradient Descent). The network consists of 10 nodes with randomly generatedtime-varying connections while ensuring strong connectivity at each time step. The constructionbegins with the Erd˝os–R´enyi model [39] where each edge exists independently with probability0 .
9; then, 2 directed edges are dropped from each strongly connected graph, leading to directedgraphs. Building upon this basic structure, we can design networks with different connectivityprofiles. Recall that the window size parameter B , introduced in Assumption 1 (a), mayimply that the union graph over B consecutive instances, starting from any instance that is amultiple of B , forms an almost-surely strongly connected Erd˝os–R´enyi graph. When B = 1, thenetwork is strongly connected at each time step. The parameter q , the fraction of entries beingcommunicated to neighboring nodes, characterizes the level of message sparsification; q = 1implies communication without compression, while q = 0 indicates there is no communication inthe network. We test the proposed Di-CS-SVRG on two tasks, linear and logistic regression, and benchmark itagainst several baseline algorithms. In particular, we compare Di-CS-SVRG with DecentralizedFull Gradient Descent (De-Full) [14], Decentralized Stochastic Gradient Descent (De-Stoc,stochastic variant), Push-DIGing (Push-DIG-Full) [6] and Push-DIGing Stochastic (Push-DIG-Stoc, stochastic variant). In addition, we also compare it to our previously proposedcommunication-sparsifying scheme that relies on full gradient computation, the SparsifyingDe-Full (S-De-Full) algorithm in [14].
We consider the setting where n nodes collaboratively solve the optimization problemmin x (cid:40) n n (cid:88) i =1 (cid:107) y i − D i x (cid:107) (cid:41) , (39)where for the i th node D i ∈ R × denotes the matrix with 200 local samples of size d = 64, and y i ∈ R denotes the corresponding measurement vector. The true value of x , x ∗ , is generatedfrom a normal distribution, and the samples are generated synthetically. The measurements aregenerated as y i = M i x ∗ + η i , where the entries of M i are generated randomly from the standardnormal distribution and then M i is normalized such that its rows sum up to one. The local13oise vector η i is drawn at random from a zero-mean Gaussian distribution with variance 0 . x i , and run with step size α t = . t .Performance of the algorithms is characterized using three metrics: residual over iterations,residual over average gradient computation and residual over communication cost, where theresidual is computed as (cid:107) x t − x ∗ (cid:107)(cid:107) x − x ∗ (cid:107) . The results are shown in Fig. 1. As seen in Fig. 1 (a),Di-CS-SVRG (i.e., our Algorithm 1) with q = 1 converges at a linear rate and, while being astochastic gradient algorithm, reaches the same residual floor as the full gradient method Push-DIG-Full. Di-CS-SVRG converges much faster than the two baseline algorithms employing SGD,Push-DIG-Stoc and De-Stoc. Fig. 1 (b) shows that Di-CS-SVRG with varied compression levels q converges to the same residual floor, and that (as expected) larger q leads to faster convergence.Moreover, the figure shows that for a fixed q , Di-CS-SVRG achieves faster convergence than thebenchmark algorithms.Fig. 1 (c) compares different algorithms in terms of the number of gradients computedper sample, demonstrating computation efficiency of Di-CS-SVRG. Finally, Fig. 1 (d) showsthe communication cost for varied q , computed as the total number of the (state, surplus andgradient) vector entries communicated across the network. As seen in the figure, to achieve apre-specified level of the residual, Di-CS-SVRG with q = 0 .
05 incurs smaller communication costthan any other considered algorithm.
To perform benchmarking on a logistic regression task, we solve a multi-class classificationproblem on the Stackoverflow dataset [40].min x µ (cid:107) x (cid:107) + n (cid:88) i =1 N (cid:88) j =1 ln(1 + exp( − ( m Tij x ) y ij )) , (40)where the training samples ( m ij , y ij ) ∈ R , m ij represents a vectorized text feature and y ij represents the corresponding tag vector. We compare the performance of Di-CS-SVRGwith the same benchmarking algorithms as in the linear regression problem, and use the sameinitialization setup. The logistic regression experiment is run with the stepsize α t = . t ; theregularization parameter is set to µ = 10 − .Performance of the algorithms on the logistic regression problem is characterized by theclassification correct rate. In particular, we evaluate the following three metrics: the correctrate vs. iterations, the correct rate vs. average gradient computation, and the correct rate vs.communication cost; they are all shown in Fig. 2. As seen in Fig. 2 (a), Di-CS-SVRG convergesand reaches the same residual floor as the full gradient method Push-DIG-Full. Di-CS-SVRGconverges much faster than then the two algorithms that rely on SGD – Push-DIG-Stoc andDe-Stoc. Fig. 2 (b) shows that for varied compression levels q , Di-CS-SVRG converges to thesame residual floor. As expected, larger q leads to faster convergence. For a fixed q , Di-CS-SVRGconverges faster than the benchmark algorithm.Fig. 2(c) reports the average gradient computation, i.e., the number of gradients computedper sample. As can be seen, Di-CS-SVRG with different compression levels uses fewer gradientcomputation than the full gradient schemes (Push-DIG-Full and De-Full) to reach 90% correctclassification rate.Fig. 2 (d) shows the communication cost, defined as the total number of the (state, surplusand gradient) vector entries exchanged across the network, for various values of q . Amongthe considered schemes, Di-CS-SVRG with q = 0 .
08 reaches a pre-specified residual level withsmaller communication cost than any stochastic scheme (Push-DIG-Stoc, De-Stoc as well asDi-CS-SVRG with other q ’s). Even though Push-DIG-Stoc reaches various accuracies slightly14
200 400 600 800 1000Iteration10 − − − R e s i dua l Push-DIG-FullPush-DIG-StocDe-FullDe-Stocq = 1, Di-CS-SVRG (a) Residual vs. iterations: full communicationschemes. − − − − − − R e s i dua l q = 1, Di − CS − SVRGq = 0.05, Di − CS − SVRGq = 0.08, Di − CS − SVRGq = 1, S − De − Fullq = 0.05, S − De − Fullq = 0.08, S − De − Full (b) Residual vs. iterations: compression communi-cation schemes. − − − R e s i dua l Push-DIG-FullPush-DIG-StocDe-FullDe-Stocq = 1, Di-CS-SVRGq = 0.05, Di − CS − SVRGq = 0.08, Di − CS − SVRG (c) Residual vs. gradient computations per sample. − − − R e s i d a l P sh-DIG-FullPush-DIG-StocDe-FullDe-Stocq = 1, Di − CS − SVRGq = 0.05, Di − CS − SVRGq = 0.08, Di − CS − SVRG (d) Residual vs. communication cost.
Figure 1:
Linear regression, B = 5. (a) The residual achieved by full communication schemes andthe residual of Di-CS-SVRG (Algorithm 1) with q = 1 vs. iterations. (b) The residual achieved byDi-CS-SVRG with different compression levels: q = 1, q = 0 .
08, and q = 0 .
05 vs. iterations. (c) Thecumulative number of gradient computations needed to reach given level of the residual; showing boththe compression as well as full communication schemes. (d) The cumulative communication cost to reachgiven level of the residual; showing both the compression as well as full communication schemes. C o rr e c t R a t e Push-DIG-FullPush-DIG-StocDe-FullDe-Stocq = 1, Di-CS-SVRG (a) Correct rate vs. iterations: full communicationschemes. C o rr e c t R a t e q = 1, Di − CS − SVRGq = 0.08, Di − CS − SVRGq = 0.12, Di − CS − SVRGq = 1, S − De − Fullq = 0.08, S − De − Fullq = 0.12, S − De − Full (b) Correct rate vs. iterations: compressionschemes. C o rr e c t R a t e Push-DIG-FullPush-DIG-StocDe-FullDe-Stocq = 1, Di-CS-SVRGq = 0.08, Di − CS − SVRGq = 0.12, Di − CS − SVRG (c)
Correct rate vs. gradient computations per sample. C o rr e c t R a t e Push-DIG-FullPush-DIG-StocDe-FullDe-Stocq = 1, Di − CS − SVRGq = 0.08, Di − CS − SVRGq = 0.12, Di − CS − SVRG (d) Correct rate vs. communication cost.
Figure 2:
Logistic regression, B = 1. (a) The correct classification rate achieved by full communicationschemes and the correct classification rate of Di-CS-SVRG vs. iterations. (b) The correct classificationrate for Di-CS-SVRG with varied compression levels: q = 1, q = 0 . q = 0 .
08 vs. iterations. (c) Thecumulative number of gradient computations required to reach given levels of correct classification rate.(d) The cumulative communication cost required to reach given level of the correct classification rate.
To further test Di-CS-SVRG in different settings, we apply it to decentralized optimization overnetworks with varied connectivity and sizes.
We consider the linear regression problem and vary the values of the joint connectivity parameter B . Fig. 3 (a) shows the resulting residuals; as seen there, larger B , implying the network takeslong time before the union of its instances forms a strongly connected graph, leads to slowerconvergence. We now consider the logistic regression problem over networks of varied sizes. In particular,we fix the total number of data points and vary the number of nodes in the network. As thenetwork size is increased, i.e., the number of nodes in the network becomes larger, each agenthas fewer locally available data points. Fig. 3 (b) shows the correct rate for compression levels q = 1, q = 0 .
12 and q = 0 .
08 as n grows from 10 to 30. For a pre-specified sparsification level,larger networks, in which each agent has fewer data points to train its local model, requiresmore communication rounds and therefore takes longer to converge. Fig. 3 (b) shows that aswe increase the network size, the locally assigned data points decrease at each agent. For apre-specified sparsification level, larger network converge slower. In this paper we studied decentralized convex optimization problems over time-varying directednetworks and proposed a stochastic variance-reduced algorithm for solving them. The algorithmsparsifies messages exchanged between network nodes thus enabling collaboration in resource-constrained settings. We proved that the proposed algorithm, Di-CS-SVRG, enjoys linearconvergence rate, and demonstrated its efficacy through simulation studies. As part of the futurework, it is of interest to extend this work to decentralized non-convex optimization problems.17
200 400 600 800 1000 1200It erat ion10 − − − − − − R e s i dua l q = 1, B = 3q = 0.05, B = 3q = 0.08, B = 3q = 1, B = 5q = 0.05, B = 5q = 0.08, B = 5q = 1, B = 10q = 0.05, B = 10q = 0.08, B = 10 (a) Linear regression for varied network connectivity. C o rr e c t R a t e q = 1, n = 10q = 0.08, n = 10q = 0.12, n = 10q = 1, n = 20q = 0.08, n = 20q = 0.12, n = 20q = 1, n = 30q = 0.08, n = 30q = 0.12, n = 30 (b) Correct rate on logistic regression with varied network sizes. Figure 3: Varying network connectivity and size.18 eferences [1] Wei Ren and Randal W Beard. “Consensus seeking in multiagent systems under dynamicallychanging interaction topologies”. In:
IEEE Transactions on automatic control
IEEE Transactions on automaticcontrol
IEEE Transactions on Automatic Control
Journal of optimizationtheory and applications
IEEE Transactions on Control of Network Systems
SIAM Journal on Optimization
Advances in neural information processing systems . 2013, pp. 315–323.[8] Mark Schmidt, Nicolas Le Roux, and Francis Bach. “Minimizing finite sums with thestochastic average gradient”. In:
Mathematical Programming
Problems in decentralized decision making and computation.
Tech.rep. Massachusetts Inst of Tech Cambridge Lab for Information and Decision Systems,1984.[10] Wei Ren, Randal W Beard, and Ella M Atkins. “Information consensus in multivehiclecooperative control”. In:
IEEE Control systems magazine
Automatica
IEEE Transactions on Automatic Control
InternationalConference on Machine Learning . 2019, pp. 3478–3487.[14] Yiuye Chen, Abolfazl Hashemi, and Haris Vikalo. “Communication-Efficient Algorithmsfor Decentralized Optimization Over Time-Varying Directed Graphs”. In: arXiv preprintarXiv:2005.13189 (2020).[15] Bj¨orn Johansson, Maben Rabi, and Mikael Johansson. “A randomized incremental subgra-dient method for distributed optimization in networked systems”. In:
SIAM Journal onOptimization . IEEE. 2012,pp. 5445–5450. 1917] John C Duchi, Alekh Agarwal, and Martin J Wainwright. “Dual averaging for distributedoptimization: Convergence analysis and network scaling”. In:
IEEE Transactions onAutomatic control . IEEE. 2015, pp. 4497–4503.[19] Lie He, An Bian, and Martin Jaggi. “Cola: Decentralized linear learning”. In:
Advances inNeural Information Processing Systems . 2018, pp. 4536–4546.[20] Sebastian U Stich, Jean-Baptiste Cordonnier, and Martin Jaggi. “Sparsified SGD withmemory”. In:
Advances in Neural Information Processing Systems . 2018, pp. 4447–4458.[21] David Kempe, Alin Dobra, and Johannes Gehrke. “Gossip-based computation of aggregateinformation”. In:
IEEE. 2003, pp. 482–491.[22] Angelia Nedi´c and Alex Olshevsky. “Distributed optimization over time-varying directedgraphs”. In:
IEEE Transactions on Automatic Control
Neurocomputing
267 (2017), pp. 508–515.[24] Angelia Nedi´c and Alex Olshevsky. “Stochastic gradient-push for strongly convex functionson time-varying directed graphs”. In:
IEEE Transactions on Automatic Control arXiv preprint arXiv:1810.07393 (2018).[26] Ran Xin, Usman A Khan, and Soummya Kar. “Variance-reduced decentralized stochasticoptimization with accelerated convergence”. In: arXiv preprint arXiv:1912.04230 (2019).[27] Boyue Li et al. “Communication-efficient distributed optimization in networks with gradienttracking and variance reduction”. In:
International Conference on Artificial Intelligenceand Statistics . PMLR. 2020, pp. 1662–1672.[28] Bahman Gharesifard and Jorge Cort´es. “When does a digraph admit a doubly stochasticadjacency matrix?” In:
Proceedings of the 2010 American Control Conference . IEEE. 2010,pp. 2440–2445.[29] Hanlin Tang et al. “Communication compression for decentralized training”. In:
Advancesin Neural Information Processing Systems . 2018, pp. 7652–7662.[30] Wei Wen et al. “Terngrad: Ternary gradients to reduce communication in distributed deeplearning”. In:
Advances in neural information processing systems . 2017, pp. 1509–1519.[31] Hantian Zhang et al. “Zipml: Training linear models with end-to-end low precision, anda little bit of deep learning”. In:
Proceedings of the 34th International Conference onMachine Learning-Volume 70 . JMLR. org. 2017, pp. 4035–4043.[32] Rudrajit Das et al. “Improved Convergence Rates for Non-Convex Federated Learningwith Compression”. In: arXiv preprint arXiv:2012.04061 (2020).[33] Amirhossein Reisizadeh et al. “Robust and communication-efficient collaborative learning”.In:
Advances in Neural Information Processing Systems . 2019, pp. 8386–8397.[34] Zebang Shen et al. “Towards More Efficient Stochastic Decentralized Learning: FasterConvergence and Sparse Communication”. In:
International Conference on MachineLearning . 2018, pp. 4624–4633. 2035] Anastasia Koloskova et al. “Decentralized deep learning with arbitrary communicationcompression”. In: arXiv preprint arXiv:1907.09356 (2019).[36] Abolfazl Hashemi et al. “On the Benefits of Multiple Gossip Steps in Communication-Constrained Decentralized Optimization”. In: arXiv preprint arXiv:2012.04061 (2020).[37] Hossein Taheri et al. “Quantized Decentralized Stochastic Learning over Directed Graphs”.In:
International Conference on Machine Learning (ICML) . 2020.[38] Lin Xiao and Stephen Boyd. “Fast linear iterations for distributed averaging”. In:
Systems& Control Letters
Publicationes mathematicae
The Stack Overflow Data . url : .[41] Ran Xin et al. “Distributed stochastic optimization with gradient tracking over strongly-connected networks”. In: arXiv preprint arXiv:1903.07266 (2019).21 upplementary Material In this supplementary document, we provide proofs of the auxiliary lemmas used in the proof ofTheorem 1 in the main manuscript.The following lemma, restated for convenience, states an upper bound on the consensuserror.
Lemma . Suppose Assumption 1 holds. Then, ∀ i ≤ n , k ≥ , and < m ≤ d , the updatesgenerated by Algorithm 1 satisfy E [ | z ( k +1) B im − ¯ z ( k +1) B m | ] ≤ σ E [ | z k B im − ¯ z k B m | ]+ 2 α − σ E [ | g k B im − ¯ g k B m (cid:107) ] (41) Proof.
By constructing normalized weight matrices and relying on the definition of M m (( k +1) B − k B ), we can simplify the update as z ( k +1) B im = n (cid:88) j =1 [ M m (( k + 1) B − k B )] ij [ Q ( z k B j )] m − αg k B im = n (cid:88) j =1 [ M m (( k + 1) B − k B )] ij z k B jm − αg k B im . (42)Since for any m we have ¯ g k B m = n (cid:80) nj =1 g k B jm , it holds that | z ( k +1) B im − ¯ z ( k +1) B m | = | n (cid:88) j =1 [ M m (( k + 1) B − k B )] ij z k B jm − ¯ z k B m − α ( g k B im − ¯ g k B m ) | . (43)By Young’s inequality, (cid:107) a + b (cid:107) ≤ (1 + η ) (cid:107) a (cid:107) + (1 + 1 η ) (cid:107) b (cid:107) , ∀ η > , a , b . Using Lemma 1, ∀ i ≤ n and 0 < m ≤ d it holds that | n (cid:88) j =1 [ M m (( k + 1) B − k B )] ij z k B jm − ¯ z k B m | ≤ σ | z k B im − ¯ z k B m | . (44)Then for all k ≥ | z ( k +1) B im − ¯ z ( k +1) B m | ≤ (1 + η ) σ | z k B im − ¯ z k B m | + (1 + 1 η ) α | g k B im − ¯ g k B m | . (45)Setting η = − σ σ completes the proof. (cid:4) Next, we are going to prove two lemmas stating upper bounds on the optimality gap and thegradient tracking error. For convenience, we first introduce¯ τ k B = 1 n n (cid:88) i =1 τ k B i (46)and τ ( k +1) B i = (cid:40) x ( k +1) B i if ( k + 1) B mod T = 0˜ w i otherwise . (47)22 emma . Suppose Assumption 1 holds and let < α < µ L . Then for all k > it holds that E [ n (cid:107) ¯ z ( k +1) B − x ∗ (cid:107) ] ≤ L αµ E [ n (cid:88) i =1 (cid:107) ¯ z k B − z k B i (cid:107) ]+ (1 − µα E [ n (cid:107) ¯ z k B − x ∗ (cid:107) ]+ 4 L α n E [ n (cid:88) i =1 (cid:107) τ k B i − ¯ τ k B (cid:107) ]+ 4 L α n E [ n (cid:107) ¯ τ k B − x ∗ (cid:107) ] , (48) Lemma . Suppose Assumption 1 holds. Then, L E [ d (cid:88) m =1 n (cid:88) i =1 | g ( k +1) B im − ¯ g ( k +1) B m | ] ≤ − σ E [ n (cid:88) i =1 (cid:107) ¯ z k B − z k B i (cid:107) ]+ 891 − σ E [ n (cid:107) ¯ z k B − x ∗ (cid:107) ]+ 3 + σ E [ (cid:80) dm =1 (cid:80) ni =1 | g k B im − ¯ g k B m | L ]+ 381 − σ E [ n (cid:88) i =1 (cid:107) ¯ z k B − z k B i (cid:107) ]+ 381 − σ E [ n (cid:107) ¯ τ k B − x ∗ (cid:107) ] . (49)Proving Lemma 3 and 4 requires a series of auxiliary lemmas, Lemma 3.1-3.4. We start withLemma 3.1, which states an upper bound on E [ (cid:107) ¯ z ( k +1) B − x ∗ (cid:107) ]. Lemma . . Suppose Assumption 1 holds. Let < α < L , where L is the smoothness parameter.For all k > , it holds that E [ (cid:107) ¯ z ( k +1) B − x ∗ (cid:107) ] ≤ L αnµ E [ n (cid:88) i =1 (cid:107) ¯ z k B − z k B i (cid:107) ] + (1 − µα ) E [ (cid:107) ¯ z k B − x ∗ (cid:107) ]+ α n E [ (cid:107) v k B − ∇ f ( x k B ) (cid:107) ] , (50) where µ is the strong convexity parameter and ∇ f ( x k B ) = [ ∇ f ( x k B ); · · · ; ∇ f n ( x k B n )] .Proof. By definition, ¯ z tm = n (cid:80) ni =1 z tim . Let us denote ∇ ¯ f ( x t ) = ∇ ¯ f ( z t ) = n (cid:80) ni =1 ∇ f i ( z ti ). Byinduction, ¯ g k B m = ¯ v k B m . (51)Next, we have that for any m ¯ z ( k +1) B m = ¯ z k B m − α ¯ g k B m = ¯ z k B m − α ¯ v k B m , (52)which implies that ¯ z ( k +1) B = ¯ z k B − α ¯ g k B = ¯ z k B − α ¯ v k B . (53)23ote that the randomness in Algorithm 1 originates from a set of independent randomvariables { ω ti } t ≥ i ∈ [2 n ] . We rely on the σ -algebra F k B to characterize the history of the dynamicalsystem generated by { ω ti } t ≤ k B− i ∈ [2 n ] , E [ (cid:107) ¯ z ( k +1) B − x ∗ (cid:107) |F k B ] = E [ (cid:107) ¯ z k B − α ¯ v k B − x ∗ (cid:107) |F k B ]= E [ (cid:107) ¯ z k B − α ∇ f (¯ z k B ) − x ∗ + α ( ∇ f (¯ z k B ) − ¯ v k B ) (cid:107) |F k B ]= (cid:107) ¯ z k B − α ∇ f (¯ z k B ) − x ∗ (cid:107) + α E [ (cid:107)∇ f (¯ z k B ) − ¯ v k B (cid:107) |F k B ]+ 2 α (cid:104) ¯ z k B − α ∇ f (¯ z k B ) − x ∗ , ∇ f (¯ z k B ) − ¯ v k B (cid:105) = (cid:107) ¯ z k B − α ∇ f (¯ z k B ) − x ∗ (cid:107) + α E [ (cid:107)∇ f (¯ z k B ) − ¯ v k B (cid:107) |F k B ]+ 2 α (cid:104) ¯ z k B − α ∇ f (¯ z k B ) − x ∗ , ∇ f (¯ z k B ) − ∇ ¯ f ( x B ) (cid:105) . (54)We then proceed by considering E [ (cid:107)∇ f (¯ z k B ) − ¯ v k B (cid:107) |F k B ], E [ (cid:107)∇ f (¯ z k B ) − ¯ v k B (cid:107) |F k B ] = E [ (cid:107)∇ f (¯ z k B ) − ∇ ¯ f ( x k B ) + ∇ ¯ f ( x k B ) − ¯ v k B (cid:107) |F k B ]= (cid:107)∇ f (¯ z k B ) − ∇ ¯ f ( x k B ) (cid:107) + E [ (cid:107)∇ ¯ f ( x k B ) − ¯ v k B (cid:107) |F k B ] , (55)where the fact that E [¯ v t |F t ] = ∇ ¯ f ( x t ) is used. Furthermore, note that E [ (cid:107)∇ ¯ f ( x k B ) − ¯ v k B (cid:107) |F k B ] = 1 n E [ (cid:107) n (cid:88) i =1 ( v k B i − ∇ f i ( z k B i )) (cid:107) |F k B ]= 1 n E [ (cid:107) v k B − ∇ f ( x k B ) (cid:107) |F k B ] , (56)since (cid:8) v ti (cid:9) ni =1 are independent given F t and E [ (cid:80) i (cid:54) = j (cid:104) v ti − ∇ f i ( z ti ) , v tj − ∇ f j ( z tj ) (cid:105)|F t ] = 0. Recallthe strong convexity of the objective, we have that if 0 < α ≤ L , ∀ x (cid:107) x − α ∇ f ( x ) − x ∗ (cid:107) ≤ (1 − µα ) (cid:107) x − x ∗ (cid:107) (57)It follows that E [ (cid:107) ¯ z ( k +1) B − x ∗ (cid:107) |F k B ] ≤ (1 − µα ) (cid:107) ¯ z k B − x ∗ (cid:107) + α (cid:107)∇ f (¯ z k B ) − ∇ ¯ f ( x k B ) (cid:107) + 2 α (1 − µα ) (cid:107) ¯ z k B − x ∗ (cid:107)(cid:107)∇ f (¯ z k B ) − ∇ ¯ f ( x k B ) (cid:107) + α n E [ (cid:107) v k B − ∇ f ( x k B ) (cid:107) |F k B ] . (58)Using Young’s inequality, we readily obtain that2 α (cid:107) ¯ z k B − x ∗ (cid:107)(cid:107)∇ f (¯ z k B ) − ∇ ¯ f ( x k B ) (cid:107) ≤ µα (cid:107) ¯ z k B − x ∗ (cid:107) + αµ (cid:107)∇ f (¯ z k B ) − ∇ ¯ f ( x k B ) (cid:107) . (59)On the other hand, assuming convexity and smoothness, we have that ∀ k ≥ (cid:107)∇ f (¯ z k B ) − ∇ ¯ f ( x k B ) (cid:107) = (cid:107) n (cid:88) i =1 ∇ f i (¯ z k B ) − ∇ f i ( z k B i ) n (cid:107) (60) ≤ L n (cid:88) i =1 (cid:107) ¯ z k B − z k B i (cid:107) n (61) ≤ L (cid:118)(cid:117)(cid:117)(cid:116) n (cid:88) i =1 (cid:107) ¯ z k B − z k B i (cid:107) n (62)= L √ n (cid:118)(cid:117)(cid:117)(cid:116) n (cid:88) i =1 (cid:107) ¯ z k B − z k B i (cid:107) . (63)24hen by taking the total expectation, E [ (cid:107) ¯ z ( k +1) B − x ∗ (cid:107) ] ≤ L αnµ E [ n (cid:88) i =1 (cid:107) ¯ z k B − z k B i (cid:107) ] + (1 − µα ) E [ (cid:107) ¯ z k B − x ∗ (cid:107) ]+ α n E [ (cid:107) v k B − ∇ f ( x k B ) (cid:107) ] (64) (cid:4) The following lemma helps establish an upper bound on the expected gradient tracking error.
Lemma . . Suppose the objective function f is µ -strongly-convex and that each component ofthe local objective function f i,j is L -smooth. If < α < √ L , E [ (cid:107) g ( k +1) B∼ n − n ¯ g ( k +1) B (cid:107) ] ≤ L − σ E [ (cid:107) z k B − n (¯ z k B ) (cid:48) (cid:107) ] + 2 L − σ E [ n (cid:107) ¯ z k B − x ∗ (cid:107) ]+ ( 1 + σ α L − σ ) E [ (cid:107) g k B∼ n − n ¯ g k B (cid:107) ]+ 51 − σ E [ (cid:107) v k B − ∇ f ( x k B ) (cid:107) ]+ 41 − σ E [ (cid:107) v ( k +1) B − ∇ f ( x ( k +1) B ) (cid:107) ] , (65) where g t ∼ n = [ g t ; · · · ; g tn ] ∈ R n × d . Proof.
For all i ≤ n and 0 < m ≤ d , it holds that | g ( k +1) B im − ¯ g ( k +1) B m | = | n (cid:88) j =1 [ B m (( k + 1) B − k B )] ij g k B jm + v ( k +1) B im − v k B im − n n (cid:88) l =1 ( n (cid:88) j =1 [ B m (( k + 1) B − k B )] lj g k B jm + v ( k +1) B lm − v k B lm ) | . (66)Denoting g ( k +1) B : m = [ g ( k +1) B m , · · · , g ( k +1) B nm ] T and v ( k +1) B : m = [ v ( k +1) B m , · · · , v ( k +1) B nm ], n (cid:88) i =1 | g ( k +1) B im − ¯ g ( k +1) B m | = (cid:107) g ( k +1) B : m − ¯ g ( k +1) B m n (cid:107) = n (cid:88) i =1 | n (cid:88) j =1 [ B m (( k + 1) B − k B )] ij g k B jm + v ( k +1) B im − v k B im − n n (cid:88) l =1 ( n (cid:88) j =1 [ B m (( k + 1) B − k B )] lj g k B jm + v ( k +1) B lm − v k B lm ) | = (cid:107) B m (( k + 1) B − k B ) g k B : m − ¯ g k B m n + ( v ( k +1) B : m − ¯ v ( k +1) B m n ) − ( v k B : m − ¯ v k B m n ) (cid:107) . (67)Once again applying Young’s inequality yields n (cid:88) i =1 | g ( k +1) B im − ¯ g ( k +1) B m | ≤ (1 + 1 − σ σ ) (cid:107) B m (( k + 1) B − k B ) g k B : m − ¯ g k B m n (cid:107) + (1 + 2 σ − σ ) (cid:107) ( v ( k +1) B : m − ¯ v ( k +1) B m n ) − ( v k B : m − ¯ v k B m n ) (cid:107) . (68)25umming up the above objects over all m , we obtain d (cid:88) m =1 n (cid:88) i =1 | g ( k +1) B im − ¯ g ( k +1) B m | ≤ d (cid:88) m =1 (1 + 1 − σ σ ) (cid:107) B m (( k + 1) B − k B ) g k B : m − ¯ g k B m n (cid:107) + (1 + 2 σ − σ ) (cid:107) ( v ( k +1) B : m − ¯ v ( k +1) B m n ) − ( v k B : m − ¯ v k B m n ) (cid:107) ≤ d (cid:88) m =1 σ (cid:107) g k B : m − ¯ g tm n (cid:107) + 21 − σ (cid:107) ( v ( k +1) B : m − ¯ v ( k +1) B m n ) − ( v k B : m − ¯ v k B m n ) (cid:107) ≤ σ d (cid:88) m =1 (cid:107) g k B : m − ¯ g k B m n (cid:107) + 21 − σ (cid:107) v ( k +1) B − v k B (cid:107) (69)Taking the total expectation yields E [ d (cid:88) m =1 n (cid:88) i =1 | g ( k +1) B im − ¯ g ( k +1) B m | ] ≤ σ E [ d (cid:88) m =1 (cid:107) g k B : m − ¯ g k B m n (cid:107) ]+ 21 − σ E [ (cid:107) v ( k +1) B − v k B (cid:107) ] . (70)Next, we derive an upper bound on E [ (cid:107) v ( k +1) B − v k B (cid:107) ] as E [ (cid:107) v ( k +1) B − v k B (cid:107) ] ≤ E [ (cid:107) v ( k +1) B − v k B − ( ∇ f ( x ( k +1) B ) − ∇ f ( x k B )) (cid:107) ]+ 2 E [ (cid:107)∇ f ( x ( k +1) B ) − ∇ f ( x k B ) (cid:107) ] ≤ E [ (cid:107) v ( k +1) B − ∇ f ( x ( k +1) B ) (cid:107) ]+ 2 E [ (cid:107) v k B − ∇ f ( x k B ) (cid:107) ] + 2 L E [ (cid:107) x ( k +1) B − x k B (cid:107) ] ≤ E [ (cid:107) v ( k +1) B − ∇ f ( x ( k +1) B ) (cid:107) ]+ 2 E [ (cid:107) v k B − ∇ f ( x k B ) (cid:107) ] + 2 L E [ (cid:107) z ( k +1) B − z k B (cid:107) ] , (71)where ∇ f ( x ( k +1) B ) = [ ∇ f ( x ( k +1) B ); · · · ; ∇ f n ( x ( k +1) B n )]. To proceed, let us derive an upperbound on E [ (cid:107) z ( k +1) B − z k B (cid:107) ]. First, consider each column of z ( k +1) B and z k B (i.e., z ( k +1) B : m and z k B : m ) separately and observe that (cid:107) z ( k +1) B : m − z k B : m (cid:107) = (cid:107) M m (( k + 1) B − k B ) z k B : m − α g k B : m − z k B : m (cid:107) ≤ (cid:107) z k B : m − ¯ z k B m n (cid:107) + 2 α (cid:107) g k B : m (cid:107) . (72)Then (cid:107) z ( k +1) B − z k B (cid:107) ≤ d (cid:88) m =1 (cid:107) z k B : m − ¯ z k B m n (cid:107) + 2 α (cid:107) g k B (cid:107) . (73)26o derive an upper bound on (cid:107) g k B∼ n (cid:107) , let ¯ z k B = [¯ z k B n , · · · , ¯ z k B d n ] and note that (cid:107) g k B∼ n (cid:107) = (cid:107) g k B∼ n − n (¯ g k B ) (cid:48) + n (¯ v k B ) (cid:48) − n ( ∇ ¯ f ( x k B )) (cid:48) + n ( ∇ ¯ f ( x k B )) (cid:48) − n ( ∇ ¯ f ( x ∗ )) (cid:48) (cid:107)≤ (cid:107) g k B∼ n − n (¯ g k B ) (cid:48) (cid:107) + (cid:107) v k B − ∇ f ( x k B ) (cid:107) + L (cid:107) x k B − n ( x ∗ ) (cid:48) (cid:107)≤ (cid:107) g k B∼ n − n (¯ g k B ) (cid:48) (cid:107) + (cid:107) v k B − ∇ f ( x k B ) (cid:107) + L (cid:107) x k B − n (¯ z k B ) (cid:48) + n (¯ z k B ) (cid:48) − n ( x ∗ ) (cid:48) (cid:107)≤ (cid:107) g k B∼ n − n (¯ g k B ) (cid:48) (cid:107) + (cid:107) v k B − ∇ f ( x k B ) (cid:107) + L (cid:107) z k B − n (¯ z k B ) (cid:48) (cid:107) + √ nL (cid:107) ¯ z k B − x ∗ (cid:107) . (74)Squaring both sides of the above inequality yields (cid:107) g k B∼ n (cid:107) ≤ L (cid:107) z k B − n (¯ z k B ) (cid:48) (cid:107) + 8 nL (cid:107) ¯ z k B − x ∗ (cid:107) + 4 (cid:107) g k B∼ n − n (¯ g k B ) (cid:48) (cid:107) + 4 (cid:107) v k B − ∇ f ( x k B ) (cid:107) . (75)Imposing 0 < α < √ L , E [ (cid:107) z ( k +1) B − z k B (cid:107) ] ≤ . E [ (cid:107) z k B − n (¯ z k B ) (cid:48) (cid:107) ] + 0 . E [ n (cid:107) ¯ z k B − x ∗ (cid:107) ]+ 8 α E [ (cid:107) g k B∼ n − n (¯ g k B ) (cid:48) (cid:107) ] + 8 α E [ (cid:107) v k B − ∇ f ( x k B ) (cid:107) ] . (76)Then E [ (cid:107) v ( k +1) B − v k B (cid:107) ] ≤ . L E [ (cid:107) z k B − n (¯ z k B ) (cid:48) (cid:107) ] + L E [ n (cid:107) ¯ z k B − x ∗ (cid:107) ]+ 16 α L E [ (cid:107) g k B∼ n − n (¯ g k B ) (cid:48) (cid:107) ] + 2 . E [ (cid:107) v k B − ∇ f ( x k B ) (cid:107) ]+ 2 E [ (cid:107) v ( k +1) B − ∇ f ( x ( k +1) B ) (cid:107) ] . (77)The proof is completed by combining (70). (cid:4) The gradient estimate error, E [ (cid:107) v k B − ∇ f ( x k B ) (cid:107) ], appearing on the right-hand side ofinequalities in Lemma 3.1 and 3.2, is analyzed in the following lemma. Lemma . . Suppose the objective function f is µ -strongly-convex, and let τ , ¯ τ be defined asabove. Then ∀ k ≥ , E [ (cid:107) v k B − ∇ f ( x k B ) (cid:107) ] ≤ L n (cid:88) i =1 E [ (cid:107) x k B i − ¯ z k B (cid:107) ] + 4 L E [ n (cid:107) ¯ z k B − x ∗ (cid:107) ]+ 4 L n (cid:88) i =1 E [ (cid:107) τ k B i − ¯ τ k B (cid:107) ] + 4 L E [ n (cid:107) ¯ τ k B − x ∗ (cid:107) ] . (78)27 roof. For all i ≤ n , it holds that E [ (cid:107) v k B i − ∇ f i ( x k B i ) (cid:107) |F k B ] = E [ (cid:107)∇ f i,l k B i ( x k B i ) − f i,l k B i ( τ k B i ) − ( ∇ f i ( x k B i ) − ∇ f i ( τ k B i )) (cid:107) |F k B ] ≤ E [ (cid:107)∇ f i,l k B i ( x k B i ) − f i,l k B i ( τ k B i ) (cid:107) |F k B ]= 1 m i m i (cid:88) j =1 (cid:107)∇ f i,j ( x k B i ) − ∇ f i,j ( x ∗ )+ ( ∇ f i,j ( x ∗ ) − ∇ f i,j ( τ k B i )) (cid:107) ≤ L (cid:107) x k B i − x ∗ (cid:107) + 2 L (cid:107) τ k B i − x ∗ (cid:107) ≤ L (cid:107) x k B i − ¯ z k B (cid:107) + 4 L (cid:107) ¯ z k B − x ∗ (cid:107) + 4 L (cid:107) τ k B i − ¯ τ k B (cid:107) + 4 L (cid:107) ¯ τ k B − x ∗ (cid:107) . (79)The proof of the lemma is completed by summing over i from 1 to n and taking the totalexpectation. (cid:4) Combining the results of Lemma 2, 3.1 and 3.3, we obtain the following result.
Lemma . . Suppose the objective function f is µ -strongly-convex. If < α ≤ L , then for all k ≥ it holds E [ (cid:107) v ( k +1) B i − ∇ f i ( x ( k +1) B i ) (cid:107) ] ≤ . L E [ (cid:107) x ( k +1) B i − ¯ z ( k +1) B (cid:107) ]+ 16 L α E [ (cid:107) g k B i − n ¯ g k B (cid:107) ] + 16 . L E [ (cid:107) ¯ z k B − x ∗ (cid:107) ]+ 4 . L E [ (cid:107) τ k B i − ¯ τ k B (cid:107) ] + 4 . L E [ (cid:107) ¯ τ k B − x ∗ (cid:107) ] . (80) Proof.
The proof is completed by combining Lemma 1, 2 and 3.3. (cid:4)
We can now present an argument proving Lemmas 3 and 4 in the main paper. In particular,combining Lemma 3.1 E [ (cid:107) ¯ z ( t +1) B − x ∗ (cid:107) ] ≤ L αnµ E [ n (cid:88) i =1 (cid:107) ¯ z k B − z k B i (cid:107) ] + (1 − µα ) E [ (cid:107) ¯ z k B − x ∗ (cid:107) ]+ α n E [ (cid:107) v k B − ∇ f ( x k B ) (cid:107) ] (81)and the results in Lemma 3.2, we obtain E [ n (cid:107) ¯ z ( k +1) B − x ∗ (cid:107) ] ≤ L α ( 1 µ + 4 αn ) E [ n (cid:88) i =1 (cid:107) ¯ z k B − z k B i (cid:107) ]+ (1 − µα + 4 L α n ) E [ n (cid:107) ¯ z k B − x ∗ (cid:107) ]+ 4 L α n E [ n (cid:88) i =1 (cid:107) τ k B i − ¯ τ k B (cid:107) ]+ 4 L α n E [ n (cid:107) ¯ τ k B − x ∗ (cid:107) ] . (82)28y letting 0 < α ≤ µ L , we can further bound E [ n (cid:107) ¯ z ( k +1) B − x ∗ (cid:107) ] ≤ L αµ E [ n (cid:88) i =1 (cid:107) ¯ z k B − z k B i (cid:107) ]+ (1 − µα E [ n (cid:107) ¯ z k B − x ∗ (cid:107) ]+ 4 L α n E [ n (cid:88) i =1 (cid:107) τ k B i − ¯ τ k B (cid:107) ]+ 4 L α n E [ n (cid:107) ¯ τ k B − x ∗ (cid:107) ] , (83)which completes the proof of Lemma 3. Moreover, E [ d (cid:88) m =1 n (cid:88) i =1 | g ( k +1) B im − ¯ g ( k +1) B m | ] ≤ L − σ E [ n (cid:88) i =1 (cid:107) ¯ z k B − z k B i (cid:107) ]+ 89 L − σ E [ n (cid:107) ¯ z k B − x ∗ (cid:107) ]+ ( 1 + σ L α − σ ) E [ d (cid:88) m =1 n (cid:88) i =1 | g k B im − ¯ g k B m | ]+ 38 L − σ E [ n (cid:88) i =1 (cid:107) τ k B i − ¯ τ k B (cid:107) ]+ 38 L − σ E [ n (cid:107) ¯ τ k B − x ∗ (cid:107) ] . (84)For 0 < α ≤ − σ √ L , we have σ + L α − σ ≤ σ ; this helps complete the proof of Lemma 4, E [ d (cid:88) m =1 n (cid:88) i =1 | g ( k +1) B im − ¯ g ( k +1) B m | ] ≤ L − σ E [ n (cid:88) i =1 (cid:107) ¯ z k B − z k B i (cid:107) ]+ 89 L − σ E [ n (cid:107) ¯ z k B − x ∗ (cid:107) ]+ 3 + σ E [ d (cid:88) m =1 n (cid:88) i =1 (cid:107) g k B im − ¯ g k B m (cid:107) ]+ 38 L − σ E [ n (cid:88) i =1 (cid:107) τ k B i − ¯ τ k B (cid:107) ]+ 38 L − σ E [ n (cid:107) ¯ τ k B − x ∗ (cid:107) ] . (85)Using the inequalities shown in Lemma 2, 3 and 4, we can construct a dynamic system andcontinue the proof of linear convergence of Algorithm 1. To this end, we first define u k B = E [ (cid:80) ni =1 (cid:107) ¯ z k B − z k B i (cid:107) ] E [ n (cid:107) ¯ z k B − x ∗ (cid:107) ] E [ (cid:80) dm =1 (cid:80) ni =1 (cid:107) g k B im − ¯ g k B m (cid:107) L ] (86)˜ u k B = E [ (cid:80) ni =1 (cid:107) τ k B i − ¯ τ k B (cid:107) ] E [ n (cid:107) ¯ τ k B − x ∗ (cid:107) ] (87)29 α = σ α L − σ L αµ − µα − σ − σ σ (88) H α = L α n L α n − σ − σ (89)and then formally state the dynamic system in Proposition 1. Proposition 2.
Suppose Assumption 1 holds, the objective function f is µ -strongly-convex andeach component of the local objective function f i,j is L -smooth. If ≤ α ≤ µ (1 − σ )14 √ L , then for any k ≥ u ( k +1) B ≤ J α u k B + H α ˜ u k B . (90)It follows that for the inner loop, for all k ∈ [ sT, ( s + 1) T − u ( k +1) B ≤ J α u k B + H α u sT . (91)For the outer loop, for all s ≥
0, it holds u ( s +1) T ≤ ( J Tα + T − (cid:88) l =0 J lα H α ) u sT . (92)To guarantee linear decay of the outer loop sequence and ultimately show linear convergenceof Algorithm 1, we require that the spectral radius of J Tα + (cid:80) T − l =0 J lα H α is small. In Lemma 5,we compute the range of the step size, α , such that the weighted matrix norms of both J Tα and (cid:80) T − l =0 J lα H α are small. Lemma . Suppose Assumption 1 holds and assume that < α ≤ (1 − σ )
187 ˜ QL , where ˜ Q = Lµ . Then, ρ ( J α ) < (cid:107)| J α |(cid:107) δ ∞ < − µα , (93) and (cid:107)| T − (cid:88) l =0 J lα H α |(cid:107) q ∞ ≤ (cid:107)| ( I − J α ) − H α |(cid:107) q ∞ < . , (94) where δ = (cid:104) , Q , Q (1 − σ ) (cid:105) and q = [1 , , − σ ) ] .Proof. Following Lemma 10 in [41], consider a matrix A ∈ R d × d and a positive vector x ∈ R d ,and note that if A x ≤ β x for β > ρ ( A ) ≤ (cid:107)| A |(cid:107) x ∞ ≤ β . Using this lemma wesolve for a range of α and a positive vector δ ∈ R such that J α δ ≤ (1 − µα δ, (95)which is equivalent to the element-wise inequalities1 + σ δ + 2 α L − σ δ ≤ (1 − µα δ L αµ δ + (1 − µα δ ≤ (1 − µα δ − σ δ + 891 − σ δ + 3 + σ δ ≤ (1 − µα δ . (96)30o solve for a meaningful δ , we set δ = 1 and δ = 8 ˜ Q ; then δ = (cid:104) , Q , Q (1 − σ ) (cid:105) and0 < α ≤ (1 − σ )
187 ˜ QL are sufficient to satisfy the first inequality. Since J α is non-negative, (cid:80) T − l =0 J lα ≤ (cid:80) ∞ l =0 J lα = ( I − J α ) − ; this yields u ( s +1) T ≤ ( J Tα + ( I − J α ) − H α ) u sT , (97)where I − J α = − σ − α L − σ − L αµ µα − − σ − − σ − σ (98)and its determinant isdet( I − J α ) = (1 − σ ) µα − L α µ (1 − σ ) − α µL (1 − σ ) . (99)When 0 < α ≤ (1 − σ )
187 ˜ QL , det( I − J α ) ≥ (1 − σ ) µα , and[adj( I − J α )] , = 178 L α (1 − σ ) , [adj( I − J α )] , = µL α (1 − σ ) , [adj( I − J α )] , ≤ (1 − σ ) , [adj( I − J α )] , = 4 L α µ (1 − σ ) , [adj( I − J α )] , = 44 . , [adj( I − J α )] , = µα (1 − σ )4 . Next, we derive a matrix upper-bounding (element-wise) ( I − J α ) − H α = adj( I − J α )det( I − J α ) H α . For0 ≤ α ≤ (1 − σ )
187 ˜ QL , ( I − J α ) − H α ≤ .
039 0 .
039 00 .
23 0 .
23 0 − σ ) − σ ) . (100)If q = [1 , , − σ ) ], we have (( I − J α ) − H α ) q ≤ . q . (101)Finally, invoking the definition of the weighted matrix norm, we complete the proof of the secondinequality in the lemma. (cid:4)(cid:4)