[PDF] A Hybrid Variance-Reduced Method for Decentralized Stochastic Non-Convex Optimization

Abstract

This paper considers decentralized stochastic optimization over a network of n nodes, where each node possesses a smooth non-convex local cost function and the goal of the networked nodes is to find an \epsilon-accurate first-order stationary point of the sum of the local costs. We focus on an online setting, where each node accesses its local cost only by means of a stochastic first-order oracle that returns a noisy version of the exact gradient. In this context, we propose a novel single-loop decentralized hybrid variance-reduced stochastic gradient method, called GT-HSGD, that outperforms the existing approaches in terms of both the oracle complexity and practical implementation. The GT-HSGD algorithm implements specialized local hybrid stochastic gradient estimators that are fused over the network to track the global gradient. Remarkably, GT-HSGD achieves a network topology-independent oracle complexity of O(n^{-1}\epsilon^{-3}) when the required error tolerance \epsilon is small enough, leading to a linear speedup with respect to the centralized optimal online variance-reduced approaches that operate on a single node. Numerical experiments are provided to illustrate our main technical results.

Full PDF

AA hybrid variance-reduced method for decentralized stochasticnon-convex optimization ∗ Ran Xin † Usman A. Khan ‡ Soummya Kar † February 16, 2021

Abstract

GT-HSGD algorithm implementsspecialized local hybrid stochastic gradient estimators that are fused over the network to track the globalgradient. Remarkably,

GT-HSGD achieves a network-independent oracle complexity of O ( n − (cid:15) − ) whenthe required error tolerance (cid:15) is small enough, leading to a linear speedup with respect to the centralizedoptimal online variance-reduced approaches that operate on a single node. Numerical experiments areprovided to illustrate our main technical results. Keywords: decentralized optimization, stochastic non-convex optimization, variance reduction.

We consider n nodes, such as machines or edge devices, communicating over a decentralized network describedby a directed graph G = ( V , E ), where V = { , · · · , n } is the set of node indices and E ⊆ V ×V is the collectionof ordered pairs ( i, j ), i, j ∈ V , such that node j sends information to node i . Each node i possesses a privatelocal cost function f i : R p → R and the goal of the networked nodes is to solve, via local computation andcommunication, the following optimization problem:min x ∈ R p F ( x ) = 1 n n (cid:88) i =1 f i ( x ) . This canonical formulation is known as decentralized optimization [1–4] that has emerged as a promisingframework for large-scale data science and machine learning problems [5, 6]. Decentralized optimization isessential in scenarios where data is geographically distributed and/or centralized data processing is infeasibledue to communication and computation overhead or data privacy concerns. In this paper, we focus on an online and non-convex setting. In particular, we assume that each local cost f i is non-convex and each node i only accesses f i by querying a local stochastic ﬁrst-order oracle ( SFO ) [7] that returns a stochastic gradient,i.e., a noisy version of the exact gradient, at the queried point. As a concrete example of practical interest,the SFO mechanism applies to many online learning and expected risk minimization problems where the noise ∗ The work of RX and SK has been partially supported by NSF under award † Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA 15213, USA; Emails: { ranx,soummyak } @andrew.cmu.edu . ‡ Department of Electrical and Computer Engineering, Tufts University, Medford, MA 02155, USA; Email: [email protected] . a r X i v : . [ m a t h . O C ] F e b n SFO lies in the uncertainty of sampling from the underlying streaming data received at each node [3, 4].We are interested in the oracle complexity, i.e., the total number of queries to

SFO required at each node, toﬁnd an (cid:15) -accurate ﬁrst-order stationary point x ∗ of the global cost F such that E [ (cid:107)∇ F ( x ∗ ) (cid:107) ] ≤ (cid:15) . We now brieﬂy review the literature of decentralized non-convex optimization with

SFO , which has beenwidely studied recently. Perhaps the most well-known approach is the decentralized stochastic gradient de-scent (

DSGD ) and its variants [3–5,8,9], which combine average consensus and a local stochastic gradient step.Although being simple and eﬀective,

DSGD is known to have diﬃculties in handling heterogeneous data [10].Recent works [11–14] achieve robustness to heterogeneous environments by leveraging certain decentralizedbias-correction techniques such as

EXTRA (type) [15–18], gradient tracking [19–25], and primal-dual princi-ples [16, 26–28]. Built on top of these bias-correction techniques, very recent works [29] and [30] propose

D-GET and

D-SPIDER-SFO respectively that further incorporate online

SARAH / SPIDER -type variance reduc-tion schemes [31–33] to achieve lower oracle complexities, when the

SFO satisﬁes a mean-squared smoothnessproperty. Finally, we note that the family of decentralized variance reduced methods has been signiﬁ-cantly enriched recently, see, for instance, [10, 34–41]; however, these approaches are explicitly designedfor empirical minimization where each local cost f i is decomposed as a ﬁnite-sum of component functions,i.e., f i = m (cid:80) mr =1 f i,r ; it is therefore unclear whether these algorithms can be adapted to the online SFO set-ting, which is the focus of this paper.

In this paper, we propose

GT-HSGD , a novel online variance-reduced method for decentralized non-convex op-timization with stochastic ﬁrst-order oracles (

SFO ). To achieve fast and robust performance, the

GT-HSGD al-gorithm is built upon global gradient tracking [19,20] and a local hybrid stochastic gradient estimator [42–44]that can be considered as a convex combination of the vanilla stochastic gradient returned by the

SFO anda

SARAH -type variance-reduced stochastic gradient [45]. In the following, we emphasize the key advantagesof

GT-HSGD compared with the existing decentralized online (variance-reduced) approaches, from both the-oretical and practical aspects.

Improved oracle complexity . A comparison of the oracle complexity of

GT-HSGD with related algo-rithms is provided in Table 1, from which we have the following important observations. First of all, the oraclecomplexity of

GT-HSGD is lower than that of

DSGD , D2 , GT-DSGD and

D-PD-SGD , which are decentralized onlinealgorithms without variance reduction; however,

GT-HSGD imposes on the

SFO an additional mean-squaredsmoothness (MSS) assumption that is required by all online variance-reduced techniques [29–33,42–44,46,47]in the literature. Secondly,

GT-HSGD further achieves a lower oracle complexity than the existing decentral-ized online variance-reduced methods

D-GET [29] and

D-SPIDER-SFO [30], especially in a regime where therequired error tolerance (cid:15) and the network spectral gap 1 − λ are relatively small. Moreover, when (cid:15) issmall enough such that (cid:15) (cid:46) min (cid:8) λ − (1 − λ ) n − , λ − (1 − λ ) . n − (cid:9) , it can be veriﬁed that the oraclecomplexity of GT-HSGD reduces to O ( n − (cid:15) − ), independent of the network topology, and GT-HSGD achieves alinear speedup, in terms of the scaling with the network size n , compared with the centralized optimal onlinevariance-reduced approaches that operate on a single node [31–33, 42, 43, 47]; see Section 3 for a detaileddiscussion. In sharp contrast, D-GET [29] and

D-SPIDER-SFO [30] exhibit no speed up compared with theaforementioned centralized optimal methods even if the network is fully connected, i.e., λ = 0. More practical implementation . Both

D-GET [29] and

D-SPIDER-SFO [30] are double-loop algorithmsthat require very large minibatch sizes. In particular, during each inner loop they execute a ﬁxed number ofminibatch stochastic gradient type iterations with O ( (cid:15) − ) oracle queries per update per node, while at everyouter loop they obtain a stochastic gradient with mega minibatch size by O ( (cid:15) − ) oracle queries at each node.Clearly, querying the oracle exceedingly, i.e., obtaining a large amount of samples from the underlying datastream, at each node and every iteration in online steaming scenarios substantially jeopardizes the actual wall-clock time. This is because the next iteration cannot be performed until all nodes complete the samplingprocess. Moreover, the double-loop implementation may incur periodic network synchronization. Theseissues are especially signiﬁcant when the working environments of the nodes are heterogeneous. Conversely,the proposed GT-HSGD is a single-loop algorithm with O (1) oracle queries per update and only requires a large2inibatch size with O ( (cid:15) − ) oracle queries once in the initialization phase , i.e., before the update recursionis executed; see Algorithm 1 and Corollary 1.Table 1: A comparison of the oracle complexities of decentralized stochastic gradient methods. The oraclecomplexity is in terms of the total number of queries to SFO required at each node to obtain an (cid:15) -accuratestationary point x ∗ of the global cost F such that E [ (cid:107)∇ F ( x ∗ ) (cid:107) ] ≤ (cid:15) . In the table, n is the number of thenodes and (1 − λ ) ∈ (0 ,

1] is the spectral gap of the weight matrix associated with the network. The MSScolumn indicates whether the algorithm in question requires the mean-squared smoothness assumption on the

SFO . We note that

DSGD requires bounded heterogeneity such that sup x n (cid:80) ni =1 (cid:107)∇ f i ( x ) − ∇ F ( x ) (cid:107) ≤ ζ , forsome ζ ∈ R + , while other algorithms in the table do not need this assumption. Centralized optimal methodsthat implement on a single node are used as the benchmark. Algorithm Oracle Complexity MSS Remarks

DSGD [5] O (cid:18) max (cid:26) n(cid:15) , λ n (1 − λ ) (cid:15) (cid:27)(cid:19) (cid:55) bounded heterogeneity D2 [11] O (cid:18) max (cid:26) − λ ) a n(cid:15) , n (1 − λ ) b (cid:15) (cid:27)(cid:19) (cid:55) a ≥ , b ∈ R + are notexplicitly shown in [11] GT-DSGD [13] O (cid:18) max (cid:26) n(cid:15) , λ n (1 − λ ) (cid:15) (cid:27)(cid:19) (cid:55) D-PD-SGD [14] O (cid:18) max (cid:26) n(cid:15) , n (1 − λ ) c (cid:15) (cid:27)(cid:19) (cid:55) c ∈ R + is not explicitlyshown in [14] D-GET [29] O (cid:18) − λ ) d (cid:15) (cid:19) (cid:51) d ∈ R + is not explicitlyshown in [29] D-SPIDER-SFO [30] O (cid:18) − λ ) h (cid:15) (cid:19) (cid:51) h ∈ R + is not explicitlyshown in [30] GT-HSGD ( this work ) O (cid:18) max (cid:26) n(cid:15) , λ (1 − λ ) (cid:15) , λ . n . (1 − λ ) . (cid:15) . (cid:27)(cid:19) (cid:51) Centralizedoptimal methods[31–33, 42, 43, 47] O (cid:18) (cid:15) (cid:19) (cid:51) implementation ona single node Roadmap and notations . The set of positive real numbers and positive integers are denoted by R + and Z + , respectively. For a ∈ R , we denote (cid:100) a (cid:101) as the smallest integer that is not less than a . Weuse lowercase bold letters to denote vectors and uppercase bold letters to denote matrices. The matrix I d represents the d × d identity; d and d are the d -dimensional column vectors of all ones and zeros, respectively.We denote [ x ] i as the i -th entry of a vector x . The Kronecker product of two matrices A and B is denotedby A ⊗ B . We use (cid:107) · (cid:107) to denote the Euclidean norm of a vector or the spectral norm of a matrix. We use σ ( · ) to denote the σ -algebra generated by the sets and/or random vectors in its argument.The rest of the paper is organized as follows. Section 2 states the mathematical formulation of theproblem and develops the proposed GT-HSGD algorithm. Section 3 presents the main convergence resultsof

GT-HSGD in this paper and discusses their implications. Section 4 outlines the proofs of the main results.Section 5 provides numerical experiments to verify our theoretical claims. Section 6 concludes the paper.The detailed proofs are presented in the appendix.

GT-HSGD algorithm

In this section, we introduce the mathematical model of the stochastic ﬁrst-order oracle (

SFO ) at each nodeand the communication network. Based on this model, we develop the proposed

GT-HSGD algorithm.3 .1 Stochastic ﬁrst-order oracle and network model

We work with a rich enough probability space { Ω , P , F} . We consider decentralized recursive algorithms ofinterest that generate a sequence of estimates { x it } t ≥ of the ﬁrst-order stationary points of F at each node i ,where x i is assumed constant. At each iteration t , each node i observes a random vector ξ it in R q , which, forinstance, may be considered as noise or as an online data sample. We then introduce the natural ﬁltration(an increasing family of sub- σ -algebras of F ) induced by these random vectors observed sequentially by thenetworked nodes: F := { Ω , φ } , F t := σ (cid:0)(cid:8) ξ i , ξ i , · · · , ξ it − : i ∈ V (cid:9)(cid:1) , ∀ t ≥ . (1)where φ is the empty set. We are now ready to deﬁne the SFO mechanism in the following. At each iteration t ,each node i , given an input random vector x ∈ R p that is F t -measurable, is able to query the local SFO thatreturns a stochastic gradient of the form g i ( x , ξ it ), where g i : R p × R q → R p is a Borel measurable function.We assume that the SFO satisﬁes the following four properties.

Assumption 1 ( Oracle ) . For any F t -measurable random vectors x , y ∈ R p , we have the following: ∀ i ∈ V and ∀ t ≥ • E (cid:2) g i ( x , ξ it ) |F t (cid:3) = ∇ f i ( x ); • E (cid:2) (cid:107) g i ( x , ξ it ) − ∇ f i ( x ) (cid:107) (cid:3) ≤ ν i , ν := n (cid:80) ni =1 ν i ; • the family (cid:8) ξ it : ∀ t ≥ , i ∈ V (cid:9) of random vectors is independent; • E (cid:2) (cid:107) g i ( x , ξ it ) − g i ( y , ξ it ) (cid:107) (cid:3) ≤ L E (cid:2) (cid:107) x − y (cid:107) (cid:3) .The ﬁrst three properties above are standard and commonly used to establish the convergence of decentral-ized stochastic gradient algorithms. They however do not explicitly impose any structures on the stochasticgradient mapping g i as long as g i is measurable. On the other hand, the last property (the mean-squaredsmoothness), roughly speaking, requires that g i is L -smooth on average with respect to the input arguments x and y . As a simple example, Assumption 1 holds if f i ( x ) = x (cid:62) Qx and g i ( x , ξ ) = Q i x + ξ , where Q i is aconstant matrix and ξ has zero mean and ﬁnite second moment. We further note that the mean-squaredsmoothness of each g i implies, by Jensen’s inequality, that each f i is L -smooth, i.e., (cid:107)∇ f i ( x ) − ∇ f i ( y ) (cid:107) ≤ L (cid:107) x − y (cid:107) , and consequently the global cost F is also L -smooth. In addition, we make the following assump-tions on F and the communication network G . Assumption 2 ( Global Function ) . F is bounded below, i.e., F ∗ := inf x ∈ R p F ( x ) > −∞ . Assumption 3 ( Communication Network ) . The directed network G admits a primitive and doubly-stochastic weight matrix W = { w ij } ∈ R n × n . Hence, W1 n = W (cid:62) n = n and λ := (cid:107) W − J (cid:107) ∈ [0 , W that satisﬁes Assumption 3 may be designed for strongly-connected weight-balanceddirected graphs (and thus for arbitrary connected undirected graphs). For example, the family of directedexponential graphs is weight-balanced and plays a key role in decentralized training [6]. We note that λ isknown as the second largest singular value of W and measures the algebraic connectivity of the graph, i.e.,a smaller value of λ roughly means a better connectivity. We note that several existing approaches requirestrictly stronger assumptions on W . For instance, D2 [11] and D-PD-SGD [14] require W to be symmetricand hence are restricted to undirected networks. We now describe the proposed

GT-HSGD algorithm and provide an intuitive construction. Recall that x it is theestimate of the stationary point of the global cost F at node i and iteration t . Let g i ( x it , ξ it ) and g i ( x it − , ξ it )be the corresponding stochastic gradients returned by the local SFO queried at x it and x it − respectively.Motivated by the strong performance of recently introduced decentralized methods that combine gradienttracking and various variance reduction schemes for ﬁnite-sum problems [29, 37, 39, 40], we seek similar4ariance reduction for decentralized online problems with SFO . In particular, we focus on the following local hybrid stochastic gradient estimator v it introduced in [42–44] for centralized online problems: ∀ t ≥ v it = g i ( x it , ξ it ) + (1 − β ) (cid:0) v it − − g i ( x it − , ξ it ) (cid:1) , (2)for some applicable weight parameter β ∈ [0 , v it is fused, via a gradienttracking mechanism [19, 20], over the network to update the global gradient tracker y it , which is subse-quently used as the descent direction in the x it -update. The complete description of GT-HSGD is provided inAlgorithm 1. We note that the update (2) of v it may be equivalently written as v it = β g i ( x it , ξ it ) (cid:124) (cid:123)(cid:122) (cid:125) Stochastic gradient + (1 − β ) (cid:0) g i ( x it , ξ it ) − g i ( x it − , ξ it ) + v it − (cid:1)(cid:124) (cid:123)(cid:122) (cid:125) SARAH , which is a convex combination of the local vanilla stochastic gradient returned by the SFO and a

SARAH -type [31, 32, 45] gradient estimator. This discussion leads to the fact that

GT-HSGD reduces to

GT-DSGD [12,13, 21] when β = 1, and becomes the inner loop of GT-SARAH [39] when β = 0. However, our convergenceanalysis shows that GT-HSGD achieves its best oracle complexity and outperforms the existing decentralizedonline variance-reduced approaches [29, 30] with a weight parameter β ∈ (0 , GT-DSGD nor the inner loop of

GT-SARAH , on their own, are able to outperform the proposed approach,making

GT-HSGD a non-trivial algorithmic design for this problem class.

Algorithm 1

GT-HSGD at each node i Require: x i = x ∈ R p ; α ∈ R + ; β ∈ (0 , b ∈ Z + ; { w ij } nj =1 ; y i = p ; v i − = p ; T ∈ Z + . Sample { ξ i ,r } b r =1 and initialize the gradient estimator: v i = b (cid:80) b r =1 g i ( x i , ξ i ,r ); Initialize the local gradient tracker: y i = (cid:80) nj =1 w ij (cid:0) y j + v j − v j − (cid:1) ; Initialize the local estimate of the solution: x i = (cid:80) nj =1 w ij (cid:0) x j − α y j (cid:1) ; for t = 1 , , · · · , T − do Sample ξ it and update the local stochastic gradient estimator: v it = g i ( x it , ξ it ) + (1 − β ) (cid:0) v it − − g i ( x it − , ξ it ) (cid:1) ; Update the local gradient tracker: y it +1 = n (cid:88) j =1 w ij (cid:16) y jt + v jt − v jt − (cid:17) ; Update the local estimate of the solution: x it +1 = n (cid:88) j =1 w ij (cid:16) x jt − α y jt +1 (cid:17) ; end forreturn (cid:101) x T selected uniformly at random from { x it : 0 ≤ t ≤ T, i ∈ V} . Remark 1.

Clearly, each v it is a conditionally biased estimator of ∇ f i ( x it ) , i.e., E [ v it |F t ] (cid:54) = ∇ f i ( x it ) ingeneral. However, it can be shown that E [ v it ] = E [ ∇ f i ( x it )] , meaning that v it serves as a surrogate for theunderlying exact gradient in the sense of total expectation. In this section, we present the main convergence results of

GT-HSGD in this paper and discuss their salientfeatures. The formal convergence analysis is deferred to Section 4.5 heorem 1.

If the weight parameter β = L α n and the step-size α is chosen as < α < min (cid:26) (1 − λ ) λ , (cid:112) n (1 − λ )26 λ , √ (cid:27) L , then the output (cid:101) x T of GT-HSGD satisﬁes: ∀ T ≥ , E (cid:104) (cid:107)∇ F ( (cid:101) x T ) (cid:107) (cid:105) ≤ F ( x ) − F ∗ ) αT + 8 βν n + 4 ν βb nT + 64 λ (1 − λ ) T (cid:13)(cid:13) ∇ f (cid:0) x (cid:1)(cid:13)(cid:13) n + 96 λ ν (1 − λ ) b T + 256 λ β ν (1 − λ ) , where (cid:13)(cid:13) ∇ f (cid:0) x (cid:1)(cid:13)(cid:13) = (cid:80) ni =1 (cid:107)∇ f i (cid:0) x (cid:1) (cid:107) . Remark 2.

Theorem 1 holds for

GT-HSGD with arbitrary initial minibatch size b ≥ Transient and steady-state performance over inﬁnite time horizon. If α and β are chosen accordingto Theorem 1, the mean-squared stationary gap E (cid:2) (cid:107)∇ F ( (cid:101) x T ) (cid:107) (cid:3) of GT-HSGD decays sublinearly at a rateof O (1 /T ) up to a steady-state error (SSE) such thatlim sup T →∞ E (cid:104) (cid:107)∇ F ( (cid:101) x T ) (cid:107) (cid:105) ≤ βν n + 256 λ β ν (1 − λ ) . (3)In view of (3), the SSE of GT-HSGD is bounded by the sum of two terms: (i) the ﬁrst term is in the orderof O ( β ) and the division by n demonstrates the beneﬁt of increasing the network size ; (ii) the second termis in the order of O ( β ) and reveals the impact of the spectral gap 1 − λ of the network topology. Clearly,the SSE can be made arbitrarily small by choosing small enough β and α (note that β = 48 L α n − inTheorem 1). Moreover, since the spectral gap 1 − λ only appears in a higher order term of β in (3), itsimpact reduces as β becomes smaller, i.e., as we require a smaller SSE.The following corollary is concerned with the non-asymptotic mean-squared convergence rate of GT-HSGD with speciﬁc choices of the algorithmic parameters α, β and b . Corollary 1.

Setting α = n / LT / , β = n / T / and b = (cid:100) T / n / (cid:101) in Theorem 1, we have: E (cid:104) (cid:107)∇ F ( (cid:101) x T ) (cid:107) (cid:105) ≤ L ( F ( x ) − F ∗ ) + 12 ν ( nT ) / + 64 λ (1 − λ ) T (cid:13)(cid:13) ∇ f (cid:0) x (cid:1)(cid:13)(cid:13) n + 240 λ n / ν (1 − λ ) T / , for all T ≥ max (cid:8) λ n (1 − λ ) , λ n . (1 − λ ) . (cid:9) . As a consequence, GT-HSGD achieves an (cid:15) -accurate stationary point x ∗ of the global cost F such that E [ (cid:107)∇ F ( x ∗ ) (cid:107) ] ≤ (cid:15) with H = O (max {H opt , H net } ) iterations, where H opt and H net are given respectively by H opt = ( L ( F ( x ) − F ∗ ) + ν ) . n(cid:15) , H net = max (cid:26) λ (1 − λ ) (cid:15) (cid:13)(cid:13) ∇ f (cid:0) x (cid:1)(cid:13)(cid:13) n , λ . n . ν . (1 − λ ) . (cid:15) . (cid:27) . The resulting total number of oracle queries at each node is thus (cid:100)H + H / n − / (cid:101) . Remark 3.

Since H / n − / is much smaller than H , we treat the oracle complexity of GT-HSGD as H forthe ease of exposition in Table 1 and the following discussion. An important implication of Corollary 1 is given in the following.

A regime for network-independent oracle complexity and linear speedup.

According to Corol-lary 1, the oracle complexity of

GT-HSGD at each node is upper bounded by the maximum of two terms: (i) Since

GT-HSGD computes O ( n ) stochastic gradients in parallel per iteration across the nodes, the network size n can beinterpreted as the minibatch size of GT-HSGD . The O ( · ) notation here does not absorb any problem parameters, i.e., it only hides universal constants. H opt is independent of the network topology and, more importantly, is n times smaller than theoracle complexity of the optimal centralized online variance-reduced methods [31–33, 42, 43] that execute ona single node for this problem class; (ii) the second term H net depends on the network spectral gap 1 − λ andis in the lower order of 1 /(cid:15) . These two observations lead to the interesting fact that the oracle complexityof GT-HSGD becomes independent of the network topology, i.e., H opt dominates H net , if the required errortolerance (cid:15) is small enough such that (cid:15) (cid:46) min (cid:8) λ − (1 − λ ) n − , λ − (1 − λ ) . n − (cid:9) . In this regime, GT-HSGD therefore achieves a network-independent oracle complexity H opt = O ( n − (cid:15) − ), exhibiting a linear speed upcompared with the aforementioned optimal centralized algorithms [31–33, 42, 43, 47] in the sense that thetotal number of oracle queries at each node is reduced by a factor of 1 /n . Remark 4.

The small error tolerance (cid:15) regime in the above discussion corresponds to a large number oforacle queries, which translates to the scenario where the required total number of iterations T is large. Notethat a large T further implies that the step-size α and the weight parameter β are small; see the expressionof α and β in Corollary 1. In this section, we outlines the proof of Theorem 1. We let Assumptions 1-3 hold throughout the restof the paper without explicitly stating them. For the ease of exposition, we write the x t - and y t -updateof GT-HSGD in the following equivalent matrix form: ∀ t ≥ y t +1 = W ( y t + v t − v t − ) , (4a) x t +1 = W ( x t − α y t +1 ) , (4b)where W := W ⊗ I p and x t , y t , v t are square-integrable random vectors in R np that respectively concatenatethe local estimates of the solution { x it } ni =1 , gradient trackers { y it } ni =1 , stochastic gradient estimators { v it } ni =1 .It is straightforward to verify that x t and y t are F t -measurable while v t is F t +1 -measurable for all t ≥

0. Forconvenience, we also denote ∇ f ( x t ) := (cid:2) ∇ f ( x t ) (cid:62) , · · · , ∇ f n ( x nt ) (cid:62) (cid:3) (cid:62) and introduce the following quantities: x t := 1 n ( (cid:62) n ⊗ I p ) x t , y t := 1 n ( (cid:62) n ⊗ I p ) y t , v t := 1 n ( (cid:62) n ⊗ I p ) v t , ∇ f ( x t ) := 1 n ( (cid:62) n ⊗ I p ) ∇ f ( x t ) . In the following lemma, we enlist several well-known results in the context of gradient tracking-basedalgorithms for decentralized stochastic optimization, whose proofs may be found in [19, 24, 48].

Lemma 1.

The following relationships hold.(a) (cid:107) Wx − Jx (cid:107) ≤ λ (cid:107) x − Jx (cid:107) , ∀ x ∈ R np .(b) y t +1 = v t , ∀ t ≥ .(c) (cid:13)(cid:13) ∇ f ( x t ) − ∇ F ( x t ) (cid:13)(cid:13) ≤ L n (cid:107) x t − Jx t (cid:107) , ∀ t ≥ . We note that Lemma 1(a) holds since W is primitive and doubly-stochastic, Lemma 1(b) is a directconsequence of the gradient tracking update (4a) and Lemma 1(c) is due to the L -smoothness of each f i .By the estimate update of GT-HSGD described in (4b) and Lemma 1(b), we have: x t +1 = x t − α y t +1 = x t − α v t , ∀ t ≥ . (5)Hence, the mean state x t proceeds in the direction of the average of local stochastic gradient estimators v t .With the help of (5) and the L -smoothness of F and each f i , we establish the following descent inequalitywhich sheds light on the overall convergence analysis. This boundary condition follows from basic algebraic manipulations. emma 2. If < α ≤ L , then we have: ∀ T ≥ , T (cid:88) t =0 (cid:13)(cid:13) ∇ F ( x t ) (cid:13)(cid:13) ≤ F ( x ) − F ∗ ) α − T (cid:88) t =0 (cid:107) v t (cid:107) + 2 T (cid:88) t =0 (cid:13)(cid:13) v t − ∇ f ( x t ) (cid:13)(cid:13) + 2 L n T (cid:88) t =0 (cid:107) x t − Jx t (cid:107) . Proof.

See Appendix A.

Remark 5.

Compared with the corresponding descent inequalities for decentralized non-convex optimizationalgorithms with conditionally unbiased stochastic gradient estimators, e.g., [5,13,40], Lemma 2 loses an O ( α ) factor on the coeﬃcient of the variance term (cid:80) t (cid:107) v t − ∇ f ( x t ) (cid:107) since v t is conditionally biased. However,as we shall see later, the gradient variance (cid:107) v t − ∇ f ( x t ) (cid:107) is subject to certain desirable recursive structuresand, consequently, can be properly controlled. In light of Lemma 2, our approach to establishing the convergence of

GT-HSGD is to seek the conditionson the algorithmic parameters of

GT-HSGD , i.e., the step-size α and the weight parameter β such that − T T (cid:88) t =0 E (cid:2) (cid:107) v t (cid:107) (cid:3) + 2 T T (cid:88) t =0 E (cid:2) (cid:107) v t − ∇ f ( x t ) (cid:107) (cid:3) + 2 L nT T (cid:88) t =0 E (cid:2) (cid:107) x t − Jx t (cid:107) (cid:3) = O (cid:18) α, β, b , T (cid:19) , (6)where O ( α, β, /b , /T ) represents a nonnegative quantity which may be made arbitrarily small by choosingsmall enough α and β along with large enough T and b . If (6) holds, then Lemma 2 reduces to1 T + 1 T (cid:88) t =0 E (cid:2) (cid:107)∇ F ( x t ) (cid:107) (cid:3) ≤ F ( x ) − F ∗ ) αT + O (cid:18) α, β, b , T (cid:19) , which leads to the convergence arguments of GT-HSGD . To this aim, we quantify (cid:80) Tt =0 E (cid:2) (cid:107) v t − ∇ f ( x t ) (cid:107) (cid:3) and (cid:80) Tt =0 E (cid:2) (cid:107) x t − Jx t (cid:107) (cid:3) next. First of all, we establish upper bounds on the gradient variances E (cid:2) (cid:107) v t − ∇ f ( x t ) (cid:107) (cid:3) and E (cid:2) (cid:107) v t − ∇ f ( x t ) (cid:107) (cid:3) by exploiting the hybrid and recursive update of v t . Lemma 3.

The following inequalities hold: ∀ t ≥ , E (cid:104)(cid:13)(cid:13) v t − ∇ f ( x t ) (cid:13)(cid:13) (cid:105) ≤ (1 − β ) E (cid:104)(cid:13)(cid:13) v t − − ∇ f ( x t − ) (cid:13)(cid:13) (cid:105) + 6 L α n (1 − β ) E (cid:104) (cid:107) v t − (cid:107) (cid:105) + 2 β ν n + 6 L n (1 − β ) E (cid:104) (cid:107) x t − Jx t (cid:107) + (cid:107) x t − − Jx t − (cid:107) (cid:105) (7) E (cid:104) (cid:107) v t − ∇ f ( x t ) (cid:107) (cid:105) ≤ (1 − β ) E (cid:104) (cid:107) v t − − ∇ f ( x t − ) (cid:107) (cid:105) + 6 nL α (1 − β ) E (cid:104) (cid:107) v t − (cid:107) (cid:105) + 2 nβ ν + 6 L (1 − β ) E (cid:104) (cid:107) x t − Jx t (cid:107) + (cid:107) x t − − Jx t − (cid:107) (cid:105) . (8) Proof.

See Appendix B.

Remark 6.

Since v t is a conditionally biased estimator of ∇ f ( x t ) , (7) and (8) do not directly imply eachother and need to be established separately. We emphasize that the recursive structure of the gradient variances shown in Lemma 3 plays a crucialrole in the convergence analysis. The following bounds on the consensus errors E (cid:2) (cid:107) x t − Jx t (cid:107) (cid:3) are standardin decentralized algorithms based on gradient tracking, e.g., [21, 39]; in particular, it follows directly fromthe x t -update (4b) and Young’s inequality. Lemma 4.

The following inequalities hold: ∀ t ≥ , (cid:107) x t +1 − Jx t +1 (cid:107) ≤ λ (cid:107) x t − Jx t (cid:107) + 2 α λ − λ (cid:107) y t +1 − Jy t +1 (cid:107) . (9) (cid:107) x t +1 − Jx t +1 (cid:107) ≤ λ (cid:107) x t − Jx t (cid:107) + 2 α λ (cid:107) y t +1 − Jy t +1 (cid:107) . (10)8t is then clear from Lemma 4 that we need to further quantify the gradient tracking errors E (cid:2) (cid:107) y t − Jy t (cid:107) (cid:3) in order to bound the consensus errors. Lemma 5.

We have the following.(a) E (cid:2) (cid:107) y − Jy (cid:107) (cid:3) ≤ λ (cid:13)(cid:13) ∇ f (cid:0) x (cid:1)(cid:13)(cid:13) + λ nν /b . (b) If < α ≤ − λ √ λ L , then ∀ t ≥ , E (cid:2) (cid:107) y t +1 − Jy t +1 (cid:107) (cid:3) ≤ λ E (cid:2) (cid:107) y t − Jy t (cid:107) (cid:3) + 21 λ nL α − λ E [ (cid:107) v t − (cid:107) ] + 63 λ L − λ E (cid:2) (cid:107) x t − − Jx t − (cid:107) (cid:3) + 7 λ β − λ E (cid:2) (cid:107) v t − − ∇ f ( x t − ) (cid:107) (cid:3) + 3 λ nβ ν . Proof.

See Appendix C.We note that establishing Lemma 5 requires a careful examination of the structure of the v t -update. Toproceed, we observe, from Lemma 3, 4 and 5, that the recursions of the gradient variances, consensus andgradient tracking errors admit similar forms. Therefore, we abstract out a formula for the accumulation ofthe error recursion of this type in the following lemma. Lemma 6.

Let { V t } t ≥ , { R t } t ≥ and { Q t } t ≥ be nonnegative sequences and C ≥ be some constant suchthat V t ≤ qV t − + qR t − + Q t + C , ∀ t ≥ , where q ∈ (0 , . Then the following inequality holds: ∀ T ≥ , T (cid:88) t =0 V t ≤ V − q + 11 − q T − (cid:88) t =0 R t + 11 − q T (cid:88) t =1 Q t + CT − q . (11) Similarly, if V t +1 ≤ qV t + R t − + C, ∀ t ≥ , then we have: ∀ T ≥ , T (cid:88) t =1 V t ≤ V − q + 11 − q T − (cid:88) t =0 R t + CT − q . (12) Proof.

See Appendix D.Applying Lemma 6 to Lemma 3 leads to the following upper bounds on the accumulated variances.

Lemma 7.

For any β ∈ (0 , , the following inequalities hold: ∀ T ≥ , T (cid:88) t =0 E (cid:2) (cid:107) v t − ∇ f ( x t ) (cid:107) (cid:3) ≤ ν βb n + 6 L α nβ T − (cid:88) t =0 E (cid:104) (cid:107) v t (cid:107) (cid:105) + 12 L n β T (cid:88) t =0 E (cid:2) (cid:107) x t − Jx t (cid:107) (cid:3) + 2 βν Tn . (13) T (cid:88) t =0 E (cid:2) (cid:107) v t − ∇ f ( x t ) (cid:107) (cid:3) ≤ nν βb + 6 nL α β T − (cid:88) t =0 E (cid:2) (cid:107) v t (cid:107) (cid:3) + 12 L β T (cid:88) t =0 E (cid:2) (cid:107) x t − Jx t (cid:107) (cid:3) + 2 nβν T. (14) Proof.

See Appendix E.Clearly, (14) in Lemma 7 can be used to reﬁne the left hand side of (6). The remaining step, naturally, isto bound (cid:80) t E (cid:2) (cid:107) x t − Jx t (cid:107) (cid:3) in terms of (cid:80) t E (cid:2) (cid:107) v t (cid:107) (cid:3) . This result is provided in the following lemma thatis obtained with the help of Lemma 4, 5, 6 and 7. Lemma 8. If < α ≤ (1 − λ ) λ L and β ∈ (0 , , then the following inequality holds: ∀ T ≥ , T (cid:88) t =0 E (cid:2) (cid:107) x t − Jx t (cid:107) (cid:3) n ≤ λ L α (1 − λ ) T − (cid:88) t =0 E (cid:2) (cid:107) v t (cid:107) (cid:3) + 32 λ α (1 − λ ) (cid:13)(cid:13) ∇ f (cid:0) x (cid:1)(cid:13)(cid:13) n + (cid:18) β − λ + 1 (cid:19) λ ν α (1 − λ ) b + (cid:18) β − λ + 3 (cid:19) λ β ν α T (1 − λ ) . Proof.

See Appendix F.Finally, we observe that Lemma 7 and 8 suﬃce to establish (6) and hence lead to Theorem 1 whosecomplete proof is presented in Appendix G. 9

Numerical experiments

In this section, we numerically illustrate our theoretical results on the convergence of

GT-HSGD with thehelp of a non-convex logistic regression model [49] for binary classiﬁcation. In particular, we consider adecentralized non-convex optimization problem of the form min x ∈ R p F ( x ) := n (cid:80) ni =1 f i ( x ) + r ( x ) , such that f i ( x ) = 1 m m (cid:88) j =1 log (cid:104) e −(cid:104) x , θ ij (cid:105) l ij (cid:105) and r ( x ) = p (cid:88) k =1 R [ x ] k x ] k , where θ i,j is the feature vector, l i,j is the corresponding binary label, and r ( x ) is a non-convex regularizer.To simulate the online SFO setting described in Section 2, each node i is only able to sample with replacement from its local data { θ i,j , l i,j } mj =1 and compute the corresponding (minibatch) stochastic gradient. Throughoutall experiments, we set the number of the nodes to n = 20 and the regularization parameter to R = 10 − . Totest the performance of related algorithms, we distribute the a9a, covertype, KDD98, MiniBooNE datasetsuniformly over the nodes and normalize the feature vectors such that (cid:107) θ i,j (cid:107) = 1 , ∀ i, j ; the statistics of thesedatasets can be found in Table 2. Moreover, we consider the following network topologies: the undirectedring graph, the undirected and directed exponential graphs, and the complete graph; see [5, 6, 22, 50] fordetails. The doubly stochastic weights are generated by the Metropolis rule for the undirected ring andexponential graphs and are set to be uniform for the directed exponential and complete graphs. The secondlargest singular values of these weight matrices are 0 . , . , . ,

0, respectively, demonstrating a signiﬁcantdiﬀerence in the algebraic connectivity of these graphs. We measure the performance of the algorithms inquestion by the decrease of the global cost function value F ( x k ), to which we refer as loss, versus epochs,where each epoch contains m Dataset train ( nm ) dimension ( p )a9a 48 ,

840 123covertype 100 ,

000 54KDD98 75 ,

000 477MiniBooNE 100 ,

000 11

Comparison with the existing decentralized online stochastic gradient methods.

First, we conducta performance comparison of

GT-HSGD with

GT-DSGD [12, 13, 21],

D-GET [29], and

D-SPIDER-SFO [30] over theundirected exponential graph of 20 nodes. Note that we use

GT-DSGD to represent methods that do notincorporate online variance reduction techniques, since it in general matches or outperforms

DSGD [5] and hasa similar performance with D2 [11] and D-PD-SGD [14]. The experimental results are provided in Fig. 1, where

GT-HSGD achieves faster convergence than the other algorithms in comparison on these four datasets. Thisobservation is coherent with our main convergence results that

GT-HSGD achieves a lower oracle complexitythan the existing approaches; see Table 1.

Network-independent convergence rate of

GT-HSGD . Next, we test the performance of

GT-HSGD overdiﬀerent network topologies and the corresponding experimental results are presented in Fig. 2. Clearly, itcan be observed that when the number of iterations is large enough, that is to say, the required error toleranceis small enough, the convergence rate of

GT-HSGD is not aﬀected by the underlying network topology. Thisinteresting phenomenon is consistent with our convergence theory; see Corollary 1 and the related discussionin Section 3.

In this paper, we investigate decentralized stochastic optimization to minimize a sum of smooth non-convexcost functions over a network of nodes. Assuming that each node has access to a stochastic ﬁrst-order10

10 20 30 40 50Epochs3.283.303.323.343.363.383.40 L o ss

1e 1 a9a

GT-DSGDD-GETD-SPIDER-SFOGT-HSGD L o ss

1e 1 covertype

GT-DSGDD-GETD-SPIDER-SFOGT-HSGD L o ss

1e 1 KDD98

GT-DSGDD-GETD-SPIDER-SFOGT-HSGD L o ss

1e 1 MiniBooNE

GT-DSGDD-GETD-SPIDER-SFOGT-HSGD

Figure 1: A comparison of

GT-HSGD with other decentralized stochastic online gradient algorithms over theundirected exponential graph of 20 nodes on the a9a, covertype, KDD98 and MiniBooNE datasets. L o ss

1e 1 a9a undirected ring graphundirected exponential graphdirected exponential graphcomplete graph L o ss

1e 1 covertype undirected ring graphundirected exponential graphdirected exponential graphcomplete graph

Figure 2: Convergence behaviors of

GT-HSGD over diﬀerent network topologies on the a9a and covertypedatasets. 11racle, we propose

GT-HSGD , a novel single-loop decentralized algorithm that leverages local hybrid variancereduction and gradient tracking to achieve provably fast convergence and robust performance. Comparedwith the existing online variance-reduced methods,

GT-HSGD achieves a lower oracle complexity with a morepractical implementation. We further show that

GT-HSGD achieves a network-independent oracle complexity,when the required error tolerance is small enough, leading to a linear speedup with respect to the centralizedoptimal methods that execute on a single node.

References [1] J. Tsitsiklis, D. Bertsekas, and M. Athans. Distributed asynchronous deterministic and stochasticgradient optimization algorithms.

IEEE Trans. Autom. Control , 31(9):803–812, 1986.[2] A. Nedi´c and A. Ozdaglar. Distributed subgradient methods for multi-agent optimization.

IEEE Trans.Autom. Control , 54(1):48, 2009.[3] S. Kar, J. M. F. Moura, and K. Ramanan. Distributed parameter estimation in sensor networks:Nonlinear observation models and imperfect communication.

IEEE Trans. Inf. Theory , 58(6):3575–3605, 2012.[4] J. Chen and A. H. Sayed. On the learning behavior of adaptive networks—part i: Transient analysis.

IEEE Transactions on Information Theory , 61(6):3487–3517, 2015.[5] X. Lian, C. Zhang, H. Zhang, C.-J. Hsieh, W. Zhang, and J. Liu. Can decentralized algorithms out-perform centralized algorithms? A case study for decentralized parallel stochastic gradient descent. In

Adv. Neural Inf. Process. Syst. , pages 5330–5340, 2017.[6] M. Assran, N. Loizou, N. Ballas, and M. Rabbat. Stochastic gradient push for distributed deep learning.In

Proceedings of the 36th International Conference on Machine Learning , pages 97: 344–353, 2019.[7] Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic ap-proximation approach to stochastic programming.

SIAM J. Optim. , 19(4):1574–1609, 2009.[8] S. Vlaski and A. H. Sayed. Distributed learning in non-convex environments–Part II: Polynomial escapefrom saddle-points. arXiv:1907.01849 , 2019.[9] H. Taheri, A. Mokhtari, H. Hassani, and R. Pedarsani. Quantized decentralized stochastic learning overdirected graphs. In

International Conference on Machine Learning , pages 9324–9333, 2020.[10] R. Xin, S. Kar, and U. A. Khan. Decentralized stochastic optimization and machine learning: A uniﬁedvariance-reduction framework for robust performance and fast convergence.

IEEE Signal Process. Mag. ,37(3):102–113, 2020.[11] H. Tang, X. Lian, M. Yan, C. Zhang, and J. Liu. D : Decentralized training over decentralized data.In International Conference on Machine Learning , pages 4848–4856, 2018.[12] S. Lu, X. Zhang, H. Sun, and M. Hong. GNSD: A gradient-tracking based nonconvex stochastic algo-rithm for decentralized optimization. In , pages 315–321, 2019.[13] R. Xin, U. A. Khan, and S. Kar. An improved convergence analysis for decentralized online stochasticnon-convex optimization. arXiv preprint arXiv:2008.04195 , 2020.[14] X. Yi, S. Zhang, T. Yang, T. Chai, and K. H. Johansson. A primal-dual SGD algorithm for distributednonconvex optimization. arXiv preprint arXiv:2006.03474 , 2020.[15] W. Shi, Q. Ling, G. Wu, and W. Yin. EXTRA: An exact ﬁrst-order algorithm for decentralized consensusoptimization.

SIAM J. Optim. , 25(2):944–966, 2015.[16] Z. Li, W. Shi, and M. Yan. A decentralized proximal-gradient method with network independentstep-sizes and separated convergence rates.

IEEE Trans. Signal Process. , 67(17):4494–4506, 2019.1217] K. Yuan, S. A. Alghunaim, B. Ying, and A. H. Sayed. On the inﬂuence of bias-correction on distributedstochastic optimization.

IEEE Trans. Signal Process. , 2020.[18] H. Li and Z. Lin. Revisiting extra for smooth distributed optimization.

SIAM J. Optim. , 30(3):1795–1821, 2020.[19] P. Di Lorenzo and G. Scutari. NEXT: In-network nonconvex optimization.

IEEE Trans. Signal Inf.Process. Netw. Process. , 2(2):120–136, 2016.[20] J. Xu, S. Zhu, Y. C. Soh, and L. Xie. Augmented distributed gradient methods for multi-agent opti-mization under uncoordinated constant stepsizes. In

Proc. IEEE Conf. Decis. Control , pages 2055–2060,2015.[21] S. Pu and A. Nedich. Distributed stochastic gradient tracking methods.

Math. Program. , pages 1–49,2020.[22] R. Xin, S. Pu, A. Nedi´c, and U. A. Khan. A general framework for decentralized optimization withﬁrst-order methods.

Proceedings of the IEEE , 108(11):1869–1889, 2020.[23] A. Nedich, A. Olshevsky, and W. Shi. Achieving geometric convergence for distributed optimizationover time-varying graphs.

SIAM J. Optim. , 27(4):2597–2633, 2017.[24] G. Qu and N. Li. Harnessing smoothness to accelerate distributed optimization.

IEEE Trans. Control.Netw. Syst. , 5(3):1245–1260, 2017.[25] C. Xi, R. Xin, and U. A. Khan. ADD-OPT: Accelerated distributed directed optimization.

IEEE Trans.Autom. Control , 63(5):1329–1339, 2017.[26] D. Jakoveti´c. A uniﬁcation and generalization of exact distributed ﬁrst-order methods.

IEEE Trans.Signal Inf. Process. Netw. Process. , 5(1):31–46, 2018.[27] S. A. Alghunaim, E. Ryu, K. Yuan, and A. H. Sayed. Decentralized proximal gradient algorithms withlinear convergence rates.

IEEE Trans. Autom. Control , 2020.[28] J. Xu, Y. Tian, Y. Sun, and G. Scutari. Distributed algorithms for composite optimization: Uniﬁedand tight convergence analysis. arXiv:2002.11534 , 2020.[29] H. Sun, S. Lu, and M. Hong. Improving the sample and communication complexity for decentralizednon-convex optimization: A joint gradient estimation and tracking approach. arXiv:1910.05857 , 2019.[30] T. Pan, J. Liu, and J. Wang. D-SPIDER-SFO: A decentralized optimization algorithm with faster con-vergence rate for nonconvex problems. In

Proceedings of the AAAI Conference on Artiﬁcial Intelligence ,volume 34, pages 1619–1626, 2020.[31] C. Fang, C. J. Li, Z. Lin, and T. Zhang. SPIDER: near-optimal non-convex optimization via stochasticpath-integrated diﬀerential estimator. In

Proc. Adv. Neural Inf. Process. Syst. , pages 689–699, 2018.[32] Z. Wang, K. Ji, Y. Zhou, Y. Liang, and V. Tarokh. Spiderboost and momentum: Faster variancereduction algorithms. In

Proc. Adv. Neural Inf. Process. Syst. , pages 2403–2413, 2019.[33] N. H. Pham, L. M. Nguyen, D. T. Phan, and Q. Tran-Dinh. ProxSARAH: an eﬃcient algorithmicframework for stochastic composite nonconvex optimization.

J. Mach. Learn. Res. , 21(110):1–48, 2020.[34] A. Mokhtari and A. Ribeiro. DSA: Decentralized double stochastic averaging gradient algorithm.

J.Mach. Learn. Res. , 17(1):2165–2199, 2016.[35] K. Yuan, B. Ying, J. Liu, and A. H. Sayed. Variance-reduced stochastic learning by networked agentsunder random reshuﬄing.

IEEE Trans. Signal Process. , 67(2):351–366, 2018.[36] H. Li, Z. Lin, and Y. Fang. Optimal accelerated variance reduced EXTRA and DIGing for stronglyconvex and smooth decentralized optimization. arXiv preprint arXiv:2009.04373 , 2020.1337] B. Li, S. Cen, Y. Chen, and Y. Chi. Communication-eﬃcient distributed optimization in networks withgradient tracking and variance reduction.

J. Mach. Learn. Res. , 21(180):1–51, 2020.[38] K. Rajawat and C. Kumar. A primal-dual framework for decentralized stochastic optimization. arXivpreprint arXiv:2012.04402 , 2020.[39] R. Xin, U. A. Khan, and S. Kar. A near-optimal stochastic gradient method for decentralized non-convexﬁnite-sum optimization. arXiv preprint arXiv:2008.07428 , 2020.[40] R. Xin, U. A. Khan, and S. Kar. A fast randomized incremental gradient method for decentralizednon-convex optimization. arXiv preprint arXiv:2011.03853 , 2020.[41] Q. L¨u, X. Liao, H. Li, and T. Huang. A computation-eﬃcient decentralized algorithm for compositeconstrained optimization.

IEEE Trans. Signal Inf. Process. Netw. , 6:774–789, 2020.[42] D. Liu, L. M. Nguyen, and Q. Tran-Dinh. An optimal hybrid variance-reduced algorithm for stochasticcomposite nonconvex optimization. arXiv preprint arXiv:2008.09055 , 2020.[43] Q. Tran-Dinh, N. H. Pham, D. T. Phan, and L. M. Nguyen. A hybrid stochastic optimization frameworkfor stochastic composite nonconvex optimization.

Math. Program. , 2020.[44] A. Cutkosky and F. Orabona. Momentum-based variance reduction in non-convex sgd. In

Adv. NeuralInf. Process. Syst. , pages 15236–15245, 2019.[45] L. M. Nguyen, J. Liu, K. Scheinberg, and M. Takac. SARAH: A novel method for machine learningproblems using stochastic recursive gradient. In

Proc. 34th Int. Conf. Mach. Learn. , pages 2613–2621,2017.[46] Y. Arjevani, Y. Carmon, J. C. Duchi, D. J. Foster, N. Srebro, and B. Woodworth. Lower bounds fornon-convex stochastic optimization. arXiv preprint arXiv:1912.02365 , 2019.[47] D. Zhou, P. Xu, and Q. Gu. Stochastic nested variance reduction for nonconvex optimization.

J. Mach.Learn. Res. , 2020.[48] R. Xin, U. A. Khan, and S. Kar. Variance-reduced decentralized stochastic optimization with acceleratedconvergence.

IEEE Trans. Signal Process. , 68:6255–6271, 2020.[49] A. Antoniadis, I. Gijbels, and M. Nikolova. Penalized likelihood regression for generalized linear modelswith non-quadratic penalties.

Annals of the Institute of Statistical Mathematics , 63(3):585–615, 2011.[50] A. Nedi´c, A. Olshevsky, and M. G. Rabbat. Network topology and communication-computation tradeoﬀsin decentralized optimization.

P. IEEE , 106(5):953–976, 2018.[51] Y. Nesterov.

Lectures on convex optimization , volume 137. Springer, 2018.

A Proof of Lemma 2

We recall the standard Descent Lemma [51], i.e., ∀ x , y ∈ R p , F ( y ) ≤ F ( x ) + (cid:104)∇ F ( x ) , y − x (cid:105) + L (cid:107) y − x (cid:107) , (15)since the global function F is L -smooth. Setting y = x t +1 and x = x t in (15) and using (5) , we have: ∀ t ≥ F ( x t +1 ) ≤ F ( x t ) − (cid:10) ∇ F ( x t ) , x t +1 − x t (cid:11) + L (cid:107) x t +1 − x t (cid:107) ≤ F ( x t ) − α (cid:10) ∇ F ( x t ) , v t (cid:11) + Lα (cid:107) v t (cid:107) . (16)14sing (cid:104) a , b (cid:105) = (cid:0) (cid:107) a (cid:107) + (cid:107) b (cid:107) − (cid:107) a − b (cid:107) (cid:1) , ∀ a , b ∈ R p , in (16) gives: for 0 < α ≤ L and ∀ t ≥ F ( x t +1 ) ≤ F ( x t ) − α (cid:107)∇ F ( x t ) (cid:107) − (cid:18) α − Lα (cid:19) (cid:107) v t (cid:107) + α (cid:107) v t − ∇ F ( x t ) (cid:107) , ≤ F ( x t ) − α (cid:107)∇ F ( x t ) (cid:107) − (cid:18) α − Lα (cid:19) (cid:107) v t (cid:107) + α (cid:13)(cid:13) v t − ∇ f ( x t ) (cid:13)(cid:13) + α (cid:13)(cid:13) ∇ f ( x t ) − ∇ F ( x t ) (cid:13)(cid:13) , ( i ) ≤ F ( x t ) − α (cid:107)∇ F ( x t ) (cid:107) − α (cid:107) v t (cid:107) + α (cid:13)(cid:13) v t − ∇ f ( x t ) (cid:13)(cid:13) + αL n (cid:107) x t − Jx t (cid:107) , (17)where ( i ) is due to Lemma 1(c) and that Lα ≤ α since 0 < α ≤ L . Rearranging (17), we have: for 0 <α ≤ L and ∀ t ≥ (cid:107)∇ F ( x t ) (cid:107) ≤ F ( x t ) − F ( x t +1 )) α − (cid:107) v t (cid:107) + 2 (cid:13)(cid:13) v t − ∇ f ( x t ) (cid:13)(cid:13) + 2 L n (cid:107) x t − Jx t (cid:107) . (18)Taking the telescoping sum of (18) over t from 0 to T , ∀ T ≥ F bounded belowby F ∗ in the resulting inequality ﬁnishes the proof. B Proof of Lemma 3

B.1 Proof of Eq. (7)

We recall that the update of each local stochastic gradient estimator v it , ∀ t ≥

1, in (2) may be writtenequivalently as follows: v it = β g i ( x it , ξ it ) + (1 − β ) (cid:0) g i ( x it , ξ it ) − g i ( x it − , ξ it ) + v it − (cid:1) , where β ∈ (0 , ∀ t ≥ ∀ i ∈ V , v it − ∇ f i ( x it ) = β g i ( x it , ξ it ) + (1 − β ) (cid:0) g i ( x it , ξ it ) − g i ( x it − , ξ it ) + v it − (cid:1) − β ∇ f i ( x it ) − (1 − β ) ∇ f i ( x it )= β (cid:16) g i ( x it , ξ it ) − ∇ f i ( x it ) (cid:17) + (1 − β ) (cid:16) g i ( x it , ξ it ) − g i ( x it − , ξ it ) + v it − − ∇ f i ( x it ) (cid:17) = β (cid:16) g i ( x it , ξ it ) − ∇ f i ( x it ) (cid:17) + (1 − β ) (cid:16) g i ( x it , ξ it ) − g i ( x it − , ξ it ) + ∇ f i ( x it − ) − ∇ f i ( x it ) (cid:17) + (1 − β ) (cid:16) v it − − ∇ f i ( x it − ) (cid:17) . (19)In (19), we observe that ∀ t ≥ ∀ i ∈ V , E (cid:2) g i ( x it , ξ it ) − ∇ f i ( x it ) |F t (cid:3) = p , (20) E (cid:2) g i ( x it , ξ it ) − g i ( x it − , ξ it ) + ∇ f i ( x it − ) − ∇ f i ( x it ) |F t (cid:3) = p , (21)where we recall the deﬁnition of the ﬁltration F t in (1). Averaging (19) over i from 1 to n gives: ∀ t ≥ v t − ∇ f ( x t ) = (1 − β ) (cid:16) v t − − ∇ f ( x t − ) (cid:17) + β · n n (cid:88) i =1 (cid:16) g i ( x it , ξ it ) − ∇ f i ( x it ) (cid:17)(cid:124) (cid:123)(cid:122) (cid:125) =: s t + (1 − β ) · n n (cid:88) i =1 (cid:16) g i ( x it , ξ it ) − g i ( x it − , ξ it ) + ∇ f i ( x it − ) − ∇ f i ( x it ) (cid:17)(cid:124) (cid:123)(cid:122) (cid:125) =: z t , (22)15here E [ s t |F t ] = E [ z t |F t ] = p by (20) and (21). In light of (22), we have: ∀ t ≥ E (cid:2) (cid:107) v t − ∇ f ( x t ) (cid:107) |F t (cid:3) = (1 − β ) (cid:107) v t − − ∇ f ( x t − ) (cid:107) + E (cid:104) (cid:107) β s t + (1 − β ) z t (cid:107) |F t (cid:105) + 2 E (cid:2)(cid:10) (1 − β ) (cid:0) v t − − ∇ f ( x t − ) (cid:1) , β s t + (1 − β ) z t (cid:11) |F t (cid:3) ( i ) = (1 − β ) (cid:107) v t − − ∇ f ( x t − ) (cid:107) + E (cid:104) (cid:107) β s t + (1 − β ) z t (cid:107) |F t (cid:105) ≤ (1 − β ) (cid:107) v t − − ∇ f ( x t − ) (cid:107) + 2 β E (cid:104) (cid:107) s t (cid:107) |F t (cid:105) + 2(1 − β ) E (cid:104) (cid:107) z t (cid:107) |F t (cid:105) , (23)where ( i ) is due to E (cid:2)(cid:10) (1 − β ) (cid:0) v t − − ∇ f ( x t − ) (cid:1) , β s t + (1 − β ) z t (cid:11) |F t (cid:3) = 0 , since E [ s t |F t ] = E [ z t |F t ] = p and v t − − ∇ f ( x t − ) is F t -measurable. We next bound the second and thethird term in (23) respectively. For the second term in (23), we observe that ∀ t ≥ E (cid:2) (cid:107) s t (cid:107) (cid:3) = 1 n n (cid:88) i =1 E (cid:104)(cid:13)(cid:13) g i ( x it , ξ it ) − ∇ f i ( x it ) (cid:13)(cid:13) (cid:105) + 1 n (cid:88) i (cid:54) = j E (cid:104)(cid:68) g i ( x it , ξ it ) − ∇ f i ( x it ) , g j ( x jt , ξ jt ) − ∇ f j ( x jt ) (cid:69)(cid:105) ( i ) = 1 n n (cid:88) i =1 E (cid:104)(cid:13)(cid:13) g i ( x it , ξ it ) − ∇ f i ( x it ) (cid:13)(cid:13) (cid:105) ≤ ν n , (24)We note that ( i ) in (24) uses that whenever i (cid:54) = j , E (cid:104)(cid:68) g i ( x it , ξ it ) − ∇ f i ( x it ) , g j ( x jt , ξ jt ) − ∇ f j ( x jt ) (cid:69)(cid:12)(cid:12) F t (cid:105) ( ii ) = E (cid:104)(cid:68) E (cid:104) g i ( x it , ξ it ) | σ ( ξ jt , F t ) (cid:105) − ∇ f i ( x it ) , g j ( x jt , ξ jt ) − ∇ f j ( x jt ) (cid:69)(cid:12)(cid:12) F t (cid:105) ( iii ) = E (cid:104)(cid:68) E (cid:2) g i ( x it , ξ it ) |F t (cid:3) − ∇ f i ( x it ) , g j ( x jt , ξ jt ) − ∇ f j ( x jt ) (cid:69)(cid:12)(cid:12)(cid:12) F t (cid:105) = 0 , (25)where ( ii ) is due to the tower property of the conditional expectation and ( iii ) uses that ξ jt is independentof { ξ it , F t } and x it is F t -measurable. Towards the third term (23), we deﬁne for the ease of exposition, ∀ t ≥ (cid:98) ∇ it := ∇ f i ( x it ) − ∇ f i ( x it − )and recall that E (cid:2) g i ( x it , ξ it ) − g i ( x it − , ξ it ) |F t (cid:3) = (cid:98) ∇ it . Observe that ∀ t ≥ E (cid:2) (cid:107) z t (cid:107) |F t (cid:3) = E (cid:20)(cid:13)(cid:13)(cid:13)(cid:13) n n (cid:88) i =1 (cid:16) g i ( x it , ξ it ) − g i ( x it − , ξ it ) − (cid:98) ∇ it (cid:17)(cid:13)(cid:13)(cid:13)(cid:13) (cid:12)(cid:12)(cid:12) F t (cid:21) = 1 n n (cid:88) i =1 E (cid:20)(cid:13)(cid:13)(cid:13) g i ( x it , ξ it ) − g i ( x it − , ξ it ) − (cid:98) ∇ it (cid:13)(cid:13)(cid:13) (cid:12)(cid:12) F t (cid:21) + 1 n (cid:88) i (cid:54) = j E (cid:104)(cid:68) g i ( x it , ξ it ) − g i ( x it − , ξ it ) − (cid:98) ∇ it , g j ( x jt , ξ jt ) − g j ( x jt − , ξ jt ) − (cid:98) ∇ jt (cid:69)(cid:12)(cid:12) F t (cid:105)(cid:124) (cid:123)(cid:122) (cid:125) =0( i ) = 1 n n (cid:88) i =1 E (cid:20)(cid:13)(cid:13)(cid:13) g i ( x it , ξ it ) − g i ( x it − , ξ it ) − (cid:98) ∇ t (cid:13)(cid:13)(cid:13) (cid:12)(cid:12) F t (cid:21) , ( ii ) ≤ n n (cid:88) i =1 E (cid:104)(cid:13)(cid:13) g i ( x it , ξ it ) − g i ( x it − , ξ it ) (cid:13)(cid:13) (cid:12)(cid:12)(cid:12) F t (cid:105) . (26)16here ( i ) follows from a similar line of arguments as (25) and ( ii ) uses the conditional variance decomposition,i.e., for any random vector a ∈ R p consisted of square-integrable random variables, E (cid:104) (cid:107) a − E [ a |F t ] (cid:107) |F t (cid:105) = E (cid:104) (cid:107) a (cid:107) |F t (cid:105) − (cid:107) E [ a |F t ] (cid:107) . (27)To proceed from (26), we take its expectation and observe that ∀ t ≥ E (cid:2) (cid:107) z t (cid:107) (cid:3) ≤ n n (cid:88) i =1 E (cid:104)(cid:13)(cid:13) g i ( x it , ξ it ) − g i ( x it − , ξ it ) (cid:13)(cid:13) (cid:105) ( i ) ≤ L n n (cid:88) i =1 E (cid:104)(cid:13)(cid:13) x it − x it − (cid:13)(cid:13) (cid:105) = L n E (cid:104) (cid:107) x t − x t − (cid:107) (cid:105) = L n E (cid:104) (cid:107) x t − Jx t + Jx t − Jx t − + Jx t − − x t − (cid:107) (cid:105) ≤ L n E (cid:104) (cid:107) x t − Jx t (cid:107) + n (cid:107) x t − x t − (cid:107) + (cid:107) x t − − Jx t − (cid:107) (cid:105) ( ii ) = 3 L α n E (cid:104) (cid:107) v t − (cid:107) (cid:105) + 3 L n (cid:16) E (cid:104) (cid:107) x t − Jx t (cid:107) + (cid:107) x t − − Jx t − (cid:107) (cid:105) (cid:17) , (28)where ( i ) uses the mean-squared smoothness of each g i and ( ii ) uses the update of x t in (5). The prooffollows by taking the expectation (23) and then using (24) and (28) in the resulting inequality. B.2 Proof of Eq. (8)

We recall from (19) the following relationship: ∀ t ≥ v it − ∇ f i ( x it ) = β (cid:16) g i ( x it , ξ it ) − ∇ f i ( x it ) (cid:17) + (1 − β ) (cid:16) g i ( x it , ξ it ) − g i ( x it − , ξ it ) + ∇ f i ( x it − ) − ∇ f i ( x it ) (cid:17) + (1 − β ) (cid:16) v it − − ∇ f i ( x it − ) (cid:17) . (29)We recall that the conditional expectation of the ﬁrst and the second term in (29) with respect to F t is zeroand that the third term in (29) is F t -measurable. Following a similar procedure in the proof of (23), wehave: ∀ t ≥ E (cid:2) (cid:107) v it − ∇ f i ( x it ) (cid:107) |F t (cid:3) ≤ (1 − β ) (cid:13)(cid:13) v it − − ∇ f i ( x it − ) (cid:13)(cid:13) + 2 β E (cid:104)(cid:13)(cid:13) g i ( x it , ξ it ) − ∇ f i ( x it ) (cid:13)(cid:13) |F t (cid:105) + 2(1 − β ) E (cid:104)(cid:13)(cid:13) g i ( x it , ξ it ) − g i ( x it − , ξ it ) − (cid:0) ∇ f i ( x it ) − ∇ f i ( x it − ) (cid:1)(cid:13)(cid:13) |F t (cid:105) ( i ) ≤ (1 − β ) (cid:13)(cid:13) v it − − ∇ f i ( x it − ) (cid:13)(cid:13) + 2 β E (cid:104)(cid:13)(cid:13) g i ( x it , ξ it ) − ∇ f i ( x it ) (cid:13)(cid:13) |F t (cid:105) + 2(1 − β ) E (cid:104)(cid:13)(cid:13) g i ( x it , ξ it ) − g i ( x it − , ξ it ) (cid:13)(cid:13) |F t (cid:105) (30)where ( i ) uses the conditional variance decomposition (27). We then take the expectation of (30) with thehelp of the mean-squared smoothness and the bounded variance of each g i to proceed: ∀ t ≥ E (cid:104)(cid:13)(cid:13) v it − ∇ f i ( x it ) (cid:13)(cid:13) (cid:105) ≤ (1 − β ) E (cid:104)(cid:13)(cid:13) v it − − ∇ f i ( x it − ) (cid:13)(cid:13) (cid:105) + 2 β ν i + 2(1 − β ) L E (cid:104)(cid:13)(cid:13) x it − x it − (cid:13)(cid:13) (cid:105) ≤ (1 − β ) E (cid:104)(cid:13)(cid:13) v it − − ∇ f i ( x it − ) (cid:13)(cid:13) (cid:105) + 2 β ν i + 6(1 − β ) L (cid:16) E (cid:104)(cid:13)(cid:13) x it − x t (cid:13)(cid:13) + (cid:107) x t − x t − (cid:107) + (cid:13)(cid:13) x t − − x it − (cid:13)(cid:13) (cid:105)(cid:17) , = (1 − β ) E (cid:104)(cid:13)(cid:13) v it − − ∇ f i ( x it − ) (cid:13)(cid:13) (cid:105) + 2 β ν i + 6(1 − β ) L α E (cid:104) (cid:107) v t − (cid:107) (cid:105) + 6(1 − β ) L (cid:16) E (cid:104)(cid:13)(cid:13) x it − x t (cid:13)(cid:13) + (cid:13)(cid:13) x it − − x t − (cid:13)(cid:13) (cid:105)(cid:17) , (31)where the last line uses the x t -update in (5). Summing up (31) over i from 1 to n completes the proof.17 Proof of Lemma 5

C.1 Proof of Lemma 5(a)

Recall the initialization of

GT-HSGD that v − = np , y = np and v i = b (cid:80) b r =1 g i ( x i , ξ i ,r ). Using thegradient tracking update (4a) at iteration t = 0, we have: E (cid:104) (cid:107) y − Jy (cid:107) (cid:105) = E (cid:104) (cid:107) W ( y + v − v − ) − JW ( y + v − v − ) (cid:107) (cid:105) ( i ) = E (cid:104) (cid:107) ( W − J ) v (cid:107) (cid:105) ( ii ) ≤ λ E (cid:104) (cid:107) v − ∇ f ( x ) + ∇ f ( x ) (cid:107) (cid:105) = λ n (cid:88) i =1 E (cid:104)(cid:13)(cid:13) v i − ∇ f i ( x i ) (cid:13)(cid:13) (cid:105) + λ (cid:107)∇ f ( x ) (cid:107) iii ) = λ n (cid:88) i =1 E (cid:20)(cid:13)(cid:13)(cid:13)(cid:13) b b (cid:88) r =1 (cid:16) g i ( x i , ξ i ,r ) − ∇ f i ( x i ) (cid:17)(cid:13)(cid:13)(cid:13)(cid:13) (cid:21) + λ (cid:107)∇ f ( x ) (cid:107) iv ) = λ b n (cid:88) i =1 b (cid:88) r =1 E (cid:104)(cid:13)(cid:13) g i ( x i , ξ i ,r ) − ∇ f i ( x i ) (cid:13)(cid:13) (cid:105) + λ (cid:107)∇ f ( x ) (cid:107) , (32)where ( i ) uses JW = J and the initial condition of v − and y , ( ii ) uses (cid:107) W − J (cid:107) = λ , ( iii ) is due tothe initialization of v i , and ( iv ) follows from the fact that { ξ i , , ξ i , , · · · , ξ i ,b } , ∀ i ∈ V , is an independentfamily of random vectors, by a similar line of arguments in (24) and (25). The proof then follows by usingthe bounded variance of each g i in (32). C.2 Proof of Lemma 5(b)

Following the gradient tracking update (4a), we have: ∀ t ≥ (cid:107) y t +1 − Jy t +1 (cid:107) = (cid:107) W ( y t + v t − v t − ) − JW ( y t + v t − v t − ) (cid:107) i ) = (cid:107) Wy t − Jy t + ( W − J ) ( v t − v t − ) (cid:107) = (cid:107) Wy t − Jy t (cid:107) + 2 (cid:10) Wy t − Jy t , ( W − J ) ( v t − v t − ) (cid:11) + (cid:107) ( W − J ) ( v t − v t − ) (cid:107) ii ) ≤ λ (cid:107) y t − Jy t (cid:107) + 2 (cid:10) Wy t − Jy t , ( W − J ) ( v t − v t − ) (cid:11)(cid:124) (cid:123)(cid:122) (cid:125) =: A t + λ (cid:107) v t − v t − (cid:107) , (33)where ( i ) uses JW = J and ( ii ) is due to (cid:107) W − J (cid:107) = λ . In the following, we bound A t and the last termin (33) respectively. We recall the update of each local stochastic gradient estimator v it in (2): ∀ t ≥ v it = g i ( x it , ξ it ) + (1 − β ) v it − − (1 − β ) g i ( x it − , ξ it ) . We observe that ∀ t ≥ ∀ i ∈ V , v it − v it − = g i ( x it , ξ it ) − β v it − − (1 − β ) g i ( x it − , ξ it )= g i ( x it , ξ it ) − g i ( x it − , ξ it ) − β v it − + β g i ( x it − , ξ it )= g i ( x it , ξ it ) − g i ( x it − , ξ it ) − β (cid:0) v it − − ∇ f i ( x it − ) (cid:1) + β (cid:0) g i ( x it − , ξ it ) − ∇ f i ( x it − ) (cid:1) . (34)Moreover, we observe from (34) that ∀ t ≥ E [ v t − v t − |F t ] = ∇ f ( x t ) − ∇ f ( x t − ) − β (cid:0) v t − − ∇ f ( x t − ) (cid:1) . (35)18owards A t , we have: ∀ t ≥ E [ A t |F t ] ( i ) = 2 (cid:68) Wy t − Jy t , ( W − J ) E [ v t − v t − |F t ] (cid:69) ( ii ) = 2 (cid:68) Wy t − Jy t , ( W − J ) (cid:16) ∇ f ( x t ) − ∇ f ( x t − ) − β (cid:0) v t − − ∇ f ( x t − ) (cid:17)(cid:69) ( iii ) ≤ λ (cid:107) y t − Jy t (cid:107) · λ (cid:13)(cid:13)(cid:13) ∇ f ( x t ) − ∇ f ( x t − ) − β (cid:16) v t − − ∇ f ( x t − ) (cid:17)(cid:13)(cid:13)(cid:13) ( iv ) ≤ − λ (cid:107) y t − Jy t (cid:107) + 2 λ − λ (cid:13)(cid:13)(cid:13) ∇ f ( x t ) − ∇ f ( x t − ) − β (cid:16) v t − − ∇ f ( x t − ) (cid:17)(cid:13)(cid:13)(cid:13) , ( v ) ≤ − λ (cid:107) y t − Jy t (cid:107) + 4 λ L − λ (cid:107) x t − x t − (cid:107) + 4 λ β − λ (cid:107) v t − − ∇ f ( x t − ) (cid:107) , (36)where ( i ) is due to the F t -measurability of y t , ( ii ) uses (35), ( iii ) is due to the Cauchy-Schwarz inequalityand (cid:107) W − J (cid:107) = λ , ( iv ) uses the elementary inequality that 2 ab ≤ ηa + b /η , with η = − λ λ for any a, b ∈ R ,and ( v ) holds since each f i is L -smooth. Next, towards the last term in (33), we take the expectation of (34)to obtain: ∀ t ≥ ∀ i ∈ V , E (cid:104)(cid:13)(cid:13) v it − v it − (cid:13)(cid:13) (cid:105) ≤ E (cid:104)(cid:13)(cid:13) g i ( x it , ξ it ) − g i ( x it − , ξ it ) (cid:13)(cid:13) (cid:105) + 3 β E (cid:104)(cid:13)(cid:13) v it − − ∇ f i ( x it − ) (cid:13)(cid:13) (cid:105) + 3 β E (cid:104)(cid:13)(cid:13) g i ( x it − , ξ it ) − ∇ f i ( x it − ) (cid:13)(cid:13) (cid:105) ≤ L E (cid:104)(cid:13)(cid:13) x it − x it − (cid:13)(cid:13) (cid:105) + 3 β E (cid:104)(cid:13)(cid:13) v it − − ∇ f i ( x it − ) (cid:13)(cid:13) (cid:105) + 3 β ν i . (37)where (37) uses the mean-squared smoothness and the bounded variance of each g i . Summing up (37) over i from 1 to n gives an upper bound on the last term in (33): ∀ t ≥ λ E (cid:104) (cid:107) v t − v t − (cid:107) (cid:105) ≤ λ L E (cid:104) (cid:107) x t − x t − (cid:107) (cid:105) + 3 λ β E (cid:104) (cid:107) v t − − ∇ f ( x t − ) (cid:107) (cid:105) + 3 λ nβ ν . (38)We now use (36) and (38) in (33) to obtain: ∀ t ≥ E (cid:104) (cid:107) y t +1 − Jy t +1 (cid:107) (cid:105) ≤ λ E (cid:104) (cid:107) y t − Jy t (cid:107) (cid:105) + 7 λ L − λ E (cid:104) (cid:107) x t − x t − (cid:107) (cid:105) + 7 λ β − λ E (cid:104) (cid:107) v t − − ∇ f ( x t − ) (cid:107) (cid:105) + 3 λ nβ ν . (39)Towards the second term in (39), we use (10) to obtain: ∀ t ≥ (cid:107) x t − x t − (cid:107) = (cid:107) x t − Jx t + Jx t − Jx t − + Jx t − − x t − (cid:107) i ) ≤ (cid:107) x t − Jx t (cid:107) + 3 nα (cid:107) v t − (cid:107) + 3 (cid:107) x t − − Jx t − (cid:107) ≤ λ α (cid:107) y t − Jy t (cid:107) + 3 nα (cid:107) v t − (cid:107) + 9 (cid:107) x t − − Jx t − (cid:107) , (40)where ( i ) uses the x t -update in (5). Finally, we use (40) in (39) to obtain: ∀ t ≥ E (cid:104) (cid:107) y t +1 − Jy t +1 (cid:107) (cid:105) ≤ (cid:18) λ λ L α − λ (cid:19) E (cid:104) (cid:107) y t − Jy t (cid:107) (cid:105) + 21 λ nL α − λ E (cid:104) (cid:107) v t − (cid:107) (cid:105) + 63 λ L − λ E (cid:104) (cid:107) x t − − Jx t − (cid:107) (cid:105) + 7 λ β − λ E (cid:104) (cid:107) v t − − ∇ f ( x t − ) (cid:107) (cid:105) + 3 λ nβ ν . The proof is completed by the fact that λ + λ L α − λ ≤ λ if 0 < α ≤ − λ √ λ L .19 Proof of Lemma 6

D.1 Proof of Eq. (11)

We recursively apply the recursion of V t from t to 0 to obtain: ∀ t ≥ V t ≤ qV t − + qR t − + Q t + C ≤ q V t − + ( q R t − + qR t − ) + ( qQ t − + Q t ) + ( qC + C ) · · ·≤ q t V + t − (cid:88) i =0 q t − i R i + t (cid:88) i =1 q t − i Q i + C t − (cid:88) i =0 q i . (41)Summing up (41) over t from 1 to T gives: ∀ T ≥ T (cid:88) t =0 V t ≤ V T (cid:88) t =0 q t + T (cid:88) t =1 t − (cid:88) i =0 q t − i R i + T (cid:88) t =1 t (cid:88) i =1 q t − i Q i + C T (cid:88) t =1 t − (cid:88) i =0 q i ≤ V ∞ (cid:88) t =0 q t + T − (cid:88) t =0 (cid:32) ∞ (cid:88) i =0 q i (cid:33) R t + T (cid:88) t =1 (cid:32) ∞ (cid:88) i =0 q i (cid:33) Q t + C T (cid:88) t =1 ∞ (cid:88) i =0 q i , and the proof follows by (cid:80) ∞ i =0 q i = (1 − q ) − . D.2 Proof of Eq. (12)

We recursively apply the recursion of V t from t + 1 to 1 to obtain: ∀ t ≥ V t +1 ≤ qV t + R t − + C ≤ q V t − + ( qR t − + R t − ) + ( qC + C ) · · ·≤ q t V + t − (cid:88) i =0 q t − − i R i + C t − (cid:88) i =0 q i . (42)We sum up (42) over t from 1 to T − ∀ T ≥ T − (cid:88) t =0 V t +1 ≤ V T − (cid:88) t =0 q t + T − (cid:88) t =1 t − (cid:88) i =0 q t − − i R i + C T − (cid:88) t =1 t − (cid:88) i =0 q i ≤ V ∞ (cid:88) t =0 q t + T − (cid:88) t =0 (cid:32) ∞ (cid:88) i =0 q i (cid:33) R t + C T − (cid:88) t =1 ∞ (cid:88) i =0 q i , and the proof follows by (cid:80) ∞ i =0 q i = (1 − q ) − . E Proof of Lemma 7

E.1 Proof of Eq. (13)

We ﬁrst observe that − (1 − β ) ≤ β for β ∈ (0 , ∀ T ≥ T (cid:88) t =0 E (cid:104)(cid:13)(cid:13) v t − ∇ f ( x t ) (cid:13)(cid:13) (cid:105) ≤ E (cid:2) (cid:107) v − ∇ f ( x ) (cid:107) (cid:3) β + 6 L α nβ T − (cid:88) t =0 E (cid:104) (cid:107) v t (cid:107) (cid:105) + 6 L n β T − (cid:88) t =0 E (cid:104)(cid:13)(cid:13) x t +1 − Jx t +1 (cid:107) + (cid:107) x t − Jx t (cid:13)(cid:13) (cid:105) + 2 βν Tn ≤ E (cid:2) (cid:107) v − ∇ f ( x ) (cid:107) (cid:3) β + 6 L α nβ T − (cid:88) t =0 E (cid:104) (cid:107) v t (cid:107) (cid:105) + 12 L n β T (cid:88) t =0 E (cid:104) (cid:107) x t − Jx t (cid:107) (cid:105) + 2 βν Tn . (43)20owards the ﬁrst term in (43), we observe that E (cid:104)(cid:13)(cid:13) v − ∇ f ( x ) (cid:13)(cid:13) (cid:105) = E (cid:34)(cid:13)(cid:13)(cid:13) n n (cid:88) i =1 b b (cid:88) r =1 (cid:16) g i ( x i , ξ i ,r ) − ∇ f i ( x i ) (cid:17)(cid:13)(cid:13)(cid:13) (cid:35) ( i ) = 1 n b n (cid:88) i =1 b (cid:88) r =1 E (cid:104)(cid:13)(cid:13) g i ( x i , ξ i ,r ) − ∇ f i ( x i ) (cid:13)(cid:13) (cid:105) ≤ ν nb , (44)where ( i ) follows from a similar line of arguments in (25). Then (13) follows from using (44) in (43). E.2 Proof of Eq. (14)

We apply (11) to (8) to obtain: ∀ T ≥ T (cid:88) t =0 E (cid:104) (cid:107) v t − ∇ f ( x t ) (cid:107) (cid:105) ≤ E (cid:2) (cid:107) v − ∇ f ( x ) (cid:107) (cid:3) β + 6 nL α β T − (cid:88) t =0 E (cid:104) (cid:107) v t (cid:107) (cid:105) + 6 L β T − (cid:88) t =0 E (cid:104)(cid:13)(cid:13) x t +1 − Jx t +1 (cid:107) + (cid:107) x t − Jx t (cid:13)(cid:13) (cid:105) + 2 nβν T ≤ E (cid:2) (cid:107) v − ∇ f ( x ) (cid:107) (cid:3) β + 6 nL α β T − (cid:88) t =0 E (cid:104) (cid:107) v t (cid:107) (cid:105) + 12 L β T (cid:88) t =0 E (cid:104) (cid:107) x t − Jx t (cid:107) (cid:105) + 2 nβν T. (45)In (45), we observe that E (cid:2) (cid:107) v − ∇ f ( x ) (cid:107) (cid:3) = n (cid:88) i =1 E (cid:34)(cid:13)(cid:13)(cid:13) b b (cid:88) r =1 (cid:0) g i ( x i , ξ i ,r ) − ∇ f i ( x i ) (cid:1) (cid:13)(cid:13)(cid:13) (cid:35) ( i ) = 1 b n (cid:88) i =1 b (cid:88) r =1 E (cid:104)(cid:13)(cid:13) g i ( x i , ξ i ,r ) − ∇ f i ( x i ) (cid:13)(cid:13) (cid:105) ≤ nν b , (46)where ( i ) follows from a similar line of arguments in (25). Then (14) follows from using (46) in (45). F Proof of Lemma 8

We recall that (cid:107) x t − Jx t (cid:107) = 0, since it is assumed without generality that x i = x j for any i, j ∈ V .Applying (11) to (9) yields: ∀ T ≥ T (cid:88) t =0 (cid:107) x t − Jx t (cid:107) ≤ λ α (1 − λ ) T (cid:88) t =1 (cid:107) y t − Jy t (cid:107) . (47)To further bound (cid:80) t (cid:107) y t − Jy t (cid:107) , we apply (12) in Lemma 5(b) to obtain: if 0 < α ≤ − λ √ λ L , then ∀ T ≥ T (cid:88) t =1 E (cid:104) (cid:107) y t − Jy t (cid:107) (cid:105) ≤ E (cid:2) (cid:107) y − Jy (cid:107) (cid:3) − λ + 84 λ nL α (1 − λ ) T − (cid:88) t =0 E (cid:2) (cid:107) v t (cid:107) (cid:3) + 252 λ L (1 − λ ) T − (cid:88) t =0 E (cid:2) (cid:107) x t − Jx t (cid:107) (cid:3) + 28 λ β (1 − λ ) T − (cid:88) t =0 E (cid:2) (cid:107) v t − ∇ f ( x t ) (cid:107) (cid:3) + 12 λ nβ ν T − λ ≤ λ nL α (1 − λ ) T − (cid:88) t =0 E (cid:2) (cid:107) v t (cid:107) (cid:3) + 252 λ L (1 − λ ) T − (cid:88) t =0 E (cid:2) (cid:107) x t − Jx t (cid:107) (cid:3) + 28 λ β (1 − λ ) T − (cid:88) t =0 E (cid:2) (cid:107) v t − ∇ f ( x t ) (cid:107) (cid:3) + 12 λ nβ ν T − λ + 4 λ (cid:13)(cid:13) ∇ f (cid:0) x (cid:1)(cid:13)(cid:13) − λ + 4 λ nν (1 − λ ) b , (48)21here the last inequality is due to Lemma 5(a). To proceed, we use (14), an upper bound on (cid:80) t E (cid:2) (cid:107) v t − ∇ f ( x t ) (cid:107) (cid:3) ,in (48) to obtain: if 0 < α ≤ − λ √ λ L and β ∈ (0 , ∀ T ≥ T (cid:88) t =1 E (cid:104) (cid:107) y t − Jy t (cid:107) (cid:105) ≤ λ nL α (1 − λ ) T − (cid:88) t =0 E (cid:2) (cid:107) v t (cid:107) (cid:3) + 588 λ L (1 − λ ) T − (cid:88) t =0 E (cid:2) (cid:107) x t − Jx t (cid:107) (cid:3) + 28 λ nβν (1 − λ ) b + 56 λ nβ ν T (1 − λ ) + 12 λ nβ ν T − λ + 4 λ (cid:13)(cid:13) ∇ f (cid:0) x (cid:1)(cid:13)(cid:13) − λ + 4 λ nν (1 − λ ) b = 252 λ nL α (1 − λ ) T − (cid:88) t =0 E (cid:2) (cid:107) v t (cid:107) (cid:3) + 588 λ L (1 − λ ) T − (cid:88) t =0 E (cid:2) (cid:107) x t − Jx t (cid:107) (cid:3) + (cid:18) β − λ + 1 (cid:19) λ nν (1 − λ ) b + (cid:18) β − λ + 3 (cid:19) λ nβ ν T − λ + 4 λ (cid:13)(cid:13) ∇ f (cid:0) x (cid:1)(cid:13)(cid:13) − λ . (49)Finally, we use (49) in (47) to obtain: ∀ T ≥ T (cid:88) t =0 E (cid:104) (cid:107) x t − Jx t (cid:107) (cid:105) ≤ λ nL α (1 − λ ) T − (cid:88) t =0 E (cid:2) (cid:107) v t (cid:107) (cid:3) + 2352 λ L α (1 − λ ) T − (cid:88) t =0 E (cid:2) (cid:107) x t − Jx t (cid:107) (cid:3) + (cid:18) β − λ + 1 (cid:19) λ nν α (1 − λ ) b + (cid:18) β − λ + 3 (cid:19) λ nβ ν α T (1 − λ ) + 16 λ (cid:13)(cid:13) ∇ f (cid:0) x (cid:1)(cid:13)(cid:13) α (1 − λ ) , which may be written equivalently as (cid:18) − λ L α (1 − λ ) (cid:19) T (cid:88) t =0 E (cid:104) (cid:107) x t − Jx t (cid:107) (cid:105) ≤ λ nL α (1 − λ ) T − (cid:88) t =0 E (cid:2) (cid:107) v t (cid:107) (cid:3) + (cid:18) β − λ + 1 (cid:19) λ nν α (1 − λ ) b + (cid:18) β − λ + 3 (cid:19) λ nβ ν α T (1 − λ ) + 16 λ (cid:13)(cid:13) ∇ f (cid:0) x (cid:1)(cid:13)(cid:13) α (1 − λ ) , (50)We observe in (50) that λ L α (1 − λ ) ≤ if 0 < α ≤ (1 − λ ) λ L , and the proof follows. G Proof of Theorem 1

For the ease of presentation, we denote ∆ := F ( x ) − F ∗ in the following. We apply (13) to Lemma 2 toobtain: if 0 < α ≤ L , then ∀ T ≥ T (cid:88) t =0 E (cid:104) (cid:107)∇ F ( x t ) (cid:107) (cid:105) ≤ α − T (cid:88) t =0 E (cid:104) (cid:107) v t (cid:107) (cid:105) + 2 L n T (cid:88) t =0 E (cid:104) (cid:107) x t − Jx t (cid:107) (cid:105) + 2 ν βb n + 12 L α nβ T − (cid:88) t =0 E (cid:104) (cid:107) v t (cid:107) (cid:105) + 24 L n β T (cid:88) t =0 E (cid:104) (cid:107) x t − Jx t (cid:107) (cid:105) + 4 βν Tn ≤ α − T (cid:88) t =0 E (cid:104) (cid:107) v t (cid:107) (cid:105) + 2 L n (cid:18) nβ (cid:19) T (cid:88) t =0 E (cid:104) (cid:107) x t − Jx t (cid:107) (cid:105) + 2 ν βb n + 4 βν Tn − (cid:18) − L α nβ (cid:19) T (cid:88) t =0 E (cid:104) (cid:107) v t (cid:107) (cid:105) . (51)Therefore, if 0 < α < √ L and L α n ≤ β <

1, i.e., − L α nβ ≥

0, we may drop the last term in (51) toobtain: ∀ T ≥ T (cid:88) t =0 E (cid:104) (cid:107)∇ F ( x t ) (cid:107) (cid:105) ≤ α − T (cid:88) t =0 E (cid:104) (cid:107) v t (cid:107) (cid:105) + 2 L n (cid:18) nβ (cid:19) T (cid:88) t =0 E (cid:104) (cid:107) x t − Jx t (cid:107) (cid:105) + 2 ν βb n + 4 βν Tn . (52)22oreover, we observe: ∀ T ≥ n n (cid:88) i =1 T (cid:88) t =0 E (cid:104)(cid:13)(cid:13) ∇ F ( x it ) (cid:13)(cid:13) (cid:105) ≤ n n (cid:88) i =1 T (cid:88) t =0 (cid:16) E (cid:104)(cid:13)(cid:13) ∇ F ( x it ) − ∇ F ( x t ) (cid:13)(cid:13) + (cid:107)∇ F ( x t ) (cid:107) (cid:105)(cid:17) = 2 L n T (cid:88) t =0 E (cid:104) (cid:107) x t − Jx t (cid:107) (cid:105) + 2 T (cid:88) t =0 E (cid:104) (cid:107)∇ F ( x t ) (cid:107) (cid:105) , (53)where the last line uses the L -smoothness of F . Using (52) in (53) yields: if 0 < α < √ L and 48 L α /n ≤ β <

1, then1 n n (cid:88) i =1 T (cid:88) t =0 E (cid:104)(cid:13)(cid:13) ∇ F ( x it ) (cid:13)(cid:13) (cid:105) ≤ α − T (cid:88) t =0 E (cid:104) (cid:107) v t (cid:107) (cid:105) + 6 L n (cid:18) nβ (cid:19) T (cid:88) t =0 E (cid:104) (cid:107) x t − Jx t (cid:107) (cid:105) + 4 ν βb n + 8 βν Tn . (54)According to (54), if 0 < α < √ L and β = 48 L α /n , we have: ∀ T ≥ n n (cid:88) i =1 T (cid:88) t =0 E (cid:104)(cid:13)(cid:13) ∇ F ( x it ) (cid:13)(cid:13) (cid:105) ≤ α − T (cid:88) t =0 E (cid:104) (cid:107) v t (cid:107) (cid:105) + 6 L n (cid:18) L α (cid:19) T (cid:88) t =0 E (cid:104) (cid:107) x t − Jx t (cid:107) (cid:105) + 4 ν βb n + 8 βν Tn ≤ α − T (cid:88) t =0 E (cid:104) (cid:107) v t (cid:107) (cid:105) + 2 nα T (cid:88) t =0 E (cid:104) (cid:107) x t − Jx t (cid:107) (cid:105)(cid:124) (cid:123)(cid:122) (cid:125) =:Φ T + 4 ν βb n + 8 βν Tn . (55)where the last line is due to 6 L α < /

8. To simplify Φ T , we use Lemma 8 to obtain: if 0 < α ≤ (1 − λ ) λ L then ∀ T ≥

2, Φ T ≤ − (cid:18) − λ L α (1 − λ ) (cid:19) T (cid:88) t =0 E (cid:104) (cid:107) v t (cid:107) (cid:105) + 64 λ (1 − λ ) (cid:13)(cid:13) ∇ f (cid:0) x (cid:1)(cid:13)(cid:13) n + (cid:18) β − λ + 1 (cid:19) λ ν (1 − λ ) b + (cid:18) β − λ + 3 (cid:19) λ β ν T (1 − λ ) . (56)In (56), we observe that if 0 < α ≤ (1 − λ ) λ L , then 1 − λ L α (1 − λ ) ≥ < α ≤ √ n (1 − λ )26 λL , then β = L α n ≤ − λ λ . Hence, if 0 < α ≤ min (cid:110) (1 − λ ) λ , √ n (1 − λ )26 λ (cid:111) L , then (56) reduces to: ∀ T ≥ T ≤ λ (1 − λ ) (cid:13)(cid:13) ∇ f (cid:0) x (cid:1)(cid:13)(cid:13) n + 96 λ ν (1 − λ ) b + 256 λ β ν T (1 − λ ) . (57)Finally, we use (57) in (55) to obtain: if 0 < α < min (cid:110) (1 − λ ) λ , √ n (1 − λ )26 λ , √ (cid:111) L , we have: ∀ T ≥ n ( T + 1) n (cid:88) i =1 T (cid:88) t =0 E (cid:104)(cid:13)(cid:13) ∇ F ( x it ) (cid:13)(cid:13) (cid:105) ≤ αT + 4 ν βb nT + 8 βν n + 64 λ (1 − λ ) T (cid:13)(cid:13) ∇ f (cid:0) x (cid:1)(cid:13)(cid:13) n + 96 λ ν (1 − λ ) b T + 256 λ β ν (1 − λ ) . (58)The proof follows by (58) and that E [ (cid:107)∇ F ( (cid:101) x T ) (cid:107) ] = n ( T +1) (cid:80) ni =1 (cid:80) Tt =0 E [ (cid:107)∇ F ( x it ) (cid:107) ] since (cid:101) x T is chosenuniformly at random from { x it : ∀ i ∈ V , ≤ t ≤ T }}