[PDF] Convergence of Distributed Stochastic Variance Reduced Methods without Sampling Extra Data

Abstract

Stochastic variance reduced methods have gained a lot of interest recently for empirical risk minimization due to its appealing run time complexity. When the data size is large and disjointly stored on different machines, it becomes imperative to distribute the implementation of such variance reduced methods. In this paper, we consider a general framework that directly distributes popular stochastic variance reduced methods in the master/slave model, by assigning outer loops to the parameter server, and inner loops to worker machines. This framework is natural and friendly to implement, but its theoretical convergence is not well understood. We obtain a comprehensive understanding of algorithmic convergence with respect to data homogeneity by measuring the smoothness of the discrepancy between the local and global loss functions. We establish the linear convergence of distributed versions of a family of stochastic variance reduced algorithms, including those using accelerated and recursive gradient updates, for minimizing strongly convex losses. Our theory captures how the convergence of distributed algorithms behaves as the number of machines and the size of local data vary. Furthermore, we show that when the data are less balanced, regularization can be used to ensure convergence at a slower rate. We also demonstrate that our analysis can be further extended to handle nonconvex loss functions.

Full PDF

aa r X i v : . [ c s . L G ] J a n Convergence of Distributed Stochastic Variance Reduced Methodswithout Sampling Extra Data

Shicong Cen † Huishuai Zhang ‡ Yuejie Chi † Wei Chen ‡ Tie-Yan Liu ‡† Carnegie Mellon University ‡ Microsoft Research Asia {shicongc,yuejiec}@andrew.cmu.edu{huzhang, wche, tyliu}@microsoft.com

January 24, 2020

Abstract

Stochastic variance reduced methods have gained a lot of interest recently for empirical risk mini-mization due to its appealing run time complexity. When the data size is large and disjointly storedon diﬀerent machines, it becomes imperative to distribute the implementation of such variance reducedmethods. In this paper, we consider a general framework that directly distributes popular stochasticvariance reduced methods, by assigning outer loops to the parameter server, and inner loops to workermachines. This framework is natural and friendly to implement, but its theoretical convergence is notwell understood. We obtain a uniﬁed understanding of algorithmic convergence with respect to data ho-mogeneity by measuring the smoothness of the discrepancy between the local and global loss functions.We establish the linear convergence of distributed versions of a family of stochastic variance reduced algo-rithms, including those using accelerated and recursive gradient updates, for minimizing strongly convexlosses. Our theory captures how the convergence of distributed algorithms behaves as the number ofmachines and the size of local data vary. Furthermore, we show that when the data are less balanced,regularization can be used to ensure convergence at a slower rate. We also demonstrate that our analysiscan be further extended to handle nonconvex loss functions.

Empirical risk minimization arises frequently in machine learning and signal processing, where the objectivefunction is the average of losses computed at diﬀerent data points. Due to the increasing size of data,distributed computing architectures, which assign the learning task over multiple computing nodes, are ingreat need to meet the scalability requirement in terms of both computation power and storage space. Inaddition, distributed frameworks are suitable for problems where there are privacy concerns to transmit andstore all the data in a central location, a scenario related to the nascent ﬁeld of federated learning [14]. It is,therefore, necessary to develop distributed optimization frameworks that are tailored to solving large-scaleempirical risk minimization problems with desirable communication-computation trade-oﬀs, where the dataare stored disjointly over diﬀerent machines.Due to the low per-iteration cost, a popular solution is distributed stochastic gradient descent (SGD) [23],where the parameter server aggregates gradients from each worker and does mini-batch gradient updates.However, distributed SGD is not communication-eﬃcient and requires lots of communication rounds toconverge, which partially diminishes the beneﬁt of distribution. On the other hand, recent breakthroughs indeveloping stochastic variance reduced methods have made it possible to achieve fast convergence and smallper-iteration cost at the same time, such as the notable SVRG [13] algorithm. Yet, distributed schemes ofsuch variance reduced methods that are both practical and theoretically sound are much less developed.This paper focuses on a general framework of distributed stochastic variance reduced methods, to bepresented in Alg. 1, which is natural and friendly to implement. On a high level, SVRG-type algorithms [13]contain inner loops for parameter updates via variance-reduced SGD, and outer loops for global gradient1nd parameter updates. Our general framework assigns outer loops to the parameter server, and inner loopsto worker machines. The parameter server collects gradients from worker machines and then distributes theglobal gradient to each machine. Each worker machine then runs the inner loop independently in parallelusing variance reduction techniques, which might be diﬀerent when distributing diﬀerent algorithms, andreturns the updates to the parameter server at the end. Per iteration, two communication rounds arerequired: one communication round is used to average the parameter estimates, and the other is used toaverage the gradients, which is the same as distributed synchronous SGD. However, the premise is thatby performing more eﬃcient local computation using stochastic variance reduced methods, the algorithmconverges in fewer iterations and is therefore more communication-eﬃcient.Due to the simplicity of this framework, similar methods have been implemented in several works [14,7, 24], and have achieved great empirical success. Surprisingly, a complete theoretical understanding ofits convergence behavior is still missing at large. Moreover, distributed variants using accelerated variancereduction methods are not developed. The main analysis diﬃculty is that the variance-reduced gradient ofeach worker is no longer an unbiased gradient estimator when sampling from re-used local data.To ease this diﬃculty, several variants of distributed SVRG, e.g. [15, 28, 32] have been proposed withperformance guarantees, which try to bypass the biased gradient estimation issue by emulating the processof i.i.d. sampling from the global data using some complicated random data re-allocation protocol, whichrequires sampling extra data with or without replacement. These procedures lead to unnecessary data wasteand potential privacy leakage, and can be cumbersome and diﬃcult to implement in practice.Consequently, a natural question arises: can we provide a mathematical analysis to the convergence ofthe natural framework of distributed stochastic variance reduced methods, under some simple and intuitivemetric?

This paper provides a convergence analysis of a family of naturally distributed stochastic variance reducedmethods under the framework described in Alg. 1, for both convex and nonconvex loss functions. Byusing diﬀerent variance reduction schemes at the worker machines, we study distributed variants of threerepresentative algorithms in this paper: SVRG [13], SARAH employing recursive gradient updates [21, 22],and MiG employing accelerated gradient updates [42]. Our methodology can be extended to study othervariants in a similar fashion. The contributions of this paper are summarized below. • We suggest a simple and intuitive metric called distributed smoothness to gauge data balancednessamong workers, deﬁned as the smoothness of the diﬀerence f k − f between the local loss function f k and the global loss function f , which is the average of the local loss functions. The metric isdeterministic, easy-to-compute, applies for arbitrary dataset splitting, and is shown to play a criticalrole in the convergence analysis. • We establish the linear convergence of distributed D-SVRG, D-SARAH, and D-MiG under stronglyconvex losses, as long as the distributed smoothness parameter is smaller than a constant fraction ofthe strong convexity parameter σ , e.g. σ/ , where the fraction might change for diﬀerent algorithms.Our bounds capture the phenomenon that the convergence rate improves as the local loss functionsbecome more similar to the global loss function, by reducing the distributed smoothness parameter.Furthermore, the run time complexity exhibits the so-called “linear speed-up” property in distributedcomputing, where the complexity depends on the local data size, instead of the global data size, whichtypically implies an improvement by a factor of n , where n is the number of machines. • When the local data are highly unbalanced, the distributed smoothness parameter becomes large, whichimplies that the algorithm might diverge. We suggest regularization as an eﬀective way to handle thissituation, and show that by adding larger regularization to machines that are less distributed smooth,one can still ensure linear convergence in a regularized version of D-SVRG, called D-RSVRG, thoughat a slower rate of convergence. • More generally, the notion of distributed smoothness can also be used to establish the convergenceunder nonconvex losses. We demonstrate this through the convergence analysis of D-SARAH in thenonconvex setting. 2able 1: Communication rounds and runtime of the proposed and existing algorithms for strongly con-vex losses (ignoring logarithmic factors in κ ) to reach ǫ -accuracy. Algorithms with an asterisk are pro-posed/analyzed in this paper. Here, N is the total number of input data points, n is the number of workermachines, and κ is the condition number of the global loss function f .Algorithm Communication Rounds Runtime AssumptionsDSVRG [15] (1 + κ/ ( N/n )) log(1 /ǫ ) ( N/n + κ ) log(1 /ǫ ) extra dataDASVRG [15] (1 + p κ/ ( N/n )) log(1 /ǫ ) ( N/n + p κN/n ) log(1 /ǫ ) extra dataDist. AGD √ κ log(1 /ǫ ) ( N/n ) √ κ log(1 /ǫ ) noneADMM κ log(1 /ǫ ) ( N/n ) κ log(1 /ǫ ) noneSCOPE [41] κ log(1 /ǫ ) ( N/n + κ log κ ) κ log(1 /ǫ ) uniform regularizationpSCOPE [40] log(1 /ǫ ) ( N/n + κ ) log(1 /ǫ ) good partitionD-SVRG* log(1 /ǫ ) ( N/n + κ ) log(1 /ǫ ) distributed restricted smoothnessD-SARAH* log(1 /ǫ ) ( N/n + κ ) log(1 /ǫ ) distributed restricted smoothnessD-MiG* (1 + p κ/ ( N/n )) log(1 /ǫ ) ( N/n + p κN/n ) log(1 /ǫ ) distributed smoothnessD-RSVRG* κ log(1 /ǫ ) ( N/n ) κ log(1 /ǫ ) large regularizationD-RSVRG* log(1 /ǫ ) ( N/n + κ ) log(1 /ǫ ) small regularization Distributed optimization is a classic topic [6, 5], yet recent trends in data-intensive applications are callingfor new developments with a focus on communication and computation eﬃciency. Examples of deterministic optimization methods include DANE [29], AIDE [24], DiSCo [38], GIANT [33], CoCoA [30], CEASE [9],one-shot averaging [43, 39], etc.Many stochastic variance reduced methods have been proposed recently, for example, SAG [25], SAGA[8], SVRG [13], SDCA [27], MiG [42], Katyusha [3], Catalyst [18], SCOPE [41, 40], SARAH [21], SPIDER[10], SpiderBoost [34], to name a few. Several previous works have studied distributed variants of SVRG.For example, the D-SVRG algorithm has been empirically studied before in [24, 14] without a theoreticalconvergence analysis. The pSCOPE algorithm [40] is also a variant of distributed SVRG, and its convergenceis studied under an assumption called good data partition in [40], which is hard to interpret and verify inpractice. The SCOPE algorithm [41] is similar to the regularized variant D-RSVRG of D-SVRG underlarge regularization, however our analysis is much more reﬁned by allowing diﬀerent regularizations todiﬀerent local workers with respect to the distributed smoothness of local data, and gracefully degeneratesto the unregularized case when the distributed smoothness is benign. The general framework of distributedvariance-reduced methods covering SARAH and MiG and various loss settings in this paper has not beenstudied before. For conciseness, Table 1 summarizes the communication and computation complexities ofthe most relevant algorithms to the current paper.There are also a lot of recent eﬀorts on reducing the communication cost of distributed GD/SGD bygradient quantization [1, 4, 26, 36], gradient compression and sparsiﬁcation [2, 17, 19, 35, 31]. In comparison,we communicate the exact gradient, and it is an interesting future direction to combine gradient compressionschemes in distributed variance reduced stochastic gradient methods.

The rest of this paper is organized as follows. Section 2 presents the problem setup and a general frameworkof distributed stochastic optimization with variance-reduced local updates. Section 3 presents the conver-gence guarantees of D-SVRG, D-SARAH and D-MiG under appropriate distributed smoothness assumptions.Section 4 introduces regularization to D-SVRG to handle unbalanced data when distributed smoothness doesnot hold. Section 5 presents extensions to nonconvex losses for D-SARAH. Section 6 provides an outlineto the analysis and the rest of the proofs are deferred to the supplemental materials. Section 7 presentsnumerical experiments to corroborate the theoretical ﬁndings. Finally, we conclude in Section 8.3

Problem Setup

Suppose we have a data set M = { z , · · · , z N } , where z j ∈ R p is the j th data point for j = 1 , ..., N , and N is the total number of data points. In particular, we do not make any assumptions on their statisticaldistribution. Consider the following empirical risk minimization problem min x ∈ R d f ( x ) := 1 N X z ∈M ℓ ( x ; z ) , (1)where x ∈ R d is the parameter to be optimized and ℓ : R d × R p R is the sample loss function. For brevity,we use ℓ z ( x ) to denote ℓ ( x ; z ) throughout the paper.In a distributed setting, where the data are distributed to n machines or workers, we deﬁne a partitionof the data set M as M = S nk =1 M k , where M i T M k = ∅ , ∀ i = k . The k th worker, correspondingly, is inpossession of the data subset M k , ≤ k ≤ n . We assume there is a parameter server (PS) that coordinatesthe parameter sharing among the workers. The sizes of data held by each worker machine is N k = |M k | .When the data is split equally, we have N k = N/n . The original problem (1) can be rewritten as minimizingthe following objective function: f ( x ) := 1 n n X k =1 f k ( x ) , (2)where f k ( x ) = N/n ) P z ∈M k ℓ z ( x ) is the local loss function at the k th worker machine. Alg. 1 presents a general framework for distributed stochastic variance reduced methods, which assignsthe outer loops of an SVRG-type algorithm [13, 21, 42] to PS and the inner loops to local workers. Byusing diﬀerent variance reduction schemes at the worker machines (i.e.

LocalUpdate ), we obtain distributedvariants of diﬀerent algorithms. On a high level, the framework alternates between local computation byindividual workers in parallel (i.e. Line 5), and global information sharing coordinated by the PS (i.e. Lines8-12). • The local worker conducts local computation

LocalUpdate based on the current estimate ˜ x t , the globalgradient ∇ f (˜ x t ) , and its local data f k ( · ) ; in this paper, we are primarily interested in local updatesusing stochastic variance-reduced gradients. A little additional information about the previous updateis needed when employing acceleration, which will be speciﬁed in Alg. 3. • The PS mainly combines the local estimates, y t + k , into a global estimate ˜ x t +1 , and then computesthe global gradient ∇ f (˜ x t +1 ) by pulling the local gradient ∇ f k (˜ x t +1 ) , which requires two rounds ofcommunications. For simplicity of analysis, we assume the global estimate ˜ x t +1 is set via one of thelocal estimates at random, rather than the average of all the local estimates. Throughout, we invoke one or several of the following standard assumptions of the loss function in theconvergence analysis.

Assumption 1 (Smoothness) . The sample loss ℓ z ( · ) is L -smooth for all z ∈ M . Assumption 2 (Convexity) . The sample loss ℓ z ( · ) is convex for all z ∈ M . Assumption 3 (Strong Convexity) . The empirical risk f ( · ) is σ -strongly convex. It is straightforward to states our results under unequal data splitting with proper rescaling. lgorithm 1 A general distributed framework Input: initial point ˜ x . Initialization : Compute ∇ f (˜ x ) and distribute it to all machines. for t = 0 , , , · · · do for workers ≤ k ≤ n in parallel do y t + k = LocalUpdate ( f k , ˜ x t , ∇ f (˜ x t )) ; Send y t + k to PS ; end for PS : randomly select ˜ x t +1 from all y t + k and push ˜ x t +1 to all workers; for workers ≤ k ≤ n in parallel do compute ∇ f k (˜ x t +1 ) and send it to PS ; end for PS : average ∇ f (˜ x t +1 ) = n P nk =1 ∇ f k (˜ x t +1 ) and push ∇ f (˜ x t +1 ) to all workers. end for return ˜ x t +1 When f is strongly convex, the condition number of f is deﬁned as κ := L/σ . Denote the uniqueminimizer and the optimal value of f ( x ) as x ∗ := arg min x ∈ R d f ( x ) , f ∗ := f ( x ∗ ) . As it turns out, the smoothness of the deviation f k − f between the local loss function f k and the globalloss function f plays a key role in the convergence analysis, as it measures the balancedness between localdata in a simple and intuitive manner. We refer to this as the “distributed smoothness”. In some cases, aweaker notion called restricted smoothness is suﬃcient, which is deﬁned below. Deﬁnition 1 (Restricted Smoothness) . A diﬀerentiable function f : R d R is called c -restricted smoothwith regard to x ∗ if k∇ f ( x ∗ ) − ∇ f ( y ) k ≤ c k x ∗ − y k , for all y ∈ R d . The restricted smoothness, compared to standard smoothness, ﬁxes one of the arguments to x ∗ , and istherefore a much weaker requirement. The following assumption quantiﬁes the distributed smoothness usingeither restricted smoothness or standard smoothness. Assumption 4a (Distributed Restricted Smoothness) . The deviation f − f k is c k -restricted smooth withregard to x ∗ for all ≤ k ≤ n . Assumption 4b (Distributed Smoothness) . The deviation f − f k is c k -smooth for all ≤ k ≤ n . It is straightforward to check that c k ≤ L for all ≤ k ≤ n . If all the data samples are generated followingcertain statistical distribution in an i.i.d. fashion, one can further link the distributed smoothness to thelocal sample size N/n , where c k decreases with the increase of N/n , see e.g. [29, 9] for further discussion.

In this section, we describe three variance-reduced routines for

LocalUpdate used in Alg. 1, namely SVRG[13], SARAH [21, 22], and MiG [42], and analyze their convergence when f ( · ) is strongly convex, respectively. The

LocalUpdate routine of D-SVRG is described in Alg. 2. Theorem 1 provides the convergence guaranteeof D-SVRG as long as the distributed restricted smoothness parameter is small enough.

Theorem 1 (D-SVRG) . Suppose that Assumptions 1, 2 and 3 hold, and Assumption 4a holds with c k ≤ c < σ/ . For proper m and step size η , the iterates of D-SVRG satisfy E (cid:2) f (˜ x t +1 ) − f ∗ (cid:3) < σ − c σ − c ) E (cid:2) f (˜ x t ) − f ∗ (cid:3) . lgorithm 2 LocalUpdate via SVRG/SARAH Input: local data M k , ˜ x t , ∇ f (˜ x t ) ; Parameters: step size η , number of iterations m ; Set y t, k = ˜ x t , v t, k = ∇ f (˜ x t ) ; for s = 0 , ..., m − do Sample z from M k uniformly at random; Compute v t,s +1 k = ( ∇ ℓ z ( y t,s +1 k ) − ∇ ℓ z (˜ x t ) + ∇ f (˜ x t ) SVRG ∇ ℓ z ( y t,s +1 k ) − ∇ ℓ z ( y t,sk ) + v t,sk SARAH y t,s +1 k = y t,sk − ηv t,sk ; end for Set y t, + k uniformly at random from y t, k , . . . , y t,mk . The communication and runtime complexities of ﬁnding an ǫ -optimal solution (in terms of function value)are O (cid:0) ζ − log(1 /ǫ ) (cid:1) and O (cid:0) ( N/n + ζ − κ ) ζ − log(1 /ǫ ) (cid:1) respectively, where ζ = 1 − c/σ . Theorem 1 establishes the linear convergence of function values in expectation for D-SVRG, as long asthe parameter c is suﬃciently small, e.g. c < σ/ . From the expressions of communication and runtime com-plexities, it can be seen that the smaller c , the faster D-SVRG converges – suggesting that the homogeneityof distributed data plays an important role in the eﬃciency of distributed optimization. When c is set suchthat c/σ is bounded above by a constant smaller than / , i.e. ζ = O (1) , the runtime complexity becomes O (( N/n + κ ) log(1 /ǫ )) , which improves the counterpart of SVRG O (( N + κ ) log(1 /ǫ )) in the centralizedsetting. Remark 1.

The above

LocalUpdate routine corresponds to the so-called Option II (w.r.t. setting y t, + k asuniformly at random selected from previous updates) speciﬁed in [13]. Under similar assumptions, we alsoestablish the convergence of D-SVRG using Option I, where the output y t, + k is set as y t,mk . In addition, D-SVRG still converges linearly in the absence of Assumption 2. We leave these extensions in the supplementarymaterials. The

LocalUpdate of D-SARAH is also described in Alg. 2, which is diﬀerent from SVRG in the update ofstochastic gradient v t,sk , by using a recursive formula proposed in [21]. Theorem 2 provides the convergenceguarantee of D-SARAH as long as the distributed restricted smoothness parameter is small enough. Theorem 2 (D-SARAH) . Suppose that Assumptions 1, 2 and 3 hold, and Assumption 4a holds with c k ≤ c < √ σ/ . With proper m and step size η , the iterates of D-SARAH satisfy E h(cid:13)(cid:13) ∇ f (˜ x t +1 ) (cid:13)(cid:13) i < − c /σ E h(cid:13)(cid:13) ∇ f (˜ x t ) (cid:13)(cid:13) i . The communication and runtime complexities of ﬁnding an ǫ -optimal solution (in terms of gradient norm)are O (cid:0) ζ − log(1 /ǫ ) (cid:1) and O (cid:0) ( N/n + ζ − κ ) ζ − log(1 /ǫ ) (cid:1) respectively, where ζ = 1 − √ c/σ . Theorem 2 establishes the linear convergence of the gradient norm in expectation for D-SARAH, aslong as the parameter c is small enough. Similar to D-SVRG, a smaller c leads to faster convergence ofD-SARAH. When c is set such that c/σ is bounded above by a constant smaller than √ / , the runtimecomplexity becomes O (( N/n + κ ) log(1 /ǫ )) , which improves the counterpart of SARAH O (( N + κ ) log(1 /ǫ )) in the centralized setting. In particular, Theorem 2 suggests that D-SARAH may allow a larger c , comparedwith D-SVRG, to guarantee convergence. 6 lgorithm 3 LocalUpdate via MiG Input: local data M k , ˜ x t , ∇ f (˜ x t ) ; Parameters: step size η , number of iterations m , and w . if t = 0 then set x t, k = ˜ x t else set x t, k = x t − ,mk end if for s = 0 , ..., m − do Set y t,sk = (1 − θ )˜ x t + θx t,sk ; Sample z from M k uniformly at random, and set v t,sk = ∇ ℓ z ( y t,sk ) − ∇ ℓ z (˜ x t ) + ∇ f (˜ x t ); x t,s +1 k = x t,sk − ηv t,sk ; end for Set y t + k = (cid:16)P m − j =0 w j (cid:17) − P m − j =0 w j y t,j +1 k . The

LocalUpdate of D-MiG is described in Alg. 3, which is inspired by the inner loop of the MiG algorithm[42], a recently proposed accelerated variance-reduced algorithm. Compared with D-SVRG and D-SARAH,D-MiG uses additional information from the previous updates, according to Line 3-6 in Alg. 3. Theorem 3provides the convergence guarantee of D-MiG, as long as the distributed smoothness parameter is smallenough.

Theorem 3 (D-MiG) . Suppose that Assumptions 1, 2 and 3 hold, and Assumption 4b holds with c k ≤ c <σ/ . Let w = (1 + ησ ) / (1 + 3 ηc ) . With proper m and step size η , the iterates of D-MiG achieve an ǫ -optimal solution within a communication complexity of O ((1 + p κ/ ( N/n )) log(1 /ǫ )) and runtime complexityof O (( N/n + p κN/n ) log(1 /ǫ )) . Theorem 3 establishes the linear convergence of D-MiG under standard smoothness of f − f k , in orderto fully harness the power of acceleration. While we do not make it explicit in the theorem statement, thetime complexity of D-MiG also decreases as c gets smaller. Furthermore, the time complexity of D-MiG issmaller than that of D-SVRG/D-SARAH when κ = Θ( N/n ) . Remark 2.

Theorem 3 continues to hold for regularized empirical risk minimization, where the loss functionis given as F ( x ) = f ( x ) + g ( x ) , and g ( x ) is a convex and non-smooth regularizer. In this case, line 11 inAlg. 3 is changed to x t,s +1 k = arg min x (cid:26) η (cid:13)(cid:13) x − x t,sk (cid:13)(cid:13) + (cid:10) v t,sk , x (cid:11) + g ( x ) (cid:27) . So far, we have established the convergence when the distributed smoothness is not too large. While it maybe reasonable in certain settings, e.g. in a data center where one has control over how to distribute data,it is increasingly harder to satisfy when the data are generated locally and heterogeneous across workers.However, when such conditions are violated, the algorithms might diverge. In this situation, adding aregularization term might ensure the convergence, at the cost of possibly slowing down the convergence rate.We consider regularizing the local gradient update of D-SVRG in Alg. 2 as v t,s +1 k = ∇ ℓ z ( y t,s +1 k ) − ∇ ℓ z (˜ x t ) + ∇ f (˜ x t ) + µ k ( y t,s +1 k − ˜ x t ) , (3)where the last regularization term penalizes the proximity between the current iterates y t,s +1 k and the ref-erence point ˜ x t , where µ k > is the regularization parameter employed at the k th worker. We have thefollowing theorem. 7 heorem 4 (Distributed Regularized SVRG (D-RSVRG)) . Suppose that Assumptions 1, 2 and 3 hold, andAssumption 4a holds with c k < ( σ + µ k ) / . Let µ = min ≤ k ≤ n µ k . With proper m and step size η , thereexists some constant ≤ ν < such that the iterates of D-RSVRG satisfy E (cid:2) f (˜ x t +1 ) − f ∗ (cid:3) ≤ (cid:18) − (1 − ν ) max (cid:26) σL + µ , − µσ (cid:27)(cid:19) · E (cid:2) f (˜ x t ) − f ∗ (cid:3) , (4) and the runtime complexity of ﬁnding an ǫ -optimal solution is bounded by O (cid:18) ( N/n + ζ − ¯ κ ) ζ − min (cid:26) κ + µσ , { − µ/σ, } (cid:27) log(1 /ǫ ) (cid:19) , where ζ = 1 − c/ ( σ + µ ) and ¯ κ = ( L + µ ) / ( σ + µ ) . Compared with Theorem 1, Theorem 4 relaxes the assumption c k < σ/ to c k < ( σ + µ k ) / , whichmeans that by inserting a larger regularization µ k to local workers that are not distributed smooth, i.e.those with large c k , one can still guarantee the convergence of D-RSVRG. However, increasing µ leads to aslower convergence rate: a large µ = 8 L leads to an iteration complexity O ( κ log(1 /ǫ )) , similar to gradientdescent. Compared with SCOPE [41] which requires a uniform regularization µ > L − σ , our analysis appliestailored regularization to local workers, and potentially allows much smaller regularization to guarantee theconvergence, since c k ’s can be much smaller than the smoothness parameter L . In this section, we extend the convergence analysis of D-SARAH to handle nonconvex loss functions, sinceSARAH-type algorithms are recently shown to achieve near-optimal performances for nonconvex problems[34, 22, 10]. As a modiﬁcation that eases the analysis, we make every worker return y t + k = y t,mk in line 9 ofAlg. 2. Our result is summarized in the theorem below. Theorem 5 (D-SARAH for non-convex losses) . Suppose that Assumption 1 and Assumption 4b hold with c k ≤ c . With the step size η ≤ L (cid:16) √ m +4 m ( m − c /L (cid:17) , D-SARAH satisﬁes T m T − X t =0 m − X s =0 E (cid:20)(cid:13)(cid:13)(cid:13) ∇ f ( y t,sk ( t ) ) (cid:13)(cid:13)(cid:13) (cid:21) ≤ ηT m (cid:0) f (˜ x ) − f ∗ (cid:1) , where k ( t ) is the agent index selected in the t th round for parameter update, i.e. ˜ x t +1 = y t + k ( t ) (c.f. line 8of Alg. 1). To ﬁnd an ǫ -optimal solution, the communication complexity is O (cid:16) p n/N + c/L ) L/ǫ (cid:17) , andthe runtime complexity is O ( N/n + ( p N/n + N/n · c/L ) L/ǫ ) by setting m = Θ( N/n ) . Theorem 5 suggests that D-SARAH converges as long as the step size is small enough. Furthermore, asmaller c allows a larger step size η , and hence faster convergence. To gain further insights, assuming i.i.d.data at each worker, by concentration inequalities it is known that c/L = O ( p log( N/n ) / ( N/n )) under mildconditions [20], and consequently, the runtime complexity of ﬁnding an ǫ -accurate solution using D-SARAHis O ( N/n + L p log( N/n ) N/n/ǫ ) . This is comparable to the best known result O ( N + L √ N /ǫ ) for thecentralized SARAH-type algorithms in the nonconvex setting [22, 10, 34] up to logarithmic factors, wherethe data size is replaced from N to the size of local data N/n – demonstrating again the beneﬁt of datadistribution.

In this section, we outline the convergence proofs of D-SVRG (Theorem 1), D-RSVRG (Theorem 4), D-SARAH in the strongly convex (Theorem 2) and nonconvex (Theorem 5) settings, while leaving the detailsand the convergence proof of D-MiG (Theorem 3) to the appendix and supplemental materials. Throughoutthis section, we simplify the notations y t,sk and v t,sk by dropping the t superscript and the k subscript,whenever the meaning is clear, since it is often suﬃcient to analyze the convergence of a speciﬁc worker k during a single round. 8 .1 D-SVRG (Theorem 1) We generalize the analysis of SVRG using the dissipativity theory in [12] to the analysis of D-SVRG, whichmight be of independent interest. We start with a quick review of dissipativity theory. Consider the followinglinear time-invariant system ξ k +1 = Aξ k + Bw k , where ξ k ∈ R n ξ is the state and w k ∈ R n w is the input. Dissipativity theory characterizes how the inputs w j , j = 0 , , , . . . drive the internal energy stored in the states ξ j , j = 0 , , , . . . via an energy function V : R n ξ R + and a supply rate S : R n ξ × R n w R . The theory aims to build the following dissipationinequality: V ( ξ k +1 ) ≤ ρ V ( ξ k ) + S ( ξ k , w k ) , (5)where ρ ∈ (0 , . The inequality indicates that at least a fraction (1 − ρ ) of the internal energy will dissipateat every iteration.With an energy function V ( ξ ) = ξ ⊤ P ξ and supply rates S j ( ξ, w ) = [ ξ ⊤ , w ⊤ ] X j [ ξ ⊤ , w ⊤ ] ⊤ , (6)we have V ( ξ k +1 ) ≤ ρ V ( ξ k ) + J X j =1 λ j S j ( ξ k , w k ) (7)as long as there exist a positive semideﬁnite matrix P and non-negative scalars λ j such that (cid:20) A ⊤ P A − ρ P A ⊤ P BB ⊤ P A B ⊤ P B (cid:21) − J X j =1 λ j X j (cid:22) . (8)In fact, by left multiplying [ ξ ⊤ k , w ⊤ k ] and right multiplying [ ξ ⊤ k , w ⊤ k ] ⊤ to (8), we recover (7). We can captureproperties of the objective function such as co-coercivity and strong-convexity with (8) and characterize theoptimality of ξ k by V ( ξ k ) , thus wrapping the convergence proof as the existence of ρ ∈ (0 , satisfying (8),which can be viewed as a generalized eigenvalue optimization problem [16].Setting ξ s = y s − x ∗ , we can write the local update of D-SVRG via the following linear time-invariantsystem [12]: ξ s +1 = ξ s − η (cid:2) ∇ ℓ z ( y s ) − ∇ ℓ z (cid:0) y (cid:1) + ∇ f (cid:0) y (cid:1)(cid:3) , where z is selected uniformly at random from local data points M k , or equivalently, ξ s +1 = Aξ s + Bw s , (9)with A = I d , B = (cid:2) − ηI d − ηI d (cid:3) , and w s = (cid:20) ∇ ℓ z ( y s ) − ∇ ℓ z ( x ∗ ) ∇ ℓ z ( x ∗ ) − ∇ ℓ z (cid:0) y (cid:1) + ∇ f (cid:0) y (cid:1)(cid:21) . Here, I d is the identity matrix of dimension d .Recall the supply rate in (6). Since ξ s ∈ R d and w s ∈ R d , we will write X j as X j = ¯ X j ⊗ I d , where ¯ X j ∈ R × . Following [12], we consider the supply rates characterized by the following matrices: ¯ X =   , ¯ X =   , ¯ X =  − − − −  . (10)We have the following lemmas which are proved in Appendix B.1 and B.2, respectively. Lemma 1.

Suppose that Assumption 1, 2, 3 and 4a hold with c k ≤ c . For the supply rates deﬁned in (10) ,we have  E [ S ] ≤ L E [ f ( y s ) − f ∗ ] + 2 cL E h k y s − x ∗ k i ; E [ S ] ≤ L E (cid:2) f ( y ) − f ∗ (cid:3) + c (4 L + 2 c ) E h(cid:13)(cid:13) y − x ∗ (cid:13)(cid:13) i ; E [ S ] ≤ − E [ f ( y s ) − f ∗ ] + 3 c E h k y s − x ∗ k i + c E h(cid:13)(cid:13) y − x ∗ (cid:13)(cid:13) i . emma 2. Suppose that Assumption 1, 2, 3 and 4a hold with c k ≤ c . If there exist non-negative scalars λ j , j = 1 , , , such that λ − Lλ − (2 Lλ + 3 λ ) c/σ > (11) and  λ − η λ − ηλ − η η − λ η λ − η η η − λ  (cid:22) (12) hold. Then D-SVRG satisﬁes E (cid:2) f ( y + ) − f ∗ (cid:3) ≤ ( λ − Lλ − (2 Lλ + 3 λ ) c/σ ) − (13) · (cid:20) σm + cσ ((4 L + 2 c ) λ + λ ) + 2 Lλ (cid:21) E (cid:2) f ( y ) − f ∗ (cid:3) , where the ﬁnal output y + is selected from y , · · · , y m uniformly at random. We can now prove Theorem 1. Set λ = λ = 2 η , and λ = η . We have (11) holds with the step size η < σ − cL (2 σ + 4 c ) , and (12) holds since  − η η η − η  (cid:22) . Applying Lemma 2 with the above choice of parameters, (13) can be written as E (cid:2) f ( y + ) − f ∗ (cid:3) ≤ / ( ηm ) + c ((8 L + 4 c ) η + 1) + 4 Lησ (1 − Lη ) − c (4 Lη + 3) ! · E (cid:2) f ( y ) − f ∗ (cid:3) . When c < σ/ , by choosing η = (1 − c/σ )(40 L ) − , m = 160 κ (1 − c/σ ) − , a convergence rate no morethan ν := 1 − · σ − cσ − c is obtained. Therefore, after line 8 of Alg. 1, we have for D-SVRG, E (cid:2) f (˜ x t +1 ) − f ∗ (cid:3) ≤ n n X k =1 E (cid:2) f ( y t + k ) − f ∗ (cid:3) ≤ (cid:18) − · σ − cσ − c (cid:19) E (cid:2) f (˜ x t ) − f ∗ (cid:3) . To obtain an ǫ -optimal solution in terms of function value, we need O ( ξ − log(1 /ǫ )) communication rounds,where ζ = 1 − c/σ . Per round, the runtime complexity at each worker is O ( N/n + m ) = O ( N/n + ζ − κ ) ,where the ﬁrst term corresponds to evaluating the batch gradient over the local data in parallel, and thesecond term corresponds to evaluating the stochastic gradients in the inner loop. Multiplying this with thecommunication rounds leads to the overall runtime. Consider an auxiliary sample function at the t th round and the k th worker as ℓ tµ k ( x ; z ) = ℓ ( x ; z ) + µ k (cid:13)(cid:13) x − ˜ x t (cid:13)(cid:13) . This leads to the auxiliary local and global loss functions, respectively, f ti ( x ) = 1 |M i | X z ∈M i ℓ tµ k ( x ; z ) = f i ( x ) + µ k (cid:13)(cid:13) x − ˜ x t (cid:13)(cid:13) , for ≤ i ≤ n , and f t ( x ) = 1 n n X i =1 f ti ( x ) = f ( x ) + µ k (cid:13)(cid:13) x − ˜ x t (cid:13)(cid:13) . (14)10oreover, we have ∇ ℓ tµ k ( y t,s +1 k ; z ) − ∇ ℓ tµ k (˜ x t ; z ) + ∇ f t (˜ x t )= ∇ ℓ ( y t,s +1 k ; z ) − ∇ ℓ (˜ x t ; z ) + ∇ f (˜ x t ) + µ k ( y t,s +1 k − ˜ x t ) , which means that D-RSVRG performs in exactly the same way as the unregularized D-SVRG with theauxiliary loss functions ℓ tµ k in the t th round. Note that ℓ tµ k is ( µ k + L ) -smooth and that f t is ( µ k + σ ) -strongly-convex, while the restricted smoothness of the k th worker, f t − f tk = f − f k remains unchanged.Applying Theorem 1, we have E (cid:2) f t (˜ x t +1 ) − f t ∗ (cid:3) < ν E (cid:2) f t (˜ x t ) − f t ∗ (cid:3) , (15)where f t ∗ is the optimal value of f t , when Assumptions 1, 2 and 3 hold, and Assumption 4a holds, as longas c k < ( σ + µ k ) / , along with proper m and step size η .However, the deﬁnition of the regularized loss functions ℓ tµ and f t rely on ˜ x t , which changes over diﬀerentrounds. Our next step is to relate the descent of f t to f . To this end, we have E (cid:2) f (˜ x t +1 ) − f ∗ (cid:3) = E (cid:2) f t (˜ x t +1 ) − f t ∗ (cid:3) + f t ∗ − E h µ (cid:13)(cid:13) ˜ x t − ˜ x t +1 (cid:13)(cid:13) i − f ∗ < E (cid:2) f t (˜ x t ) − f t ∗ (cid:3) − (1 − ν ) E (cid:2) f t (˜ x t ) − f t ∗ (cid:3) + f t ∗ − f ∗ = E (cid:2) f (˜ x t ) − f ∗ (cid:3) − (1 − ν ) E (cid:2) f (˜ x t ) − f t ∗ (cid:3) , where the second line uses (14), the third line uses (15), and the last line follows from f (˜ x t ) = f t (˜ x t ) .We can continue to bound f (˜ x t ) − f t ∗ in two manners. First, f (˜ x t ) − f t ∗ = f t (˜ x t ) − f t ∗ > L + µ ) k∇ f (˜ x t ) k > σL + µ ( f (˜ x t ) − f ∗ ) . On the other hand, we have f (˜ x t ) − f t ∗ ≥ f (˜ x t ) − f t ( x ∗ ) = f (˜ x t ) − f ( x ∗ ) − µ k ˜ x t − x ∗ k ≥ (1 − µ/σ )( f (˜ x t ) − f ∗ ) . Thus (4) follows immediately by combining the above two bounds. Theorem 2 can be deducted from the following theorem, whose proof can be found in Appendix C.1.

Theorem 6.

Suppose that Assumption 1, 2, 3 and 4a hold with c k ≤ c . Then D-SARAH satisﬁes (cid:18) − c σ (cid:19) E h(cid:13)(cid:13) ∇ f (˜ x t +1 ) (cid:13)(cid:13) i ≤ (cid:18) σηm + 4 c σ + 2 ηL − ηL (cid:19) E h(cid:13)(cid:13) ∇ f (˜ x t ) (cid:13)(cid:13) i . When c < √ σ , we can choose η = 2(1 − c /σ )(9 − c /σ ) L and m = 2 κ − c /σ (1 − c /σ ) in Theorem 6, leading tothe following convergence rate: E h(cid:13)(cid:13) ∇ f (˜ x t +1 ) (cid:13)(cid:13) i ≤ − c /σ E h(cid:13)(cid:13) ∇ f (˜ x t ) (cid:13)(cid:13) i . Consequently, following similar discussions as D-SVRG, the communication complexity of ﬁnding an ǫ -optimal solution is O ( ζ − log(1 /ǫ )) , and the runtime complexity is O (cid:0) ( N/n + ζ − κ ) ζ − log(1 /ǫ ) (cid:1) , where ζ = 1 − √ c/σ . Theorem 5 can be deducted from the following theorem, proved in Appendix C.2, which speciﬁes the choiceof the step size to guarantee convergence.

Theorem 7.

Suppose that Assumption 1 and 4b hold with c k ≤ c . By setting the step size η ≤ L (cid:16) p m −

1) + 4 m ( m − c /L (cid:17) , or a single outer loop of D-SARAH, it satisﬁes: m − X s =0 E h k∇ f ( y s ) k i ≤ η E (cid:2) f ( y ) − f ( y m ) (cid:3) . By setting ˜ x t +1 = y m , from the above theorem, we have m − X s =0 E h k∇ f ( y s ) k i ≤ η E (cid:2) f (˜ x t ) − f (˜ x t +1 ) (cid:3) . Hence, with T outer loops, we have T m T − X t =0 m − X s =0 E (cid:20)(cid:13)(cid:13)(cid:13) ∇ f ( y t,sk ( t ) ) (cid:13)(cid:13)(cid:13) (cid:21) ≤ ηT m (cid:0) f (˜ x ) − f ∗ (cid:1) . where k ( t ) is the worker index selected in the t th round for parameter update. The communication complexityto achieve an ǫ -optimal solution is T = O (cid:18) ηmǫ (cid:19) = O p m + m c /L m · Lǫ ! = O (cid:18) (cid:18) √ m + cL (cid:19) · Lǫ (cid:19) , with the choice η = Θ (cid:18) L √ m + m c /L (cid:19) . Per round, the runtime complexity at each worker is O ( N/n + m ) .By choosing m = O ( N/n ) , we achieve the runtime complexity O (cid:18) N/n + (cid:16)p N/n + N/n · cL (cid:17) Lǫ (cid:19) . Thought the focus of this paper is theoretical, we illustrate the performance of the proposed distributedstochastic variance reduced algorithms in various settings as a proof-of-concept.

Consider ℓ -regularized logistic regression, where the sample loss is deﬁned as ℓ ( x ; z i ) = log (cid:0) (cid:0) − b i a ⊤ i x (cid:1)(cid:1) + λ k x k , (16)with the data z i = ( a i , b i ) ∈ R d × {± } . We evaluate the performance on the gisette dataset [11], bysplitting the data equally to all workers. We scale the data according to max i ∈ [ N ] k a i k = 1 , so that thesmoothness parameter is estimated as L = 1 / λ . We choose λ = N − . , N − . and N − to illustratethe performance under diﬀerent condition numbers. We use the optimality gap, deﬁned as f (˜ x t ) − f ∗ , toillustrate the convergence behavior.For D-SVRG and D-SARAH, the step size is set as η = 1 / (2 L ) . For D-MiG, although the choice of w in the theory requires knowledge of c , we simply ignore it and set w = 1 + ησ , θ = 1 / and the stepsize η = 1 / (3 θL ) to reﬂect the robustness of the practical performance to parameters. We further use ˜ x t +1 = n P nk =1 y t + k at the PS for better empirical performance. For D-AGD, the step size is set as η = 1 /L and the momentum parameter is set as √ κ − √ κ +1 . Following [13], which sets the number of inner loop iterationsas m = 2 N , we set m ≈ N/n to ensure the same number of total inner iterations. We note that such12

20 40 60

Number of Communication Rounds -10 -5 O p t i m a li t y G ap D-SVRGD-SARAHD-MiGD-AGD

Number of Communication Rounds -10 -5 O p t i m a li t y G ap D-SVRGD-SARAHD-MiGD-AGD (a) λ = N − . (b) λ = N − . (c) λ = N − Figure 1: The optimality gap on ℓ -regularized logistic regression with respect to the number of communi-cation rounds with 4 workers using the gisette dataset under diﬀerent conditioning for diﬀerent algorithms. Number of Communication Rounds -10 -5 O p t i m a li t y G ap D-SVRGD-SARAHD-MiGD-AGD

Number of Communication Rounds -10 -5 O p t i m a li t y G ap D-SVRGD-SARAHD-MiGD-AGD

Number of Communication Rounds -10 -5 O p t i m a li t y G ap D-SVRGD-SARAHD-MiGD-AGD (a) n = 4 (b) n = 8 (c) n = 16 Figure 2: The optimality gap on ℓ -regularized logistic regression with respect to the number of commu-nication rounds with diﬀerent number of workers using the gisette dataset for diﬀerent algorithms when λ = N − .parameters can be further tuned to achieve better trade-oﬀ between communication cost and computationcost in practice.Fig. 1 illustrates the optimality gap of various algorithms with respect to the number of communicationrounds with 4 local workers under diﬀerent conditioning, and Fig. 2 shows the corresponding result withdiﬀerent numbers of local workers when λ = N − . The distributed stochastic variance-reduced algorithmsoutperform distributed AGD signiﬁcantly. In addition, D-MiG outperforms D-SVRG and D-SARAH whenthe condition number is large. We justify the beneﬁt of regularization by evaluating the proposed algorithms under unbalanced data allo-cation. We assign 50%, 30%, 19.9%, 0.1% percent of data to four workers, respectively, and set λ = N − inthe logistic regression loss (16). To deal with unbalanced data, we perform the regularized update, given in(3), on the worker with the least amount of data, and keep the update on the rest of the workers unchanged.A similar regularized update can be conceived for D-SARAH and D-MiG, resulting in regularized variants,D-RSARAH and D-RMiG. While our theory does not cover them, we still evaluate their numerical perfor-mance. We properly set µ according to the amount of data on this worker as µ = 0 . / (0 . · N ) . . We setthe number of iterations at workers m = 2 N on all agents. Fig. 3 shows the optimality gap with respect tothe number of communication rounds for all algorithms. It can be seen that all unregularized methods failto converge, and the regularized algorithms still converge, verifying the role of regularization in addressingunbalanced data. It is also worth mentioning that the regularization can be ﬂexibly imposed depending onthe local data size, rather than homogeneously across all workers.13

10 20 30 40 50 60

Number of Communication Rounds -10 -5 O p t i m a li t y G ap D-SVRGD-SARAHD-MiGD-RSVRGD-RSARAHD-RMiGD-AGD

Figure 3: The optimality gap with respect to the number of communication rounds for highly unbalanceddata allocation. It can be seen that the regularized variants of distributed stochastic variance-reducedalgorithms still converge will the unregularized ones no longer converge.

We follow the same setting as [34] to evaluate D-SARAH and Distributed Gradient Descent (D-GD) on thegisette dataset with a nonconvex sample loss function: ℓ ncvx ( x ; z i ) = log(1 + exp( − b i a ⊤ i x )) + λ d X j =1 x j / (1 + x j ) , which consists of the logistic loss and a non-convex regularizer, where x j is the j th entry of x . The smoothnessparameter of ℓ ncvx ( x ; z i ) can be estimated as L = 1 / λ . Fig. 4 plots the squared norm the gradient k∇ f (˜ x t ) k of D-SARAH and D-GD with respect to the number of communication rounds. It can be seenthat D-SARAH achieves a much lower gradient norm than D-GD with the same number of communicationrounds. Number of Communication Rounds -6 -5 -4 -3 -2 D-SARAHD-GD

Figure 4: The squared norm of the gradient with respect to the number of communication rounds on thegisette dataset with 4 workers using a nonconvex loss function.

In this paper, we have developed a convergence theory for a family of distributed stochastic variance reducedmethods without sampling extra data, under a mild distributed smoothness assumption that measures thediscrepancy between the local and global loss functions. Convergence guarantees are obtained for distributedstochastic variance reduced methods using accelerations and recursive gradient updates, and for minimizing14oth strongly convex and nonconvex losses. We also suggest regularization as a means of ensuring conver-gence when the local data are unbalanced and heterogeneous. We believe the analysis framework is usefulfor studying distributed variants of other stochastic variance-reduced methods such as Katyusha [3], andproximal variants such as [37].

Acknowledgements

The work of S. Cen was partly done when visiting MSRA. The work of S. Cen and Y. Chi is supported in partby National Science Foundation under the grant CCF-1806154, Oﬃce of Naval Research under the grantsN00014-18-1-2142 and N00014-19-1-2404, and Army Research Oﬃce under the grant W911NF-18-1-0303.

References [1] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic. Qsgd: Communication-eﬃcient sgd viagradient quantization and encoding. In

Advances in Neural Information Processing Systems , pages1709–1720, 2017.[2] D. Alistarh, T. Hoeﬂer, M. Johansson, N. Konstantinov, S. Khirirat, and C. Renggli. The convergence ofsparsiﬁed gradient methods. In

Advances in Neural Information Processing Systems , pages 5973–5983,2018.[3] Z. Allen-Zhu. Katyusha: The ﬁrst direct acceleration of stochastic gradient methods. In

Proceedings ofthe 49th Annual ACM SIGACT Symposium on Theory of Computing , pages 1200–1205. ACM, 2017.[4] J. Bernstein, Y.-X. Wang, K. Azizzadenesheli, and A. Anandkumar. Signsgd: Compressed optimisationfor non-convex problems. In

International Conference on Machine Learning , pages 559–568, 2018.[5] D. P. Bertsekas and J. N. Tsitsiklis.

Parallel and distributed computation: numerical methods , volume 23.Prentice hall Englewood Cliﬀs, NJ, 1989.[6] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statisticallearning via the alternating direction method of multipliers.

Foundations and Trends R (cid:13) in Machinelearning , 3(1):1–122, 2011.[7] S. De and T. Goldstein. Eﬃcient distributed sgd with variance reduction. In , pages 111–120. IEEE, 2016.[8] A. Defazio, F. Bach, and S. Lacoste-Julien. Saga: A fast incremental gradient method with support fornon-strongly convex composite objectives. In Advances in neural information processing systems , pages1646–1654, 2014.[9] J. Fan, Y. Guo, and K. Wang. Communication-eﬃcient accurate statistical estimation. arXiv preprintarXiv:1906.04870 , 2019.[10] C. Fang, C. J. Li, Z. Lin, and T. Zhang. Spider: Near-optimal non-convex optimization via stochasticpath-integrated diﬀerential estimator. In

Advances in Neural Information Processing Systems , pages687–697, 2018.[11] I. Guyon, S. Gunn, A. Ben-Hur, and G. Dror. Result analysis of the nips 2003 feature selection challenge.In

Advances in neural information processing systems , pages 545–552, 2005.[12] B. Hu, S. Wright, and L. Lessard. Dissipativity theory for accelerating stochastic variance reduction: Auniﬁed analysis of svrg and katyusha using semideﬁnite programs.

International Conference on MachineLearning (ICML) , 2018.[13] R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction.In

Advances in neural information processing systems , pages 315–323, 2013.1514] J. Konečn`y, B. McMahan, and D. Ramage. Federated optimization: Distributed optimization beyondthe datacenter. arXiv preprint arXiv:1511.03575 , 2015.[15] J. D. Lee, Q. Lin, T. Ma, and T. Yang. Distributed stochastic variance reduced gradient methods bysampling extra data with replacement.

The Journal of Machine Learning Research , 18(1):4404–4446,2017.[16] L. Lessard, B. Recht, and A. Packard. Analysis and design of optimization algorithms via integralquadratic constraints.

SIAM Journal on Optimization , 26(1):57–95, 2016.[17] X. Lian, C. Zhang, H. Zhang, C.-J. Hsieh, W. Zhang, and J. Liu. Can decentralized algorithms out-perform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. In

Advances in Neural Information Processing Systems , pages 5330–5340, 2017.[18] H. Lin, J. Mairal, and Z. Harchaoui. A universal catalyst for ﬁrst-order optimization. In

Advances inNeural Information Processing Systems , pages 3384–3392, 2015.[19] Y. Lin, S. Han, H. Mao, Y. Wang, and B. Dally. Deep gradient compression: Reducing the commu-nication bandwidth for distributed training. In

International Conference on Learning Representations ,2018.[20] S. Mei, Y. Bai, and A. Montanari. The landscape of empirical risk for nonconvex losses.

The Annals ofStatistics , 46(6A):2747–2774, 2018.[21] L. M. Nguyen, J. Liu, K. Scheinberg, and M. Takáč. Sarah: A novel method for machine learningproblems using stochastic recursive gradient. In

International Conference on Machine Learning , pages2613–2621, 2017.[22] L. M. Nguyen, M. van Dijk, D. T. Phan, P. H. Nguyen, T.-W. Weng, and J. R. Kalagnanam. Finite-sumsmooth optimization with sarah. arXiv preprint arXiv:1901.07648 , 2019.[23] B. Recht, C. Re, S. Wright, and F. Niu. Hogwild: A lock-free approach to parallelizing stochasticgradient descent. In

Advances in neural information processing systems , pages 693–701, 2011.[24] S. J. Reddi, J. Konečn`y, P. Richtárik, B. Póczós, and A. Smola. Aide: fast and communication eﬃcientdistributed optimization. arXiv preprint arXiv:1608.06879 , 2016.[25] M. Schmidt, N. Le Roux, and F. Bach. Minimizing ﬁnite sums with the stochastic average gradient.

Mathematical Programming , 162(1-2):83–112, 2017.[26] F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu. 1-bit stochastic gradient descent and its application todata-parallel distributed training of speech dnns. In

Fifteenth Annual Conference of the InternationalSpeech Communication Association , 2014.[27] S. Shalev-Shwartz and T. Zhang. Stochastic dual coordinate ascent methods for regularized loss mini-mization.

Journal of Machine Learning Research , 14(Feb):567–599, 2013.[28] O. Shamir. Without-replacement sampling for stochastic gradient methods. In

Advances in NeuralInformation Processing Systems , pages 46–54, 2016.[29] O. Shamir, N. Srebro, and T. Zhang. Communication-eﬃcient distributed optimization using an ap-proximate newton-type method. In

International conference on machine learning , pages 1000–1008,2014.[30] V. Smith, S. Forte, M. Chenxin, M. Takáč, M. I. Jordan, and M. Jaggi. Cocoa: A general frameworkfor communication-eﬃcient distributed optimization.

Journal of Machine Learning Research , 18:230,2018.[31] H. Tang, X. Lian, M. Yan, C. Zhang, and J. Liu. D : Decentralized training over decentralized data.In International Conference on Machine Learning , pages 4855–4863, 2018.1632] J. Wang, W. Wang, and N. Srebro. Memory and communication eﬃcient distributed stochastic opti-mization with minibatch prox. In

Conference on Learning Theory , pages 1882–1919, 2017.[33] S. Wang, F. Roosta-Khorasani, P. Xu, and M. W. Mahoney. Giant: Globally improved approximatenewton method for distributed optimization. In

Advances in Neural Information Processing Systems ,pages 2338–2348, 2018.[34] Z. Wang, K. Ji, Y. Zhou, Y. Liang, and V. Tarokh. Spiderboost: A class of faster variance-reducedalgorithms for nonconvex optimization. arXiv preprint arXiv:1810.10690 , 2018.[35] J. Wangni, J. Wang, J. Liu, and T. Zhang. Gradient sparsiﬁcation for communication-eﬃcient dis-tributed optimization. In

Advances in Neural Information Processing Systems , pages 1299–1309, 2018.[36] W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li. Terngrad: Ternary gradients to reducecommunication in distributed deep learning. In

Advances in neural information processing systems ,pages 1509–1519, 2017.[37] L. Xiao and T. Zhang. A proximal stochastic gradient method with progressive variance reduction.

SIAM Journal on Optimization , 24(4):2057–2075, 2014.[38] Y. Zhang and X. Lin. DiSCO: Distributed Optimization for Self-Concordant Empirical Loss.

Interna-tional Conference on Machine Learning , pages 362–370, 2015.[39] Y. Zhang, M. J. Wainwright, and J. C. Duchi. Communication-eﬃcient algorithms for statistical opti-mization. In

Advances in Neural Information Processing Systems , pages 1502–1510, 2012.[40] S. Zhao, G.-D. Zhang, M.-W. Li, and W.-J. Li. Proximal scope for distributed sparse learning. In

Advances in Neural Information Processing Systems , pages 6552–6561, 2018.[41] S.-Y. Zhao, R. Xiang, Y.-H. Shi, P. Gao, and W.-J. Li. Scope: scalable composite optimization forlearning on spark. In

Thirty-First AAAI Conference on Artiﬁcial Intelligence , 2017.[42] K. Zhou, F. Shang, and J. Cheng. A simple stochastic variance reduced algorithm with fast convergencerates. In

International Conference on Machine Learning , pages 5975–5984, 2018.[43] M. Zinkevich, M. Weimer, L. Li, and A. J. Smola. Parallelized stochastic gradient descent. In

Advancesin neural information processing systems , pages 2595–2603, 2010.

A Preliminary

We ﬁrst establish a lemma which will be useful later.

Lemma 3.

When Assumptions 1,2 and one of the distributed smoothness (Assumption 4a or 4b) hold, wehave E z h k∇ ℓ z ( x ) − ∇ ℓ z ( x ) k i ≤ LD f ( x , x ) + ( cL (cid:16) k x − x ∗ k + k x − x ∗ k (cid:17) Assumption acL k x − x k Assumption b , for any ˜ x , where the expectation is evaluated over z , and D f ( x , x ) = f ( x ) − f ( x ) − h∇ f ( x ) , x − x i .Proof. Given f is L -smooth and convex, the Bregman divergence D f ( x , x ) is L -smooth and convex as afunction of x . When Assumptions 1 and 2 hold, we have D ℓ z ( x , x ) ≤ D ℓ z ( x , x ) − L k∇ x D ℓ z ( x , x ) k = D ℓ z ( x , x ) − L k∇ ℓ z ( x ) − ∇ ℓ z ( x ) k . z ∈ M k gives L · D f k ( x , x ) ≥ E z h k∇ ℓ z ( x ) − ∇ ℓ z ( x ) k i . (17)To further bound the left-hand side, Assumption 4a allows us to compare D f and D f k : | D f k ( x , x ) − D f ( x , x ) | = (cid:12)(cid:12)(cid:12) D f − f k ( x , x ∗ ) + D f − f k ( x ∗ , x ) + h∇ ( f − f k )( x ∗ − x ) , x − x ∗ i (cid:12)(cid:12)(cid:12) ≤ c k x − x ∗ k + c k x ∗ − x k + c k x ∗ − x k k x − x ∗ k≤ c (cid:16) k x − x ∗ k + k x ∗ − x k (cid:17) . Following similar arguments, using Assumption 4b we obtain a tighter bound by replacing x ∗ with any ˜ x .In particular, setting ˜ x = ( x + x ) / we have | D f k ( x , x ) − D f ( x , x ) | ≤ c k x − x k / . Combining the above estimates with (17) proves the lemma.

B Proof for D-SVRG

B.1 Proof of Lemma 1

For E [ S ] , we apply Lemma 3 directly: E [ S ] = E h k∇ ℓ z ( y s ) − ∇ ℓ z ( x ∗ ) k i ≤ L E [ f ( y s ) − f ( x ∗ )] + 2 cL E h k y s − x ∗ k i , where the inequality follows from the deﬁnition of D f ( y s , x ∗ ) , and ∇ f ( x ∗ ) = 0 . For E [ S ] , we have E [ S ] = E h(cid:13)(cid:13) ∇ ℓ z ( x ∗ ) − ∇ ℓ z ( y ) + ∇ f ( y ) (cid:13)(cid:13) i ≤ E h(cid:13)(cid:13) ∇ ℓ z ( x ∗ ) − ∇ ℓ z ( y ) − ( ∇ f k ( x ∗ ) − ∇ f k ( y )) (cid:13)(cid:13) i + 2 E h(cid:13)(cid:13) ( ∇ f k ( x ∗ ) − ∇ f k ( y )) + ∇ f ( y ) (cid:13)(cid:13) i ≤ E h(cid:13)(cid:13) ∇ ℓ z ( x ∗ ) − ∇ ℓ z ( y ) (cid:13)(cid:13) i + 2 c E h(cid:13)(cid:13) y − x ∗ (cid:13)(cid:13) i ≤ L E (cid:2) f ( y ) − f ( x ∗ ) (cid:3) + c (4 L + 2 c ) E h(cid:13)(cid:13) y − x ∗ (cid:13)(cid:13) i , where the ﬁrst inequality is due to k a + b k ≤ k a k + 2 k b k , the second inequality follows from evaluatingthe expectation and Assumption 4a, and the last step uses Lemma 3 again.For E [ S ] , we have E [ S ] = − E (cid:2)(cid:10) y s − x ∗ , ∇ ℓ z ( y s ) − ∇ ℓ z ( y ) + ∇ f ( y ) (cid:11)(cid:3) =2 E [ − h y s − x ∗ , ∇ f ( y s ) i ] − E (cid:2)(cid:10) y s − x ∗ , ∇ ( f − f k )( y ) − ∇ ( f − f k )( y s ) (cid:11)(cid:3) ≤ − E [ f ( y s ) − f ( x ∗ )] + 2 c E (cid:2) k y s − x ∗ k (cid:0) k y s − x ∗ k + (cid:13)(cid:13) y − x ∗ (cid:13)(cid:13)(cid:1)(cid:3) ≤ − E [ f ( y s ) − f ( x ∗ )] + 3 c E h k y s − x ∗ k i + c E h(cid:13)(cid:13) y − x ∗ (cid:13)(cid:13) i , where the ﬁrst inequality is obtained by applying Assumption 2, Cauchy-Schwarz inequality and Assumption4a. 18 .2 Proof of Lemma 2 Setting P = I d and ρ = 1 in (8), we have that it becomes equivalent to (12), and consequently the dissipationinequality (5) holds. In view of Lemma 1, it can be written as E h(cid:13)(cid:13) y s +1 − x ∗ (cid:13)(cid:13) i ≤ (1 + 2 cLλ + 3 cλ ) E h k y s − x ∗ k i − (2 λ − Lλ ) E [ f ( y s ) − f ∗ ]+ 4 Lλ E (cid:2) f ( y ) − f ∗ (cid:3) + c [(4 L + 2 c ) λ + λ ] E h(cid:13)(cid:13) y − x ∗ (cid:13)(cid:13) i . Since f is σ -strongly convex, k y s − x ∗ k < σ ( f ( y s ) − f ∗ ) , then the above E h(cid:13)(cid:13) y s +1 − x ∗ (cid:13)(cid:13) i ≤ E h k y s − x ∗ k i − γ E [ f ( y s ) − f ∗ ] + 4 Lλ E (cid:2) f ( y ) − f ∗ (cid:3) + γ E h(cid:13)(cid:13) y − x ∗ (cid:13)(cid:13) i , where γ = (2 λ − Lλ − (4 Lλ + 6 λ ) c/σ ) and γ = c [(4 L + 2 c ) λ + λ ] are introduced as short-handnotations. In addition, γ > by assumption. Telescoping the above inequality by summing over s =0 , . . . , m − , we have γ m − X s =0 E [ f ( y s ) − f ∗ ] ≤ (1 + γ m ) E h(cid:13)(cid:13) y − x ∗ (cid:13)(cid:13) i + 4 Lλ m E (cid:2) f ( y ) − f ∗ (cid:3) . Note that the choice of y + implies E (cid:2) f ( y + ) − f ∗ (cid:3) = 1 m m − X s =0 E [ f ( y s ) − f ∗ ] . Therefore, γ E (cid:2) f ( y + ) − f ∗ (cid:3) ≤ (1 /m + γ ) E h(cid:13)(cid:13) y − x ∗ (cid:13)(cid:13) i + 4 Lλ E (cid:2) f ( y ) − f ∗ (cid:3) . We obtain the ﬁnal result by substituting E h(cid:13)(cid:13) y − x ∗ (cid:13)(cid:13) i ≤ σ E (cid:2) f (cid:0) y (cid:1) − f ∗ (cid:3) into the above inequality. B.3 Convergence of D-SVRG without Assumption 2

When Assumption 2 does not hold, we can still use similar arguments as the proof of Theorem 1 and establishthe convergence of D-SVRG, though at a slower rate. Using the same supply rates (10), Lemma 1 can bemodiﬁed as below.

Lemma 4.

Suppose that Assumption 1, 3 and 4a hold. For the supply rates deﬁned in (10) , we have  E [ S ] ≤ L σ − E [ f ( y s ) − f ∗ ] E [ S ] ≤ L + c ) σ − E (cid:2) f ( y ) − f ∗ (cid:3) E [ S ] ≤ − E [ f ( y s ) − f ∗ ] + 3 c E h k y s − x ∗ k i + c E h(cid:13)(cid:13) y − x ∗ (cid:13)(cid:13) i Proof.

With L -smoothness of ℓ z we have the following estimate: E z h k∇ ℓ z ( y ) − ∇ ℓ z ( y ) k i ≤ L k y − y k . So we have E [ S ] = E h k∇ ℓ z ( y s ) − ∇ ℓ z ( x ∗ ) k i ≤ L E h k y s − x ∗ k i ≤ L σ − E [ f ( y s ) − f ∗ ] and E [ S ] = E h(cid:13)(cid:13) ∇ ℓ z ( x ∗ ) − ∇ ℓ z ( y ) + ∇ f ( y ) (cid:13)(cid:13) i ≤ E h(cid:13)(cid:13) ∇ ℓ z ( x ∗ ) − ∇ ℓ z ( y ) − ( ∇ f k ( x ∗ ) − ∇ f k ( y )) (cid:13)(cid:13) i + 2 E h(cid:13)(cid:13) ( ∇ f k ( x ∗ ) − ∇ f k ( y )) + ∇ f ( y ) (cid:13)(cid:13) i ≤ E h(cid:13)(cid:13) ∇ ℓ z ( x ∗ ) − ∇ ℓ z ( y ) (cid:13)(cid:13) i + 2 c E h(cid:13)(cid:13) y − x ∗ (cid:13)(cid:13) i ≤ L + c ) E h(cid:13)(cid:13) y − x ∗ (cid:13)(cid:13) i ≤ L + c ) σ − E (cid:2) f ( y ) − f ∗ (cid:3) . E [ S ] is identical to that in Lemma 1.Following the same process in the proof of Lemma 2 in Appendix B.2, we have the following inequalitywith proper choices of λ , λ and λ : E h(cid:13)(cid:13) y s +1 − x ∗ (cid:13)(cid:13) i ≤ (1 + 3 cλ ) E h k y s − x ∗ k i − (2 λ − L σ − λ ) E [ f ( y s ) − f ∗ ]+ 4( L + c ) σ − λ E (cid:2) f ( y ) − f ∗ (cid:3) + cλ E h(cid:13)(cid:13) y − x ∗ (cid:13)(cid:13) i ≤ E h k y s − x ∗ k i − (2 λ − L σ − λ − cλ σ − ) E [ f ( y s ) − f ∗ ]+ 4( L + c ) σ − λ E (cid:2) f ( y ) − f ∗ (cid:3) + cλ E h(cid:13)(cid:13) y − x ∗ (cid:13)(cid:13) i . By summing the inequality and letting λ = λ = 2 η , λ = η , we have η (1 − L σ − η − cσ − ) m − X s =0 E [ f ( y s ) − f ∗ ] ≤ (1 + cηm ) E h(cid:13)(cid:13) y − x ∗ (cid:13)(cid:13) i + 8( L + c ) σ − η m E (cid:2) f ( y ) − f ∗ (cid:3) ≤ (cid:0) σ − + 2 cηmσ − + 8( L + c ) σ − η m (cid:1) E (cid:2) f ( y ) − f ∗ (cid:3) . Therefore, with − L σ − η − cσ − > the following convergence bound can be established: E (cid:2) f (cid:0) y + (cid:1) − f ∗ (cid:3) ≤ ( ησm ) − + cσ − + 4( L + c ) σ − η − L σ − η − cσ − E (cid:2) f ( y ) − f ∗ (cid:3) . By choosing η = 1 − cσ − κL , m = 160 κ (1 − cσ − ) , we get a convergence rate no more than − · σ − cσ − c .Hence the overall time complexity to ﬁnd an ǫ -optimal solution is O (cid:0)(cid:0) N/n + ζ − κ (cid:1) ζ − log(1 /ǫ ) (cid:1) where ζ = 1 − c/σ . B.4 Convergence of D-SVRG with Option I

Another option for the output of the inner loops of SVRG, or the output of the local workers, is to outputthe last iterates, i.e. y t + k = y t,mk , which is called “Option 1” in [13]. Here, we establish the convergence ofD-SVRG using Option I in the following theorem. Theorem 8 (D-SVRG with Option I) . Suppose that Assumptions 1, 2 and 3 hold, and Assumption 4a holdswith c < σ/ . With suﬃciently large m and suﬃciently small step size η , there exists ≤ ν < such that E h(cid:13)(cid:13) ˜ x t +1 − x ∗ (cid:13)(cid:13) i < ν E h(cid:13)(cid:13) ˜ x t − x ∗ (cid:13)(cid:13) i . Theorem 8 indicates that the iterates ˜ x t of D-SVRG with Option I converge to the minimizer x ∗ linearlyin expectation as long as c is suﬃciently small. The proof is outlined in Section 6.1. By taking m → ∞ and η → , the rate approaches ν := c σ − c , which suggests the algorithm admits faster convergence with thedecrease of c , as expected.Following [12], we consider the following four supply rates: ¯ X =   , ¯ X =  σ − − − −  , ¯ X =  − L − L  , ¯ X = −   . (18)We have the following lemma which is proved at the end of this subsection.20 emma 5. Suppose that Assumptions 1, 2, 3 and 4a hold. For the supply rates deﬁned in (18) , we have  E [ S ] ≤ ( L + 2 cL − σ ) E h(cid:13)(cid:13) y − x ∗ (cid:13)(cid:13) i E [ S ] ≤ c E h k y s − x ∗ k i + c E h(cid:13)(cid:13) y − x ∗ (cid:13)(cid:13) i E [ S ] ≤ E [ S ] ≤ c E h k y s − x ∗ k i + c E h(cid:13)(cid:13) y − x ∗ (cid:13)(cid:13) i Therefore, by choosing P = I d , λ = 2 η , λ = η − Lη , λ = η , λ = Lη , ρ = 1 − σ ( η − Lη ) , thecondition (8) holds: η  − −  (cid:22) . This immediately leads to the inequality (7) which reads as: E h(cid:13)(cid:13) y s +1 − x ∗ (cid:13)(cid:13) i ≤ (1 − (2 σ − c )( η − Lη ) + cLη ) E h k y s − x ∗ k i + (2 η ( L + 2 cL − σ ) + cη ) E h(cid:13)(cid:13) y − x ∗ (cid:13)(cid:13) i . (19)Let ˜ ρ = (1 − (2 σ − c )( η − Lη ) + cLη ) . Telescoping the inequality over t = 0 , , · · · , m − leads to E h k y m − x ∗ k i ≤ (cid:18) ˜ ρ m + 2 η ( L + 2 cL − σ ) + c (2 σ − c )(1 − Lη ) − cLη (cid:19) · E h(cid:13)(cid:13) y − x ∗ (cid:13)(cid:13) i Note that when m → ∞ and η → , the rate becomes ν := c σ − c , hence we need c < σ/ to get a rate ν < . We have E h(cid:13)(cid:13) ˜ x t +1 − x ∗ (cid:13)(cid:13) i ≤ n n X k =1 E h(cid:13)(cid:13) y t,mk − x ∗ (cid:13)(cid:13) i ≤ n n X k =1 ν E (cid:20)(cid:13)(cid:13)(cid:13) y t, k − x ∗ (cid:13)(cid:13)(cid:13) (cid:21) = ν E h(cid:13)(cid:13) ˜ x t − x ∗ (cid:13)(cid:13) i . Proof of Lemma 5.

The following inequalities can be viewed as combinations of standard inequalities inconvex optimization (co-coercivity, etc) and the characterization of restricted smoothness. E [ S ] = E h(cid:13)(cid:13) ∇ ℓ z ( x ∗ ) − ∇ ℓ z (cid:0) y (cid:1) + ∇ f (cid:0) y (cid:1)(cid:13)(cid:13) i = E h(cid:13)(cid:13) ∇ ℓ z ( x ∗ ) − ∇ ℓ z (cid:0) y (cid:1)(cid:13)(cid:13) i + 2 E (cid:2)(cid:10) ∇ ℓ z ( x ∗ ) − ∇ ℓ z (cid:0) y (cid:1) , ∇ f (cid:0) y (cid:1)(cid:11)(cid:3) + E h(cid:13)(cid:13) ∇ f (cid:0) y (cid:1)(cid:13)(cid:13) i ≤ L E h(cid:13)(cid:13) x ∗ − y (cid:13)(cid:13) i + 2 E (cid:2)(cid:10) ∇ ( f − f k )( y ) − ∇ ( f − f k )( x ∗ ) , ∇ f (cid:0) y (cid:1)(cid:11)(cid:3) − E h(cid:13)(cid:13) ∇ f (cid:0) y (cid:1)(cid:13)(cid:13) i ≤ ( L + 2 cL − σ ) E h(cid:13)(cid:13) y − x ∗ (cid:13)(cid:13) i , and E [ S ] =2 E h σ k y s − x ∗ k − (cid:10) y s − x ∗ , ∇ ℓ z ( y s ) − ∇ ℓ z (cid:0) y (cid:1) + ∇ f (cid:0) y (cid:1)(cid:11)i =2 E h σ k y s − x ∗ k − h y s − x ∗ , ∇ f ( y s ) i i − E (cid:2)(cid:10) y s − x ∗ , ∇ ( f − f k )( y ) − ∇ ( f − f k )( y s ) (cid:11)(cid:3) ≤ c E (cid:2) k y s − x ∗ k (cid:0) k y s − x ∗ k + (cid:13)(cid:13) y − x ∗ (cid:13)(cid:13)(cid:1)(cid:3) ≤ c E h k y s − x ∗ k i + c E h(cid:13)(cid:13) y − x ∗ (cid:13)(cid:13) i . E [ S ] ≤ is simply the restatement of L -smoothness of ℓ ( · , z ) , z ∈ M . E [ S ] =2 E (cid:2)(cid:10) y s − x ∗ , ∇ ( f − f k )( y ) − ∇ ( f − f k )( x ∗ ) (cid:11)(cid:3) ≤ c E (cid:2) k y s − x ∗ k (cid:13)(cid:13) y − x ∗ (cid:13)(cid:13)(cid:3) ≤ c E h k y s − x ∗ k i + c E h(cid:13)(cid:13) y − x ∗ (cid:13)(cid:13) i . Proof for D-SARAH

C.1 Proof of Theorem 6

To begin, we cite two supporting lemmas from [21].

Lemma 6. [21] Suppose that Assumption 1 holds, then m − X s =0 E h k∇ f ( y s ) k i ≤ η E (cid:2) f ( y ) − f ( y m ) (cid:3) + m − X s =0 E h k∇ f ( y s ) − v s k i − (1 − Lη ) m − X s =0 E h k v s k i . Lemma 7. [21] Suppose Assumption 1 and 2 hold and η < /L , then E h(cid:13)(cid:13) v s − v s − (cid:13)(cid:13) i ≤ ηL − ηL h E h(cid:13)(cid:13) v s − (cid:13)(cid:13) i − E h k v s k ii . We also present a new lemma below with the proof given in Appendix C.3.

Lemma 8.

The update rule of D-SARAH satisﬁes E h(cid:13)(cid:13) ∇ f ( y ) − ∇ f k ( y ) + ∇ f k ( y s ) − v s (cid:13)(cid:13) i = s X j =1 E h(cid:13)(cid:13) v j − v j − (cid:13)(cid:13) i − s X j =1 E h(cid:13)(cid:13) ∇ f k ( y j ) − ∇ f k ( y j − ) (cid:13)(cid:13) i . Proof of Theorem 6.

By combining Lemmas 7 and 8, we have E h(cid:13)(cid:13) ∇ f ( y ) − ∇ f k ( y ) + ∇ f k ( y s ) − v s (cid:13)(cid:13) i ≤ s X j =1 E h(cid:13)(cid:13) v j − v j − (cid:13)(cid:13) i ≤ ηL − ηL E h(cid:13)(cid:13) v (cid:13)(cid:13) i . (20)By Assumption 4a, we have (cid:13)(cid:13) ∇ f ( y ) − ∇ f k ( y ) + ∇ f k ( y s ) − ∇ f ( y s ) (cid:13)(cid:13) = (cid:13)(cid:13) ∇ ( f − f k )( y ) − ∇ ( f − f k )( y s ) (cid:13)(cid:13) ≤ c k y s − x ∗ k + 2 c (cid:13)(cid:13) y − x ∗ (cid:13)(cid:13) . (21)Therefore, combining (20) and (21), we have E h k∇ f ( y s ) − v s k i ≤ E h(cid:13)(cid:13) ∇ f ( y ) − ∇ f k ( y ) + ∇ f k ( y s ) − ∇ f ( y s ) (cid:13)(cid:13) i + 2 E h(cid:13)(cid:13) ∇ f ( y ) − ∇ f k ( y ) + ∇ f k ( y s ) − v s (cid:13)(cid:13) i ≤ c E h k y s − x ∗ k i + 4 c E h(cid:13)(cid:13) y − x ∗ (cid:13)(cid:13) i + 2 ηL − ηL E h(cid:13)(cid:13) v (cid:13)(cid:13) i . Substituting it into Lemma 6 gives m − X s =0 E h k∇ f ( y s ) k i ≤ η E (cid:2) f ( y ) − f ( x ∗ ) (cid:3) + m − X s =0 E h k∇ f ( y s ) − v s k i ≤ η E (cid:2) f ( y ) − f ( x ∗ ) (cid:3) + m − X s =0 (cid:16) c E h k y s − x ∗ k i + 4 c E h(cid:13)(cid:13) y − x ∗ (cid:13)(cid:13) i + 2 ηL − ηL E h(cid:13)(cid:13) v (cid:13)(cid:13) i (cid:17) . f is σ -strongly convex, we have c k y s − x ∗ k ≤ c σ k∇ f ( y s ) k . Denote y + as the local update whichis selected from y , · · · , y m − uniformly at random. We have (cid:18) − c σ (cid:19) E h(cid:13)(cid:13) ∇ f ( y + ) (cid:13)(cid:13) i = (cid:18) − c σ (cid:19) m m − X s =0 E h k∇ f ( y s ) k i + 2 ηL − ηL E h(cid:13)(cid:13) v (cid:13)(cid:13) i ≤ (cid:18) σηm + 4 c σ + 2 ηL − ηL (cid:19) E h(cid:13)(cid:13) ∇ f ( y ) (cid:13)(cid:13) i . Since ˜ x t +1 is randomly chosen from the local outputs { y t + k , ≤ k ≤ n } , we have (cid:18) − c σ (cid:19) E h(cid:13)(cid:13) ∇ f (˜ x t +1 ) (cid:13)(cid:13) i ≤ (cid:18) σηm + 4 c σ + 2 ηL − ηL (cid:19) E h(cid:13)(cid:13) ∇ f (˜ x t ) (cid:13)(cid:13) i . C.2 Proof of Theorem 7

Recall Lemma 6. The theorem follows if m − X s =0 E h k∇ f ( y s ) − v s k i − (1 − Lη ) m − X s =0 E h k v s k i ≤ . The rest of this proof is thus dedicated to show the above inequality. Note that E h k∇ f ( y s ) − v s k i ≤ E h(cid:13)(cid:13) ∇ f ( y ) − ∇ f k ( y ) + ∇ f k ( y s ) − ∇ f ( y s ) (cid:13)(cid:13) i + 2 E h(cid:13)(cid:13) ∇ f ( y ) − ∇ f k ( y ) + ∇ f k ( y s ) − v s (cid:13)(cid:13) i ≤ c E h(cid:13)(cid:13) y − y s (cid:13)(cid:13) i + 2 s X j =1 E h(cid:13)(cid:13) v j − v j − (cid:13)(cid:13) i ≤ c s s X j =1 E h(cid:13)(cid:13) y j − y j − (cid:13)(cid:13) i + 2 s X j =1 E h(cid:13)(cid:13) v j − v j − (cid:13)(cid:13) i =2 c sη s X j =1 E h(cid:13)(cid:13) v j − (cid:13)(cid:13) i + 2 s X j =1 E h(cid:13)(cid:13) v j − v j − (cid:13)(cid:13) i , where the second inequality follows from Lemma 8 and Assumption 4b, and the third inequality follows from y − y s = P sj =1 y j − y j − , and the last line follows from the deﬁnition. The L -smoothness of ℓ z implies that (cid:13)(cid:13) v j − v j − (cid:13)(cid:13) = (cid:13)(cid:13) ∇ ℓ z (cid:0) y j (cid:1) − ∇ ℓ z (cid:0) y j − (cid:1)(cid:13)(cid:13) ≤ L (cid:13)(cid:13) y j − j j − (cid:13)(cid:13) = L η (cid:13)(cid:13) v j − (cid:13)(cid:13) . So we have m − X s =0 E h k∇ f ( y s ) − v s k i − (1 − Lη ) m − X s =0 E h k v s k i ≤ m − X s =1 (2 c s + 2 L ) η s X j =1 E h(cid:13)(cid:13) v j − (cid:13)(cid:13) i − (1 − Lη ) m − X s =0 E h k v s k i ≤ m − X s =0 (cid:0) m ( m − c η + 2 L η ( m − − (1 − Lη ) (cid:1) · E h k v s k i . Therefore, with < η ≤ L (cid:16) √ m − m ( m − c /L (cid:17) , we have m ( m − c η +2 L η ( m − − (1 − Lη ) ≤ and the proof is ﬁnished. 23 .3 Proof of Lemma 8 First, we write ∇ f ( y ) − ∇ f k ( y ) + ∇ f k ( y s ) − v s = ∇ f ( y ) − ∇ f k ( y ) + ∇ f k ( y s − ) − v s − + [ ∇ f k ( y s ) − ∇ f k ( y s − )] − [ v s − v s − ] . Let F s denote the σ -algebra generated by all random sample selections in sub-iteration , · · · , s − . Wehave E (cid:20)(cid:13)(cid:13) ∇ f ( y ) − ∇ f k ( y ) + ∇ f k ( y s ) − v s (cid:13)(cid:13) (cid:12)(cid:12)(cid:12)(cid:12) F s (cid:21) = (cid:13)(cid:13) ∇ f ( y ) − ∇ f k ( y ) + ∇ f k ( y s − ) − v s − (cid:13)(cid:13) + (cid:13)(cid:13) ∇ f k ( y s ) − ∇ f k ( y s − ) (cid:13)(cid:13) + E (cid:20)(cid:13)(cid:13) v s − v s − (cid:13)(cid:13) (cid:12)(cid:12)(cid:12)(cid:12) F s (cid:21) + 2 (cid:10) ∇ f ( y ) − ∇ f k ( y ) + ∇ f k ( y s − ) − v s − , ∇ f k ( y s ) − ∇ f k ( y s − ) (cid:11) − (cid:28) ∇ f ( y ) − ∇ f k ( y ) + ∇ f k ( y s − ) − v s − , E (cid:20) v s − v s − (cid:12)(cid:12)(cid:12)(cid:12) F s (cid:21)(cid:29) − (cid:28) ∇ f k ( y s ) − ∇ f k ( y s − ) , E (cid:20) v s − v s − (cid:12)(cid:12)(cid:12)(cid:12) F s (cid:21)(cid:29) = (cid:13)(cid:13) ∇ f ( y ) − ∇ f k ( y ) + ∇ f k ( y s − ) − v s − (cid:13)(cid:13) − (cid:13)(cid:13) ∇ f k ( y s ) − ∇ f k ( y s − ) (cid:13)(cid:13) + E (cid:20)(cid:13)(cid:13) v s − v s − (cid:13)(cid:13) (cid:12)(cid:12)(cid:12)(cid:12) F s (cid:21) , where the second equality follows from E (cid:20) v s − v s − (cid:12)(cid:12)(cid:12)(cid:12) F s (cid:21) = E (cid:20) ∇ ℓ z ( y s ) − ∇ ℓ z ( y s − ) (cid:12)(cid:12)(cid:12)(cid:12) F s (cid:21) = ∇ f k ( y s ) − ∇ f k ( y s − ) , Taking expectation over F s gives E h(cid:13)(cid:13) ∇ f ( y ) − ∇ f k ( y ) + ∇ f k ( y s ) − v s (cid:13)(cid:13) i = E h(cid:13)(cid:13) ∇ f ( y ) − ∇ f k ( y ) + ∇ f k ( y s − ) − v s − (cid:13)(cid:13) i − E h(cid:13)(cid:13) ∇ f k ( y s ) − ∇ f k ( y s − ) (cid:13)(cid:13) i + E h(cid:13)(cid:13) v s − v s − (cid:13)(cid:13) i . Hence telescoping the above equality we obtain the claimed result.

D Proof for D-MiG (Theorem 3)

As earlier, we simplify the notations y t,sk , x t,sk and v t,sk by dropping the superscript t and the subscript k .In this section we deal with the non-smooth target function F ( x ) = f ( x ) + g ( x ) as mentioned in Remark 2,where g is a convex and non-smooth function known to all agents. The analysis is done by carefully adaptingthe proof for the centralized algorithm (i.e. [42, Section B.1.]) with the distributed smoothness assumption.We impose the following constraint on the step size η : Lθ + Lθ − θ + c ≤ η . (22)We restate the inequalities (8) and (9) [42] and change notations to match our context: f ( y s − ) − f ( u ) ≤ − θθ (cid:10) ∇ f ( y s − ) , ˜ x − y s − (cid:11) + (cid:10) ∇ f ( y s − ) , x s − − u (cid:11) , (23) (cid:10) ∇ f ( y s − ) , x s − − u (cid:11) = D ∇ f ( y s − ) − ˜ ∇ , x s − − u E + D ˜ ∇ , x s − − x s E + D ˜ ∇ , x s − u E , (24)where ˜ ∇ = ∇ ℓ z ( y s − ) − ∇ ℓ z (˜ x ) + ∇ f (˜ x ) , with z randomly selected from M k and u ∈ R d is an arbitraryvector. Following the L -smoothness argument in [42], (24) leads to D ˜ ∇ , x s − − x s E ≤ θ (cid:0) f ( y s − ) − f ( y s ) (cid:1) + D ∇ f ( y s − ) − ˜ ∇ , x s − x s − E + Lθ (cid:13)(cid:13) x s − x s − (cid:13)(cid:13) .

24y plugging in the constraint (22), we have D ˜ ∇ , x s − − x s E ≤ θ (cid:0) f ( y s − ) − f ( y s ) (cid:1) + D ∇ f ( y s − ) − ˜ ∇ , x s − x s − E + (cid:18) η − Lθ − θ ) − c (cid:19) (cid:13)(cid:13) x s − x s − (cid:13)(cid:13) . (25)By combining (23),(24),(25), [42, Lemma 3] and then taking expectation over the choice of random sample z , we have f ( y s − ) − f ( u ) ≤ − θθ (cid:10) ∇ f ( y s − ) , ˜ x − y s − (cid:11) + E hD ∇ f ( y s − ) − ˜ ∇ , x s − u Ei + 1 θ (cid:0) f ( y s − ) − E [ f ( y s )] (cid:1) − (cid:18) Lθ − θ ) + c (cid:19) E h(cid:13)(cid:13) x s − x s − (cid:13)(cid:13) i + 12 η (cid:13)(cid:13) x s − − u (cid:13)(cid:13) − ησ η E h k x s − u k i + g ( u ) − E [ g ( x s )] . (26)We further split the term E hD ∇ f ( y s − ) − ˜ ∇ , x s − u Ei as E hD ∇ f ( y s − ) − ˜ ∇ , x s − u Ei = E hD ∇ f (˜ x ) − ∇ f k (˜ x ) + ∇ f k ( y s − ) − ˜ ∇ , x s − x s − Ei + E hD ∇ f ( y s − ) − ˜ ∇ , x s − − u Ei + E (cid:2)(cid:10) ∇ f ( y s − ) − ∇ f (˜ x ) + ∇ f k (˜ x ) − ∇ f k ( y s − ) , x s − x s − (cid:11)(cid:3) ≤ β E (cid:20)(cid:13)(cid:13)(cid:13) ∇ f (˜ x ) − ∇ f k (˜ x ) + ∇ f k ( y s − ) − ˜ ∇ (cid:13)(cid:13)(cid:13) (cid:21) + β E h(cid:13)(cid:13) x s − x s − (cid:13)(cid:13) i + c (cid:13)(cid:13) ˜ x − y s − (cid:13)(cid:13) E (cid:2)(cid:13)(cid:13) x s − x s − (cid:13)(cid:13)(cid:3) + c (cid:13)(cid:13) ˜ x − y s − (cid:13)(cid:13) (cid:13)(cid:13) x s − − u (cid:13)(cid:13) ≤ β (cid:16) LD f (˜ x, y s − ) + 2 cL (cid:13)(cid:13) ˜ x − y s − (cid:13)(cid:13) (cid:17) + β + c E h(cid:13)(cid:13) x s − x s − (cid:13)(cid:13) i + c (cid:13)(cid:13) ˜ x − y s − (cid:13)(cid:13) + c (cid:13)(cid:13) x s − − u (cid:13)(cid:13) , where the ﬁrst inequality is due to Cauchy-Schwarz inequality and Assumption 4b, with β > satisfying − θθ = Lβ , and the last inequality is obtained by combining E (cid:20)(cid:13)(cid:13)(cid:13) ∇ f (˜ x ) − ∇ f k (˜ x ) + ∇ f k ( y ) − ˜ ∇ (cid:13)(cid:13)(cid:13) (cid:21) ≤ E h(cid:13)(cid:13) ∇ ℓ z ( y s − ) − ∇ ℓ z (˜ x ) (cid:13)(cid:13) i . with Lemma 3 under Assumption 4b, along with the inequality ab ≤ ( a + b ) / . By substituting theinequality into (26) we have the following result. f ( y s − ) − f ( u ) ≤ β E h L ( f (˜ x ) − f ( y s − ) + 2 cL (cid:13)(cid:13) ˜ x − y s − (cid:13)(cid:13) i + c (cid:13)(cid:13) ˜ x − y s − (cid:13)(cid:13) + c (cid:13)(cid:13) x s − − u (cid:13)(cid:13) + 1 θ (cid:0) f ( y s − ) − E [ f ( y s )] (cid:1) + 12 η (cid:13)(cid:13) x s − − u (cid:13)(cid:13) − ησ η E h k x s − u k i + g ( u ) − E [ g ( x s )] ≤ cθ (cid:13)(cid:13) ˜ x − x s − (cid:13)(cid:13) + c (cid:13)(cid:13) x s − − u (cid:13)(cid:13) + 1 − θθ (cid:0) f (˜ x ) − f ( y s − ) (cid:1) + 1 θ (cid:0) f ( y s − ) − E [ f ( y s )] (cid:1) + 12 η (cid:13)(cid:13) x s − − u (cid:13)(cid:13) − ησ η E h k x s − u k i + g ( u ) − E [ g ( x s )] . Note that the choice of β cancels both − θθ (cid:10) ∇ f ( y s − ) , ˜ x − y s − (cid:11) and (cid:16) Lθ − θ ) + c (cid:17) E h(cid:13)(cid:13) x s − x s − (cid:13)(cid:13) i . Byrearranging the above inequality, we can cancel the term f ( y s − ) . We further use − g ( x s ) ≤ − θθ g (˜ x ) − θ g ( y s ) ,which leads to θ ( E [ F ( y s )] − F ( u )) ≤ cθ (cid:13)(cid:13) ˜ x − x s − (cid:13)(cid:13) + 1 − θθ ( F (˜ x ) − F ( u ))+ 1 + cη η (cid:13)(cid:13) x s − − u (cid:13)(cid:13) − ησ η E h k x s − u k i . (cid:13)(cid:13) ˜ x − x s − (cid:13)(cid:13) ≤ k ˜ x − x ∗ k + 2 (cid:13)(cid:13) x s − − x ∗ (cid:13)(cid:13) . By setting u = x ∗ and using σ -strongly-convexityof F we have θ ( E [ F ( y s )] − F ∗ ) ≤ (cid:18) − θθ + 4 cθσ (cid:19) ( F (˜ x ) − F ∗ )+ 1 + (1 + 4 θ ) cη η (cid:13)(cid:13) x s − − x ∗ (cid:13)(cid:13) − ησ η E h k x s − x ∗ k i . To simplify the analysis we impose another constraint θ ≤ / : θ ( E [ F ( y s )] − F ∗ ) ≤ (cid:18) − θθ + 4 cθσ (cid:19) ( F (˜ x ) − F ∗ )+ 1 + 3 cη η (cid:13)(cid:13) x s − − x ∗ (cid:13)(cid:13) − ησ η E h k x s − x ∗ k i . (27)Let w = ησ cη . Multiplying (27) by w s − and then summing over s , we have θ m − X s =0 w s ( E (cid:2) F ( y s +1 ) (cid:3) − F ∗ ) + w m (1 + 3 cη )2 η E h k x m − x ∗ k i ≤ (cid:18) − θθ + 4 cθσ (cid:19) m − X s =0 w s ( F (˜ x ) − F ∗ ) + 1 + 3 cη η (cid:13)(cid:13) x − x ∗ (cid:13)(cid:13) , Adding the superscript t and the subscript k back and applying Jensen’s inequality to the deﬁnition of y t + k ,we get θ m − X s =0 w s ( E (cid:2) F ( y t + k ) (cid:3) − F ∗ ) + w m (1 + 3 cη )2 η E (cid:20)(cid:13)(cid:13)(cid:13) x t +1 , k − x ∗ (cid:13)(cid:13)(cid:13) (cid:21) ≤ (cid:18) − θθ + 4 cθσ (cid:19) m − X s =0 w s ( F (˜ x t ) − F ∗ ) + 1 + 3 cη η (cid:13)(cid:13)(cid:13) x t, k − x ∗ (cid:13)(cid:13)(cid:13) . Averaging the inequality over k = 1 , · · · , n , we have θ m − X s =0 w s ( E (cid:2) F (˜ x t +1 ) (cid:3) − F ∗ ) + w m (1 + 3 cη )2 η E " n n X k =1 (cid:13)(cid:13)(cid:13) x t +1 , k − x ∗ (cid:13)(cid:13)(cid:13) ≤ (cid:18) − θθ + 4 cθσ (cid:19) m − X s =0 w s ( F (˜ x t ) − F ∗ ) + 1 + 3 cη η n n X k =1 (cid:13)(cid:13)(cid:13) x t, k − x ∗ (cid:13)(cid:13)(cid:13) . Case I m ( σ − c ) L ≤ / . By setting θ = q m ( σ − c )3 L ≤ / and η = Lθ + c we satisfy the step size constraint(22). We aim to show that θ ≥ (cid:18) − θθ + 4 cθσ (cid:19) w m . We have w = ησ cη ≤ ησ − cη . Note that Lθ = q Lm ( σ − c )3 ≥ q σ ( σ − c )3 > c , so η ≤ Lθ . Moreover,with σ > c we have m Lθ ( σ − c ) θ − cσ θ = 34(1 − cθ/σ ) ≤ . Hence, (cid:18) − θ + 4 cσ θ (cid:19) (1 + ησ − ηc ) m ≤ (cid:18) − θ + 4 cσ θ (cid:19) (cid:18) Lθ ( σ − c ) (cid:19) m ≤ (1 − ζ ) (1 + ζ/m ) m , ζ = m Lθ ( σ − c ) . With ζ ≤ θ − cσ θ ≤ / we have (1 − ζ ) (1 + ζ/m ) m ≤ . So we have (cid:18) − θθ + 4 cθσ (cid:19) w m ≥ θ (cid:18) − θ + 4 cσ θ (cid:19) (1 + ησ − ηc ) m ≥ θ . Therefore we have w m (cid:18) − θθ + 4 cθσ (cid:19) m − X s =0 w s ( F (˜ x t +1 ) − F ∗ ) + 1 + 3 cη η E " n n X k =1 (cid:13)(cid:13)(cid:13) x t +1 , k − x ∗ (cid:13)(cid:13)(cid:13) ≤ (cid:18) − θθ + 4 cθσ (cid:19) m − X s =0 w s ( F (˜ x t ) − F ∗ ) + 1 + 3 cη η n n X k =1 (cid:13)(cid:13)(cid:13) x t, k − x ∗ (cid:13)(cid:13)(cid:13) . The convergence rate over T rounds of communication is w − T m = (cid:16) (cid:16) √ κm (cid:17)(cid:17) − T m , so the communi-cation complexity is T = O (cid:16)p κ/m log(1 /ǫ ) (cid:17) . With the choice of m = Θ( N/n ) the runtime complexity is O (( N/n + m ) T ) = O ( p κN/n log(1 /ǫ )) . Case II m ( σ − c ) L > / . We set θ = 1 / and η = Lθ + c . We have ησ = κ/ c/σ ≤ / , thus w = 1 + ησ cη ≥ / ησ − cη / ησ − cη ) and w m ≥ ( σ − c ) ηm ≥ m ( σ − c )5(3 L/ L/ > / . So we have m − X s =0 w s ( F (˜ x t +1 ) − F ∗ ) + 54 · cη η E " n n X k =1 (cid:13)(cid:13)(cid:13) x t +1 , k − x ∗ (cid:13)(cid:13)(cid:13) ≤ (cid:18) (cid:19) m − X s =0 w s ( F (˜ x t ) − F ∗ ) + 1 + 3 cη η n n X k =1 (cid:13)(cid:13)(cid:13) x t, k − x ∗ (cid:13)(cid:13)(cid:13) . Therefore, m − X s =0 w s ( F (˜ x t +1 ) − F ∗ ) + 1 + 3 cη η E " n n X k =1 (cid:13)(cid:13)(cid:13) x t +1 , k − x ∗ (cid:13)(cid:13)(cid:13) ≤ m − X s =0 w s ( F (˜ x t ) − F ∗ ) + 1 + 3 cη η n n X k =1 (cid:13)(cid:13)(cid:13) x t, k − x ∗ (cid:13)(cid:13)(cid:13) . This implies that O (log(1 /ǫ )) rounds of communication is suﬃcient to ﬁnd an ǫ -accurate solution, and thatthe runtime complexity is O ( N/n log(1 /ǫ )) . E Discussions on Distributed Smoothness

In this paper, we have established that distributed variance reduced methods admit simple convergenceanalysis under the distributed smoothness for all worker machines, as long as the parameter c is smaller thana constant fraction of σ , the strong convexity parameter of the global loss function f . In this section, wewill show that the distributed smoothness can be guaranteed for many practical loss functions as long as thelocal data size is suﬃciently large and homogeneous across machines. This is as expected, since SVRG andSARAH rely heavily on exploiting data similarities to reduce the variance.Since the distributed smoothness only examines the gradient information of f k − f , it can be applied toloss functions with non-smooth gradients, e.g. Huber loss. However, for simplicity of exposition, we limit ourfocus to the case when the sample loss ℓ z ( · ) is second-order diﬀerentiable and demonstrate the smoothnessof f − f k via uniform concentration of Hessian matrices.27or simplicity, we consider the quadratic loss case, which allows us to compare with existing result forthe DANE algorithm [29], which is a communication-eﬃcient approximate Newton-type algorithm. Assume ℓ ( · , z ) is quadratic for all z . Recall the following result on the concentration of Hessian matrices from [29]. Lemma 9. [29] If (cid:22) ∇ ℓ z ( x ) (cid:22) L holds for all z , then with probability at least − δ over the samples, forall x , max ≤ k ≤ n (cid:13)(cid:13) ∇ f k ( x ) − ∇ f ( x ) (cid:13)(cid:13) ≤ s L log( dn/δ ) N/n .

Moreover, the iteration complexity of DANE is given by the theorem below.

Theorem 9. [29] If (cid:22) ∇ ℓ z ( x ) (cid:22) L holds for all z and σ (cid:22) ∇ f ( x ) (cid:22) L , then with probability exceeding − δ , DANE needs O κ N/n log (cid:18) dnδ (cid:19) log L (cid:13)(cid:13) x − x ∗ (cid:13)(cid:13) ǫ !! iterations to ﬁnd an ǫ -optimal solution. By Theorem 9, [29] claims that when the local data size of every machine is suﬃciently large, namely

N/n = Ω (cid:0) κ log( dn ) (cid:1) , DANE can ﬁnd a desired ǫ -optimal with O (log(1 /ǫ )) iterations and thus communication-eﬃcient. Note that at this local data size, according to Lemma 9, it is suﬃcient to establish c = O ( σ ) , whichsatisﬁes the convergence requirement of D-SVRG and D-SARAH. Consequently, the proposed D-SVRG andD-SARAH converges at the same iteration complexity as DANE, that is O (log(1 /ǫ )) . Recall that DANErequires its local routines to be solved exactly. In contrast, our results formally justiﬁes that SVRG andSARAH can be safely used as an inexactinexact