[PDF] STL-SGD: Speeding Up Local SGD with Stagewise Communication Period

Abstract

Full PDF

aa r X i v : . [ c s . L G ] J un STL-SGD: Speeding Up Local SGD with StagewiseCommunication Period

Shuheng Shen , Yifei Cheng , Jingchang Liu , and Linli Xu ∗ University of Science and Technology of China The Hong Kong University of Science and Technology

Abstract

Distributed parallel stochastic gradient descent algorithms are workhorses forlarge scale machine learning tasks. Among them, local stochastic gradient descent(Local SGD) has attracted signiﬁcant attention due to its low communication com-plexity. Previous studies prove that the communication complexity of Local SGDwith a ﬁxed or an adaptive communication period is in the order of O ( N T ) and O ( N T ) when the data distributions on clients are identical (IID) or otherwise(Non-IID). In this paper, to accelerate the convergence by reducing the commu-nication complexity, we propose ST agewise L ocal SGD (STL-SGD), which in-creases the communication period gradually along with decreasing learning rate.We prove that STL-SGD can keep the same convergence rate and linear speedupas mini-batch SGD. In addition, as the beneﬁt of increasing the communication pe-riod, when the objective is strongly convex or satisﬁes the Polyak-Łojasiewicz con-dition, the communication complexity of STL-SGD is O ( N log T ) and O ( N T ) for the IID case and the Non-IID case respectively, achieving signiﬁcant improve-ments over Local SGD. Experiments on both convex and non-convex problemsdemonstrate the superior performance of STL-SGD. We consider the task of distributed stochastic optimization, which employs N clients to solve thefollowing empirical risk minimization problem: min x ∈ R d f ( x ) := 1 N N X i =1 f i ( x ) , (1)where f i ( x ) := |D i | P ξ ∈D i f ( x, ξ ) is the local objective of client i . D i ’s denote the data distri-butions among clients, which can be possibly different. Speciﬁcally, the scenario where D i ’s areidentical corresponds to a central problem of traditional distributed optimization. When they are notidentical, Formulation (1) captures the federated learning setting [26, 15], where the local data ineach mobile client is independent and private, resulting in high variance of the data distributions.As representatives of distributed stochastic optimization methods, traditional Synchronous SGD(SyncSGD) [6, 7] and Asynchronous SGD (AsyncSGD) [1, 23] achieve linear speedup theoreticallywith respect to the number of clients. Nevertheless, for both SyncSGD and AsyncSGD, communi-cation needs to be conducted at each iteration and O ( d ) parameters are communicated each time,incurring signiﬁcant communication cost which restricts the performance in terms of time speedup. ∗ Corresponding author: [email protected]. Under review. o address this dilemma, distributed algorithms with low communication cost, either by decreas-ing the communication frequency [35, 30, 39, 28] or by reducing the communication bits in eachround [2, 31, 33], become widely applied for large scale training.Among them, Local SGD [30] (also called FedAvg [26]), which conducts communication ev-ery k iterations, enjoys excellent theoretical and practical performance [25, 30]. In the IID caseand the Non-IID case, the communication complexity of Local SGD is respectively proved to be O ( N T ) [35, 30] and O ( N T ) [39, 28], where T is the number of iterations, while the linearspeedup is maintained. When the objective satisﬁes the Polyak-Łojasiewicz condition [16], [9] pro-vides a tighter theoretical analysis which shows that the communication complexity of Local SGDis O ( N T ) . In terms of the communication period k , most previous studies of Local SGD chooseto ﬁx it through the iterations. In contrast, [34] suggests using an adaptively decreasing k whenthe learning rate is ﬁxed, and [9] proposes an adaptively increasing k as the iterations go on. Nev-ertheless, none of them achieve a communication complexity lower than O ( N T ) . For stronglyconvex objectives, if a small ﬁxed learning rate is adopted, Local SGD with ﬁxed communicationperiod is proved to achieve O ( N log ( N T )) [32] communication complexity. However, the smallﬁxed learning rate results in suboptimal convergence rate O ( log TNT ) . It remains an open problem asto whether the communication complexity can be further reduced with a varying k when the optimalconvergence rate O ( NT ) is maintained, to which this paper provides an afﬁrmative solution. Main Contributions.

We propose Stagewise Local SGD (STL-SGD), which adopts a stagewiselyincreasing communication period , and make the following contributions: • We ﬁrst prove that Local SGD achieves O ( √ NT ) convergence when the objective is generalconvex. A novel insight from this analysis is that, the convergence rate O ( √ NT ) can beattained when setting k to be O ( ηN ) and O ( √ ηN ) in the IID case and the Non-IID caserespectively, where η is the learning rate. This indicates that the communication period isnegatively relevant to the learning rate. • Taking Local SGD as a subalgorithm and tuning its parameters stagewisely, we proposeSTL-SGD sc for strongly convex problems, which geometrically increases the communi-cation period along with decreasing learning rate. We prove that STL-SGD sc achieves O ( NT ) convergence rate with communication complexities O ( N log T ) and O ( N T ) for the IID case and the Non-IID case, respectively. • For non-convex problems, we propose the STL-SGD nc algorithm, which uses Local SGDto optimize a regularized objective f γx s ( · ) inexactly at each stage. When the Polyak-Łojasiewicz condition holds, the same communication complexity as in strongly convexproblems is achieved. For general non-convex problems, we prove that STL-SGD nc achieves the linear speedup with communication complexities O ( N T ) and O ( N T ) for the IID case and the Non-IID case, respectively. Local SGD.

When the data distributions on clients are identical, Local SGD is proved to achieve O ( NT ) convergence for strongly convex objectives [30] and O ( √ NT ) convergence for non-convexobjectives [35] when the communication period k satisﬁes k ≤ O ( T /N ) . As demonstrated inthese results, Local SGD achieves a linear speedup with the communication complexity O ( N T ) for both strongly convex and non-convex objectives in the IID case. In addition, [9] justiﬁes that O ( N T ) rounds of communication are sufﬁcient to achieve O ( NT ) convergence for objectiveswhich satisfy the Polyak-Łojasiewicz condition. On the other hand, for the Non-IID case, Lo-cal SGD is proved with a O (1 / √ N T ) convergence rate under a communication complexity of O ( N T ) for non-convex objectives [39, 28]. Meanwhile, for strongly convex objectives, a subop-timal convergence rate of O ( k µNT ) [22] is obtained. Beyond that, when a small ﬁxed learning rateis adopted, [32] and [17] prove that the communication complexity of Local SGD is O ( N log( N T )) and O ( N T ) for the IID case and the Non-IID case respectively, at the cost of a suboptimal con-vergence rate O ( log TNT ) . For general non-convex objectives, [11] proves a lower communication2omplexity of O ( N T ) for the Non-IID case under the assumption of bounded gradient diversity.From the practical view, [41] suggests to communicate more frequently in the beginning of the op-timization process, and [9] veriﬁes that using a geometrically increasing period does not harm theconvergence notably. Stagewisely Training.

For training both strongly convex and non-convex objectives, stagewiselydecreasing the learning rate is widely adopted. Epoch-SGD [12] and ASSG [36] use SGD as theirsubalgorithm and geometrically decrease the learning rate stage by stage. They are proved to achievethe optimal O (1 /T ) convergence for stochastic strongly convex optimization. For training neuralnetworks, stagewisely decreasing the learning rate [21, 13] is a very important trick. From a theoret-ical aspect, stagewise SGD is proved with O (1 / √ T ) convergence for both general and compositenon-convex objectives [3, 4, 5], by adopting SGD to optimize a regularized objective at each stageand decreasing the learning rate linearly stage by stage. Stagewise training is also veriﬁed to achievebetter testing error than general SGD [40]. Large Batch SGD (LB-SGD).

SyncSGD with extremely large batch is proved to achieve a linearspeedup with respect to the batch size [32]. Nevertheless, [14] shows that increasing the batch sizedoes not help when the bias dominates the variance. It is also observed from practice that LB-SGDleads to a poor generalization [18, 8, 37]. [38] proposes CR-PSGD which increases the batch sizegeometrically step by step and proves that CR-PSGD achieves a linear speedup with O (log T ) com-munication complexity. However, after a large number of iterations, CR-PSGD essentially becomesGD and loses the beneﬁt of SGD. Local SGD with Varaiance Reduction.

Recently, several techniques are proposed to reducethe communication complexity of Local SGD in the Non-IID case. [10] shows that using redun-dant data among clients yields lower communication complexity. One variant of Local SGD calledVRL-SGD [24] incorporates the variance reduction technique and is proved to achieve a O ( N T ) communication complexity for non-convex objectives. SCAFFOLD [17] extends VRL-SGD byinvolving two separate learning rates, and is proved to achieve O (log ( N T )) and O ( N T ) com-munication complexities for strongly convex objectives and non-convex objectives respectively. AsSCAFFOLD adopts a small ﬁxed learning rate, its convergence rate for strongly convex objectives is O ( log TNT ) , which is suboptimal. Nevertheless, these methods are orthogonal to our study. CombiningSTL-SGD and variance reduction exceeds the scope of this paper.For a comprehensive and detailed comparison of STL-SGD and the related works, please refer toTable 3 in the Appendix. Throughout the paper, we let k · k indicate the ℓ norm of a vector and h· , ·i indicate the innerproduct of two vectors. The set { , , · · · , n } is denoted as [ n ] . We use x ∗ to represent the optimalsolution of (1). ∇ f represents the gradient of f . E indicates a full expectation with respect to all therandomness in the algorithm (the stochastic gradients sampled in all iterations and the randomnessin return).The data distributions on different clients may not be identical. To quantify the difference of distribu-tions, we deﬁne ζ ∗ f := N P Ni =1 k∇ f i ( x ∗ ) k = N P Ni =1 k∇ f i ( x ∗ ) − ∇ f ( x ∗ ) k , which representsthe variance of gradients among clients at x ∗ . Some literatures assume that the variance of gradientsamong clients is bounded by a constant ζ [28] or the norm of stochastic gradients is bounded by aconstant G [39, 22]. Note that both ζ and G are larger than ζ ∗ f . When the data distributions areidentical, we have k∇ f i ( x ∗ ) k = 0 , thus it holds that ζ ∗ f = 0 .To state the convergence of algorithms for solving (1), we introduce some deﬁnitions, which can bealso found in other works [4, 9]. Deﬁnition 1 ( ρ -weakly convex) . A non-convex function f ( x ) is ρ -weakly convex ( ρ > ) if f ( x ) ≥ f ( y ) + h∇ f ( y ) , x − y i − ρ k x − y k , ∀ x, y ∈ R d . lgorithm 1 Local-SGD( f , x , η , T , k ) Initialize: x i = x , ∀ i ∈ [ N ] . for t = 1 , ..., T do Client C i does: Uniformly sample a mini-batch ξ it − ∈ D i and calculate a stochastic gradient ∇ f i ( x it − , ξ it − ) . if t divides k then Communicate with other clients and update: x it = P Nj =1 1 N ( x jt − − η ∇ f ( x jt − , ξ jt − )) . else Update locally: x it = x it − − η ∇ f i ( x it − , ξ it − ) . end if end for return ˜ x = N P Ni =1 x it for the randomly chosen t ∈ { , , · · · , T − } . Deﬁnition 2 ( µ -Polyak-Łojasiewicz (PL)) . A function f ( x ) satisﬁes the µ -PL condition ( µ > ) if µ ( f ( x ) − f ( x ∗ )) ≤ k∇ f ( x ) k , ∀ x ∈ R d . Throughout this paper, we make the following assumptions, all of which are commonly used andbasic assumptions [30, 39, 22, 4, 3].

Assumption 1. f i ( x ) is L -smooth in terms of i ∈ [ N ] for every x ∈ R d : k∇ f i ( x ) − ∇ f i ( y ) k ≤ L k x − y k , ∀ x, y ∈ R d , i ∈ [ N ] . Assumption 2.

There exists a constant σ such that E ξ ∼D i k∇ f ( x, ξ ) − ∇ f i ( x ) k ≤ σ , ∀ x ∈ R d , ∀ i ∈ [ N ] . Assumption 3.

If the objective function is non-convex, we assume it is ρ -weakly convex. Remark 1.

Note that if f ( x ) is L -smooth, it is L -weakly convex. This is because Assumption 1implies − L k x − y k ≤ f ( x ) − f ( y ) − h∇ f ( y ) , x − y i ≤ L k x − y k [27]. Therefore, for an L -smooth function, we can immediately get that the weakly-convex parameter ρ satisﬁes < ρ ≤ L . To alleviate the high communication cost in SyncSGD, the periodically averaging technique is pro-posed [30, 39]. Instead of averaging models in all clients at every iteration, Local SGD lets clientsupdate their models locally for k iterations, then one communication is conducted to average thelocal models to make them consistent. Speciﬁcally, the update rule of Local SGD is x it = ( N P Nj =1 ( x jt − − η ∇ f ( x jt − , ξ jt − )) , if t % k = 0 ,x it − − η ∇ f ( x it − , ξ it − ) , else , where x it is the local model in client i at iteration t . Therefore, when each client conducts T iterations,the total number of communications is T /k . The complete procedure of Local SGD is summarizedin Algorithm 1. Different from previous studies [26, 30, 39], Algorithm 1 returns ˜ x = N P Ni =1 x it for a randomly chosen t ∈ { , , · · · , T − } . In practice, we can determine t at ﬁrst to avoidredundant iterations.Although several studies have analysed the convergence of Local SGD, they assume that the objec-tive f ( x ) is µ -strongly convex or non-convex. [19] focuses on general convex objectives while theyuse the full gradient descent. Besides, most of the existing analysis relies on some stronger assump-tions, including bounded gradient norm (i.e., k∇ f i ( x, ξ ) k ≤ G ) [30, 22] or bounded variance ofgradients among clients [28]. Here, we give a basic convergence result of Local SGD for the generalconvex objectives without these assumptions. 4 lgorithm 2 STL-SGD sc ( f , x , η , T , k ) for s = 1 , , ..., S do x s +1 = Local-SGD( f , x s , η s , T s , max {⌊ k s ⌋ , } ). Set η s +1 = η s , T s +1 = 2 T s and k s +1 = (cid:26) √ k s , Non - IID case , k s , IID case . end for return x S +1 . Theorem 1.

Suppose Assumptions 1 and 2 hold, f ( x ) is convex and η ≤ L . If we set k ≤ min { ηLN , ηL } and k ≤ min { σ √ ηLN ( σ +4 ζ ∗ f ) , ηL } for the IID case and the Non-IIDcase respectively, we have E f (˜ x ) − f ( x ∗ ) ≤ k x − x ∗ k ηT + ησ N . (2)

Remark 2.

If we set η = q NT , we have E f (˜ x ) − f ( x ∗ ) ≤ k x − x ∗ k + σ √ NT , which is consistent withthe result of mini-batch SGD [6]. To further reduce the communication complexity, we propose ST agewise L ocal SGD (STL-SGD) inthis section with the following features. • At the beginning, STL-SGD employs Algorithm 1 as a subalgorithm in each stage. • Instead of using a small ﬁxed learning rate or a gradually decreasing learning rate (e.g. η αt ), STL-SGD adopts a stagewisely adaptive scheme. The learning rate is ﬁxed at ﬁrst,and decreased stage by stage. • The communication periods are increased stagewisely.We propose two variants of STL-SGD for strongly convex and non-convex problems, respectively.

In this subsection, we propose the STL-SGD algorithm for strongly convex problems, which isdenoted as STL-SGD sc and summarized in Algorithm 2. At each stage, the learning rate is decreasedexponentially. In the meantime, the number of iterations and the communication period are increasedexponentially. Speciﬁcally, at the s -th stage, we set η s = η s − and T s = 2 T s − . The communicationperiod k s is set as k s = 2 k s − and k s = √ k s − for the IID case and the Non-IID case respectively.Below, let x s denote the initial point of the s -th stage. Theorem 2 establishes the convergence rateof STL-SGD sc . Theorem 2.

Suppose f ( x ) is µ -strongly convex. Let η ≤ L and T η = µ . We set k = min { η LN , η L } and k = min { σ √ η LN ( σ +4 ζ f ) , η L } for the IID case and theNon-IID case respectively. Under Assumptions 1 and 2, when the number of stages satisﬁes S ≥ log( N ( f ( x ) − f ( x ∗ )) η σ ) + 2 , we have the following result for Algorithm 2: E f ( x S +1 ) − f ( x ∗ ) ≤ η σ S N . (3)

Deﬁning T := T + T + · · · + T S , we have E f ( x S +1 ) − f ( x ∗ ) ≤ O (cid:18) N T (cid:19) . (4)5 lgorithm 3 STL-SGD nc ( f , x , η , T , k ) for s = 1 , , ..., S do Let f γx s ( x ) = f ( x ) + γ k x − x s k . x s +1 = Local-SGD( f γx s , x s , η s , T s , max {⌊ k s ⌋ , } ). Option 1:

Set η s +1 = η s , T s +1 = 2 T s and k s +1 = (cid:26) √ k s , Non - IID case , k s , IID case . Option 2:

Set η s +1 = η s +1 , T s +1 = ( s + 1) T and k s +1 = (cid:26) √ s + 1 k , Non - IID case , ( s + 1) k , IID case . end for return x S +1 . Remark 3.

Theorem 2 claims the following properties of STL-SGD sc : • Linear Speedup.

To reach a solution x S +1 with E f ( x S +1 ) − f ( x ∗ ) ≤ ǫ , the number ofiterations is O ( Nǫ ) , which indicates a linear speedup. • Communication Complexity for the Non-IID Case.

For the Non-IID case, we set k s +1 = √ k s for Algorithm 2. Therefore, the total communication complexity is T k + · · · + T S k S = T k (1 + 2 + · · · + 2 s − ) = O ( T k · ( TT ) ) = O ( N T ) , where the last equality holdsbecause T k = O ( √ T η N ) = O ( N ) . • Communication Complexity for the IID Case.

If the data distributions on different clientsare identical, we set k s +1 = 2 k s for Algorithm 2. Thus, the total communication complexityis T k + · · · + T S k S = S T k = O ( N log T ) . In this subsection, we proceed to propose the variant of STL-SGD algorithm for non-convex prob-lems (STL-SGD nc ). Different from Algorithm 2, which optimizes a ﬁxed objective during all stages,STL-SGD nc changes the objective once a stage is ﬁnished. Speciﬁcally, in the s -th stage, the objec-tive is a regularized problem f γx s = f ( x ) + γ k x − x s k , where x s is the initial point of the s -thstage and γ is a constant that satisﬁes γ < ρ − . f γx s ( x ) is guaranteed to be convex due to the ρ -weakconvexity of f ( x ) . In this way, the theoretical property of Algorithm 1 under convex settings stillholds in each stage of STL-SGD nc . Other parameters are set in two different ways ( Option 1 and

Option 2 ) for non-convex objectives satisfying the PL condition and otherwise, which are detailedin Algorithm 3.In

Option 1 , we set η s , T s and k s in the same way as in Algorithm 2. Here we analyse the theoreticalproperty of STL-SGD nc with Option 1 for non-convex objectives that satisfy the PL condition.

Theorem 3.

Assume f ( x ) satisﬁes the PL condition deﬁned in Deﬁnition 2 with constant µ .Suppose Assumptions 1, 2 and 3 hold and f ( x ) is weakly convex with constant ρ ≤ µ . Let η ≤ L γ , T η = ρ . Set k = min { η L γ N , η L γ } and k = min { σ √ η L γ N ( σ +4 ζ f ) , η L γ } for the IID case and the Non-IID case respectively. When the number of stages satisﬁes S ≥ log N ( f ( x ) − f ( x ∗ )) η σ + 2 , Algorithm 3 with Option 1 returns a solution x S +1 such that E f ( x S +1 ) − f ( x ∗ ) ≤ O (cid:18) N T (cid:19) , (5) where T = T + T + · · · + T S . emark 4. As the result of Theorem 3 is the same as that of Theorem 2, properties stated in Re-mark 3 all hold here.

Option 2 is employed for the non-convex objectives which do not satisfy the PL condition. Insteadof increasing the communication period geometrically as in

Option 1 of Algorithm 3, we let itincrease in a linear manner, i.e., k s = sk . Meanwhile, we increase the stage length linearly, that is T s = sT , while keeping T s η s a constant. Theorem 4.

Suppose Assumptions 1, 2 and 3 hold. Let η ≤ L γ and T η = ρ . Set k =min { η LN , η L } and k = min { σ √ η LN ( σ +4 ζ f ) , η L } for the IID case and the Non-IID caserespectively. Algorithm 3 with Option 2 guarantees that E k∇ f ( x s ) k ≤ O (cid:18) √ N T (cid:19) , (6) where s is randomly sampled from { , , · · · , S } with probability p s = s ··· + S . Remark 5.

STL-SGD nc with Option 2 has the following properties: • Linear Speedup:

To achieve E k∇ f ( x S ) k ≤ ǫ , the total number of iterations when N clients are used is O ( Nǫ ) , which shows a linear speedup. • Communication Complexity for the Non-IID case:

Algorithm 3 with

Option 2 sets k s = √ sk . Thus, the communication complexity is T k + T k + · · · + T S k S = T k (1 + √ · · · + √ S ) = O ( T k ( TT ) ) = O ( N T ) . • Communication Complexity for the IID case: As k s = sk , the communication complexityis T k + T k + · · · + T S k S = T k S = O ( T k ( TT ) ) = O (cid:16) N T (cid:17) . We validate the performance of the proposed STL-SGD algorithm with experiments on both convexand non-convex problems. For each type of problems, we conduct experiments for both the IID caseand the Non-IID case. Experiments are conducted on a machine with 8 Nvidia Geforce GTX 1080TiGPUs and 2 Xeon(R) Platinum 8153 CPUs.To simulate the Non-IID scenarios, we divide the training data among clients and make the distri-butions of classes very different among them. Similar to the setting in [17], at ﬁrst, we randomlytake s % i.i.d. data from the training set and divide them equally to each client. For the remainingdata, we sort them according to their classes and then assign them to the clients in order. In ourexperiments, we set s = 50 for the convex problems and s = 0 for the non-convex problems.We compare STL-SGD with SyncSGD, LB-SGD, CR-PSGD [38] and Local SGD [30]. We showthe comparison of these algorithms in terms of the communication rounds. The investigation regard-ing convergence is included in the Appendix, which validates that STL-SGD can achieve similarconvergence as SyncSGD. We consider the binary classiﬁcation problem with logistic regression, i.e., min θ ∈ R d n n X i =1 log(1 + exp( − y i x Ti θ )) + λ k θ k , (7)where ( x i , y i ) , i ∈ [ n ] constitute a set of training examples, and λ is the regularization parameter.It is notable that (7) is a strongly convex problem when λ > , and we set λ = 1 /n . We take twodatasets a9a and MNIST from the libsvm website . a9a has , examples and features. For MNIST , we sample a subset with , examples and features from two classes (4 and 9).Experiments are implemented on 32 clients and communication is handled with MPI .

100 200 300 400

Communication Rounds −4 −3 −2 −1 O b j e c t i v e G a p a9a, IID SyncSGDLB-SGD(B=100)CR-PSGD(ρ = 1.001)Local-SGD( =800)STL-SGD sc ( = 400) Communication Rounds −4 −3 −2 −1 O b e c t i v e G a p a9a, Non-IID SyncSGDLB-SGD(B=10)CR-PSGD(ρ = 1.001)Local-SGD(k=10)STL-SGD sc (k = 10) Communication Rounds −4 −3 −2 −1 O b j e c t v e G a p MNIST, IID

S,ncSGDLB-SGD(B=100)CR-PSGD(ρ = 1.001)Local-SGD(k=100)STL-SGD sc (k = 100) Communication Rounds −4 −3 −2 −1 O b j e c t v e G a p MNIST, Non-IID

SyncSGDLB-SGD(B=10)CR-PSGD(ρ = 1.001)Local-SGD(k=10)STL-SGD sc (k = 10) Figure 1: Training objective gap f ( x ) − f ( x ∗ ) w.r.t the communication rounds for logistic regressionon a9a and MNIST .Table 1: Communication rounds to reach − objective gap in convex problems. We also show thespeedup of these algorithms compared with SyncSGD.Algorithms a9a (IID) a9a (Non-IID) MNIST (IID) MNIST (Non-IID)SyncSGD 100683 ( × ) 90513 ( × ) 32664 ( × ) 22021 ( × )LB-SGD 7620 ( . × ) 12221 ( . × ) 7011 ( . × ) 7740 ( . × )CR-PSGD 5434 ( . × ) 5772 ( . × ) 6788 ( . × ) 7029 ( . × )Local-SGD 184 ( . × ) 10068 ( . × ) 289 ( . × ) 2642 ( . × )STL-SGD sc

61 ( . × ) 4417 ( . × ) 79 ( . × ) 1518 ( . × ) SyncSGD, LB-SGD and Local SGD are implemented with the decreasing learning rate η t = η αt as suggested in [30, 22] and we tune α in { − , − , − } for the best performance.For STL-SGD sc , we set η T = λ . The initial learning rate for all algorithms is tuned in { N, N/ , N/ } . The communication period k and the batch size B for LB-SGD are tunedin { , , , , } for the IID case, and { , , , , } for the Non-IID case. Thescaling factor of batch size ρ for CR-PSGD is tuned in { . , . , . } . We report the largest k , B and ρ which do not sacriﬁce the convergence.Figure 1 shows the objective gap f ( x ) − f ( x ∗ ) with regard to the communication rounds. Wecan observe that STL-SGD sc converges with the fewest communication rounds for both the IIDcase and the Non-IID case. Although the initial communication period of STL-SGD sc may needto be set smaller than Local SGD in the the IID case, the total number of communication roundsof STL-SGD sc is still signiﬁcantly lower, which validates that the communication complexity of8

750 1500 2250 3000

Communication Rounds T r a i n i n g L o ss ResNet18, CIFAR10, IID

SyncSGDLB-SGD(B=320)CR-PSGD(ρ = 1.1)Local-SGD(k=10)STL-SGD nc -2(k = 10)STL-SGD nc -1(k = 10) Communication Rounds T r a i n i n g L o ss ResNet18, CIFAR10, Non-IID

SyncSGDLB-SGD(B=320)CR-PSGD(ρ = 1.1)Local-SGD(k=5)STL-SGD nc -2(k = 5)STL-SGD nc -1(k = 5) Communication Rounds T r a i n i n g L o ss VGG16, CIFAR10, IID

SyncSGDLB-SGD(B=320)CR-PSGD(ρ = 1.1)Local-SGD(k=10)STL-SGD nc -2(k = 10)STL-SGD nc -1(k = 10) Communication Rounds T r a i n i n g L o ss VGG16, CIFAR10, Non-IID

SyncSGDLB-SGD(B=192)CR-PSGD(ρ = 1.1)Local-SGD(k=3)STL-SGD nc -2(k = 3)STL-SGD nc -1(k = 3) Figure 2: Training loss w.r.t the communication rounds for ResNet18 and VGG16 on CIFAR10.Table 2: Communication rounds to reach 99% training accuracy in non-convex problems. We runall algorithms for 200 epochs, where an epoch indicates one pass of the dataset. LB-SGD andCR-PSGD can not achieve 99% training accuracy on the VGG16 neural network until the end oftraining.Algorithms ResNet18 (IID) ResNet18 (Non-IID) VGG16 (IID) VGG16 (Non-IID)SyncSGD 7644 ( × ) 5390 ( × ) 13622 ( × ) 15092 ( × )LB-SGD 3000 ( . × ) 3180 ( . × ) − ( − ) − ( − )CR-PSGD 1797 ( . × ) 1937 ( . × ) − ( − ) − ( − )Local-SGD 755 ( . × ) 1235 ( . × ) 1245 ( . × ) 3986 ( . × )STL-SGD nc -2

470 ( . × ) 1158 ( . × ) 696 ( . × ) 2732 ( . × ) STL-SGD nc -1

434 ( . × ) 954 ( . × ) 602 ( . × ) 2179 ( . × ) STL-SGD sc is much lower than Local SGD. As shown in Table 1, to achieve − objective gap,the communication rounds of STL-SGD sc is almost 1.7-3 times fewer than Local SGD. We train ResNet18 [13] and VGG16 [29] on the

CIFAR10 [20] dataset, which includes a trainingset of 50,000 examples from 10 classes. 8 clients are used in total.For our proposed algorithm, we denote STL-SGD nc with Option 1 and

Option 2 as STL-SGD nc -1and STL-SGD nc -2 respectively. The learning rates of SyncSGD, LB-SGD, CR-PSGD and Local-SGD are all set ﬁxed as suggested in their convergence theory [7, 38, 39]. The initial learning rate for9ll algorithms is tuned in { N/ , N/ , N/ } . The basic batch size at each client is 64. Theﬁrst stage length of STL-SGD nc is tuned in { , , } epochs. The parameter γ in STL-SGD nc istuned in { , , } . We tune the communication period k in { , , , } and the batch size B for LB-SGD in { , , , } . For ease of implementation, we increase the batch sizein CR-PSGD with B = ρB once an epoch is ﬁnished, and ρ is tuned in { . , . , . } . B stopsgrowing when it exceeds as suggested in [38].The experimental results of training loss regarding communication rounds are presented in Figure 2and the communication rounds to achieve 99% training accuracy for all algorithms are shown inTable 2. As can be seen, STL-SGD nc -1 and STL-SGD nc -2 converge with much fewer communi-cations than other algorithms. In spite of the same order of communication complexity as LocalSGD, the performance of STL-SGD nc -2 is better as the beneﬁt of the negative relevance betweenthe learning rate and the communication period. STL-SGD nc -1 converges with the fewest numberof communications, as it uses a geometrically increasing communication period. We propose STL-SGD, which adopts a stagewisely increasing communication period to reduce thecommunication complexity. Two variants of STL-SGD (STL-SGD sc and STL-SGD nc ) are providedfor strongly convex objectives and non-convex objectives respectively. Theoretically, we prove that:(i) STL-SGD maintains the convergence rate and linear speedup as SyncSGD; (ii) when the objec-tive is strongly convex or satisﬁes the PL condition, while attaining the optimal convergence rate O ( NT ) , STL-SGD achieves the state-of-the-art communication complexity; (iii) when the objectiveis general non-convex, STL-SGD has the same communication complexity as Local SGD, whilebeing more consistent with practical tricks. Experiments on both convex and non-convex problemsdemonstrate the effectiveness of the proposed algorithm.Local SGD with variance reduction achieves outstanding communication complexity for the Non-IID case. One interesting idea is to combine the techniques of stagewise training and variancereduction to get better results for the Non-IID case. We will consider it in our future work. References [1] Alekh Agarwal and John C Duchi. Distributed delayed stochastic optimization. In

Advancesin Neural Information Processing Systems , pages 873–881, 2011.[2] Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. Qsgd:Communication-efﬁcient sgd via gradient quantization and encoding. In

Advances in NeuralInformation Processing Systems , pages 1709–1720, 2017.[3] Zeyuan Allen-Zhu. How to make the gradients small stochastically: Even faster convex andnonconvex sgd. In

Advances in Neural Information Processing Systems , pages 1157–1167,2018.[4] Zaiyi Chen, Zhuoning Yuan, Jinfeng Yi, Bowen Zhou, Enhong Chen, and Tianbao Yang. Uni-versal stagewise learning for non-convex problems with convergence on averaged solutions. In

International Conference on Learning Representations , 2019.[5] Damek Davis and Benjamin Grimmer. Proximally guided stochastic subgradient method fornonsmooth, nonconvex problems.

SIAM Journal on Optimization , 29(3):1908–1930, 2019.[6] Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, and Lin Xiao. Optimal distributed online pre-diction using mini-batches.

Journal of Machine Learning Research , 13(Jan):165–202, 2012.[7] Saeed Ghadimi and Guanghui Lan. Stochastic ﬁrst-and zeroth-order methods for nonconvexstochastic programming.

SIAM Journal on Optimization , 23(4):2341–2368, 2013.[8] Noah Golmant, Nikita Vemuri, Zhewei Yao, Vladimir Feinberg, Amir Gholami, Kai Rothauge,Michael W. Mahoney, and Joseph Gonzalez. On the computational inefﬁciency of large batchsizes for stochastic gradient descent, 2018.[9] Farzin Haddadpour, Mohammad Mahdi Kamani, Mehrdad Mahdavi, and Viveck Cadambe.Local sgd with periodic averaging: Tighter analysis and adaptive synchronization. In

Advancesin Neural Information Processing Systems , pages 11080–11092, 2019.1010] Farzin Haddadpour, Mohammad Mahdi Kamani, Mehrdad Mahdavi, and Viveck Cadambe.Trading redundancy for communication: Speeding up distributed sgd for non-convex optimiza-tion. In

International Conference on Machine Learning , pages 2545–2554, 2019.[11] Farzin Haddadpour and Mehrdad Mahdavi. On the convergence of local descent methods infederated learning, 2019.[12] Elad Hazan and Satyen Kale. Beyond the regret minimization barrier: optimal algorithmsfor stochastic strongly-convex optimization.

The Journal of Machine Learning Research ,15(1):2489–2512, 2014.[13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for imagerecognition. In

Proceedings of the IEEE conference on computer vision and pattern recogni-tion , pages 770–778, 2016.[14] Prateek Jain, Sham M. Kakade, Rahul Kidambi, Praneeth Netrapalli, and Aaron Sidford. Paral-lelizing stochastic gradient descent for least squares regression: mini-batching, averaging, andmodel misspeciﬁcation, 2016.[15] Peter Kairouz, H Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Ar-jun Nitin Bhagoji, Keith Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings,et al. Advances and open problems in federated learning. arXiv preprint arXiv:1912.04977 ,2019.[16] Hamed Karimi, Julie Nutini, and Mark Schmidt. Linear convergence of gradient and proximal-gradient methods under the polyak-łojasiewicz condition. In

Joint European Conference onMachine Learning and Knowledge Discovery in Databases , pages 795–811. Springer, 2016.[17] Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank J Reddi, Sebastian U Stich,and Ananda Theertha Suresh. Scaffold: Stochastic controlled averaging for on-device feder-ated learning. arXiv preprint arXiv:1910.06378 , 2019.[18] Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and PingTak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp min-ima, 2016.[19] Ahmed Khaled, Konstantin Mishchenko, and Peter Richtárik. First analysis of local gd onheterogeneous data. arXiv preprint arXiv:1909.04715 , 2019.[20] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images.2009.[21] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classiﬁcation with deepconvolutional neural networks. In

Advances in neural information processing systems , pages1097–1105, 2012.[22] Xiang Li, Kaixuan Huang, Wenhao Yang, Shusen Wang, and Zhihua Zhang. On the conver-gence of fedavg on non-{iid} data. In

International Conference on Learning Representations ,2020.[23] Xiangru Lian, Yijun Huang, Yuncheng Li, and Ji Liu. Asynchronous parallel stochastic gradi-ent for nonconvex optimization. In

Advances in Neural Information Processing Systems , pages2737–2745, 2015.[24] Xianfeng Liang, Shuheng Shen, Jingchang Liu, Zhen Pan, Enhong Chen, and YifeiCheng. Variance reduced local sgd with lower communication complexity. arXiv preprintarXiv:1912.12844 , 2019.[25] Tao Lin, Sebastian U Stich, Kumar Kshitij Patel, and Martin Jaggi. Don’t use large mini-batches, use local sgd. arXiv preprint arXiv:1808.07217 , 2018.[26] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Ar-cas. Communication-efﬁcient learning of deep networks from decentralized data. In

ArtiﬁcialIntelligence and Statistics , pages 1273–1282, 2017.[27] Yurii Nesterov.

Lectures on convex optimization , volume 137. Springer, 2018.[28] Shuheng Shen, Linli Xu, Jingchang Liu, Xianfeng Liang, and Yifei Cheng. Faster distributeddeep net training: computation and communication decoupled stochastic gradient descent. In

Proceedings of the 28th International Joint Conference on Artiﬁcial Intelligence , pages 4582–4589. AAAI Press, 2019. 1129] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scaleimage recognition. arXiv preprint arXiv:1409.1556 , 2014.[30] Sebastian U. Stich. Local SGD converges fast and communicates little. In

International Con-ference on Learning Representations , 2019.[31] Sebastian U Stich, Jean-Baptiste Cordonnier, and Martin Jaggi. Sparsiﬁed sgd with memory.In

Advances in Neural Information Processing Systems , pages 4447–4458, 2018.[32] Sebastian U. Stich and Sai Praneeth Karimireddy. The error-feedback framework: Better ratesfor sgd with delayed gradients and compressed communication, 2019.[33] Hanlin Tang, Chen Yu, Xiangru Lian, Tong Zhang, and Ji Liu. Doublesqueeze: Parallel stochas-tic gradient descent with double-pass error-compensated compression. In

International Con-ference on Machine Learning , pages 6155–6165, 2019.[34] Jianyu Wang and Gauri Joshi. Adaptive communication strategies to achieve the best error-runtime trade-off in local-update sgd. arXiv preprint arXiv:1810.08313 , 2018.[35] Jianyu Wang and Gauri Joshi. Cooperative sgd: A uniﬁed framework for the design andanalysis of communication-efﬁcient sgd algorithms. arXiv preprint arXiv:1808.07576 , 2018.[36] Yi Xu, Qihang Lin, and Tianbao Yang. Stochastic convex optimization: Faster local growthimplies faster global convergence. In

Proceedings of the 34th International Conference onMachine Learning-Volume 70 , pages 3821–3830. JMLR. org, 2017.[37] Dong Yin, Ashwin Pananjady, Max Lam, Dimitris Papailiopoulos, Kannan Ramchandran, andPeter Bartlett. Gradient diversity: a key ingredient for scalable distributed learning, 2017.[38] Hao Yu and Rong Jin. On the computation and communication complexity of parallel sgd withdynamic batch sizes for stochastic non-convex optimization, 2019.[39] Hao Yu, Sen Yang, and Shenghuo Zhu. Parallel restarted sgd with faster convergence and lesscommunication: Demystifying why model averaging works for deep learning. In

Proceedingsof the AAAI Conference on Artiﬁcial Intelligence , volume 33, pages 5693–5700, 2019.[40] Zhuoning Yuan, Yan Yan, Rong Jin, and Tianbao Yang. Stagewise training accelerates conver-gence of testing error over sgd. In

Advances in Neural Information Processing Systems , pages2604–2614, 2019.[41] Jian Zhang, Christopher De Sa, Ioannis Mitliagkas, and Christopher Ré. Parallel sgd: Whendoes averaging help? arXiv preprint arXiv:1606.07365 , 2016.12

Comparison to Previous Results

Table 3 summarizes the comparison of Local SGD and its state-of-the-art extensions with the re-sults in this paper. The table shows the convergence rate and the communication complexity ofthese algorithms when the data distributions are identical or otherwise. Strongly convex objectives,non-convex objectives which satisfy the PL condition and general non-convex objectives are allconsidered.For both strongly convex objectives and non-convex objectives which satisfy the PL condition, STL-SGD achieves the state-of-the-art communication complexity while attaining the optimal conver-gence rate of O ( NT ) . For general non-convex objectives, STL-SGD keeps the same communicationcomplexity as Local SGD, but SCAFFOLD [17] achieves lower communication complexity whendata distributions are not identical. Nevertheless, the variance reduction technique used in SCAF-FOLD is orthogonal to our study. It is an interesting direction to combine techniques in STL-SGDand SCAFFOLD for the Non-IID case. Some existing studies make extra assumptions includingthe bounded gradient and the bounded variance of gradients among clients, while the theoreticalanalysis in this paper does not depend on these assumptions.Table 3: A comparison of the results in this paper and previous state-of-the-art results of Local SGDand its variants. Regarding orders of convergence rate and communication complexity, we highlightthe dependency on T (the number of iterations), N (the number of clients) and k (communicationperiod). Previous results may depend on some extra assumptions, which include: (1) an upper boundfor gradient, (2) an upper bound for the gradient variance among clients and (3) an upper bound forthe gradient diversity, which are shown in the last column. Algorithms Objectives ConvergenceRate CommunicationComplexity DataDistributions ExtraAssumptionsLocal SGD [30] Strongly Convex O ( NT ) O ( N T ) IID (1)Local SGD [32] Strongly Convex O ( log TNT ) O ( N log ( NT )) IID No

STL-SGD Strongly Convex O ( NT ) O ( N log T ) IID No

Local SGD [22] Strongly Convex O ( k NT ) O ( T ) Non-IID (1)Local SGD [17] Strongly Convex O ( log TNT ) O ( N T ) Non-IID NoSCAFFOLD [17] Strongly Convex O ( log TNT ) O (log ( NT )) Non-IID No

STL-SGD Strongly Convex O ( NT ) O ( N T ) Non-IID No

Local SGD [9] Non-Convex+PL O ( NT ) O ( N T ) IID No

STL-SGD Non-Convex+PL O ( NT ) O ( N log T ) IID NoSTL-SGD Non-Convex+PL O ( NT ) O ( N T ) Non-IID No

Local SGD [35] Non-Convex O ( √ NT ) O ( N T ) IID (1)

STL-SGD Non-Convex O ( √ NT ) O ( N T ) IID No

Local SGD [28] Non-Convex O ( √ NT ) O ( N T ) Non-IID (2)Local SGD [11] Non-Convex O ( √ NT ) O ( N T ) Non-IID (3)SCAFFOLD [17] Non-Convex O ( √ NT ) O ( N T ) Non-IID No

STL-SGD Non-Convex O ( √ NT ) O ( N T ) Non-IID No Although these studies prove lower communication complexity, a suboptimal O ( log TNT ) convergence rateis proved due to the small ﬁxed learning rate. The adaptive variant of Local SGD proposed in [9] has the same order of communication complexity asLocal SGD.

25 50 75 100

Epochs −4 −3 −2 −1 O b j e c t i v e G a p a9a, IID SyncSGDLB-SGD(B=100)CR-PSGD(,=1.001)Local-SGD(k=800)STL-SGD (c (k =400) Epochs −4 −3 −2 −1 O b j e c t i v e G a p a9a, Non-IID SyncSGDLB-SGD(B=10)CR-PSGD(,=1.001)Local-SGD(k=10)STL-SGD (c (k =10) Epochs −4 −3 −2 −1 O b j e c t i v e G a p MNIST, IID

SyncSGDLB-SGD(B=100)CR-PSGD(ρ=1.001)Local-SGD(k=100)STL-SGD sc (k =100) Epochs −4 −3 −2 −1 O b j e c t i v e G a p MNIST, Non-IID

SyncSGDLB-SGD(B=10)CR-PSGD(ρ=1.001)Local-SGD(k=10)STL-SGD sc (k =10) Figure 3: Training objective gap f ( x ) − f ( x ∗ ) w.r.t epochs for logistic regression on a9a and MNISTdatasets. Epochs T r a i n i n g L o ss ResNet18, CIFAR10, IID

SyncSGDLB-SGD(B=320)CR-PSGD(ρ=1.1)Local-SGD(k=10)STL-SGD nc -2(k =10)STL-SGD nc -1(k =10) Epochs T r a i n i n g L o ss ResNet18, CIFAR10, Non-IID

SyncSGDLB-SGD(B=320)CR-PSGD(ρ=1.1)Local-SGD(k=5)STL-SGD nc -2(k =5)STL-SGD nc -1(k =5) Epochs T r a i n i n g L o ss VGG16, CIFAR10, IID

SyncSGDLB-SGD(B=320)CR-PSGD(ρ=1.1)Local-SGD(k=10)STL-SGD nc -2(k =10)STL-SGD nc -1(k =10) Epochs T r a i n i n g L o ss VGG16, CIFAR10, Non-IID

SyncSGDLB-SGD(B=192)CR-PSGD(ρ=1.1)Local-SGD(k=3)STL-SGD nc -2(k =3)STL-SGD nc -1(k =3) Figure 4: Training loss w.r.t epochs for ResNet18 and VGG16 on CIFAR10 dataset.

B More About Experiments

B.1 Experimental Results for Validating the Convergence Rate

In this subsection, we supplement the experimental results not included in Section 5. The rules forturning the hyper-parameters are presented in Section 5 and we turn all hyper-parameters to makeall algorithms to achieve the best convergence speed. We present the experimental results of thetraining loss with regard to the epochs in this subsection. The results for strongly convex objectivesand non-convex objectives are shown in Figure 3 and Figure 4 respectively.From the theoretical perspective, STL-SGD, CR-PSGD and Local SGD can maintain the same con-vergence rate with SyncSGD: O ( NT ) for strongly convex objectives and O ( √ NT ) for non-convexobjectives. As shown in Figure 3 and Figure 4, when the hyper-parameters are set properly, the con-vergence speed of the above algorithms is similar. STL-SGD and Local SGD may converge slowlyin the beginning, but they match SyncSGD when the number of iterations is relatively large, whichis consistent with our theory in Theorem 2 and Theorem 3 that the number of stages can not be toosmall. Although LB-SGD is theoretically justiﬁed to achieve a linear speedup with respect to thebatch size, it can not maintain the convergence of mini-batch SGD (or SyncSGD) when the batchsize B gets large. The reason could be that the bias dominates the variance as discussed in [14]. C Proofs for Results in Section 3

In this section, we ﬁrst present some lemmas, then give the proof for Theorem 1.

C.1 Some Basic Lemmas

We bound the norm of the difference between gradients with the Bregman divergence D f ( x, y ) := f ( x ) − f ( y ) − h∇ f ( y ) , x − y i for a smooth and convex function. Lemma 1.

Suppose f ( x ) is L -smooth and convex. The following inequality holds: k∇ f ( x ) − ∇ f ( y ) k ≤ L D f ( x, y ) . Proof.

This Lemma is identical to Theorem 2.1.5 (2.1.10) in [27], which is a basic property ofsmooth and convex functions. 14or ease of analysis, we deﬁne ˆ x t as the average of the local models, i.e., ˆ x t = N P Ni =1 x it . Accord-ing to the update rule in Algorithm 1, we have ˆ x t +1 = 1 N N X i =1 x it +1 = 1 N N X i =1 ( x it − η ∇ f ( x it , ξ it )) = ˆ x t − η N N X i =1 ∇ f ( x it , ξ it ) . We use t p to denote the last time to communicate, i.e., t p = ⌊ t/k ⌋ · k . Then, we get ˆ x t = ˆ x t p − ηN t − X τ = t p N X i =1 ∇ f ( x iτ , ξ iτ ) and x it = ˆ x t p − η t − X τ = t p ∇ f ( x iτ , ξ iτ ) . (8)As each client updates its model locally and communicates with others periodically, it is importantto make sure that the divergence of local models is not very large. We use Lemma 2 to bound thedifference between ˆ x t and x it to guarantee this. Lemma 2.

Under Assumptions 1 and 2, for any x ∈ R d , Algorithm 1 ensures that N N X i =1 T − X t =0 E k ˆ x t − x it k ≤ k − − k η L T η σ + 8 kη L T − X t =0 E D f (ˆ x τ , x ) + 4 T kη ζ xf ! . (9) Proof.

According to (8), we have k ˆ x t − x it k = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ˆ x t p − ηN t − X τ = t p N X j =1 ∇ f ( x jτ , ξ jτ ) −  ˆ x t p − η t − X τ = t p ∇ f ( x iτ , ξ iτ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = η (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) t − X τ = t p ∇ f ( x iτ , ξ iτ ) − N N X j =1 t − X τ = t p ∇ f ( x jτ , ξ jτ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) . Since N P Ni =1 (cid:13)(cid:13)(cid:13) A i − N P Nj =1 A j (cid:13)(cid:13)(cid:13) = N P Ni =1 k A i k − (cid:13)(cid:13)(cid:13) N P Ni =1 A i (cid:13)(cid:13)(cid:13) , we have N N X i =1 E (cid:13)(cid:13) ˆ x t − x it (cid:13)(cid:13) = η  N N X i =1 E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) t − X τ = t p ∇ f ( x iτ , ξ iτ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) − E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N N X i =1 t − X τ = t p ∇ f ( x iτ , ξ iτ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)  ≤ η N N X i =1 E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) t − X τ = t p ∇ f ( x iτ , ξ iτ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) . (10)Next, we bound E (cid:13)(cid:13)(cid:13)P t − τ = t p ∇ f ( x iτ , ξ iτ ) (cid:13)(cid:13)(cid:13) : E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) t − X τ = t p ∇ f ( x iτ , ξ iτ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) t − X τ = t p ∇ f ( x iτ , ξ iτ ) − t − X τ = t p ∇ f i ( x iτ ) + t − X τ = t p ∇ f i ( x iτ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) a ) = E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) t − X τ = t p ∇ f ( x iτ , ξ iτ ) − t − X τ = t p ∇ f i ( x iτ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) t − X τ = t p ∇ f i ( x iτ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) b ) = t − X τ = t p E (cid:13)(cid:13) ∇ f ( x iτ , ξ iτ ) − ∇ f i ( x iτ ) (cid:13)(cid:13) + E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) t − X τ = t p ∇ f i ( x iτ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) c ) ≤ t − X τ = t p E (cid:13)(cid:13) ∇ f ( x iτ , ξ iτ ) − ∇ f i ( x iτ ) (cid:13)(cid:13) + ( t − t p ) t − X τ = t p E (cid:13)(cid:13) ∇ f i ( x iτ ) (cid:13)(cid:13) d ) ≤ ( t − t p ) σ + ( t − t p ) t − X τ = t p E (cid:13)(cid:13) ∇ f i ( x iτ ) (cid:13)(cid:13) , (11)15here ( a ) and ( b ) hold because E ∇ f ( x iτ , ξ iτ ) = ∇ f i ( x iτ ) and ξ iτ ’s are independent; ( c ) followsfrom Cauchy’s inequality; ( d ) is due to Assumption 2. We then bound E (cid:13)(cid:13) ∇ f i ( x iτ ) (cid:13)(cid:13) : E (cid:13)(cid:13) ∇ f i ( x iτ ) (cid:13)(cid:13) = E (cid:13)(cid:13) ∇ f i ( x iτ ) − ∇ f i (ˆ x τ ) + ∇ f i (ˆ x τ ) (cid:13)(cid:13) a ) ≤ E k∇ f i ( x iτ ) − ∇ f i (ˆ x τ ) k + 2 E k∇ f i (ˆ x τ ) k b ) ≤ L E k x iτ − ˆ x τ k + 2 E k∇ f i (ˆ x τ ) − ∇ f i ( x ) + ∇ f i ( x ) k c ) ≤ L E k x iτ − ˆ x τ k + 4 E k∇ f i (ˆ x τ ) − ∇ f i ( x ) k + 4 E k∇ f i ( x ) k d ) ≤ L E k x iτ − ˆ x τ k + 8 L E D f i (ˆ x τ , x ) + 4 E k∇ f i ( x ) k , (12)where ( a ) and ( c ) come from k a + b k ≤ k a k + 2 k b k , ( b ) holds because of Assumption 1, ( d ) follows from Lemma 1. Substituting (12) into (11) and based on t − t p ≤ k − , we have E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) t − X τ = t p ∇ f ( x iτ , ξ iτ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ ( k − σ + ( k − t − X τ = t p (cid:0) L E k x iτ − ˆ x τ k + 8 L E D f i (ˆ x τ , x ) + 4 E k∇ f i ( x ) k (cid:1) . (13)Substituting (13) into (10) and according to the deﬁnition of ζ xf , we get N N X i =1 E (cid:13)(cid:13) ˆ x t − x it (cid:13)(cid:13) ≤ η ( k − σ + 2( k − η L N N X i =1 t − X τ = t p E k x iτ − ˆ x τ k +8( k − η L t − X τ = t p E D f (ˆ x τ , x ) + 4( k − η t − X τ = t p ζ xf . Summing up this inequality from t = 0 to T − , we have N N X i =1 T − X t =0 E (cid:13)(cid:13) ˆ x t − x it (cid:13)(cid:13) ≤ ( k − T η σ + 2 η L N N X i =1 T − X t =0 t − X τ = t p k x iτ − ˆ x τ k +8 η L T − X t =0 t − X τ = t p E D f (ˆ x τ , x ) + 4 η T − X t =0 t − X τ = t p ζ xf ! ≤ ( k − T η σ + 2 kη L N N X i =1 T − X t =0 k x iτ − ˆ x τ k + 8 kη L T − X t =0 E D f (ˆ x τ , x ) + 4 T kη ζ xf ! , (14)where the second inequality comes from a simple counting argument: P Tt =0 P t − τ = t p A τ ≤ P Tt =0 P t − τ = t − k A τ ≤ k P Tt =0 A t , A t ≥ . Rearranging (14), we get N N X i =1 T − X t =0 E k ˆ x t − x it k ≤ k − − k η L T η σ + 8 kη L T − X t =0 E D f (ˆ x τ , x ) + 4 T kη ζ xf ! . Below, we use Lemma 3 to bound the average of stochastic gradients.

Lemma 3.

Under Assumptions 1 and 2, we have E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N X i =1 N ∇ f ( x it , ξ it ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ σ N + 3 L N N X i =1 E (cid:13)(cid:13) x it − ˆ x t (cid:13)(cid:13) + 32 E k∇ f (ˆ x t ) k . (15)16 roof. Since ξ it ’s are independent, we have E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N N X i =1 ∇ f ( x it , ξ it ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N N X i =1 ∇ f i ( x it , ξ it ) − N N X i =1 ∇ f i ( x it ) + 1 N N X i =1 ∇ f i ( x it ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = 1 N E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N X i =1 ∇ f ( x it , ξ it ) − N X i =1 ∇ f i ( x it ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N N X i =1 ∇ f i ( x it ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = 1 N N X i =1 E (cid:13)(cid:13) ∇ f ( x it , ξ it ) − ∇ f i ( x it ) (cid:13)(cid:13) + E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N N X i =1 ∇ f i ( x it ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ σ N + E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N N X i =1 ∇ f i ( x it ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) , (16)where the last inequality comes from Assumption 2. According to Young’s Inequality and Cauchy’sInequality, we have E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N N X i =1 ∇ f i ( x it ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N N X i =1 ∇ f i ( x it ) − N N X i =1 ∇ f i (ˆ x t ) + 1 N N X i =1 ∇ f i (ˆ x t ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N N X i =1 (cid:0) ∇ f i ( x it ) − ∇ f i (ˆ x t ) (cid:1)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + 32 E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N N X i =1 ∇ f i (ˆ x t ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = 3 N E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N X i =1 (cid:0) ∇ f i ( x it ) − ∇ f i (ˆ x t ) (cid:1)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + 32 E k∇ f (ˆ x t ) k ≤ N N X i =1 E (cid:13)(cid:13) ∇ f i ( x it ) − ∇ f i (ˆ x t ) (cid:13)(cid:13) + 32 E k∇ f (ˆ x t ) k ≤ L N N X i =1 E (cid:13)(cid:13) x it − ˆ x t (cid:13)(cid:13) + 32 E k∇ f (ˆ x t ) k , (17)where the last inequality holds since f i ( x ) is L -smooth. Substituting (17) into (16), we complete theproof.Next, we bounded f (ˆ x t ) − f ( x ) for any x ∈ R d with Lemma 4. Lemma 4.

Suppose Assumptions 1 and 2 hold and f ( x ) is convex. When Algorithm 1 runs with aﬁxed learning rate η , for any x ∈ R d , we have η T − X t =0 E ( f (ˆ x t ) − f ( x )) − η T − X t =0 E k∇ f (ˆ x t ) k − ( ηL + 3 η L )( k − − k η L kη L T − X t =0 E D f (ˆ x t , x ) ≤ k ˆ x − x ∗ k + T η σ N + ( ηL + 3 η L )( k − − k η L ( T η σ + 4 T kη ζ xf ) . (18) Proof.

Based on the update rule of Algorithm 1, we obtain E k ˆ x t +1 − x k = E k ˆ x t − x k − η E h ˆ x t − x, N N X i =1 ∇ f ( x it , ξ it ) i + η E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N N X i =1 ∇ f ( x it , ξ it ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) . = E k ˆ x t − x k − η E h ˆ x t − x, N N X i =1 ∇ f i ( x it ) i + η E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N N X i =1 ∇ f ( x it , ξ it ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) . (19)17ince f i ( x ) is convex and L -smooth, we have −h ˆ x t − x, N N X i =1 ∇ f i ( x it ) i = h x − ˆ x t , N N X i =1 ∇ f i ( x it ) i = 1 N N X i =1 (cid:0) h x − x it , ∇ f i ( x it ) i + h x it − ˆ x t , ∇ f i ( x it ) i (cid:1) ≤ N N X i =1 (cid:18)(cid:0) f i ( x ) − f i ( x it ) (cid:1) + (cid:18) f i ( x it ) − f i (ˆ x t ) + L k x it − ˆ x t k (cid:19)(cid:19) = 1 N N X i =1 (cid:18) f i ( x ) − f i (ˆ x t ) + L k x it − ˆ x t k (cid:19) . = f ( x ) − f (ˆ x t ) + L N N X i =1 k x it − ˆ x t k (20)Substituting (20) into (19) yields E k ˆ x t +1 − x k ≤ E k ˆ x t − x k +2 η f ( x ) − f (ˆ x t ) + L N N X i =1 k x it − ˆ x t k ! + η E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N N X i =1 ∇ f ( x it , ξ it ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) . (21)According to (15) in Lemma 3, we have E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N N X i =1 ∇ f ( x it , ξ it ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ σ N + 3 L N N X i =1 E (cid:13)(cid:13) x it − ˆ x t (cid:13)(cid:13) + 32 E k∇ f (ˆ x t ) k (22)Combining (21) and (22), we get E k ˆ x t +1 − x k ≤ E k ˆ x t − x k + 2 η f ( x ) − f (ˆ x t ) + L N N X i =1 k x it − ˆ x t k ! + η σ N + 3 L N N X i =1 E (cid:13)(cid:13) x it − ˆ x t (cid:13)(cid:13) + 32 E k∇ f (ˆ x t ) k ! . = E k ˆ x t − x k − η E ( f (ˆ x t ) − f ( x )) + 3 η E k∇ f (ˆ x t ) k + ηL + 3 η L N N X i =1 k ˆ x t − x it k + η σ N . (23)Summing up this inequality from t = 0 to T − , we have E k ˆ x T − x k ≤ k ˆ x − x k − η T − X t =0 E ( f (ˆ x t ) − f ( x )) + 3 η T − X t =0 E k∇ f (ˆ x t ) k + ηL + 3 η L N N X i =1 T − X t =0 k ˆ x t − x it k + T η σ N . (24)18ubstituting (9) in Lemma 2 into (24), it holds that η T − X t =0 E ( f (ˆ x t ) − f ( x )) ≤ k ˆ x − x k − E k ˆ x T − x k + 3 η T − X t =0 E k∇ f (ˆ x t ) k + ( ηL + 3 η L )( k − − k η L T η σ + 8 kη L T − X t =0 E D f (ˆ x t , x ) + 4 T kη ζ xf ! + T η σ N . ≤ k ˆ x − x k + 3 η T − X t =0 E k∇ f (ˆ x t ) k + ( ηL + 3 η L )( k − − k η L T η σ + 8 kη L T − X t =0 E D f (ˆ x t , x ) + 4 T kη ζ xf ! + T η σ N . (25)Rearranging (25), we get η T − X t =0 E ( f (ˆ x t ) − f ( x )) − η T − X t =0 E k∇ f (ˆ x t ) k − ( ηL + 3 η L )( k − − k η L kη L T − X t =0 E D f (ˆ x t , x ) ≤ k ˆ x − x k + ( ηL + 3 η L )( k − − k η L ( T η σ + 4 T kη ζ xf ) + T η σ N . (26)

C.2 Proof of Theorem 1

Proof.

Applying (18) in Lemma 4 with x = x ∗ , it holds that η T − X t =0 E ( f (ˆ x t ) − f ( x ∗ )) − η T − X t =0 E k∇ f (ˆ x t ) k − ( ηL + 3 η L )( k − − k η L kη L T − X t =0 E D f (ˆ x t , x ∗ ) ≤ k ˆ x − x ∗ k + ( ηL + 3 η L )( k − − k η L ( T η σ + 4 T kη ζ ∗ f ) + T η σ N . (27)As f i ( x ) , i ∈ [ N ] are L -smooth, it is easy to verify that f ( x ) is L -smooth. According to Lemma 1,we have k∇ f (ˆ x t ) k = k∇ f (ˆ x t ) − ∇ f ( x ∗ ) k ≤ L D f (ˆ x t , x ∗ )= 2 L ( f (ˆ x t ) − f ( x ∗ )) . (28)Substituting (28) into the left hand side of (27) yields (cid:18) η − η L − ( ηL + 3 η L )( k − − k η L kη L (cid:19) T − X t =0 E ( f (ˆ x t ) − f ( x ∗ )) ≤ k ˆ x − x ∗ k + ( ηL + 3 η L )( k − − k η L ( T η σ + 4 T kη ζ ∗ f ) + T η σ N . (29)Setting the learning rate η so that η ≤ L and ηk ≤ L , we have ηL + 3 η L − k η L ≤ ηL + ηL − ≤ ηL , (30)19nd η − η L − ( ηL + 3 η L )8( k − kη L − k η L ≥ η − η L − ( ηL + 3 η L )8 k η L − k η L ≥ η − η − ( η + η )8 k η L − ≥ η − η − × × η ≥ η. (31)Substituting (30) and (31) into (29), we get η T − X t =0 E ( f (ˆ x t ) − f ( x ∗ )) ≤ k ˆ x − x ∗ k + 74 T η L ( k − σ + 4 kζ ∗ f ) + T η σ N .

Dividing by ηT on both sides of the above inequality yields T T − X t =0 E ( f (ˆ x t ) − f ( x ∗ )) ≤ k ˆ x − x ∗ k ηT + 2116 η L ( k − σ + 4 kζ ∗ f ) + 3 ησ N . ≤ k ˆ x − x ∗ k ηT + 32 η L ( k − σ + 4 kζ ∗ f ) + 3 ησ N .

Recall that we let ˜ x = ˆ x t for randomly chosen t from { , , · · · , T − } . Taking the expectationwith regard to t , we get E f (˜ x ) − f ( x ∗ ) ≤ k ˆ x − x ∗ k ηT + 32 η L ( k − σ + 4 kζ ∗ f ) + 3 ησ N . (32)Under the result of (32), we set k as k = ( min { ηLN , ηL } ζ ∗ f = 0 , min { σ √ ηLN ( σ +4 ζ f ) , ηL } else. (33)For the IID case, i.e., ζ ∗ f = 0 , based on the setting of k in (33), we have η L ( k − σ + 4 kζ ∗ f ) ≤ η Lkσ ≤ η L ηLN σ = ησ N . (34)For the Non-IID case, we get η L ( k − σ + 4 kζ ∗ f ) ≤ η Lk ( σ + 4 ζ ∗ f ) ≤ η L σ ηLN ( σ + 4 ζ ∗ f ) ( σ + 4 ζ ∗ f )= ησ N . (35)Substituting (34) and (35) into (32) yields E f (˜ x ) − f ( x ∗ ) ≤ k ˆ x − x ∗ k ηT + ησ N , (36)which completes the proof. 20

Proofs for Results in Section 4.1

Proof of Theorem 2

Proof.

Based on the parameter settings in Algorithm 2, we have η s T s = η s − · s − T = η T = 6 µ (37)and k s = (cid:26) ( √ s − k s − k ≤ ( ( √ s − min { σ √ η LN ( σ +4 ζ f ) , η L } s − min { η LN , η L } = ( min { σ √ η s LN ( σ +4 ζ f ) , √ s − η s L } min { η s LN , η s L }≤ ( min { σ √ η s LN ( σ +4 ζ f ) , η s L } , Non - IID case , min { η s LN , η s L } , IID case . (38)Thus, according to (37), (38) and (2) in Theorem 1, we get E f ( x s +1 ) − f ( x ∗ ) ≤ E k x s − x ∗ k η s T s + η s σ N = µ E k x s − x ∗ k η σ s − N . (39)Since the objective f ( x ) is µ -strongly convex, we have µ E k x s − x ∗ k ≤ E f ( x s ) − f ( x ∗ )4 . (40)Substituting (40) into (39) yields E f ( x s +1 ) − f ( x ∗ ) ≤ E f ( x s ) − f ( x ∗ )4 + η σ s − N . (41)Subtracting η σ s − on both sides of (41), we get E f ( x s +1 ) − f ( x ∗ ) − η σ s +1 N ≤

14 ( E f ( x s ) − f ( x ∗ ) − η σ s N ) . Based on the property of geometric progression, we have E f ( x S ) − f ( x ∗ ) − η σ S N ≤ S − ( E f ( x ) − f ( x ∗ ) − η σ N ) . (42)Setting S ≥ log( N ( f ( x ) − f ( x ∗ )) η σ ) + 2 gives f ( x ) − f ( x ∗ ) ≤ S − η σ N . (43)By substituting (43) into (42) and rearranging the result further, we obtain E f ( x S ) − f ( x ∗ ) ≤ η σ S N + 14 S − ( E f ( x ) − f ( x ∗ ) − η σ N ) ≤ η σ S N + E f ( x ) − f ( x ∗ )4 S − ≤ η σ S N + η σ S N = 9 η σ S N . (44)Since T s = 2 s − T , we have T = T + T + · · · + T S = T (1 + 2 + · · · + 2 S − )= T (2 S − . S = log ( TT + 1) . Replacing S with log( TT + 1) in (44) and combining η T = µ , we have E f ( x S ) − f ( x ∗ ) ≤ η σ ( TT + 1) N = 9 η T σ ( T + T ) N = 54 σ µ ( T + T ) N = O (cid:18) N T (cid:19) . E Proofs for Results in Section 4.2

E.1 Proof for result of STL-SGD nc with Option 1 We will ﬁrst analyse the convergence of Local-SGD for a single stage in Lemma 5. Then we extendthe result to S stages in Theorem 3. Lemma 5.

Suppose Assumptions 1, 2 and 3 hold. Let γ − = 2 ρ , η s ≤ L γ and k s η s ≤ L γ ,where L γ = L + γ − . We have the following result for stage s of Algorithm 3 with Option 1 : E f ( x s +1 ) − f ( x ∗ ) ≤ (cid:18) η s T s + 1127 ρ (cid:19) k x s − x ∗ k + 3 η s σ N + 32 η s L γ ( k s − σ + 4 k s ζ ∗ f ) . (45) Proof.

We let the objectives in all stages be convex by setting γ − > ρ , where ρ is the weaklyconvex parameter in Assumption 3. Recall that f ( x ) is L -smooth. Denoting L γ = L + γ , we have k∇ f γx s ( x ) − ∇ f γx s ( y ) k = (cid:13)(cid:13)(cid:13)(cid:13) ∇ f ( x ) − ∇ f ( y ) + 1 γ ( x − y ) (cid:13)(cid:13)(cid:13)(cid:13) ≤ k∇ f ( x ) − ∇ f ( y ) k + 1 γ k ( x − y ) k≤ (cid:18) L + 1 γ (cid:19) k x − y k = L γ k x − y k , (46)where the ﬁrst inequality comes from the Triangle Inequality. Thus, f γx s ( x ) is L γ -smooth. Based onAssumption 2, we further have E ξ ∼D i k∇ f γx s ( x, ξ ) − ∇ f γx s ,i ( x ) k = E ξ ∼D i k∇ f ( x, ξ ) − ∇ f i ( x ) k ≤ σ . (47)22s we set γ − > ρ , f γx s is ( γ − − ρ )-strongly convex, thus we have −h ˆ x t − x, N N X i =1 ∇ f γx s ,i ( x it ) i = h x − ˆ x t , N N X i =1 ∇ f γx s ,i ( x it ) i = 1 N N X i =1 (cid:0) h x − x it , ∇ f γx s ,i ( x it ) i + h x it − ˆ x t , ∇ f γx s ,i ( x it ) i (cid:1) ≤ N N X i =1 (cid:18) f γx s ,i ( x ) − f γx s ,i ( x it ) − γ − − ρ k x it − x k (cid:19) + (cid:18) f γx s ,i ( x it ) − f γx s ,i (ˆ x t ) + L k x it − ˆ x t k (cid:19) ! = 1 N N X i =1 (cid:18) f γx s ,i ( x ) − f γx s ,i (ˆ x t ) + L k x it − ˆ x t k − γ − − ρ k x it − x k (cid:19) . ≤ f γx s ( x ) − f γx s (ˆ x t ) + L N N X i =1 k x it − ˆ x t k − γ − − ρ k ˆ x t − x k , (48)where the last inequality holds because the function g ( x ) = k x k is convex. Respectively replacing(20) with (48), L with L γ and x with x ∗ , going through the proof process in Lemma 4 again, we get η s T s − X t =0 E (cid:0) f γx s (ˆ x t ) − f γx s ( x ∗ ) (cid:1) − η s T s − X t =0 E (cid:13)(cid:13) ∇ f γx s (ˆ x t ) (cid:13)(cid:13) − ( η s L γ + 3 η s L γ )( k s − − k s η s L γ k s η s L γ T s − X t =0 E D f γxs (ˆ x t , x ∗ ) ≤ k ˆ x − x ∗ k − η s ( γ − − ρ ) T s − X t =0 k ˆ x t − x ∗ k + ( η s L γ + 3 η s L γ )( k s − − k s η s L γ ( T s η s σ + 4 T s k s η s ζ ∗ f γxs ) + T s η s σ N , (49)where D f γxs (ˆ x t , x ∗ ) = f γx s (ˆ x t ) − f γx s ( x ∗ ) − h∇ f γx s ( x ∗ ) , ˆ x t − x ∗ i and ζ ∗ f γxs = N P Ni =1 k∇ f i ( x ∗ ) + x ∗ − x s γ k . We bound k∇ f γx s (ˆ x t ) k as k∇ f γx s (ˆ x t ) k = k∇ f γx s (ˆ x t ) − ∇ f γx s ( x ∗ ) + ∇ f γx s ( x ∗ ) k ≤ k∇ f γx s (ˆ x t ) − ∇ f γx s ( x ∗ ) k + 2 k∇ f γx s ( x ∗ ) k ≤ L γ D f γxs (ˆ x t , x ∗ ) + 2 γ k x ∗ − x s k , (50)where the last inequality comes from Lemma 1. As N P Ni =1 ∇ f i ( x ∗ ) = ∇ f ( x ∗ ) = 0 , we have ζ ∗ f γxs = 1 N N X i =1 k∇ f i ( x ∗ ) + x ∗ − x s γ k = 1 N N X i =1 k∇ f i ( x ∗ ) k + k x ∗ − x s γ k = ζ ∗ f + 1 γ k x ∗ − x s k (51)23nd D f γxs (ˆ x t , x ∗ ) = f γx s (ˆ x t ) − f γx s ( x ∗ ) + 1 γ h x ∗ − x s , x ∗ − ˆ x t i ( a ) = f γx s (ˆ x t ) − f γx s ( x ∗ ) + 12 γ (cid:0) k x ∗ − x s k − k x s − ˆ x t k + k x ∗ − ˆ x t k (cid:1) , (52)where ( a ) is based on the fact that h x − y, x − z i = k x − y k − k y − z k + k x − z k . Substituting(50), (51), (52) into (49) and taking the expectation regarding t , we get T s (cid:0) η s − η s L γ − A γ k s η s L γ (cid:1) (cid:0) f γx s ( x s +1 ) − f γx s ( x ∗ ) (cid:1) − (cid:18) η s T s γ + 3 η s L γ T s γ + 4 A γ k s η s L γ T s γ (cid:19) k x ∗ − x s k ≤ (1 + 4 A γ k s η s T s γ ) k x s − x ∗ k + (cid:18) A γ k s η s L γ γ + 3 η s L γ γ − η s ( γ − − ρ ) (cid:19) T s − X t =0 k ˆ x t − x ∗ k + A γ T s η s ( σ + 4 k s ζ ∗ f ) + T s η s σ N , (53)where A γ = ( η s L γ +3 η s L γ )( k s − − k s η s L γ . Setting γ = ρ , η s ≤ L γ and η s k s ≤ L γ , we have A γ k s η s = ( η s L γ + 3 η s L γ )( k s − − k s η s L γ k s η s ≤ ( η s + η s ) k s η s L γ (1 − ) L γ ≤ η s L γ

181 = 5 η s L γ , (54) η s − η s L γ − A γ k s η s L γ ≥ η s − η s − η s ≥ η s (55)and A γ k s η s L γ γ + 3 η s L γ γ − η s ( γ − − ρ ) ≤ η s ρ

79 + η s ρ − η s ρ ≤ . (56)Substituting (54), (55) and (56) into (53) yields η s T s (cid:0) f γx s ( x s +1 ) − f γx s ( x ∗ ) (cid:1) ≤ (cid:18) T s η s ρ L γ + 12 η s T s ρ + 6 η s L γ T s ρ + 10 T s η s ρ (cid:19) k x s − x ∗ k + 32 T s η s L γ ( k s − σ + 4 k s ζ ∗ f ) + T s η s σ N . (57)By the deﬁnition of f γx s ( x ) and γ − = 2 ρ , we have f γx s ( x s +1 ) − f γx s ( x ∗ ) = f ( x s +1 ) − f ( x ∗ ) + ρ k x s +1 − x s k − ρ k x ∗ − x s k ≥ f ( x s +1 ) − f ( x ∗ ) − ρ k x ∗ − x s k . (58)Substituting (58) into (57) and rearranging the result further, we get η s T s f ( x s +1 ) − f ( x ∗ )) ≤ (cid:18) T s η s ρ L γ + 12 η s T s ρ + 6 η s L γ T s ρ + 10 T s η s ρ

79 + 4 η s T s ρ (cid:19) k x s − x ∗ k + 32 T s η s L γ ( k s − σ + 4 k s ζ ∗ f ) + T s η s σ N .

Dividing by η s T s on both sides of the above inequality yields f ( x s +1 ) − f ( x ∗ ) ≤ (cid:18) η s T s + 15 ρ L γ + 9 η s ρ + 9 η s L γ ρ ρ

158 + ρ (cid:19) k x s − x ∗ k + 32 η s L γ ( k s − σ + 4 k s ζ ∗ f ) + 3 η s σ N . L ≥ ρ , we have L γ = L + γ ≥ ρ , η s ≤ L γ ≤ ρ and f ( x s +1 ) − f ( x ∗ ) ≤ (cid:18) η s T s + 1127 ρ (cid:19) k x s − x ∗ k + 32 η s L γ ( k s − σ + 4 k s ζ ∗ f ) + 3 η s σ N .

Proof of Theorem 3

Proof.

Since f ( x ) satisﬁes the PL condition with parameter µ , we have µ k x − x ∗ k ≤ f ( x ) − f ( x ∗ ) . (59)Combining (59) with the result of Lemma 5, we have f ( x s +1 ) − f ( x ∗ ) ≤ (cid:18) η s T s + 1127 ρ (cid:19) k x s − x ∗ k + 32 η s L γ ( k s − σ + 4 k s ζ ∗ f ) + 3 η s σ N ≤ (cid:18) η s T s + 1127 ρ (cid:19) µ ( f ( x s ) − f ( x ∗ )) + 32 η s L γ ( k s − σ + 4 k s ζ ∗ f ) + 3 η s σ N . (60)According to the parameter settings in

Option 1 of Algorithm 3, we have η s T s = η s − · s − T = η T = 6 ρ (61)and k s = (cid:26) ( √ s − k s − k ≤ ( ( √ s − min { σ √ η L γ N ( σ +4 ζ f ) , η L γ } s − min { η L γ N , η L γ } = ( min { σ √ η s L γ N ( σ +4 ζ f ) , √ s − η s L γ } min { η s L γ N , η s L γ }≤ ( min { σ √ η s L γ N ( σ +4 ζ f ) , η s L γ } , Non - IID case , min { η s L γ N , η s L γ } , IID case . (62)Similar to the proof of (34) and (35), we have η s L γ ( k s − σ + 4 k s ζ ∗ f ) ≤ η s σ N . (63)Substituting (61) and (63) into (60), according to µ ≥ ρ , we have f ( x s +1 ) − f ( x ∗ ) ≤ (cid:18) η s T s + 1127 ρ (cid:19) µ ( f ( x s ) − f ( x ∗ )) + η s σ N . = (cid:18) ρ ρ (cid:19) µ ( f ( x s ) − f ( x ∗ )) + η σ s − N ≤

14 ( f ( x s ) − f ( x ∗ )) + η σ s − N . (64)Note that the formula of (64) is the same as (41). Thus, the rest of the proof is a duplicate to that ofTheorem 2.

E.2 Proof for result of STL-SGD nc with Option 2Proof of Theorem 4 roof. For convenience of analysis, we let x ∗ s denote the optimal solution of the objective used inthe s -th stage f γx s ( x ) . According to (46) and (47), we have that f γx s is L γ -smooth and the vari-ance of its stochastic gradients is bounded by σ . We set η ≤ L γ , k = min { η L γ N , η L γ } when ζ ∗ f = 0 and k = min { σ √ η L γ N ( σ +4 ζ ∗ f ) , η L γ } when ζ ∗ f = 0 . As η s = η /s and k s = (cid:26) sk , IID case √ sk , else , we have η s ≤ L γ (65)and k s ≤ ( min { η s L γ N , η s L γ } , IID case , min { σ √ η s L γ N ( σ +4 ζ ∗ f ) , η s L γ } , else . (66)By setting γ − > ρ , we can ensure that f γx s is strongly convex. Based on these settings, we applyTheorem 1 in each call of Local-SGD in STL-SGD nc : f γx s ( x s +1 ) − f γx s ( x ∗ s ) ≤ k x s − x ∗ s k η s T s + η s σ N . (67)Under the deﬁnition f γx s ( x s +1 ) = f ( x s +1 ) + γ k x s +1 − x s k , and the strong convexity f γx s ( x s ) − f γx s ( x ∗ s ) ≥ γ − − ρ k x s − x ∗ s k , we have f ( x s +1 ) + 12 γ k x s +1 − x s k + γ − − ρ k x s − x ∗ s k − f ( x s ) ≤ k x s − x ∗ s k η s T s + η s σ N . (68)Setting γ − = 2 ρ and rearranging (68) yields ρ k x s +1 − x s k + ρ k x s − x ∗ s k ≤ f ( x s ) − f ( x s +1 ) + 3 k x s − x ∗ s k η s T s + η s σ N . (69)As η s = η /s , T s = sT and η T = ρ , we have ρ k x s +1 − x s k + ρ k x s − x ∗ s k ≤ f ( x s ) − f ( x s +1 ) + η σ sN . (70)According to the L γ -smoothness of f γx s ( x ) , we have k∇ f ( x s ) k = k∇ f γx s ( x s ) k = k∇ f γx s ( x s ) − ∇ f γx s ( x ∗ s ) k ≤ L γ k x s − x ∗ s k . (71)Combining (70) and (71) yields ρ L γ k∇ f ( x s ) k ≤ ρ k x s − x ∗ s k ≤ f ( x s ) − f ( x s +1 ) + η σ sN . (72)Deﬁne w s = s and ∆ s = f ( x s ) − f ( x s +1 ) . Multiplying both sides by w s , we have ρw s L γ k∇ f ( x s ) k ≤ w s ∆ s + w s η σ sN . (73)After telescoping (72) for s = 1 , , · · · , S , we get S X s =1 w s k∇ f ( x s ) k ≤ L γ ρ S X s =1 w s ∆ s + S X s =1 w s η σ sN ! . (74)Taking the expectation w.r.t s ∈ { , , · · · , S } with probability p s = s ··· + S , we have E k∇ f ( x s ) k ≤ L γ ρ P Ss =1 w s ∆ s P Ss =1 w s + P Ss =1 w s η σ sN P Ss =1 w s ! . (75)26ased on the deﬁnition of w s and ∆ s , setting w = 0 , we have S X s =1 w s ∆ s = S X s =1 w s ( f ( x s ) − f ( x s +1 )) = S X s =1 f ( x s ) − Sf ( x S +1 ) ≤ S ( f (¯ x ) − f ( x S +1 )) ≤ w S ( f (¯ x ) − f ( x ∗ )) , (76)where ¯ x = arg max x i ,i ∈ [ S ] f ( x i ) . Substituting (76) into (75), we get E k∇ f ( x s ) k ≤ L γ ρ w S ( f (¯ x ) − f ( x ∗ )) P Ss =1 w s + P Ss =1 w s η σ sN P Ss =1 w s ! = 8 L γ ρ (cid:18) f (¯ x ) − f ( x ∗ ) S + 1 + η σ ( S + 1) N (cid:19) . (77)As T s = sT , we have T = T + T + · · · + T S = T (1 + 2 + · · · + S ) = T S ( S + 1)2 ≤ T ( S + 1) . (78)Substituting S + 1 ≥ q TT into (77), we get E k∇ f ( x s ) k ≤ L γ ρ  ( f (¯ x ) − f ( x ∗ )) q TT + η σ q TT N  = O (cid:18) ( f (¯ x ) − f ( x ∗ )) √ T √ T + √ T η σ N √ T (cid:19) = O (cid:18) f (¯ x ) − f ( x ∗ ) √ T η + √ η σ N √ T (cid:19) , (79)where the last equality holds since η T = 3 /ρ . We use η N to denote the learning rate when using N clients. Setting η N = N η yields E k∇ f ( x s ) k ≤ O (cid:18) √ N T (cid:19) ,,