Convergence of Distributed Stochastic Variance Reduced Methods without Sampling Extra Data
Shicong Cen, Huishuai Zhang, Yuejie Chi, Wei Chen, Tie-Yan Liu
aa r X i v : . [ c s . L G ] J a n Convergence of Distributed Stochastic Variance Reduced Methodswithout Sampling Extra Data
Shicong Cen † Huishuai Zhang ‡ Yuejie Chi † Wei Chen ‡ Tie-Yan Liu ‡† Carnegie Mellon University ‡ Microsoft Research Asia {shicongc,yuejiec}@andrew.cmu.edu{huzhang, wche, tyliu}@microsoft.com
January 24, 2020
Abstract
Stochastic variance reduced methods have gained a lot of interest recently for empirical risk mini-mization due to its appealing run time complexity. When the data size is large and disjointly storedon different machines, it becomes imperative to distribute the implementation of such variance reducedmethods. In this paper, we consider a general framework that directly distributes popular stochasticvariance reduced methods, by assigning outer loops to the parameter server, and inner loops to workermachines. This framework is natural and friendly to implement, but its theoretical convergence is notwell understood. We obtain a unified understanding of algorithmic convergence with respect to data ho-mogeneity by measuring the smoothness of the discrepancy between the local and global loss functions.We establish the linear convergence of distributed versions of a family of stochastic variance reduced algo-rithms, including those using accelerated and recursive gradient updates, for minimizing strongly convexlosses. Our theory captures how the convergence of distributed algorithms behaves as the number ofmachines and the size of local data vary. Furthermore, we show that when the data are less balanced,regularization can be used to ensure convergence at a slower rate. We also demonstrate that our analysiscan be further extended to handle nonconvex loss functions.
Empirical risk minimization arises frequently in machine learning and signal processing, where the objectivefunction is the average of losses computed at different data points. Due to the increasing size of data,distributed computing architectures, which assign the learning task over multiple computing nodes, are ingreat need to meet the scalability requirement in terms of both computation power and storage space. Inaddition, distributed frameworks are suitable for problems where there are privacy concerns to transmit andstore all the data in a central location, a scenario related to the nascent field of federated learning [14]. It is,therefore, necessary to develop distributed optimization frameworks that are tailored to solving large-scaleempirical risk minimization problems with desirable communication-computation trade-offs, where the dataare stored disjointly over different machines.Due to the low per-iteration cost, a popular solution is distributed stochastic gradient descent (SGD) [23],where the parameter server aggregates gradients from each worker and does mini-batch gradient updates.However, distributed SGD is not communication-efficient and requires lots of communication rounds toconverge, which partially diminishes the benefit of distribution. On the other hand, recent breakthroughs indeveloping stochastic variance reduced methods have made it possible to achieve fast convergence and smallper-iteration cost at the same time, such as the notable SVRG [13] algorithm. Yet, distributed schemes ofsuch variance reduced methods that are both practical and theoretically sound are much less developed.This paper focuses on a general framework of distributed stochastic variance reduced methods, to bepresented in Alg. 1, which is natural and friendly to implement. On a high level, SVRG-type algorithms [13]contain inner loops for parameter updates via variance-reduced SGD, and outer loops for global gradient1nd parameter updates. Our general framework assigns outer loops to the parameter server, and inner loopsto worker machines. The parameter server collects gradients from worker machines and then distributes theglobal gradient to each machine. Each worker machine then runs the inner loop independently in parallelusing variance reduction techniques, which might be different when distributing different algorithms, andreturns the updates to the parameter server at the end. Per iteration, two communication rounds arerequired: one communication round is used to average the parameter estimates, and the other is used toaverage the gradients, which is the same as distributed synchronous SGD. However, the premise is thatby performing more efficient local computation using stochastic variance reduced methods, the algorithmconverges in fewer iterations and is therefore more communication-efficient.Due to the simplicity of this framework, similar methods have been implemented in several works [14,7, 24], and have achieved great empirical success. Surprisingly, a complete theoretical understanding ofits convergence behavior is still missing at large. Moreover, distributed variants using accelerated variancereduction methods are not developed. The main analysis difficulty is that the variance-reduced gradient ofeach worker is no longer an unbiased gradient estimator when sampling from re-used local data.To ease this difficulty, several variants of distributed SVRG, e.g. [15, 28, 32] have been proposed withperformance guarantees, which try to bypass the biased gradient estimation issue by emulating the processof i.i.d. sampling from the global data using some complicated random data re-allocation protocol, whichrequires sampling extra data with or without replacement. These procedures lead to unnecessary data wasteand potential privacy leakage, and can be cumbersome and difficult to implement in practice.Consequently, a natural question arises: can we provide a mathematical analysis to the convergence ofthe natural framework of distributed stochastic variance reduced methods, under some simple and intuitivemetric?
This paper provides a convergence analysis of a family of naturally distributed stochastic variance reducedmethods under the framework described in Alg. 1, for both convex and nonconvex loss functions. Byusing different variance reduction schemes at the worker machines, we study distributed variants of threerepresentative algorithms in this paper: SVRG [13], SARAH employing recursive gradient updates [21, 22],and MiG employing accelerated gradient updates [42]. Our methodology can be extended to study othervariants in a similar fashion. The contributions of this paper are summarized below. • We suggest a simple and intuitive metric called distributed smoothness to gauge data balancednessamong workers, defined as the smoothness of the difference f k − f between the local loss function f k and the global loss function f , which is the average of the local loss functions. The metric isdeterministic, easy-to-compute, applies for arbitrary dataset splitting, and is shown to play a criticalrole in the convergence analysis. • We establish the linear convergence of distributed D-SVRG, D-SARAH, and D-MiG under stronglyconvex losses, as long as the distributed smoothness parameter is smaller than a constant fraction ofthe strong convexity parameter σ , e.g. σ/ , where the fraction might change for different algorithms.Our bounds capture the phenomenon that the convergence rate improves as the local loss functionsbecome more similar to the global loss function, by reducing the distributed smoothness parameter.Furthermore, the run time complexity exhibits the so-called “linear speed-up” property in distributedcomputing, where the complexity depends on the local data size, instead of the global data size, whichtypically implies an improvement by a factor of n , where n is the number of machines. • When the local data are highly unbalanced, the distributed smoothness parameter becomes large, whichimplies that the algorithm might diverge. We suggest regularization as an effective way to handle thissituation, and show that by adding larger regularization to machines that are less distributed smooth,one can still ensure linear convergence in a regularized version of D-SVRG, called D-RSVRG, thoughat a slower rate of convergence. • More generally, the notion of distributed smoothness can also be used to establish the convergenceunder nonconvex losses. We demonstrate this through the convergence analysis of D-SARAH in thenonconvex setting. 2able 1: Communication rounds and runtime of the proposed and existing algorithms for strongly con-vex losses (ignoring logarithmic factors in κ ) to reach ǫ -accuracy. Algorithms with an asterisk are pro-posed/analyzed in this paper. Here, N is the total number of input data points, n is the number of workermachines, and κ is the condition number of the global loss function f .Algorithm Communication Rounds Runtime AssumptionsDSVRG [15] (1 + κ/ ( N/n )) log(1 /ǫ ) ( N/n + κ ) log(1 /ǫ ) extra dataDASVRG [15] (1 + p κ/ ( N/n )) log(1 /ǫ ) ( N/n + p κN/n ) log(1 /ǫ ) extra dataDist. AGD √ κ log(1 /ǫ ) ( N/n ) √ κ log(1 /ǫ ) noneADMM κ log(1 /ǫ ) ( N/n ) κ log(1 /ǫ ) noneSCOPE [41] κ log(1 /ǫ ) ( N/n + κ log κ ) κ log(1 /ǫ ) uniform regularizationpSCOPE [40] log(1 /ǫ ) ( N/n + κ ) log(1 /ǫ ) good partitionD-SVRG* log(1 /ǫ ) ( N/n + κ ) log(1 /ǫ ) distributed restricted smoothnessD-SARAH* log(1 /ǫ ) ( N/n + κ ) log(1 /ǫ ) distributed restricted smoothnessD-MiG* (1 + p κ/ ( N/n )) log(1 /ǫ ) ( N/n + p κN/n ) log(1 /ǫ ) distributed smoothnessD-RSVRG* κ log(1 /ǫ ) ( N/n ) κ log(1 /ǫ ) large regularizationD-RSVRG* log(1 /ǫ ) ( N/n + κ ) log(1 /ǫ ) small regularization Distributed optimization is a classic topic [6, 5], yet recent trends in data-intensive applications are callingfor new developments with a focus on communication and computation efficiency. Examples of deterministic optimization methods include DANE [29], AIDE [24], DiSCo [38], GIANT [33], CoCoA [30], CEASE [9],one-shot averaging [43, 39], etc.Many stochastic variance reduced methods have been proposed recently, for example, SAG [25], SAGA[8], SVRG [13], SDCA [27], MiG [42], Katyusha [3], Catalyst [18], SCOPE [41, 40], SARAH [21], SPIDER[10], SpiderBoost [34], to name a few. Several previous works have studied distributed variants of SVRG.For example, the D-SVRG algorithm has been empirically studied before in [24, 14] without a theoreticalconvergence analysis. The pSCOPE algorithm [40] is also a variant of distributed SVRG, and its convergenceis studied under an assumption called good data partition in [40], which is hard to interpret and verify inpractice. The SCOPE algorithm [41] is similar to the regularized variant D-RSVRG of D-SVRG underlarge regularization, however our analysis is much more refined by allowing different regularizations todifferent local workers with respect to the distributed smoothness of local data, and gracefully degeneratesto the unregularized case when the distributed smoothness is benign. The general framework of distributedvariance-reduced methods covering SARAH and MiG and various loss settings in this paper has not beenstudied before. For conciseness, Table 1 summarizes the communication and computation complexities ofthe most relevant algorithms to the current paper.There are also a lot of recent efforts on reducing the communication cost of distributed GD/SGD bygradient quantization [1, 4, 26, 36], gradient compression and sparsification [2, 17, 19, 35, 31]. In comparison,we communicate the exact gradient, and it is an interesting future direction to combine gradient compressionschemes in distributed variance reduced stochastic gradient methods.
The rest of this paper is organized as follows. Section 2 presents the problem setup and a general frameworkof distributed stochastic optimization with variance-reduced local updates. Section 3 presents the conver-gence guarantees of D-SVRG, D-SARAH and D-MiG under appropriate distributed smoothness assumptions.Section 4 introduces regularization to D-SVRG to handle unbalanced data when distributed smoothness doesnot hold. Section 5 presents extensions to nonconvex losses for D-SARAH. Section 6 provides an outlineto the analysis and the rest of the proofs are deferred to the supplemental materials. Section 7 presentsnumerical experiments to corroborate the theoretical findings. Finally, we conclude in Section 8.3
Problem Setup
Suppose we have a data set M = { z , · · · , z N } , where z j ∈ R p is the j th data point for j = 1 , ..., N , and N is the total number of data points. In particular, we do not make any assumptions on their statisticaldistribution. Consider the following empirical risk minimization problem min x ∈ R d f ( x ) := 1 N X z ∈M ℓ ( x ; z ) , (1)where x ∈ R d is the parameter to be optimized and ℓ : R d × R p R is the sample loss function. For brevity,we use ℓ z ( x ) to denote ℓ ( x ; z ) throughout the paper.In a distributed setting, where the data are distributed to n machines or workers, we define a partitionof the data set M as M = S nk =1 M k , where M i T M k = ∅ , ∀ i = k . The k th worker, correspondingly, is inpossession of the data subset M k , ≤ k ≤ n . We assume there is a parameter server (PS) that coordinatesthe parameter sharing among the workers. The sizes of data held by each worker machine is N k = |M k | .When the data is split equally, we have N k = N/n . The original problem (1) can be rewritten as minimizingthe following objective function: f ( x ) := 1 n n X k =1 f k ( x ) , (2)where f k ( x ) = N/n ) P z ∈M k ℓ z ( x ) is the local loss function at the k th worker machine. Alg. 1 presents a general framework for distributed stochastic variance reduced methods, which assignsthe outer loops of an SVRG-type algorithm [13, 21, 42] to PS and the inner loops to local workers. Byusing different variance reduction schemes at the worker machines (i.e.
LocalUpdate ), we obtain distributedvariants of different algorithms. On a high level, the framework alternates between local computation byindividual workers in parallel (i.e. Line 5), and global information sharing coordinated by the PS (i.e. Lines8-12). • The local worker conducts local computation
LocalUpdate based on the current estimate ˜ x t , the globalgradient ∇ f (˜ x t ) , and its local data f k ( · ) ; in this paper, we are primarily interested in local updatesusing stochastic variance-reduced gradients. A little additional information about the previous updateis needed when employing acceleration, which will be specified in Alg. 3. • The PS mainly combines the local estimates, y t + k , into a global estimate ˜ x t +1 , and then computesthe global gradient ∇ f (˜ x t +1 ) by pulling the local gradient ∇ f k (˜ x t +1 ) , which requires two rounds ofcommunications. For simplicity of analysis, we assume the global estimate ˜ x t +1 is set via one of thelocal estimates at random, rather than the average of all the local estimates. Throughout, we invoke one or several of the following standard assumptions of the loss function in theconvergence analysis.
Assumption 1 (Smoothness) . The sample loss ℓ z ( · ) is L -smooth for all z ∈ M . Assumption 2 (Convexity) . The sample loss ℓ z ( · ) is convex for all z ∈ M . Assumption 3 (Strong Convexity) . The empirical risk f ( · ) is σ -strongly convex. It is straightforward to states our results under unequal data splitting with proper rescaling. lgorithm 1 A general distributed framework Input: initial point ˜ x . Initialization : Compute ∇ f (˜ x ) and distribute it to all machines. for t = 0 , , , · · · do for workers ≤ k ≤ n in parallel do y t + k = LocalUpdate ( f k , ˜ x t , ∇ f (˜ x t )) ; Send y t + k to PS ; end for PS : randomly select ˜ x t +1 from all y t + k and push ˜ x t +1 to all workers; for workers ≤ k ≤ n in parallel do compute ∇ f k (˜ x t +1 ) and send it to PS ; end for PS : average ∇ f (˜ x t +1 ) = n P nk =1 ∇ f k (˜ x t +1 ) and push ∇ f (˜ x t +1 ) to all workers. end for return ˜ x t +1 When f is strongly convex, the condition number of f is defined as κ := L/σ . Denote the uniqueminimizer and the optimal value of f ( x ) as x ∗ := arg min x ∈ R d f ( x ) , f ∗ := f ( x ∗ ) . As it turns out, the smoothness of the deviation f k − f between the local loss function f k and the globalloss function f plays a key role in the convergence analysis, as it measures the balancedness between localdata in a simple and intuitive manner. We refer to this as the “distributed smoothness”. In some cases, aweaker notion called restricted smoothness is sufficient, which is defined below. Definition 1 (Restricted Smoothness) . A differentiable function f : R d R is called c -restricted smoothwith regard to x ∗ if k∇ f ( x ∗ ) − ∇ f ( y ) k ≤ c k x ∗ − y k , for all y ∈ R d . The restricted smoothness, compared to standard smoothness, fixes one of the arguments to x ∗ , and istherefore a much weaker requirement. The following assumption quantifies the distributed smoothness usingeither restricted smoothness or standard smoothness. Assumption 4a (Distributed Restricted Smoothness) . The deviation f − f k is c k -restricted smooth withregard to x ∗ for all ≤ k ≤ n . Assumption 4b (Distributed Smoothness) . The deviation f − f k is c k -smooth for all ≤ k ≤ n . It is straightforward to check that c k ≤ L for all ≤ k ≤ n . If all the data samples are generated followingcertain statistical distribution in an i.i.d. fashion, one can further link the distributed smoothness to thelocal sample size N/n , where c k decreases with the increase of N/n , see e.g. [29, 9] for further discussion.
In this section, we describe three variance-reduced routines for
LocalUpdate used in Alg. 1, namely SVRG[13], SARAH [21, 22], and MiG [42], and analyze their convergence when f ( · ) is strongly convex, respectively. The
LocalUpdate routine of D-SVRG is described in Alg. 2. Theorem 1 provides the convergence guaranteeof D-SVRG as long as the distributed restricted smoothness parameter is small enough.
Theorem 1 (D-SVRG) . Suppose that Assumptions 1, 2 and 3 hold, and Assumption 4a holds with c k ≤ c < σ/ . For proper m and step size η , the iterates of D-SVRG satisfy E (cid:2) f (˜ x t +1 ) − f ∗ (cid:3) < σ − c σ − c ) E (cid:2) f (˜ x t ) − f ∗ (cid:3) . lgorithm 2 LocalUpdate via SVRG/SARAH Input: local data M k , ˜ x t , ∇ f (˜ x t ) ; Parameters: step size η , number of iterations m ; Set y t, k = ˜ x t , v t, k = ∇ f (˜ x t ) ; for s = 0 , ..., m − do Sample z from M k uniformly at random; Compute v t,s +1 k = ( ∇ ℓ z ( y t,s +1 k ) − ∇ ℓ z (˜ x t ) + ∇ f (˜ x t ) SVRG ∇ ℓ z ( y t,s +1 k ) − ∇ ℓ z ( y t,sk ) + v t,sk SARAH y t,s +1 k = y t,sk − ηv t,sk ; end for Set y t, + k uniformly at random from y t, k , . . . , y t,mk . The communication and runtime complexities of finding an ǫ -optimal solution (in terms of function value)are O (cid:0) ζ − log(1 /ǫ ) (cid:1) and O (cid:0) ( N/n + ζ − κ ) ζ − log(1 /ǫ ) (cid:1) respectively, where ζ = 1 − c/σ . Theorem 1 establishes the linear convergence of function values in expectation for D-SVRG, as long asthe parameter c is sufficiently small, e.g. c < σ/ . From the expressions of communication and runtime com-plexities, it can be seen that the smaller c , the faster D-SVRG converges – suggesting that the homogeneityof distributed data plays an important role in the efficiency of distributed optimization. When c is set suchthat c/σ is bounded above by a constant smaller than / , i.e. ζ = O (1) , the runtime complexity becomes O (( N/n + κ ) log(1 /ǫ )) , which improves the counterpart of SVRG O (( N + κ ) log(1 /ǫ )) in the centralizedsetting. Remark 1.
The above
LocalUpdate routine corresponds to the so-called Option II (w.r.t. setting y t, + k asuniformly at random selected from previous updates) specified in [13]. Under similar assumptions, we alsoestablish the convergence of D-SVRG using Option I, where the output y t, + k is set as y t,mk . In addition, D-SVRG still converges linearly in the absence of Assumption 2. We leave these extensions in the supplementarymaterials. The
LocalUpdate of D-SARAH is also described in Alg. 2, which is different from SVRG in the update ofstochastic gradient v t,sk , by using a recursive formula proposed in [21]. Theorem 2 provides the convergenceguarantee of D-SARAH as long as the distributed restricted smoothness parameter is small enough. Theorem 2 (D-SARAH) . Suppose that Assumptions 1, 2 and 3 hold, and Assumption 4a holds with c k ≤ c < √ σ/ . With proper m and step size η , the iterates of D-SARAH satisfy E h(cid:13)(cid:13) ∇ f (˜ x t +1 ) (cid:13)(cid:13) i < − c /σ E h(cid:13)(cid:13) ∇ f (˜ x t ) (cid:13)(cid:13) i . The communication and runtime complexities of finding an ǫ -optimal solution (in terms of gradient norm)are O (cid:0) ζ − log(1 /ǫ ) (cid:1) and O (cid:0) ( N/n + ζ − κ ) ζ − log(1 /ǫ ) (cid:1) respectively, where ζ = 1 − √ c/σ . Theorem 2 establishes the linear convergence of the gradient norm in expectation for D-SARAH, aslong as the parameter c is small enough. Similar to D-SVRG, a smaller c leads to faster convergence ofD-SARAH. When c is set such that c/σ is bounded above by a constant smaller than √ / , the runtimecomplexity becomes O (( N/n + κ ) log(1 /ǫ )) , which improves the counterpart of SARAH O (( N + κ ) log(1 /ǫ )) in the centralized setting. In particular, Theorem 2 suggests that D-SARAH may allow a larger c , comparedwith D-SVRG, to guarantee convergence. 6 lgorithm 3 LocalUpdate via MiG Input: local data M k , ˜ x t , ∇ f (˜ x t ) ; Parameters: step size η , number of iterations m , and w . if t = 0 then set x t, k = ˜ x t else set x t, k = x t − ,mk end if for s = 0 , ..., m − do Set y t,sk = (1 − θ )˜ x t + θx t,sk ; Sample z from M k uniformly at random, and set v t,sk = ∇ ℓ z ( y t,sk ) − ∇ ℓ z (˜ x t ) + ∇ f (˜ x t ); x t,s +1 k = x t,sk − ηv t,sk ; end for Set y t + k = (cid:16)P m − j =0 w j (cid:17) − P m − j =0 w j y t,j +1 k . The
LocalUpdate of D-MiG is described in Alg. 3, which is inspired by the inner loop of the MiG algorithm[42], a recently proposed accelerated variance-reduced algorithm. Compared with D-SVRG and D-SARAH,D-MiG uses additional information from the previous updates, according to Line 3-6 in Alg. 3. Theorem 3provides the convergence guarantee of D-MiG, as long as the distributed smoothness parameter is smallenough.
Theorem 3 (D-MiG) . Suppose that Assumptions 1, 2 and 3 hold, and Assumption 4b holds with c k ≤ c <σ/ . Let w = (1 + ησ ) / (1 + 3 ηc ) . With proper m and step size η , the iterates of D-MiG achieve an ǫ -optimal solution within a communication complexity of O ((1 + p κ/ ( N/n )) log(1 /ǫ )) and runtime complexityof O (( N/n + p κN/n ) log(1 /ǫ )) . Theorem 3 establishes the linear convergence of D-MiG under standard smoothness of f − f k , in orderto fully harness the power of acceleration. While we do not make it explicit in the theorem statement, thetime complexity of D-MiG also decreases as c gets smaller. Furthermore, the time complexity of D-MiG issmaller than that of D-SVRG/D-SARAH when κ = Θ( N/n ) . Remark 2.
Theorem 3 continues to hold for regularized empirical risk minimization, where the loss functionis given as F ( x ) = f ( x ) + g ( x ) , and g ( x ) is a convex and non-smooth regularizer. In this case, line 11 inAlg. 3 is changed to x t,s +1 k = arg min x (cid:26) η (cid:13)(cid:13) x − x t,sk (cid:13)(cid:13) + (cid:10) v t,sk , x (cid:11) + g ( x ) (cid:27) . So far, we have established the convergence when the distributed smoothness is not too large. While it maybe reasonable in certain settings, e.g. in a data center where one has control over how to distribute data,it is increasingly harder to satisfy when the data are generated locally and heterogeneous across workers.However, when such conditions are violated, the algorithms might diverge. In this situation, adding aregularization term might ensure the convergence, at the cost of possibly slowing down the convergence rate.We consider regularizing the local gradient update of D-SVRG in Alg. 2 as v t,s +1 k = ∇ ℓ z ( y t,s +1 k ) − ∇ ℓ z (˜ x t ) + ∇ f (˜ x t ) + µ k ( y t,s +1 k − ˜ x t ) , (3)where the last regularization term penalizes the proximity between the current iterates y t,s +1 k and the ref-erence point ˜ x t , where µ k > is the regularization parameter employed at the k th worker. We have thefollowing theorem. 7 heorem 4 (Distributed Regularized SVRG (D-RSVRG)) . Suppose that Assumptions 1, 2 and 3 hold, andAssumption 4a holds with c k < ( σ + µ k ) / . Let µ = min ≤ k ≤ n µ k . With proper m and step size η , thereexists some constant ≤ ν < such that the iterates of D-RSVRG satisfy E (cid:2) f (˜ x t +1 ) − f ∗ (cid:3) ≤ (cid:18) − (1 − ν ) max (cid:26) σL + µ , − µσ (cid:27)(cid:19) · E (cid:2) f (˜ x t ) − f ∗ (cid:3) , (4) and the runtime complexity of finding an ǫ -optimal solution is bounded by O (cid:18) ( N/n + ζ − ¯ κ ) ζ − min (cid:26) κ + µσ , { − µ/σ, } (cid:27) log(1 /ǫ ) (cid:19) , where ζ = 1 − c/ ( σ + µ ) and ¯ κ = ( L + µ ) / ( σ + µ ) . Compared with Theorem 1, Theorem 4 relaxes the assumption c k < σ/ to c k < ( σ + µ k ) / , whichmeans that by inserting a larger regularization µ k to local workers that are not distributed smooth, i.e.those with large c k , one can still guarantee the convergence of D-RSVRG. However, increasing µ leads to aslower convergence rate: a large µ = 8 L leads to an iteration complexity O ( κ log(1 /ǫ )) , similar to gradientdescent. Compared with SCOPE [41] which requires a uniform regularization µ > L − σ , our analysis appliestailored regularization to local workers, and potentially allows much smaller regularization to guarantee theconvergence, since c k ’s can be much smaller than the smoothness parameter L . In this section, we extend the convergence analysis of D-SARAH to handle nonconvex loss functions, sinceSARAH-type algorithms are recently shown to achieve near-optimal performances for nonconvex problems[34, 22, 10]. As a modification that eases the analysis, we make every worker return y t + k = y t,mk in line 9 ofAlg. 2. Our result is summarized in the theorem below. Theorem 5 (D-SARAH for non-convex losses) . Suppose that Assumption 1 and Assumption 4b hold with c k ≤ c . With the step size η ≤ L (cid:16) √ m +4 m ( m − c /L (cid:17) , D-SARAH satisfies T m T − X t =0 m − X s =0 E (cid:20)(cid:13)(cid:13)(cid:13) ∇ f ( y t,sk ( t ) ) (cid:13)(cid:13)(cid:13) (cid:21) ≤ ηT m (cid:0) f (˜ x ) − f ∗ (cid:1) , where k ( t ) is the agent index selected in the t th round for parameter update, i.e. ˜ x t +1 = y t + k ( t ) (c.f. line 8of Alg. 1). To find an ǫ -optimal solution, the communication complexity is O (cid:16) p n/N + c/L ) L/ǫ (cid:17) , andthe runtime complexity is O ( N/n + ( p N/n + N/n · c/L ) L/ǫ ) by setting m = Θ( N/n ) . Theorem 5 suggests that D-SARAH converges as long as the step size is small enough. Furthermore, asmaller c allows a larger step size η , and hence faster convergence. To gain further insights, assuming i.i.d.data at each worker, by concentration inequalities it is known that c/L = O ( p log( N/n ) / ( N/n )) under mildconditions [20], and consequently, the runtime complexity of finding an ǫ -accurate solution using D-SARAHis O ( N/n + L p log( N/n ) N/n/ǫ ) . This is comparable to the best known result O ( N + L √ N /ǫ ) for thecentralized SARAH-type algorithms in the nonconvex setting [22, 10, 34] up to logarithmic factors, wherethe data size is replaced from N to the size of local data N/n – demonstrating again the benefit of datadistribution.
In this section, we outline the convergence proofs of D-SVRG (Theorem 1), D-RSVRG (Theorem 4), D-SARAH in the strongly convex (Theorem 2) and nonconvex (Theorem 5) settings, while leaving the detailsand the convergence proof of D-MiG (Theorem 3) to the appendix and supplemental materials. Throughoutthis section, we simplify the notations y t,sk and v t,sk by dropping the t superscript and the k subscript,whenever the meaning is clear, since it is often sufficient to analyze the convergence of a specific worker k during a single round. 8 .1 D-SVRG (Theorem 1) We generalize the analysis of SVRG using the dissipativity theory in [12] to the analysis of D-SVRG, whichmight be of independent interest. We start with a quick review of dissipativity theory. Consider the followinglinear time-invariant system ξ k +1 = Aξ k + Bw k , where ξ k ∈ R n ξ is the state and w k ∈ R n w is the input. Dissipativity theory characterizes how the inputs w j , j = 0 , , , . . . drive the internal energy stored in the states ξ j , j = 0 , , , . . . via an energy function V : R n ξ R + and a supply rate S : R n ξ × R n w R . The theory aims to build the following dissipationinequality: V ( ξ k +1 ) ≤ ρ V ( ξ k ) + S ( ξ k , w k ) , (5)where ρ ∈ (0 , . The inequality indicates that at least a fraction (1 − ρ ) of the internal energy will dissipateat every iteration.With an energy function V ( ξ ) = ξ ⊤ P ξ and supply rates S j ( ξ, w ) = [ ξ ⊤ , w ⊤ ] X j [ ξ ⊤ , w ⊤ ] ⊤ , (6)we have V ( ξ k +1 ) ≤ ρ V ( ξ k ) + J X j =1 λ j S j ( ξ k , w k ) (7)as long as there exist a positive semidefinite matrix P and non-negative scalars λ j such that (cid:20) A ⊤ P A − ρ P A ⊤ P BB ⊤ P A B ⊤ P B (cid:21) − J X j =1 λ j X j (cid:22) . (8)In fact, by left multiplying [ ξ ⊤ k , w ⊤ k ] and right multiplying [ ξ ⊤ k , w ⊤ k ] ⊤ to (8), we recover (7). We can captureproperties of the objective function such as co-coercivity and strong-convexity with (8) and characterize theoptimality of ξ k by V ( ξ k ) , thus wrapping the convergence proof as the existence of ρ ∈ (0 , satisfying (8),which can be viewed as a generalized eigenvalue optimization problem [16].Setting ξ s = y s − x ∗ , we can write the local update of D-SVRG via the following linear time-invariantsystem [12]: ξ s +1 = ξ s − η (cid:2) ∇ ℓ z ( y s ) − ∇ ℓ z (cid:0) y (cid:1) + ∇ f (cid:0) y (cid:1)(cid:3) , where z is selected uniformly at random from local data points M k , or equivalently, ξ s +1 = Aξ s + Bw s , (9)with A = I d , B = (cid:2) − ηI d − ηI d (cid:3) , and w s = (cid:20) ∇ ℓ z ( y s ) − ∇ ℓ z ( x ∗ ) ∇ ℓ z ( x ∗ ) − ∇ ℓ z (cid:0) y (cid:1) + ∇ f (cid:0) y (cid:1)(cid:21) . Here, I d is the identity matrix of dimension d .Recall the supply rate in (6). Since ξ s ∈ R d and w s ∈ R d , we will write X j as X j = ¯ X j ⊗ I d , where ¯ X j ∈ R × . Following [12], we consider the supply rates characterized by the following matrices: ¯ X = , ¯ X = , ¯ X = − − − − . (10)We have the following lemmas which are proved in Appendix B.1 and B.2, respectively. Lemma 1.
Suppose that Assumption 1, 2, 3 and 4a hold with c k ≤ c . For the supply rates defined in (10) ,we have E [ S ] ≤ L E [ f ( y s ) − f ∗ ] + 2 cL E h k y s − x ∗ k i ; E [ S ] ≤ L E (cid:2) f ( y ) − f ∗ (cid:3) + c (4 L + 2 c ) E h(cid:13)(cid:13) y − x ∗ (cid:13)(cid:13) i ; E [ S ] ≤ − E [ f ( y s ) − f ∗ ] + 3 c E h k y s − x ∗ k i + c E h(cid:13)(cid:13) y − x ∗ (cid:13)(cid:13) i . emma 2. Suppose that Assumption 1, 2, 3 and 4a hold with c k ≤ c . If there exist non-negative scalars λ j , j = 1 , , , such that λ − Lλ − (2 Lλ + 3 λ ) c/σ > (11) and λ − η λ − ηλ − η η − λ η λ − η η η − λ (cid:22) (12) hold. Then D-SVRG satisfies E (cid:2) f ( y + ) − f ∗ (cid:3) ≤ ( λ − Lλ − (2 Lλ + 3 λ ) c/σ ) − (13) · (cid:20) σm + cσ ((4 L + 2 c ) λ + λ ) + 2 Lλ (cid:21) E (cid:2) f ( y ) − f ∗ (cid:3) , where the final output y + is selected from y , · · · , y m uniformly at random. We can now prove Theorem 1. Set λ = λ = 2 η , and λ = η . We have (11) holds with the step size η < σ − cL (2 σ + 4 c ) , and (12) holds since − η η η − η (cid:22) . Applying Lemma 2 with the above choice of parameters, (13) can be written as E (cid:2) f ( y + ) − f ∗ (cid:3) ≤ / ( ηm ) + c ((8 L + 4 c ) η + 1) + 4 Lησ (1 − Lη ) − c (4 Lη + 3) ! · E (cid:2) f ( y ) − f ∗ (cid:3) . When c < σ/ , by choosing η = (1 − c/σ )(40 L ) − , m = 160 κ (1 − c/σ ) − , a convergence rate no morethan ν := 1 − · σ − cσ − c is obtained. Therefore, after line 8 of Alg. 1, we have for D-SVRG, E (cid:2) f (˜ x t +1 ) − f ∗ (cid:3) ≤ n n X k =1 E (cid:2) f ( y t + k ) − f ∗ (cid:3) ≤ (cid:18) − · σ − cσ − c (cid:19) E (cid:2) f (˜ x t ) − f ∗ (cid:3) . To obtain an ǫ -optimal solution in terms of function value, we need O ( ξ − log(1 /ǫ )) communication rounds,where ζ = 1 − c/σ . Per round, the runtime complexity at each worker is O ( N/n + m ) = O ( N/n + ζ − κ ) ,where the first term corresponds to evaluating the batch gradient over the local data in parallel, and thesecond term corresponds to evaluating the stochastic gradients in the inner loop. Multiplying this with thecommunication rounds leads to the overall runtime. Consider an auxiliary sample function at the t th round and the k th worker as ℓ tµ k ( x ; z ) = ℓ ( x ; z ) + µ k (cid:13)(cid:13) x − ˜ x t (cid:13)(cid:13) . This leads to the auxiliary local and global loss functions, respectively, f ti ( x ) = 1 |M i | X z ∈M i ℓ tµ k ( x ; z ) = f i ( x ) + µ k (cid:13)(cid:13) x − ˜ x t (cid:13)(cid:13) , for ≤ i ≤ n , and f t ( x ) = 1 n n X i =1 f ti ( x ) = f ( x ) + µ k (cid:13)(cid:13) x − ˜ x t (cid:13)(cid:13) . (14)10oreover, we have ∇ ℓ tµ k ( y t,s +1 k ; z ) − ∇ ℓ tµ k (˜ x t ; z ) + ∇ f t (˜ x t )= ∇ ℓ ( y t,s +1 k ; z ) − ∇ ℓ (˜ x t ; z ) + ∇ f (˜ x t ) + µ k ( y t,s +1 k − ˜ x t ) , which means that D-RSVRG performs in exactly the same way as the unregularized D-SVRG with theauxiliary loss functions ℓ tµ k in the t th round. Note that ℓ tµ k is ( µ k + L ) -smooth and that f t is ( µ k + σ ) -strongly-convex, while the restricted smoothness of the k th worker, f t − f tk = f − f k remains unchanged.Applying Theorem 1, we have E (cid:2) f t (˜ x t +1 ) − f t ∗ (cid:3) < ν E (cid:2) f t (˜ x t ) − f t ∗ (cid:3) , (15)where f t ∗ is the optimal value of f t , when Assumptions 1, 2 and 3 hold, and Assumption 4a holds, as longas c k < ( σ + µ k ) / , along with proper m and step size η .However, the definition of the regularized loss functions ℓ tµ and f t rely on ˜ x t , which changes over differentrounds. Our next step is to relate the descent of f t to f . To this end, we have E (cid:2) f (˜ x t +1 ) − f ∗ (cid:3) = E (cid:2) f t (˜ x t +1 ) − f t ∗ (cid:3) + f t ∗ − E h µ (cid:13)(cid:13) ˜ x t − ˜ x t +1 (cid:13)(cid:13) i − f ∗ < E (cid:2) f t (˜ x t ) − f t ∗ (cid:3) − (1 − ν ) E (cid:2) f t (˜ x t ) − f t ∗ (cid:3) + f t ∗ − f ∗ = E (cid:2) f (˜ x t ) − f ∗ (cid:3) − (1 − ν ) E (cid:2) f (˜ x t ) − f t ∗ (cid:3) , where the second line uses (14), the third line uses (15), and the last line follows from f (˜ x t ) = f t (˜ x t ) .We can continue to bound f (˜ x t ) − f t ∗ in two manners. First, f (˜ x t ) − f t ∗ = f t (˜ x t ) − f t ∗ > L + µ ) k∇ f (˜ x t ) k > σL + µ ( f (˜ x t ) − f ∗ ) . On the other hand, we have f (˜ x t ) − f t ∗ ≥ f (˜ x t ) − f t ( x ∗ ) = f (˜ x t ) − f ( x ∗ ) − µ k ˜ x t − x ∗ k ≥ (1 − µ/σ )( f (˜ x t ) − f ∗ ) . Thus (4) follows immediately by combining the above two bounds. Theorem 2 can be deducted from the following theorem, whose proof can be found in Appendix C.1.
Theorem 6.
Suppose that Assumption 1, 2, 3 and 4a hold with c k ≤ c . Then D-SARAH satisfies (cid:18) − c σ (cid:19) E h(cid:13)(cid:13) ∇ f (˜ x t +1 ) (cid:13)(cid:13) i ≤ (cid:18) σηm + 4 c σ + 2 ηL − ηL (cid:19) E h(cid:13)(cid:13) ∇ f (˜ x t ) (cid:13)(cid:13) i . When c < √ σ , we can choose η = 2(1 − c /σ )(9 − c /σ ) L and m = 2 κ − c /σ (1 − c /σ ) in Theorem 6, leading tothe following convergence rate: E h(cid:13)(cid:13) ∇ f (˜ x t +1 ) (cid:13)(cid:13) i ≤ − c /σ E h(cid:13)(cid:13) ∇ f (˜ x t ) (cid:13)(cid:13) i . Consequently, following similar discussions as D-SVRG, the communication complexity of finding an ǫ -optimal solution is O ( ζ − log(1 /ǫ )) , and the runtime complexity is O (cid:0) ( N/n + ζ − κ ) ζ − log(1 /ǫ ) (cid:1) , where ζ = 1 − √ c/σ . Theorem 5 can be deducted from the following theorem, proved in Appendix C.2, which specifies the choiceof the step size to guarantee convergence.
Theorem 7.
Suppose that Assumption 1 and 4b hold with c k ≤ c . By setting the step size η ≤ L (cid:16) p m −
1) + 4 m ( m − c /L (cid:17) , or a single outer loop of D-SARAH, it satisfies: m − X s =0 E h k∇ f ( y s ) k i ≤ η E (cid:2) f ( y ) − f ( y m ) (cid:3) . By setting ˜ x t +1 = y m , from the above theorem, we have m − X s =0 E h k∇ f ( y s ) k i ≤ η E (cid:2) f (˜ x t ) − f (˜ x t +1 ) (cid:3) . Hence, with T outer loops, we have T m T − X t =0 m − X s =0 E (cid:20)(cid:13)(cid:13)(cid:13) ∇ f ( y t,sk ( t ) ) (cid:13)(cid:13)(cid:13) (cid:21) ≤ ηT m (cid:0) f (˜ x ) − f ∗ (cid:1) . where k ( t ) is the worker index selected in the t th round for parameter update. The communication complexityto achieve an ǫ -optimal solution is T = O (cid:18) ηmǫ (cid:19) = O p m + m c /L m · Lǫ ! = O (cid:18) (cid:18) √ m + cL (cid:19) · Lǫ (cid:19) , with the choice η = Θ (cid:18) L √ m + m c /L (cid:19) . Per round, the runtime complexity at each worker is O ( N/n + m ) .By choosing m = O ( N/n ) , we achieve the runtime complexity O (cid:18) N/n + (cid:16)p N/n + N/n · cL (cid:17) Lǫ (cid:19) . Thought the focus of this paper is theoretical, we illustrate the performance of the proposed distributedstochastic variance reduced algorithms in various settings as a proof-of-concept.
Consider ℓ -regularized logistic regression, where the sample loss is defined as ℓ ( x ; z i ) = log (cid:0) (cid:0) − b i a ⊤ i x (cid:1)(cid:1) + λ k x k , (16)with the data z i = ( a i , b i ) ∈ R d × {± } . We evaluate the performance on the gisette dataset [11], bysplitting the data equally to all workers. We scale the data according to max i ∈ [ N ] k a i k = 1 , so that thesmoothness parameter is estimated as L = 1 / λ . We choose λ = N − . , N − . and N − to illustratethe performance under different condition numbers. We use the optimality gap, defined as f (˜ x t ) − f ∗ , toillustrate the convergence behavior.For D-SVRG and D-SARAH, the step size is set as η = 1 / (2 L ) . For D-MiG, although the choice of w in the theory requires knowledge of c , we simply ignore it and set w = 1 + ησ , θ = 1 / and the stepsize η = 1 / (3 θL ) to reflect the robustness of the practical performance to parameters. We further use ˜ x t +1 = n P nk =1 y t + k at the PS for better empirical performance. For D-AGD, the step size is set as η = 1 /L and the momentum parameter is set as √ κ − √ κ +1 . Following [13], which sets the number of inner loop iterationsas m = 2 N , we set m ≈ N/n to ensure the same number of total inner iterations. We note that such12
20 40 60
Number of Communication Rounds -10 -5 O p t i m a li t y G ap D-SVRGD-SARAHD-MiGD-AGD
Number of Communication Rounds -10 -5 O p t i m a li t y G ap D-SVRGD-SARAHD-MiGD-AGD
Number of Communication Rounds -10 -5 O p t i m a li t y G ap D-SVRGD-SARAHD-MiGD-AGD (a) λ = N − . (b) λ = N − . (c) λ = N − Figure 1: The optimality gap on ℓ -regularized logistic regression with respect to the number of communi-cation rounds with 4 workers using the gisette dataset under different conditioning for different algorithms. Number of Communication Rounds -10 -5 O p t i m a li t y G ap D-SVRGD-SARAHD-MiGD-AGD
Number of Communication Rounds -10 -5 O p t i m a li t y G ap D-SVRGD-SARAHD-MiGD-AGD
Number of Communication Rounds -10 -5 O p t i m a li t y G ap D-SVRGD-SARAHD-MiGD-AGD (a) n = 4 (b) n = 8 (c) n = 16 Figure 2: The optimality gap on ℓ -regularized logistic regression with respect to the number of commu-nication rounds with different number of workers using the gisette dataset for different algorithms when λ = N − .parameters can be further tuned to achieve better trade-off between communication cost and computationcost in practice.Fig. 1 illustrates the optimality gap of various algorithms with respect to the number of communicationrounds with 4 local workers under different conditioning, and Fig. 2 shows the corresponding result withdifferent numbers of local workers when λ = N − . The distributed stochastic variance-reduced algorithmsoutperform distributed AGD significantly. In addition, D-MiG outperforms D-SVRG and D-SARAH whenthe condition number is large. We justify the benefit of regularization by evaluating the proposed algorithms under unbalanced data allo-cation. We assign 50%, 30%, 19.9%, 0.1% percent of data to four workers, respectively, and set λ = N − inthe logistic regression loss (16). To deal with unbalanced data, we perform the regularized update, given in(3), on the worker with the least amount of data, and keep the update on the rest of the workers unchanged.A similar regularized update can be conceived for D-SARAH and D-MiG, resulting in regularized variants,D-RSARAH and D-RMiG. While our theory does not cover them, we still evaluate their numerical perfor-mance. We properly set µ according to the amount of data on this worker as µ = 0 . / (0 . · N ) . . We setthe number of iterations at workers m = 2 N on all agents. Fig. 3 shows the optimality gap with respect tothe number of communication rounds for all algorithms. It can be seen that all unregularized methods failto converge, and the regularized algorithms still converge, verifying the role of regularization in addressingunbalanced data. It is also worth mentioning that the regularization can be flexibly imposed depending onthe local data size, rather than homogeneously across all workers.13
10 20 30 40 50 60
Number of Communication Rounds -10 -5 O p t i m a li t y G ap D-SVRGD-SARAHD-MiGD-RSVRGD-RSARAHD-RMiGD-AGD
Figure 3: The optimality gap with respect to the number of communication rounds for highly unbalanceddata allocation. It can be seen that the regularized variants of distributed stochastic variance-reducedalgorithms still converge will the unregularized ones no longer converge.
We follow the same setting as [34] to evaluate D-SARAH and Distributed Gradient Descent (D-GD) on thegisette dataset with a nonconvex sample loss function: ℓ ncvx ( x ; z i ) = log(1 + exp( − b i a ⊤ i x )) + λ d X j =1 x j / (1 + x j ) , which consists of the logistic loss and a non-convex regularizer, where x j is the j th entry of x . The smoothnessparameter of ℓ ncvx ( x ; z i ) can be estimated as L = 1 / λ . Fig. 4 plots the squared norm the gradient k∇ f (˜ x t ) k of D-SARAH and D-GD with respect to the number of communication rounds. It can be seenthat D-SARAH achieves a much lower gradient norm than D-GD with the same number of communicationrounds. Number of Communication Rounds -6 -5 -4 -3 -2 D-SARAHD-GD
Figure 4: The squared norm of the gradient with respect to the number of communication rounds on thegisette dataset with 4 workers using a nonconvex loss function.
In this paper, we have developed a convergence theory for a family of distributed stochastic variance reducedmethods without sampling extra data, under a mild distributed smoothness assumption that measures thediscrepancy between the local and global loss functions. Convergence guarantees are obtained for distributedstochastic variance reduced methods using accelerations and recursive gradient updates, and for minimizing14oth strongly convex and nonconvex losses. We also suggest regularization as a means of ensuring conver-gence when the local data are unbalanced and heterogeneous. We believe the analysis framework is usefulfor studying distributed variants of other stochastic variance-reduced methods such as Katyusha [3], andproximal variants such as [37].
Acknowledgements
The work of S. Cen was partly done when visiting MSRA. The work of S. Cen and Y. Chi is supported in partby National Science Foundation under the grant CCF-1806154, Office of Naval Research under the grantsN00014-18-1-2142 and N00014-19-1-2404, and Army Research Office under the grant W911NF-18-1-0303.
References [1] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic. Qsgd: Communication-efficient sgd viagradient quantization and encoding. In
Advances in Neural Information Processing Systems , pages1709–1720, 2017.[2] D. Alistarh, T. Hoefler, M. Johansson, N. Konstantinov, S. Khirirat, and C. Renggli. The convergence ofsparsified gradient methods. In
Advances in Neural Information Processing Systems , pages 5973–5983,2018.[3] Z. Allen-Zhu. Katyusha: The first direct acceleration of stochastic gradient methods. In
Proceedings ofthe 49th Annual ACM SIGACT Symposium on Theory of Computing , pages 1200–1205. ACM, 2017.[4] J. Bernstein, Y.-X. Wang, K. Azizzadenesheli, and A. Anandkumar. Signsgd: Compressed optimisationfor non-convex problems. In
International Conference on Machine Learning , pages 559–568, 2018.[5] D. P. Bertsekas and J. N. Tsitsiklis.
Parallel and distributed computation: numerical methods , volume 23.Prentice hall Englewood Cliffs, NJ, 1989.[6] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statisticallearning via the alternating direction method of multipliers.
Foundations and Trends R (cid:13) in Machinelearning , 3(1):1–122, 2011.[7] S. De and T. Goldstein. Efficient distributed sgd with variance reduction. In , pages 111–120. IEEE, 2016.[8] A. Defazio, F. Bach, and S. Lacoste-Julien. Saga: A fast incremental gradient method with support fornon-strongly convex composite objectives. In Advances in neural information processing systems , pages1646–1654, 2014.[9] J. Fan, Y. Guo, and K. Wang. Communication-efficient accurate statistical estimation. arXiv preprintarXiv:1906.04870 , 2019.[10] C. Fang, C. J. Li, Z. Lin, and T. Zhang. Spider: Near-optimal non-convex optimization via stochasticpath-integrated differential estimator. In
Advances in Neural Information Processing Systems , pages687–697, 2018.[11] I. Guyon, S. Gunn, A. Ben-Hur, and G. Dror. Result analysis of the nips 2003 feature selection challenge.In
Advances in neural information processing systems , pages 545–552, 2005.[12] B. Hu, S. Wright, and L. Lessard. Dissipativity theory for accelerating stochastic variance reduction: Aunified analysis of svrg and katyusha using semidefinite programs.
International Conference on MachineLearning (ICML) , 2018.[13] R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction.In
Advances in neural information processing systems , pages 315–323, 2013.1514] J. Konečn`y, B. McMahan, and D. Ramage. Federated optimization: Distributed optimization beyondthe datacenter. arXiv preprint arXiv:1511.03575 , 2015.[15] J. D. Lee, Q. Lin, T. Ma, and T. Yang. Distributed stochastic variance reduced gradient methods bysampling extra data with replacement.
The Journal of Machine Learning Research , 18(1):4404–4446,2017.[16] L. Lessard, B. Recht, and A. Packard. Analysis and design of optimization algorithms via integralquadratic constraints.
SIAM Journal on Optimization , 26(1):57–95, 2016.[17] X. Lian, C. Zhang, H. Zhang, C.-J. Hsieh, W. Zhang, and J. Liu. Can decentralized algorithms out-perform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. In
Advances in Neural Information Processing Systems , pages 5330–5340, 2017.[18] H. Lin, J. Mairal, and Z. Harchaoui. A universal catalyst for first-order optimization. In
Advances inNeural Information Processing Systems , pages 3384–3392, 2015.[19] Y. Lin, S. Han, H. Mao, Y. Wang, and B. Dally. Deep gradient compression: Reducing the commu-nication bandwidth for distributed training. In
International Conference on Learning Representations ,2018.[20] S. Mei, Y. Bai, and A. Montanari. The landscape of empirical risk for nonconvex losses.
The Annals ofStatistics , 46(6A):2747–2774, 2018.[21] L. M. Nguyen, J. Liu, K. Scheinberg, and M. Takáč. Sarah: A novel method for machine learningproblems using stochastic recursive gradient. In
International Conference on Machine Learning , pages2613–2621, 2017.[22] L. M. Nguyen, M. van Dijk, D. T. Phan, P. H. Nguyen, T.-W. Weng, and J. R. Kalagnanam. Finite-sumsmooth optimization with sarah. arXiv preprint arXiv:1901.07648 , 2019.[23] B. Recht, C. Re, S. Wright, and F. Niu. Hogwild: A lock-free approach to parallelizing stochasticgradient descent. In
Advances in neural information processing systems , pages 693–701, 2011.[24] S. J. Reddi, J. Konečn`y, P. Richtárik, B. Póczós, and A. Smola. Aide: fast and communication efficientdistributed optimization. arXiv preprint arXiv:1608.06879 , 2016.[25] M. Schmidt, N. Le Roux, and F. Bach. Minimizing finite sums with the stochastic average gradient.
Mathematical Programming , 162(1-2):83–112, 2017.[26] F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu. 1-bit stochastic gradient descent and its application todata-parallel distributed training of speech dnns. In
Fifteenth Annual Conference of the InternationalSpeech Communication Association , 2014.[27] S. Shalev-Shwartz and T. Zhang. Stochastic dual coordinate ascent methods for regularized loss mini-mization.
Journal of Machine Learning Research , 14(Feb):567–599, 2013.[28] O. Shamir. Without-replacement sampling for stochastic gradient methods. In
Advances in NeuralInformation Processing Systems , pages 46–54, 2016.[29] O. Shamir, N. Srebro, and T. Zhang. Communication-efficient distributed optimization using an ap-proximate newton-type method. In
International conference on machine learning , pages 1000–1008,2014.[30] V. Smith, S. Forte, M. Chenxin, M. Takáč, M. I. Jordan, and M. Jaggi. Cocoa: A general frameworkfor communication-efficient distributed optimization.
Journal of Machine Learning Research , 18:230,2018.[31] H. Tang, X. Lian, M. Yan, C. Zhang, and J. Liu. D : Decentralized training over decentralized data.In International Conference on Machine Learning , pages 4855–4863, 2018.1632] J. Wang, W. Wang, and N. Srebro. Memory and communication efficient distributed stochastic opti-mization with minibatch prox. In
Conference on Learning Theory , pages 1882–1919, 2017.[33] S. Wang, F. Roosta-Khorasani, P. Xu, and M. W. Mahoney. Giant: Globally improved approximatenewton method for distributed optimization. In
Advances in Neural Information Processing Systems ,pages 2338–2348, 2018.[34] Z. Wang, K. Ji, Y. Zhou, Y. Liang, and V. Tarokh. Spiderboost: A class of faster variance-reducedalgorithms for nonconvex optimization. arXiv preprint arXiv:1810.10690 , 2018.[35] J. Wangni, J. Wang, J. Liu, and T. Zhang. Gradient sparsification for communication-efficient dis-tributed optimization. In
Advances in Neural Information Processing Systems , pages 1299–1309, 2018.[36] W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li. Terngrad: Ternary gradients to reducecommunication in distributed deep learning. In
Advances in neural information processing systems ,pages 1509–1519, 2017.[37] L. Xiao and T. Zhang. A proximal stochastic gradient method with progressive variance reduction.
SIAM Journal on Optimization , 24(4):2057–2075, 2014.[38] Y. Zhang and X. Lin. DiSCO: Distributed Optimization for Self-Concordant Empirical Loss.
Interna-tional Conference on Machine Learning , pages 362–370, 2015.[39] Y. Zhang, M. J. Wainwright, and J. C. Duchi. Communication-efficient algorithms for statistical opti-mization. In
Advances in Neural Information Processing Systems , pages 1502–1510, 2012.[40] S. Zhao, G.-D. Zhang, M.-W. Li, and W.-J. Li. Proximal scope for distributed sparse learning. In
Advances in Neural Information Processing Systems , pages 6552–6561, 2018.[41] S.-Y. Zhao, R. Xiang, Y.-H. Shi, P. Gao, and W.-J. Li. Scope: scalable composite optimization forlearning on spark. In
Thirty-First AAAI Conference on Artificial Intelligence , 2017.[42] K. Zhou, F. Shang, and J. Cheng. A simple stochastic variance reduced algorithm with fast convergencerates. In
International Conference on Machine Learning , pages 5975–5984, 2018.[43] M. Zinkevich, M. Weimer, L. Li, and A. J. Smola. Parallelized stochastic gradient descent. In
Advancesin neural information processing systems , pages 2595–2603, 2010.
A Preliminary
We first establish a lemma which will be useful later.
Lemma 3.
When Assumptions 1,2 and one of the distributed smoothness (Assumption 4a or 4b) hold, wehave E z h k∇ ℓ z ( x ) − ∇ ℓ z ( x ) k i ≤ LD f ( x , x ) + ( cL (cid:16) k x − x ∗ k + k x − x ∗ k (cid:17) Assumption acL k x − x k Assumption b , for any ˜ x , where the expectation is evaluated over z , and D f ( x , x ) = f ( x ) − f ( x ) − h∇ f ( x ) , x − x i .Proof. Given f is L -smooth and convex, the Bregman divergence D f ( x , x ) is L -smooth and convex as afunction of x . When Assumptions 1 and 2 hold, we have D ℓ z ( x , x ) ≤ D ℓ z ( x , x ) − L k∇ x D ℓ z ( x , x ) k = D ℓ z ( x , x ) − L k∇ ℓ z ( x ) − ∇ ℓ z ( x ) k . z ∈ M k gives L · D f k ( x , x ) ≥ E z h k∇ ℓ z ( x ) − ∇ ℓ z ( x ) k i . (17)To further bound the left-hand side, Assumption 4a allows us to compare D f and D f k : | D f k ( x , x ) − D f ( x , x ) | = (cid:12)(cid:12)(cid:12) D f − f k ( x , x ∗ ) + D f − f k ( x ∗ , x ) + h∇ ( f − f k )( x ∗ − x ) , x − x ∗ i (cid:12)(cid:12)(cid:12) ≤ c k x − x ∗ k + c k x ∗ − x k + c k x ∗ − x k k x − x ∗ k≤ c (cid:16) k x − x ∗ k + k x ∗ − x k (cid:17) . Following similar arguments, using Assumption 4b we obtain a tighter bound by replacing x ∗ with any ˜ x .In particular, setting ˜ x = ( x + x ) / we have | D f k ( x , x ) − D f ( x , x ) | ≤ c k x − x k / . Combining the above estimates with (17) proves the lemma.
B Proof for D-SVRG
B.1 Proof of Lemma 1
For E [ S ] , we apply Lemma 3 directly: E [ S ] = E h k∇ ℓ z ( y s ) − ∇ ℓ z ( x ∗ ) k i ≤ L E [ f ( y s ) − f ( x ∗ )] + 2 cL E h k y s − x ∗ k i , where the inequality follows from the definition of D f ( y s , x ∗ ) , and ∇ f ( x ∗ ) = 0 . For E [ S ] , we have E [ S ] = E h(cid:13)(cid:13) ∇ ℓ z ( x ∗ ) − ∇ ℓ z ( y ) + ∇ f ( y ) (cid:13)(cid:13) i ≤ E h(cid:13)(cid:13) ∇ ℓ z ( x ∗ ) − ∇ ℓ z ( y ) − ( ∇ f k ( x ∗ ) − ∇ f k ( y )) (cid:13)(cid:13) i + 2 E h(cid:13)(cid:13) ( ∇ f k ( x ∗ ) − ∇ f k ( y )) + ∇ f ( y ) (cid:13)(cid:13) i ≤ E h(cid:13)(cid:13) ∇ ℓ z ( x ∗ ) − ∇ ℓ z ( y ) (cid:13)(cid:13) i + 2 c E h(cid:13)(cid:13) y − x ∗ (cid:13)(cid:13) i ≤ L E (cid:2) f ( y ) − f ( x ∗ ) (cid:3) + c (4 L + 2 c ) E h(cid:13)(cid:13) y − x ∗ (cid:13)(cid:13) i , where the first inequality is due to k a + b k ≤ k a k + 2 k b k , the second inequality follows from evaluatingthe expectation and Assumption 4a, and the last step uses Lemma 3 again.For E [ S ] , we have E [ S ] = − E (cid:2)(cid:10) y s − x ∗ , ∇ ℓ z ( y s ) − ∇ ℓ z ( y ) + ∇ f ( y ) (cid:11)(cid:3) =2 E [ − h y s − x ∗ , ∇ f ( y s ) i ] − E (cid:2)(cid:10) y s − x ∗ , ∇ ( f − f k )( y ) − ∇ ( f − f k )( y s ) (cid:11)(cid:3) ≤ − E [ f ( y s ) − f ( x ∗ )] + 2 c E (cid:2) k y s − x ∗ k (cid:0) k y s − x ∗ k + (cid:13)(cid:13) y − x ∗ (cid:13)(cid:13)(cid:1)(cid:3) ≤ − E [ f ( y s ) − f ( x ∗ )] + 3 c E h k y s − x ∗ k i + c E h(cid:13)(cid:13) y − x ∗ (cid:13)(cid:13) i , where the first inequality is obtained by applying Assumption 2, Cauchy-Schwarz inequality and Assumption4a. 18 .2 Proof of Lemma 2 Setting P = I d and ρ = 1 in (8), we have that it becomes equivalent to (12), and consequently the dissipationinequality (5) holds. In view of Lemma 1, it can be written as E h(cid:13)(cid:13) y s +1 − x ∗ (cid:13)(cid:13) i ≤ (1 + 2 cLλ + 3 cλ ) E h k y s − x ∗ k i − (2 λ − Lλ ) E [ f ( y s ) − f ∗ ]+ 4 Lλ E (cid:2) f ( y ) − f ∗ (cid:3) + c [(4 L + 2 c ) λ + λ ] E h(cid:13)(cid:13) y − x ∗ (cid:13)(cid:13) i . Since f is σ -strongly convex, k y s − x ∗ k < σ ( f ( y s ) − f ∗ ) , then the above E h(cid:13)(cid:13) y s +1 − x ∗ (cid:13)(cid:13) i ≤ E h k y s − x ∗ k i − γ E [ f ( y s ) − f ∗ ] + 4 Lλ E (cid:2) f ( y ) − f ∗ (cid:3) + γ E h(cid:13)(cid:13) y − x ∗ (cid:13)(cid:13) i , where γ = (2 λ − Lλ − (4 Lλ + 6 λ ) c/σ ) and γ = c [(4 L + 2 c ) λ + λ ] are introduced as short-handnotations. In addition, γ > by assumption. Telescoping the above inequality by summing over s =0 , . . . , m − , we have γ m − X s =0 E [ f ( y s ) − f ∗ ] ≤ (1 + γ m ) E h(cid:13)(cid:13) y − x ∗ (cid:13)(cid:13) i + 4 Lλ m E (cid:2) f ( y ) − f ∗ (cid:3) . Note that the choice of y + implies E (cid:2) f ( y + ) − f ∗ (cid:3) = 1 m m − X s =0 E [ f ( y s ) − f ∗ ] . Therefore, γ E (cid:2) f ( y + ) − f ∗ (cid:3) ≤ (1 /m + γ ) E h(cid:13)(cid:13) y − x ∗ (cid:13)(cid:13) i + 4 Lλ E (cid:2) f ( y ) − f ∗ (cid:3) . We obtain the final result by substituting E h(cid:13)(cid:13) y − x ∗ (cid:13)(cid:13) i ≤ σ E (cid:2) f (cid:0) y (cid:1) − f ∗ (cid:3) into the above inequality. B.3 Convergence of D-SVRG without Assumption 2
When Assumption 2 does not hold, we can still use similar arguments as the proof of Theorem 1 and establishthe convergence of D-SVRG, though at a slower rate. Using the same supply rates (10), Lemma 1 can bemodified as below.
Lemma 4.
Suppose that Assumption 1, 3 and 4a hold. For the supply rates defined in (10) , we have E [ S ] ≤ L σ − E [ f ( y s ) − f ∗ ] E [ S ] ≤ L + c ) σ − E (cid:2) f ( y ) − f ∗ (cid:3) E [ S ] ≤ − E [ f ( y s ) − f ∗ ] + 3 c E h k y s − x ∗ k i + c E h(cid:13)(cid:13) y − x ∗ (cid:13)(cid:13) i Proof.
With L -smoothness of ℓ z we have the following estimate: E z h k∇ ℓ z ( y ) − ∇ ℓ z ( y ) k i ≤ L k y − y k . So we have E [ S ] = E h k∇ ℓ z ( y s ) − ∇ ℓ z ( x ∗ ) k i ≤ L E h k y s − x ∗ k i ≤ L σ − E [ f ( y s ) − f ∗ ] and E [ S ] = E h(cid:13)(cid:13) ∇ ℓ z ( x ∗ ) − ∇ ℓ z ( y ) + ∇ f ( y ) (cid:13)(cid:13) i ≤ E h(cid:13)(cid:13) ∇ ℓ z ( x ∗ ) − ∇ ℓ z ( y ) − ( ∇ f k ( x ∗ ) − ∇ f k ( y )) (cid:13)(cid:13) i + 2 E h(cid:13)(cid:13) ( ∇ f k ( x ∗ ) − ∇ f k ( y )) + ∇ f ( y ) (cid:13)(cid:13) i ≤ E h(cid:13)(cid:13) ∇ ℓ z ( x ∗ ) − ∇ ℓ z ( y ) (cid:13)(cid:13) i + 2 c E h(cid:13)(cid:13) y − x ∗ (cid:13)(cid:13) i ≤ L + c ) E h(cid:13)(cid:13) y − x ∗ (cid:13)(cid:13) i ≤ L + c ) σ − E (cid:2) f ( y ) − f ∗ (cid:3) . E [ S ] is identical to that in Lemma 1.Following the same process in the proof of Lemma 2 in Appendix B.2, we have the following inequalitywith proper choices of λ , λ and λ : E h(cid:13)(cid:13) y s +1 − x ∗ (cid:13)(cid:13) i ≤ (1 + 3 cλ ) E h k y s − x ∗ k i − (2 λ − L σ − λ ) E [ f ( y s ) − f ∗ ]+ 4( L + c ) σ − λ E (cid:2) f ( y ) − f ∗ (cid:3) + cλ E h(cid:13)(cid:13) y − x ∗ (cid:13)(cid:13) i ≤ E h k y s − x ∗ k i − (2 λ − L σ − λ − cλ σ − ) E [ f ( y s ) − f ∗ ]+ 4( L + c ) σ − λ E (cid:2) f ( y ) − f ∗ (cid:3) + cλ E h(cid:13)(cid:13) y − x ∗ (cid:13)(cid:13) i . By summing the inequality and letting λ = λ = 2 η , λ = η , we have η (1 − L σ − η − cσ − ) m − X s =0 E [ f ( y s ) − f ∗ ] ≤ (1 + cηm ) E h(cid:13)(cid:13) y − x ∗ (cid:13)(cid:13) i + 8( L + c ) σ − η m E (cid:2) f ( y ) − f ∗ (cid:3) ≤ (cid:0) σ − + 2 cηmσ − + 8( L + c ) σ − η m (cid:1) E (cid:2) f ( y ) − f ∗ (cid:3) . Therefore, with − L σ − η − cσ − > the following convergence bound can be established: E (cid:2) f (cid:0) y + (cid:1) − f ∗ (cid:3) ≤ ( ησm ) − + cσ − + 4( L + c ) σ − η − L σ − η − cσ − E (cid:2) f ( y ) − f ∗ (cid:3) . By choosing η = 1 − cσ − κL , m = 160 κ (1 − cσ − ) , we get a convergence rate no more than − · σ − cσ − c .Hence the overall time complexity to find an ǫ -optimal solution is O (cid:0)(cid:0) N/n + ζ − κ (cid:1) ζ − log(1 /ǫ ) (cid:1) where ζ = 1 − c/σ . B.4 Convergence of D-SVRG with Option I
Another option for the output of the inner loops of SVRG, or the output of the local workers, is to outputthe last iterates, i.e. y t + k = y t,mk , which is called “Option 1” in [13]. Here, we establish the convergence ofD-SVRG using Option I in the following theorem. Theorem 8 (D-SVRG with Option I) . Suppose that Assumptions 1, 2 and 3 hold, and Assumption 4a holdswith c < σ/ . With sufficiently large m and sufficiently small step size η , there exists ≤ ν < such that E h(cid:13)(cid:13) ˜ x t +1 − x ∗ (cid:13)(cid:13) i < ν E h(cid:13)(cid:13) ˜ x t − x ∗ (cid:13)(cid:13) i . Theorem 8 indicates that the iterates ˜ x t of D-SVRG with Option I converge to the minimizer x ∗ linearlyin expectation as long as c is sufficiently small. The proof is outlined in Section 6.1. By taking m → ∞ and η → , the rate approaches ν := c σ − c , which suggests the algorithm admits faster convergence with thedecrease of c , as expected.Following [12], we consider the following four supply rates: ¯ X = , ¯ X = σ − − − − , ¯ X = − L − L , ¯ X = − . (18)We have the following lemma which is proved at the end of this subsection.20 emma 5. Suppose that Assumptions 1, 2, 3 and 4a hold. For the supply rates defined in (18) , we have E [ S ] ≤ ( L + 2 cL − σ ) E h(cid:13)(cid:13) y − x ∗ (cid:13)(cid:13) i E [ S ] ≤ c E h k y s − x ∗ k i + c E h(cid:13)(cid:13) y − x ∗ (cid:13)(cid:13) i E [ S ] ≤ E [ S ] ≤ c E h k y s − x ∗ k i + c E h(cid:13)(cid:13) y − x ∗ (cid:13)(cid:13) i Therefore, by choosing P = I d , λ = 2 η , λ = η − Lη , λ = η , λ = Lη , ρ = 1 − σ ( η − Lη ) , thecondition (8) holds: η − − (cid:22) . This immediately leads to the inequality (7) which reads as: E h(cid:13)(cid:13) y s +1 − x ∗ (cid:13)(cid:13) i ≤ (1 − (2 σ − c )( η − Lη ) + cLη ) E h k y s − x ∗ k i + (2 η ( L + 2 cL − σ ) + cη ) E h(cid:13)(cid:13) y − x ∗ (cid:13)(cid:13) i . (19)Let ˜ ρ = (1 − (2 σ − c )( η − Lη ) + cLη ) . Telescoping the inequality over t = 0 , , · · · , m − leads to E h k y m − x ∗ k i ≤ (cid:18) ˜ ρ m + 2 η ( L + 2 cL − σ ) + c (2 σ − c )(1 − Lη ) − cLη (cid:19) · E h(cid:13)(cid:13) y − x ∗ (cid:13)(cid:13) i Note that when m → ∞ and η → , the rate becomes ν := c σ − c , hence we need c < σ/ to get a rate ν < . We have E h(cid:13)(cid:13) ˜ x t +1 − x ∗ (cid:13)(cid:13) i ≤ n n X k =1 E h(cid:13)(cid:13) y t,mk − x ∗ (cid:13)(cid:13) i ≤ n n X k =1 ν E (cid:20)(cid:13)(cid:13)(cid:13) y t, k − x ∗ (cid:13)(cid:13)(cid:13) (cid:21) = ν E h(cid:13)(cid:13) ˜ x t − x ∗ (cid:13)(cid:13) i . Proof of Lemma 5.
The following inequalities can be viewed as combinations of standard inequalities inconvex optimization (co-coercivity, etc) and the characterization of restricted smoothness. E [ S ] = E h(cid:13)(cid:13) ∇ ℓ z ( x ∗ ) − ∇ ℓ z (cid:0) y (cid:1) + ∇ f (cid:0) y (cid:1)(cid:13)(cid:13) i = E h(cid:13)(cid:13) ∇ ℓ z ( x ∗ ) − ∇ ℓ z (cid:0) y (cid:1)(cid:13)(cid:13) i + 2 E (cid:2)(cid:10) ∇ ℓ z ( x ∗ ) − ∇ ℓ z (cid:0) y (cid:1) , ∇ f (cid:0) y (cid:1)(cid:11)(cid:3) + E h(cid:13)(cid:13) ∇ f (cid:0) y (cid:1)(cid:13)(cid:13) i ≤ L E h(cid:13)(cid:13) x ∗ − y (cid:13)(cid:13) i + 2 E (cid:2)(cid:10) ∇ ( f − f k )( y ) − ∇ ( f − f k )( x ∗ ) , ∇ f (cid:0) y (cid:1)(cid:11)(cid:3) − E h(cid:13)(cid:13) ∇ f (cid:0) y (cid:1)(cid:13)(cid:13) i ≤ ( L + 2 cL − σ ) E h(cid:13)(cid:13) y − x ∗ (cid:13)(cid:13) i , and E [ S ] =2 E h σ k y s − x ∗ k − (cid:10) y s − x ∗ , ∇ ℓ z ( y s ) − ∇ ℓ z (cid:0) y (cid:1) + ∇ f (cid:0) y (cid:1)(cid:11)i =2 E h σ k y s − x ∗ k − h y s − x ∗ , ∇ f ( y s ) i i − E (cid:2)(cid:10) y s − x ∗ , ∇ ( f − f k )( y ) − ∇ ( f − f k )( y s ) (cid:11)(cid:3) ≤ c E (cid:2) k y s − x ∗ k (cid:0) k y s − x ∗ k + (cid:13)(cid:13) y − x ∗ (cid:13)(cid:13)(cid:1)(cid:3) ≤ c E h k y s − x ∗ k i + c E h(cid:13)(cid:13) y − x ∗ (cid:13)(cid:13) i . E [ S ] ≤ is simply the restatement of L -smoothness of ℓ ( · , z ) , z ∈ M . E [ S ] =2 E (cid:2)(cid:10) y s − x ∗ , ∇ ( f − f k )( y ) − ∇ ( f − f k )( x ∗ ) (cid:11)(cid:3) ≤ c E (cid:2) k y s − x ∗ k (cid:13)(cid:13) y − x ∗ (cid:13)(cid:13)(cid:3) ≤ c E h k y s − x ∗ k i + c E h(cid:13)(cid:13) y − x ∗ (cid:13)(cid:13) i . Proof for D-SARAH
C.1 Proof of Theorem 6
To begin, we cite two supporting lemmas from [21].
Lemma 6. [21] Suppose that Assumption 1 holds, then m − X s =0 E h k∇ f ( y s ) k i ≤ η E (cid:2) f ( y ) − f ( y m ) (cid:3) + m − X s =0 E h k∇ f ( y s ) − v s k i − (1 − Lη ) m − X s =0 E h k v s k i . Lemma 7. [21] Suppose Assumption 1 and 2 hold and η < /L , then E h(cid:13)(cid:13) v s − v s − (cid:13)(cid:13) i ≤ ηL − ηL h E h(cid:13)(cid:13) v s − (cid:13)(cid:13) i − E h k v s k ii . We also present a new lemma below with the proof given in Appendix C.3.
Lemma 8.
The update rule of D-SARAH satisfies E h(cid:13)(cid:13) ∇ f ( y ) − ∇ f k ( y ) + ∇ f k ( y s ) − v s (cid:13)(cid:13) i = s X j =1 E h(cid:13)(cid:13) v j − v j − (cid:13)(cid:13) i − s X j =1 E h(cid:13)(cid:13) ∇ f k ( y j ) − ∇ f k ( y j − ) (cid:13)(cid:13) i . Proof of Theorem 6.
By combining Lemmas 7 and 8, we have E h(cid:13)(cid:13) ∇ f ( y ) − ∇ f k ( y ) + ∇ f k ( y s ) − v s (cid:13)(cid:13) i ≤ s X j =1 E h(cid:13)(cid:13) v j − v j − (cid:13)(cid:13) i ≤ ηL − ηL E h(cid:13)(cid:13) v (cid:13)(cid:13) i . (20)By Assumption 4a, we have (cid:13)(cid:13) ∇ f ( y ) − ∇ f k ( y ) + ∇ f k ( y s ) − ∇ f ( y s ) (cid:13)(cid:13) = (cid:13)(cid:13) ∇ ( f − f k )( y ) − ∇ ( f − f k )( y s ) (cid:13)(cid:13) ≤ c k y s − x ∗ k + 2 c (cid:13)(cid:13) y − x ∗ (cid:13)(cid:13) . (21)Therefore, combining (20) and (21), we have E h k∇ f ( y s ) − v s k i ≤ E h(cid:13)(cid:13) ∇ f ( y ) − ∇ f k ( y ) + ∇ f k ( y s ) − ∇ f ( y s ) (cid:13)(cid:13) i + 2 E h(cid:13)(cid:13) ∇ f ( y ) − ∇ f k ( y ) + ∇ f k ( y s ) − v s (cid:13)(cid:13) i ≤ c E h k y s − x ∗ k i + 4 c E h(cid:13)(cid:13) y − x ∗ (cid:13)(cid:13) i + 2 ηL − ηL E h(cid:13)(cid:13) v (cid:13)(cid:13) i . Substituting it into Lemma 6 gives m − X s =0 E h k∇ f ( y s ) k i ≤ η E (cid:2) f ( y ) − f ( x ∗ ) (cid:3) + m − X s =0 E h k∇ f ( y s ) − v s k i ≤ η E (cid:2) f ( y ) − f ( x ∗ ) (cid:3) + m − X s =0 (cid:16) c E h k y s − x ∗ k i + 4 c E h(cid:13)(cid:13) y − x ∗ (cid:13)(cid:13) i + 2 ηL − ηL E h(cid:13)(cid:13) v (cid:13)(cid:13) i (cid:17) . f is σ -strongly convex, we have c k y s − x ∗ k ≤ c σ k∇ f ( y s ) k . Denote y + as the local update whichis selected from y , · · · , y m − uniformly at random. We have (cid:18) − c σ (cid:19) E h(cid:13)(cid:13) ∇ f ( y + ) (cid:13)(cid:13) i = (cid:18) − c σ (cid:19) m m − X s =0 E h k∇ f ( y s ) k i + 2 ηL − ηL E h(cid:13)(cid:13) v (cid:13)(cid:13) i ≤ (cid:18) σηm + 4 c σ + 2 ηL − ηL (cid:19) E h(cid:13)(cid:13) ∇ f ( y ) (cid:13)(cid:13) i . Since ˜ x t +1 is randomly chosen from the local outputs { y t + k , ≤ k ≤ n } , we have (cid:18) − c σ (cid:19) E h(cid:13)(cid:13) ∇ f (˜ x t +1 ) (cid:13)(cid:13) i ≤ (cid:18) σηm + 4 c σ + 2 ηL − ηL (cid:19) E h(cid:13)(cid:13) ∇ f (˜ x t ) (cid:13)(cid:13) i . C.2 Proof of Theorem 7
Recall Lemma 6. The theorem follows if m − X s =0 E h k∇ f ( y s ) − v s k i − (1 − Lη ) m − X s =0 E h k v s k i ≤ . The rest of this proof is thus dedicated to show the above inequality. Note that E h k∇ f ( y s ) − v s k i ≤ E h(cid:13)(cid:13) ∇ f ( y ) − ∇ f k ( y ) + ∇ f k ( y s ) − ∇ f ( y s ) (cid:13)(cid:13) i + 2 E h(cid:13)(cid:13) ∇ f ( y ) − ∇ f k ( y ) + ∇ f k ( y s ) − v s (cid:13)(cid:13) i ≤ c E h(cid:13)(cid:13) y − y s (cid:13)(cid:13) i + 2 s X j =1 E h(cid:13)(cid:13) v j − v j − (cid:13)(cid:13) i ≤ c s s X j =1 E h(cid:13)(cid:13) y j − y j − (cid:13)(cid:13) i + 2 s X j =1 E h(cid:13)(cid:13) v j − v j − (cid:13)(cid:13) i =2 c sη s X j =1 E h(cid:13)(cid:13) v j − (cid:13)(cid:13) i + 2 s X j =1 E h(cid:13)(cid:13) v j − v j − (cid:13)(cid:13) i , where the second inequality follows from Lemma 8 and Assumption 4b, and the third inequality follows from y − y s = P sj =1 y j − y j − , and the last line follows from the definition. The L -smoothness of ℓ z implies that (cid:13)(cid:13) v j − v j − (cid:13)(cid:13) = (cid:13)(cid:13) ∇ ℓ z (cid:0) y j (cid:1) − ∇ ℓ z (cid:0) y j − (cid:1)(cid:13)(cid:13) ≤ L (cid:13)(cid:13) y j − j j − (cid:13)(cid:13) = L η (cid:13)(cid:13) v j − (cid:13)(cid:13) . So we have m − X s =0 E h k∇ f ( y s ) − v s k i − (1 − Lη ) m − X s =0 E h k v s k i ≤ m − X s =1 (2 c s + 2 L ) η s X j =1 E h(cid:13)(cid:13) v j − (cid:13)(cid:13) i − (1 − Lη ) m − X s =0 E h k v s k i ≤ m − X s =0 (cid:0) m ( m − c η + 2 L η ( m − − (1 − Lη ) (cid:1) · E h k v s k i . Therefore, with < η ≤ L (cid:16) √ m − m ( m − c /L (cid:17) , we have m ( m − c η +2 L η ( m − − (1 − Lη ) ≤ and the proof is finished. 23 .3 Proof of Lemma 8 First, we write ∇ f ( y ) − ∇ f k ( y ) + ∇ f k ( y s ) − v s = ∇ f ( y ) − ∇ f k ( y ) + ∇ f k ( y s − ) − v s − + [ ∇ f k ( y s ) − ∇ f k ( y s − )] − [ v s − v s − ] . Let F s denote the σ -algebra generated by all random sample selections in sub-iteration , · · · , s − . Wehave E (cid:20)(cid:13)(cid:13) ∇ f ( y ) − ∇ f k ( y ) + ∇ f k ( y s ) − v s (cid:13)(cid:13) (cid:12)(cid:12)(cid:12)(cid:12) F s (cid:21) = (cid:13)(cid:13) ∇ f ( y ) − ∇ f k ( y ) + ∇ f k ( y s − ) − v s − (cid:13)(cid:13) + (cid:13)(cid:13) ∇ f k ( y s ) − ∇ f k ( y s − ) (cid:13)(cid:13) + E (cid:20)(cid:13)(cid:13) v s − v s − (cid:13)(cid:13) (cid:12)(cid:12)(cid:12)(cid:12) F s (cid:21) + 2 (cid:10) ∇ f ( y ) − ∇ f k ( y ) + ∇ f k ( y s − ) − v s − , ∇ f k ( y s ) − ∇ f k ( y s − ) (cid:11) − (cid:28) ∇ f ( y ) − ∇ f k ( y ) + ∇ f k ( y s − ) − v s − , E (cid:20) v s − v s − (cid:12)(cid:12)(cid:12)(cid:12) F s (cid:21)(cid:29) − (cid:28) ∇ f k ( y s ) − ∇ f k ( y s − ) , E (cid:20) v s − v s − (cid:12)(cid:12)(cid:12)(cid:12) F s (cid:21)(cid:29) = (cid:13)(cid:13) ∇ f ( y ) − ∇ f k ( y ) + ∇ f k ( y s − ) − v s − (cid:13)(cid:13) − (cid:13)(cid:13) ∇ f k ( y s ) − ∇ f k ( y s − ) (cid:13)(cid:13) + E (cid:20)(cid:13)(cid:13) v s − v s − (cid:13)(cid:13) (cid:12)(cid:12)(cid:12)(cid:12) F s (cid:21) , where the second equality follows from E (cid:20) v s − v s − (cid:12)(cid:12)(cid:12)(cid:12) F s (cid:21) = E (cid:20) ∇ ℓ z ( y s ) − ∇ ℓ z ( y s − ) (cid:12)(cid:12)(cid:12)(cid:12) F s (cid:21) = ∇ f k ( y s ) − ∇ f k ( y s − ) , Taking expectation over F s gives E h(cid:13)(cid:13) ∇ f ( y ) − ∇ f k ( y ) + ∇ f k ( y s ) − v s (cid:13)(cid:13) i = E h(cid:13)(cid:13) ∇ f ( y ) − ∇ f k ( y ) + ∇ f k ( y s − ) − v s − (cid:13)(cid:13) i − E h(cid:13)(cid:13) ∇ f k ( y s ) − ∇ f k ( y s − ) (cid:13)(cid:13) i + E h(cid:13)(cid:13) v s − v s − (cid:13)(cid:13) i . Hence telescoping the above equality we obtain the claimed result.
D Proof for D-MiG (Theorem 3)
As earlier, we simplify the notations y t,sk , x t,sk and v t,sk by dropping the superscript t and the subscript k .In this section we deal with the non-smooth target function F ( x ) = f ( x ) + g ( x ) as mentioned in Remark 2,where g is a convex and non-smooth function known to all agents. The analysis is done by carefully adaptingthe proof for the centralized algorithm (i.e. [42, Section B.1.]) with the distributed smoothness assumption.We impose the following constraint on the step size η : Lθ + Lθ − θ + c ≤ η . (22)We restate the inequalities (8) and (9) [42] and change notations to match our context: f ( y s − ) − f ( u ) ≤ − θθ (cid:10) ∇ f ( y s − ) , ˜ x − y s − (cid:11) + (cid:10) ∇ f ( y s − ) , x s − − u (cid:11) , (23) (cid:10) ∇ f ( y s − ) , x s − − u (cid:11) = D ∇ f ( y s − ) − ˜ ∇ , x s − − u E + D ˜ ∇ , x s − − x s E + D ˜ ∇ , x s − u E , (24)where ˜ ∇ = ∇ ℓ z ( y s − ) − ∇ ℓ z (˜ x ) + ∇ f (˜ x ) , with z randomly selected from M k and u ∈ R d is an arbitraryvector. Following the L -smoothness argument in [42], (24) leads to D ˜ ∇ , x s − − x s E ≤ θ (cid:0) f ( y s − ) − f ( y s ) (cid:1) + D ∇ f ( y s − ) − ˜ ∇ , x s − x s − E + Lθ (cid:13)(cid:13) x s − x s − (cid:13)(cid:13) .
24y plugging in the constraint (22), we have D ˜ ∇ , x s − − x s E ≤ θ (cid:0) f ( y s − ) − f ( y s ) (cid:1) + D ∇ f ( y s − ) − ˜ ∇ , x s − x s − E + (cid:18) η − Lθ − θ ) − c (cid:19) (cid:13)(cid:13) x s − x s − (cid:13)(cid:13) . (25)By combining (23),(24),(25), [42, Lemma 3] and then taking expectation over the choice of random sample z , we have f ( y s − ) − f ( u ) ≤ − θθ (cid:10) ∇ f ( y s − ) , ˜ x − y s − (cid:11) + E hD ∇ f ( y s − ) − ˜ ∇ , x s − u Ei + 1 θ (cid:0) f ( y s − ) − E [ f ( y s )] (cid:1) − (cid:18) Lθ − θ ) + c (cid:19) E h(cid:13)(cid:13) x s − x s − (cid:13)(cid:13) i + 12 η (cid:13)(cid:13) x s − − u (cid:13)(cid:13) − ησ η E h k x s − u k i + g ( u ) − E [ g ( x s )] . (26)We further split the term E hD ∇ f ( y s − ) − ˜ ∇ , x s − u Ei as E hD ∇ f ( y s − ) − ˜ ∇ , x s − u Ei = E hD ∇ f (˜ x ) − ∇ f k (˜ x ) + ∇ f k ( y s − ) − ˜ ∇ , x s − x s − Ei + E hD ∇ f ( y s − ) − ˜ ∇ , x s − − u Ei + E (cid:2)(cid:10) ∇ f ( y s − ) − ∇ f (˜ x ) + ∇ f k (˜ x ) − ∇ f k ( y s − ) , x s − x s − (cid:11)(cid:3) ≤ β E (cid:20)(cid:13)(cid:13)(cid:13) ∇ f (˜ x ) − ∇ f k (˜ x ) + ∇ f k ( y s − ) − ˜ ∇ (cid:13)(cid:13)(cid:13) (cid:21) + β E h(cid:13)(cid:13) x s − x s − (cid:13)(cid:13) i + c (cid:13)(cid:13) ˜ x − y s − (cid:13)(cid:13) E (cid:2)(cid:13)(cid:13) x s − x s − (cid:13)(cid:13)(cid:3) + c (cid:13)(cid:13) ˜ x − y s − (cid:13)(cid:13) (cid:13)(cid:13) x s − − u (cid:13)(cid:13) ≤ β (cid:16) LD f (˜ x, y s − ) + 2 cL (cid:13)(cid:13) ˜ x − y s − (cid:13)(cid:13) (cid:17) + β + c E h(cid:13)(cid:13) x s − x s − (cid:13)(cid:13) i + c (cid:13)(cid:13) ˜ x − y s − (cid:13)(cid:13) + c (cid:13)(cid:13) x s − − u (cid:13)(cid:13) , where the first inequality is due to Cauchy-Schwarz inequality and Assumption 4b, with β > satisfying − θθ = Lβ , and the last inequality is obtained by combining E (cid:20)(cid:13)(cid:13)(cid:13) ∇ f (˜ x ) − ∇ f k (˜ x ) + ∇ f k ( y ) − ˜ ∇ (cid:13)(cid:13)(cid:13) (cid:21) ≤ E h(cid:13)(cid:13) ∇ ℓ z ( y s − ) − ∇ ℓ z (˜ x ) (cid:13)(cid:13) i . with Lemma 3 under Assumption 4b, along with the inequality ab ≤ ( a + b ) / . By substituting theinequality into (26) we have the following result. f ( y s − ) − f ( u ) ≤ β E h L ( f (˜ x ) − f ( y s − ) + 2 cL (cid:13)(cid:13) ˜ x − y s − (cid:13)(cid:13) i + c (cid:13)(cid:13) ˜ x − y s − (cid:13)(cid:13) + c (cid:13)(cid:13) x s − − u (cid:13)(cid:13) + 1 θ (cid:0) f ( y s − ) − E [ f ( y s )] (cid:1) + 12 η (cid:13)(cid:13) x s − − u (cid:13)(cid:13) − ησ η E h k x s − u k i + g ( u ) − E [ g ( x s )] ≤ cθ (cid:13)(cid:13) ˜ x − x s − (cid:13)(cid:13) + c (cid:13)(cid:13) x s − − u (cid:13)(cid:13) + 1 − θθ (cid:0) f (˜ x ) − f ( y s − ) (cid:1) + 1 θ (cid:0) f ( y s − ) − E [ f ( y s )] (cid:1) + 12 η (cid:13)(cid:13) x s − − u (cid:13)(cid:13) − ησ η E h k x s − u k i + g ( u ) − E [ g ( x s )] . Note that the choice of β cancels both − θθ (cid:10) ∇ f ( y s − ) , ˜ x − y s − (cid:11) and (cid:16) Lθ − θ ) + c (cid:17) E h(cid:13)(cid:13) x s − x s − (cid:13)(cid:13) i . Byrearranging the above inequality, we can cancel the term f ( y s − ) . We further use − g ( x s ) ≤ − θθ g (˜ x ) − θ g ( y s ) ,which leads to θ ( E [ F ( y s )] − F ( u )) ≤ cθ (cid:13)(cid:13) ˜ x − x s − (cid:13)(cid:13) + 1 − θθ ( F (˜ x ) − F ( u ))+ 1 + cη η (cid:13)(cid:13) x s − − u (cid:13)(cid:13) − ησ η E h k x s − u k i . (cid:13)(cid:13) ˜ x − x s − (cid:13)(cid:13) ≤ k ˜ x − x ∗ k + 2 (cid:13)(cid:13) x s − − x ∗ (cid:13)(cid:13) . By setting u = x ∗ and using σ -strongly-convexityof F we have θ ( E [ F ( y s )] − F ∗ ) ≤ (cid:18) − θθ + 4 cθσ (cid:19) ( F (˜ x ) − F ∗ )+ 1 + (1 + 4 θ ) cη η (cid:13)(cid:13) x s − − x ∗ (cid:13)(cid:13) − ησ η E h k x s − x ∗ k i . To simplify the analysis we impose another constraint θ ≤ / : θ ( E [ F ( y s )] − F ∗ ) ≤ (cid:18) − θθ + 4 cθσ (cid:19) ( F (˜ x ) − F ∗ )+ 1 + 3 cη η (cid:13)(cid:13) x s − − x ∗ (cid:13)(cid:13) − ησ η E h k x s − x ∗ k i . (27)Let w = ησ cη . Multiplying (27) by w s − and then summing over s , we have θ m − X s =0 w s ( E (cid:2) F ( y s +1 ) (cid:3) − F ∗ ) + w m (1 + 3 cη )2 η E h k x m − x ∗ k i ≤ (cid:18) − θθ + 4 cθσ (cid:19) m − X s =0 w s ( F (˜ x ) − F ∗ ) + 1 + 3 cη η (cid:13)(cid:13) x − x ∗ (cid:13)(cid:13) , Adding the superscript t and the subscript k back and applying Jensen’s inequality to the definition of y t + k ,we get θ m − X s =0 w s ( E (cid:2) F ( y t + k ) (cid:3) − F ∗ ) + w m (1 + 3 cη )2 η E (cid:20)(cid:13)(cid:13)(cid:13) x t +1 , k − x ∗ (cid:13)(cid:13)(cid:13) (cid:21) ≤ (cid:18) − θθ + 4 cθσ (cid:19) m − X s =0 w s ( F (˜ x t ) − F ∗ ) + 1 + 3 cη η (cid:13)(cid:13)(cid:13) x t, k − x ∗ (cid:13)(cid:13)(cid:13) . Averaging the inequality over k = 1 , · · · , n , we have θ m − X s =0 w s ( E (cid:2) F (˜ x t +1 ) (cid:3) − F ∗ ) + w m (1 + 3 cη )2 η E " n n X k =1 (cid:13)(cid:13)(cid:13) x t +1 , k − x ∗ (cid:13)(cid:13)(cid:13) ≤ (cid:18) − θθ + 4 cθσ (cid:19) m − X s =0 w s ( F (˜ x t ) − F ∗ ) + 1 + 3 cη η n n X k =1 (cid:13)(cid:13)(cid:13) x t, k − x ∗ (cid:13)(cid:13)(cid:13) . Case I m ( σ − c ) L ≤ / . By setting θ = q m ( σ − c )3 L ≤ / and η = Lθ + c we satisfy the step size constraint(22). We aim to show that θ ≥ (cid:18) − θθ + 4 cθσ (cid:19) w m . We have w = ησ cη ≤ ησ − cη . Note that Lθ = q Lm ( σ − c )3 ≥ q σ ( σ − c )3 > c , so η ≤ Lθ . Moreover,with σ > c we have m Lθ ( σ − c ) θ − cσ θ = 34(1 − cθ/σ ) ≤ . Hence, (cid:18) − θ + 4 cσ θ (cid:19) (1 + ησ − ηc ) m ≤ (cid:18) − θ + 4 cσ θ (cid:19) (cid:18) Lθ ( σ − c ) (cid:19) m ≤ (1 − ζ ) (1 + ζ/m ) m , ζ = m Lθ ( σ − c ) . With ζ ≤ θ − cσ θ ≤ / we have (1 − ζ ) (1 + ζ/m ) m ≤ . So we have (cid:18) − θθ + 4 cθσ (cid:19) w m ≥ θ (cid:18) − θ + 4 cσ θ (cid:19) (1 + ησ − ηc ) m ≥ θ . Therefore we have w m (cid:18) − θθ + 4 cθσ (cid:19) m − X s =0 w s ( F (˜ x t +1 ) − F ∗ ) + 1 + 3 cη η E " n n X k =1 (cid:13)(cid:13)(cid:13) x t +1 , k − x ∗ (cid:13)(cid:13)(cid:13) ≤ (cid:18) − θθ + 4 cθσ (cid:19) m − X s =0 w s ( F (˜ x t ) − F ∗ ) + 1 + 3 cη η n n X k =1 (cid:13)(cid:13)(cid:13) x t, k − x ∗ (cid:13)(cid:13)(cid:13) . The convergence rate over T rounds of communication is w − T m = (cid:16) (cid:16) √ κm (cid:17)(cid:17) − T m , so the communi-cation complexity is T = O (cid:16)p κ/m log(1 /ǫ ) (cid:17) . With the choice of m = Θ( N/n ) the runtime complexity is O (( N/n + m ) T ) = O ( p κN/n log(1 /ǫ )) . Case II m ( σ − c ) L > / . We set θ = 1 / and η = Lθ + c . We have ησ = κ/ c/σ ≤ / , thus w = 1 + ησ cη ≥ / ησ − cη / ησ − cη ) and w m ≥ ( σ − c ) ηm ≥ m ( σ − c )5(3 L/ L/ > / . So we have m − X s =0 w s ( F (˜ x t +1 ) − F ∗ ) + 54 · cη η E " n n X k =1 (cid:13)(cid:13)(cid:13) x t +1 , k − x ∗ (cid:13)(cid:13)(cid:13) ≤ (cid:18) (cid:19) m − X s =0 w s ( F (˜ x t ) − F ∗ ) + 1 + 3 cη η n n X k =1 (cid:13)(cid:13)(cid:13) x t, k − x ∗ (cid:13)(cid:13)(cid:13) . Therefore, m − X s =0 w s ( F (˜ x t +1 ) − F ∗ ) + 1 + 3 cη η E " n n X k =1 (cid:13)(cid:13)(cid:13) x t +1 , k − x ∗ (cid:13)(cid:13)(cid:13) ≤ m − X s =0 w s ( F (˜ x t ) − F ∗ ) + 1 + 3 cη η n n X k =1 (cid:13)(cid:13)(cid:13) x t, k − x ∗ (cid:13)(cid:13)(cid:13) . This implies that O (log(1 /ǫ )) rounds of communication is sufficient to find an ǫ -accurate solution, and thatthe runtime complexity is O ( N/n log(1 /ǫ )) . E Discussions on Distributed Smoothness
In this paper, we have established that distributed variance reduced methods admit simple convergenceanalysis under the distributed smoothness for all worker machines, as long as the parameter c is smaller thana constant fraction of σ , the strong convexity parameter of the global loss function f . In this section, wewill show that the distributed smoothness can be guaranteed for many practical loss functions as long as thelocal data size is sufficiently large and homogeneous across machines. This is as expected, since SVRG andSARAH rely heavily on exploiting data similarities to reduce the variance.Since the distributed smoothness only examines the gradient information of f k − f , it can be applied toloss functions with non-smooth gradients, e.g. Huber loss. However, for simplicity of exposition, we limit ourfocus to the case when the sample loss ℓ z ( · ) is second-order differentiable and demonstrate the smoothnessof f − f k via uniform concentration of Hessian matrices.27or simplicity, we consider the quadratic loss case, which allows us to compare with existing result forthe DANE algorithm [29], which is a communication-efficient approximate Newton-type algorithm. Assume ℓ ( · , z ) is quadratic for all z . Recall the following result on the concentration of Hessian matrices from [29]. Lemma 9. [29] If (cid:22) ∇ ℓ z ( x ) (cid:22) L holds for all z , then with probability at least − δ over the samples, forall x , max ≤ k ≤ n (cid:13)(cid:13) ∇ f k ( x ) − ∇ f ( x ) (cid:13)(cid:13) ≤ s L log( dn/δ ) N/n .
Moreover, the iteration complexity of DANE is given by the theorem below.
Theorem 9. [29] If (cid:22) ∇ ℓ z ( x ) (cid:22) L holds for all z and σ (cid:22) ∇ f ( x ) (cid:22) L , then with probability exceeding − δ , DANE needs O κ N/n log (cid:18) dnδ (cid:19) log L (cid:13)(cid:13) x − x ∗ (cid:13)(cid:13) ǫ !! iterations to find an ǫ -optimal solution. By Theorem 9, [29] claims that when the local data size of every machine is sufficiently large, namely
N/n = Ω (cid:0) κ log( dn ) (cid:1) , DANE can find a desired ǫ -optimal with O (log(1 /ǫ )) iterations and thus communication-efficient. Note that at this local data size, according to Lemma 9, it is sufficient to establish c = O ( σ ) , whichsatisfies the convergence requirement of D-SVRG and D-SARAH. Consequently, the proposed D-SVRG andD-SARAH converges at the same iteration complexity as DANE, that is O (log(1 /ǫ )) . Recall that DANErequires its local routines to be solved exactly. In contrast, our results formally justifies that SVRG andSARAH can be safely used as an inexactinexact