GGradient Coding: Avoiding Stragglers in Synchronous GradientDescent
Rashish Tandon (cid:63) , Qi Lei † , Alexandros G. Dimakis ‡ and Nikos Karampatziakis (cid:93)(cid:63) Department of Computer Science, UT Austin † Institute for Computational Engineering and Sciences, UT Austin ‡ Department of Electrical and Computer Engineering, UT Austin (cid:93)
Microsoft { rashish@cs, leiqi@ices, dimakis@austin } .utexas.edu, [email protected] March 9, 2017
Abstract
We propose a novel coding theoretic framework for mitigating stragglers in distributedlearning. We show how carefully replicating data blocks and coding across gradients can providetolerance to failures and stragglers for Synchronous Gradient Descent. We implement our schemesin python (using MPI) to run on Amazon EC2, and show how we compare against baselineapproaches in running time and generalization error.
We propose a novel coding theoretic framework for mitigating stragglers in distributed learning. Thecentral idea can be seen through the simple example of Figure 1: Consider synchronous GradientDescent (GD) on three workers ( W , W , W ). The baseline vanilla system is shown in the left figureand operates as follows: The three workers have different partitions of the labeled data storedlocally ( D , D , D ) and all share the current model. Worker 1 computes the gradient of the modelon examples in partition D , denoted by g . Similarly, Workers 2 and 3 compute g and g . Thethree gradient vectors are then communicated to a central node (called the master/aggregator) A which computes the full gradient by summing these vectors g + g + g and updates the modelwith a gradient step. The new model is then sent to the workers and the system moves to thenext round (where the same examples or other labeled examples, say D , D , D , will be used inthe same way). The problem is that sometimes worker nodes can be stragglers (Li et al., 2014;Ho et al., 2013; Dean et al., 2012) i.e. delay significantly in computing and communicating gra-dient vectors to the master. This is especially pronounced for cheaper virtual machines in thecloud. For example on t2.micro machines on Amazon EC2, as can be seen in Figure 2: some ma-chines can be 5 × slower in computing and communicating gradients compared to typical performance.First, we discuss one way to resolve this problem if we replicate some data across machines byconsidering the placement in Fig.1 (b) but without coding. As can be seen, in Fig. 1 (b) eachexample is replicated two times using a specific placement policy. Each worker is assigned to computetwo gradients on the two examples they have for this round. For example, W will compute vectors g and g . Now let’s assume that W is the straggler. If we use control messages, W , W can notifythe master A that they are done. Subsequently, if feedback is used, the master can ask W to send1 a r X i v : . [ s t a t . M L ] M a r W W W g g g g + g + g D1 D2 D3 (a) Naive synchronous gradient descent AW W W g + g + g D1 D2 D3D1D2 D3 g /2 + g g - g g /2 + g (from any 2) (b) Gradient coding: The vector g + g + g is in the span of any two out of the vectors g / g g − g and g / g . Figure 1: The idea of Gradient Coding. g and g and W to send g . These feedback control messages can be much smaller than the actualgradient vectors but are still a system complication that can cause delays. However, feedback makesit possible for a centralized node to coordinate the workers, thereby avoiding stragglers. One canalso reduce network communication further by simply asking W to send the sum of two gradientvectors g + g instead of sending both. The master can then create the global gradient on thisbatch by summing these two vectors. Unfortunately, which linear combination must be sent dependson who is the straggler: If W was the straggler then W should be sending g and W sending g + g so that their sum is the global gradient g + g + g .In this paper we show that feedback and coordination is not necessary: every worker can send a single linear combination of gradient vectors without knowing who the straggler will be. The maincoding theoretic question we investigate is how to design these linear combinations so that anytwo (or any fixed number generally) contain the g + g + g vector in their span. In our example,in Fig. 1b, W sends g + g , W sends g − g and W sends g + g . The reader can verifythat A can obtain the vector g + g + g from any two out of these three vectors . For instance, g + g + g = 2 (cid:0) g + g (cid:1) − ( g − g ). We call this idea gradient coding .We consider this problem in the general setting of n machines and any s stragglers. We first establisha lower bound: to compute gradients on all the data in the presence of any s stragglers, each parti-tion must be replicated s + 1 times across machines. We propose two placement and gradient codingschemes that match this optimal s + 1 replication factor. We further consider a partial stragglersetting, wherein we assume that a straggler can compute gradients at a fraction of the speed of others,and show how our scheme can be adapted to such scenarios. All proofs can be found in the appendix.We also compare our scheme with the popular ignoring the stragglers approach (Chen et al., 2016):simply doing a gradient step when most workers are done. We see that while ignoring the stragglersis faster, this loses some data which can hurt the generalization error. This can be especiallypronounced in supervised learning with unbalanced labels or heavily unbalanced features since afew examples may contain critical, previously unseen information.2
10 20 30 40 50
Index of Worker A v g . T i m e ( i n s e c o n d s ) Figure 2: Average communication times, measure over 100 rounds, for a vector of dimension p = 500000 using n = 50 t2.micro worker machines (and a c3.8xlarge master machine). Errorbars indicate one standard deviation. In Figure 2, we show the average time required for 50 t2.micro
Amazon EC2 instances to com-municate gradients to a single master machine (a c3.8xlarge instance). We observe that a fewworker machines incurred a communication delay of up to 5 × the typical behavior. Interestingly,throughout the timescale of our experiments (a few hours), the straggling behavior was consistentin the same machines.We have also experimented extensively with other Amazon EC2 instances: Our finding is thatcheaper instance types have significantly higher variability in performance. This is especially truefor t2 type instance which on AWS are described as having Burstable Performance . Fortunately,these machines have very low cost.The choices of the number and type of workers used in training big models ultimately depends ontotal cost and time needed until deployment. The main message of this paper is that going forvery low-cost instances and using coding to mitigate stragglers, may be a sensible choice for somelearning problems.
The slow machine problem is the Achilles heel of many distributed learning systems that run inmodern cloud environments. Recognizing that, some recent work has advocated asynchronousapproaches (Li et al., 2014; Ho et al., 2013; Mitliagkas et al., 2016) to learning. While asynchronousupdates are a valid way to avoid slow machines, they do give up many other desirable properties,including faster convergence rates, amenability to analysis, and ease of reproducibility and debugging.3ttacking the straggling problem in synchronous machine learning algorithms has surprisingly notreceived much attention in the literature. There do exist general systems solutions such as speculativeexecution Zaharia et al. (2008) but we believe that approaches tailored to machine learning canbe vastly more efficient. In Chen et al. (2016) the authors use synchronous minibatch SGD andrequest a small number of additional worker machines so that they have an adequate minibatchsize even when some machines are slow. However, this approach does not handle well machinesthat are consistently slow and the data on those machines might never participate in training.In Narayanamurthy et al. (2013) the authors describe an approach for dealing with failed machinesby approximating the loss function in the failed partitions with a linear approximation at the lastiterate before they failed. Since the linear approximation is only valid at a small neighborhood ofthe model parameters, this approach can only work if failed data partitions are restored fairly quickly.The work of Lee et al. (2015) is the closest in spirit to our work, using coding theory and treatingstragglers as erasures in the transmission of the computed results. However, we focus on codesfor recovering the batch gradient of any loss function while Lee et al. (2015) and the more recentwork of Dutta et al. (2016) describe techniques for mitigating stragglers in two different distributedapplications: data shuffling and matrix multiplication. We also mention Li et al. (2016a), whichinvestigates a generalized view of the coding ideas in Lee et al. (2015), showing that their solutionis a single operating point in a general scheme of trading off latency of computation to the load ofcommunication. Further closely related work has shown how coding can be used for distributedMapReduce, as well as a similar communication and computation tradeoff (Li et al., 2015, 2016b).All these prior works develop novel coding techniques, but do not code across gradient vectors inthe way we are proposing in this paper.
Given data D = { ( x , y ) , . . . , ( x d , y d ) } , with each tuple ( x, y ) ∈ R p × R , several machine learningtasks aim to solve the following problem: β ∗ = arg min β ∈ R p d (cid:88) i =1 (cid:96) ( β ; x i , y i ) + λR ( β ) (1)where (cid:96) ( · ) is a task-specific loss function, and R ( · ) is a regularization function. Typically, thisoptimization problem can be solved using gradient-based approaches. Let g := (cid:80) di =1 ∇ (cid:96) ( β ( t ) ; x i , y i )be the gradient of the loss at the current model β ( t ) . Then the updates to the model are of the form: β ( t +1) = h R (cid:16) β ( t ) , g (cid:17) (2)where h R is a gradient-based optimizer, which also depends on R ( · ). Several methods such as gradientdescent, accelerated gradient, conditional gradient (Frank-Wolfe), proximal methods, LBFGS, andbundle methods fit in this framework. However, if the number of samples, d , is large, a computationalbottleneck in the above update step is the computation of the gradient, g , whose computation canbe distributed. Throughout this paper, we let d denote the number of samples, n denote the number of workers, k denote the number of data partitions, and s denote the number of stragglers/failures. The n W , W , . . . , W n . The partial gradients over k data partitions are denotedas g , g , . . . , g k . The i th row of some matrices A or B is denoted as a i or b i respectively. For anyvector x ∈ R n , supp( x ) denotes its support i.e. supp( x ) = { i | x i (cid:54) = 0 } , and (cid:107) x (cid:107) denotes its (cid:96) -normi.e. the cardinality of the support. p × q and p × q denote all 1s and all 0s matrices respectively, withdimension p × q . Finally, for any r ∈ N , [ r ] denotes the set { , . . . , r } . We can generalize the scheme in Figure 1b to n workers and k data partitions by setting up a systemof linear equations: AB = f × k (3)where f denotes the number of combinations of surviving workers/non-stragglers, f × k is the all 1smatrix of dimension f × k , and we have matrices A ∈ R f × n , B ∈ R n × k .We associate the i th row of B , b i , with the i th worker, W i . The support of b i , supp( b i ), correspondsto the data partitions that worker W i has access to, and the entries of b i encode a linear combinationover their gradients that worker W i transmits. Let ¯ g ∈ R k × d be a matrix with each row being thepartial gradient of a data partition i.e. ¯ g = [ g , g , . . . , g k ] T . Then, worker W i transmits b i ¯ g . Note that to transmit b i ¯ g , W i only needs to compute the par-tial gradients on the partitions in supp( b i ). Now, each row of A is associated with a specificfailure/straggler scenario, to which tolerance is desired. In particular, any row a i , with supportsupp( a i ), corresponds to the scenario where the worker indices in supp( a i ) are alive/non-stragglers.Also, by the construction in Eq. (3), we have: a i B ¯ g = [1 , , . . . , g = k (cid:88) j =1 g j T and, (4) a i B ¯ g = (cid:88) k ∈ supp( a i ) a i ( k )( b k ¯ g ) (5)where a i ( k ) denotes the k th element of the row a i . Thus, the entries of a i encode a linear combinationwhich, when taken over the transmitted gradients of the alive/non-straggler workers, { b k ¯ g } k ∈ supp( a i ) ,would yield the full gradient.Going back to the example in Fig. 1b, the corresponding A and B matrices under the abovegeneralization are: A = − , and B = / − / (6)with f = 3 , n = 3 , k = 3. It is easy to check that AB = × . Also, since every row of A here hasexactly one zero, we say that this scheme is robust to any one straggler.In general, we shall seek schemes, through the construction of ( A, B ), which are robust to any s stragglers.The rest of this paper is organized as follows. In Section 3 we provide two schemes applicable toany number of workers n , under the assumption that stragglers can be arbitrarily slow to the extent5f total failure. In Section 4, we relax this assumption to the case of worker slowdown (with knownslowdown factor), instead of failure, and show how our constructions can be appended to be moreeffective. Finally, in Section 5 we present results of empirical tests using our proposed distributionschemes on Amazon EC2. In this section, we consider schemes robust to any s stragglers, given n workers (with s < n ). Weassume that any straggler is (what we call) a full straggler i.e. it can be arbitrarily slow to theextent of complete failure. We show how to construct the matrices A and B , with AB = , suchthat the scheme ( A, B ) is robust to any s full stragglers.Consider any such scheme ( A, B ). Since every row of A represents a set of non-straggler workers,all possible sets over [ n ] of size ( n − s ) must be supports in the rows of A . Thus f = (cid:0) nn − s (cid:1) = (cid:0) ns (cid:1) i.e. the total number of failure scenarios is the number of ways to choose s stragglers out of n workers. Now, since each row of A represents a linear span over some rows of B , and since werequire AB = , this leads us to the following condition on B : Condition 1 (B-Span) . Consider any scheme ( A, B ) robust to any s stragglers, given n workers(with s < n ). Then we require that for every subset I ⊆ [ n ] , | I | = n − s : × k ∈ span { b i | i ∈ I } (7) where span {·} is the span of vectors. The B-Span condition above ensures that the all s vector lies in the span of any n − s rows of B . This is of course necessary. However, it is also sufficient. In particular, given a B satisfyingCondition 1, we can construct A such that AB = , and A has the support structure discussedabove. The construction of A is described in Algorithm 1 (in MATLAB syntax), and we have thefollowing lemma. Lemma 1.
Consider B ∈ R n × k satisfying Condition 1 for some s < n . Then, Algorithm 1, withinput B and s , yields an A ∈ R ( ns ) × n such that AB = ( ns ) × n and the scheme ( A, B ) is robust toany s full stragglers. Based on Lemma 1, to obtain a scheme (
A, B ) robust to any s stragglers, we only need to furnish a B satisfying Condition 1. A trivial B that works is B = n × k , the all ones matrix. However, this is Algorithm 1
Algorithm to compute A Input : B satisfying Condition 1, s ( < n ) Output : A such that AB = ( ns ) × n f = binom( n, s ); A = zeros( f, n ); foreach I ⊆ [ n ] s.t. | I | = ( n − s ) do a = zeros(1 , k ); x = ones(1 , k ) /B ( I, :); a ( I ) = x ; A = [ A ; a ]; end B satisfying Condition 1 while also being as sparse as possible in each row. Inthis regard, we have the following theorem, which gives a lower bound on the number of non-zerosin any row of B . Theorem 1 (Lower Bound on B’s density) . Consider any scheme ( A, B ) robust to any s stragglers,given n workers (with s < n ) and k partitions. Then, if all rows of B have the same number ofnon-zeros, we must have: (cid:107) b i (cid:107) ≥ kn ( s + 1) for any i ∈ [ n ] . Theorem 1 implies that any scheme (
A, B ) that assigns the same amount of data to all the workersmust assign at least s +1 n fraction of the data to each worker. Since this fraction is independent of k ,for the remainder of this paper we shall assume that k = n i.e. the number of partitions is the sameas the number of workers. In this case, we want B to be a square matrix satisfying Condition 1,with each row having at least ( s + 1) non-zeros. In the sequel, we demonstrate two constructions for B which satisfy Condition 1 and achieve the density lower bound. In this section, we provide a construction for B that works by replicating the task done by a subsetof the workers. We note that this construction is only applicable when the number of workers, n , isa multiple of ( s + 1), where s is the number of stragglers we seek tolerance to. In this case, theconstruction is as follows: • We divide the n workers into ( s + 1) groups of size ( n/ ( s + 1)). • In each group, we divide all the data equally and disjointly, assigning ( s + 1) partitions toeach worker • All the groups are replicas of each other • When finished computing, every worker transmits the sum of its partial gradients AW D D D W D D D W D D D W D D D W D D D W D D D Group 1 Group 2 Group 3
Figure 3: Fractional Repetition Scheme for n = 6 , s = 2Fig. 3 shows an instance of the above construction for n = 6 , s = 2. A general description of B constructed in this way (denoted as B frac ) is shown in Eq. (9). Each group of workers in thisscheme can be denoted by a block matrix B block ( n, s ) ∈ R ns +1 × n . We define: B block ( n, s ) = × ( s +1) × ( s +1) · · · · · · × ( s +1) × ( s +1) × ( s +1) · · · · · · × ( s +1) ... ... . . . ... × ( s +1) × ( s +1) · · · · · · × ( s +1) ns +1 × n (8)7hus, the first worker in the group gets the first ( s + 1) partitions, the second worker gets the second( s + 1) partitions, and so on. Then, B is simply ( s + 1) replicated copies of B block ( n, s ): B = B frac = B (1)block B (2)block ... B ( s +1)block n × n (9)where for each t ∈ { , . . . , s + 1 } , B ( t )block = B block ( n, s ).It is easy to see that this construction can yield robustness to any s stragglers. Since any particularpartition of data is replicated over ( s + 1) workers, any s stragglers would leave at least onenon-straggler worker to process it. We have the following theorem. Theorem 2.
Consider B frac constructed as in Eq. (9) , for a given number of workers n andstragglers s ( < n ) . Then, B frac satisfies the B-Span condition (Condition 1). Consequently, thescheme ( A, B frac ) , with A constructed using Algorithm 1, is robust to any s stragglers. The construction of B frac matches the density lower bound in Theorem 1 and, the above theoremshows that the scheme ( A, B frac ), with A constructed from Algorithm 1, is robust to s stragglers. In this section we provide an alternate construction for B which also matches the lower bound inTheorem 1 and satisfies Condition 1. However, in contrast to construction in the previous section,this construction does not require n to be divisible by ( s + 1). Here, instead of assigning disjointcollections of partitions, we consider a cyclic assignment of ( s + 1) partitions to the workers. Weconstruct a B = B cyc with the following support structure:supp( B cyc ) = s+1 (cid:122) (cid:125)(cid:124) (cid:123) (cid:63) (cid:63) · · · (cid:63) (cid:63) · · · (cid:63) (cid:63) · · · (cid:63) (cid:63) · · · · · · (cid:63) (cid:63) · · · (cid:63) (cid:63) ... ... ... ... ... ... . . . . . . ... ... (cid:63) · · · (cid:63) (cid:63) · · · (cid:63) n × n (10)where (cid:63) indicates non-zero entries in B cyc . So, the first row of B cyc has its first ( s + 1) entriesassigned as non-zero. As we move down the rows, the positions of the ( s + 1) non-zero entries shiftone step to the right, and cycle around until the last row.Given the support structure in Eq. 10, the actual non-zero entries must be carefully assigned in orderto satisfy Condition 1. The basic idea is to pick every row of B cyc , with its particular support, to liein a suitable subspace S that contains the all ones vector n × . We consider a ( n − s ) dimensionalsubspace, S = { x ∈ R n | Hx = 0 , H ∈ R s × n } i.e. the null space of the matrix H , for some H satisfying H = 0. Now, to make the rows of B cyc lie in S , we require that the null space of H must contain vectors with all the different supports in Eq. 10. This turns out to be equivalent torequiring that any s columns of H are linearly independent, and is also referred to as the MDSproperty in coding theory. We show that a random choice of H suffices for this, and we are able8 lgorithm 2 Algorithm to construct B = B cyc Input : n , s ( < n ) Output : B ∈ R n × n with ( s + 1) non-zeros in each row H = randn( s, n ); H (: , n ) = − sum( H (: , n − , B = zeros( n ); for i = 1 : n do j = mod( i − s + i − , n ) + 1; B ( i, j ) = [1; − H (: , j (2 : s + 1)) \ H (: , j (1))]; end to construct a B cyc with the support structure in Eq. 10. Moreover, for any ( n − s ) rows of B cyc ,we show that their linear span also contains n × , thereby satisfying Condition 1. Algorithm 2describes the construction of B cyc (in MATLAB syntax) and, we have the following theorem. Theorem 3.
Consider B cyc constructed using the randomized construction in Algorithm 2, for agiven number of workers n and stragglers s ( < n ) . Then, with probability , B cyc satisfies the B-Spancondition (Condition 1). Consequently, the scheme ( A, B cyc ) , with A constructed using Algorithm 1,is robust to any s stragglers. In this section, we revisit our earlier assumption of full stragglers . Under a full straggler assumption,Theorem 1 shows that any non-straggler worker must incur an ( s +1)-factor overhead in computation,if we want to attain tolerance to any s stragglers. This may be prohibitively huge in many situations.One way to mitigate this is by allowing at least some work to be done also by the straggling workers.Therefore, in this section, we consider a more plausible scenario of slow workers, but assume aknown slowdown factor. We say that a straggler is an α - partial straggler (with α >
1) if it is atmost α slower than any non-straggler. This means that if a non-straggler completes a task in time T , an α - partial straggler would require at most αT time to complete it. Now, we augment ourprevious schemes (in Section 3.1 and Section 3.2) to be robust to any s stragglers, assuming thatany straggler is an α - partial straggler .Note that our earlier constructions are still applicable: a scheme ( A, B ), with B = B frac or B = B cyc ,would still provide robustness to s partial stragglers. However, given that no machine is slower A W N N C C W N N C C W N N C C n , b n , b n , b n = g(N ) + g(N )n = g(N ) + g(N )n = g(N ) + g(N )b = g(C )/2 + g(C )b = g(C ) - g(C )b = g(C )/2 + g(C ) Figure 4: Scheme for Partial Stragglers, n = 3 , s = 1 , α = 2. g ( · ) represents the partial gradient.9han a factor of α , a more efficient scheme is possible by exploiting at least some computationon every machine. Our basic idea is to couple our earlier schemes with a naive distributionscheme, but on different parts of the data. We split the data into a naive component, and a coded component. The key is to do the split such that whenever an α -partial straggler is done process-ing its naive partitions, a non-straggler would be done processing both its naive and coded partitions.In general, for any ( n, s, α ), our two-stage scheme works as follows: • We split the data D into n + n s +1 α − equal-sized partitions — of which n partitions are coded components, and the rest are naive components • Each worker gets s +1 α − naive partitions, distributed disjointly. • Each worker gets ( s + 1) coded partitions, distributed according to an ( A, B ) distributionscheme robust to s stragglers (e.g. with B = B frac or B = B cyc ) • Any worker, W i , first processes all its naive partitions and sends the sum of their gradients tothe aggregator. It then processes its coded partitions, and sends a linear combination, as perthe ( A, B ) distribution scheme.Note that each worker now has to send two partial gradients (instead of one, as in earlier schemes).However, a speedup gained in processing a smaller fraction of the data may mitigate this overheadin communication, since each non-straggler only has to process a s +1 n (cid:16) αs + α (cid:17) fraction of the data,as opposed to a s +1 n fraction in full straggler schemes. Thus, when computation is the bottleneck,adopting a partial stragglers scheme may not hurt the overall efficiency. On the other hand, whencommunication is the bottleneck (and if a 2 × overhead is prohibitive), a full straggler scheme maybe a better choice even with its (s+1)-factor overhead in computation for the non-straggler workers.Fig. 4 illustrates our two-stage strategy for n = 3 , s = 1 , α = 2. We see that each non-straggler gets4 / .
44 fraction of the data, instead of a 2 / .
67 fraction (for e.g. in Fig 1b).
In this section, we present experimental results on Amazon EC2, comparing our proposed gradientcoding schemes with baseline approaches. We compare our approaches against: (1) the naive scheme,where the data is divided uniformly across all workers without replication and the aggregator waitsfor all workers to send their gradients, and (2) the ignoring s stragglers scheme where the data isdivided as in the naive scheme, however the aggregator performs an update step after any n − s workers have successfully sent their gradient. We implemented all methods in python using MPI4py (Dalcin et al., 2011), an open source MPIimplementation. Based on the method being considered, each worker loads a certain numberof partitions of the data into memory before starting the iterations. In iteration t the aggre-gator sends the latest model β ( t ) to all the workers (using Isend() ). Each worker receives themodel (using
Irecv() ) and starts a gradient computation. Once finished, it sends its gradient(s)back to the aggregator. When sufficiently many workers have returned with their gradients, theaggregator computes the overall gradient, performs a descent step, and moves on to the next iteration.10 elay = 0 s Delay = 2 s Delay = 3.5 s Delay = 5s01234567 T i m e ( i n s e c o n d s ) Running Times with one straggler (s=1)
NaiveCyclicFractionalPartial Fractional ( α = α = α = α = α = α = (a) s = 1 Straggler Delay = 0 s Delay = 3 s Delay = 4.5 s Delay = 6s012345678 T i m e ( i n s e c o n d s ) Running Times with two stragglers (s=2)
NaiveCyclicFractionalPartial Fractional ( α = α = α = α = α = α = (b) s = 2 Stragglers Figure 5: Empirical running times on Amazon EC2 with n = 12 machines for s = 1 and s = 2stragglers. In this experiment, the stragglers are artificially delayed while the other machines run atnormal speed. We note that the partial straggler schemes have much lower data replication, forexample with α = 1 . m1.small and t2.micro — these are very small, very low-cost EC2 instances. We also observedthat our system was often bottlenecked by the number of incoming connections i.e. all workerstrying to talk to the master concurrently. For that reason, and to mitigate this additional overheadto some degree, we used a larger master instance of c3.8xlarge in our experiments.We ran the various approaches to train logistic regression models, a well-understood convex problemthat is widely used in practice. Moreover, Logistic regression models are often expanded by includinginteraction terms that are often one-hot encoded for categorical features. This can lead to 100’s ofthousands of parameters (or more) in the trained models. To train the logistic regression modelsfor using our proposed scheme (or the naive scheme), we used Nesterov’s Accelerated Gradientdescent with a constant learning rate, where the constant was chosen optimally from a range. Notethat other optimizers such as LBFGS would have also been applicable here since we obtain the full11 aive FracRep s=1 CycRep s=1 Ignore Stragg s=10.000.050.100.150.200.25 A v e r a g e T i m e p e r i t e r a t i o n ( i n s e c o n d s ) Avg. Time per iteration on n=10 t2.micro workers
Naive FracRep s=3 FracRep s=4 CycRep s=3 CycRep s=4 Ignore Stragg s=3 Ignore Stragg s=40.000.050.100.150.200.25 A v e r a g e T i m e p e r i t e r a t i o n ( i n s e c o n d s ) Avg. Time per iteration on n=20 t2.micro workers
Naive FracRep s=5 FracRep s=9 CycRep s=5 CycRep s=9 Ignore Stragg s=5 Ignore Stragg s=90.000.050.100.150.200.25 A v e r a g e T i m e p e r i t e r a t i o n ( i n s e c o n d s ) Avg. Time per iteration on n=30 t2.micro workers
Figure 6: Avg. Time per iteration on Amazon Employee Access dataset. A U C AUC vs. Time on n=10 t2.micro workersFracRep, s=1CycRep, s=1IgnoreStragg, s=1 A U C AUC vs. Time on n=20 t2.micro workersFracRep, s=3FracRep, s=4CycRep, s=3CycRep, s=4IgnoreStragg, s=3IgnoreStragg, s=4 A U C AUC vs. Time on on n=30 t2.micro workersFracRep, s=4FracRep, s=9CycRep, s=4CycRep, s=9IgnoreStragg, s=4IgnoreStragg, s=9
Figure 7: AUC vs Time on Amazon Employee Access dataset. The two proposed methods areFracRep and CycRep compared against the frequently used approach of Ignoring s stragglers. Ascan be seen, gradient coding achieves significantly better generalization error on a true holdout.gradient in our schemes. For the ignoring s stragglers approach, we used gradient descent with alearning rate of c / ( t + c ) (which is typical for SGD), where c and c were also chosen optimallyin a range. We did not use NAG here since it is unstable to noisy gradients. While we do notpresent any empirical results, we refer the reader to Devolder et al. (2014) for a theoretical andempirical analysis of the effect of noisy gradients in NAG. Thus another advantage of our schemesover ignoring s stragglers is that the latter cannot be combined with NAG because errors mayquickly accumulate and eventually cause the method to diverge. Artificial Dataset:
In our first experiment, we solved a logistic regression problem on a artificiallygenerated dataset. We generated a dataset of d = 554400 samples D = { ( x , y ) , . . . , ( x d , y d ) } , usingthe model x ∼ . × N ( µ , I ) + 0 . × N ( µ , I ) (for random µ , µ ∈ R p ), and y ∼ Ber ( κ ), with κ = 1 / (exp(2 x T β ∗ ) + 1), where β ∗ ∈ R p is the true regressor. In our experiments, we used a modeldimension of p = 100, and chose β ∗ randomly.In this experiment, we also artificially added delays to s random workers in each iteration (using time.sleep() ). Figure 5 presents the results of our experiments with s = 1 and s = 2 stragglers,on a cluster of n = 12 m1.small machines. As expected, the baseline naive scheme that waits forthe stragglers has poorer performance as the delay increases. The Cyclic and
Fractional schemes12ere designed for one straggler in Figure 5a and for two stragglers in Figure 5b. Therefore, weexpect that these two schemes would not be influenced at all by the delay of the stragglers (up tosome variance due to implementation overheads). The partial straggler schemes were designed forvarious α . Recall that for partial straggler schemes, α denotes the slowdown factor. Real Dataset:
Next, we trained a logistic regression model on the Amazon Employee Accessdataset from Kaggle . We used d = 26200 training samples, and a model dimension of p = 241915(after one-hot encoding with interaction terms). These experiments were run on n = 10 , , t2.micro instances on Amazon EC2.In Figure 7 we show the Generalization AUC of our method (FracRep and CycRep) versus ignoring s stragglers (IgnoreStragg). As can be seen, Gradient coding achieved significantly better general-ization error. We emphasize that the results in figures 6 and 7 do not use any artificial straggling,only the natural delays introduced by the EC2 cluster.How is this stark difference possible? When stragglers were ignored we were, at best, receiving astochastic gradient (when random machines are straggling in each iteration). As alluded to earlier, inthis case the best we could do as an optimization algorithm is to run gradient descent as it is robustto noise. When using gradient coding however, we could retrieve the full gradient which gave usaccess to faster optimization algorithms. In Figure 7 we used Nesterov’s Accelerated Gradient (NAG).Another advantage of using full gradients is that we can guarantee that we are training on the samedistribution as the one the training set was drawn from. This is not true for the approach thatignores stragglers. If a particular machine is more likely to be a straggler, samples on that machinewill likely be underrepresented in the final model, unless particular countermeasures are deployed.There may even be inherent reasons why a particular sample will systematically be excluded when weignore stragglers. For example, in structured models such as linear-chain CRFs, the computation ofthe gradient is proportional to the length of the sequence. Therefore, extraordinarily long examplescan be ignored very frequently. In this paper, we have experimented with various gradient coding ideas on Amazon EC2 instances.This is a complex trade-off space between model sizes, number of samples, worker configurations,and number of workers. Our proposed schemes create computation overheads while keeping commu-nication the same.The benefit of this additional computation is fault-tolerance: we are able to recover full gradients,even if s machines do not deliver their assigned work, or are slow in doing so. Moreover, our partialstraggler schemes provide fault tolerance while allowing all machines to do partial work. Theyhowever require an extra round of communication. An interesting open problem here is whetherpartial work on all machines is possible without this extra round of communication. Another openquestion under our framework is that of approximate gradient coding: can we get a vector that isclose to the true gradient , with lesser computation overheads ? Ignoring stragglers does give theapproximate gradient in a sense. However, is it possible to have a better approximation with onlittle computation overheads (relative to gradient coding ) ? Acknowledgements
This research has been supported by NSF Grants CCF 1344364, 1407278, 1422549, 1618689 andARO YIP W911NF-14-1-0258. 14 eferences
Chen, J., Monga, R., Bengio, S., and Jozefowicz, R. (2016). Revisiting Distributed SynchronousSGD.
ArXiv e-prints .Dalcin, L. D., Paz, R. R., Kler, P. A., and Cosimo, A. (2011). Parallel distributed computing usingpython.
Advances in Water Resources , 34(9):1124 – 1139. New Computational Methods andSoftware Tools.Dau, S. H., Song, W., Dong, Z., and Yuen, C. (2013). Balanced sparsest generator matrices for mdscodes. In , pages 1889–1893.Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., Ranzato, M., Senior, A., Tucker,P., Yang, K., Le, Q. V., and Ng, A. Y. (2012). Large scale distributed deep networks. In
Advancesin Neural Information Processing Systems 25 .Devolder, O., Glineur, F., and Nesterov, Y. (2014). First-order methods of smooth convex optimiza-tion with inexact oracle.
Mathematical Programming , 146(1-2):37–75.Dutta, S., Cadambe, V., and Grover, P. (2016). Short-dot: Computing large linear transformsdistributedly using coded short dot products. In Lee, D. D., Sugiyama, M., Luxburg, U. V.,Guyon, I., and Garnett, R., editors,
Advances in Neural Information Processing Systems 29 , pages2100–2108. Curran Associates, Inc.Ho, Q., Cipar, J., Cui, H., Lee, S., Kim, J. K., Gibbons, P. B., Gibson, G. A., Ganger, G., and Xing,E. P. (2013). More effective distributed ml via a stale synchronous parallel parameter server. In
Advances in neural information processing systems .Lee, K., Lam, M., Pedarsani, R., Papailiopoulos, D. S., and Ramchandran, K. (2015). Speeding updistributed machine learning using codes.
CoRR , abs/1512.02673.Li, M., Andersen, D. G., Smola, A. J., and Yu, K. (2014). Communication efficient distributedmachine learning with the parameter server. In
Advances in Neural Information ProcessingSystems , pages 19–27.Li, S., Maddah-Ali, M. A., and Avestimehr, A. S. (2015). Coded mapreduce. In , pages 964–971.Li, S., Maddah-Ali, M. A., and Avestimehr, A. S. (2016a). A unified coding framework for distributedcomputing with straggling servers.
CoRR , abs/1609.01690.Li, S., Maddah-Ali, M. A., Yu, Q., and Avestimehr, A. S. (2016b). A fundamental tradeoff betweencomputation and communication in distributed computing.
CoRR , abs/1604.07086.Mitliagkas, I., Zhang, C., Hadjis, S., and R´e, C. (2016). Asynchrony begets momentum, with anapplication to deep learning.
CoRR , abs/1605.09774.Narayanamurthy, S., Weimer, M., Mahajan, D., Condie, T., Sellamanickam, S., and Keerthi, S. S.(2013). Towards resource-elastic machine learning.Zaharia, M., Konwinski, A., Joseph, A. D., Katz, R. H., and Stoica, I. (2008). Improving mapreduceperformance in heterogeneous environments. In
OSDI , volume 8, page 7.15
Appendix - Proofs
By Condition 1, we know that for any I ⊆ [ n ] , | I | = n − s , we have ∈ span { b i | i ∈ I } . In otherwords, there exists at least one x ∈ R ( n − s ) such that: xB ( I, :) = (11)Therefore, by construction, we have: AB = ( ns ) × n , and the scheme ( A, B ) is robust to any s stragglers. Consider any scheme (
A, B ) robust to any s stragglers, with B ∈ R n × k . Now, construct a bipartitegraph between n workers, { W , . . . , W n } , and k partitions, { P , . . . , P k } , where we add an edge ( i, j )if worker i and partition j is worker i has access to partition j . In other words, for any i ∈ [ n ] , j ∈ [ k ]: e ij = (cid:40) B ( i, j ) (cid:54) = 00 otherwise (12)Now, it is easy to see that the degree of the i th worker W i is (cid:107) b i (cid:107) .Also, for any partition P j , its degree must be at least ( s + 1). If its degree is s or less, then considerthe scenario where all its neighbors are stragglers. In this case, there is no non-straggler workerwith access to P j , which contradicts robustness to any s stragglers.Based on the above discussion, and using the fact that the sum of degrees of the workers in thebipartite graph must be the same as the sum of degrees of partitions, we get: n (cid:88) i =1 (cid:107) b i (cid:107) ≥ k ( s + 1) (13)Since we assume all workers get access to the same number of partitions, this gives: (cid:107) b i (cid:107) ≥ k ( s + 1) n , for any i ∈ [ n ] (14) Consider groups of partitions { G , . . . , G n/ ( s +1) } as follows: G = { P , . . . , P s +1 } G = { P s +2 , . . . , P s +2 } ... (15) G n/ ( s +1) = { P n − s , . . . , P n } (16)Fix some set I ⊆ [ n ] , | I | = n − s . Based on our construction, it is easy to observe that for anygroup G j , there exists some index in I , say i G j ∈ I , such that the corresponding row in B , b i Gj has all 1s at partitions in G j and 0s elsewhere. This is because there are ( s + 1) rows of B thatcorrespond in this way to G j (one in each block B block ), and so at least one would survive in the set I of cardinality ( n − s ). Now, it is trivial to see that: ∈ span { b i Gj | j = 1 , . . . , n/ ( s + 1) } (17)16lso, since span { b i Gj | j = 1 , . . . , n/ ( s + 1) } ⊆ span { b i | i ∈ I } , (18)we have ∈ span { b i | i ∈ I } .Finally, since the above holds for any set I , we get that B satisfies Condition 1. The remainder ofthe theorem follows from Lemma 1. Consider the subspace given by the null space of the random matrix H (constructed in Algorithm2): S = { x ∈ R n | Hx = 0 } (19)Note that H has ( n − s different random values ( s for each column), since its last column is simplythe negative sum of its previous ( n −
1) columns. Now, we have the following Lemma listing someproperties of H and S . Lemma 2.
Consider H ∈ R ss × n as constructed in Algorithm 2, and the subspace S as defined inEq. 19. Then, the following hold: • Any s columns of H are linearly independent with probability • dim ( S ) = n − s with probability • ∈ S , where is the all-ones vector For i ∈ [ n ], let S i denote the set S i = { i mod n, ( i + 1) mod n, . . . , ( i + s ) mod n } . Then, S i corresponds to the support of the i th row of B in our construction, as also given by the supportstructure in Eq. (10).Recall that we denote the i th row of B by b i . By our construction, we have: b i ( i ) = 1 b i ( S i \ { i } ) = − H − S i \{ i } H i (20)Now, we have the following lemma; Lemma 3.
Consider the i th row of B constructed using Algorithm 2 (also shown in Eq. 20). Then, • b i ∈ S • Every element of b i ( S i \ { i } ) is non-zero with probability 1 • For any subset I ⊆ [ n ] , | I | = n − s , the set of vectors { b i | i ∈ I } is linearly independent withprobability 1 Now, using Lemma 3, we can conclude that for any subset I ⊆ [ n ], | I | = n − s , dim (span { b i | i ∈ I } ) = n − s and span { b i | i ∈ I } ⊆ S . Consequently, from Lemma 2, since dim ( S ) = n − s and ∈ S , thisimplies that: span { b i | i ∈ I } = S with probability 1 (21)and, ∈ span { b i | i ∈ I } . Taking union bound over every I shows that B satisfies Condition 1. Theremainder of the theorem follows from Lemma 1.17 .4.1 Proof of Lemma 2 Consider any subset I ⊆ n , | I | = s such that n / ∈ I . Then, all the elements of H I are independent,and det ( H I ) is a polynomial in the elements of H I . Consequently, since every element is drawnfrom a continuous probability distribution (in particular, Gaussian), the set { H I | det ( H I ) = 0 } is azero measure set. So, P ( det ( H I ) (cid:54) = 0) = 1, and thus the columns of H I are linearly independentwith probability 1.If n ∈ I , then we have: det ( H I ) = det ( (cid:101) H ) (22)where we let (cid:101) H = (cid:2) H I \{ n } , − (cid:80) i ∈ [ n ] \ I H i (cid:3) . The elements of (cid:101) H are independent, so using the sameargument as above, we again have P ( det ( H I ) = det ( (cid:101) H ) (cid:54) = 0) = 1. Finally, taking a union boundover all sets I of cardinality s shows that any s columns of H are linearly independent.Since any s columns in H are linearly independent, this implies that rank ( H ) = s . Since thesubspace S is simply the null space of H , we have dim ( S ) = n − s .Finally, since H n = − (cid:80) i ∈ [ n − H i (by construction), we have H = 0 and thus ∈ S . By construction of b i , we have: Hb i = H i + H S i \{ i } b i ( S i \ { i } ) = H i − H i = 0 (23)Thus, b i ∈ S .Now, if possible, let for some k ∈ S i \ { i } , b i ( k ) = 0. Then, since b i ∈ S , we have: Hb i = H i + H S i \{ i,k } b i ( S i \ { i, k } ) = 0 (24)Consequently, the set of columns { j | j ∈ S i \ { i, k }} ∪ { i } is linearly dependent which contradicts H having any s columns being linearly independent (in Lemma 2). Therefore, we must have everyelement of b i ( S i \ { i } ) being non-zero.Now, consider any subset I ⊆ [ n ] , | I | = n − s . We shall show that the matrix B I (correspondingto the rows of B with indices in I ) has rank n − s with probability 1. Consequently, the setof vectors { b i | i ∈ I } would be linearly independent. To show this, we consider some n − s columns of B I , say given by the set J ⊆ [ n ] , | J | = n − s , and denote the sub-matrix of columnsby B I,J . Then, it suffices to show that det( B I,J ) (cid:54) = 0. Now, by the construction in Algorithm2, we have: det( B I,J ) = poly ( H ) / poly ( H ), for some polynomials poly ( · ) and poly ( · ) in theentries of H . Therefore, if we can show that there exists at least one H (cid:48) with H (cid:48) = andpoly ( H (cid:48) ) / poly ( H (cid:48) ) (cid:54) = 0, then under a choice of i.i.d. standard Gaussian entries of H , we wouldhave: P (poly ( H ) / poly ( H ) (cid:54) = 0) = 1 (25)The remainder of this proof is dedicated to showing that such an H (cid:48) exists. To show this, weshall consider a matrix (cid:101) B ∈ R n − s × n such that supp( (cid:101) B ) = supp( B I ) and det( (cid:101) B : ,J ) (cid:54) = 0, where (cid:101) B : ,J corresponds to the sub-matrix of (cid:101) B with columns in the set J . Given such a (cid:101) B , we shall showthat there exists an s × n matrix H (cid:48) (with H (cid:48) = ) such that when we run Algorithm 2 with this H (cid:48) , we get a matrix B (cid:48) s.t. B (cid:48) I = (cid:101) B i.e. the output matrix from Algorithm 2 is identical to ourrandom choice (cid:101) B on the rows in the set I . This suffices to show the existence of an H (cid:48) such thatpoly ( H (cid:48) ) / poly ( H (cid:48) ) (cid:54) = 0, since poly ( H (cid:48) ) / poly ( H (cid:48) ) = det( B (cid:48) I,J ) = det( (cid:101) B J ) (cid:54) = 0.Let us pick a random matrix (cid:101) B as: (cid:101) B = B rI D (26)18here B rI is a matrix with the same support as B I and with each non-zero entry i.i.d. standardGaussian, and D is a diagonal matrix such that D ii = (cid:80) n − sj =1 B rI ( j, i ) , i ∈ [ n ]. Note that a consequenceof the above choice of (cid:101) B is that the sum of all its rows is the all s vector. Now, it can be shownthat any ( n − s ) columns of (cid:101) B form an invertible sub-matrix with probability 1. Let S i be thesupport of the i th row of B . The rows of B rI have the supports S i , i ∈ I . Now because of the cyclicsupport structure in B , any collection { i , i , . . . , i k } (0 ≤ k ≤ n − s ) satisfies the property: |∪ kj =1 S i j | ≥ s + k (27)Using Lemma 4 in Dau et al. (2013), this implies that there is a perfect matching between the rowsof B rI and any of its ( n − s ) columns . Consequently, with probability 1, any ( n − s ) columns of B rI form an invertible sub-matrix. Also, since every column of B rI contains at least one non-zero (again,owing to the support structure of B ), this implies that with probability 1, all the diagonal entries of D are non-zero. Combining the above two observations, we can infer that any ( n − s ) columns of (cid:101) B form an invertible sub-matrix with probability 1.So far, we have shown existence of a matrix (cid:101) B with the following properties: (i) (cid:101) B has the samesupport structure as B I , (ii) any ( n − s ) columns of (cid:101) B form invertible sub-matrix, (iii) the sum ofall rows of (cid:101) B is the all s vector. Now, for any such (cid:101) B , we shall show that there exists an H (cid:48) suchthat H (cid:48) (cid:101) B T = such that any s columns of H (cid:48) form an invertible sub-matrix. This implies thatwhen we run Algorithm 2 with this H (cid:48) , the output matrix would be the same as (cid:101) B on the rows inthe set I . The remainder of the proof then follows from our earlier discussion.Now, consider any set Q ⊆ [ n ] , | Q | ≤ s . Suppose we pick any invertible H (cid:48) : ,Q , and set H (cid:48) : , [ n ] \ Q = − H (cid:48) : ,Q (cid:101) B T : ,Q ( (cid:101) B T : , [ n ] \ Q ) − . Then, such an H (cid:48) satisfies H (cid:48) (cid:101) B T = 0 and its columns in the set Q form aninvertible sub-matrix. Now, since invertibility on the set Q simply corresponds to det( H (cid:48) : ,Q ) (cid:54) = 0( i.e. some fixed polynomial being non-zero), if we actually picked a uniformly random H (cid:48) on thesubspace H (cid:48) (cid:101) B T = 0, then P (cid:16) det( H (cid:48) : ,Q ) (cid:54) = 0 | H (cid:48) (cid:101) B T = 0 (cid:17) = 1 (28)Taking a union bound over all Q s, we get that P (cid:16) any s columns of H (cid:48) form an invertible sub-matrix | H (cid:48) (cid:101) B T = 0 (cid:17) = 1 (29)Thus, there exists an H (cid:48) satisfying H (cid:48) (cid:101) B T = 0 with any s of its columns forming an invertiblesub-matrix. Also, since the sum of all rows of (cid:101) B is , this implies H (cid:48) =0