[PDF] Coded Distributed Computing with Partial Recovery

Abstract

Coded computation techniques provide robustness against straggling workers in distributed computing. However, most of the existing schemes require exact provisioning of the straggling behaviour and ignore the computations carried out by straggling workers. Moreover, these schemes are typically designed to recover the desired computation results accurately, while in many machine learning and iterative optimization algorithms, faster approximate solutions are known to result in an improvement in the overall convergence time. In this paper, we first introduce a novel coded matrix-vector multiplication scheme, called coded computation with partial recovery (CCPR), which benefits from the advantages of both coded and uncoded computation schemes, and reduces both the computation time and the decoding complexity by allowing a trade-off between the accuracy and the speed of computation. We then extend this approach to distributed implementation of more general computation tasks by proposing a coded communication scheme with partial recovery, where the results of subtasks computed by the workers are coded before being communicated. Numerical simulations on a large linear regression task confirm the benefits of the proposed distributed computation scheme with partial recovery in terms of the trade-off between the computation accuracy and latency.

Full PDF

11 Coded Distributed Computing with Partial Recovery

Emre Ozfatura, Sennur Ulukus, and Deniz G¨und¨uz

Abstract

Coded computation techniques provide robustness against straggling workers in distributed computing. However, most of theexisting schemes require exact provisioning of the straggling behaviour and ignore the computations carried out by stragglingworkers. Moreover, these schemes are typically designed to recover the desired computation results accurately, while in manymachine learning and iterative optimization algorithms, faster approximate solutions are known to result in an improvement inthe overall convergence time. In this paper, we ﬁrst introduce a novel coded matrix-vector multiplication scheme, called codedcomputation with partial recovery (CCPR) , which beneﬁts from the advantages of both coded and uncoded computation schemes,and reduces both the computation time and the decoding complexity by allowing a trade-off between the accuracy and the speedof computation. We then extend this approach to distributed implementation of more general computation tasks by proposing acoded communication scheme with partial recovery, where the results of subtasks computed by the workers are coded beforebeing communicated. Numerical simulations on a large linear regression task conﬁrm the beneﬁts of the proposed distributedcomputation scheme with partial recovery in terms of the trade-off between the computation accuracy and latency.

Index Terms

Coded computation, distributed computation, maximum distance separable (MDS) code, linear codes, rateless codes, stragglers.

I. I

NTRODUCTION

One of the key enablers of efﬁcient machine learning solutions is the availability of large datasets. However, the ever growingsize of the datasets and the complexity of the models trained on them lead also to an increase in the computational complexityand storage requirements of the algorithms employed. In parallel, there is a growing availability of cloud computing platforms(such as Amazon Web Services, Microsoft Azure and Google Cloud Functions) that offer computational resources to users tocarry out demanding computation tasks. The associated distributed computation framework allows harnessing the computationand memory resources of multiple heterogeneous computation servers, referred to as workers .In the most common implementation of distributed computation, a parameter server (PS) divides the main computationaltask into several subtasks and assigns them to workers. Each worker executes the computation tasks assigned to it, and conveysthe result to the PS. Having received the results from all the workers, the PS combines them to obtain the result of the maincomputation task. In principle, such a distributed computation framework should achieve a speed-up factor proportional to thenumber of workers employed. However, in real implementations, the overall computation time is constrained by the slowest,so-called straggling worker(s) . Moreover, as the number of employed workers increases, communication starts becoming morecomplex introducing additional delays, which can aggravate the straggler problem. To remedy the delays due to stragglingworkers, various straggler-tolerant distributed computation schemes have been introduced recently, which build upon the ideaof assigning redundant computations/subtasks to workers, to let faster workers compensate for the stragglers [1]–[42].

A. Motivation

We will motivate the proposed distributed computation framework on a simple regression problem. In linear regression, thegoal is to minimize the empirical mean squared-error: L ( θ ) (cid:44) N N (cid:213) i = ( y i − x Ti θ ) , (1)where x , . . . , x N ∈ R d are the data points with corresponding labels y , . . . , y N ∈ R , and θ ∈ R d is the parameter vector.The optimal parameter vector can be obtained iteratively by gradient descent (GD), in which the parameter vector is updatediteratively as follows: θ t + = θ t − η t ∇ θ L ( θ t ) , (2)where η t is the learning rate at the t -th iteration. Gradient of the loss function in (1) can be written as ∇ θ L ( θ t ) = X T X θ t − X T y , (3) This paper was presented in part at the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing in Brighton, UKEmre Ozfatura and Deniz G¨und¨uz are with Information Processing and Communications Lab, Department of Electrical and Electronic Engineering, ImperialCollege London Email: { m.ozfatura, d.gunduz } @imperial.ac.uk.Sennur Ulukus is with Department of Electrical and Computer Engineering, University of Maryland.This work was supported in part by the Marie Sklodowska-Curie Action SCAVENGE (grant agreement no. 675891), and by the European Research Council(ERC) Starting Grant BEACON (grant agreement no. 677854). a r X i v : . [ c s . I T ] J u l where X = [ x , . . . , x N ] T and y = [ y , . . . , y N ] T . In the gradient expression, only θ t changes over iterations; hence, the keycomputational task at each iteration is the matrix-vector multiplication W θ t , where W (cid:44) X T X ∈ R d × d . B. Coded Distributed Matrix-Vector Multiplication

The execution of W θ t can be distributed across K workers by simply dividing W row-wise into K disjoint submatrices andassigning each submatrix to one of the workers. However, the computation time of this naive approach will be limited by the straggling worker(s). The main challenge in this setup arises because the straggling behaviour (due either to the computationspeed of the workers or the delays in communication) varies over time, and its realization at each iteration is not knownin advance. The statistical knowledge of the computation and communication latency for each worker can be acquired overtime, and used for a more efﬁcient allocation of computation tasks (e.g. as in [28], [35], [36]) as well as the coding schemeemployed, for the sake of simplicity we assume homogeneous workers in this work. Coded computation has been introduced to tolerate stragglers in matrix-vector multiplication by encoding the W matrix, anddistributing the partitions of this encoded matrix among the workers, to achieve redundancy [15]–[27]. One well-known methodto introduce redundancy in matrix-vector multiplication is to utilize maximum distance separable (MDS) codes to encode W [15]. To elucidate the MDS-coded computation (MCC) we can divide W into ¯ K disjoint submatrices, W , . . . , W ¯ K ∈ R ¯ d × d , ¯ d = d / ¯ K , which are then encoded with a ( ¯ K , K ) MDS code. Each coded submatrix is assigned to a different worker, whichmultiplies it with θ t , and returns the result to the PS. The PS can recover W θ t from the results of any ¯ K workers.Note that, up to K − ¯ K stragglers can be tolerated with MCC at the expense of increasing the computation load of eachworker by r = K / ¯ K ; that is, each worker is assigned r times more computations compared to the naive approach of equallydividing all the required computations among the workers. Alternative to MDS codes [15], [16], [22], LDPC codes [17], andrateless codes [23] have also been studied for straggler-tolerant coded computation in the literature. C. Computation-communication trade-off

Conventional straggler-aware designs assume that a single message is transmitted by each worker after completing its assignedcomputation task. Under this limitation, straggler-aware schemes require exact provisioning of the straggler behaviour, andotherwise, suffer from over-computation and under-utilization [43]. To overcome these obstacles one can allow each worker tosend multiple messages to the PS at each iteration, which we refer to as multi-message communication (MMC) [1], [3], [13],[16], [19], [23], [24], [38]. However, MMC may introduce additional delays due to the communication overhead. Hence, withMMC the objective is to ﬁnd an optimal operating point that balances the computation and communication latencies [43]. Onerecent approach for coded matrix-vector multiplication with MMC is to utilize rateless codes [23] due to their advantages ofbetter utilizing the computational resources and low decoding complexity at the PS. However, in practice, rateless codes reachthe target coding rates only if the number of coded messages are sufﬁciently large, which may not be desired in distributedcomputation framework since it leads to a congestion at the PS. Hence, the design of a code structure for distributed computationthat can reduce the computation time without inducing an overwhelming communication overhead is an open challenge thatwe address in this paper.

D. Computation accuracy-computation speed trade-off

In the case of iterative optimization algorithms we can consider the accuracy of the computations at each iteration as anotherdimension of the trade-off governing the overall convergence behaviour. For example, when applying gradient descent over largedatasets, computation of the gradient in (3) at each iteration can become very costly. The most common alternative iterativeoptimization framework for large scale learning problems is stochastic gradient descent (SGD) , which uses an estimate of thegradient in (3) at each iteration, evaluated on a random subset of the dataset. Hence, by changing the size of the sampleddataset it is possible to seek a balance between the accuracy of the gradient estimate and the computation time.On the other hand, a vast majority of the coded computation schemes in the literature are designed for full gradientrecovery. Nevertheless, a simple uncoded computation scheme with MMC [3], [19] can exploit partial computations performedby straggling workers, while also providing the PS a certain ﬂexibility to terminate an iteration when a sufﬁcient number ofcomputations are received. Accordingly, our goal here is to design a coded computing framework that can efﬁciently beneﬁtfrom redundant computations with the ﬂexibility of partial gradient computations. To this end, we introduce a novel hybridscheme for distributed matrix-vector multiplication, called coded computation with partial recovery (CCPR), bringing togetherthe advantages of uncoded computation, such as low decoding complexity and partial gradient updates, with those of codedcomputation, such as reduced per-iteration completion time and reduced communication load.We also want to highlight that, in most of the coded computation schemes in the literature, encoding is executed in PS ina centralized manner, and coded submatrices are distributed to workers; however, in our proposed strategy, as explained inSection IV, encoding step can be executed in a decentralized manner. Such local encoding provides two key advantages; ﬁrst,it is possible to dynamically change the codewords over time based on the realization of the straggler behaviour [44], which

Main computa (cid:1) on task

Distributed computa (cid:0) on with stragglersPar (cid:0) al recovery X Fig. 1: Illustration of partial recovery in a naive distributed computation scenario with 6 workers, 2 of which are stragglers.

Cumulative computation type MCC n m ( N i ) UC-MMC n u ( N i ) CCPR n c ( N i ) N : N = , N = , N = N : N = , N = , N = N : N = , N = , N = N : N = , N = , N = N : N = , N = , N =

12 8 12 N : N = , N = , N = N : N = , N = , N = N : N = , N = , N = N : N = , N = , N = TABLE I: Number of successful score vectors for each cumulative computation that can accurately recover the computationtask with K = and r = .is particularly desired when the straggler behavior is correlated over time; second, as further explained in Section V, it allowsus to extend the CCPR strategy to more general distributed learning problems.To the best of our knowledge, the partial recovery approach was ﬁrst introduced in our preliminary study [26], and in thiswork, we extend our study and provide a more comprehensive analysis. Our contributions in this paper can be summarized asfollows: • We provide a general framework, and highlight certain design principles to efﬁciently employ partial recovery in a codedcomputation scenario, particularly with MMC. • Based on these design principles, we introduce random circularly shifted (RCS) codes for distributed matrix-vectormultiplication. • We provide a generalization of RCS codes to the distributed implementation of more general computation tasks (beyondmatrix-vector multiplication) by proposing a gradient coding scheme with partial computation. • Through numerical experiments in a linear regression problem, we show that RCS codes outperform existing distributedlearning schemes across straggling workers, and also present the trade-offs between the update accuracy, communicationlatency and computation time achieved by these codes.II. C

ODED COMPUTATION WITH PARTIAL RECOVERY

In conventional coded computation schemes, the PS waits until a sufﬁcient number of computations are gathered from theworkers to recover the correct results of the underlying computation task. In contrast, the partial recovery strategy does notnecessarily aim at recovering the results accurately. In particular, for the matrix-vector multiplication task of W θ , we willtarget recovering only a subset of the entries of the d -dimensional result vector. We deﬁne the percentage of entries of W θ tobe recovered as the tolerance , which will be dictated by the underlying computation task.For encoding, we utilize a general linear code structure. Matrix W is initially divided into K disjoint submatrices W , . . . , W K ∈ R d / K × d . Then, r coded submatrices, (cid:102) W i , , . . . , (cid:102) W i , r , are assigned to each worker i for computation, where each coded matrix (cid:102) W i , j is a linear combination of K submatrices, i.e., (cid:102) W i , j = (cid:213) k ∈[ K ] α ( i ) j , k W k . (4)Following the initial encoding phase, the i th worker performs the computations (cid:102) W i , θ , . . . , (cid:102) W i , r θ in the given order and sendsthe result as soon as it is completed. We remark that, in the considered MMC scenario, the order of the assigned coded computations affects the completion time; therefore, we introduce the computation assignment matrix C to represent a codedcomputation strategy, i.e, C (cid:44) (cid:102) W , (cid:102) W , . . . (cid:102) W , K (cid:102) W , (cid:102) W , . . . (cid:102) W , K ... ... . . . ... (cid:102) W r , (cid:102) W r , . . . (cid:102) W r , K  . When partial recovery is allowed, PS waits until ( − q ) × percent of the entries of the result vector are successfullyrecovered. We call the parameter q as the tolerance , which is a design parameter.In the scope of this paper, our aim is to highlight certain design principles to form coded submatrices (cid:102) W i , , . . . , (cid:102) W i , r foreach user i , in order to allow partial recovery with reduced iteration time. Let us ﬁrst present a simple example to show howcoded computation with partial recovery can improve upon other schemes, such as MDS coding or uncoded computation withMMC (UC-MMC).Here, C shows the assigned computation tasks to each worker with their execution order. More speciﬁcally, submatrix C ( i , j ) denotes the i th computation task to be executed by the j th worker. A. Motivating example

Consider K = workers and assume that W is divided into 4 submatrices W , . . . , W . Let us ﬁrst consider two knowndistributed computation schemes, namely UC-MM [3], [19] and MDS-coded computation (MCC) [15]. Each scheme is deﬁnedby its computation assignment matrix.In MDS-coded computation, linearly independent coded computation tasks are distributed to the workers as follows: C MDS = (cid:20) (cid:20) W + W W + W (cid:21) (cid:20) W + W W + W (cid:21) (cid:20) W + W W + W (cid:21) (cid:20) W + W W + W (cid:21) (cid:21) C MDS consists of a single row of computation tasks since each worker sends the results of its computations only after all ofthem are completed, e.g., ﬁrst worker sends the concatenation of [( W + W ) θ ( W + W ) θ ] after completing both computations. C MDS above corresponds to a ( , ) MDS code; hence, the PS can recover the full gradient from the results of any two workers.In the UC-MMC scheme with cyclic shifted computation assignment [19], computation scheduling matrix is given by C UC − M MC = (cid:20) W W W W W W W W (cid:21) , and each worker sends the results of its computations sequentially, as soon as each of them is completed. This helps to reducethe per-iteration completion time with an increase in the communication load [3], [19]. With UC-MMC, full gradient can berecovered even if each worker performs only one computation, which is faster if the workers have similar speeds.Instead the computation scheduling matrix of the proposed CCPR scheme is given by C CCPR = (cid:20) W W W W W + W W + W W + W W + W (cid:21) . As we can see, C CCPR is a combination of the uncoded and coded approaches. Below we illustrate the advantages of thisscheme by comparing its performance for both accurate and approximate computations. For this analysis we need to introducea few deﬁnitions.Let N s ( t ) denote the number of workers that have completed exactly s computations by time t , s = , . . . , r . We deﬁne N ( t ) (cid:44) ( N r ( t ) , . . . , N ( t )) as the cumulative computation type at time t . Additionally, we introduce the K -dimensional scorevector S ( t ) = [ s ( t ) , . . . , s K ( t )] , where s i ( t ) denotes the number of computations completed and communicated by the i thworker by time t . We will call a score vector successful if it allows the recovery of the desired computation task at the PS.We note that, due to the homogeneous worker assumption, the probability of experiencing any score vector with the samecumulative computation type is the same. Therefore, what is important for the overall computation time statistics is the numberof successful score vectors corresponding to each computation type. B. Full gradient performance

Let n m ( N ) , n u ( N ) and n c ( N ) denote the number of distinct succesfull score vectors with the cumulative computation type N that allow the recovery of W θ for the MDS, UC-MMC, and CCPR schemes, respectively. For instance, for the cumulativecomputation type N = ( , , ) , MDS scheme cannot recover the full gradient; however, UC-MMC can recover the full gradientfor four S vectors; S = [ , , , ] , S = [ , , , ] , S = [ , , , ] and S = [ , , , ] , hence n u ( N ) = . Finally, in CCPRscheme, there are in total 8 successful S vectors; [ , , , ] , [ , , , ] , [ , , , ] , [ , , , ] , [ , , , ] , [ , , , ] , [ , , , ] and [ , , , ] . Cumulative computation type MCC n m ( i ) UC-MM n u ( i ) CCPR n c ( i ) N : N = , N = , N = N : N = , N = , N = N : N = , N = , N = N : N = , N = , N = N : N = , N = , N =

12 12 12 N : N = , N = , N = N : N = , N = , N = N : N = , N = , N = N : N = , N = , N = N : N = , N = , N = N : N = , N = , N = TABLE II: Number of successful score vectors for each cumulative computation type that can result in the recovery of at least3 out of 4 computations with K = and r = .These values are listed in Table I for the cumulative computation types with at least one successful score vector for oneof the schemes. Particularly striking are the last three rows that correspond to cases with very few computations completed,i.e., when at most one worker completes all its assigned tasks. In these cases, CCPR is much more likely to recover W θ ; andhence, the computation deadline can be reduced signiﬁcantly. For a more explicit comparison of the completion time statistics,we can analyze the probability of each type under a speciﬁc computation time statistics. Then, the probability of cumulativecomputation type N = ( N r , . . . , N ) at time t is given by Pr ( N ( t ) = N ) = r (cid:214) s = P s ( t ) N s , (5)where P s ( t ) is the probability of completing exactly s computations by time t . Let T denote the recovery time of the desiredcomputation. Accordingly, for any of the schemes, we can write Pr ( T < t ) = (cid:205) i = n a ( N i ) · Pr ( N ( t ) = N i ) , a ∈ { m , u , c } , wherethe types N i and corresponding n a ( N i ) , i = , . . . , , are listed in Table I. It is now clear that CCPR has the highest Pr ( T < t ) for any t ; and hence, the minimum average completion time E [ T ] . In the next subsection, we will highlight the partial recoveryproperty of CCPR. C. Partial computation performance

Now, we compare the three schemes when we reduce the tolerance level, and aim at recovering only a portion of thecomputation results. In particular, for the above example, we will deem a scheme successful if it recovers at least 3 out of 4values, { W θ , . . . , W θ } , corresponding to a tolerance of . For each cumulative computation type the number of successfulscore vectors are listed in Table II. We can see that UC-MMC and CCPR have the same average completion time statistics.Hence, CCPR can provide a lower average per-iteration completion time for accurate computation compared to UC-MMC,while achieving the same performance when partial computation is allowed.III. D ESIGN P RINCIPLES OF

CCPRFor the encoding of the assigned computations we use a similar strategy to rateless codes , particularly to LT codes [45].We ﬁrst brieﬂy explain the LT code structure, and highlight the required modiﬁcation for our problem setup.Consider a sequence of symbols W = { W , . . . , W K } (in our setup these correspond to submatrices W i ) to be transmittedover an erasure channel which correspond to stragglers in our model. The codewords (coded computations in our model) areformed as linear combinations of W , . . . , W K , and the goal is to correctly recover the original sequence from only a randomsubset of the coded symbols. In the encoding phase, a coded symbol is formed by choosing d elements randomly from W andsumming them, where d , which simply deﬁnes the degree of the symbol, comes from a distribution P ( d ) . In the decoding part,each coded symbol is decomposed by using the recovered symbols, that is if a coded symbol contains a previously recoveredsymbol then it is subtracted from the coded symbol to obtain a new coded symbol with a smaller degree. Overall the objectiveis to recover all K symbols from K ( + (cid:15) ) coded symbols with as small an (cid:15) as possible which reﬂects the overhead.We remark that coded symbols with smaller degrees can be decomposed faster; however, having many coded symbols withsmaller degrees increases the probability of linear dependence among codewords. Hence, the degree distribution plays animportant role in the performance of LT codes. It has been shown that for a carefully chosen P ( d ) , (cid:15) goes to zero as K → ∞ .LT codes have the following drawbacks when employed in distributed computation.First, the LT codes are designed under the assumption of receiving a large number of coded symbols; however, in a distributedcomputation scenario the number of symbols is typically limited since each symbol corresponds to a message transmitted tothe PS over the network, and increasing the number of messages may lead to congestion and communication delays at somepoint. On the other hand, for small K the overhead (cid:15) might be high.Second, the degree distribution of LT codes, P ( d ) , is designed for the correct recovery of the original sequence W . However,in our partial coded computation scenario we want to provide a certain ﬂexibility by allowing the recovery of only ( − q ) K symbols, for some predeﬁned tolerance q . Although the partial recovery of the rateless codes has been studied in the literature[46], we note that partial recovery for coded computation requires a tailored approach since the computational tasks, each ofwhich corresponding to a distinct coded symbol, are executed sequentially; thus, the corresponding erasure probabilities due tostraggling behavior of workers, are neither identical nor independent. Therefore, the coded symbols must be designed takinginto account their execution orders to prevent overlaps and to minimize the average completion time.Hence, the main idea behind the CCPR scheme is to utilize the LT code structure in a more systematic way such that thedegree of the each coded computation task is chosen carefully based on its computation order and with aim of partial recovery. A. Computation order and degree limitation

We want to re-emphasize that, coded computation tasks with lower degrees can be recovered faster but as the number ofcomputations received at the PS increases lower degree computations become less and less informative. Therefore, we wantinitial computations to be easily recoverable while those completed later to be more informative. Hence, we introduce thefollowing design criteria; (i) for the ﬁrst row of the computation assignment matrix, we consider uncoded computations, (ii)for a particular worker and computation orders i , j < r , the degree of the computation at order i can not be higher than thatof the one at order j . B. Uniformity imposed encoding

As highlighted before, coded messages with lower degrees may result in duplicate recoveries, wasting the computationresources. To this end, under the speciﬁed degree limitation, the main design issue is how to form the coded computations toprevent duplicate messages as much as possible. Accordingly, the challenge is to distribute submatrices W , . . . , W K amongthe coded computation tasks in a uniform fashion.

1) Order-wise uniformity:

By order-wise uniformity, we impose a constraint on the code construction such that computationswith the same order must have the same degree, and among the computations at the same order, each submatrix W k mustappear in exactly the same number of computations. Formally speaking, let (cid:102) W jk (cid:44) (cid:110)(cid:102) W i , j : α ij , k (cid:44) , i ∈ [ K ] , k ∈ [ K ] (cid:111) (6)be the set of coded computations at order j ∈ [ r ] containing submatrix W k . Then, the order-wise uniformity constraint imposes | (cid:102) W jk | = d j , ∀ j ∈ [ r ] , k ∈ [ K ] , (7)for some d j .

2) Worker-wise uniformity:

Worker-wise uniformity imposes a constraint on the coded computations assigned to each worker,such that the coded computations do not contain any common submatrices. Formally speaking, for any worker i ∈ [ K ] , if α ij , k (cid:44) , for some j ∈ [ r ] and k ∈ [ K ] , then α il , k = , ∀ l ∈ [ r ] \ { j } .Next, we introduce an encoding structure which ensures both order-wise and worker-wise uniformity.IV. R ANDOMLY C IRCULAR S HIFTED (RCS) C

ODE D ESIGN

Algorithm 1

RCS coded computation Data assignment phase: L = (cid:205) mi = m ( i ) Choose random subset

I ⊂ [ K ] , |I| = L for row index i = , , . . . , L do Randomly choose j ∈ I Update I : I ← I \ { j } A ( i , : ) = circshi f t ( W , j − ) Data encoding phase: for worker k = , , . . . , K do for message j = , . . . , r do Starting row index: l s = (cid:205) j − i = d i + Ending row index: l e = (cid:205) ji = d i (cid:102) W i , j = (cid:205) l e l = l s A ( l , i ) In this section, we introduce the randomly circular shifted (RCS) code design for coded distributed computation, whichconsists of two steps; namely data assignment and code construction. Before explaining these steps in detail, we ﬁrst deﬁnethe degree vector d of length r , where its i th entry d i denotes the degree of the coded computations assigned to workers in A RCS =  W W W . . . W W W W . . . W W W W . . . W W W W . . . W W W W . . . W W W W . . . W  Fig. 2: Assignment matrix for K = and I = { , , , , , }  W W W W W W  → C = (cid:102) W , (cid:102) W , (cid:102) W ,  =  W W + W W + W + W  Fig. 3: Illustration of the encoding phase for the ﬁrst worker and the corresponding ﬁrst column of the computation assignmentmatrix, C .order i , ≤ i ≤ r . Based on the aforementioned design criteria we set d = and, for any i < j , we have d i ≤ d j . Once thedegree vector d is ﬁxed, the two phases of RCS code design can be implemented.In the ﬁrst phase, an assignment matrix is formed by using random circular shifts on the vector of submatrices W = [ W , . . . , W K ] . At the beginning of the ﬁrst phase, an index set I ⊂ [ K ] of size L = (cid:205) ri = d i is randomly chosen. Then usingthe elements of I as a parameter of the circular shift operator on vector W , an assignment matrix A RCS is formed followingAlgorithm 1 (line 5-8). Let us illustrate this on a simple example. Consider K = workers and I = { , , , , , } . Then,for the i th row of the A RCS a j ∈ I is chosen randomly and discarded from I , while W is circularly shifted by j − . For sakeof simplicity, we assume that elements of I are chosen with the given order , , , , , , and the corresponding assignmentmatrix is illustrated in Fig. 2. Once the assignment matrix is ﬁxed, codewords can be generated based on the degree vector d for each column independently and identically. The colors in the assignment matrix in Fig. 2 represent the submatrices thatwill form the same coded computation. The code generation for the ﬁrst user, using the submatrices on the ﬁrst column of theassignment matrix, is illustrated in Fig. 3.We note that each coded message corresponds to a linear equation, and in the decoding phase any approach for solving aset of linear equations can be utilized, e.g., we can form a matrix with the coefﬁcients of the coded messages, α ij , k , that is,each coded message is represented by a binary row vector, and obtain the reduced row echelon form. Similar to LT codes, weconsider a low complexity decoding framework that decomposes the codewords successively by using only recovered symbols.From the construction the maximum number of decomposition required for decoding is K × ˜ L where ˜ L = (cid:205) ri = d i − , hencethe complexity of the decoding phase is O( K ˜ L ) .V. E XTENSION TO CODED COMMUNICATION SCENARIO

Above, we have mainly focused on coded computation in the context of a linear regression problem, where the maincomputation task boils down to distributed matrix-vector multiplication. In a more general distributed computation problem, inwhich the computations cannot be expressed as a linear transform of the dataset, we cannot employ a similar coded computationtechnique. However, if the overall computation task can be written as the summation of smaller partial computation tasks, then,the redundancy can be achieved by assigning each of these partial computations to multiple workers. Communication load ofsuch an implementation can be reduced by coded communication, where each worker sends to PS linear combinations of itspartial computations. The gradient coding (GC) scheme, introduced in [8], considers gradient estimates on subsets of a datasetas partial computations, and achieves redundancy by replicating parts of the dataset at multiple workers. This approach hasbeen extended in various directions to improve the performance [9]–[14]Let G = { g , . . . , g K } be the results corresponding to K partial computations, and the goal is to recover their sum at PS. Inthe GC scheme with computation load r , r partial computations, denoted by G k , are assigned to worker k . Each worker, aftercompleting r partial computations, sends a linear combination of its results to the PS c k (cid:44) L k ( g i : g i ∈ G k ) . (8)We refer to the linear combinations c , . . . , c K as coded partial computations . The PS waits until it receives sufﬁciently manycoded partial gradients to recover the full gradient. It is shown in [8] that, for any set of non-straggler workers ˜ K ⊆ [ K ] with A RCS =  g g g . . . g g g g . . . g g g g . . . g g g g . . . g g g g . . . g g g g . . . g  . Fig. 4: Assignment matrix for K = and I = { , , , , , }  g g g g g g  → (cid:101) g , (cid:101) g , (cid:101) g ,  =  g g + g g + g + g  . Fig. 5: Illustration of the encoding phase at the ﬁrst worker for coded communication with the computation assignment matrixin Fig. 4 and a degree vector d = [ , , ] . | ˜ K | = K − r + , there exists a set of coefﬁcients A ˜ K = (cid:8) a k : k ∈ ˜ K (cid:9) such that (cid:213) k ∈ ˜ K a k c ( t ) k = K (cid:213) k = g ( t ) k . (9)The GC scheme is designed for exact recovery of the summation, and limits the number of messages per worker to one.The MMC variation of GC is studied before in [11], [13]. Here, we will show that the RCS code proposed for matrix-vectormultiplication can also be used for partial recovery in coded communication with a small variation in the encoding phase.We will illustrate RCS coded communication on an example. Consider the previous example with K = workers. Asbefore, an L × K computation assignment is formed to assign partial computations to workers with a certain computationorder as illustrated in Fig. 4. In the coded communication scenario encoding takes place after the computation, therefore thecomputation load of the computation assignment matrix in Fig. 4 is r = L = , whereas the same matrix would have acomputation load of r = in the coded computation scenario. Again, once the assignment matrix is formed, coded messagesfor each worker are constructed according to the given degree vector d and based on the assignment matrix A RCS as illustratedin Fig. 5. Note that, similarly to the coded communication scenario RCS code allows recovery of only a subset of the partialcomputations, and can compute an approximation to the required summation. Note, however, that, while in coded computationmissing results correspond to entries of the vector we would like to compute, here the missing results will impact every entryof the desired computation as we will be missing some of the the partial computations. We want to remark that, concurrentlyto our work, GC scheme with particular focus on the trade-off between computation accuracy and time has been studied in[14], [40]–[42], [47]. A RCS =  W W W W W W W W W W W W W W W W W W W W  Fig. 6: A RCS based on z = [ , , , , ] , and random samples { , } ∈ I , { , , } ∈ I . C RCS =  W W W W W W W W W + W + W W + W + W W + W + W W + W + W  . Fig. 7: C RCS based on A RCS and d = [ , , ] VI. G

ENERALIZED

RCS

CODES

In the introduced RCS code structure, the main computation task is divided into K equal sub-tasks. However, if the variationon the computational speed of the workers is small, it might be better to divide them into even smaller tasks in order to betterutilize the computational resourcesHere, we present generalized RCS codes that allow adjusting the sizes of individual computation tasks. In generalized RCScodes, the encoding part remains the same as before, but the construction of the A RCS matrix is as follows. First, W is dividedinto K N disjoint submatrices, i.e., W , . . . , W K N , which are then divided into N groups W ( ) , . . . , W ( N ) , each containing K submatrices, i.e., W ( i ) = [ W ( i − ) K + , . . . , W iK ] . Before the construction of A RCS matrix, we will deﬁne a vector z which willbe useful.Let z is a L = (cid:205) mi = m ( i ) dimensional vector where each entry is from the set [ N ] . Construction of A RCS is executed rowby row such that for the i th row, we ﬁrst check z ( i ) and accordingly use the submatricies in group W ( Z ( i )) . Once we decideon the group of submatricies W ( Z ( i )) , we randomly sample an element j from set ∈ I z ( i ) and circularly shift the W ( Z ( i )) with j − before assigning it as the i th row of A RCS . We remark that, initially, I i = [ K ] , ∀ i ∈ [ N ] , and after the circularshiftoperation based on sampled j we remove the j from the corresponding set I i to prevent repetition among the rows. The detailedprocedure is illustrated in Algorithm 2. We present a simple example for K = and N = to clarify the overall procedure.Let I = { , } , I = { , , } , and z = [ , , , , ] , and the construction procedure of A RCS is illustrated in Fig. 6. Thecomputation assignment matrix C RCS is given for d = [ , , ] in Fig. 7. As in the previous example illustrated in Fig. 3, eachuser is allowed to send at most 3 messages, but now the computation load is r = / instead of as each of the computationsis half the size of those in the previous example. If the computation speeds of the workers are more likely to be similar, it willbe better to divide the main computation task into smaller subtasks to utilize the available computation resources better. Here,we remark that the ﬁrst row of C RCS is from W ( ) while the second row is from W ( ) . Hence, considering only the ﬁrst twoassigned computations, the recovery probability of W i θ is higher for W i ∈ W ( ) , compared to W i ∈ W ( ) . Therefore, the thirdrow is generated using two submatrices from W ( ) and one submatrix from W ( ) . Consequently, by playing with d = [ , , ] and z = [ , , , , ] different operating points in terms of the computation speed and accuracy can be achieved. Algorithm 2

Generalized RCS coded computation Data assignment phase: L = (cid:205) mi = m ( i ) for j = N do I j = [ K ] for row index i = , , . . . , L do Randomly choose j ∈ I z ( i ) Update I z ( i ) : I z ( i ) ← I z ( i ) \ { j } A ( i , : ) = circshi f t ( W ( z ( i )) , j − ) Data encoding phase: for worker k = , , . . . , K do for message j = , . . . , r do Starting row index: l s = (cid:205) j − i = d i + Ending row index: l e = (cid:205) ji = d i (cid:102) W k , j = (cid:205) l e l = l s A ( l , k ) VII. N

UMERICAL R ESULTS AND D ISCUSSIONS

For the numerical analysis, we will ﬁrst analyze the convergence performance of the partial recovery strategy for codedcomputation with RCS codes under different tolerance requirements, then we will compare the average per-iteration completiontime of the RCS code with UC-MMC and MDS coded compuation (MCC) schemes, and, ﬁnally we will extend our analysisto coded communication. The code used to obtain the simulation results can be accessed at [48].

A. Simulation setup

For the statistics of the computation speeds of the workers, we adopt the model in [15], where the probability of completingexactly s computations by time t , P s ( t ) , is given by P s ( t ) =  , if t < s α, − e − µ ( t s − α ) , s α ≤ t < ( s + ) α, e − µ ( ts + − α ) − e − µ ( ts − α ) ( s + ) α < t , (10) Number of iterations T r a i n i ng e rr o r q=0q=0.15q=0.3 Fig. 8: Training error over T = iterations, for a model size of d = , and tolerance level of q = , . , . , respectively. Computation strategy / tolerance RCS UC-MMC MCC q = . q = . q = . q = q = . q = . Average iteration time 0.1475 0.0936 0.0776 0.2424 0.1170 0.0799 0.1572Number of received messages 60.93 42.38 35.03 81.29 51.16 36.70 14

TABLE III: Comparison of the proposed RCS scheme with UC-MMC and MCC schemes for the computational load r = where α is the minimum required time to ﬁnish a computation task, and µ is the average number of computations completedin unit time. For the simulations, we consider a linear regression problem over synthetically created training dataset, accordingto normal mixture distribution as in [18], consisting of size of 2000 samples. In the generation of synthetic data, we ﬁrst createthe true parameter model θ (cid:63) randomly sampling each entry from the interval [ , ] according to uniform distribution. Oncewe have the θ (cid:63) , we construct two mean vectors µ = . d θ (cid:63) and µ = − . d θ (cid:63) , where this mean vectors are later used in thenormal mixture distribution N ( µ , I ) + N ( µ , I ) to generate the each row of dataset X .We assume that there are K = homogeneous workers, and set µ = and α = . for the statistics of their computationspeeds in (10). For RCS coded computation we choose the degree vector d = [ , , ] , which corresponds to computation loadof r = , and we execute Algorithm 1 accordingly. In all the simulations, we set the learning rate to λ = . . B. Simulation Results

We ﬁrst consider a model size of d = , and evaluate the training error over T = iterations for tolerance level of q = (which corresponds to full recovery), q = . , and q = . . One can observe in Fig. 8 that, although the convergencespeed reduces with increasing tolerance level at each iteration, partial recovery does not harm the convergence behaviour, muchespecially if the tolerance level is moderate, e.g., q = . . We then repeat the same experiments for a model size of d = ,which demonstrates similar trends as seen in Fig. 9.What we want to see next is how much reduction in per-iteration time can be achieved by the partial recovery scheme.Hence, we present the per-iteration time of three different schemes, namely RCS, UC-MMC and MCC, for computationalload r = . For RCS, we use the order vector m = [ , , ] , for UC-MMC we use cyclic shifted assignment as in [19], andﬁnally, for MCC we use ((cid:100) K / r (cid:101) , K ) = ( , ) MDS code.We compare the three schemes in terms of two performance metrics: average completion time and number of receivedmessages , which demonstrate how fast an iteration will be completed, and the induced communication load, respectively. FromTable III, one can observe that for full recovery, i.e., q = , RCS outperforms both UC-MMC and MCC schemes. We alsoobserve that by allowing partial recovery it is possible to achieve approximately to reduction in the per iterationcompletion time with q = . and q = . , respectively. Here, we note that the partial recovery approach can be alsoemployed with UC-MMC; however, as demonstrated in Table III, RCS outperforms UC-MMC for all given tolerance values.Besides, with RCS PS completes an iteration with less number of received messages, which means that compared to UC-MMC,RCS induces less communication load and congestion. For instance, for q = UC-MMC requires, on average, more For the convergence plot, we take average over 100 independent simulations. Number of iterations T r a i n i ng e rr o r q=0q=0.15q=0.3 Fig. 9: Training error over T = iterations, for a model size of d = , and tolerance level of q = , . , . , respectively Computation strategy / tolerance rate RCS UC-MMC GC q = . q = . q = . q = q = . q = . Average iteration time 0.2219 . . TABLE IV: Comparison of the proposed RCS scheme with UC-MMC and GC schemes for the computational load r = messages to complete an iteration compared to the proposed RCS scheme, which is shown to be a critical factor affectingthe performance of real implementations [43]. We note that MCC requires the minimum number of messages to completean iteration. Hence, for q = MCC can be a better alternative; nevertheless, another advantage of RCS compared to MCCis that the decoding process can be executed in parallel to computations so that the additional latency due to decoding isminimized. Finally, based on the simulation results, we can conclude that the RCS scheme operates most efﬁciently whenpartial recovery is aimed with a low tolerance level, e.g., q = . , since in this regime we can see the advantage of both usingcoded computation and partial recovery. To clarify this point, we observe that increasing the tolerance level to q = . makesconsiderable impact on the training accuracy, but the reduction on the average per-iteration computation time is relatively small.Besides, as q increases the performance of the UC-MMC scheme gets closer to that of RCS.We also conduct an experiment for the generalized RCS code with N = . For the simulation we choose d = [ , , , ] so that 4 coded computations are assigned to each worker in total and the complexity of the each computation is half ofthe original RCS schemes, hence r = . The coded computations contain one submatrix from the ﬁrst group, one submatrixfrom the second group, 3 submatrices from each group and 5 submatrices from each group, respectively . For the given codestructure we measure the average iteration time as . , . and . for q = , . , . , respectively. One can observefrom the results that by using smaller tasks both computation time and the computation load can be reduced. However, onthe other hand use of smaller subtasks may increase the number of messages, for instance, given simulation setup the averagenumber of received messages at the PS are , and for q = , . , . , respectively.Although the RCS code is initially designed for coded computation, as explained in Section V, can be implemented forcoded communication as well. Therefore, we repeat our experiments to compare the performances of RCS, UC-MMC and GCfor a computational load of r = . For the RCS we now use the degree vector d = [ , , ] . The simulation results are presentedin Table IV. We observe that in terms of the average per-iteration completion time UC-MMC and RCS both outperformGC, with UC-MMC typically achieving the lowest computation time. In terms of the communication load, UC-MMC has theworst performance. Hence, the key advantage of RCS in the coded communication scenario is a better balance between thecomputation and communication latencies. At this point, we also want to highlight that UC-MMC can be considered as aspecial case of RCS, where the degrees of the all message are one. Overall, one can play with the degree vector to achievedifferent points on the trade-off between the communication and computation latencies. Corresponding vector z = [ , , , , , , , , , , , , , ] . C. Discussions

We have proposed a new code construction framework for straggler-aware coded computation/communication schemes,which provides the ﬂexibility to trade-off the accuracy of computation with the computation latency and the communicationload. While we have provided a speciﬁc code construction, several improvements and/or adaptations of this design are possible.Below, we brieﬂy discuss some of the possible future extensions.

1) Double threshold scheme:

One of the possible extensions is to use two thresholds to decide when to terminate an iteration.The reason behind the use of two thresholds strategy is that partial recovery strategy is efﬁcient when the sacriﬁce from thecomputation accuracy provides a noticeable reduction in the latency. However, if all the workers are sufﬁciently fast, thenpartial recovery may not bring noticeable reduction in the latency but lose accuracy. To this end, after sending the latest modelvector θ t , the PS starts keeping time and collects messages from workers for a given ﬁxed duration. Once the duration iscompleted, PS checks whether the requirement due to tolerance level is satisﬁed or not and if it is not satisﬁed then continuesto receive messages from workers until it is satisﬁed.

2) Adaptive tolerance :

In the case of iterative training of machine learning models, it is known that the update accuracyhas different impacts on the convergence at different phases of the training process. Hence, the tolerance can be adjusted overtime to obtain a better overall convergence result.

3) Memory enhanced updates:

Again, when partial computations are used in the context of iterative optimization or training,the PS can beneﬁt from the computations recovered in the previous iterations. In our simulations, at each iteration we use onlythe computations recovered in that iteration. Instead, it can utilize the results from previous iterations to compensate for themissing computation results in the current iteration, similar to the momentum SGD framework.VIII. C

ONCLUSIONS

In this paper, we have introduced the CCPR approach, and a particular code structure, called RCS, in order to providean additional ﬂexibility in seeking a balance between the per-iteration completion time, the computation accuracy, and thecommunication load in distributed computing. In particular, for a matrix-vector multiplication task, the RCS code can adaptivelyrecover a portion of the element of the resultant vector. The proposed code construction is built upon the LT code structure, butrequires additional optimization of the underlying degree distribution due to the correlation among the erasures of symbols inthe code. We have also shown that the RCS code can also provide a similar ﬂexibility in distributed computation of any arbitrarycomputation task that can be written as the summation of multiple partial computations. We have applied the proposed RCScode to iterative SGD in a linear regression problem. By conducting experiments using different tolerance values we showedthat the RCS code can help to reduce the per-iteration completion time for a reasonable reduction in the update accuracy,which can be tolerated due to the iterative nature of the algorithm. We also showed that, compared to UC-MMC, which canalso employ partial recovery, RCS requires, on average, less number of messages to complete an iteration, which means lowercommunication load. R

EFERENCES[1] S. Li, S. M. M. Kalan, A. S. Avestimehr, and M. Soltanolkotabi, “Near-optimal straggler mitigation for distributed gradient methods,” in , May 2018, pp. 857–866.[2] N. Ferdinand and S. C. Draper, “Anytime stochastic gradient descent: A time to hear from all the workers,” in , Oct 2018, pp. 552–559.[3] M. Mohammadi Amiri and D. Gndz, “Computation scheduling for distributed machine learning with straggling workers,”

IEEE Transactions on SignalProcessing , vol. 67, no. 24, pp. 6270–6284, Dec 2019.[4] A. Behrouzi-Far and E. Soljanin, “On the effect of task-to-worker assignment in distributed computing systems with stragglers,” in , Oct 2018, pp. 560–566.[5] J. Chen, R. Monga, S. Bengio, and R. J´ozefowicz, “Revisiting distributed synchronous SGD,”

CoRR , vol. abs/1604.00981, 2016. [Online]. Available:http://arxiv.org/abs/1604.00981[6] M. F. Akta and E. Soljanin, “Straggler mitigation at scale,”

IEEE/ACM Transactions on Networking , vol. 27, no. 6, pp. 2266–2279, 2019.[7] D. Wang, G. Joshi, and G. W. Wornell, “Efﬁcient straggler replication in large-scale parallel computing,”

ACM Trans. Model. Perform. Eval. Comput.Syst. , vol. 4, no. 2, pp. 7:1–7:23, Apr. 2019. [Online]. Available: http://doi.acm.org/10.1145/3310336[8] R. Tandon, Q. Lei, A. G. Dimakis, and N. Karampatziakis, “Gradient coding: Avoiding stragglers in distributed learning,” in

Proceedings of the 34thInternational Conference on Machine Learning , ser. Proceedings of Machine Learning Research, D. Precup and Y. W. Teh, Eds., vol. 70. InternationalConvention Centre, Sydney, Australia: PMLR, 06–11 Aug 2017, pp. 3368–3376.[9] M. Ye and E. Abbe, “Communication-computation efﬁcient gradient coding,” in

Proceedings of the 35th International Conference on Machine Learning ,ser. Proceedings of Machine Learning Research, J. Dy and A. Krause, Eds., vol. 80. Stockholmsmssan, Stockholm Sweden: PMLR, 10–15 Jul 2018,pp. 5610–5619.[10] W. Halbawi, N. Azizan, F. Salehi, and B. Hassibi, “Improving distributed gradient descent using reed-solomon codes,” in , June 2018, pp. 2027–2031.[11] E. Ozfatura, D. Gndz, and S. Ulukus, “Gradient coding with clustering and multi-message communication,” in , June 2019, pp. 42–46.[12] S. Sasi, V. Lalitha, V. Aggarwal, and B. S. Rajan, “Straggler mitigation with tiered gradient codes,”

IEEE Transactions on Communications , pp. 1–1,2020.[13] L. Tauz and L. Dolecek, “Multi-message gradient coding for utilizing non-persistent stragglers,” in , 2019, pp. 2154–2159.[14] N. Charalambides, M. Pilanci, and A. O. Hero, “Weighted gradient coding with leverage score sampling,” in

ICASSP 2020 - 2020 IEEE InternationalConference on Acoustics, Speech and Signal Processing (ICASSP) , 2020, pp. 5215–5219. [15] K. Lee, M. Lam, R. Pedarsani, D. Papailiopoulos, and K. Ramchandran, “Speeding up distributed machine learning using codes,” IEEE Trans. Inf.Theory , vol. 64, no. 3, pp. 1514–1529, March 2018.[16] N. Ferdinand and S. C. Draper, “Hierarchical coded computation,” in , June 2018, pp. 1620–1624.[17] R. K. Maity, A. Singh Rawa, and A. Mazumdar, “Robust gradient descent via moment encoding and ldpc codes,” in , July 2019, pp. 2734–2738.[18] S. Li, S. M. M. Kalan, Q. Yu, M. Soltanolkotabi, and A. S. Avestimehr, “Polynomially coded regression: Optimal straggler mitigation via data encoding,”

CoRR , vol. abs/1805.09934, 2018.[19] E. Ozfatura, D. Gndz, and S. Ulukus, “Speeding up distributed gradient descent by utilizing non-persistent stragglers,” in , July 2019, pp. 2729–2733.[20] S. Dutta, M. Fahim, F. Haddadpour, H. Jeong, V. Cadambe, and P. Grover, “On the optimal recovery threshold of coded matrix multiplication,”

IEEETransactions on Information Theory , pp. 1–1, 2019.[21] Q. Yu, M. Maddah-Ali, and S. Avestimehr, “Polynomial codes: an optimal design for high-dimensional coded matrix multiplication,” in

Advances inNeural Information Processing Systems 30 , I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds. CurranAssociates, Inc., 2017, pp. 4403–4413.[22] H. Park, K. Lee, J. Sohn, C. Suh, and J. Moon, “Hierarchical coding for distributed computing,” in , June 2018, pp. 1630–1634.[23] A. Mallick, M. Chaudhari, U. Sheth, G. Palanikumar, and G. Joshi, “Rateless codes for near-perfect load balancing in distributed matrix-vectormultiplication,”

Proc. ACM Meas. Anal. Comput. Syst. , vol. 3, no. 3, Dec. 2019. [Online]. Available: https://doi.org/10.1145/3366706[24] S. Kiani, N. Ferdinand, and S. C. Draper, “Exploitation of stragglers in coded computation,” in , June 2018, pp. 1988–1992.[25] A. B. Das, L. Tang, and A. Ramamoorthy, “C3les: Codes for coded computation that leverage stragglers,” in , Nov 2018, pp. 1–5.[26] E. Ozfatura, S. Ulukus, and D. Gndz, “Distributed gradient descent with coded partial gradient computations,” in

ICASSP 2019 - 2019 IEEE InternationalConference on Acoustics, Speech and Signal Processing (ICASSP) , May 2019, pp. 3492–3496.[27] F. Haddadpour, Y. Yang, M. Chaudhari, V. R. Cadambe, and P. Grover, “Straggler-resilient and communication-efﬁcient distributed iterative linear solver,”

CoRR , vol. abs/1806.06140, 2018.[28] H. Wang, S. Guo, B. Tang, R. Li, and C. Li, “Heterogeneity-aware gradient coding for straggler tolerance,” in , 2019, pp. 555–564.[29] M. Kim, J. Sohn, and J. Moon, “Coded matrix multiplication on a group-based model,” in , July 2019, pp. 722–726.[30] Y. Yang, M. Interlandi, P. Grover, S. Kar, S. Amizadeh, and M. Weimer, “Coded elastic computing,” in , July 2019, pp. 2654–2658.[31] Q. Yu, M. A. Maddah-Ali, and A. S. Avestimehr, “Straggler mitigation in distributed matrix multiplication: Fundamental limits and optimal coding,” in , June 2018, pp. 2022–2026.[32] S. Dutta, Z. Bai, H. Jeong, T. M. Low, and P. Grover, “A uniﬁed coded deep neural network training strategy based on generalized polydot codes,” in , June 2018, pp. 1585–1589.[33] P. Soto, J. Li, and X. Fan, “Dual entangled polynomial code: Three-dimensional coding for distributed matrix multiplication,” in

Proceedings of the36th International Conference on Machine Learning , ser. Proceedings of Machine Learning Research, K. Chaudhuri and R. Salakhutdinov, Eds., vol. 97.Long Beach, California, USA: PMLR, 09–15 Jun 2019, pp. 5937–5945.[34] H. Park and J. Moon, “Irregular product coded computation for high-dimensional matrix multiplication,” in , July 2019, pp. 1782–1786.[35] Y. Sun, J. Zhao, S. Zhou, and D. Gunduz, “Heterogeneous coded computation across heterogeneous workers,” in , 2019, pp. 1–6.[36] C.-S. Yang, R. Pedarsani, and A. S. Avestimehr, “Timely-throughput optimal coded computing over cloud networks,” in

Proceedings of the TwentiethACM International Symposium on Mobile Ad Hoc Networking and Computing , ser. Mobihoc 19. New York, NY, USA: Association for ComputingMachinery, 2019, p. 301310.[37] B. Buyukates and S. Ulukus, “Timely distributed computation with stragglers,” 2019.[38] B. Hasircioglu, J. Gomez-Vilardebo, and D. Gunduz, “Bivariate polynomial coding for exploiting stragglers in heterogeneous coded computing systems,”2020.[39] R. Bitar, Y. Xing, Y. Keshtkarjahromi, V. Dasari, S. E. Rouayheb, and H. Seferoglu, “Private and rateless adaptive coded matrix-vector multiplication,”2019.[40] R. Bitar, M. Wootters, and S. El Rouayheb, “Stochastic gradient coding for straggler mitigation in distributed learning,”

IEEE Journal on Selected Areasin Information Theory , pp. 1–1, 2020.[41] H. Wang, Z. B. Charles, and D. S. Papailiopoulos, “Erasurehead: Distributed gradient descent without delays using approximate gradient coding,”

CoRR , vol. abs/1901.09671, 2019. [Online]. Available: http://arxiv.org/abs/1901.09671[42] S. Wang, J. Liu, and N. Shroff, “Fundamental limits of approximate gradient coding,”

Proc. ACM Meas. Anal. Comput. Syst. , vol. 3, no. 3, Dec. 2019.[Online]. Available: https://doi.org/10.1145/3366700[43] E. Ozfatura, S. Ulukus, and D. Gunduz, “Straggler-aware distributed learning: Communication computation latency trade-off,”

Entropy, special issueInterplay between Storage, Computing, and Communications from an Information-Theoretic Perspective , vol. 22, no. 5, p. 544, May 2020.[44] E. Ozfatura, B. Buyukates, D. Gunduz, and S. Ulukus, “Age-based coded computation for bias reduction in distributed learning,” 2020.[45] M. Luby, “LT codes,” in , November 2002.[46] V. Bioglio, M. Grangetto, R. Gaeta, and M. Sereno, “An optimal partial decoding algorithm for rateless codes,” in , July 2011, pp. 2731–2735.[47] S. Horii, T. Yoshida, M. Kobayashi, and T. Matsushima, “Distributed stochastic gradient descent using ldgm codes,” in2019 IEEE InternationalSymposium on Information Theory (ISIT)