[PDF] A Survey of Coded Distributed Computing

Abstract

Distributed computing has become a common approach for large-scale computation of tasks due to benefits such as high reliability, scalability, computation speed, and costeffectiveness. However, distributed computing faces critical issues related to communication load and straggler effects. In particular, computing nodes need to exchange intermediate results with each other in order to calculate the final result, and this significantly increases communication overheads. Furthermore, a distributed computing network may include straggling nodes that run intermittently slower. This results in a longer overall time needed to execute the computation tasks, thereby limiting the performance of distributed computing. To address these issues, coded distributed computing (CDC), i.e., a combination of coding theoretic techniques and distributed computing, has been recently proposed as a promising solution. Coding theoretic techniques have proved effective in WiFi and cellular systems to deal with channel noise. Therefore, CDC may significantly reduce communication load, alleviate the effects of stragglers, provide fault-tolerance, privacy and security. In this survey, we first introduce the fundamentals of CDC, followed by basic CDC schemes. Then, we review and analyze a number of CDC approaches proposed to reduce the communication costs, mitigate the straggler effects, and guarantee privacy and security. Furthermore, we present and discuss applications of CDC in modern computer networks. Finally, we highlight important challenges and promising research directions related to CDC

Full PDF

11 A Survey of Coded Distributed Computing

Jer Shyuan Ng, Wei Yang Bryan Lim, Nguyen Cong Luong, Zehui Xiong, Alia Asheralieva,Dusit Niyato,

IEEE Fellow , Cyril Leung, Chunyan Miao

Abstract —Distributed computing has become a common ap-proach for large-scale computation of tasks due to beneﬁtssuch as high reliability, scalability, computation speed, and cost-effectiveness. However, distributed computing faces critical issuesrelated to communication load and straggler effects. In particular,computing nodes need to exchange intermediate results witheach other in order to calculate the ﬁnal result, and thissigniﬁcantly increases communication overheads. Furthermore,a distributed computing network may include straggling nodesthat run intermittently slower. This results in a longer overalltime needed to execute the computation tasks, thereby limitingthe performance of distributed computing. To address theseissues, coded distributed computing (CDC), i.e., a combinationof coding theoretic techniques and distributed computing, hasbeen recently proposed as a promising solution. Coding theoretictechniques have proved effective in WiFi and cellular systemsto deal with channel noise. Therefore, CDC may signiﬁcantlyreduce communication load, alleviate the effects of stragglers,provide fault-tolerance, privacy and security. In this survey, weﬁrst introduce the fundamentals of CDC, followed by basicCDC schemes. Then, we review and analyze a number ofCDC approaches proposed to reduce the communication costs,mitigate the straggler effects, and guarantee privacy and security.Furthermore, we present and discuss applications of CDC inmodern computer networks. Finally, we highlight importantchallenges and promising research directions related to CDC.

Index Terms —Distributed computing, communication mini-mization, straggler effects mitigation, security, coded distributedcomputing

I. I

NTRODUCTION

In recent years, distributed computing has used for large-scale computation [1] since it offers several advantages overcentralized computing. First, distributed computing is able toprovide computing services with high reliability and fault-tolerance. In particular, distributed computing systems can efﬁ-ciently and reliably work even if some of the computing nodes,i.e., computers or workers, fail. Second, distributed comput-ing has high computation speed as the computation load is

JS. Ng and WYB. Lim are with Alibaba Group and Alibaba-NTU JointResearch Institute, Nanyang Technological University, Singapore.N. C. Luong is with Faculty of Information Technology, PHENIKAAUniversity, Hanoi 12116, Vietnam, and is with PHENIKAA Research andTechnology Institute (PRATI), A&A Green Phoenix Group JSC, No.167Hoang Ngan, Trung Hoa, Cau Giay, Hanoi 11313, Vietnam.Z. Xiong is with Alibaba-NTU Joint Research Institute, and also withSchool of Computer Science and Engineering, Nanyang Technological Uni-versity, Singapore.A. Asheralieva is with Department of Computer Science and Engineering,Southern University of Science and Technology, Shenzhen, China.D. Niyato is with School of Computer Science and Engineering, NanyangTechnological University, Singapore.C. Leung is with The University of British Columbia and Joint NTU-UBCResearch Centre of Excellence in Active Living for the Elderly (LILY).C. Miao is with Joint NTU-UBC Research Centre of Excellence inActive Living for the Elderly (LILY) and School of Computer Science andEngineering, Nanyang Technological University, Singapore. shared among various computing nodes. Third, distributedcomputing systems are scalable since computing nodes caneasily be added. Fourth, distributed computing is economicalby using computing nodes with low-cost hardware. Likewise,distributed computing is adopted in cloud computing and otheremerging services. Given the aforementioned advantages, dis-tributed computing has been applied in numerous real-lifeapplications such as telecommunication networks [2] (e.g.,telephone networks and wireless sensor networks), networkapplications [3] (e.g., world-wide web networks, massivelymultiplayer online games and virtual reality communities,distributed database management systems, and network ﬁlesystems), real-time process control [4] (e.g., aircraft controlsystems), and parallel computation [5] (e.g., cluster computing,grid computing, and computer graphics).However, distributed computing faces serious challenges.Let us consider one of the most common distributed com-putation frameworks, i.e., MapReduce [6]. MapReduce is asoftware framework and programming model for processing acomputation task across large datasets using a large number ofcomputing nodes, i.e., workers. A set of the computing nodesis referred to as a cluster or a grid. In general, the overallcomputation task is decomposed into three phases, i.e., the“Map” phase, “Shufﬂe” phase, and “Reduce” phase. In theMap phase, a master node splits the computation task intomultiple subtasks and assigns the subtasks to the computingnodes. The computing nodes compute the subtasks accordingto the allocated Map functions to generate intermediate re-sults. Then, the intermediate results are exchanged among thecomputing nodes, namely “data shufﬂing”, during the Shufﬂephase. In the Reduce phase, the computing nodes use theseresults to compute the ﬁnal result in a distributed manner byusing their allocated Reduce functions.Distributed computing has two major challenges. First, thecomputing nodes need to exchange a number of intermediateresults over the network with each other in order to calculatethe ﬁnal result; this signiﬁcantly increases communicationoverheads and limits the performance of distributed com-puting applications such as Self-Join [7], Terasort [8], andmachine learning [9]. For example, for the Hadoop cluster atFacebook, it is observed that on average, the data shufﬂingphase accounts for 33% of the overall job execution time [9].65% and 70% of the overall job execution time is spenton the Shufﬂe phase when running a TeraSort and Self-Join application respectively on a heterogeneous Amazon EC2cluster [10]. In fact, the communication bottleneck is worse inthe trainings of convolutional neural networks (CNNs), e.g.,Resnet-50 [11] and AlexNet [12], which include updates ofmillions of model parameters. Second, distributed computingis executed by a large number of computing nodes which may a r X i v : . [ c s . D C ] A ug have very different computing and networking resources. As aresult, there are straggling nodes or stragglers, i.e., computingnodes which run unintentionally slower than others, therebyincreasing the overall time needed to complete the computingtasks. To address the straggler effects, traditional approachessuch as work exchange [13] and naive replication [14] havebeen adopted for the distributed computing. However, such ap-proaches either introduce redundancy or require coordinationamong the nodes that signiﬁcantly increases communicationcosts and computation loads. This motivates the need for anovel technique that is able to more effectively and completelyaddress the straggler effects and communication load of dis-tributed computing.Coding theoretic techniques, e.g., channel coding suchas low-density parity-check (LDPC) [15], have been widelyused in WiFi and cellular systems to combat the impact ofchannel noise and impairments. They have also been appliedin distributed storage systems and cache networks [16] toreduce storage cost and network trafﬁc. The basic principle ofthe coding theoretic techniques is that redundant information,i.e., redundancy, is introduced in messages/signals before theyare transmitted to a receiver. The redundancy is includedin the messages in a controller manner such that it canbe utilized by the receiver to correct errors caused by thechannel noise. Coding theoretic techniques have been recentlyregarded as promising solutions to cope with the challengesin distributed computing [17], [18]. For example, codingtheoretic techniques can be used to encode the Map tasksof the computing nodes such that the master node is ableto recover the ﬁnal result from partially ﬁnished nodes, thusalleviating the straggler effects [19]. Another example is thatcoding theoretic techniques enable coding opportunities acrossintermediate results of the distributed computation tasks whichsigniﬁcantly reduces the communication load by reducing thenumber and the size of data transmissions among the process-ing nodes [17]. The combination of coding techniques anddistributed computing is called coded distributed computing(CDC) [17]. Apart from reducing communication load andalleviating the effects of stragglers, CDC can provide fault-tolerance, preserve privacy [20], and improve security [21]in distributed computing. As a result, CDC approaches haverecently received a lot of attention.CDC schemes can be applied in modern networks such asNetwork Function Virtualization (NFV) and edge computing.With the data mainly generated by end devices, e.g., Internetof Things (IoT), that have signiﬁcant sensing as well ascomputational and storage capabilities, it is natural to performsome computations at the end devices, instead of the cloudwhich may not be able to handle the large amounts of datagenerated. As such, edge computing [22] has been proposed asa solution to perform distributed computation tasks. In orderto perform complex computations, e.g., the training of deepneural networks that involve a large number of training layers,resource-constrained devices may need to pool their resourcesto perform their computations collaboratively [23]. This resultsin high communication costs and computation latency. CDCschemes can be implemented to overcome these challenges.Furthermore, CDC schemes can be implemented on edge com- Fig. 1: Structure of survey. puting networks that involve constantly-moving end devices[24], e.g., vehicles and smartphones, which imposes additionalcommunication constraints.To the best of our knowledge, although there are severalsurveys and books related to distributed computing, there isno survey paper on CDC. In particular, large-scale distributedcomputing and applications are discussed in [1]. Surveysrelated to distributed computing include grid resource manage-ment systems for distributed computing [25], resource alloca-tion in high performance distributed computing [26], wirelessdistributed computing [27], and wireless grid computing [28].This motivates the need for this survey on CDC. In summary,our survey has the following contributions: • We describe the fundamentals of CDC. In particular,we introduce the commonly used distributed computationframeworks for the implementation of coding techniquesand algorithms. We then discuss basic CDC schemes. • We review and discuss a number of CDC schemes to re-duce the communication costs for distributed computing.The approaches include ﬁle allocation, coded shufﬂingdesign, and function allocation. We further analyze andcompare the advantages and disadvantages of the CDCschemes. • We review, discuss, and analyze CDC schemes whichmitigate the straggler effects of distributed computing.The approaches include computation load allocations,approximate coding, and exploitation of stragglers. • We review and present CDC schemes that can improvethe privacy and security in distributed computing. • We analyze and provide insights into the existing ap-proaches and solutions in the CDC literature. • We present and discuss applications of CDC in modernnetworks such as NFV and edge computing. • We highlight challenges and discuss promising researchdirections related to CDC.For the reader’s convenience, we classify the related CDCstudies according to the challenges that need to be handled.In particular, the issues are communication costs, stragglereffects, and security. As such, readers who are interestedin or working on related issues will beneﬁt greatly fromour insightful reviews and in-depth discussions of existingapproaches, remaining/open problems, and potential solutions.The rest of this paper is organized as follows. Section II introduces the fundamentals of CDC. Section III presentsbasic CDC schemes. Section IV reviews CDC approaches thathave been proposed to reduce communication costs. Section Vdiscusses CDC approaches that have been proposed to miti-gate straggler effects. Section VI presents CDC approachesthat have been proposed to enhance privacy and securityin distributed computing. Section VII discusses applicationsof CDC. Section VIII highlights important challenges andpromising research directions. The structure of the survey ispresented in Figure 1. Section IX concludes the paper. A list ofabbreviations commonly used in this paper is given in Table I.TABLE I:

List of common abbreviations used in this paper.

Abbreviation Description

ARIMA Auto Regressive Integrated Moving AverageBCC Batch Coupon’s CollectorBGC Bernoulli Gradient CodeBGW Ben-Or, Goldwasser, and WigdersonBPCC Batch-Processing Based Coded ComputingC3P Coded Cooperative Computation ProtocolCDC Coded Distributed ComputingCNN Convolutional Neural NetworkCPGC Coded Partial Gradient ComputationDAG Directed Acyclic GraphsDNN Deep Neural NetworkFRC Fractional Repetition CodingHCMM Heterogeneous Coded Matrix MultiplicationIoT Internet of ThingsLCC Lagrange Coded ComputingLDPC Low-Density Parity-CheckLT Luby TransformMDS Maximum Distance SeparableMMC Multi-Message CommunicationMPC Multi-Party ComputationNFV Network Function VirtualizationPCR Polynomially Coded RegressionPDA Placement Delivery ArraySDMM Secure Distributed Matrix MultiplicationSGC Stochastic Gradient CodingSGD Stochastic Gradient DescentSVD Singular Vector DecompositionUAV Unmanned Aerial Vehicle

II. F

UNDAMENTALS OF C ODED D ISTRIBUTED C OMPUTING

Distributed computing has been an important solution tolarge-scale, complex computation problems, which involvesmassive amounts of data. Various distributed computing mod-els, e.g., cluster computing [29], grid computing [30] andcloud computing [26], [31], have been developed to performthe distributed computation tasks while providing high qualityof services (QoS) to the users. Among the distributed com-puting models, cloud computing is gaining much popularityrecently as it eliminates the need for users to purchase ex-pensive hardware and software resources since the users only need to pay for the cloud services based on their usage needsin an on-demand basis. A comparison between cluster, gridand cloud computing models is summarized in Table II.Distributed computing has been widely implemented in avariety of applications, e.g., sensor networks [32], healthcareapplications [33], the development of smart cities [34], auto-mated manufacturing processes [35] and vehicular applications[36]. In order to improve the performance of the distributedcomputing systems, various aspects such as resource allocationstrategies [26], task allocation strategies [37], [38], schedulingalgorithms [39], [40], incentive mechanisms [41], [42], energyefﬁciency [29], network security [43] and the performancemodelling [44] of the distributed computing systems have beenextensively studied in the literature. However, the performanceof the distributed computing systems is still limited by thehigh communication costs and straggler effects which lead toa longer time needed to execute the computation tasks. As aresult, recent research has focused on coding techniques toovercome these implementation challenges of the distributedcomputing systems, the aims of which are to minimize thecommunication load as well as to mitigate the straggler effects.In this section, we discuss commonly used distributedcomputation frameworks for the implementation of codingtechniques and algorithms. Note that while the different com-putation frameworks are useful for different computing appli-cations, we focus speciﬁcally on the MapReduce framework[6] as the majority of the research works on CDC schemesare based on the MapReduce computation framework. We alsointroduce the two main lines of works in CDC, i.e., to reducecommunication load and to mitigate the straggler effects,which aim to solve the challenges in distributed computing.

A. Coded Distributed Computation Frameworks

While the distributed computation frameworks have movedbeyond a simple MapReduce framework, the majority of thestudies on CDC have focused on the MapReduce framework.MapReduce [6] is a software framework and programmingmodel that runs on a large cluster of commodity machines forthe processing of large-scale datasets in a distributed comput-ing environment. The cluster of computers is modelled as amaster-worker system which consists of a single master nodeand multiple workers to store and analyzes massive amountof unstructured data. Due to its scalability and its ability totolerate machines’ failure [45], the MapReduce framework iscommonly used in a wide range of applications [6], e.g., theanalysis of web access logs, the clustering of documents, theconstruction of web-link graph that matches the all sourceURLs to a target URL and the development of machinelearning algorithms. Generally, the MapReduce computationframework involves the processing of a large input ﬁle togenerate multiple output pairs of which each pair consists ofa key and a corresponding value. Figure 2 demonstrates theimplementation of the conventional MapReduce framework todetermine the frequency of occurrence of 4 speciﬁc words inthe books, where the 4 processing nodes, i.e., the workers areto compute the 4 output pairs. There are three important phasesin the MapReduce computation framework:

TABLE II:

Comparison between cluster, grid and cloud computing models [26].

Feature Cluster Grid Cloud

Size Small to medium Large Small to largeNetwork type Private, LAN Private, WAN Public, WANJob management and scheduling Centralized Decentralized BothCoupling Tight Loose/tight LooseResource reservation Pre-reserved Pre-reserved On-demandService-level agreement (SLA) constraint Strict High HighResource support Homogeneous and heterogeneous (GPU) Heterogeneous HeterogenousVirtualization Semi-virtualized Semi-virtualized Completely virtualizedSecurity type Medium High LowService-oriented architecture and heterogeneity support Not supported Supported SupportedUser interface Single system image Diverse and dynamic Single system imageInitial infrastructure cost Very high High LowSelf service and elasticisty No No YesAdministrative domain Single Multi Both

Book 1, Book 2, Book 3, Book 4,Book 5, Book 6, Book 7, Book 8

Book 1Book 2 Book 5Book 6Book 3Book 4 Book 7Book 8 (Tree,3)[1], (Tree,5)[2](Bear,2)[1], (Bear,3)[2](Fork,4)[1], (Fork,2)[2](Pots,3)[1], (Pots,4)[2]

Map (Tree,5)[3], (Tree,9)[4](Bear,7)[3], (Bear,2)[4](Fork,2)[3], (Fork,1)[4](Pots,6)[3], (Pots,0)[4] (Tree,8)[5], (Tree,4)[6](Bear,6)[5], (Bear,7)[6](Fork,2)[5], (Fork,2)[6](Pots,1)[5], (Pots,8)[6] (Tree,2)[7], (Tree,8)[8](Bear,1)[7], (Bear,3)[8](Fork,7)[7], (Fork,1)[8](Pots,9)[7], (Pots,5)[8]

Map Map Map (Tree,3)[1](Tree,5)[2](Tree,5)[3](Tree,9)[4](Tree,8)[5](Tree,4)[6](Tree,2)[7](Tree,8)[8] (Tree, 44)

Reduce (Bear,2)[1](Bear,3)[2](Bear,7)[3](Bear,2)[4](Bear,6)[5](Bear,7)[6](Bear,1)[7](Bear,3)[8] (Bear, 31)

Reduce (Fork,4)[1](Fork,2)[2](Fork,2)[3](Fork,1)[4](Fork,2)[5](Fork,2)[6](Fork,7)[7](Fork,1)[8] (Fork, 21)

Reduce (Pots,3)[1](Pots,4)[2](Pots,6)[3](Pots,0)[4](Pots,1)[5](Pots,8)[6](Pots,9)[7](Pots,5)[8] (Pots, 36)

Reduce M ap P ha s e S hu ff l e P ha s e R edu c e P ha s e Node 1 Node 2 Node 3 Node 4

Fig. 2:

Illustration of conventional MapReduce framework. Theintermediate output pairs are represented by (key,frequency)[booknumber] and the output pairs are represented by (key,frequency).

1) In the

Map phase, there are two stages, namely theallocation of Map tasks and the execution of Map tasks.Generally, a Map task is a function that generates a key-value output pair based on the allocated subﬁles. Firstly,as seen in Fig. 2, the master node splits the input ﬁleinto 8 subﬁles of smaller sizes and allocates the subﬁlesto the 4 workers. Secondly, each worker produces 4intermediate key-value pairs for each allocated subﬁleusing the map functions. Since the workers are allocated2 subﬁles each, each worker generates 8 computedintermediate results.2) In the

Shufﬂe phase, the workers exchange their com-puted intermediate results to obtain the required in-termediate results for the computation of the Reducefunctions. In particular, in each time slot, one of theworkers creates a message that contains information ofthe intermediate output pairs from the Map phase andtransmits the message to other workers. The shufﬂingprocess continues until all workers have received therequired intermediate output pairs for the Reduce phase.3) In the

Reduce phase, the workers aggregate the 8 inter- mediate key-value pairs obtained from the Shufﬂe phaseand compute the ﬁnal result which is a smaller set ofkey-value pairs using the reduce functions. In particular,the reduction tasks are evenly distributed among theworkers. Each reduce function is responsible for theevaluation of a key. For example, node 1 in Fig. 2 isresponsible for the evaluation of “Tree”. Therefore, thetotal number of reduce functions needed equals the totalnumber of keys of the output, i.e., 4 reduce functionsare needed to compute 4 output pairs.Apart from the MapReduce framework, there are otherdistributed computation frameworks that provide support forthe processing of large-scale datasets such as: • Spark [46]:

It supports applications that need to reuse aworking dataset across the multiple parallel processes.These applications cannot be expressed as efﬁcientlyas acyclic data ﬂows which are required in popularcomputation frameworks such as MapReduce. There aretwo use cases for the implementation of Spark computa-tion framework: (i) iterative machine learning algorithmswhich operate on the same dataset repeatedly, and (ii)iterative data analysis tools, where different users queryfor a subset of data from the same dataset. • Dryad [47]:

By allowing the developers to constructtheir own communication graphs and the subroutinesat the vertices through simple, high-level programminglanguage, Dryad executes large-scale data-intensive com-putations over clusters consisting multiple computers. Itdoes not require the developers to express their codein Map, Shufﬂe and Reduce phases in order to adoptthe MapReduce framework for computations. Besides,the Dryad execution engine, which is based on theconstructed data ﬂow graph, takes care of the implemen-tation issues of the distributed computation tasks such asthe scheduling of tasks, allocation of resources and therecovery from communication and computation failures. • CIEL [48]:

The main characteristic of the CIEL computa-tion framework is that it allows data-dependent data ﬂowswhere the directed acyclic graphs (DAG) are built dynam-ically based on the execution of previous computations,rather than a statistically predetermined DAG. Instead ofmaximizing throughput, CIEL aims to minimize latencyof individual tasks, which is very useful for the imple-

TABLE III:

Comparison between distributed computation frame-works [48].

Feature MapReduce[6] Dyrad [47] CIEL [48]

Dynamic Control Flow No No YesTask Dependencies Fixed(2-stage) Fixed (DAG) DynamicFault Tolerant Transparent Transparent TransparentData Locality Yes Yes YesTransparent Scaling Yes Yes Yes mentation of iterative algorithms, where latency growssigniﬁcantly as the number of iteration increases.A comparison between the distributed computation frame-works is presented in Table III.

B. Objectives of CDC Schemes

There are two main lines of works in CDC. Firstly, the CDCschemes are implemented to minimize the communication load in distributed computing systems. Secondly, the CDC schemesaim to mitigate the straggler effects which cause a delayin the computation of the distributed tasks. For each of theobjectives, we discuss the importance of solving these issues toimprove the performance of the distributed computing systems.Then, we brieﬂy discuss the existing solutions that have beenproposed in the literature to meet these objectives. However,the current existing solutions do not adopt coding approaches.Different from the existing solutions, the CDC schemes areable to meet these objectives by introducing coded redundancy.In fact, the CDC schemes outperform the replication methods,e.g., naive replication [18] and fork-join model [14], [49],which introduce redundancy without coding techniques, interms of time taken to execute the tasks.

1) Communication Load:

Among the three phases of theMapReduce computation framework, the Shufﬂe phase dom-inates the time required to complete the computation tasks[7], [8], [50] since multiple communications between theprocessing nodes are needed to exchange their intermediateresults. For the Hadoop cluster at Facebook, it is observedthat on average, the data shufﬂing phase accounts for 33% ofthe overall job execution time [9]. In fact, the data shufﬂingphase is more time consuming when running on heteroge-neous clusters with diverse computational, communication andstorage capabilities. When running TeraSort [51], which is aconventional distributed sorting algorithm for large amountof data, and Self-Join applications on heterogeneous AmazonEC2 clusters, 65% and 70% of the overall job execution time isspent on the Shufﬂe phase respectively [10]. The data shufﬂingprocess is also an important step in implementing distributedlearning algorithms. In particular, to train machine learningmodels with distributed algorithms, it is common to shufﬂethe data randomly and run the algorithms iteratively for anumber of times such that the processing nodes compute adifferent subset of the data at each iteration until there is aconvergence [52]–[54].

For the logistic regression applicationwhich requires at least 100 iteration to converge, 42% of theiteration time is spent on communication [9] . For each timethe data shufﬂing process is performed, the entire training

Book 1, Book 2, Book 3, Book 4,Book 5, Book 6, Book 7, Book 8

Book 1Book 2Book3Book 4 Book 5Book 6Book7Book 8Book 1Book 2Book 3Book 4 Book 5Book 6Book 7Book 8 (Tree,3)[1], (Tree,5)[2], (Tree,5)[3], (Tree,9)[4](Bear,2)[1], (Bear,3)[2], (Bear,7)[3], (Bear,2)[4](Fork,4)[1], (Fork,2)[2], (Fork,2)[3], (Fork,1)[4](Pots,3)[1], (Pots,4)[2], (Pots,6)[3], (Pots,0)[4]

Map (Tree,8)[5], (Tree,4)[6], (Tree,2)[7], (Tree,8)[8](Bear,6)[5], (Bear,7)[6], (Bear,1)[7], (Bear,3)[8](Fork,2)[5], (Fork,2)[6], (Fork,7)[7], (Fork,1)[8](Pots,1)[5], (Pots,8)[6], (Pots,9)[7], (Pots,5)[8

Map Map Map (Tree,3)[1](Tree,5)[2](Tree,5)[3](Tree,9)[4](Tree,8)[5](Tree,4)[6](Tree,2)[7](Tree,8)[8] (Tree, 44)

Reduce (Bear,2)[1](Bear,3)[2](Bear,7)[3](Bear,2)[4](Bear,6)[5](Bear,7)[6](Bear,1)[7](Bear,3)[8] (Bear, 31)

Reduce (Fork,4)[1](Fork,2)[2](Fork,2)[3](Fork,1)[4](Fork,2)[5](Fork,2)[6](Fork,7)[7](Fork,1)[8] (Fork, 21)

Reduce (Pots,3)[1](Pots,4)[2](Pots,6)[3](Pots,0)[4](Pots,1)[5](Pots,8)[6](Pots,9)[7](Pots,5)[8] (Pots, 36)

Reduce M ap P ha s e S hu ff l e P ha s e R edu c e P ha s e Node 1 Node 2 Node 3 Node 4 (Tree,3)[1], (Tree,5)[2], (Tree,5)[3], (Tree,9)[4](Bear,2)[1], (Bear,3)[2], (Bear,7)[3], (Bear,2)[4](Fork,4)[1], (Fork,2)[2], (Fork,2)[3], (Fork,1)[4](Pots,3)[1], (Pots,4)[2], (Pots,6)[3], (Pots,0)[4] (Tree,8)[5], (Tree,4)[6], (Tree,2)[7], (Tree,8)[8](Bear,6)[5], (Bear,7)[6], (Bear,1)[7], (Bear,3)[8](Fork,2)[5], (Fork,2)[6], (Fork,7)[7], (Fork,1)[8](Pots,1)[5], (Pots,8)[6], (Pots,9)[7], (Pots,5)[8

Fig. 3:

Illustration of naive replication MapReduce framework. dataset is communicated over the network, resulting in highcommunication costs which limit the performance of thedistributed computing systems.Since the performance of the data shufﬂing process has asigniﬁcant impact on the overall performance of the distributedcomputing systems, it has been extensively studied in theliterature, e.g., [55]–[62]. Various data shufﬂing strategies areproposed to achieve different objectives such as minimizingthe job execution time, maximizing the utilization of resourcesand accommodating interactive workloads. While the overlapbetween the map computations and the shufﬂe communicationhelps to reduce the latency of the distributed computation tasks[55], the computing nodes require large storage capacities forbuffering. An efﬁcient and adaptive data shufﬂing strategy isproposed in the study of [56] to manage the tradeoff betweenthe accumulation of the shufﬂe blocks and the minimizationof the utilization of memory space to reduce the overall jobexecution time and improve the scalability of the distributedcomputing systems. In [57], the authors propose a virtual datashufﬂing strategy which reduces storage space and trafﬁc loadin the network by delaying the actual movement of the datauntil it is needed to complete the computations in the Reducephase.To improve the performance of the data shufﬂing process,task scheduling algorithms such as Quincy scheduler [58],Hadoop Fair Scheduler [61] and delay scheduling algorithm[62] are also designed to allocate tasks to the workers. Inthe design of optimal task scheduling and task selectionalgorithms, the communication load can be minimized throughvarious approaches such as by optimizing the placement ofcomputation tasks, distributing the computing resources fairlyto the nodes and maximizing the resource utilization of thesystems. Since the task scheduling schemes are not the focusof this survey, we refer interested readers to the study of [63]and the references therein for more detailed information onthe scheduling techniques.One of the ways to reduce communication load in theshufﬂing phase is by repeating the computation tasks [18].Figure 3 illustrates the implementation of the MapReduce framework by using the naive replication method in which4 processing nodes are required to compute 4 output pairs,which is the same computation task illustrated in Fig. 2.By simply replicating the Map tasks where each worker isrequired to compute more intermediate output pairs, i.e., 16in Fig. 2 instead of 8 intermediate output pairs in Fig. 3, thecommunication load is reduced as fewer intermediate outputpairs are communicated. For example, in the conventionalMapReduce framework in Fig. 2, node 1 needs to obtain 6intermediate output pairs from other workers whereas in thenaive replication scheme in Fig. 3, node 1 only needs to obtain4 intermediate output pairs from other workers.However, the aforementioned non-coding methods havelimits to which the communication load in the data shufﬂingphase can be minimized. Given that the naive replicationmethod reduces communication load in the data shufﬂingphase by introducing redundancy to the systems (which is alsodiscussed in Section III-A), coding techniques can be usedto introduce redundancy to further minimize communicationload, which will be discussed in-depth in Section IV.

2) Straggler Effects:

In distributed computing systems, theprocessing nodes may have heterogeneous computing capabil-ities, i.e., different processing speeds. As such, another lineof work in the CDC literature is to solve the bottleneck thatresults from a variation in time taken to complete the allocatedtasks. In distributed computing systems, there are stragglers ,which are the processing nodes that run unexpectedly slowerthan the average or nodes that may be disconnected fromthe network due to several factors such as insufﬁcient power,contention of shared resources, imbalance work allocation andnetwork congestion [64], [65]. As a result, the overall timeneeded to execute the tasks is determined by the slowestprocessing node. We brieﬂy discuss the existing approaches(which are summarized in Table IV) to handle the stragglereffects as follows: • Stragglers Detection:

The most direct approach to miti-gate the straggler effects is to detect the stragglers andact on them early in their lifetime. For example, Mantri[65] detects the stragglers by identifying the tasks thatare processed at a rate slower than the average. Thesystem determines the cause of the delay and implementstargeted solutions to mitigate the stragglers. The solutionsinclude restarting the tasks allocated to the stragglersat other processing nodes, optimally allocating the tasksbased on network resources as well as protecting againstinterim data loss by replicating the outputs of valuabletasks. • Work Stealing [66]:

The basic idea of the work stealingalgorithm is to allow the faster processing nodes to takeover the remaining computation tasks from the slowerprocessing nodes so that the overall job execution time isminimized. By adopting this approach, the faster process-ing nodes operate continuously while leaving the slowerprocessing nodes idle after their jobs are taken over bythe faster processing nodes till the end of the computationsession. • Work Exchange [13]:

By leveraging on the informationof computational heterogeneity in the system, the master node ﬁrst allocates the tasks to the workers based ontheir computational capabilities. Upon receiving the ﬁrstcomputed result from any of the workers, the masternode pauses the computation process and redistributethe remaining incomplete work to be computed by theworkers. The process is performed for a number ofiterations until all work is done. Since the workers needto inform the master node of the amount of work doneat each time that the computation process is paused,additional communication costs are incurred. The highercommunication costs are also a result of the reallocationof data to the workers. • Naive Replication:

One of the solutions to handle strag-glers in the distributed computing systems is by in-troducing redundancy to minimize computation latency.The computation task is replicated and executed overmultiple processing nodes. Since all processing nodes areworking on the same computation task, the time requiredto complete the computation task is determined by thefastest processing node. The partial computations of theremaining processing nodes are discarded. Experimentson Google Trace data [14] have shown the effectivenessof the use of redundancy in minimizing computationlatency by eliminating the need for the computed resultsby the stragglers. However, the introduction of redun-dancy comes at the expense of higher cost such as highcommunication bandwidth and high computation load[14], [49], [67]–[70]. Various redundancy strategies havebeen analyzed to derive the limiting distribution of thestate of the systems [49], [67]. Although the introductionof redundancy helps to reduce latency, the performancevaries under different settings. In fact, in some settings,it is optimal to not use any redundancy strategy. Lookinginto this, the work in [68] presents the optimal redundant-requesting policies under diverse settings.Similar to the existing methods to reduce communicationcosts as discussed in the previous section (Section II-B1),the existing methods to mitigate the straggler effects do notadopt coding approaches. Coding techniques can also beused to introduce redundancy into the systems to mitigatethe straggler effects. The authors in [69], [70] investigatethe tradeoff between latency and cost for both replication-redundancy systems and coded-redundancy systems. Coded-redundancy systems outperform the replication-redundancysystems in both latency and cost. In other words, by usingcoding techniques, the latency and cost incurred are lowerthan that of naive replication. The use of coding techniquesto mitigate the straggler effects is discussed in more detail inSection III-B and Section V. To better understand the proposedCDC schemes, some of the commonly used performancemetrics of the distributed computing systems are deﬁned asfollows:1)

Storage Space is deﬁned as the total number of ﬁlesstored across K processing nodes, normalized by thetotal number of subﬁles N [71].2) Computation load is represented by r , where ≤ r ≤ K , is deﬁned as the total number of Map functions TABLE IV:

Approaches to mitigate the straggler effects.

Approach Key Ideas Shortcomings

Stragglers Detection Detects the straggling nodes, determines the cause ofdelay and implements targeted solutions Difﬁcult to identify the cause of delaysWork Stealing Reallocates the remaining computation tasks from theslower workers to the faster workers The capabilities of the slower workers are not maximizedWork Exchange Reallocates the computation tasks every time a workercompletes its tasks Incurs high communication costs due to the feedbackfrom the workers to the master node as well as thereallocation of dataNaive Replication Introduces redundancy where each computation subtaskis performed by more than one processing node Incurs high communication costs and computation loadCoded Redundancy Uses coding techniques to introduce redundancy such thatthe master node can recover the ﬁnal result from anydecodable set of workers Still incurs high communication costs and computationload, but lower than that of naive replication computed across K processing nodes, normalized bythe total number of subﬁles N [17]. In particular, when r = 1 , it means that each Map function is only computedby a single processing node. When r = 2 , it means thateach Map function is computed by two processing nodeson average.3) Communication load is represented by L , where ≤ L ≤ , is deﬁned as the total number of bits communi-cated by the K processing nodes in the Shufﬂe phase,normalized by the total number of subﬁles N [17].Given that coding techniques are able to solve the aforemen-tioned implementation challenges of the distributed computingsystems, we review various proposed CDC schemes, which isthe main focus of this paper. In the following section, wepresent a tutorial of the simple CDC schemes along thesetwo lines of works, i.e., minimizing communication load andmitigating the straggler effects, of which is useful to betterunderstand the related works discussed in Section IV, V andVI.III. C ODED D ISTRIBUTED C OMPUTING (CDC) S

CHEMES

Recently, coding techniques have become a popular ap-proach to solve the challenges of the distributed computingsystems. As mentioned previously, there are two main lines ofwork in CDC: (i) to reduce the communication costs and (ii) tomitigate the straggler effects. In this section, we introduce thetwo basic CDC schemes which are the ﬁrst works that showthe effectiveness of using coding techniques to solve thesetwo challenges separately. Then, we discuss a uniﬁed CDCscheme that characterizes the tradeoff between computationlatency and communication load.

A. CDC to Minimize Communication Load

In the conventional MapReduce computation frameworkas shown in Fig. 2, after the split of the input ﬁle intomultiple subﬁles, each subﬁle is mapped to only one ofthe processing nodes, i.e., the workers. The naive replicationscheme, i.e., uncoded data shufﬂing scheme which relaxesthis restriction, can reduce the communication costs of thesystem by allowing each subﬁle to be replicated and mappedto more than one processing nodes. In the example illustratedin Fig. 3, each subﬁle is repeated twice. Hence, as comparedto the conventional MapReduce framework, each processingnode has more Map tasks to perform in the naive replication

Book 1, Book 2, Book 3, Book 4,Book 5, Book 6, Book 7, Book 8

Book 1Book 2Book 3Book 4 Book 3Book 4Book 7Book 8Book 1Book 2Book 5Book 6 Book 5Book 6Book 7Book 8 (Tree,3)[1], (Tree,5)[2], (Tree,5)[3], (Tree,9)[4](Bear,2)[1], (Bear,3)[2], (Bear,7)[3], (Bear,2)[4](Fork,4)[1], (Fork,2)[2], (Fork,2)[3], (Fork,1)[4](Pots,3)[1], (Pots,4)[2], (Pots,6)[3], (Pots,0)[4]

Map (Tree,5)[3], (Tree,9)[4], (Tree,2)[7], (Tree,8)[8](Bear,7)[3], (Bear,2)[4], (Bear,1)[7], (Bear,3)[8](Fork,2)[3], (Fork,1)[4], (Fork,7)[7], (Fork,1)[8](Pots,6)[3], (Pots,0)[4], (Pots,9)[7], (Pots,5)[8]

Map Map Map (Tree,3)[1](Tree,5)[2](Tree,5)[3](Tree,9)[4](Tree,8)[5](Tree,4)[6](Tree,2)[7](Tree,8)[8] (Tree, 44)

Reduce (Bear,2)[1](Bear,3)[2](Bear,7)[3](Bear,2)[4](Bear,6)[5](Bear,7)[6](Bear,1)[7](Bear,3)[8] (Bear, 31)

Reduce (Fork,4)[1](Fork,2)[2](Fork,2)[3](Fork,1)[4](Fork,2)[5](Fork,2)[6](Fork,7)[7](Fork,1)[8] (Fork, 21)

Reduce (Pots,3)[1](Pots,4)[2](Pots,6)[3](Pots,0)[4](Pots,1)[5](Pots,8)[6](Pots,9)[7](Pots,5)[8] (Pots, 36)

Reduce M ap P ha s e S hu ff l e P ha s e R edu c e P ha s e Node 1 Node 2 Node 3 Node 4 (Tree,3)[1], (Tree,5)[2], (Tree,8)[5], (Tree,4)[6] (Bear,2)[1], (Bear,3)[2], (Bear,6)[5], (Bear,7)[6] (Fork,4)[1], (Fork,2)[2], (Fork,2)[5], (Fork,2)[6] (Pots,3)[1], (Pots,4)[2], (Pots,1)[5], (Pots,8)[6] (Tree,8)[5], (Tree,4)[6], (Tree,2)[7], (Tree,8)[8](Bear,6)[5], (Bear,7)[6], (Bear,1)[7], (Bear,3)[8](Fork,2)[5], (Fork,2)[6], (Fork,7)[7], (Fork,1)[8](Pots,1)[5], (Pots,8)[6], (Pots,9)[7], (Pots,5)[8]

CodeDecode CodeDecode CodeDecode CodeDecode (Tree,8)[5] ⨁ (Fork,2)[2]→(TreeFork,10)[5,2](Tree,4)[6] ⨁ (Pots,3)[1]→(TreePots,7)[6,1] (Tree,2)[7] ⨁ (Bear,2)[4]→(TreeBear,4)[7,4](Tree,8)[8] ⨁ (Pots,6)[3]→(TreePots,14)[8,3](Bear,7)[3] ⨁ (Fork,2)[2]→(BearFork,9)[3,2](Bear,2)[4] ⨁ (Pots,3)[1]→(BearPots,5)[4,1] (Bear,1)[7] ⨁ (Pots,0)[4]→(BearPots,1)[7,4](Fork,2)[5] ⨁ (Pots,4)[2]→(ForkPots,6)[5,2](Fork,4)[1] ⨁ (Pots,0)[4]→(ForkPots,4)[1,4] (Tree,2)[7] ⨁ (Fork,2)[5]→(TreeFork,4)[7,5](Tree,4)[6] ⨁ (Bear,3)[8]→(TreeBear,7)[6,8](Bear,1)[7] ⨁ (Fork,2)[6]→(BearFork,3)[7,6] Fig. 4:

Illustration of Coded MapReduce framework. scheme. However, by simple replication, the communicationload in the Shufﬂe phase decreases and this gain is knownas the repetition gain . Speciﬁcally, the communication loadfor uncoded schemes which include both the conventionalMapReduce framework and the naive replication scheme, isdenoted as follows [17]: L uncoded ( r ) = 1 − rK , (1)where K is the number of processing nodes in the network.Based on Equation (1), the communication loads achieved bythe conventional MapReduce framework in Fig. 2 and thenaive replication scheme in Fig. 3 are where r = 1 and where r = 2 respectively.To further reduce the communication load, i.e., to increasethe repetition gain, the Coded MapReduce computation frame-work is proposed in [18] where the Map tasks are carefullydistributed among the processing nodes and the messages areencoded for transmission in the Shufﬂe phase by using codingtheory. Figure 4 illustrates the Coded MapReduce frameworkwith 4 processing nodes to determine the 4 output pairs. Afterthe Map phase, the processing node multicasts a bit-wise XORof two computed intermediate pairs, denoted by ⊕ , satisfying Fig. 5:

Comparison of communication load between the CDCscheme and the uncoded scheme [17]. the requirements of two other processing nodes simultane-ously. For example, node 1 multicasts a bit-wise XOR of“Bear” and “Fork” to both nodes 2 and 3, which involvesthe transmission of only one packet of information, insteadof two packets if the information is sent separately to thenodes in a unicast manner. Since the intermediate output pairsare now coded, there is an additional step of decoding beforethe reduce functions are applied. Given the coded “BearFork”information, node 2 is able to decode and recover the required“Bear” information by cancelling the “Fork” information sincenode 2 has also computed the same “Fork” information.Similarly, node 3 can recover “Fork” information by cancellingthe “Bear” information. The simulation results in [18]showthat the Coded MapReduce reduces the communication loadby 66% and 50% as compared to the conventional MapReduceframework and the naive replication scheme respectively.Since the use of coding techniques reduces both the la-tency and cost of the distributed computing systems [70], amore generalized framework known as the Coded DistributedComputing (CDC) scheme is introduced in [17]. The studyof [17] presents the fundamental inverse relationship betweencomputation load and communication load. Speciﬁcally, thecommunication load in the Shufﬂe phase can be reduced by afactor r by increasing the computation load in the Map phaseby the same factor r , as shown in Fig. 5. The communicationload achieved by the CDC framework, L coded , is given asfollows: L coded ( r ) = 1 r (cid:16) − rK (cid:17) . (2)Note that the information-theoretic lower bound derived onthe minimum communication load L ∗ ( r ) equals L coded ( r ) ofthe CDC framework. As such, the optimal tradeoff betweenthe computation load and communication load is characterizedas follows [17]: L ∗ ( r ) = L coded ( r ) = 1 r (cid:16) − rK (cid:17) , r ∈ K . (3) From Equation (1) which applies to the uncoded computa-tion schemes, the communication load L decreases linearly asthe computation load r increases. However, when the numberof processing nodes K becomes large, there is no signiﬁcantimpact of increasing computation load on the communicationload. On the other hand, for the proposed CDC framework,the communication load is inversely proportional to the com-putation load (Equation (2)). Even when K becomes large,the increase in computation load still signiﬁcantly reduces thecommunication load.Since the proposed CDC framework can be applied toany distributed computation framework with an underlyingMapReduce structure, the performance of the CDC frameworkon TeraSort [51] is evaluated. Experimental results on AmazonEC2 clusters show that the Coded TeraSort scheme [72] whichis a coded distributed sorting algorithm, achieves a reductionin the overall job execution time by factors of 2.16 and 3.39with 16 processing nodes and computation loads of r = 3 and r = 5 , respectively as compared to the uncoded TeraSortscheme.Previously in [17], computation load is linearly dependenton the number of replicated Map tasks, i.e., load redundancy,as each processing node is assumed to compute all intermedi-ate values for all subﬁles allocated in its memory. However, theprocessing nodes can be selective in choosing the intermediatevalues to compute. As such, the load redundancy is nolonger a direct measure of computation load. In other words,the storage constraints do not necessarily imply computationconstraints. Building on the work in [17], the authors in [73]propose an alternative tradeoff between computation load andcommunication load under a predeﬁned storage constraint.In fact, the computation load is quadratic in terms of loadredundancy. By taking load redundancy into consideration, analternative computation-communication tradeoff curve is de-rived. In particular, this alternative tradeoff curve is especiallyrelevant to the processing nodes that do have the sufﬁcientresources or time to perform computations for all the allocatedsubﬁles. Given that the processing nodes can only performa limited amount of computations below the computationload threshold, the alternative tradeoff curve proposed in [73]accurately deﬁnes the amount of communication load neededfor the distributed computation tasks.Since the processing nodes are not required to compute allintermediate results that can be obtained from their locallystored data, the storage capabilities of the processing nodesshould be considered in the CDC computation framework in[17]. The study of [71] characterizes the tradeoff betweenstorage, computation and communication where the minimumcommunication load is determined given the storage and com-putational capabilities of the processing nodes. In particular,the optimal computation curve is obtained by characterizingthe optimal storage-communication tradeoff given the mini-mum computation load. As a result, the triangles between theoptimal communication curve and the optimal computationcurve reﬂect the pareto-optimal surface of all achievablestorage-computation-communication triples. However, as thenumber of processing nodes in the system increases, the num-ber of input ﬁles required increases exponentially, resulting in Fig. 6:

Illustration of coded computation with 3 workers. The masternode is able to recover the ﬁnal result upon receiving the computedresults from any 2 workers. an increase in the number of transmissions needed and hencehigh communication costs. As such, it is important to reducethe number of input ﬁles, which is discussed in Section IV-A3later.

B. CDC to Mitigate the Straggler Effects

Apart from reducing communication load in the Shufﬂephase of the MapReduce framework, coding techniques canalso be used to alleviate the straggler effects. Since matrixmultiplication is one of the most basic linear operations usedin distributed computing systems, a coded computation frame-work is proposed in [19] to minimize computation latency ofdistributed matrix multiplication tasks. The coded computationframework uses erasure codes to generate redundant interme-diate computations. In particular, the master node encodesthe equal-sized data blocks, i.e., submatrices and distributesthem to the workers to compute the local functions. Uponcompletion, the workers transmit the computed results to themaster node. The master node can recover the ﬁnal result byusing the decoding functions once the local computations fromany of the decodable sets are completed. As seen in Fig. 6,the master node can recover the ﬁnal result upon receiving thecomputed results from any 2 workers, instead of all 3 workers.As such, the total computation is not determined by the sloweststraggler, but by the time when the master node receivescomputed results from some decodable set of indices. In thiswork [19], the authors explore the effectiveness of encodingthe submatrices by using maximum distance separable (MDS)codes [88] to mitigate the effects of stragglers.Considering K workers and a shifted-exponential distribu-tion for the job execution time of the distributed algorithm, thesimulation results show that the optimal repetition-coded dis-tributed algorithm achieves a lower average job execution timewhen the straggling parameter is smaller than one, i.e., µ < but is still slower than the optimal MDS-coded distributedalgorithm by a factor of Θ(log K ) . However, the storage costof the coded distributed algorithm is higher than that of the uncoded distributed algorithm as more data is required to bestored at the workers’ sites for the coded distributed algorithm.The proposed algorithm is tested on an Amazon EC2 clusterand is compared against various parallel matrix multiplicationalgorithms, e.g., block matrix multiplication, column-partitionmatrix multiplication and row-partition matrix multiplication.The simulation results show that the proposed algorithm in[19] performs better where the coded matrix multiplicationachieves and . reduction in average job executiontime on clusters of m1-small and c1-medium instances with10 workers each respectively as compared to the best of thethree uncoded distributed algorithms.Although the MDS codes proposed in [19] is able tomitigate the straggler effects, it cannot be generalized to alltypes of computation tasks. In order to mitigate the stragglereffects of different distributed computation tasks, the codingtechniques can be designed by exploiting the algebraic struc-tures of the speciﬁc operations. An important performancemetric that is introduced in the proposed CDC schemes is therecovery threshold, which refers to the worst-case requirednumber of workers the master needs to wait to recover theﬁnal result for job completion [78]. The smaller the recoverythreshold, the shorter the computation latency. The objectiveis to reduce the recovery threshold so that the ﬁnal result canbe recovered by waiting for a smaller number of workers, thuscontributing to a reduction in computation latency. Here, wediscuss the coding techniques for various types of computationtasks, namely (i) matrix-vector multiplications, (ii) matrix-matrix multiplications, (iii) gradient descent, (iv) convolutionand Fourier transform. Table V summarizes the coding tech-niques designed for different distributed computation tasks.

1) Matrix-vector multiplications:

Distributed matrix-vectormultiplications are the building blocks of linear transformationcomputations which are an important step in machine learningand signal processing applications. In particular, the compu-tation of linear transformation on high-dimensional vectors isrequired for popular dimensionality reduction techniques suchas Linear Discriminant Analysis (LDA) [89] and PrincipalComponent Analysis (PCA) [90].Instead of using MDS codes that is proposed in [19], theauthors in [74] propose the use of Luby Transform (LT)codes to mitigate the straggler effects in distributed matrix-vector multiplications problems. Different from the worksin [91] and [80] which use LT codes in ﬁxed-rate settings,the rateless property of the LT codes can be exploited togenerate unlimited number of encoded symbol from a ﬁniteset of source symbols. There are several advantages of usingrateless codes: (i) near-ideal load balancing, (ii) negligibleredundant computation, (iii) maximum straggler tolerance, and(iv) low decoding complexity. To further reduce the latencyfor practical implementations, blockwise communication canbe used to transmit the submatrix-vector products. Insteadof transmitting each encoded row-vector product separatelyto the master node, the workers are allowed to transmit thecomputed results in blocks where each block comprises a fewrow-vector products, reducing the number of communicationrounds needed and hence minimizing the time needed tocomplete the computation tasks. TABLE V:

Coding techniques to mitigate the straggler effects.

Problems Ref. Coding Schemes Key Ideas

Matrix-Vector [19] MDS Codes Reduce the computation latency as the master node is able to recoverthe ﬁnal result without waiting for the slowest processing node[74] LT Codes Exploit the rateless property to generate unlimited number of encodedsymbol from a ﬁnite set of source symbols[75] Short-Dot Codes Reduce the length of dot-products computed at the processing nodesby introducing sparsity to the encoded matrices[76] s-diagonal codes Exploit the diagonal structure of the matrices to achieve both optimalrecovery threshold and optimal computation loadMatrix-Matrix [77] Product Codes Instead of encoding the matrices along one dimension as in the MDS-coded schemes, the matrices are encoded with MDS codes along bothdimensions, i.e. row and column[78] Polynomial Codes - Design the algebraic structure of the encoded matrices such thatthe MDS structure is found in both the encoded matrices and theintermediate computations- Reconstruct the ﬁnal results by solving the polynomial interpolationproblem[79] MatDot Codes Achieve lower recovery threshold than Polynomial Codes [78] at theexpense of higher communication costs by computing only the relevantcross-products[79] PolyDot Codes Characterize the tradeoff between recovery threshold and communi-cation costs where Polynomial Codes [78] and MatDot Codes are thetwo extreme ends on this tradeoff curve[80] Sparse Codes Exploit to sparsity in both input and output matrices to reducecomputation load, while achieving near optimal thresholdGradient Descent [81] Fractional RepetitionCoding - Workers are divided into multiple groups and data is divided amongthe workers in each group- Each partition of data is performed by more than one worker[81] Cyclic Repetition Coding Data is allocated based on a cyclic assignment strategy[82] Cyclic MDS Codes The entries to the columns of the encoding matrix are cyclic shifts ofthe entries to the ﬁrst column[83] Reed-Solomon Codes Use balanced mask matrix and chooses appropriate codewords fromthe RS codes to construct the encoding matrix[84] Batch Coupons Collector - Divide the data into multiple batches which are allocated randomlyto the workers for computations- communication costs are signiﬁcantly reduced as there is no needfor communication between the workers and for feedback from themaster node to the workers[85] Polynomially CodedRegression Encode the data batches directly instead of the computed intermediateresultsConvolution [86] Coded Convolution Split both vectors into multiple parts of speciﬁed length and encodesone of the vectors with MDS codesFourierTransform [87] Coded Fourier Transform Leverage on recursive structure and the linearity of the discrete FourierTransform operations

The authors in [75] propose Short-Dot codes to perform thecomputation of linear transforms reliably and efﬁciently underthe presence of straggling nodes. Speciﬁcally, the processingnodes compute shorter dot products by imposing sparsity onthe encoded submatrices. However, there is tradeoff betweenthe optimal threshold recovery and the length of the dot-products where the master node needs to wait for computedresults from more processing nodes when the length of thedot-products is shorter. The experimental results on the classi-ﬁcation of hand-written digits of MNIST show that the Short-Dot codes achieve 32% faster expected computation time thanthe MDS codes [19].Although the Short-Dot codes [75] can offer lower recoverythreshold, the greater length of the dot-products means greatercomputation load for the processing nodes. With this concern,s-diagonal codes [76] are proposed to achieve both optimalrecovery threshold and optimal computation load by exploitingthe diagonal structure of the encoding matrix. The computationtime can be further reduced by using a low-complexity hybriddecoding algorithm which combines the peeling decodingalgorithm and Gaussian elimination techniques.

2) Matrix-matrix multiplications:

For large-scale dis-tributed matrix-matrix multiplications, the coded computa-tion schemes based on MDS codes are no longer suitableas the encoding and decoding processes scale with systemsize. Besides, the size of one of the matrices is assumedto be small enough in order to allow individual workers toperform the computations [77], restricting the implementationof MDS codes in large-scale multiplications. Hence, for large-scale problems, coded schemes not only need to achieve lowcomputation time, but also require efﬁcient encoding anddecoding algorithms in order to minimize the overall jobexecution time. To deal with the straggler effects in high-dimensional distributed matrix multiplications, four types ofcoded computation schemes are proposed: • Product codes [77]:

Product codes are implemented bybuilding a larger code upon smaller MDS codes. Insteadof encoding computations along only one dimension inMDS-coded schemes, the product codes encode computa-tions along both dimensions, i.e., rows and columns of thematrices. When the number of backup workers increasessub-linearly with the number of subtasks, the product- coded schemes outperform the MDS-coded schemes interms of average computation time and decoding time.In the linear regime, the one-dimensional decoding of theMDS-coded schemes is sufﬁcient to recover the missingentries of the computation results. By allowing eachrow and column of the MDS constituent codes to havedifferent code rates [92], the average computation timecan be further reduced, contributing to a decrease in theoverall job execution time. Product codes can also beused to solve higher-dimensional linear operations suchas tensor operations by exploiting the tensor-structuredencoding matrix [93]. To reduce the decoding time ofthe product codes, efﬁcient decoding algorithms such asReed-Solomon codes and LDPC codes can be explored. • Polynomial codes [78]:

The key advantage of the poly-nomial codes in solving large-scale matrix multiplicationproblems is that they provide a lower bound to the optimalrecovery threshold. For polynomial codes, the recoverythreshold does not scale with the number of workersinvolved where as for MDS codes and the product codes,the recovery thresholds scale proportionally with thenumber of workers. By taking advantage of the algebraicstructure of the polynomial codes, the master node canrecover the ﬁnal result by using polynomial interpolationalgorithms, e.g., the Reed-Solomon codes, to decodethe computation results from the workers. In additionto the optimal recovery threshold, the polynomial-codedschemes achieve minimum possible computation latencyand communication load for distributed matrix multipli-cations. However, as the number of workers increases,the encoding and decoding costs are much higher thanthat of the product codes. Furthermore, by implementingReed-Solomon codes, there is a limit to the number ofworkers that can be handled, which is not useful forpractical implementations where the systems may involveup to thousands of nodes. As an extension to the poly-nomial codes proposed in [78], the entangled polynomialcode that is proposed in [94] achieves a lower recoverythreshold which is only half of that achieved by thePolyDot codes [79], to be discussed later. Different fromthe polynomial codes which only allow column-wisepartitioning of the matrices, the entangled polynomialcodes allow arbitrary partitioning of the input matricesand evaluate only a subspace of bilinear functions suchthat unnecessary multiplications are avoided. The issue ofnumerical stability has also received attention to ensurethe scalability of the polynomial-coded schemes [95]. • PolyDot codes [79]:

PolyDot codes characterize thetradeoff between the recovery threshold and communi-cation costs where the polynomial codes and the MatDotcodes are special instances of this coding framework byconsidering two extreme ends of this tradeoff: minimiz-ing either recovery threshold or communication costs.In particular, the MatDot codes achieve lower recov-ery threshold than the polynomial codes at the expenseof much higher communication costs. This is achievedby only computing the relevant cross-products of thesubmatrices. Building on the work of PolyDot codes Fig. 7:

Fractional repetition coding with 6 workers and 2 stragglers. [79], the Generalized PolyDot codes [96] are used tocompute matrix-vector multiplications and achieve thesame recovery threshold as the entangled polynomialcodes [94]. More importantly, the Generalized PolyDotcodes can be extended for the training of large deepneural networks (DNNs), which consists of multiple non-linear layers. • Sparse codes [80]:

Although the polynomial codes [78]achieve optimal recovery threshold, the computationloads of the workers increase due to the increased densityof the input matrix, resulting in an increase in the overalljob execution time which is not desirable. By exploitingsparsity, i.e., the number of zero entries of the encodedmatrix, not only the recovery threshold is kept low, butthe computation loads of the workers also decrease whilemaintaining a nearly linear decoding time [97], jointlycontributing to the shorter overall job execution time.The basic idea of the algorithm proposed in [97] is toallow the master node to ﬁnd the linear combination ofrow vectors such that only the particular relevant sub-blocks are recovered. Then, the entire block of matrixcan be recovered by aggregating the partial recoveryof sub-blocks. Simulation results show that the sparsecodes require an overall shortest time to complete thejob as compared to other computation schemes, e.g.,uncoded scheme, product codes [77], polynomial codes[78], sparse MDS codes [75] and LT codes [74]. Afurther analysis of the different components of subtasks,i.e., communication time, computation time and decodingtime, shows that the sparse codes require much shortertime to decode, thus contributing signiﬁcantly to theshorter overall job execution time.

3) Gradient Descent:

Apart from matrices, coding tech-niques can be applied to recover batch gradients of anyloss function of the distributed gradient descent tasks. In[81], the authors have introduced the idea of gradient codingwhich is useful to mitigate stragglers that may slow down thecomputation tasks. Two gradient coding schemes are proposed, namely (i) fractional repetition coding (FRC) and (ii) cyclicrepetition coding. In the FRC scheme, the workers are ﬁrstdivided into several groups. In each group, the data is equallydivided and allocated to the workers. As a result, all groupsof workers are replica of each other as shown in Fig. 7. Uponcompleting their subtasks, the workers in each group transmitthe sum of partial gradients to the master node. In the cyclicrepetition coding scheme, the data partitions are allocated tothe workers based on a cyclic assignment strategy. The partialgradients computed by each worker are encoded by linearlycombining them, of which is transmitted as a single codedmessage to the master node. By applying the gradient codingschemes, the distributed computation tasks do not suffer fromdelays incurred by the straggling nodes as the master nodeis able to recover the ﬁnal result with the results from thenon-straggling nodes. Other coding theories such as the cyclicMDS codes [82] and the Reed-Solomon codes [83] can beused to compute exact gradients of the distributed gradientdescent problems.To efﬁciently mitigate the straggler effects in distributedgradient descent algorithms, Batch Coupon’s Collector (BCC)scheme is proposed in [84]. In BCC, there are two importantsteps, namely (i) batching and (ii) coupon collecting. Inbatching, the training set is partitioned into batches whichare distributed to the workers randomly whereas in couponcollecting, the master node collects the computed results fromthe workers until the results from all batches of data arereceived. This decentralized BCC scheme does not require anycommunication between the workers nodes and each workeris allocated data batches independently of other workers. Asa result, it is easy to implement the BCC scheme in practicalscenarios. Another important advantage of the BCC schemeis its universality. Different from other coding schemes thatare designed to guarantee their robustness to a ﬁxed numberof straggling nodes, the BCC scheme does not require anyprior knowledge about the straggling nodes which is morepractical as it is difﬁcult to estimate the number of stragglingnodes present in the clusters. Furthermore, the BCC schemecan be easily extended to solve gradient descent problemsover heterogeneous clusters where the workers have differentcomputational and communication capabilities. The simulationresults show that the BCC scheme speeds up the overalljob execution time by up to 85.4% and 69.9% over theuncoded scheme and the cyclic repetition coding scheme [81]respectively.The gradient coding schemes proposed in [81] illustrate thetradeoff between computation load and straggler tolerance.However, in non-linear learning tasks, communication costsdominate the overall job execution time as the number of iter-ations increases. As such, to generalize the coding schemes in[81], the authors in [98] incorporate communication costs intotheir framework and present a fundamental tradeoff betweenthe three parameters, namely computation load, straggler tol-erance and communication costs. In particular, for a ﬁxedcomputation load, the communication costs can be reducedby waiting for more workers.Instead of encoding the partial gradients computed basedon uncoded data as seen in the studies of [81]–[83], coding techniques can be applied directly to the data batches to reducethe straggler effects and the overall job execution time. Con-sidering the gradient computations for least-square regressionproblems, the polynomially coded regression (PCR) scheme[85] exploits the underlying algebraic property to generatecoded submatrices such that they are linear combinations ofthe uncoded input matrices. The master node can evaluatethe ﬁnal gradient by interpolating the polynomials from thecomputed partial gradients by the workers. Compared to thegradient coding schemes proposed in [81], the simulation re-sults show that the PCR scheme achieves much lower recoverythreshold and hence shorter computation and communicationtime, resulting in shorter time needed for overall job execution.

4) Convolution and Fourier transform:

The polynomialcodes proposed in both the studies of [78] and [94] can beextended to the applications of distributed coded convolutionbased on the coded convolution scheme proposed in [86]. Thework in [86] explores the use of MDS codes to encode thepre-speciﬁed vectors such that fast convolution is performedunder deadline constraint. In addition to that, MDS codes canbe used to mitigate the straggler effects in widely-implementeddistributed discrete Fourier Transform operations [87] whichare used in many applications such as machine learningalgorithms and signal processing frameworks.

C. Uniﬁed CDC Scheme

Given the aforementioned coding schemes of [17] and[19], we can observe that coding techniques are used tospeed up distributed computing applications with two differentapproaches. On one hand, the authors in [17] propose the“Minimum Bandwidth Code" that minimizes the communica-tion load by repeating the computation tasks in the Map phaseto introduce multicasting opportunities in the Shufﬂe phase.On the other hand, the authors in [19] propose the “MinimumLatency Code" which minimizes the computation latency byencoding the Map tasks such that the master node is ableto recover the ﬁnal result without waiting for the stragglingprocessing nodes.Inspired by the aforementioned approaches, a uniﬁed codedscheme that characterizes the tradeoff between computationlatency and communication load is proposed in [99] given thecomputation load. The coding schemes in [17] and [19] areconsidered to be the extreme cases of this uniﬁed scheme.The uniﬁed coded scheme exploits the advantages of the twocoding approaches by applying the MDS codes to the Maptasks and replicating the encoded Map tasks. Speciﬁcally, theuniﬁed coded scheme ﬁrst encodes the rows of the matrix,following which the coded rows of the matrix are replicatedand stored at the processing nodes in a speciﬁc pattern. Then,the processing nodes perform the computation until a certainnumber of the fastest processing nodes complete their tasks.To reduce communication load in the Shufﬂe phase, codedmultitasking is used to exchange the intermediate results thatare needed to recover the ﬁnal results in the Reduce phase. Animprovement to the latency-communication tradeoff presentedin the uniﬁed coded scheme [99] is proposed in [100] by lever-aging on the redundancy created by the repetition code. By increasing the redundancy rate of the repetition code, both thecommunication load in the Shufﬂe phase and the computationlatency in the Map phase can be simultaneously improved, thuscontributing to a improved latency-communication tradeoff.The aforementioned initial works of coding schemes haveshown their effectiveness in minimizing communication costsand alleviating the straggler effects. In the following sections,we review related works that leverage on coding techniquesto address the implementation challenges of the distributedcomputing systems.IV. M INIMIZATION OF C OMMUNICATION L OAD

With more computing nodes that are equipped with greatercapabilities to collect and process data, massive amountsof data are generated for the computations of user-deﬁnedcomputation tasks. Since the computations are scaled outacross a large number of distributed computing nodes, largenumber of intermediate results need to be exchanged betweenthe computing nodes in the Shufﬂe phase of the MapReduceframework to complete the computation tasks, resulting insigniﬁcant data movement. Oftentimes, for the training of amodel with distributed learning algorithms, data is shufﬂedat each iteration, contributing to high communication costs,which is a bottleneck of the distributed computing systems.As a result, there is a need to reduce the communication costsin order to speed up the distributed computation tasks. In thissection, we present four approaches to reduce communicationcosts: • File Allocation:

In this approach, the studies aim todesign an optimal ﬁle allocation strategy that considersthe heterogeneities of the processing nodes’ capabilitiesin the systems, maximizing data locality or reducingsubpacketization level, which refers to the number of sub-ﬁles generated [101], [102]. These different approacheswork towards reducing the communication load in thedistributed computing systems. • Coded Shufﬂing Design:

Since data shufﬂing phase incursa large proportion of the communication costs, data isencoded before it is transmitted so that the communi-cation load can be minimized. Apart from combiningcoding with different techniques, e.g., compression andrandomization techniques to improve the performanceof the shufﬂing phase, the coding techniques are alsodesigned to solve different computation problems, e.g.,distributed graph computation problems [103], [104] andmultistage MapReduce computations [105]. • Consideration of Underlying Network Architecture:

Gen-erally, the communications between the workers as wellas between the workers and the master node are affectedby the way that they are connected to each other. Forexample, server-rack architecture (Fig. 8) is one of themost commonly used methods to connect the variousservers. By taking the underlying architecture into consid-eration, the effectiveness of the coding implementation inreducing communication costs can be greatly improved. • Function Allocation:

Similar to the allocation of ﬁles, thestudies apply this approach on heterogeneous systems. In addition, some studies consider a cascaded system [106]where each Reduce function is allowed to be computed atmultiple processing nodes. In some cases where the datais randomly stored at the processing nodes, e.g., whenthe processing nodes are constantly moving, an optimalfunction allocation strategy is useful in reducing thenumber of broadcast transmissions and thus minimizingthe communication load.

A. File Allocation

The design of ﬁle allocation at each processing node is oneof the major steps for the implementation of CDC scheme.There are a few approaches to an optimal ﬁle allocation strat-egy: (i) considering heterogeneous systems, (ii) maximizingdata locality, and (iii) reducing subpacketization level.

1) Considering Heterogeneous Systems:

As discussed inSection III-A, although the CDC scheme proposed in [17]carefully allocates the subﬁles to the processing nodes in orderto introduce coded multitasking opportunities, it considers ahomogeneous system which may not be useful for practicalimplementation. In order to appropriately allocate the ﬁles tothe distributed computing nodes, heterogeneous systems wherethe processing nodes have diverse storage, computational andcommunication capabilities, should be considered in determin-ing the optimal ﬁle allocation strategy and coding scheme thatminimize the communication load [107].By leveraging on the extra storage capacity of the workers,the communication costs between the master node and theworkers in the process of data shufﬂing are minimized. Thereason is that if more data can be stored at the workers, fewercommunication rounds are needed for the workers to receiveshufﬂed data from the master node. In the extreme case, if theworker can store the entire dataset, there is no communicationneeded for the worker to receive shufﬂed data in any iteration.As a result, there is a tradeoff between the storage capacityof the workers and the communication overhead in the datashufﬂing process. In the data shufﬂing process, there are twophases, namely data delivery and storage update. Instead of arandom storage placement [19], a deterministic and systematicstorage update strategy [108] creates more coding opportu-nities in transmitting data to the workers at each iteration,reducing the communication load.

2) Maximizing Data Locality:

One of the important factorsin determining the optimal ﬁle allocation strategy is datalocality. Data locality is deﬁned as the percentage of localtasks over the total number of Map tasks, i.e., the fraction ofMap tasks that are allocated to the processing nodes having therequired data for computations such that no communication isneeded to obtain the data. High data locality means that lesscommunication bandwidth is needed for the transmission ofsubﬁles, which is required if the processing node does nothave the needed subﬁles for the execution of the Map tasks.In order to maximize data locality, the problem of allocationof Map tasks to different processing nodes can be tackled bysolving a constrained integer optimization problem [109].

3) Reducing Subpacketization Level:

As the number ofprocessing nodes in the network increases, the input ﬁle needs to be split into a large number of subﬁles. Speciﬁcally,the number of subﬁles generated increases exponentially inthe number of processing nodes [101]. However, there is amaximum allowable subpacketization level, i.e., number ofsubﬁles, where the dataset can only be partitioned into alimited number of packets, beyond which the communicationload increases due to more transmissions required and theunevenly-sized intermediate results. Hence, there are severalreasons for the reduction of subpacketization level: (i) toreduce the communication load in the Shufﬂe phase evenwhen there is a large number of processing nodes, (ii) toreduce the packet overheads which increases with the numberof broadcast transmissions, and (iii) to reduce the number ofunevenly-mapped outputs which require zero padding. To keepthe subpacketization level below the maximum allowable level,Group-based Coded MapReduce [101] allocates the datasetbased on the random groupings of the processing nodes andallows the processing nodes to cooperate in the transmissionof messages.To avoid splitting the input ﬁle too ﬁnely, the authorsin [102] use an appropriate resolvable design [110], whichis based on linear error correcting codes, to determine thenumber of subﬁles, the allocation of the subﬁles to theprocessing nodes and the construction of the coded messagesin the Shufﬂe phase. Building on this initial work, the authorsin [111] use the resolvable design based scheme to solve thelimitation of the compressed CDC scheme [112] that uses bothcompression and coding techniques. Although the compressedCDC scheme helps to reduce communication load, it requireslarge number of jobs to be processed simultaneously. Hence,the resolvable design based scheme is used to reduce thenumber of subﬁles generated. Speciﬁcally, for each job in thecompressed CDC scheme, the single-parity code is used tosplit the input ﬁle and the resolvable design based schemeis used to allocate the subﬁles to the processing nodes. Byaggregating the underlying functions and applying the resolv-able design based scheme, multiple jobs can be processed inparallel while minimizing the execution time in the Shufﬂephase, contributing to the reduction of the overall job executiontime. Although the number of subﬁles or number of jobsgenerated still increases exponentially with some of the systemparameters, e.g., number of computing nodes and numberof output functions, the exponent is much smaller when theresolvable design based scheme is implemented.In addition to the exponential increase in the number ofsubﬁles required, the number of output functions required alsoincreases exponentially as the number of processing nodes inthe network increases. There are other methods to reduce thenumber of subﬁles and the number of output functions suchas the hypercube computing scheme [113] and the placementdelivery array (PDA) [114]–[116]. However, most of the CDCschemes consider non-cascaded systems, i.e., each Reducefunction is computed at exactly one processing node [111].In [113], a cascaded system is considered but only two valuesfor the number of processing nodes that perform each Reducefunction are considered. By applying the concept of PDA tothe distributed computation framework, the performance of theproposed computation scheme is evaluated for different num- ber of processing nodes that compute the Reduce functions[114]. Although the implementation of these various methodsreduces the number of subﬁles generated, it may come at theexpense of higher communication load [102], [116]. B. Coded Shufﬂing Design

In the design of coded shufﬂing algorithms, we haveclassiﬁed the approaches into three different categories: (i)compression and randomization, (ii) coding across multipleiterations and (iii) problem-speciﬁc coding approaches.

1) Compression and randomization:

To further reduce thecommunication costs of the distributed computation tasks, thedesign of the coded data shufﬂing scheme can incorporatedifferent techniques to create more coded multicasting oppor-tunities. Besides, the coded shufﬂing schemes are designedto minimize communication costs for different distributedcomputation problems such as iterative algorithms, graphcomputations and multistage dataﬂow problems.The work in [17] generates replications of the computationtasks in the Map phase in order to reduce communication loadin the Shufﬂe phase by coding and multicasting the inter-mediate results. To further reduce the communication load,compression and randomization techniques can be applied tothe design of the coded shufﬂing algorithms. • Compression Techniques:

The compressed CDC com-putation scheme is proposed in [112] by jointly usingtwo techniques, i.e., compression and coding techniques.Each processing node ﬁrst computes the allocated Maptasks and generates the intermediate results. By using thecompression techniques, several intermediate results of asingle computation task are compressed into a single pre-combined value. The communication bandwidth neededto transmit a single pre-combined value is much smallerthan that of transmitting several uncombined intermediatevalues since the size of the pre-combined value equalsthe size of only one intermediate value. With the pre-combined values from different computation tasks, theprocessing node codes them for multicasting to other pro-cessing nodes simultaneously. There are two advantagesto this compressed CDC scheme: (i) the communicationload is reduced proportional to the storage capacity ofeach processing node, and (ii) the communication loaddoes not scale linearly, i.e., slower than linear, with thesize of the dataset.In some cases, e.g., parallel stochastic gradient descent(SGD) algorithms, instead of transmitting intermediateresults, computed gradient updates are exchanged amongthe workers. In such cases, Quantized SGD [117], a com-pression technique, can be used to reduce communicationbandwidth used during the gradient updates between theprocessing nodes. In each iteration, the processing nodesare allowed to adjust the number of transmitted bits byquantizing each component to a discrete set of values andencoding these quantized gradients. • Randomization Techniques:

Instead of introducing codedmulticasting opportunities to reduce the communicationload in the Shufﬂe phase, there are other coding tech-niques that can be applied to increase efﬁciency of data shufﬂing. One of the ways is to perform a semi-randomdata shufﬂing and coding scheme based on pliable in-dex coding which introduces randomization in the datashufﬂing process [118]. There are two important modi-ﬁcations made to the conventional pliable index codingscheme [119], which is used to minimize the number ofbroadcast transmissions while satisfying users’ demands.Firstly, the correlation of messages between workers isreduced. In order to do so, a message should only betransmitted to a fraction of the workers so that the samemessage is not held by all workers. As such, the pliableindex coding problem is formulated with an objectivewhere the goal is to minimize the number of broad-cast transmissions under the constraint of a maximumnumber of workers that can receive the same message.Secondly, the correlation of messages between iterationsis reduced. The reduction of correlation of messagesprevents the workers from performing computations onthe same dataset after shufﬂing, which may be redundant.A two-layer hierarchical structure is proposed for datashufﬂing. In the upper layer, the messages are partitionedinto multiple groups of which each group of messagesis transmitted to a fraction of workers. In the lowerlayer, each group of messages and the correspondingallocated workers are formulated as a constrained pliableindex coding problem. Randomization occurs in twostages: (i) when the master node selects the messagesin each group and transmits them to the workers, and (ii)when the workers discard old messages from their cache.Experimental results show that the proposed pliable indexcoding requires only 12% of broadcast transmissionsneeded by an uncoded scheme, i.e., random shufﬂing withreplacement scheme.

2) Coding across multiple iterations:

Most works on codediterative algorithms focus on the optimization of a single com-putation iteration or the minimization of communication loadin a single communication round [81]–[83], [120]. However,multiple rounds of communications are generally required tosolve distributed iterative problems. In the studies of [121] and[122], the results of multiple iterations are transmitted in a sin-gle round of communication by jointly coding across severaliterations of the distributed computation task. By leveraging onthe computation and storage redundancy of the workers, thenumber of communication rounds between the master nodeand the workers is greatly reduced, resulting in a reductionin the communication costs. However, the computation andstorage costs may not be optimal as compared to uncodedcomputing schemes, e.g., [123] and [124], which achieve near-optimal computation and storage costs.

3) Problem-speciﬁc coding approaches:

Apart from thetypical distributed computation problems, the MapReduceframework can also be used to solve distributed graph compu-tation problems [103], [104]. However, for graph computingsystems, the computation at each vertex is a function of thegraph structure which means that each computation only needsdata from its neighbouring vertices. More speciﬁcally, thecommunication load in the Shufﬂe phase depends on the con-nectivity probabilities of vertices in the graph, in which each Fig. 8:

Server-rack architecture where multiple servers in each rackare connected via a Top of Rack switch and the Root switch connectsmultiple Top of the Rack switches. vertex is only allowed to communicate reliably with a subset ofrandom vertices [104]. As a result, the CDC scheme proposedin [17] (which was previously discussed in Section III-A) isnot applicable to solving the graph computation problems.Looking into this, the authors in [103] propose a coded schemeto solve the problem of random ErdÃ˝us-RÃl’nyi graph whileminimizing the communication load in the Shufﬂe phase. Asimilar inverse tradeoff curve between the computation loadand average communication load is obtained by using codingtechniques in solving distributed graph computations.Moreover, many distributed computing applications con-sists of multiple stages of MapReduce computations. Themultistage data ﬂow can be represented by a layered DAG[105] in which the processing nodes, i.e., vertices in a par-ticular computation stage are grouped into a single layer.Each vertex computes a user-deﬁned function, i.e., Map orReduce computation that transforms the given input ﬁles intointermediate results whereas the edges represent data ﬂowbetween the processing nodes. By exploiting the redundancyof the computing nodes, coding techniques are applied to theprocessing nodes individually to minimize the communica-tion costs. The proposed work in [105] considers a uniformresource allocation strategy where the computation of eachvertex is distributed across all processing nodes. However,the communication load can be further reduced by reducingthe number of processing nodes that are used to computeeach vertex. Given that fewer processing nodes are used tocompute each vertex, the computation load performed by eachprocessing node increases and the processing nodes have morelocal information, thus reducing the need for communicationto obtain the required information. Therefore, a dynamicresource allocation strategy is needed to further minimize thecommunication load in the multistage MapReduce problems.

C. Consideration of Underlying Network Architecture

Although various techniques can be used jointly with thecoding techniques to increase the efﬁciency of data shufﬂingphase, it is important to consider the underlying networkarchitecture, i.e., how the servers are connected to each other,in designing the coded shufﬂing algorithms. In [109], Hybrid Coded MapReduce is proposed by con-sidering the server-rack architecture (Fig. 8) in the distributedcomputing systems. There are two types of communications inthe Shufﬂe phase: (i) the cross-rack communication where thedata is shufﬂed across different rack layers, and (ii) the intra-rack communication where the data is shufﬂed within the rack.In the ﬁrst stage where cross-rack communication takes place,the Coded MapReduce algorithm [18] (Fig. 4) is used to createmulticasting opportunities for the transmission of messages.In the second stage where the intra-rack communication isperformed, data is shufﬂed in a unicast manner where nocoding technique is applied. The simulation results show thatthe cross-rack communication costs incurred by the hybridscheme are lower than those of both Coded MapReduce [18]and uncoded scheme [6] at the expense of a higher intra-rack communication costs. Although the Coded MapReducescheme still achieves the lowest total communication costs,the overall communication costs for the hybrid scheme canbe further reduced by parallelizing the intra-rack operationsto provide a more accurate comparison between the differentcomputation schemes.The CDC computation scheme proposed in [17] is usefulfor networks with processing nodes that are closely locatedwith each other and connected via a common communicationbus. However, in practical distributed computing networks, it ishard to implement a common-bus topology for the physicallyseparated processing nodes. Hence, to reduce communicationload for the distributed computation tasks, it is importantto consider a practical data center network topology to reapthe coding beneﬁts of the CDC schemes. As such, in [125],the authors propose a CDC scheme based on a widely-used low-cost network topology which is the t-ary fat-treetopology [126], [127]. It has the characteristics of networksymmetry and connections between any two processing nodesin the network. Given a practical network topology design, theproposed topological CDC scheme achieves optimal max-linkcommunication load over all links in the topology of whichthe optimal tradeoff between the max-link communication loadand the computation load is characterized.Although the coded shufﬂing algorithm proposed in [19] re-duces communication load of the data shufﬂing process, thereare two main limitations: (i) it assumes a perfect broadcastchannel between the master node and the workers, and (ii)the theoretical guarantee of number of broadcast transmissionsonly holds when the number of data points approaches inﬁnity.To overcome the limitations of the coded shufﬂing algorithmproposed in [19], Ubershufﬂe [128] ﬁlls the missing entriesin the encoding tables by reallocating the data points betweenthe workers. This reduces the number of transmitted encodedpackets, resulting in a reduction of communication load. Theperformance of the UberShufﬂe algorithm is evaluated whenapplied to different problem settings. The experimental resultsshow that in comparison to the coded shufﬂing algorithm in[19], the UberShufﬂe algorithm reduces the shufﬂing time byup to 47.2% and 35.7% when implemented with the distributedSGD algorithm for a low-rank matrix completion problem andthe parallel SGD algorithm for a linear regression problemrespectively.

D. Function Allocation

In most of the studies of CDC computation framework, non-cascaded systems are considered. In other words, each Reducefunction can only be computed at exactly one processingnode. Considering that the processing nodes have differentstorage, computational and communication capabilities, a non-cascaded heterogeneous system [129] where each Reducefunction is computed exactly once is considered. In thisproposed scheme, the processing nodes are allocated withdifferent number of Reduce functions of which the processingnodes with greater storage and computational capabilities areallocated more Reduce functions. This heterogenous Reducefunction allocation creates a symmetry among the multicastgroups such that each processing node in a group requeststhe same number of intermediate outputs from the otherprocessing nodes in the same group. The heterogeneous CDCsystem achieves a lower communication load than an equiva-lent homogeneous CDC system.However, it is desirable for the Reduce functions to becomputed at multiple processing nodes in some applications,e.g., iterative algorithms in the Spark model [46] where theoutput of a MapReduce procedure acts as the input to theMapReduce procedure in the next iteration. Although the workin [17] generalizes the CDC framework by allowing eachReduce function to be computed at more than one processingnode, it only applies to homogeneous systems. Similar to[129], heterogeneities of the systems are also considered in[106]. However, instead of a non-cascaded system, the authorspropose a more general framework of cascaded CDC [106] onheterogeneous systems. In other words, each Reduce functionis allowed to be computed at multiple processing nodes. Sincethe processing nodes have different storage capacities, thenumber of subﬁles stored at each processing node differs. Theprocessing nodes with larger storage capacities are allocatedmore subﬁles and thus, compute more intermediate results.Instead of allocating the ﬁles and functions randomly as in thework in [17], this cascaded CDC scheme uses a hypercubeapproach [113] to allocate the ﬁles and functions to theprocessing nodes. The simulation results show that for thesame number of processing nodes in the network, the pro-posed cascaded CDC scheme achieves smaller communicationload than the state-of-art schemes that consider homogeneoussystems [17], [113].In the study of [130], the allocation of the functions to theprocessing nodes does not depend on their capabilities but onthe data stored at the nodes, of which the data placement isassumed to be random. This is very useful for applications thatthe processing nodes are mobile and collect data on-the-move.If the probability that the processing nodes contain data ishigher than the pre-deﬁned threshold, it is possible to allocateReduce functions such that the processing nodes do not needto exchange their intermediate results for the computationsof Reduce functions, i.e., each processing node can computethe Reduce functions based on its locally stored data. Thisreduces the number of broadcast transmissions in the Shufﬂephase, thus minimizing the communication load.Although heterogeneities of the systems are taken into consideration in some of the works [107], [129], [106], theymerely focus on either ﬁle allocation or function allocation. Onone hand, the work in [107] proposes an optimal ﬁle allocationstrategy in consideration of the heterogeneous storage capabil-ities of the processing nodes. However, the Reduce functionsare distributed uniformly among the processing nodes. Onthe other hand, the works in [129] and [106] propose theallocation of Reduce functions based on the computationalcapabilities of the processing nodes. However, the input ﬁleis split equally and distributed among the processing nodes.Considering a more generalized heterogeneous system, a jointﬁle and function allocation strategy is proposed in [131] toreduce the communication load in the Shufﬂe phase. The ﬁleallocation and function assignment strategies allocate moresubﬁles and Reduce functions respectively to the processingnodes with higher computational and storage capabilities. Gen-erally, the Reduce function assignment is related to the inputﬁle allocation as the processing nodes with more input ﬁleshave greater storage and computational capabilities and henceare more capable of computing more output functions. Inparticular, there are two proposed schemes of function assign-ment, i.e., computation-aware function assignment and shufﬂe-aware function assignment. For computation-aware functionassignment strategy, the number of functions allocated isproportional to the computational capabilities of the processingnodes in order to reduce computation latency. For the shufﬂe-aware function assignment strategy, the functions are mostlyallocated to the processing nodes with high computationalcapabilities so that the communication load in the Shufﬂephase is minimized. The simulation results show that thecommunication loads achieved by both computation-aware andshufﬂe-aware function assignment strategies are lower than theuniform function allocation strategy. Besides, the computation-aware function assignment strategy requires fewer number ofoutput functions as compared to the proposed schemes in [106]and [129]. However, the number of input ﬁles required is muchlarger, especially when the number of processing nodes in thesystem increases.While there has been great attention on the design of ﬁleallocation, coded shufﬂing algorithms and function allocation,all these works assume a ﬁxed number of processing nodes inthe distributed computing systems. For a given computationtask that speciﬁes the number of subﬁles and the numberof outputs, the allocation of functions and the data shufﬂingschemes can be implemented with minimum number of pro-cessing nodes. In the study of [132], the resource allocationproblem is formulated as an optimization problem that mini-mizes the overall job execution time with optimal number ofprocessing nodes used. For more practical implementation, theresource allocation strategy should consider the heterogeneityin the processing speed of the nodes, since the straggler effectscause an increase in computation latency which increases theoverall job execution time.By exploiting the fact that the processing nodes have time-varying computing resources, e.g., the processing nodes mayhave different central processing unit (CPU) power over time,an optimal computation task scheduling scheme helps toreduce the communication load. In the scheduling of tasks under dynamic computing resources, the total communicationload is minimized by optimizing the number of allocated tasksand load redundancy at each time slot [133]. E. Summary and Lessons Learned

In this section, we have reviewed four approaches to min-imize communication costs in distributed computing systems.For each approach, we discuss the solutions that are proposedin different works. We summarize the coding schemes to min-imize communication costs in Table VI. The lessons learnedare as follows: • To handle the increasing amounts of data generated,more processing nodes are needed for the completionof distributed computations given the limited capabilitiesof the processing nodes. With more processing nodesconnected to the network, more communication rounds ofthe computed intermediate results are required, resultingin higher communication costs which lead to a longer jobexecution time. Besides, the high communication costsin the data shufﬂing phase impede the implementation ofefﬁcient distributed iterative algorithms which are usefulfor the training of machine learning models. As such,the minimization of communication costs is a key step toachieve the objective of reducing the overall job executiontime of the distributed computation tasks. • While having more processing nodes to perform thecomputation tasks in parallel helps to reduce the compu-tation load of each processing node, the communicationcosts may increase, slowing down the entire computationprocess. Instead of generating inﬁnitely large number ofsubﬁles and distributing them among large number ofprocessing nodes, several studies focus on determiningthe optimal number of subﬁles to avoid the input ﬁlefrom splitting too ﬁnely. For example, the resolvabledesign based schemes [102], [111] and PDA approaches[114]–[116] are adopted to split the input ﬁles. Hence,it may not be optimal to use all the processing nodesthat are connected to the network. In fact, the authorsin [132] propose an optimal resource allocation schemethat determines the minimum number of processing nodesneeded to achieve the minimum overall job executiontime. • Coding techniques have shown to be effective in re-ducing communication load in the data shufﬂing phaseat the expense of higher computation load [17], [19].However, the two-dimensional tradeoff is insufﬁcient tofully evaluate the performance of the CDC schemes.Apart from leveraging on the computational capabilitiesof the processing nodes, their storage capabilities canbe exploited. For example, more data can be stored atprocessing nodes with higher storage capacities such thatthe number of communication rounds is reduced [108].Moreover, by considering the data stored at the processingnodes, the allocation of functions that maximizes datalocality helps to reduce the need for communicationbandwidth [109]. Hence, instead of the two-dimensionaltradeoff between computation load and communication TABLE VI:

CDC schemes to reduce communication costs.

Approach Ref. Coding Schemes Key Ideas Platform Support

File Allocation [108] - Deterministic and systematic storage update strategy Heterogeneous[109] Hybrid Coded MapReduce Allocates Map tasks such that data locality is maxi-mized Homogeneous[101] Group-based CodedMapReduce Allocates dataset by using a group-based method inorder to avoid high subpacketization level and allowprocessing nodes to cooperate in the transmission ofmessages Homogeneous[102] Resolvable Design Uses single-parity code to determine the number ofsubﬁles and allocate the subﬁles based on the corre-sponding resolvable design Homogeneous[114]–[116] Placement Delivery Arrays Construction of CDC schemes based on PDA which hasthe property of illustrating the placement and deliveryphase in a single array HomogeneousCoded ShufﬂingDesign [112] Compressed CDC Pre-combines computed intermediate values of thesame function, followed by coding the pre-combinedpackets for communication between different process-ing nodes Homogeneous[117] Quantized SGD Quantizes the component of the gradient vector toa discrete set of values and encodes the quantizedgradients given their statistical properties Homogeneous[118] Pliable Index Coding Semi-random data shufﬂing scheme based on modiﬁedpliable index coding to reduce the number of commu-nication rounds Homogeneous[121] Cross-iteration CodedComputing Jointly codes across multiple iterations for a singlecommunication round Homogeneous[103] CDC for distributed graphcomputing systems Instead of communicating with all other processingnodes, each processing nodes only needs to commu-nicate with a subset of processing nodes to obtain therequired data to complete its computation tasks Homogeneous[105] CDC for multistage dataﬂow A more generalized CDC scheme is proposed to handlemultistage dataﬂow computation tasks which are repre-sented by layered DAGs HomogeneousConsideration ofUnderlyingNetworkArchitecture [109] Hybrid Coded MapReduce Reduces cross-rack communication at the expense ofhigher intra-rack communication based on the server-rack architecture Homogeneous[125] Topological CDC Considers t-ary fat-tree topology which is a morepractical topology to connect the physically separatedprocessing nodes in data center networks Homogeneous[128] UberShufﬂe Considers imperfect communication channel betweenthe workers and the master node Homogeneous[134] - Considers multicore setup where each computing nodecan have multiple cores, e.g., the CPU instances ofpublicly available cloud infrastructure can deliver upto 128 cores HomogeneousFunctionAllocation [129] - Considers a non-cascaded system and allocates Re-duce functions over a simpliﬁed heterogeneous networkwhich comprises multiple homogeneous networks Heterogenous[106] Cascaded CDC Reduce functions are computed at more than oneprocessing node and they are allocated based on thecombinatorial design in [113] Heterogeneous[130] - Allocates functions to maximize data locality suchthat the number of communication rounds required isreduced Homogeneous[131] - Joint ﬁle and function allocation strategy Heterogeneous[133] - Considers the availability of time-varying computingresources Homogeneous load, the three-dimensional tradeoff between computa-tion, communication and storage cost [71] has to becarefully managed for the implementation of efﬁcientCDC schemes. • For effective implementation of CDC schemes in practi-cal distributed computing systems, the underlying archi-tecture has to be considered. Generally, the distributedcomputing systems operate under the server architecturewhich consists of multiple racks where each rack hasseveral servers. The Hybrid Coded MapReduce [109]scheme reduces cross-rack communication at the ex- pense of higher intra-rack communication. Besides, thecommunication channels between the master node andworkers are imperfect. As a result, the theoretical gainsof the coded data shufﬂing schemes are not achievableunder practical setups. In order to design and analyzethe performance of the CDC schemes, the limitations ofthe distributed computing systems should be taken intoconsideration. • Most of the CDC schemes focus on the minimization ofcommunication costs in the data shufﬂing phase at the ex-pense of reasonable increase in computation load. How- ever, the computational overhead of the algorithm is notnegligible under some settings. For example, under muchfaster broadcast environment, the UberShufﬂe algorithm[128] incurs signiﬁcant encoding and decoding costs suchthat the shufﬂing gain is offset by the high computationaloverhead. For future works, more practical CDC schemescan be proposed such that the communication costs areminimized while maintaining low computational cost toimprove the performance of the distributed computingsystems. Given the uncoded computing schemes, e.g.,[123] and [124], achieve near-optimal computation andstorage costs, one possible research direction is to mergethe communication-efﬁcient proposed schemes with theuncoded computing schemes to reduce the computationand storage costs. • Although some of the works consider heterogeneitiesin the capabilities of the processing nodes to allocateﬁles and functions, the presence of stragglers which haveslower processing speeds still hinder the performance ofthe distributed computing systems. Therefore, we furtherdiscuss the approaches to mitigate the straggler effects inthe next section.V. M

ITIGATION OF S TRAGGLERS

In distributed computing systems, processing nodes havedifferent processing speeds and thus the time taken to completetheir allocated subtasks differs from each other. Since thecomputation task is distributed among the processing nodes,the master node needs to wait for all processing nodes tocomplete their subtasks and return the computed results beforerecovering the ﬁnal result. As such, the time taken to executea computation task is determined by the slowest processingnode. This is also known as the straggler effects .The problem of straggler effects has been widely observedin the distributed computing systems. Previously, variousmethods such as straggler detection [65], [135], asynchronousexecution and naive replication of jobs [14], [68] have beenproposed to reduce the overall time taken to execute thecomputation tasks. Recently, coding approaches have beenshown to outperform the aforementioned methods in reducingcomputation latency of the distributed computing systems.In this section, we discuss three approaches to mitigate thestraggler effects: • Computation Load Allocations:

Coding techniques canbe implemented together with computation load alloca-tion strategies to reduce the computation latency in thedistributed computing systems. It is important to take intoaccount the variation in the computational capabilities ofthe processing nodes to allocate computation load. Assuch, different prediction methods such as an long short-term memory (LSTM) algorithm [136], an Auto Regres-sive Integrated Moving Average (ARIMA) model and aMarkov model [137] are used to estimate the processingspeeds of the nodes. Generally, the objective of the loadallocation strategies is to minimize computation latency.However, in some applications where strict deadlines aregiven, the load allocation strategies aim to maximizetimely computation throughput [137]. • Approximate Coding:

For some applications, e.g.,location-based recommendation systems, exact solutionsare not necessary. The studies explore different coding ap-proaches to obtain approximate solutions to the problems.The approximate coding methods relax the requirementfor convergence and thus reduce the number of workersthat are required to return their computed results. This canavoid stragglers to make an adverse effect to the system. • Exploitation of Stragglers:

The straggling nodes mayhave completed a fraction of the allocated computationtasks, of which is a waste to be ignored completely.In fact, the stragglers may not be persistent over theentire computation process [138] and thus their partialcomputed results can be useful to recover the ﬁnal result.In order to maximize the resource utilization of thestraggling nodes, the workers are allowed to sequentiallyprocess their allocated subtasks and transmit their partialcomputed results continuously [139]. However, this maycome at the expense of higher communication load, whichneeds to be carefully managed.

A. Computation Load Allocation

Apart from having different straggling rates that affectthe completion of tasks, the processing nodes have differentcapabilities, e.g., storage capacities, computing resources andcommunication bandwidths. To better handle the stragglingnodes in the distributed computing systems, an optimal loadallocation strategy that takes into account these heterogeneitiesis necessary to minimize the overall job execution time.Given the computation time parameters, i.e., the stragglingand shift parameters of each worker, the Heterogeneous CodedMatrix Multiplication (HCMM) algorithm [140] determinesthe allocation of computation load to each worker. The HCMMscheme exploits the beneﬁts of both coding techniques andcomputation load allocation strategy to minimize the averagecomputation time of the computation tasks. Given that itis difﬁcult to derive closed-form expressions of expectedcomputation time of the heterogeneous processing nodes, atwo-step alternative problem formulation is proposed. In theﬁrst step, given a time period, the number of computed resultsby the workers is maximized by optimizing the load allocation.In the second step, given the derived load allocation in the ﬁrststep, the time needed to ensure sufﬁcient results are returnedat a pre-deﬁned probability is minimized. The simulationresults show that when workers’ computation time is assumedto follow a shifted exponential runtime distribution, HCMMreduces the average computation time by up to 71%, 53% and39% over uniform uncoded, load-balanced uncoded and uni-form coded load allocation schemes, respectively. In practicalexperiments over Amazon EC2 clusters, the combination ofHCMM and LT codes outperforms the uniform uncoded, load-balanced uncoded and uniform coded load allocation schemesby up to 61%, 46% and 36%, respectively.Although HCMM achieves asymptotically optimal compu-tation time, the decoding complexity is high, which suggeststhe opportunity to further speed up the overall computationtasks. In practical distributed computing systems, some pro-cessing nodes have the same computational capabilities, in terms of the same distributions of computation time, and thusthey can be grouped together. By exploiting the group structureand heterogeneities among different groups of processingnodes [141], [142], the implementation of a combination ofgroup codes and an optimal load allocation strategy not onlyapproaches the optimal computation time that is achieved bythe MDS codes, but also has low decoding complexity. Inaddition, by varying the number of allocated rows of thematrix to the workers [142], the computation latency can bereduced by orders of magnitude over the MDS codes withﬁxed computation load allocation [141] as the number ofworkers increases. The load allocation strategy proposed in[142] focuses mainly on the design of an optimal MDS code.Other efﬁcient coding algorithms such as LT codes [74] canalso be explored in future studies.In addition to the heterogeneous capabilities of the process-ing nodes, the amount of available resources of the processingnodes may vary over time. By always allocating computa-tion tasks to the processing nodes with higher capabilities,delays may be incurred in completing the allocated tasks ifthe processing nodes start to work on the newly-allocatedcomputation tasks in parallel. At the same time, the resourcesof the processing nodes with lower capabilities may be under-utilized. In order to maximize the resource utilization of theprocessing nodes, dynamic workload allocation algorithmswhich are adaptive to the time-varying capabilities of theprocessing nodes are proposed [143], [136], [137]. To providerobustness against the straggling nodes, the design of theload allocation algorithms often depends on the historical dataof the processing nodes such as computation time, whichcan be used to predict the processing speeds by using anLSTM algorithm [136], an ARIMA model or a Markov model[137]. The best performing LSTM model achieves 5% lowerprediction error than an ARIMA (1,0,0) model [136].In the study of [143], the authors propose Coded Coopera-tive Computation Protocol (C3P) which is a dynamic and adap-tive coded cooperation framework that efﬁciently utilizes theavailable resources of the processing node while minimizingthe overall job execution time. Speciﬁcally, the master nodedetermines the coded packet transmission intervals based onthe responsiveness of the processing nodes. For the processingnodes which are not able to complete the tasks within thegiven transmission interval, they suffer from longer waitingtime for the next coded packets. In comparison to the HCMMscheme [140] which does not consider the dynamic resourceheterogeneity in workers, the C3P framework achieves morethan 30% improvement in task completion delay.The dynamic and adaptive load allocation algorithms areespecially useful in providing timely services with deadlineconstraints which are common in many IoT applications.For such applications, instead of minimizing task completiondelay, the objective of the load allocation algorithms is tomaximize timely computation throughput, i.e., the averagenumber of computation tasks that are successfully completedbefore the given deadline [137].For some applications that may need timely but not nec-essarily optimal results, it is more important to recover theﬁnal result with the highest accuracy possible by the stipulated deadlines than to solve for an exact solution. The algorithmto solve for an approximate solution requires signiﬁcantlyshorter computation time than that of an algorithm thatsolves for an exact solution [144]. In the study of [145], thebatch-processing based coded computing (BPCC) algorithmis proposed. The workers ﬁrst partition the allocated encodedmatrix into several batches, i.e., submatrices, for computations.As soon as the workers complete the computation of eachbatch of submatrix with a given vector, they return the partialcomputation results which are used to generate the approx-imate solution. Based on the computation time parameters,i.e., the straggling and shift parameters of the workers, theBPCC algorithm is used to optimally allocate the computationload to each worker by considering two cases: (i) negligiblebatching overheads, and (ii) linear batching overheads. Hence,the allocation of computation load is optimized by jointlyminimizing both the overall job execution time as well as thepotential overhead of batch processing. In addition to reducingcomputation latency, the BPCC algorithm exploits the partialcomputation results of all processing nodes, including thestraggling nodes, which contribute to approximate solutions ofhigher accuracy [139]. The simulation results show that BPCCalgorithm with negligible batching overheads achieves up to73%, 56% and 34% reduction in average job execution timeover uniform coded, load-balanced uncoded and HCMM [140]load allocation schemes respectively. The experimental resultson Amazon EC2 clusters and Unmanned Aerial Vehicles(UAVs) based airborne computing platform also demonstratesimilar results. B. Approximate Coding

For some applications, it is not necessary to obtain exactsolutions. In this subsection, we review the coding approachesthat are used to derive approximate solutions for the distributedcomputation tasks: (i) matrix multiplication, (ii) gradient de-scent, (iii) non-linear computations. Table VII summarizes theapproximate coding schemes for various distributed computa-tion tasks.

1) Matrix multiplications:

The anytime coding [144]scheme derives an approximate solution by using the outputresults of the completed processing nodes at any given time.Based on singular value decomposition (SVD), the givencomputation task is decomposed into several subtasks ofdifferent priorities. More important subtasks are allocatedmore processing nodes for computations as they improve theaccuracy of the approximation. To allow the users to receiveuseful information from time to time, approximate solutionscan be transmitted to the users sequentially. This can beachieved by solving a sequence of approximated problems[146], instead of solving the original problem directly.To further reduce the computation time of approximatematrix multiplications, sketching techniques [154], [155] canbe used to remove redundancy in the structure of the ma-trices through dimensionality reduction. However, by usingsketching techniques, the recovery threshold increases as theredundancy is removed. In contrast, coding techniques reducesrecovering threshold by introducing redundancy. As such, a TABLE VII:

Approximate CDC schemes to mitigate the straggler effects.

Problems Ref. Coding Schemes Key Ideas

Matrix-Vector [144] Anytime Coding Computations can be stopped anytime and the approximate solutionis derived from the processing nodes that have completed their tasks[146] Coded SequentialComputation Scheme A sequence of approximated problems are designed such that the timerequired to solve these problems is shorter than solving the originalproblem directlyMatrix-Matrix [147] CodedSketch Use a combination of count-sketch technique and structured polyno-mial codes[148] OverSketch Divide the sketched matrices into blocks for computationsGradient Descent [120] Bernoulli Gradient Codes Use Bernoulli random variables as entries of the function assignmentmatrix[149] Stochastic Block Codes Interpolation between BGC and FRC[150] Stochastic Gradient Coding Distribute data based on pair-wise balanced scheme and provide arigorous convergence analysis of the proposed coding scheme[15] LDPC Codes Encode the second moment of the data points[151], [152] Encoded Optimization Encode both the labels and data such that the redundancy is introducedin the formulation of the optimization problemsNon-Linear [96] Generalized PolyDot Generalization of PolyDot codes [79] and is used for the training ofDNNs[153] Learning-based Approach Design neural network architectures to learn and train the encodingand decoding functions to approximate unavailable outputs combination of both techniques that carefully manage thetradeoff between the recovery threshold and the amount ofredundancy can be implemented to minimize computationlatency. In particular, count-sketch technique [156] is com-bined with structured codes to mitigate the straggler effects bypreserving a certain amount of redundancy, thereby achievingthe optimal recovery threshold and hence computation latency[147], [148].

2) Gradient descent:

To speed up the distributed gradientdescent tasks, several approximate gradient coding schemesare proposed to approximately compute any sum of functions.Instead of constructing the gradient codes based on expandergraphs [82] which are difﬁcult to compute due to highcomplexity, a more efﬁcient and simpler Bernoulli GradientCode (BGC) is proposed by using sparse random graphs [120]which introduce randomness into the entries of the functionassignment matrix. Since the performance of the gradientcodes depends on the efﬁciency of the decoding algorithms,the authors also present two decoding techniques to recover theapproximate solution from the outputs of the non-stragglingnodes. The simulation results show that the BGC schemescan handle adversaries with polynomial-time computations butat a cost of higher decoding error than the FRC schemes[81]. Besides, the optimal decoding algorithm always achievesa lower decoding error than that of the one-step decodingalgorithm across various gradient coding schemes. A rigorousconvergence analysis of the approximate gradient codes andthe performance of BGC on different practical datasets suchas the Amazon dataset, Covertype dataset and KC Housingdataset are presented in [157]. Stochastic block code [149]which is based on the stochastic block model from randomgraph theory, is an interpolation between FRC [81] and BGC[120]. On one hand, the FRC schemes achieve small recon-struction errors under random straggler selection while on theother hand, the BGC schemes are robust against polynomial-time adversarial stragglers.Other approximate gradient coding methods such as theStochastic Gradient Coding (SGC) [150] and LDPC codes [15]are used to obtain an estimate of the true gradient. Similar to the idea of encoding data batches in the PCR schemes[85], the data variables in the optimization formulation can beefﬁciently encoded to mitigate the straggler effects in moregeneral large-scale optimization problems such as supportvector machines, linear regressions and compressed sensing[151], [152]. Generally, in solving for approximate solutionsto reduce computation latency of the distributed tasks, thereis an inherent tradeoff between the recovery threshold andthe approximation error where the recovery threshold can bereduced at the expense of larger approximation error [158].

3) Non-linear computations:

By leveraging on the enor-mous amounts of data generated, machine learning algorithmsare useful in making predictions and allowing devices torespond intuitively to user demands without human intercep-tion. Different neural network architectures are developed tomake accurate inference given the dataset. Since some ofthe layers of the neural networks such as the max-poolingfunctions and the activation layer are non-linear, the overallcomputation of the functions are non-linear. As a result, mostof the prior works on linear computations which are discussedin Sections III-B, V-B1 and V-B2, are not applicable to thecomputation of the increasingly important non-linear neuralnetworks, in which their performances are also limited bythe straggling nodes. One of the few approaches that canbe extended to the training of DNNs is the GeneralizedPolyDot codes [96] which are used to compute matrix-vectormultiplications. The Generalized PolyDot codes are used tocode the linear operations at each layer of the neural networks.This coding scheme allows for errors in the training of eachlayer. In other words, decoding can still be performed correctlygiven that the amount of errors does not exceed the maximumerror tolerance level. The effectiveness of coding techniques inmitigating the straggler effects of different neural network ar-chitectures such as AlexNet [12] and Visual Geometry Group(VGG) [159] in an IoT system is illustrated in [160]. However,this uniﬁed coded DNN training strategy may not be relevantto the training of other neural networks which have largenumber of non-linear functions. As such, the authors in [153]propose a learning-based approach for designing codes. Based on the dilated CNN and multilayer perceptrons (MLP), neuralnetwork architectures and a training methodology are proposedto learn the encoding and decoding functions. The outputs ofthe decoding functions are used to approximate the unavail-able outputs of any differentiable non-linear function. Thesimulation results show that the learning-based approach todesigning the encoding and decoding functions can accuratelyreconstruct 95.71%, 82.77% and 60.74% of the unavailableResNet-18 classiﬁer outputs on MNIST, Fashion-MNIST andCIFAR-10 datasets respectively. C. Exploitation of Stragglers

To avoid delays caused by the straggling nodes in thenetwork, most distributed computation schemes ignore thework completed by the straggling nodes by either increasingthe workload of the non-straggling nodes or by obtaining lessaccurate approximate solutions. However, the amount of workthat has been completed by the straggling nodes, especially thenon-persistent stragglers, is non-negligible and can be betterutilized.In order to exploit the computational capacities of these non-persistent stragglers, multi-message communication (MMC) isused where the workers are allowed to send multiple messagesto the master node at each iteration. This allows the workers totransmit their partial computed results whenever they completea fraction of the allocated task, rather than completing the en-tire computation task before transmitting the computed resultin a single communication round. The work in [138] considersthe implementation of Lagrange Coded Computing (LCC)with MMC to minimize the average job execution time at theexpense of higher communication load due to the increase innumber of messages transmitted by the workers to the masternode. Since the LCC scheme utilizes polynomial interpolationto recover the ﬁnal result, the decoding complexity and thenumber of transmissions can be further reduced by increasingthe number of polynomials used in decoding the computed re-sults returned by the workers. The simulation results show thatby exploiting the computing resources of the non-persistentstragglers via MMC, the average job execution time decreasesas the computation load of each worker increases. However,since the communication load of the LCC-MMC scheme isconstant as the computation load increases, it is suitable tobe implemented only when computation time dominates theoverall execution time of the distributed tasks. The total timeneeded to execute the distributed tasks includes the timeneeded for both computation and communication. Otherwise,if communication load is the cause of the bottleneck of thenetwork, LCC without MMC should be used instead since thecommunication load can be reduced at the expense of highercomputation load.Given that MMC is allowed where the workers performmore than one round of communication for each iteration,the computation work done by the straggling nodes can beexploited by allowing sequential processing where the workersneed to transmit the computation results of their completedsubtask before working on the next subtask. To fully exploitthe useful information provided by the straggling nodes, the hierarchical coded computation scheme is proposed in [139]to utilize the computations from all workers. Each workerﬁrst divides the allocated computation task into layers of sub-computations, which are processed sequentially, i.e., the resultof a layer of sub-computation is transmitted to the masternode before the computation of the next layer starts. Sincethe workers have different processing speeds and the sub-computations are performed sequentially, the ﬁnishing timefor each layer is different. MDS codes are used to encode thelayers so that the ﬁnishing time of each layer is approximatelythe same. The top layers which have lower probability oferasure are encoded with higher-rate MDS codes whereas thebottom layers are encoded with lower-rate MDS codes. Thesimulation results show that for the same amount of tasks tobe completed, the proposed hierarchical coded computationscheme achieves . factor improvement in expected ﬁnishingtime as compared to the coded computation scheme proposedin [19] which ignores the computations of the straggling nodes.In the study of [161], by computing the block productssequentially, the partial computation results from the stragglingnodes are used to aggregate the ﬁnal result. In order to preservethe sparsity of matrices for processing by the workers, insteadof coding the entire matrix, the fraction of coded blocks canbe speciﬁed to control the level to which coding is utilized inthe solutions. There are two different approaches considered,depending on the placement of the coded blocks. When thecoded blocks appear at the bottom of the workers, it is easierfor the master to decode. When the coded blocks appear atthe top of the nodes, it minimizes the computation load of theworkers. As such, the placement can be calibrated based onthe task requirements.While the coded computation schemes achieve low com-munication load and reduce the average job execution timefor each iteration, the uncoded computation schemes havetheir own beneﬁts of having no decoding complexity andallowing partial gradient updates. In order for a system tobeneﬁt from the advantages of both schemes, coded partialgradient computation (CPGC) scheme is proposed in [162].In the CPGC scheme, uncoded submatrices are allocated forthe ﬁrst computation since there is a high probability for eachworker to complete their ﬁrst computation task. Subsequently,coded submatrices are allocated to handle the straggling nodes.The master node is able to update the gradient parameters byusing the computation results from a subset of workers andby exploiting the partial computations of the straggling nodes.As a result, the average job execution time for each iterationis reduced. D. Summary and Lessons Learned

In this section, we have discussed three approaches (Fig. 9)to mitigate the straggler effects. The lessons learned are asfollows: • The straggler effect is a key issue to be resolved inorder to reduce computation latency, hence minimizingthe overall job execution time. Due to various factorssuch as insufﬁcient power, contention of shared resources,imbalance work allocation and network congestions [64], Fig. 9:

Approaches to mitigate the straggler effects include (a)computation load allocation by predicting the speed of the processingnodes [136], [137], (b) approximate coding and (c) exploitation ofstragglers by allowing multi-message communications [138], [139]. [65], some processing nodes may run slower than theaverage or even be disconnected from the network.Since the computation tasks are only completed when allprocessing nodes complete their computations, the timeneeded to complete the tasks is determined by the slowestprocessing node. Coding techniques have shown theireffectiveness in reducing computation latency by intro-ducing redundancy [19]. In this section, we have exploredthe use of coding techniques for different distributed com-putation tasks, e.g., matrix-vector multiplications, matrix-matrix multiplications, linear inverse problems, iterativealgorithms, convolutions and non-linear problems. Whilemost of the research focuses on the design of encodingtechniques, the decoding complexity of the codes alsoaffects the computation latency signiﬁcantly. Apart fromReed-Solomon codes [83] and LDPC codes [15], moreeffective codes with low decoding complexity can beinvestigated in future studies. • Considering heterogeneities in the capabilities of theprocessing nodes, effective computation load allocationstrategies are implemented to allocate workload to theprocessing nodes. We have discussed the proposed com-putation load allocation algorithms under different con-straints, e.g., strict deadlines and time-varying computingresources. In addition, different prediction methods suchas the LSTM algorithm [136], an ARIMA model anda Markov model [137] that predict the speeds of theprocessing nodes are explored. However, the stragglersmay be non-persistent in nature and thus they may beuseful when they are able to perform computations fasterthan the average rate. Hence, the load allocation based onthe responsiveness of the processing nodes may be moreuseful in such situations. • Instead of exact solutions, it is acceptable to derive ap-proximate solutions for some applications, e.g., location-based recommender systems. Various coding techniques to derive approximate solutions are investigated. Forexample, in the studies of [147] and [148], sketchingtechniques are used with structured codes to minimizecomputation latency. However, there exists a tradeoffbetween the recovery threshold and the approximationerror. For future works, an improvement to this tradeoffcan be investigated. • Although the straggling nodes run slower than the av-erage, the computations that are completed may still beuseful. It is wasteful to not utilize the partial computedresults of the straggling nodes. Besides, these partialcomputed results can help to improve the accuracy ofthe estimates. For example, in [163], the stragglers aretreated as soft errors instead of erasures to minimizethe mean-squared error of the iterative linear inversesolvers under a deadline constraint by using approximateweights. Outputs from all computing nodes, includingthe straggling nodes are used to recover estimates thatare as close as possible to the convergence values whenthe computation deadline is brought forward or when thenumber of computing nodes increases. Unfortunately, asthe processing nodes are required to send their partialresults once they complete, more communication roundsare performed. The high communication costs may bethe bottleneck of the distributed computation tasks. Giventhe advantages of using partial results from the stragglingnodes, optimization approaches to minimize the commu-nication costs between the master node and the workersshould be explored. • The current studies in this section have proposed effectivecoding schemes for implementation. However, they donot consider security in the design of the coding schemes.For example, the FRC scheme [81] achieves high accu-racy in the presence of stragglers, but it is susceptibleto attacks from adversarial stragglers, which turn moreprocessing nodes into straggling nodes. Besides, otherthan the straggling workers, there may exist maliciousor curious workers that may compromise the privacy andsecurity of the system. Therefore, approaches to ensuresecure coding are discussed in-depth in the next section.VI. S

ECURE C ODING FOR D ISTRIBUTED C OMPUTING

In distributed computing, the data owner, master node, andworkers may not belong to the same entity. For example,the data owner may wish to perform a task on a massivedataset on which intensive computations have to be performed.The computations may be divided and distributed to multipleworkers on third-party computing services. However, sensitivedata, e.g., in healthcare services [164], may be involved. Inthis case, curious workers may collude to obtain informationabout the raw data, whereas malicious workers [165] mayintentionally contribute erroneous inputs to introduce biasesto the model. Besides, in some cases, the dataset does notbelong to either the master node or the workers, and as such,the raw dataset has to be guarded against both parties.To ensure that privacy and security are preserved during thecomputing tasks, conventional methods such as homomorphic encryption [166] have been proposed in which the data isﬁrst encrypted before being shared to workers. However, theencryption techniques are found to be costly in terms ofcomputation and storage costs [167]. Besides, the secure multi-party computation (MPC) approaches [168]–[170] mainly fo-cus on the correctness and privacy of the data [20], whereasneglecting to reduce the complexity of computation at eachworkers, or the number of workers required to perform thetask. Recently, coding techniques that have originally beenintroduced to mitigate straggling workers are increasinglyutilized to provide information-theoretic privacy guarantees.Speciﬁcally, information-theoretic privacy considers the setupconsisting honest but curious workers, in which collusionsformed between T of N workers do not leak informationabout the training dataset. Coding schemes can be used tonot only mitigate the stragglers but also the colluding curiousworkers and malicious workers as illustrated in Fig. 10. Forease of exposition, we classify the related studies into twomain categories in this section: • Secure Distributed Computing:

In this section, the stud-ies aim to reduce the number of workers needed forinformation-theoretic privacy, i.e., where the colludingworkers are unable to infer sensitive information fromthe dataset. For some studies, this objective is met whilesimultaneously preserving the efﬁciency of distributedcomputing, e.g., through providing resiliency againststraggling workers [21]. • Secure Distributed Matrix Multiplication (SDMM):

Thestudies in the aforementioned category mainly focus ongeneric operations, e.g., addition, subtraction, multipli-cation, or computation of polynomial functions, whereasthe studies in this category focus speciﬁcally on matrixmultiplication. One key difference between the two cat-egories is that SDMM considers the speciﬁc scenario inwhich both input matrices in the multiplication operationare private information, i.e., two-sided privacy [171],whereas the prior category mainly considers one-sidedprivacy, i.e., only one input matrix is private. In addition,a performance metric of interest in the SDMM literatureis the download rate, i.e., the ratio of the size of desiredresult to the total amount of information downloaded bya user from the workers.

A. Secure Distributed Computing

In Section III-B, we discuss that the polynomial codesproposed in [78] have the desirable property of an optimalrecovery threshold that does not scale with the number ofworkers involved. In consideration of this useful property,the authors in [20] propose the polynomial sharing approachwhich combines the polynomial codes and the Ben-Or, Gold-wasser, and Wigderson (BGW) scheme [172]. The systemmodel considered in this study is that the data originates fromexternal sources, and thus has to be kept private against boththe workers and master node. In contrast to the BGW approachwhich uses Shamir’s scheme to encode the dataset, the studyof [20] proposes to encoded the dataset using the polyno-mial coding scheme. The authors show that the polynomial Fig. 10:

Illustration of coding framework for the objectives of miti-gating stragglers, colluding curious workers, and malicious workers. sharing approach may be applied to perform several proce-dures, e.g., addition, multiplication, and the computation ofpolynomial functions, while reducing the number of workersrequired to complete the task as compared to conventionalMPC approaches even when workers have capacity-limitedcommunication links.Typically, in conventional polynomial coding schemes, thedataset on which computations are performed is divided intomultiple sub-tasks, with one sub-task encoded and assignedto each worker. In this case, faster workers that completetheir task will be idle while waiting for straggling workers.To further mitigate the straggler effects, the authors in [173]leverage on computation redundancy to propose the privateasynchronous polynomial coding scheme in which a com-putation task is divided into several relatively smaller sub-tasks for distribution to each worker. This results in two keyadvantages, in addition to retaining the privacy preservationproperties of polynomial coding. Firstly, the smaller sub-taskscan be successfully completed by straggling workers withlimited computing capacity. Secondly, workers of the fastestgroups are assigned more tasks to continue working through-out the whole duration rather than wait for the stragglers, thusreducing the computation time.However, the studies [20], [173] mainly utilize polynomialcoding for privacy preservation which is restrictive in certainaspects, e.g., it only allows column-wise partitioning of thematrices [94]. As such, the entangled polynomial codes [94]are applied by [174] as an extension to polynomial sharing,so as to further reduce the restrictions during the data sharingphase, and hence, the number of workers required to performthe same computations while meeting privacy constraints.While the studies of [20], [174] consider the scenarioin which honest-but-curious workers are involved, workersmay randomly be malicious in nature. As an illustration,a group of workers may be involved to compute gradientstowards training a machine learning model. However, thegradients may be intentionally misreported by the workers TABLE VIII:

Comparison of BGW, LCC, Harmonic codingschemes for Gradient-type computations [21], [172], [177].

Sharmir LCC Harmonic

Min. number ofworkers K ( deg g + 1) K deg g +1 K ( deg g −

1) + 2 to introduce biases or inaccuracies to the model [165]. Anexisting approach is to perform median, rather than mean,based aggregation of the gradients to eliminate misreportsthat are usually outliers [175]. However, the median basedaggregation is computationally costly and faces convergenceissues. As such, the study of [176] proposes DRACO, which isbased on the coding of gradients and algorithmic redundancy,i.e., each worker evaluates redundant gradients, such that theaccurate gradients may be derived even in the presence ofadversarial nodes. The simulation results show that DRACOis more than 3 times faster in achieving 90% test accuracy forgradient computations on MNIST dataset as compared to thegeometric median method.An improvement to the studies of [20], [174], [176] isdone in [21], which proposes LCC to achieve an optimumtradeoff between resiliency against straggling workers, se-curity against malicious workers, and information-theoreticprivacy amid colluding workers. In LCC, the dataset of themaster is encoded using the Lagrange polynomial to createcomputational redundancy. Then, the coded data is sharedto the workers for computation on the encoded data, as ifthe coding did not take place. In comparison with the BGWMPC scheme [172], LCC requires more workers. However, theLagrange polynomial based encoding leads to a reduction inthe amount of randomness required to encode the data, whichtranslates to lower storage and computation costs incurredby each worker. The LCC also outperforms the BGW basedpolynomial sharing [20] in terms of communication costs,given that the polynomial sharing scheme requires a communi-cation round for each bilinear operations. In addition, LCC isless computationally costly than DRACO [176], which doesnot utilize the algebraic structure of the encoded gradients.However, the Lagrange coding only works for computationsinvolving arbitrary multivariate polynomial functions of theinput dataset. As an extension, the study of [178] proposesCodedPrivateML which adopts polynomial approximations tohandle the non-linearities of the gradient computation whenthe sigmoid function is involved, such that logistic regressioncan be conducted on LCC-encoded data while providinginformation-theoretic privacy for the dataset. Given that the ad-vantages of the LCC is preserved, the experiments conductedon Amazon EC2 clusters validate that the proposed scheme isclose to 34 times faster than the BGW based MPC approaches.In light of the growing popularity of machine learning, thestudy of [177] proposes Harmonic coding for tasks speciﬁc togradient-type computations, e.g., for loss function minimiza-tion in distributed model training. Harmonic coding leverageson the structure of the intermediate partial gradients computedto enable the cancellation of redundant results, such that theencoding and decoding process is more efﬁcient. As such,for the same level of privacy constraint, Harmonic coding improves on Shamir’s secret sharing scheme [179] and LCC[21] in terms of requiring fewer workers to compute gradient-type functions. This result is further summarized in Table VIII,where we present a comparison of the minimum number ofworkers required for the discussed schemes. Note that K refersto the number of partitions of the input dataset, g refers to theﬁxed multivariate polynomial , and deg refers to the degree of g . Like LCC, Harmonic coding can also be applied universallyto any gradient-type function computation. As such, the dataencoding can be performed before the computing task isspeciﬁed, thus further reducing the delay in computation. B. Secure Distributed Matrix Multiplication (SDMM)

Matrix multiplication is a key operation in many popularmachine learning algorithms [180], e.g., principal componentanalysis [90], support vector machines [181], and gradient-based computations. While the reviewed studies in Sec-tion VI-A discuss coded computing for privacy preservation ingeneral operations, the studies to be discussed consider tailoredstrategies for SDMM.In [182] and [180], the authors propose the use of staircasecodes in place of linear secret sharing codes, e.g., Shamir’scodes [179]. As an illustration, we consider that a masterencodes its data A with random matrix R into three secretshares before transmitting a share to each of the workersto perform matrix multiplication. When linear secret sharingcodes are used, the data and random matrix are not segmentedbut instead encoded and transmitted as a whole (Table IX).As such, the master has to wait for two full responses fromany two of three workers before being able to decode andderive the desired results. In contrast, when the staircase codeis used, the data and random matrices are segmented into sub-shares before transmission to the workers. When sufﬁcientsub-tasks have been completed by the workers, the mastercan then instruct the workers to cease computation. Clearly,the staircase coding approach reduces the computation cost ofworkers and communication costs incurred by the master node.Accordingly, the staircase coding approach can outperform theclassical secret sharing code in terms of mean waiting time.With 4 workers considered, experiments conducted on theAmazon EC2 clusters show a 59% decrease in mean waitingtime using staircase codes.However, [182] and [180] still consider the case of one-sided privacy, i.e., the approach is designed to keep onlyone of two input matrices involved in SDMM operationsprivate. As such, several studies have shifted the focus towardsapplying coding techniques to the speciﬁc case in which two-sided privacy, i.e., in which both input matrices are private,is ensured. Among the ﬁrst such study is that of [171],which applies the aligned secret sharing for two-sided privacy.Speciﬁcally, the input matrices are split into submatrices andencoded with random matrices. Then, the undesired termsare aligned such that the server only recovers the desiredresults and saves on communication costs. This leads to animproved download rate, i.e., the ratio of size of desired resultto the amount of information downloaded by a user, overconventional secret sharing schemes. TABLE IX:

Comparison of linear secret sharing codes and staircasecodes in the distribution of computation tasks in a system with threeworkers [180], [182]. S S S Linear secretsharing code

R R + A R + 2 A Staircase code A + 2 A + 4 R R + R A + 2 A + 4 R R + 2 R A + 2 A + 4 R R + 3 R Following [171], the study of [183] proposes an inductiveapproach to ﬁnd a close-to-optimal partition of the inputmatrices in consideration of two metrics namely the downloadrate and the minimum number of required workers. The pro-posed scheme improves on download rate, number of tolerablecolluding servers, and computational complexity as comparedto the study of [171]. Inspired by [78], the polynomial codingscheme is also extended to SDMM operations and speciﬁcallyconvolution tasks in [184], while preserving two-sided privacy,download rate similar to that of [183], and further mitigatingthe straggler effects. For convolution tasks, the authors lever-age on the inherent property in which the sums of convolutionsof sub-vectors may be used to derive the convolution result.Then, the upper and lower bounds of the recovery thresholdis derived to show that an order-optimal recovery threshold isachieved, i.e., it does not scale with number of workers.However, the key weakness of [171] and [183] as indicatedin [185] is that the proposed theoretical results do not clarifythe effect of matrix dimensions on the download rate, i.e.,the download rates are derived in the case whereby matricesare simply assumed to have large dimensions but without anyspeciﬁcations otherwise. However, the study of [185] foundthat the results for [171], [183] may be violated in some casesfor differing relative dimensions of the input matrices. Underthis context, the model proposed in [185] allows the matrixdimensions to be speciﬁed, and a new converse bound for thetwo-sided security SDMM is derived.In general, the encoded results of matrix multiplication aresent to the master, where interpolation is performed to obtainthe multiplication results, i.e., coefﬁcients of a polynomial.In [171], [183], the encoding of the private matrices is suchthat the coefﬁcients are mainly non-zero. In contrast, thestudy of [186] propose the Gap Additive Secure Polynomial(GASP) codes such that there are as many zero coefﬁcientsas possible. This allows the product to be interpolated and thedesired results to be derived with fewer number of evaluationsperformed, which implies that fewer workers are required toperform the matrix multiplication. To assign the exponentsfor decodability while having as many zero coefﬁcients aspossible, the authors propose the degree table to solve thecombinatorial problem. In [187], the authors further generalizethe GASP codes to be applicable to different partitions ofinput matrices and security parameters. The GASP codes areshown to outperform the approaches in [171], [183] in termsof download rate.

C. Summary and Lessons Learned

In this section, we have discussed studies that have adopteda coded computing approach towards ensuring secure dis- tributed computing. Then, we discuss the speciﬁc applicationsof SDMM, which require two-sided privacy. The summary andlessons learned are as follows: • Originally proposed to mitigate stragglers in distributedcomputing systems, coding approaches have also beenextended to ensure information theoretic privacy [20]and for some studies, security against malicious workers[21], [176]. However, given that the coding techniques,e.g., polynomial codes, are designed with the intention tomitigate stragglers, the studies discussed in this sectionhave focused on modiﬁcations to the approaches to simul-taneously achieve both straggler mitigation and privacy.As was mentioned in this section, the coding approachesto ensure privacy preservation have outperformed con-ventional secret sharing schemes that mainly focused ondata correctness and privacy. • With the recent popularity of machine learning, sev-eral studies have also focused on tailoring the codingapproaches to be utilized for various machine learningtasks, e.g., entangled polynomial coding for logistic re-gression [174] and harmonic coding for gradient basedcomputations [177]. In particular, given the importanceof two-sided privacy in SDMM, we have also speciﬁcallydiscussed papers that focus on two-sided privacy separatefrom papers that only considered one-sided privacy. Ingeneral, papers that focus on taking advantage of thealgebraic structure of the speciﬁc operations [21], e.g.,convolution tasks [184] or gradient computations [177],for efﬁcient encoding and decoding usually perform bet-ter. • In most of the studies that we have discussed, the focustends to be on information-theoretic privacy. However,weak security is key and may be explored in future works.Speciﬁcally, a system is weakly secure if attackers areunable to learn the sensitive intermediate values withouthaving received a certain number of coded packets [188].This may be relevant when there exist eavesdroppers withan access to communication links between the masterand workers [184]. For example, in the study of [189], adata shufﬂing design and redundancy reduction algorithmto assign computing tasks have been proposed to ensureweak security in the system. However, the focus has beenon workers, rather than eavesdroppers that may not beinvolved in the computations. • Given that the straggler effect is a fundamental con-cern in distributed computing, most papers have consid-ered workers with heterogeneous computing capabilities.However, as discussed in [184], the issue of hetero-geneous networks in other aspects have been under-explored in the aforementioned studies. For example, theworkers may have different levels of reputation [190].This enables the adoption of context-sensitive solutions,e.g., data security may be guarded against new workersor workers with low reputation, whereas it may not berequired for trusted workers. With this, the computationcomplexity and duration may be reduced in trusted nodes. Fig. 11:

Application of CDC to Network Function Virtualization(NFV) model for uplink channel decoding.

VII. CDC A

PPLICATIONS

A. Network Function Virtualization (NFV)

Network Function Virtualization serves as an enabling tech-nology for optimizing the 5G as well as emerging 6G networks[191], presenting a promising paradigm shift in the telecom-munication service provisioning industry. By leveraging onvirtualization technologies, NFV simpliﬁes the managementand operations of the networking services. In particular, NFVdecouples the Network Functions (NFs) such as routing andbaseband processing from the physical network equipmenton which they operate. The NFs are mapped to the VirtualNetwork Functions (VNFs) that are supported by the Commer-cial Off-The-Shelf (COTS) physical resources which providestorage, networking and computational capabilities. As thesoftware component in the network is decoupled from thehardware component by a virtualization layer, the VNFs canbe easily implemented over the distributed network locationswhich have the required hardware resources. Due to theﬂexibility in the deployment of the VNFs, the NFV bringsabout signiﬁcant reduction in operating and capital expenses.Moreover, the development of new networking services isfaster and cheaper as the COTS resources can be instantiatedeasily to provide the required network connection services.Extensive literature has been carried out on the various aspectsof NFV such as architectural designs [191], [192], resourceallocation [193], [194], energy efﬁciency [195], performanceimprovement [196] and security [197], [198]. More details canbe found in [191], [199], [200].However, one of the limiting factors of the performance ofNFV lies in the reliability of the COTS hardware resources[191], [201]. Hardware failure due to several factors such ascomponent malfunctioning, temporary unavailability and mali-cious attacks affects the implementation of NFV, hindering theprovision of services. Apart from the fault-tolerant virtualiza-tion strategies that are based on the diversity approach whichmaps VNFs onto the various virtual machines (VMs) suchthat the probability of a disruptive failure is minimized [199],coding approach can also be used to minimize computationlatency in NFV. In the study of [202], the authors consider the highlycomplex uplink channel decoding of the Cloud Radio AccessNetwork (C-RAN) architecture [203], which is a key appli-cation of NFV. In this system, the users communicate withthe cloud via Remote Radio Head (RRH). In order to ensurethe reliability of channel decoding, the received data framesby the RRH are encoded by leveraging on their algebraicstructures before being distributed to the different VMs asshown in Fig. 11. The simulation results show that the codingapproach achieves lower probability of error for decoding atthe cloud than that of the diversity-based approach. However,there are several assumptions that are made in this proposedscheme. Firstly, the binary symmetric communication channelis assumed between the users and the RRH. By consideringother communication channels such as the additive Gaussiannoise channels, different coding techniques may be applied.Secondly, this simple framework works well for a networkwith three processing nodes, but its performance for larger-size networks is not guaranteed.Considering the same issue of uplink channel decoding inthe C-RAN architecture, the authors in [204] propose a moregeneralized coded computation framework that works for anynumber of servers, random computing runtimes and randompacket arrivals by adopting the coding approach proposed in[202]. Given the randomness in the arrivals of data framestransmitted by the users, two queue management policies areconsidered: (i) per-frame decoding where one frame is decodedat any point of time, and (ii) continuous decoding where theservers start to decode the next packet of data frame uponcompletion of the ﬁrst packet. There is an underlying tradeoffbetween the average decoding latency and the frame unavail-ability probability which is an indication of the reliability ofthe decoding process at the servers. The simulation resultsshow that properly designed NFV codes are useful in achievingthe desired tradeoffs by optimizing the minimum distance ofthe codes.The idea of adopting coding approach in NFV is relativelynew. Different coding techniques can be explored in the future.Besides, instead of addressing the uplink channel decoding,which is the most computationally-intensive baseband function[203], other network functions such as routing and securityfunctions can be considered.

B. Edge Computing

With the enhanced sensing capabilities of end devices, anoverwhelming amount of data is produced at the edge of thenetwork today. Traditional schemes of computation ofﬂoadingto the cloud is thus unsustainable. Moreover, certain edgeapplications may involve end devices in remote areas that havelimited connectivity. This necessitates a paradigm shift towardsedge computing, in which computation is performed closer tothe edge of the network where data is produced. However,resource-constrained devices may not be able to carry out com-plex computations individually, especially given the increasingsize and complexity of state-of-the-art AI models [205]. Assuch, one of the enabling technologies of edge computing iscooperative computation in which the available resources of end devices and edge nodes, e.g., road side units in vehicularnetworking, can be pooled together to execute computationintensive tasks collaboratively [23]. For ease of exposition,we refer to these participating end devices and edge nodes asworkers in this section.As the number of devices connected to the network in-creases, more information needs to be exchanged among theworkers, resulting in high communication load. However, thecommunication bandwidth is ﬁxed and thus the network isunable to handle the high communication load, causing abottleneck as a result. Moreover, the heterogeneous nature ofworkers in the edge computing paradigm, e.g., in terms ofcomputational and communication capabilities, can lead to thestraggler effects. In face of these challenges, coding techniquescan be used.In the study of [206], the coded wireless distributed com-puting (CWDC) framework is proposed. The system modelconsists of multiple devices, i.e., workers, involved in coop-erative computation. As an illustration, a worker may have aninput, e.g., an image, that has to be processed, e.g., for objectrecognition. Individually, a worker may not have the storageor computational capabilities to execute the task. Therefore,the inference model may be split and stored on each worker,whereas the cooperative computation of results can be imple-mented following the MapReduce framework as discussed inSection II-A. An access point, e.g., a base station or a Wi-Fi router can then be utilized to facilitate the exchange ofintermediate results among workers. The proposed frameworkachieves communication loads that are independent of the sizeof the network and the storage size of the workers. Moreover,the CWDC framework can be generalized and applied todifferent types of applications.In practical distributed computing systems, the workersmay have heterogeneous computational, communication, andstorage capabilities. Based on a similar system model proposedin [206] where the workers communicate with each othervia an access point, the study in [207] considers deviceswith heterogeneous storage capacities over wireless networks.For uplink transmission, the allocation of ﬁles is based onthe scheme proposed in [107] as previously discussed inSection IV-A whereas for downlink transmission, data isencoded at the access point for the reduction of downlinkcommunication load. However, the achievable scheme has onlybeen validated in a small network that consists just threeprocessing nodes.In light of the growing popularity of machine learningmodel training at the edge, the study of [208] considers thedistribution of gradient descent computations across workersin the network to train a linear regression model. The proposedheterogeneous coded gradient descent (HCGD) assigns eachworker with an optimal load partition, through modellingthe computation delay of devices with a shifted exponentialdistribution. In consideration of data privacy, the authors in[209] propose the Coded Federated Learning (CFL) approachfor privacy-preserving linear regression. Federated Learning(FL) is a privacy preserving distributed machine learningparadigm proposed in [210], in which the sensitive data of dataowners are kept locally, whereas only the model parameters are Fig. 12: Illustration of the Coded Federated Learning (CFL) schemein a FL system that consists of multiple data owners and a FL modelowner. transmitted to the central server for aggregation to update themodel. However, FL still suffers from the issues of stragglingdevices and communication inefﬁciency [165]. As such, in theproposed CFL scheme, each data owner ﬁrst generates paritydata from its local data for transmission to the central server.At the central server, the gradients are also computed fromparity data simultaenously, such that only a subset of gradientsfrom the data owners have to be received for completionof model update. Figure 12 illustrates the implementationof the CFL scheme. The CFL approach is also shown toconverge almost four times faster than an uncoded approach.However, the communication and computation cost involvedin generating and transmitting the parity data has not beenwell elaborated in the study. Clearly, a major weakness ofaforementioned studies of [208], [209] is that they can onlybe applied to linear regression model training. The study of[211] extends on the aforementioned works with the proposedCodedFedL designed to mitigate straggler mitigation in non-linear regression and classiﬁcation tasks.In some cases, the resource level of the devices may notbe known by the network operator. To enable a dynamicand adaptive coded sub-task allocation for cooperative com-putation, an Automatic Repeat reQuest (ARQ) mechanism isproposed in the study of [143], in which devices are allocatedwith speciﬁc levels of packets for computation based on theirresponsiveness. Speciﬁcally, devices that are more responsiveare assumed to have more available resources. These deviceswill thus be assigned with more sub-tasks for computation.Instead of reducing communication load, it is also im-portant to consider communication efﬁciency, i.e., achieveddata rates, especially in wireless networks that have limitedspectral resources or networks with mutual interference amongusers. In order to improve spectral efﬁciency, the co-channelcommunication model is proposed in the study of [212] whichconsists of two stages, i.e., the uplink multiple access stage andthe downlink broadcasting stage. The communication model turns out to be equivalent to a multiple-input-multiple-output(MIMO) interference channel. Interference alignment [213]has been an effective approach in handling mutual interferenceamong users. The signals are precoded into the same subspacesat the unintended receivers and the desired signals are recov-ered at the intended receivers by using the decoding matrix.Linear coding scheme is adopted to establish the conditionsfor interference alignment. A low-rank optimization problem isformulated to minimize the number of channel used subject tothe established interference alignment conditions. By solvingthis optimization problem, the achievable symmetric degree-of-freedom (DoF), which implies the extent to which interfer-ence is eliminated, can be maximized. In [212], an efﬁcientdifference-of-convex-functions (DC) algorithm based on a DCrepresentation for the rank function is proposed to solve thelow-rank optimization problem. The performance of the DCalgorithm is evaluated in two different scenarios by varying: (i)the number of ﬁles stored in the devices, and (ii) the number ofantennas that are equipped by the devices. In the ﬁrst scenario,as the number of ﬁles stored in the devices increases, theachievable DoF increases. Similarly in the second scenario,as the number of antennas equipped by the devices increases,the achievable DoF increases. Furthermore, the simulationresults show that the DC approach achieves higher DoF thanthe existing benchmark algorithms, e.g., iterative reweightedleast squares (IRLS) algorithm and the nuclear norm relaxationapproach. However, the proposed scheme is based on a homo-geneous network, i.e., the number of ﬁles stored is the sameacross all devices and all devices are equipped with the samenumber of antennas. As an extension, heterogeneous networkscan be considered.Instead of communicating with each other via an accesspoint, the devices can communicate directly with each otherover wireless interference channels [214]. In particular, thetransmission of data in the Shufﬂe phase operates over wire-less interference channels. While the CDC scheme in [17]allows communications based on time-division multiple access(TDMA) scheme which allows each processing node to trans-mit one coded information packet at any time, the one-shotlinear scheme adopted in [214] allows more than one process-ing node to transmit information simultaneously at any giventime slot. The transmitted symbols are linear combinationsof coded intermediate results from the processing nodes. Thetransmitted symbols are broadcasted to the processing nodes,following which the nodes can decode to recover the desiredinformation. The study in [214] characterizes an improvedcomputation-communication tradeoff as compared to the studyin [17]. However, the proposed scheme operates under theassumption of perfect channel state information (CSI) wherethe CSI is available to all processing nodes. As such, theauthors in [215] propose superposition coding scheme whichhas better performance than that of the CDC scheme [17]and the one-shot linear scheme [214] under imperfect CSIcondition. The study is extended to account for the presenceof stragglers in the networks [216].In general, the aforementioned studies have consideredconventional implementations of the CDC scheme in whicheach device has to complete the computation task ﬁrst, before transmitting the intermediate results, e.g., to other devices viathe access point. However, given the limited computationalcapabilities of devices, the latency involved can still be sig-niﬁcant. As such, the studies in [145] and [217] consider thebatch-processing based coded computing (BPCC) frameworkwhich allows each device to return only the partially completedresults to a master node in batches, even before the task is fullycompleted. These partial results are then utilized to generateapproximations to the full solution through the singular valuedecomposition (SVD) approach [144]. This is particularlyuseful in applications that require fast, but not necessarilyoptimized, results. The BPCC scheme has its effectivenessvalidated on the EC2 computing cluster, in which latencyis proven to be reduced. Moreover, in consideration of theenergy limitations of Unmanned Aerial Vehicles (UAV), theBPCC framework was proposed for UAV-based mobile edgecomputing to provide energy-efﬁcient computation ofﬂoadingsupport on-demand [218], [219]. C. Summary and Lessons Learned

In this section, we have discussed applications of CDC inNFV and edge computing. The lessons learned are as follows: • The convergence of the recently popular edge computingand machine learning has given rise to edge intelligence,in which the computational capabilities of edge serversand devices have been increasingly leveraged to conductmachine learning model training closer to where the datais produced. While this leads to several beneﬁts, e.g.,lower training latency and enhanced privacy preservation,the problem of straggling devices is still a bottleneck. Assuch, CDC approaches have been increasingly applied inthis context. • One major difference between distributed computing atthe edge and at computing clusters is that the enddevices and edge servers, i.e., workers, are not speciﬁcallydedicated to computations. For example, the workersmay only share a fraction of their processing power[208] at time slots in which they are idle. As such, theheterogeneity in computational capabilities of workersmay also be greater in this regard. In face of thesechallenges, optimal load partition and allocation strategieshave been explored. Moreover, for the future works, thedynamics of the system may be captured using DeepReinforcement Learning based resource allocation [220]. • In edge intelligence, there may be several data ownersinvolved. Moreover, the data owners can be deviceswith computational constraints. As such, conventionaltechniques of data encoding before transmission to work-ers for computation may not be feasible. To meet thischallenge, FL has been proposed in the work of [210],whereas CFL [209] and CodedFedL [211] have beenproposed to mitigate the straggler effects in FL. However,the proposed methods are highly restrictive. For example,the CFL can only be applied to linear regression prob-lems, whereas both methods require costly computationand transmission of parity data. Moreover, they have notbeen implemented on typical end devices to validate the feasibility of the schemes in practical implementation. Forthe future works, studies on using CDC schemes in edgecomputing applications may adopt the approach of [145],[217], in which the schemes are implemented under thecontext of practical hardware constraints.VIII. C HALLENGES , O

PEN I SSUES AND F UTURE W ORKS

The utilization of CDC schemes to solve the implementationchallenges of the distributed computing systems is a relativelynew and recent approach. There are still challenges andopen issues that have yet to be addressed and this providesopportunities for new research directions in the future. Wepresent major challenges that need to be looked into foreffective implementation of CDC schemes. • Heterogeneous nodes:

As compared to traditional dis-tributed computing clusters, heterogeneities among com-puting nodes are much more signiﬁcant when the comput-ing nodes are connected in the edge computing networks,e.g., smartphones and wearable devices. Many studies,e.g., [107], [129], [131], [140] have considered heteroge-neous systems where the processing nodes have differentcomputational, communication and storage capabilities.For example, in [131], a joint ﬁle and function allocationstrategy is proposed to assign jobs to the processingnodes such that the communication load is minimized.In [140], the computation load is allocated based onthe capabilities of the processing nodes. However, otheraspects of heterogeneity such as the reputation [190] andthe willingness to participate of the processing nodesare not taken into account. New allocation strategiesneed to consider different aspects of heterogeneities ofthe processing devices so that coding techniques canbe implemented effectively and securely. For example,the computation tasks can be allocated to workers withhigher reputation which implies higher probability of theworkers in completing their allocated tasks. • Encoding and decoding complexities:

The studies thatwe have discussed in Section IV and Section V mainlyminimize communication load in Shufﬂe phase and com-putation latency in the Map phase respectively. However,the complexities of encoding and decoding are often notevaluated. It is important to ensure low encoding anddecoding complexities in order to minimize the overalljob execution time. Otherwise, the speedup gain achievedby in speciﬁc phases, e.g., communication and compu-tation phases may be offset by the high encoding anddecoding complexities. For example, the UberShufﬂe al-gorithm [128] incurs high computational overhead underthe fast broadcast environment, i.e., networks with largebandwidth, such that it is not feasible for implementationeven though it achieves signiﬁcant shufﬂing gain. Hence,to better assess the performance of the CDC schemes, thecomplexities of the encoding and decoding methods haveto be evaluated. • Non-static computing nodes:

For the commonly useddistributed computing models such as cluster computing,grid computing and cloud computing that are discussed in Section II, the computing nodes are static, i.e., the nodesare located at ﬁxed location. For example, the servers arelocated at speciﬁc data centers. Data required for com-putations is transmitted over the wireless communicationchannels to the servers. However, as the edge devices,e.g., IoT devices, wearable devices and vehicles, havegreater communication and computational capabilities,new distributed computing models such as mobile edgecomputing [24], [221] and fog computing [222] have beendeveloped recently. In [223], the basic CDC scheme isimplemented in the context of fog computing. However,the edge devices are usually mobile. The data that isprocessed by the edge devices depends on the locationsthat they visit [130], and hence the master node hasno control of the data distribution to the workers. Newcoding approaches for edge and fog computing whichinvolve moving workers can be proposed in future works.One proposed solution is to allocate Reduce functionsbased on the data stored at the processing nodes [130]. • Security concerns:

Coding techniques are able to mitigatethe straggler effects while preserving privacy as shownin the studies of [20], [21] and [176]. The proposedsecure coding techniques are an extension to the codingtechniques that are originally proposed to mitigate thestraggler effects. As mentioned in [21], the tradeoffbetween resiliency against straggling workers, securityagainst malicious workers and information-theoretic pri-vacy amid colluding workers needs to be carefully man-aged. In addition, there may exist eavesdroppers whichtap on the less secure communication links between themaster node and the workers. As such, more researcheffort can be directed towards developing weakly securesystems which prevents the eavesdroppers from retrievingsensitive information. • Network architectures:

It is important to consider howthe computing nodes are connected and communicatewith each other for effective implementation of the CDCschemes. For example, in [109] and [224], the authorsintroduce a hierarchical structure where the master com-municates with multiple submaster and each submasterleads a group of computing nodes. From the studiesthat we have reviewed, the network architecture is onlyconsidered for the implementation of CDC schemes toreduce communication load. However, it is an importantconsideration when designing secure coding schemes. Inpractice, it may be safe for computing nodes within agroup, e.g., from the same location, to share informationfreely with each other but not with computing nodesfrom another group, e.g., from different location. Inaddition, the communication channels are not perfect, i.e.,they may not have perfect CSI [215] or the transmittedinformation may have missing entries. Some networksmay have limited spectral resources or suffer from mutualinterference among users [212]. As such, future researchcan work towards designing coding schemes that canbe implemented in practical distributed computing sys-tems. Besides a need to design effective coding schemes,there is also a need to consider the design of low-cost, easily-implementable and scalable network architecturesin which the coding schemes can be applied to. • Different computation frameworks:

Currently, most of thestudies are based on the MapReduce computing model.Speciﬁcally, Coded MapReduce is proposed in [18]by implementing coding techniques in the MapReduceframework. However, there are limitations to MapReducemodel that hinder its wide adoption for all types of dis-tributed computation tasks as explained in Section II-A.In fact, there are other computing models such as Spark,Dryad and CIEL which support iterative algorithms, inwhich the feasibility of implementing coding techniqueshas not been explored. As such, the importance of thesecomputing models motivates future directions such asthe design of coding schemes that are speciﬁc to thesecomputing models in order to solve any distributed com-putation tasks, e.g., convolution, Fourier transform andnon-linear computations. • Coding for both communication reduction and strag-glers mitigation:

As we have discussed previously inSection IV and Section V, coding techniques are usedto either reduce communication load or mitigate thestraggler effects. However, coding techniques cannot beapplied to solve these implementation challenges simul-taneously. As characterized in [99], there is a tradeoffbetween communication load and computation latency.However, in an ideal situation, both communication loadand computation latency need to be minimized. Thus, itis important to carefully manage the tradeoff to achieveoptimal performance of the distributed computing sys-tems. For future works on CDC schemes, there is a needto improve the latency-communication tradeoff curveso that the time taken to execute the allocated taskscan be signiﬁcantly reduced. In addition, instead of thetwo-dimensional tradeoff, the three-dimensional tradeoffbetween computation, communication and storage cost[71], which is much more challenging to manage, shouldbe considered. • CDC applications:

Given the advantages, CDC schemesare implemented in distributed computing applicationssuch as the NFV and edge computing. Apart from theUAVs, CDC schemes can be extended to edge comput-ing applications in other areas, e.g., vehicular networks,healthcare systems and industrial operations. For the stud-ies that we have reviewed in CDC applications, the mainfocus lies in the implementation of coding techniquesin various applications, without considering privacy andsecurity. Given the importance of secure coding as dis-cussed in Section VI, secure coding techniques need tobe considered in the implementation of CDC applications.Besides, application-speciﬁc issues need to be addressed.For example, in vehicular networks where the vehicles areconstantly moving, the CDC schemes need to be robustto vehicles which do not have consistent access to thewireless communication channels.The idea of using coding techniques to overcome the chal-lenges in distributed computing systems is relatively new. For effective implementation in practical distributed computingsystems, various aspects such as the heterogeneities of thecomputing nodes and the network architectures are worth thein-depth studies. Some promising research directions presentedin this survey serve as useful guidelines and valuable refer-ences for future research in CDC.IX. C

ONCLUSION

In this paper, we provided a tutorial of CDC schemes and acomprehensive survey on the two main lines of CDC works.We ﬁrst motivated the need for CDC schemes. The current per-formance of distributed computing systems can be improvedusing coding schemes. Then, we described the fundamentalsand principles of CDC schemes. We also reviewed CDCworks which aim to minimize communication costs, mitigatestraggler effects as well as enhance privacy and security. Inaddition, we discussed the implementation of CDC schemesin practical distributed computing applications. Finally, wehighlighted the challenges and discussed promising researchdirections. R

EFERENCES[1] V. Cristea, C. Dobre, C. Stratan, F. Pop, and A. Costan,

Large-ScaleDistributed Computing and Applications: Models and Trends: Modelsand Trends . IGI Global, 2010.[2] V. P. Kumar, V. K. Prasanna, S. Iyengar, P. Spirakis, and M. Welsh,

Distributed Computing in Sensor Systems . Springer Science & BusinessMedia, 2005.[3] J. C. Corbett, J. Dean, M. Epstein, A. Fikes, C. Frost, J. J. Furman,S. Ghemawat, A. Gubarev, C. Heiser, P. Hochschild, et al. , “Spanner:Google’s Globally Distributed Database,”

ACM Transactions on Com-puter Systems (TOCS) , vol. 31, no. 3, pp. 1–22, 2013.[4] J. Gonzaga, L. A. C. Meleiro, C. Kiang, and R. Maciel Filho, “ANN-based Soft-sensor for Real-time Process Monitoring and Control of anIndustrial Polymerization Process,”

Computers & chemical engineer-ing , vol. 33, no. 1, pp. 43–49, 2009.[5] A. E. De Giusti, “Structured Parallel Programming: Patterns for Ef-ﬁcient Computation,”

Journal of Computer Science and Technology ,vol. 15, no. 01, pp. 43–44, 2015.[6] J. Dean and S. Ghemawat, “MapReduce: Simpliﬁed Data Processingon Large Clusters,”

Commun. ACM , vol. 51, p. 107â ˘A¸S113, Jan. 2008.[7] F. Ahmad, S. T. Chakradhar, A. Raghunathan, and T. N. Vijayku-mar, “Tarazu: Optimizing MapReduce on Heterogeneous Clusters,”

SIGARCH Comput. Archit. News , vol. 40, p. 61â ˘A¸S74, Mar. 2012.[8] Y. Guo, J. Rao, D. Cheng, and X. Zhou, “iShufﬂe: Improving HadoopPerformance with Shufﬂe-on-Write,”

IEEE Transactions on Paralleland Distributed Systems , vol. 28, no. 6, pp. 1649–1662, 2017.[9] M. Chowdhury, M. Zaharia, J. Ma, M. I. Jordan, and I. Stoica,“Managing Data Transfers in Computer Clusters with Orchestra,” in

Proceedings of the ACM SIGCOMM 2011 Conference , SIGCOMMâ ˘A ´Z11, (New York, NY, USA), p. 98â ˘A¸S109, Association for Com-puting Machinery, 2011.[10] Z. Zhang, L. Cherkasova, and B. T. Loo, “Performance Modelingof MapReduce Jobs in Heterogeneous Cloud Environments,” in , pp. 839–846, 2013.[11] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning forImage Recognition,” in

The IEEE Conference on Computer Vision andPattern Recognition (CVPR) , June 2016.[12] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classiﬁcationwith Deep Convolutional Neural Networks,” in

Advances in Neural In-formation Processing Systems 25 (F. Pereira, C. J. C. Burges, L. Bottou,and K. Q. Weinberger, eds.), pp. 1097–1105, Curran Associates, Inc.,2012.[13] M. A. Attia and R. Tandon, “Combating Computational Heterogeneityin Large-Scale Distributed Computing via Work Exchange,” arXivpreprint arXiv:1711.08452 , 2017. [14] D. Wang, G. Joshi, and G. Wornell, “Using Straggler Replication toReduce Latency in Large-Scale Parallel Computing,” SIGMETRICSPerform. Eval. Rev. , vol. 43, p. 7â ˘A¸S11, Nov. 2015.[15] R. K. Maity, A. Singh Rawa, and A. Mazumdar, “Robust GradientDescent via Moment Encoding and LDPC Codes,” in , pp. 2734–2738,2019.[16] M. A. Maddah-Ali and U. Niesen, “Fundamental Limits of Caching,”

IEEE Transactions on Information Theory , vol. 60, no. 5, pp. 2856–2867, 2014.[17] S. Li, M. A. Maddah-Ali, Q. Yu, and A. S. Avestimehr, “A FundamentalTradeoff Between Computation and Communication in DistributedComputing,”

IEEE Transactions on Information Theory , vol. 64, no. 1,pp. 109–128, 2018.[18] S. Li, M. A. Maddah-Ali, and A. S. Avestimehr, “Coded MapReduce,”in , pp. 964–971, 2015.[19] K. Lee, M. Lam, R. Pedarsani, D. Papailiopoulos, and K. Ramchandran,“Speeding Up Distributed Machine Learning Using Codes,”

IEEETransactions on Information Theory , vol. 64, no. 3, pp. 1514–1529,2018.[20] H. A. Nodehi and M. A. Maddah-Ali, “Limited-Sharing Multi-PartyComputation for Massive Matrix Operations,” in , pp. 1231–1235, 2018.[21] Q. Yu, S. Li, N. Raviv, S. M. M. Kalan, M. Soltanolkotabi, andS. Avestimehr, “Lagrange Coded Computing: Optimal Design forResiliency, Security and Privacy,” arXiv preprint arXiv:1806.00939 ,2018.[22] W. Shi, J. Cao, Q. Zhang, Y. Li, and L. Xu, “Edge Computing:Vision and Challenges,”

IEEE Internet of Things Journal , vol. 3, no. 5,pp. 637–646, 2016.[23] W. Y. B. Lim, J. S. Ng, Z. Xiong, D. Niyato, C. Leung, C. Miao,and Q. Yang, “Incentive Mechanism Design for Resource Sharing inCollaborative Edge Learning,” arXiv preprint arXiv:2006.00511 , 2020.[24] Y. Mao, C. You, J. Zhang, K. Huang, and K. B. Letaief, “A Surveyon Mobile Edge Computing: The Communication Perspective,”

IEEECommunications Surveys Tutorials , vol. 19, no. 4, pp. 2322–2358,2017.[25] K. Krauter, R. Buyya, and M. Maheswaran, “A Taxonomy and Surveyof Grid Resource Management Systems for Distributed Computing,”

Software: Practice and Experience , vol. 32, no. 2, pp. 135–164, 2002.[26] H. Hussain, S. U. R. Malik, A. Hameed, S. U. Khan, G. Bickler,N. Min-Allah, M. B. Qureshi, L. Zhang, W. Yongji, N. Ghani,J. Kolodziej, A. Y. Zomaya, C.-Z. Xu, P. Balaji, A. Vishnu, F. Pinel,J. E. Pecero, D. Kliazovich, P. Bouvry, H. Li, L. Wang, D. Chen, andA. Rayes, “A Survey on Resource Allocation in High PerformanceDistributed Computing Systems,”

Parallel Computing , vol. 39, no. 11,pp. 709 – 736, 2013.[27] D. Datla, X. Chen, T. Tsou, S. Raghunandan, S. S. Hasan, J. H. Reed,C. B. Dietrich, T. Bose, B. Fette, and J.-H. Kim, “Wireless DistributedComputing: A Survey of Research Challenges,”

IEEE CommunicationsMagazine , vol. 50, no. 1, pp. 144–152, 2012.[28] S. P. Ahuja and J. R. Myers, “A Survey on Wireless Grid Computing,”

The Journal of Supercomputing , vol. 37, no. 1, pp. 3–21, 2006.[29] G. L. Valentini, W. Lassonde, S. U. Khan, N. Min-Allah, S. A. Madani,J. Li, L. Zhang, L. Wang, N. Ghani, J. Kolodziej, H. Li, A. Y. Zomaya,C.-Z. Xu, P. Balaji, A. Vishnu, F. Pinel, J. E. Pecero, D. Kliazovich, andP. Bouvry, “An Overview of Energy Efﬁciency Techniques in ClusterComputing Systems,”

Cluster Computing , vol. 16, no. 1, pp. 3–15,2013.[30] N. Sadashiv and S. M. D. Kumar, “Cluster, Grid and Cloud Computing:A Detailed Comparison,” in , pp. 477–482, 2011.[31] H. Kamal Idrissi, A. Kartit, and M. El Marraki, “A Taxonomy andSurvey of Cloud Computing,” in ,pp. 1–5, 2013.[32] Y. Xu and H. Qi, “Distributed Computing Paradigms for CollaborativeSignal and Information Processing in Sensor Networks,”

Journal ofParallel and Distributed Computing , vol. 64, no. 8, pp. 945 – 959,2004.[33] S. U. Khan, A. Y. Zomaya, and A. Abbas,

Handbook of Large-ScaleDistributed Computing in Smart Healthcare . Springer, 2017.[34] B. Tang, Z. Chen, G. Hefferman, S. Pei, T. Wei, H. He, and Q. Yang,“Incorporating Intelligence in Fog Computing for Big Data Analysisin Smart Cities,”

IEEE Transactions on Industrial Informatics , vol. 13,no. 5, pp. 2140–2150, 2017. [35] N. R. S. Raghavan and T. Waghmare, “DPAC: An Object-orientedDistributed and Parallel Computing Framework for ManufacturingApplications,”

IEEE Transactions on Robotics and Automation , vol. 18,no. 4, pp. 431–443, 2002.[36] A. A. Juan, J. Faulin, J. Jorba, J. Caceres, and J. M. Marquès, “UsingParallel & Distributed Computing for Real-time Solving of VehicleRouting Problems with Stochastic Demands,”

Annals of OperationsResearch , vol. 207, no. 1, pp. 43–65, 2013.[37] I. Ahmad, M. K. Dhodhi, and A. Ghafoor, “Task Assignment inDistributed Computing Systems,” in

Proceedings International PhoenixConference on Computers and Communications , pp. 49–53, 1995.[38] M. Kaﬁl and I. Ahmad, “Optimal Task Assignment in HeterogeneousDistributed Computing Systems,”

IEEE Concurrency , vol. 6, no. 3,pp. 42–50, 1998.[39] Sung-Ho Woo, Sung-Bong Yang, Shin-Dug Kim, and Tack-Don Han,“Task Scheduling in Distributed Computing Systems with a GeneticAlgorithm,” in

Proceedings High Performance Computing on theInformation Superhighway. HPC Asia ’97 , pp. 301–305, 1997.[40] R. V. Lopes and D. MenascÃl’, “A Taxonomy of Job Scheduling onDistributed Computing Systems,”

IEEE Transactions on Parallel andDistributed Systems , vol. 27, no. 12, pp. 3412–3428, 2016.[41] R. Ranjan, A. Harwood, and R. Buyya, “A Case for Cooperative andIncentive-based Federation of Distributed Clusters,”

Future GenerationComputer Systems , vol. 24, no. 4, pp. 280 – 295, 2008.[42] L. Duan, T. Kubo, K. Sugiyama, J. Huang, T. Hasegawa, and J. Wal-rand, “Incentive Mechanisms for Smartphone Collaboration in DataAcquisition and Distributed Computing,” in , pp. 1701–1709, 2012.[43] Y. Xiao,

Security in Distributed, Grid, Mobile, and Pervasive Comput-ing . CRC Press, 2007.[44] S. Pllana, I. Brandic, and S. Benkner, “Performance Modeling andPrediction of Parallel and Distributed Computing Systems: A Surveyof the State of the Art,” in

First International Conference on Complex,Intelligent and Software Intensive Systems (CISIS’07) , pp. 279–284,IEEE, 2007.[45] D. Jiang, B. C. Ooi, L. Shi, and S. Wu, “The Performance of MapRe-duce: An in-Depth Study,”

Proc. VLDB Endow. , vol. 3, p. 472â ˘A¸S483,Sept. 2010.[46] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, I. Stoica, et al. ,“Spark: Cluster Computing with Working Sets.,”

HotCloud , vol. 10,no. 10-10, p. 95, 2010.[47] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly, “Dryad:Distributed Data-Parallel Programs from Sequential Building Blocks,”in

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conferenceon Computer Systems 2007 , EuroSys â ˘A ´Z07, (New York, NY, USA),p. 59â ˘A¸S72, Association for Computing Machinery, 2007.[48] D. G. Murray, M. Schwarzkopf, C. Smowton, S. Smith, A. Mad-havapeddy, and S. Hand, “CIEL: A Universal Execution Engine forDistributed Data-ﬂow Computing,” in

Proc. 8th ACM/USENIX Sympo-sium on Networked Systems Design and Implementation , pp. 113–126,2011.[49] G. Joshi, E. Soljanin, and G. Wornell, “Efﬁcient Redundancy Tech-niques for Latency Reduction in Cloud Systems,”

ACM Trans. Model.Perform. Eval. Comput. Syst. , vol. 2, Apr. 2017.[50] J. Weets, M. K. Kakhani, and A. Kumar, “Limitations and Challengesof HDFS and MapReduce,” in , pp. 545–549, 2015.[51] “Hadoop Terasort,” Aug. 13 2020. https://hadoop.apache.org/docs/current/api/org/apache/hadoop/examples/terasort/package-summary.html.[52] B. Recht and C. Ré, “Parallel Stochastic Gradient Algorithms forLarge-Scale Matrix Completion,”

Mathematical Programming Compu-tation , vol. 5, no. 2, pp. 201–226, 2013.[53] L. Bottou, “Stochastic Gradient Descent Tricks,” in

Neural networks:Tricks of the trade , pp. 421–436, Springer, 2012.[54] S. Ioffe and C. Szegedy, “Batch Normalization: Accelerating DeepNetwork Training by Reducing Internal Covariate Shift,” arXiv preprintarXiv:1502.03167 , 2015.[55] F. Ahmad, S. Lee, M. Thottethodi, and T. Vijaykumar, “Mapreduce withcommunication overlap (marco),”

Journal of Parallel and DistributedComputing , vol. 73, no. 5, pp. 608 – 620, 2013.[56] B. Nicolae, C. H. A. Costa, C. Misale, K. Katrinis, and Y. Park, “Lever-aging Adaptive I/O to Optimize Collective Data Shufﬂing Patterns forBig Data Analytics,”

IEEE Transactions on Parallel and DistributedSystems , vol. 28, no. 6, pp. 1663–1674, 2017. [57] W. Yu, Y. Wang, X. Que, and C. Xu, “Virtual Shufﬂing for EfﬁcientData Movement in MapReduce,” IEEE Transactions on Computers ,vol. 64, no. 2, pp. 556–568, 2015.[58] M. Isard, V. Prabhakaran, J. Currey, U. Wieder, K. Talwar, andA. Goldberg, “Quincy: Fair Scheduling for Distributed ComputingClusters,” in

Proceedings of the ACM SIGOPS 22nd Symposium onOperating Systems Principles , SOSP â ˘A ´Z09, (New York, NY, USA),p. 261â ˘A¸S276, Association for Computing Machinery, 2009.[59] S. Suresh and N. Gopalan, “An Optimal Task Selection Scheme forHadoop Scheduling,”

IERI Procedia , vol. 10, pp. 70 – 75, 2014. Inter-national Conference on Future Information Engineering (FIE 2014).[60] J. Xie, F. Meng, H. Wang, H. Pan, J. Cheng, and X. Qin, “Research onscheduling scheme for hadoop clusters,”

Procedia Computer Science ,vol. 18, pp. 2468 – 2471, 2013. 2013 International Conference onComputational Science.[61] “Hadoop: Fair Scheduler,” Jul. 6 2020. https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html.[62] M. Zaharia, D. Borthakur, J. Sen Sarma, K. Elmeleegy, S. Shenker,and I. Stoica, “Delay Scheduling: A Simple Technique for AchievingLocality and Fairness in Cluster Scheduling,” in

Proceedings of the5th European Conference on Computer Systems , EuroSys â ˘A ´Z10,(New York, NY, USA), p. 265â ˘A¸S278, Association for ComputingMachinery, 2010.[63] B. T. Rao and L. S. S. Reddy, “Survey on Improved Schedul-ing in Hadoop MapReduce in Cloud Environments,” arXiv preprintarXiv:1207.0780 , 2012.[64] J. Dean and L. A. Barroso, “The Tail at Scale,”

Commun. ACM , vol. 56,p. 74â ˘A¸S80, Feb. 2013.[65] G. Ananthanarayanan, S. Kandula, A. G. Greenberg, I. Stoica, Y. Lu,B. Saha, and E. Harris, “Reining in the Outliers in Map-ReduceClusters using Mantri.,” in

Osdi , p. 24, 2010.[66] R. D. Blumofe and C. E. Leiserson, “Scheduling Multithreaded Com-putations by Work Stealing,”

J. ACM , vol. 46, p. 720â ˘A¸S748, Sept.1999.[67] K. Gardner, S. Zbarsky, S. Doroudi, M. Harchol-Balter, and E. Hyy-tia, “Reducing Latency via Redundant Requests: Exact Analysis,” in

Proceedings of the 2015 ACM SIGMETRICS International Conferenceon Measurement and Modeling of Computer Systems , SIGMETRICSâ ˘A ´Z15, (New York, NY, USA), p. 347â ˘A¸S360, Association for Com-puting Machinery, 2015.[68] N. B. Shah, K. Lee, and K. Ramchandran, “When Do RedundantRequests Reduce Latency?,”

IEEE Transactions on Communications ,vol. 64, no. 2, pp. 715–722, 2016.[69] M. F. Aktas, P. Peng, and E. Soljanin, “Straggler Mitigation by De-layed Relaunch of Tasks,”

SIGMETRICS Perform. Eval. Rev. , vol. 45,p. 224â ˘A¸S231, Mar. 2018.[70] M. F. Aktas, P. Peng, and E. Soljanin, “Effective Straggler Mitigation:Which Clones Should Attack and When?,”

SIGMETRICS Perform.Eval. Rev. , vol. 45, p. 12â ˘A¸S14, Oct. 2017.[71] Q. Yan, S. Yang, and M. Wigger, “Storage, Computation, and Commu-nication: A Fundamental Tradeoff in Distributed Computing,” in , pp. 1–5, 2018.[72] S. Li, S. Supittayapornpong, M. A. Maddah-Ali, and S. Avestimehr,“Coded TeraSort,” in , pp. 389–398, 2017.[73] Y. H. Ezzeldin, M. Karmoose, and C. Fragouli, “Communication vsDistributed Computation: An Alternative Trade-off Curve,” in , pp. 279–283, 2017.[74] A. Mallick, M. Chaudhari, U. Sheth, G. Palanikumar, and G. Joshi,“Rateless Codes for Near-Perfect Load Balancing in DistributedMatrix-Vector Multiplication,”

Proc. ACM Meas. Anal. Comput. Syst. ,vol. 3, Dec. 2019.[75] S. Dutta, V. Cadambe, and P. Grover, “Short-Dot: Computing LargeLinear Transforms Distributedly Using Coded Short Dot Products,”in

Advances in Neural Information Processing Systems 29 (D. D. Lee,M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, eds.), pp. 2100–2108, Curran Associates, Inc., 2016.[76] S. Wang, J. Liu, N. Shroff, and P. Yang, “Fundamental Limits of CodedLinear Transform,” arXiv preprint arXiv:1804.09791 , 2018.[77] K. Lee, C. Suh, and K. Ramchandran, “High-dimensional CodedMatrix Multiplication,” in , pp. 2418–2422, 2017.[78] Q. Yu, M. Maddah-Ali, and S. Avestimehr, “Polynomial Codes: anOptimal Design for High-Dimensional Coded Matrix Multiplication,”in

Advances in Neural Information Processing Systems 30 (I. Guyon,U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan,and R. Garnett, eds.), pp. 4403–4413, Curran Associates, Inc., 2017. [79] M. Fahim, H. Jeong, F. Haddadpour, S. Dutta, V. Cadambe, andP. Grover, “On the Optimal Recovery Threshold of Coded MatrixMultiplication,” in , pp. 1264–1270, 2017.[80] S. Wang, J. Liu, and N. Shroff, “Coded Sparse Matrix Multiplication,” arXiv preprint arXiv:1802.03430 , 2018.[81] R. Tandon, Q. Lei, A. G. Dimakis, and N. Karampatziakis, “GradientCoding: Avoiding Stragglers in Distributed Learning,” in

Proceedingsof the 34th International Conference on Machine Learning (D. Precupand Y. W. Teh, eds.), vol. 70 of

Proceedings of Machine Learning Re-search , (International Convention Centre, Sydney, Australia), pp. 3368–3376, PMLR, 06–11 Aug 2017.[82] N. Raviv, I. Tamo, R. Tandon, and A. G. Dimakis, “Gradient Cod-ing from Cyclic MDS Codes and Expander Graphs,” arXiv preprintarXiv:1707.03858 , 2017.[83] W. Halbawi, N. Azizan, F. Salehi, and B. Hassibi, “Improving Dis-tributed Gradient Descent Using Reed-Solomon Codes,” in , pp. 2027–2031,2018.[84] S. Li, S. M. Mousavi Kalan, A. S. Avestimehr, and M. Soltanolkotabi,“Near-Optimal Straggler Mitigation for Distributed Gradient Methods,”in , pp. 857–866, 2018.[85] S. Li, S. M. M. Kalan, Q. Yu, M. Soltanolkotabi, and A. S. Avestimehr,“Polynomially Coded Regression: Optimal Straggler Mitigation viaData Encoding,” arXiv preprint arXiv:1805.09934 , 2018.[86] S. Dutta, V. Cadambe, and P. Grover, “Coded Convolution for Paralleland Distributed Computing within a Deadline,” in , pp. 2403–2407, 2017.[87] Q. Yu, M. A. Maddah-Ali, and A. S. Avestimehr, “Coded Fourier Trans-form,” in , pp. 494–501, 2017.[88] M. Blaum and R. M. Roth, “On Lowest Density MDS Codes,”

IEEETransactions on Information Theory , vol. 45, no. 1, pp. 46–59, 1999.[89] S. Balakrishnama and A. Ganapathiraju, “Linear Discriminant Analysis- A Brief Tutorial,” in

Institute for Signal and information Processing ,vol. 18, pp. 1–8, 1998.[90] H. Abdi and L. J. Williams, “Principal Component Analysis,”

Wiley in-terdisciplinary reviews: computational statistics , vol. 2, no. 4, pp. 433–459, 2010.[91] A. Severinson, A. Graell i Amat, and E. Rosnes, “Block-Diagonal andLT Codes for Distributed Computing With Straggling Servers,”

IEEETransactions on Communications , vol. 67, no. 3, pp. 1739–1753, 2019.[92] H. Park and J. Moon, “Irregular Product Coded Computation forHigh-Dimensional Matrix Multiplication,” in , pp. 1782–1786, 2019.[93] T. Baharav, K. Lee, O. Ocal, and K. Ramchandran, “Straggler-ProoﬁngMassive-Scale Distributed Matrix Multiplication with D-DimensionalProduct Codes,” in , pp. 1993–1997, 2018.[94] Q. Yu, M. A. Maddah-Ali, and A. S. Avestimehr, “Straggler Mitigationin Distributed Matrix Multiplication: Fundamental Limits and OptimalCoding,”

IEEE Transactions on Information Theory , vol. 66, no. 3,pp. 1920–1933, 2020.[95] M. Fahim and V. R. Cadambe, “Numerically Stable PolynomiallyCoded Computing,” in , pp. 3017–3021, 2019.[96] S. Dutta, Z. Bai, H. Jeong, T. M. Low, and P. Grover, “A Uniﬁed CodedDeep Neural Network Training Strategy based on Generalized PolyDotcodes,” in , pp. 1585–1589, 2018.[97] G. Suh, K. Lee, and C. Suh, “Matrix Sparsiﬁcation for Coded MatrixMultiplication,” in , pp. 1271–1278, 2017.[98] M. Ye and E. Abbe, “Communication-Computation Efﬁcient GradientCoding,” arXiv preprint arXiv:1802.03475 , 2018.[99] S. Li, M. A. Maddah-Ali, and A. S. Avestimehr, “A Uniﬁed CodingFramework for Distributed Computing with Straggling Servers,” in , pp. 1–6, 2016.[100] J. Zhang and O. Simeone, “Improved Latency-communication Trade-off for Map-shufﬂe-reduce Systems with Stragglers,” in

ICASSP 2019 -2019 IEEE International Conference on Acoustics, Speech and SignalProcessing (ICASSP) , pp. 8172–8176, 2019.[101] E. Parrinello, E. Lampiris, and P. Elia, “Coded Distributed Computingwith Node Cooperation Substantially Increases Speedup Factors,” in ,pp. 1291–1295, 2018. [102] K. Konstantinidis and A. Ramamoorthy, “Leveraging Coding Tech-niques for Speeding up Distributed Computing,” in , pp. 1–6, 2018.[103] S. Prakash, A. Reisizadeh, R. Pedarsani, and S. Avestimehr, “CodedComputing for Distributed Graph Analytics,” in , pp. 1221–1225, 2018.[104] S. R. Srinivasavaradhan, L. Song, and C. Fragouli, “Distributed Com-puting Trade-offs with Random Connectivity,” in , pp. 1281–1285, 2018.[105] S. Li, M. A. Maddah-Ali, and A. S. Avestimehr, “Coded DistributedComputing: Straggling Servers and Multistage Dataﬂows,” in , pp. 164–171, 2016.[106] N. Woolsey, R. Chen, and M. Ji, “Cascaded Coded DistributedComputing on Heterogeneous Networks,” in , pp. 2644–2648, 2019.[107] M. Kiamari, C. Wang, and A. S. Avestimehr, “On HeterogeneousCoded Distributed Computing,” arXiv preprint arXiv:1709.00196 ,2017.[108] M. A. Attia and R. Tandon, “Information Theoretic Limits of DataShufﬂing for Distributed Learning,” in , pp. 1–6, 2016.[109] S. Gupta and V. Lalitha, “Locality-aware Hybrid Coded MapReducefor Server-Rack Architecture,” in , pp. 459–463, 2017.[110] D. Stinson, Combinatorial Designs: Constructions and Analysis .Springer Science & Business Media, 2007.[111] K. Konstantinidis and A. Ramamoorthy, “Resolvable Designsfor Speeding up Distributed Computing,” arXiv preprintarXiv:1908.05666 , 2019.[112] S. Li, M. A. Maddah-Ali, and A. S. Avestimehr, “Compressed CodedDistributed Computing,” in , pp. 2032–2036, 2018.[113] N. Woolsey, R. Chen, and M. Ji, “A New Combinatorial Design ofCoded Distributed Computing,” in , pp. 726–730, 2018.[114] J. Jiang and L. Qu, “Coded Distributed Computing Schemes withSmaller Numbers of Input Files and Output Functions,” arXiv preprintarXiv:2001.04194 , 2020.[115] Q. Yan, X. Tang, and Q. Chen, “Placement Delivery Array and ItsApplications,” in ,pp. 1–5, 2018.[116] V. Ramkumar and P. V. Kumar, “Coded MapReduce Schemes Basedon Placement Delivery Array,” in , pp. 3087–3091, 2019.[117] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, “QSGD:Communication-Efﬁcient SGD via Gradient Quantization and En-coding,” in

Advances in Neural Information Processing Systems 30 (I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vish-wanathan, and R. Garnett, eds.), pp. 1709–1720, Curran Associates,Inc., 2017.[118] L. Song, C. Fragouli, and T. Zhao, “A Pliable Index Coding Approachto Data Shufﬂing,”

IEEE Transactions on Information Theory , vol. 66,no. 3, pp. 1333–1353, 2020.[119] S. Brahma and C. Fragouli, “Pliable Index Coding,”

IEEE Transactionson Information Theory , vol. 61, no. 11, pp. 6192–6203, 2015.[120] Z. Charles, D. Papailiopoulos, and J. Ellenberg, “Approximate GradientCoding via Sparse Random Graphs,” arXiv preprint arXiv:1711.06771 ,2017.[121] F. Haddadpour, Y. Yang, V. Cadambe, and P. Grover, “Cross-IterationCoded Computing,” in , pp. 196–203,2018.[122] F. Haddadpour, Y. Yang, M. Chaudhari, V. R. Cadambe, and P. Grover,“Straggler-Resilient and Communication-Efﬁcient Distributed IterativeLinear Solver,” arXiv preprint arXiv:1806.06140 , 2018.[123] Y. Zhang, J. C. Duchi, and M. J. Wainwright, “Communication-Efﬁcient Algorithms for Statistical Optimization,”

J. Mach. Learn. Res. ,vol. 14, p. 3321â ˘A¸S3363, Jan. 2013.[124] M. Zinkevich, M. Weimer, L. Li, and A. J. Smola, “ParallelizedStochastic Gradient Descent,” in

Advances in Neural InformationProcessing Systems 23 (J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, and A. Culotta, eds.), pp. 2595–2603, CurranAssociates, Inc., 2010.[125] K. Wan, M. Ji, and G. Caire, “Topological Coded Distributed Comput-ing,” arXiv preprint arXiv:2004.04421 , 2020. [126] W. Xia, P. Zhao, Y. Wen, and H. Xie, “A Survey on Data Center Net-working (DCN): Infrastructure and Operations,”

IEEE CommunicationsSurveys Tutorials , vol. 19, no. 1, pp. 640–656, 2017.[127] M. Al-Fares, A. Loukissas, and A. Vahdat, “A Scalable, CommodityData Center Network Architecture,” in

Proceedings of the ACM SIG-COMM 2008 Conference on Data Communication , SIGCOMM â ˘A ´Z08,(New York, NY, USA), p. 63â ˘A¸S74, Association for ComputingMachinery, 2008.[128] J. Chung, K. Lee, R. Pedarsani, D. Papailiopoulos, and K. Ramchan-dran, “Ubershufﬂe: Communication-efﬁcient Data Shufﬂing for SGDvia Coding Theory,”

Advances in NIPS , 2017.[129] N. Woolsey, R.-R. Chen, and M. Ji, “Coded Distributed Com-puting with Heterogeneous Function Assignments,” arXiv preprintarXiv:1902.10738 , 2019.[130] L. Song, S. R. Srinivasavaradhan, and C. Fragouli, “The Beneﬁt ofBeing Flexible in Distributed Computation,” in , pp. 289–293, 2017.[131] F. Xu and M. Tao, “Heterogeneous Coded Distributed Computing: JointDesign of File Allocation and Function Assignment,” arXiv preprintarXiv:1908.06715 , 2019.[132] Q. Yu, S. Li, M. A. Maddah-Ali, and A. S. Avestimehr, “How toOptimally Allocate Resources for Coded Distributed Computing?,” in , pp. 1–7, 2017.[133] M. Zhao, W. Wang, Y. Wang, and Z. Zhang, “Load Scheduling for Dis-tributed Edge Computing: A Communication-Computation Tradeoff,”

Peer-to-Peer Networking and Applications , vol. 12, no. 5, pp. 1418–1432, 2019.[134] K. Lee, R. Pedarsani, D. Papailiopoulos, and K. Ramchandran, “CodedComputation for Multicore Setups,” in , pp. 2413–2417, 2017.[135] M. Zaharia, A. Konwinski, A. D. Joseph, R. H. Katz, and I. Stoica, “Im-proving MapReduce Performance in Heterogeneous Environments.,” in

Osdi , p. 7, 2008.[136] K. G. Narra, Z. Lin, M. Kiamari, S. Avestimehr, and M. Annavaram,“Slack Squeeze Coded Computing for Adaptive Straggler Mitigation,”in

Proceedings of the International Conference for High PerformanceComputing, Networking, Storage and Analysis , SC â ˘A ´Z19, (New York,NY, USA), Association for Computing Machinery, 2019.[137] C.-S. Yang, R. Pedarsani, and A. S. Avestimehr, “Timely-ThroughputOptimal Coded Computing over Cloud Networks,” arXiv preprintarXiv:1904.05522 , 2019.[138] E. Ozfatura, D. GÃijndÃijz, and S. Ulukus, “Speeding Up DistributedGradient Descent by Utilizing Non-persistent Stragglers,” in , pp. 2729–2733,2019.[139] N. Ferdinand and S. C. Draper, “Hierarchical Coded Computation,”in ,pp. 1620–1624, 2018.[140] A. Reisizadeh, S. Prakash, R. Pedarsani, and A. S. Avestimehr, “CodedComputation Over Heterogeneous Clusters,”

IEEE Transactions onInformation Theory , vol. 65, no. 7, pp. 4227–4242, 2019.[141] M. Kim, J. Sohn, and J. Moon, “Coded Matrix Multiplication ona Group-Based Model,” in , pp. 722–726, 2019.[142] D. Kim, H. Park, and J. Choi, “Optimal Load Allocation for CodedDistributed Computation in Heterogeneous Clusters,” arXiv preprintarXiv:1904.09496 , 2019.[143] Y. Keshtkarjahromi, Y. Xing, and H. Seferoglu, “DynamicHeterogeneity-Aware Coded Cooperative Computation at the Edge,”in , pp. 23–33, 2018.[144] N. S. Ferdinand and S. C. Draper, “Anytime Coding for DistributedComputation,” in , pp. 954–960, 2016.[145] B. Wang, J. Xie, K. Lu, Y. Wan, and S. Fu, “On Batch-Processing BasedCoded Computing for Heterogeneous Distributed Computing Systems,” arXiv preprint arXiv:1912.12559 , 2019.[146] J. Zhu, Y. Pu, V. Gupta, C. Tomlin, and K. Ramchandran, “A SequentialApproximation Framework for Coded Distributed Optimization,” in , pp. 1240–1247, 2017.[147] T. Jahani-Nezhad and M. A. Maddah-Ali, “CodedSketch: CodedDistributed Computation of Approximated Matrix Multiplication,” in ,pp. 2489–2493, 2019. [148] V. Gupta, S. Wang, T. Courtade, and K. Ramchandran, “OverSketch:Approximate Matrix Multiplication for the Cloud,” in , pp. 298–304, 2018.[149] Z. Charles and D. Papailiopoulos, “Gradient Coding via the StochasticBlock Model,” arXiv preprint arXiv:1805.10378 , 2018.[150] R. Bitar, M. Wootters, and S. El Rouayheb, “Stochastic GradientCoding for Straggler Mitigation in Distributed Learning,” IEEE Journalon Selected Areas in Information Theory , pp. 1–1, 2020.[151] C. Karakus, Y. Sun, S. Diggavi, and W. Yin, “Redundancy Techniquesfor Straggler Mitigation in Distributed Optimization and Learning,” arXiv preprint arXiv:1803.05397 , 2018.[152] C. Karakus, Y. Sun, and S. Diggavi, “Encoded Distributed Optimiza-tion,” in , pp. 2890–2894, 2017.[153] J. Kosaian, K. V. Rashmi, and S. Venkataraman, “Learning a Code:Machine Learning for Approximate Non-Linear Coded Computation,” arXiv preprint arXiv:1806.01259 , 2018.[154] D. P. Woodruff, “Sketching as a Tool for Numerical Linear Algebra,” arXiv preprint arXiv:1411.4357 , 2014.[155] S. Wang, “A Practical Guide to Randomized Matrix Computations withMATLAB Implementations,” arXiv preprint arXiv:1505.07570 , 2015.[156] G. Cormode and M. Hadjieleftheriou, “Finding Frequent Items in DataStreams,”

Proc. VLDB Endow. , vol. 1, p. 1530â ˘A¸S1541, Aug. 2008.[157] H. Wang, Z. Charles, and D. Papailiopoulos, “ErasureHead: DistributedGradient Descent without Delays Using Approximate Gradient Cod-ing,” arXiv preprint arXiv:1901.09671 , 2019.[158] W. Chang and R. Tandon, “Random Sampling for Distributed CodedMatrix Multiplication,” in

ICASSP 2019 - 2019 IEEE InternationalConference on Acoustics, Speech and Signal Processing (ICASSP) ,pp. 8187–8191, 2019.[159] K. Simonyan and A. Zisserman, “Very Deep Convolutional Networksfor Large-Scale Image Recognition,” arXiv preprint arXiv:1409.1556 ,2014.[160] R. Hadidi, J. Cao, M. S. Ryoo, and H. Kim, “Robustly Executing DNNsin IoT Systems Using Coded Distributed Computing,” in

Proceedingsof the 56th Annual Design Automation Conference 2019 , DAC â ˘A ´Z19,(New York, NY, USA), Association for Computing Machinery, 2019.[161] A. B. Das, L. Tang, and A. Ramamoorthy, “C3LES: Codes for CodedComputation that Leverage Stragglers,” in , pp. 1–5, 2018.[162] E. Ozfatura, S. Ulukus, and D. GÃijndÃijz, “Distributed GradientDescent with Coded Partial Gradient Computations,” in

ICASSP 2019 -2019 IEEE International Conference on Acoustics, Speech and SignalProcessing (ICASSP) , pp. 3492–3496, 2019.[163] Y. Yang, P. Grover, and S. Kar, “Coded Distributed Computing forInverse Problems,” in

Advances in Neural Information ProcessingSystems 30 (I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus,S. Vishwanathan, and R. Garnett, eds.), pp. 709–719, Curran Asso-ciates, Inc., 2017.[164] W. Y. B. Lim, Z. Xiong, C. Miao, D. Niyato, Q. Yang, C. Leung, andH. V. Poor, “Hierarchical Incentive Mechanism Design for FederatedMachine Learning in Mobile Networks,”

IEEE Internet of ThingsJournal , 2020.[165] W. Y. B. Lim, N. C. Luong, D. T. Hoang, Y. Jiao, Y.-C. Liang,Q. Yang, D. Niyato, and C. Miao, “Federated Learning in Mobile EdgeNetworks: A Comprehensive Survey,”

IEEE Communications Surveys& Tutorials , 2020.[166] C. Gentry, “Fully homomorphic encryption using ideal lattices,” in

Proceedings of the forty-ﬁrst annual ACM symposium on Theory ofcomputing , pp. 169–178, 2009.[167] Z. Brakerski and V. Vaikuntanathan, “Efﬁcient Fully HomomorphicEncryption from (Standard) LWE,”

SIAM Journal on Computing ,vol. 43, no. 2, pp. 831–871, 2014.[168] O. Goldreich, “Secure Multi-Party Computation,”

Manuscript. Prelim-inary version , vol. 78, 1998.[169] Y. Huang, D. Evans, J. Katz, and L. Malka, “Faster Secure Two-PartyComputation Using Garbled Circuits,” in

USENIX Security Symposium ,vol. 201, pp. 331–335, 2011.[170] D. Bogdanov, S. Laur, and J. Willemson, “Sharemind: A Frameworkfor Fast Privacy-Preserving Computations,” in

European Symposium onResearch in Computer Security , pp. 192–206, Springer, 2008.[171] W. Chang and R. Tandon, “On the Capacity of Secure Distributed Ma-trix Multiplication,” in , pp. 1–6, 2018.[172] M. Ben-Or, S. Goldwasser, and A. Wigderson,

Completeness Theo-rems for Non-Cryptographic Fault-Tolerant Distributed Computation , p. 351â ˘A¸S371. New York, NY, USA: Association for ComputingMachinery, 2019.[173] M. Kim, H. Yang, and J. Lee, “Private Coded Computation for MachineLearning,” arXiv preprint arXiv:1807.01170 , 2018.[174] H. A. Nodehi, S. R. H. Najarkolaei, and M. A. Maddah-Ali, “EntangledPolynomial Coding in Limited-Sharing Multi-Party Computation,” in , pp. 1–5, 2018.[175] Y. Chen, L. Su, and J. Xu, “Distributed Statistical Machine Learningin Adversarial Settings: Byzantine Gradient Descent,”

Proceedings ofthe ACM on Measurement and Analysis of Computing Systems , vol. 1,no. 2, pp. 1–25, 2017.[176] L. Chen, H. Wang, Z. Charles, and D. Papailiopoulos, “DRACO:Byzantine-resilient Distributed Training via Redundant Gradients,” arXiv preprint arXiv:1803.09877 , 2018.[177] Q. Yu and A. S. Avestimehr, “Harmonic Coding: An Optimal Lin-ear Code for Privacy-Preserving Gradient-Type Computation,” in ,pp. 1102–1106, 2019.[178] J. So, B. Guler, A. S. Avestimehr, and P. Mohassel, “CodedPrivateML:A Fast and Privacy-Preserving Framework for Distributed MachineLearning,” arXiv preprint arXiv:1902.00641 , 2019.[179] A. Shamir, “How to Share a Secret,”

Communications of the ACM ,vol. 22, no. 11, pp. 612–613, 1979.[180] R. Bitar, P. Parag, and S. El Rouayheb, “Minimizing Latency for SecureDistributed Computing,” in , pp. 2900–2904, 2017.[181] M. A. Hearst, S. T. Dumais, E. Osuna, J. Platt, and B. Scholkopf, “Sup-port Vector Machines,”

IEEE Intelligent Systems and their applications ,vol. 13, no. 4, pp. 18–28, 1998.[182] R. Bitar, P. Parag, and S. El Rouayheb, “Minimizing Latency for SecureCoded Computing Using Secret Sharing via Staircase Codes,”

IEEETransactions on Communications , pp. 1–1, 2020.[183] J. Kakar, S. Ebadifar, and A. Sezgin, “Rate-Efﬁciency and Straggler-Robustness through Partition in Distributed Two-Sided Secure MatrixComputation,” arXiv preprint arXiv:1810.13006 , 2018.[184] H. Yang and J. Lee, “Secure Distributed Computing With StragglingServers Using Polynomial Codes,”

IEEE Transactions on InformationForensics and Security , vol. 14, no. 1, pp. 141–150, 2019.[185] Z. Jia and S. A. Jafar, “On the Capacity of Secure Distributed MatrixMultiplication,” arXiv preprint arXiv:1908.06957 , 2019.[186] R. G. L. Dâ ˘A ´ZOliveira, S. El Rouayheb, and D. Karpuk, “GASP Codesfor Secure Distributed Matrix Multiplication,”

IEEE Transactions onInformation Theory , pp. 1–1, 2020.[187] G. L. Rafael Dâ ˘A ´ZOliveira, S. E. Rouayheb, D. Heinlein, andD. Karpuk, “Degree Tables for Secure Distributed Matrix Multipli-cation,” in , pp. 1–5,2019.[188] X. Chang, J. Wang, J. Wang, V. Lee, K. Lu, and Y. Yang, “On Achiev-ing Maximum Secure Throughput Using Network Coding AgainstWiretap Attack,” in , pp. 526–535, IEEE, 2010.[189] R. Zhao, J. Wang, K. Lu, J. Wang, X. Wang, J. Zhou, andC. Cao, “Weakly Secure Coded Distributed Computing,” in , pp. 603–610, 2018.[190] P. Resnick, K. Kuwabara, R. Zeckhauser, and E. Friedman, “ReputationSystems,”

Communications of the ACM , vol. 43, no. 12, pp. 45–48,2000.[191] R. Mijumbi, J. Serrat, J. Gorricho, N. Bouten, F. De Turck, andR. Boutaba, “Network Function Virtualization: State-of-the-Art and Re-search Challenges,”

IEEE Communications Surveys Tutorials , vol. 18,no. 1, pp. 236–262, 2016.[192] Q. Qi, W. Wang, X. Gong, and X. Que, “A SDN-based NetworkVirtualization Architecture with Autonomie Management,” in , pp. 178–182, 2014.[193] V. Eramo, T. Catena, and F. G. Lavacca, “Proposal and Investigationof an Optical Reconﬁguration Cost Aware Policy for Resource Alloca-tion in Network Function Virtualization Infrastructures,” in ,pp. 1–5, 2019.[194] J. Sun, G. Zhu, G. Sun, D. Liao, Y. Li, A. K. Sangaiah, M. Ramachan-dran, and V. Chang, “A Reliability-Aware Approach for ResourceEfﬁcient Virtual Network Function Deployment,”

IEEE Access , vol. 6,pp. 18238–18250, 2018. [195] A. Al-Quzweeni, T. E. H. El-Gorashi, L. Nonde, and J. M. H.Elmirghani, “Energy Efﬁcient Network Function Virtualization in 5Gnetworks,” in , pp. 1–4, 2015.[196] L. Linguaglossa, S. Lange, S. Pontarelli, G. RÃl’tvÃ ˛ari, D. Rossi,T. Zinner, R. Bifulco, M. Jarschel, and G. Bianchi, “Survey of Perfor-mance Acceleration Techniques for Network Function Virtualization,” Proceedings of the IEEE , vol. 107, no. 4, pp. 746–764, 2019.[197] H. Jang, J. Jeong, H. Kim, and J. Park, “A Survey on Interfaces toNetwork Security Functions in Network Virtualization,” in , pp. 160–163, 2015.[198] A. Aljuhani and T. Alharbi, “Virtualized Network Functions SecurityAttacks and Vulnerabilities,” in , pp. 1–4, 2017.[199] S. Cherrared, S. Imadali, E. Fabre, G. GÃ˝ussler, and I. G. B. Yahia, “ASurvey of Fault Management in Network Virtualization Environments:Challenges and Solutions,”

IEEE Transactions on Network and ServiceManagement , vol. 16, no. 4, pp. 1537–1551, 2019.[200] P. v. Anvith, N. Gunavathi, B. Malarkodi, and B. Rebekka, “A Surveyon Network Functions Virtualization for Telecom Paradigm,” in , pp. 302–306, 2019.[201] J. Liu, Z. Jiang, N. Kato, O. Akashi, and A. Takahara, “Reliability Eval-uation for NFV Deployment of Future Mobile Broadband Networks,”

IEEE Wireless Communications , vol. 23, no. 3, pp. 90–96, 2016.[202] A. Al-Shuwaili, O. Simeone, J. Kliewer, and P. Popovski, “Coded Net-work Function Virtualization: Fault Tolerance via In-Network Coding,”

IEEE Wireless Communications Letters , vol. 5, no. 6, pp. 644–647,2016.[203] N. Nikaein, “Processing Radio Access Network Functions in the Cloud:Critical Issues and Modeling,” in

Proceedings of the 6th InternationalWorkshop on Mobile Cloud Computing and Services , MCS â ˘A ´Z15,(New York, NY, USA), p. 36â ˘A¸S43, Association for ComputingMachinery, 2015.[204] M. Aliasgari, J. Kliewer, and O. Simeone, “Coded ComputationAgainst Processing Delays for Virtualized Cloud-Based Channel De-coding,”

IEEE Transactions on Communications , vol. 67, no. 1, pp. 28–38, 2019.[205] J. Frankle and M. Carbin, “The Lottery Ticket Hypothesis: FindingSparse, Trainable Neural Networks,” arXiv preprint arXiv:1803.03635 ,2018.[206] S. Li, Q. Yu, M. A. Maddah-Ali, and A. S. Avestimehr, “A ScalableFramework for Wireless Distributed Computing,”

IEEE/ACM Transac-tions on Networking , vol. 25, no. 5, pp. 2643–2654, 2017.[207] M. Kiamari, C. Wang, and A. S. Avestimehr, “Coding for Edge-facilitated Wireless Distributed Computing with Heterogeneous Users,”in ,pp. 536–540, 2017.[208] S. Dhakal, S. Prakash, Y. Yona, S. Talwar, and N. Himayat, “CodedComputing for Distributed Machine Learning in Wireless Edge Net-work,” in , pp. 1–6, IEEE, 2019.[209] S. Dhakal, S. Prakash, Y. Yona, S. Talwar, and N. Himayat, “CodedFederated Learning,” in , pp. 1–6, IEEE, 2019.[210] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas,“Communication-efﬁcient Learning of Deep Networks from Decen-tralized Data,” in

Artiﬁcial Intelligence and Statistics , pp. 1273–1282,2017.[211] S. Prakash, S. Dhakal, M. Akdeniz, A. S. Avestimehr, and N. Himayat,“Coded Computing for Federated Learning at the Edge,” arXiv preprintarXiv:2007.03273 , 2020.[212] K. Yang, Y. Shi, and Z. Ding, “Data Shufﬂing in Wireless DistributedComputing via Low-Rank Optimization,”

IEEE Transactions on SignalProcessing , vol. 67, no. 12, pp. 3087–3099, 2019.[213] N. Zhao, F. R. Yu, M. Jin, Q. Yan, and V. C. M. Leung, “InterferenceAlignment and Its Applications: A Survey, Research Issues, andChallenges,”

IEEE Communications Surveys Tutorials , vol. 18, no. 3,pp. 1779–1803, 2016.[214] F. Li, J. Chen, and Z. Wang, “Wireless MapReduce Distributed Com-puting,”

IEEE Transactions on Information Theory , vol. 65, no. 10,pp. 6101–6114, 2019.[215] S. Ha, J. Zhang, O. Simeone, and J. Kang, “Wireless Map-ReduceDistributed Computing with Full-Duplex Radios and Imperfect CSI,” in , pp. 1–5, 2019.[216] S. Ha, J. Zhang, O. Simeone, and J. Kang, “Coded Federated Comput-ing in Wireless Networks with Straggling Devices and Imperfect CSI,”in ,pp. 2649–2653, 2019.[217] B. Wang, J. Xie, K. Lu, Y. Wan, and S. Fu, “Coding for HeterogeneousUAV-Based Networked Airborne Computing,” in , pp. 1–6, 2019.[218] J. S. Ng, W. Y. B. Lim, H.-N. Dai, Z. Xiong, J. Huang, D. Niyato,X.-S. Hua, C. Leung, and C. Miao, “Joint Auction-Coalition FormationFramework for Communication-Efﬁcient Federated Learning in UAV-Enabled Internet of Vehicles,” arXiv preprint arXiv:2007.06378 , 2020.[219] W. Y. B. Lim, J. Huang, Z. Xiong, J. Kang, D. Niyato, X.-S.Hua, C. Leung, and C. Miao, “Towards Federated Learning in UAV-Enabled Internet of Vehicles: A Multi-Dimensional Contract-MatchingApproach,” arXiv preprint arXiv:2004.03877 , 2020.[220] H. Mao, M. Alizadeh, I. Menache, and S. Kandula, “Resource Man-agement with Deep Reinforcement Learning,” in

Proceedings of the15th ACM Workshop on Hot Topics in Networks , pp. 50–56, 2016.[221] N. Abbas, Y. Zhang, A. Taherkordi, and T. Skeie, “Mobile EdgeComputing: A Survey,”

IEEE Internet of Things Journal , vol. 5, no. 1,pp. 450–465, 2018.[222] F. Bonomi, R. Milito, J. Zhu, and S. Addepalli, “Fog Computing and ItsRole in the Internet of Things,” in

Proceedings of the First Edition ofthe MCC Workshop on Mobile Cloud Computing , MCC â ˘A ´Z12, (NewYork, NY, USA), p. 13â ˘A¸S16, Association for Computing Machinery,2012.[223] S. Li, M. A. Maddah-Ali, and A. S. Avestimehr, “Coding for Dis-tributed Fog Computing,”

IEEE Communications Magazine , vol. 55,no. 4, pp. 34–40, 2017.[224] H. Park, K. Lee, J. Sohn, C. Suh, and J. Moon, “Hierarchical Codingfor Distributed Computing,” in2018 IEEE International Symposium onInformation Theory (ISIT)