Coded Computing for Federated Learning at the Edge
Saurav Prakash, Sagar Dhakal, Mustafa Akdeniz, A. Salman Avestimehr, Nageen Himayat
CCoded Computing for Federated Learning at the Edge
Saurav Prakash Sagar Dhakal Mustafa Akdeniz A. Salman Avestimehr Nageen Himayat Abstract
Federated Learning (FL) is an exciting newparadigm that enables training a global modelfrom data generated locally at the client nodes,without moving client data to a centralized server.Performance of FL in a multi-access edge com-puting (MEC) network suffers from slow conver-gence due to heterogeneity and stochastic fluctua-tions in compute power and communication linkqualities across clients. A recent work, CodedFederated Learning (CFL), proposes to mitigatestragglers and speed up training for linear regres-sion tasks by assigning redundant computationsat the MEC server. Coding redundancy in CFLis computed by exploiting statistical propertiesof compute and communication delays. We de-velop CodedFedL that addresses the difficult taskof extending CFL to distributed non-linear re-gression and classification problems with multi-output labels. The key innovation of our workis to exploit distributed kernel embedding usingrandom Fourier features that transforms the train-ing task into distributed linear regression. Weprovide an analytical solution for load allocation,and demonstrate significant performance gains forCodedFedL through experiments over benchmarkdatasets using practical network parameters.
1. Introduction
We live in an era where massive amounts of data are gener-ated each day by the IoT comprising billions of devices (Sar-avanan et al., 2019). To utilize this sea of distributed data forlearning powerful statistical models without compromisingdata privacy, an exciting paradigm called federated learning(FL) (Koneˇcn`y et al., 2016; McMahan et al., 2017; Kairouzet al., 2019) has been rising recently. The FL frameworkcomprises of two major steps. First, every client carries
This work was part of Saurav Prakash’s internship projects atIntel. Department of Electrical and Computer Engineering, Universityof Southern California, Los Angeles, CA Intel Labs, Santa Clara,CA. Correspondence to: Saurav Prakash < [email protected] > . out a local update on its dataset. Second, a central servercollects and aggregates the gradient updates to obtain a newmodel and transmits the updated model to the clients. Thisiterative procedure is carried out until convergence.We focus on the implementation of FL in multi-access edgecomputing (MEC) platforms (Guo et al., 2018; Shahzadiet al., 2017; Ndikumana et al., 2019; Ai et al., 2018) thatenable low-latency, efficient and cloud like computing ca-pabilities close to the client traffic. Furthermore, with theemergence of ultra-dense networks (Ge et al., 2016; Anet al., 2017; Andreev et al., 2019), it is increasingly likelythat message transmissions during distributed learning takeplace over wireless. Thus, carrying out FL over MEC suf-fers from some fundamental bottlenecks. First, due to theheterogeneity of compute and communication resourcesacross clients, the overall gradient aggregation at the servercan be significantly delayed by the straggling computationsand communication links. Second, FL suffers from wire-less link failures during transmission. Re-transmission ofmessages can be done for failed communications, but it maydrastically prolong training time. Third, data is typicallynon-IID across client nodes in an FL setting, i.e. data storedlocally on a device does not represent the population distri-bution (Zhao et al., 2018). Thus, missing out updates fromsome clients can lead to poor convergence.In a recent work (Dhakal et al., 2019a), a novel techniqueCoded Federated Learning (CFL) based on coding theoreticideas, was proposed to alleviate the aforementioned bot-tlenecks in FL for linear regression tasks. In CFL, at thebeginning of the training procedure, each client generatesmasked parity data by taking linear combinations of fea-tures and labels in the local dataset, and shares it with thecentral server. The masking coefficients are never sharedwith the server, thus keeping the raw data private. Duringtraining, server performs redundant gradient computationson the aggregate parity data to compensate for the erasedor delayed parameter updates from the straggling clients.The combination of coded gradient computed at the serverand the gradients from the non-straggling clients stochasti-cally approximates the full gradient over the entire datasetavailable at the clients. Our Contributions : CFL is limited to linear regressiontasks with scalar output labels. Additionally, CFL lacks a r X i v : . [ c s . D C ] J u l odedFedL theoretical analysis and evaluations over real-world datasets. In this paper, we build upon CFL and develop CodedFedL toaddress the difficult problem of injecting structured redun-dancy for straggler mitigation in FL for general non-linearregression and classification tasks with vector labels.
Thekey idea of our approach is to perform the kernel Fourierfeature mapping (Rahimi & Recht, 2008a) of the clientdata, transforming the distributed learning task into linearregression. This allows us to leverage the framework ofCFL for generating parity data privately at the clients forstraggler mitigation during training. For creating the paritydata, we propose to take linear combinations over randomfeatures and vector labels (e.g. one-hot encoded labels forclassification). Furthermore, for a given coded redundancy,we provide an analytical approach for finding the optimalload allocation policy for straggler mitigation, as opposedto the numerical approach proposed in CFL. The analysisreveals that the key subproblem of the underlying optimiza-tion can be formulated as a piece-wise concave problemwith bounded domain, which can be solved using standardconvex optimization tools. Lastly, we evaluate the perfor-mance of CodedFedL over two real-world datasets, MNISTand Fashion MNIST, for which CodedFedL achieves signifi-cant gains of . × and . × respectively over the uncodedapproach for a small coding redundancy of . Related Works : Coded computing is a new paradigm thathas been developed for injecting computation redundancy inunorthodox encoded forms to efficiently deal with commu-nication bottleneck and system disturbances like stragglers,outages, node failures, and adversarial computations in dis-tributed systems (Li et al., 2017; Lee et al., 2017; Tandonet al., 2017; Karakus et al., 2017; Reisizadeh et al., 2019).Particularly, (Lee et al., 2017) proposed to use erasure cod-ing for speeding up distributed matrix multiplication andlinear regression tasks. (Tandon et al., 2017) proposed a cod-ing method over gradients for synchronous gradient descent.(Karakus et al., 2019) proposed to encode over the data foravoiding the impact of stragglers for linear regression tasks.Many other works on coded computing for straggler miti-gation in distributed learning have been proposed recently(Ye & Abbe, 2018; Dhakal et al., 2019b; Yu et al., 2017;Raviv et al., 2017; Charles et al., 2017). In all these works,the data placement and coding strategy is orchestrated by acentral server. As a result, these works are not applicablein FL setting, where data is privately owned by clients andcannot be shared with the central server. The CFL paper(Dhakal et al., 2019a) was the first to propose a distributedmethod for coding in a federated setting, but was limited tolinear regression.Our work, CodedFedL, proposes to develop coded paritydata for non-linear regression and classification tasks withvector labels. For this, we leverage the popular kernel em-bedding based on random Fourier features (RFF) (Rahimi & Recht, 2008a; Pham & Pagh, 2013; Kar & Karnick, 2012),that has been a popular approach for dealing with kernel-based inference on large datasets. In the classical kernelapproach (Shawe-Taylor et al., 2004), a kernel similaritymatrix needs to be constructed, whose storage and computecosts are quadratic in the size of the dataset. RFF was pro-posed in (Rahimi & Recht, 2008a) to address this problemby explicitly constructing finite-dimensional random fea-tures from the data such that inner products between therandom features approximate the kernel functions. Train-ing and inference with random features have been shownto work considerably well in practice (Rahimi & Recht,2008a;b; Shahrampour et al., 2018; Kar & Karnick, 2012;Rick Chang et al., 2016).
2. Problem Setup
In this section, we provide a preliminary background onlinear regression and FL, followed by a description of ourcomputation and communication models.
Consider a dataset D = { ( x , y ) , . . . , ( x m , y m ) } , wherefor i ∈{ , . . . , m } , data feature x i ∈ R × d and label y i ∈ R × c . Many machine learning tasks consider the fol-lowing optimization problem: β ∗ = arg min β ∈ R d × c (cid:107) X β − Y (cid:107) F + λ (cid:107) β (cid:107) F , = arg min β ∈ R d × c m m (cid:88) i =1 (cid:107) x i β − y i (cid:107) + λ (cid:107) β (cid:107) F , (1)where X ∈ R m × d and Y ∈ R m × c denote the feature matrixand label matrix respectively, while β , λ and (cid:107)·(cid:107) F de-note the model parameter, regularization parameter andFrobenius norm respectively. A common strategy tosolve (1) is gradient descent. Specifically, model is up-dated sequentially until convergence as follows: β ( r +1) = β ( r ) − µ ( r +1) (cid:16) g | D | + λ β ( r ) (cid:17) , where µ ( r +1) denotes thelearning rate. Here, g denotes the gradient of the loss func-tion over the dataset D as follows: g = m X T ( X β ( r ) − Y ) . In FL, the goal is to solve (1) for the dataset D = ∪ nj = i D j , where D j is the dataset that is available lo-cally at client j . Let (cid:96) j > denote the size of D j , and X ( j ) =[ x ( j ) T , . . . , x ( j ) T(cid:96) j ] T and Y ( j ) =[ y ( j ) T , . . . , y ( j ) T(cid:96) j ] T denote the feature set and label set respectively for the j -thclient. Therefore, the combined feature and label sets acrossall clients can be represented as X = [ X (1) T , . . . , X ( n ) T ] T and Y = [ Y (1) T , . . . , Y ( n ) T ] T respectively.During iteration ( r + 1) of training, the server shares the cur-rent model β ( r ) with the clients. Client j then computes thelocal gradient g ( j ) = (cid:96) j X ( j ) T ( X ( j ) β ( r ) − Y ( j ) ) . The server odedFedL collects the client gradient and combines them to recover thefull gradient g = m (cid:80) nj =1 (cid:96) j g ( j ) = m X T ( X β ( r ) − Y ) .The server then carries out the model update to obtain β ( r +1) , which is then shared with the clients in the fol-lowing training iteration. The iterative procedure is carriedout until sufficient convergence is achieved.To capture the stochastic nature of implementing FL inMEC, we consider probabilistic models for computationand communication resources as illustrated next. To statistically represent the compute heterogeneity, weassume a shifted exponential model for local gradient com-putation. Specifically, the computation time for j -th clientis given by the shifted exponential random variable T ( j ) cmp as T ( j ) cmp = T ( j, cmp + T ( j, cmp . Here, T ( j, cmp = (cid:96) j µ j represents the non-stochastic part of the time in seconds to process partialgradient over (cid:96) j data points, where processing rate is µ j data points per second. T ( j, cmp models the stochastic compo-nent of compute time coming from random memory accessduring read/write cycles associated to Multiply-Accumulate(MAC) operations, where p T ( j, cmp ( t )= γ j e − γ j t , t ≥ . Here, γ j = α j µ j (cid:96) j , with α j > controlling the average time spent incomputing vs memory access.The overall execution time for j -th client during ( r + 1) -thepoch also includes T ( j ) com − d , time to download β ( r ) fromthe server, and T ( j ) com − u , time to upload the partial gradient g ( j ) to the server. The communications between server andclients take place over wireless links, that fluctuate in quality.It is a typical practice to model the wireless link betweenserver and j -th client by a tuple ( r j , p j ) , where r j and p j denote the the achievable data rate (in bits per second perHz) and link erasure probability p j (3GPP, 2014). Downlinkand uplink communication delays are IID random variablesgiven as follows : T ( j ) com − d = N j τ j , T ( j ) com − u = N j τ j . Here, τ j = br j W is the deterministic time to upload (or download)a packet of size b bits containing partial gradient g ( j ) (ormodel β ( r ) ) and W is the bandwidth in Hz assigned to the j -th worker device. Here, N j ∼ G ( p = 1 − p j ) is a geomet-ric random variable denoting the number of transmissionsrequired for the first successful communication as follows: P { N j = x } = p x − j (1 − p j ) , x = 1 , , , ... (2)Therefore, the total time T ( j ) taken by the j -th device toreceive the updated model, compute and successfully com- For the purpose of this article, we assume the downlink anduplink delays to be reciprocal. Generalization of our framework toasymmetric delay model is easy to address. municate the partial gradient to the server is as follows: T ( j ) = T ( j ) comp + T ( j ) comm − d + T ( j ) comm − u , (3)with average delay E ( T ( j ) ) = (cid:96) j µ j (1 + α j ) + τ j − p j .
3. Proposed CodedFedL Scheme
In this section, we present the different modules of our pro-posed CodedFedL scheme for resilient FL in MEC networks:random Fourier feature mapping for non-linear regression,structured optimal redundancy for mitigating stragglers, andmodified training at the server.
Linear regression procedure outlined in Section 2.1 is com-putationally favourable for low-powered personalized clientdevices, as the gradient computations involve matrix multi-plications which have low computational complexity. How-ever, in many ML problems, a linear model does not performwell. To combine the advantages of non-linear models andlow complexity gradient computations in linear regressionin FL, we propose to leverage kernel embedding based onrandom Fourier feature mapping (RFFM) (Rahimi & Recht,2008a). In RFFM, each feature x i ∈ R × d is mapped to (cid:98) x i using a function φ : R × d → R × q . RFFM approximatesa positive definitive kernel function K : R × d × R × d → R × q as represented below: K ( x i , x j ) ≈ (cid:98) x i (cid:98) x Tj = φ ( x i ) φ ( x j ) T . (4)Before training starts, j -th client carries out RFFM to trans-form its raw feature set X ( j ) to (cid:98) X ( j ) = φ ( X ( j ) ) , and trainingproceeds with the transformed dataset (cid:98) D =( (cid:98) X , Y ) , where (cid:98) X ∈ R m × q is the matrix denoting all the transformed featuresacross all clients. The goal then is to recover the full gradi-ent (cid:98) g = m (cid:98) X T ( (cid:98) X β ( r ) − (cid:98) Y ) over (cid:98) D for β ( r ) ∈ R q × c . In thispaper, we consider a commonly used kernel known as RBFkernel (Vert et al., 2004) K ( x i , x j ) = exp (cid:16) − || x i − x j || σ (cid:17) , where σ is the kernel width parameter. RFFM for the RBFkernel is obtained as follows (Rahimi & Recht, 2008b): (cid:98) x i = (cid:114) q [cos( x i ω + δ ) , . . . , cos( x i ω q + δ q ))] (5)where the frequency vectors ω s ∈ R d × are drawn indepen-dently from N ( , σ I d ) , while the shift elements δ s aredrawn independently from Uniform (0 , π ] distribution. Remark 1.
For distributed transformation of features atthe clients, the server sends the same pseudo-random seedto every client which then obtains the samples requiredfor RFFM in (5). This mitigates the need for the serverto communicate the samples ω , . . . , ω q , δ , . . . , δ q thusminimizing client bandwidth overhead. odedFedL Along with the computational benefits of linear regressionover the transformed dataset (cid:98) D =( (cid:98) X , Y ) , applying RFFMallows us to apply distributed encoding strategy for linearregression developed in (Dhakal et al., 2019a). The remain-ing part of Section 3 has been adapted from the CFL schemeproposed in (Dhakal et al., 2019a). Client j carries out random linear encoding over its trans-formed training dataset (cid:98) D j =( (cid:98) X ( j ) , Y ( j ) ) containing trans-formed feature set (cid:98) X ( j ) obtained from RFFM. Specifically,random generator matrix G j ∈ R u × (cid:96) j is used for encoding,where u denotes the coding redundancy which is the amountof parity data to be generated at each device. Typically, u (cid:28) m . Further discussion on u is deferred to Section 3.3,where the load allocation policy is described.Entries of G j are drawn independently from a normaldistribution with mean and variance u . G j is appliedon the weighted local dataset to obtain ( D j =( ( X ( j ) , ( Y ( j ) ) as follows: ( X ( j ) = G j W j (cid:98) X ( j ) , ( Y ( j ) = G j W j Y ( j ) . For w j = [ w j, , . . . , w j,(cid:96) j ] , the weight matrix W j = diag ( w j ) is an (cid:96) j × (cid:96) j diagonal matrix that weighs training data point ( (cid:98) x ( j ) k , y ( j ) k ) with w j,k based on the stochastic conditionsof the compute and communication resources, k ∈ [ (cid:96) j ] . Wedefer the details of deriving W j to Section 3.4.The central server receives local parity data from all clientdevices and combines them to obtain the composite par-ity dataset ( D =( ( X , ( Y ) , where ( X ∈ R u × q and ( Y ∈ R u × c arethe composite parity feature set and label set as follows: ( X = (cid:80) nj =1 ( X ( j ) , ( Y = (cid:80) nj =1 ( Y ( j ) . Therefore, we have: ( X = GW (cid:98) X , ( Y = GWY , (6)where G =[ G , . . . , G n ] ∈ R u × m and W ∈ R m × m is ablock-diagonal matrix given by W = diag ([ w , . . . , w n ]) .Equation (6) represents the encoding over the entire de-centralized dataset (cid:98) D =( (cid:98) X , Y ) , performed implicitly in adistributed manner across clients. Remark 2.
Although client j ∈ [ n ] shares its locally codeddataset ( D j =( ( X ( j ) , ( Y ( j ) ) with the central server, the localdataset (cid:98) D j as well as the encoding matrix G j are private tothe client and not shared with the server. It is an interestingfuture work to characterize the exact privacy leakage afterthe proposed randomization. The server carries out a load policy design based on statisti-cal conditions of MEC for finding (cid:101) (cid:96) j ≤ (cid:96) j , the number of datapoints to be processed at the j -th client, and u ≤ u max , thenumber of coded data points to be processed at the server, where, u max is the maximum number of coded data pointsthat the server can process. For each epoch, let R j ( t ; (cid:101) (cid:96) j ) bethe indicator random variable denoting the event that serverreceives the partial gradient computed and communicatedby the j -th client within time t , i.e. R j ( t ; (cid:101) (cid:96) j )= (cid:101) (cid:96) j { T j ≤ t } .Clearly, R j ( t ; (cid:101) (cid:96) j ) ∈{ , (cid:101) (cid:96) j } . The following denotes the totalaggregate return for t ≥ : R ( t ; ( u, (cid:101) (cid:96) )) = R C ( t ; u ) + R U ( t ; (cid:96) ) , (7)where R C ( · ) denotes indicator random variable for theevent that the server finishes computing the codedgradient over the parity dataset ( D =( ( X , ( Y ) , while R U ( t ; (cid:96) )= (cid:80) nj =1 R j ( t ; (cid:96) j ) denotes the total aggregate re-turn for the uncoded partial gradients from the clients. Thegoal is to have an expected return E ( R ( t ; (cid:101) (cid:96) ))= m for a min-imum waiting time t = t ∗ , where m is the total number datapoints at the clients. When coding redundancy is large,clients need to compute less. This however results in acoarser approximation of the true gradient over the entiredistributed client data, since encoding of training data resultsin colored noise, which may result in poor convergence.Without loss of generality, we assume that the server hasreliable and powerful computational capability so that R C ( t ; u max )= u max a.s. for any t ≥ . Therefore, theproblem reduces to finding an expected aggregate returnfrom the clients E ( R U ( t ; (cid:96) )) = ( m − u max ) for a minimumwaiting time t = t ∗ for each epoch. The two-step approachfor the load allocation is described below: Step 1 : Optimize (cid:101) (cid:96) =( (cid:101) (cid:96) , . . . , (cid:101) (cid:96) n ) to maximize the expectedreturn E ( R U ( t ; (cid:96) )) for a fixed t> by solving the followingfor the expected aggregate return from the clients: (cid:96) ∗ ( t ) = arg max ≤ (cid:101) (cid:96) ≤ ( (cid:96) ,...,(cid:96) n ) E ( R U ( t ; (cid:101) (cid:96) )) (8)Also, the optimization problem in (8) is decomposable into n independent optimization problems, one for each client j : (cid:96) ∗ j ( t ) = arg max ≤ (cid:101) (cid:96) ≤ (cid:96) j E ( R j ( t ; (cid:101) (cid:96) j )) (9) Remark 3.
We prove in Section 4 that E ( R j ( t ; (cid:101) (cid:96) j )) is apiece-wise concave function in (cid:96) j > , where the intervalboundaries are determined by the number of transmissionsduring the return time for j -th client. Thus, (9) can beefficiently solved using any convex toolbox. Step 2 : Next, optimization of t is considered in order to findthe minimum waiting time so that the maximized expectedaggregate return E ( R ( t ; (cid:96) ∗ ( t ))) is equal to m , where (cid:96) ∗ ( t ) is the solution to (8). Specifically, for a tolerance parameter (cid:15) ≥ , the following optimization problem is considered: t ∗ = arg min t ≥ : m − u ≤ E ( R U ( t ; (cid:96) ∗ ( t ))) ≤ m − u + (cid:15). (10) odedFedL Remark 4.
In Section 4, we leverage our derived math-ematical result for E ( R j ( t ; (cid:101) (cid:96) j )) to numerically show that E ( R ( t ; (cid:96) ∗ ( t ))) is monotonically increasing in t . Therefore,(10) can be efficiently solved using a binary search for t . Remark 5.
The optimization procedure outlined above canbe generalized to solving (7), by treating server as ( n +1) -thnode, with ≤ (cid:101) (cid:96) n +1 ≤ u max . Then, u = (cid:96) ∗ n +1 ( t ∗ ) . Client j samples (cid:96) ∗ j ( t ∗ ) data points uniformly and randomlythat it will process for local gradient computation duringtraining. It is not revealed to the server which data pointsare sampled, adding another layer of privacy. The proba-bility that the partial gradient computed at client j is notreceived at server by t ∗ is pnr j, =(1 − P { T j ≤ t ∗ } ) . Also, ( (cid:96) j − (cid:96) ∗ j ( t ∗ )) data points are never evaluated locally, whichimplies that the probability of no return pnr j, =1 .The diagonal weight matrix W j ∈ R (cid:96) j × (cid:96) j captures this ab-sence of updates reaching the server during the trainingprocedure for different data points. Specifically, for the (cid:96) ∗ j ( t ∗ ) data points processed at the client, the correspondingweight matrix coefficient is w j,k = √ pnr j, , while for the ( (cid:96) j − (cid:96) ∗ j ( t ∗ )) data points never processed, w j,k = √ pnr j, .As we illustrate next, this weighing ensures that the combi-nation of the coded gradient and the partial gradient updatesfrom the non-straggling clients stochastically approximatesthe full gradient over the entire dataset across the clients. In each epoch, the server computes the coded gradient g C over the composite parity data ( D =( ( X , ( Y ) as follows: g C = ( X T ( ( X β ( r ) − ( Y )= (cid:98) X T W T (cid:18) G T G (cid:19) W ( (cid:98) X β ( r ) − (cid:98) Y ) (11) = ⇒ E ( g C ) ( a ) = (cid:98) X T W T W ( (cid:98) X β ( r ) − (cid:98) Y )= n (cid:88) j =1 (cid:96) j (cid:88) k =1 w j,k (cid:98) x ( j ) Tk ( (cid:98) x ( j ) k β ( t ) − (cid:98) y ( j ) k ) (12)In ( a ) , we have replaced the quantity E ( G T G ) by an iden-tity matrix since the entries in G ∈ R u × m are IID withvariance /u . One can even replace G T G by an identitymatrix as a good approximation for reasonably large u .Client j computes gradient g ( j ) U = (cid:96) ∗ j ( t ∗ ) (cid:101) X T ( (cid:101) X β ( r ) − (cid:101) Y ) over (cid:101) D ( j ) =( (cid:101) X ( j ) , (cid:101) Y ( j ) ) , which is composed of the (cid:96) ∗ j ( t ∗ ) data points that j -th client samples for processing beforetraining. The server waits for the partial gradients from the clients until the optimized waiting time t ∗ and aggre-gates them to obtain g U = (cid:80) nj =1 (cid:96) ∗ j ( t ∗ ) { T j ≤ t ∗ } g ( j ) U . Theserver combines the coded and uncoded gradients to obtain g M = m ( g C + g U ) , that stochastically approximates the fullgradient (cid:98) g = m (cid:98) X T ( (cid:98) X β ( r ) − (cid:98) Y ) . Specifically, the expectedaggregate gradient from the clients is as follows: E ( g U ) = n (cid:88) j =1 P ( T j ≤ t ∗ ) g ( j ) U ( a ) = n (cid:88) j =1 (cid:96) ∗ j ( t ) (cid:88) k =1 P ( T j ≤ t ∗ ) (cid:98) x ( j ) Tk ( (cid:98) x ( j ) k β ( t ) − (cid:98) y ( t ) k ) ( b ) = n (cid:88) j =1 l j (cid:88) k =1 (1 − w j,k ) (cid:98) x ( j ) Tk ( (cid:98) x ( j ) k β ( t ) − (cid:98) y ( j ) k ) , (13)where the inner sum in ( a ) denotes the sum over the datapoints in (cid:101) D ( j ) = ( (cid:101) X ( j ) , (cid:101) Y ( j ) ) , while in ( b ) all the points inthe local dataset are included, with (1 − w j,k ) = 0 for thepoints that are selected to be not processed by the j -th client.In light of (12) and (13), we can see that E ( g M ) = (cid:98) g .
4. Analyzing CodedFedL
In this section, we analyze the expected return E ( R j ( t ; (cid:101) (cid:96) j )) defined in Section 3.3. We first present the main result. Theorem.
For the compute and communication modelsdescribed in Section 2.2, let ≤ (cid:101) (cid:96) j ≤ (cid:96) j be the number of datapoints processed by j -th client in each training epoch. Then,for a waiting time of t at the server, the expectation of thereturn R j ( t ; (cid:101) (cid:96) j )= (cid:101) (cid:96) j { T j ≤ t } satisfies the following: E ( R j ( t ; (cid:101) (cid:96) j ))= (cid:80) ν m ν =2 U (cid:18) t − (cid:101) (cid:96) j µ j − τ j ν (cid:19) h ν f ν ( t ; (cid:101) (cid:96) j ) if ν m ≥ otherwisewhere U ( · ) is the unit step function, f ν ( t ; (cid:101) (cid:96) j ) = (cid:101) (cid:96) j (cid:18) − e − αjµj (cid:101) (cid:96)j ( t − (cid:101) (cid:96)jµj − τ j ν ) (cid:19) , h ν = ( ν − − p j ) p ν − j ,and ν m ∈ Z satisfies t − τ j ν m > , t − τ j ( ν m + 1) ≤ . The proof is in Appendix A.1. Next we discuss the behaviorof E ( R j ( t ; (cid:101) (cid:96) j )) for ν m ≥ . For a fixed t> , consider thefunction f ν ( t ; (cid:101) (cid:96) j ) for (cid:101) (cid:96) j > . Then, the following holds: f (cid:48)(cid:48) ν ( t ; (cid:101) (cid:96) j ) = − e − αjµj (cid:101) (cid:96)j ( t − ντ j − (cid:101) (cid:96)jµj ) α j µ j ( t − ντ j ) (cid:101) (cid:96) j < . Thus, f ν ( t ; (cid:101) (cid:96) j ) is strictly concave in the domain (cid:101) (cid:96) j > . Also, f ν,t ( (cid:101) (cid:96) j ) ≤ for (cid:101) (cid:96) j ≥ µ j ( t − τ j ν ) for ν ∈ { , . . . , ν m } . Solv- odedFedL ing for f (cid:48) ν ( t ; (cid:101) (cid:96) j ) = 0 , we obtain the optimal load as follows: (cid:96) ∗ j ( t, ν ) = − α j µ j W − ( − e − (1+ α j ) ) + 1 ( t − ντ j ) U ( t − ντ j ) . (14)where W − ( · ) is the minor branch of Lambert W -function,where the Lambert W -function is the inverse function of f ( W ) = W e W . Therefore, as highlighted in Remark 3, = 2 = 3 = 4 = 5Sum (a) Illustration of the piece-wise concavity of thefunction E ( R j ( t ; (cid:101) (cid:96) j )) for (cid:101) (cid:96) j > . (b) Illustrating the monotonic relationship be-tween E ( R j ( t ; (cid:96) ∗ j ( t ))) and t for j -th client. Figure 1.
Illustrating the properties of expected aggregate return E ( R j ( t ; (cid:101) (cid:96) j )) based on the result of the Theorem. We assume p j =0 . , τ j = √ , µ j =2 and in Fig. 1(a), we have fixed t =10 . the expected return E ( R j ( t ; (cid:101) (cid:96) j )) is piece-wise concave in (cid:101) (cid:96) j in the intervals (0 , µ j ( t − ν m τ j )) , . . . , ( µ j ( t − τ ) , µ j ( t − τ )) . Thus, for a given t> , the problem of maximizingthe expected return decomposes into a finite number ofconvex optimization problems, that are efficiently solvedin practice (Boyd & Vandenberghe, 2004). This piece-wiseconcave relationship is also highlighted in Fig. 1(a).Consider the optimized expected return E ( R j ( t ; (cid:96) ∗ j ( t ))) . In-tuitively, as we increase the waiting time at the server, the op-timized load allocation should vary such that the server getsmore return on average. We substantiate this intuition andillustrate the relationship in Fig. 1(b). As E ( R j ( t ; (cid:101) (cid:96) ∗ j ( t ))) is monotonically increasing for each client j , it holds for E ( R U ( t ; (cid:96) ∗ ( t )))= (cid:80) nj =1 E ( R j ( t ; (cid:96) ∗ j ( t ))) . Hence, the opti-mization problem in (10) can be solved efficiently usingbinary search, as claimed in Remark 4.
5. Empirical Evaluation of CodedFedL
We now illustrate the numerical gains of CodedFedL pro-posed in Section 3, comparing it with the uncoded approach,where each client computes partial gradient over the entirelocal dataset, and the server waits to aggregate local gradi-ents from all the clients. As a pre-processing step, kernelembedding is carried out for obtaining random features foreach dataset, as outlined in Section 3.1. A cc u r a c y ( % ) CodedFedLUncoded (a) The test accuracy with re-spect to the wall-clock time un-der uncoded scheme vs. Cod-edFedL scheme. A cc u r a c y ( % ) CodedFedLUncoded (b) The test accuracy with re-spect to mini-batch update iter-ation under uncoded scheme vs.CodedFedL scheme.
Figure 2.
Illustrating the results for MNIST.
We consider a wireless scenario consisting of N =30 hetero-geneous client nodes and an MEC server. We consider bothMNIST (LeCun et al., 2010) and Fashion-MNIST (Xiaoet al., 2017). Training is done in batches, and encoding forCodedFedL is applied based on the global mini-batch, witha coded redundancy of only . The details of simulationparameters such as those for compute and communicationresources, pre-processing steps, and strategy for modelingheterogeneity of data have been provided in Appendix A.2.For each dataset, training is performed on the training set,while accuracy is reported on the test set. Figure 2(a) illustrates the generalization accuracy as a func-tion of wall-clock time for MNIST, while Figure 2(b) illus-trates generalization accuracy vs training iteration. Simi-lar results for Fashion-MNIST are included in AppendixA.2. Clearly, CodedFedL has significantly better conver-gence time than the uncoded approach, and as highlightedin Section 3.5, the coded federated gradient aggregation ap-proximates the uncoded gradient aggregation well for largedatasets. To illustrate this further, let γ be target accuracyfor a given dataset, while t Uγ and t Cγ respectively be thefirst time instants to reach the γ accuracy for uncoded andCodedFedL respectively. In Table 1 in Appendix A.2, wesummarize the results, demonstrating a gain of up to . × inconvergence time for CodedFedL over the uncoded scheme,even for a small coding redundancy of . odedFedL References , 14.2.0(14), 2014.Ai, Y., Peng, M., and Zhang, K. Edge computing technolo-gies for internet of things: a primer.
Digital Communica-tions and Networks , 4(2):77–86, 2018.An, J., Yang, K., Wu, J., Ye, N., Guo, S., and Liao,Z. Achieving sustainable ultra-dense heterogeneous net-works for 5g.
IEEE Communications Magazine , 55(12):84–90, 2017.Andreev, S., Petrov, V., Dohler, M., and Yanikomeroglu,H. Future of ultra-dense networks beyond 5g: harness-ing heterogeneous moving cells.
IEEE CommunicationsMagazine , 57(6):86–92, 2019.Boyd, S. and Vandenberghe, L.
Convex optimization . Cam-bridge university press, 2004.Charles, Z., Papailiopoulos, D., and Ellenberg, J. Approx-imate gradient coding via sparse random graphs. arXivpreprint arXiv:1711.06771 , 2017.Dhakal, S., Prakash, S., Yona, Y., Talwar, S., and Himayat,N. Coded federated learning. In . IEEE, 2019a.Dhakal, S., Prakash, S., Yona, Y., Talwar, S., and Himayat,N. Coded computing for distributed machine learningin wireless edge network. In , pp. 1–6. IEEE,2019b.Ge, X., Tu, S., Mao, G., Wang, C.-X., and Han, T. 5g ultra-dense cellular networks.
IEEE Wireless Communications ,23(1):72–79, 2016.Guo, H., Liu, J., and Zhang, J. Efficient computation offload-ing for multi-access edge computing in 5g hetnets. In , pp. 1–6. IEEE, 2018.Kairouz, P., McMahan, H. B., Avent, B., Bellet, A., Bennis,M., Bhagoji, A. N., Bonawitz, K., Charles, Z., Cormode,G., Cummings, R., et al. Advances and open problemsin federated learning. arXiv preprint arXiv:1912.04977 ,2019.Kar, P. and Karnick, H. Random feature maps for dotproduct kernels. In
Artificial Intelligence and Statistics ,pp. 583–591, 2012.Karakus, C., Sun, Y., Diggavi, S., and Yin, W. StragglerMitigation in Distributed Optimization Through DataEncoding. In
Advances in Neural Information ProcessingSystems , pp. 5440–5448, 2017. Karakus, C., Sun, Y., Diggavi, S. N., and Yin, W. Redun-dancy techniques for straggler mitigation in distributedoptimization and learning.
Journal of Machine LearningResearch , 20(72):1–47, 2019.Koneˇcn`y, J., McMahan, H. B., Yu, F. X., Richt´arik, P.,Suresh, A. T., and Bacon, D. Federated learning: Strate-gies for improving communication efficiency. arXivpreprint arXiv:1610.05492 , 2016.LeCun, Y., Cortes, C., and Burges, C. Mnist handwrittendigit database.
ATT Labs [Online]. Available: http://yann.lecun. com/exdb/mnist , 2, 2010.Lee, K., Lam, M., Pedarsani, R., Papailiopoulos, D., andRamchandran, K. Speeding up distributed machine learn-ing using codes.
IEEE Transactions on Information The-ory , 64(3):1514–1529, 2017.Li, S., Maddah-Ali, M. A., Yu, Q., and Avestimehr, A. S. Afundamental tradeoff between computation and commu-nication in distributed computing.
IEEE Transactions onInformation Theory , 64(1):109–128, 2017.McMahan, H. B., Moore, E., Ramage, D., Hampson, S.,et al. Communication-efficient learning of deep networksfrom decentralized data.
AISTATS , 2017.Ndikumana, A., Tran, N. H., Ho, T. M., Han, Z., Saad, W.,Niyato, D., and Hong, C. S. Joint communication, compu-tation, caching, and control in big data multi-access edgecomputing.
IEEE Transactions on Mobile Computing ,2019.Pham, N. and Pagh, R. Fast and scalable polynomial kernelsvia explicit feature maps. In
Proceedings of the 19thACM SIGKDD international conference on Knowledgediscovery and data mining , pp. 239–247, 2013.Rahimi, A. and Recht, B. Random features for large-scalekernel machines. In
Advances in neural informationprocessing systems , pp. 1177–1184, 2008a.Rahimi, A. and Recht, B. Uniform approximation of func-tions with random bases. In ,pp. 555–561. IEEE, 2008b.Raviv, N., Tamo, I., Tandon, R., and Dimakis, A. G. Gradi-ent coding from cyclic mds codes and expander graphs. arXiv preprint arXiv:1707.03858 , 2017.Reisizadeh, A., Prakash, S., Pedarsani, R., and Avestimehr,A. S. Coded computation over heterogeneous clusters.
IEEE Transactions on Information Theory , 65(7):4227–4242, 2019. odedFedL
Rick Chang, J.-H., Sankaranarayanan, A. C., and Vijaya Ku-mar, B. Random features for sparse signal classification.In
Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , pp. 5404–5412, 2016.Saravanan, V., Hussain, F., and Kshirasagar, N. Role ofbig data in internet of things networks. In
Handbookof Research on Big Data and the IoT , pp. 273–299. IGIGlobal, 2019.Shahrampour, S., Beirami, A., and Tarokh, V. On data-dependent random features for improved generalization insupervised learning. In
Thirty-Second AAAI Conferenceon Artificial Intelligence , 2018.Shahzadi, S., Iqbal, M., Dagiuklas, T., and Qayyum, Z. U.Multi-access edge computing: open issues, challengesand future perspectives.
Journal of Cloud Computing , 6(1):30, 2017.Shawe-Taylor, J., Cristianini, N., et al.
Kernel methods forpattern analysis . Cambridge university press, 2004.Tandon, R., Lei, Q., Dimakis, A. G., and Karampatziakis,N. Gradient Coding: Avoiding Stragglers in DistributedLearning . In
International Conference on MachineLearning , pp. 3368–3376, 2017.Vert, J.-P., Tsuda, K., and Sch¨olkopf, B. A primer on kernelmethods.
Kernel methods in computational biology , 47:35–70, 2004.Xiao, H., Rasul, K., and Vollgraf, R. Fashion-mnist: anovel image dataset for benchmarking machine learningalgorithms. arXiv preprint arXiv:1708.07747 , 2017.Ye, M. and Abbe, E. Communication-Computation EfficientGradient Coding. In
Proceedings of the 35th InternationalConference on Machine Learning , volume 80, pp. 5610–5619, 10–15 Jul 2018.Yu, Q., Maddah-Ali, M., and Avestimehr, S. PolynomialCodes: an Optimal Design for High-Dimensional CodedMatrix Multiplication. In
Advances in Neural InformationProcessing Systems , pp. 4403–4413, 2017.Zhao, Y., Li, M., Lai, L., Suda, N., Civin, D., and Chandra,V. Federated learning with non-iid data. arXiv preprintarXiv:1806.00582 , 2018.
A. Supplementary Material
This is the supplementary component of our submission.
A.1. Proof of TheoremTheorem.
For the compute and communication modelsdescribed in Section 2.2, let ≤ (cid:101) (cid:96) j ≤ (cid:96) j be the number of datapoints processed by j -th client in each training epoch. Then,for a waiting time of t at the server, the expectation of thereturn R j ( t ; (cid:101) (cid:96) j )= (cid:101) (cid:96) j { T j ≤ t } satisfies the following: E ( R j ( t ; (cid:101) (cid:96) j ))= (cid:80) ν m ν =2 U (cid:18) t − (cid:101) (cid:96) j µ j − τ j ν (cid:19) h ν f ν ( t ; (cid:101) (cid:96) j ) if ν m ≥ otherwisewhere U ( · ) is the unit step function, f ν ( t ; (cid:101) (cid:96) j ) = (cid:101) (cid:96) j (cid:18) − e − αjµj (cid:101) (cid:96)j ( t − (cid:101) (cid:96)jµj − τ j ν ) (cid:19) , h ν = ( ν − − p j ) p ν − j ,and ν m ∈ Z satisfies t − τ j ν m > , t − τ j ( ν m + 1) ≤ .Proof. Using the computation and communication modelspresented in Section 2.2, we have the following for theexecution time for one epoch for j -th client: T ( j ) = T ( j, cmp + T ( j, cmp + T ( j ) com − d + T ( j ) com − u = (cid:101) (cid:96) j µ j + T ( j, cmp + τ j N ( j ) com , (15)where N ( j ) com ∼ NB ( r = 2 , p = 1 − p j ) has negative binomialdistribution while T ( j, cmp ∼ E ( α j µ j (cid:101) (cid:96) j ) exponential. We haveused the fact that T ( j ) com − d and T ( j ) com − u are IID G ( p ) geo-metric random variables and sum of r IID G ( p ) is NB ( r, p ) .Therefore, the probability distribution for T ( j ) is obtainedas follows: P ( T ( j ) ≤ t ) = P (cid:32) (cid:101) (cid:96) j µ j + T ( j, cmp + τ j N ( j ) com ≤ t (cid:33) = ∞ (cid:88) ν =2 P ( N ( j ) com = ν ) · P (cid:32) T ( j, cmp ≤ t − (cid:101) (cid:96) j µ j − τ j ν | N ( j ) com = ν (cid:33) ( a ) = ∞ (cid:88) ν =2 P ( N ( j ) com = ν ) P (cid:32) T ( j, cmp ≤ t − (cid:101) (cid:96) j µ j − τ j ν (cid:33) ( b ) = ∞ (cid:88) ν =2 U (cid:32) t − (cid:101) (cid:96) j µ j − τ j ν (cid:33) ( ν − − p j ) p ν − j · (cid:32) − exp (cid:32) − α j µ j (cid:101) (cid:96) j (cid:32) t − (cid:101) (cid:96) j µ j − τ j ν (cid:33)(cid:33)(cid:33) where ( a ) holds due to independence of T ( j, cmp and N ( j ) com ,while in ( b ) , we have used U ( · ) to denote the unit step odedFedL function with U ( x )=1 for x> and U ( x )=0 for x ≤ . Fora fixed t , P ( T ( j ) ≤ t )=0 if t ≤ τ j . For t> τ j , let ν m ≥ satisfy the following criteria: ( t − τ j ν m ) > , ( t − τ j ( ν m + 1)) ≤ . (16)Therefore, for ν > ν m , the terms in ( b ) are . Finally,as E ( R j ( t ; (cid:101) (cid:96) j ))= (cid:101) (cid:96) j E ( { T j ≤ t } )= (cid:101) (cid:96) j P ( T ( j ) ≤ t ) , we arrive atthe result of our Theorem. A.2. ExperimentsSimulation Setting : We consider a wireless scenario con-sisting of N =30 client nodes and an MEC server, withsimilar computation and communication models as used in(Dhakal et al., 2019b). Specifically, we use an LTE network,where each node is assigned 3 resource blocks. We use thesame failure probability p j =0 . for all clients, capturing thetypical practice in wireless to adapt transmission rate for aconstant failure probability. To model heterogeneity, linkcapacities (normalized) are generated using { , k , k , ... } and a random permutation of them is assigned to the clients,the maximum communication rate being kbps. Anoverhead of 10% is assumed and each scalar is representedby 32 bits. The processing powers (normalized) are gen-erated using { , k , k , ... } , the maximum MAC rate being . × MAC/s. We fix ( k , k ) = (0 . , . .We consider two different datasets: MNIST (LeCun et al.,2010) and Fashion-MNIST (Xiao et al., 2017). The featuresare vectorized, and the labels are one-hot encoded. For ker-nel embedding, the hyperparameters are ( σ, q ) = (5 , .To model non-IID data distribution, training data is sortedby class label, and divided into 30 equally sized shard, onefor each worker. Furthermore, training is done in batches,where each global batch is of size 12000, i.e. each epochconstitutes 5 global mini-batch steps. Similarly, encodingfor CodedFedL is applied based on the global mini-batch,with a coded redundancy of . For both approaches, aninitial step size of is used with a step decay of . at every mini-batch steps. Regularization parameter is . .For each dataset, training is performed on the training set,while accuracy is reported on the test set. Speedup Results
Let γ be target accuracy for a givendataset, while t Uγ and t Cγ respectively be the first time in-stants to reach the γ accuracy for uncoded and CodedFedLrespectively. In Table 1, we summarize the results for thetwo datasets. Table 1.
Summary of Results
Dataset γ (%) t Uγ (h) t Cγ (h) GainMNIST 94.2 505 187 × . Fashion-MNIST 84.2 513 216 × . Convergence Curves for Fashion-MNIST
As in the case for MNIST, CodedFedL provides significantimprovement in convergence performance over the uncodedscheme for Fashion-MNIST as well as shown in Fig. 3(a).Fig. 3(b) illustrates that coded federated aggregation de-scribed in Section 3.5 provides a good approximation ofthe true gradient over the entire distributed dataset acrossthe client devices, even with a small coding redundancy of . A cc u r a c y ( % ) CodedFedLUncoded (a) The test accuracy with re-spect to the wall-clock time un-der uncoded scheme vs. Cod-edFedL scheme. A cc u r a c y ( % ) CodedFedLUncoded (b) The test accuracy with re-spect to mini-batch update iter-ation under uncoded scheme vs.CodedFedL scheme.