[PDF] Data-Aware Device Scheduling for Federated Edge Learning

Abstract

Federated Edge Learning (FEEL) involves the collaborative training of machine learning models among edge devices, with the orchestration of a server in a wireless edge network. Due to frequent model updates, FEEL needs to be adapted to the limited communication bandwidth, scarce energy of edge devices, and the statistical heterogeneity of edge devices' data distributions. Therefore, a careful scheduling of a subset of devices for training and uploading models is necessary. In contrast to previous work in FEEL where the data aspects are under-explored, we consider data properties at the heart of the proposed scheduling algorithm. To this end, we propose a new scheduling scheme for non-independent and-identically-distributed (non-IID) and unbalanced datasets in FEEL. As the data is the key component of the learning, we propose a new set of considerations for data characteristics in wireless scheduling algorithms in FEEL. In fact, the data collected by the devices depends on the local environment and usage pattern. Thus, the datasets vary in size and distributions among the devices. In the proposed algorithm, we consider both data and resource perspectives. In addition to minimizing the completion time of FEEL as well as the transmission energy of the participating devices, the algorithm prioritizes devices with rich and diverse datasets. We first define a general framework for the data-aware scheduling and the main axes and requirements for diversity evaluation. Then, we discuss diversity aspects and some exploitable techniques and metrics. Next, we formulate the problem and present our FEEL scheduling algorithm. Evaluations in different scenarios show that our proposed FEEL scheduling algorithm can help achieve high accuracy in few rounds with a reduced cost.

Full PDF

aa r X i v : . [ c s . D C ] F e b Data-Aware Device Scheduling for FederatedEdge Learning

Afaf Ta¨ık

Student Member, IEEE , Zoubeir Mlika

Member, IEEE and SoumayaCherkaoui

Senior Member, IEEE

INTERLAB, Engineering Faculty, Universit´e de Sherbrooke, Canada. { afaf.taik, zoubeir.mlika, soumaya.cherkaoui } @usherbrooke.ca Abstract

Federated Edge Learning (FEEL) involves the collaborative training of machine learning modelsamong edge devices, with the orchestration of a server in a wireless edge network. Due to frequentmodel updates, FEEL needs to be adapted to the limited communication bandwidth, scarce energy ofedge devices, and the statistical heterogeneity of edge devices’ data distributions. Therefore, a carefulscheduling of a subset of devices for training and uploading models is necessary. In contrast to previouswork in FEEL where the data aspects are under-explored, we consider data properties at the heart of theproposed scheduling algorithm. To this end, we propose a new scheduling scheme for non-independentand-identically-distributed (non-IID) and unbalanced datasets in FEEL. As the data is the key componentof the learning, we propose a new set of considerations for data characteristics in wireless schedulingalgorithms in FEEL. In fact, the data collected by the devices depends on the local environment andusage pattern. Thus, the datasets vary in size and distributions among the devices. In the proposedalgorithm, we consider both data and resource perspectives. In addition to minimizing the completiontime of FEEL as well as the transmission energy of the participating devices, the algorithm prioritizesdevices with rich and diverse datasets. We ﬁrst deﬁne a general framework for the data-aware schedulingand the main axes and requirements for diversity evaluation. Then, we discuss diversity aspects and someexploitable techniques and metrics. Next, we formulate the problem and present our FEEL schedulingalgorithm. Evaluations in different scenarios show that our proposed FEEL scheduling algorithm canhelp achieve high accuracy in few rounds with a reduced cost.

Keywords

Edge Computing; Data Diversity; Federated Learning; Scheduling; Wireless Networks.

I. I

NTRODUCTION

Machine learning (ML) models require large and rich sets of data for training. Nonetheless,the collection of large volumes of data generated by connected devices over wireless networksraises concerns related to data privacy and network congestion. Federated edge learning (FEEL)[1, 2] was proposed to tackle these concerns, by implementing distributed ML at the edge of thenetwork. In addition to preserving privacy by keeping the data locally, FEEL beneﬁts from rapidaccess to the data generated by end devices and leveraging their computational resources. InFEEL, the model training is performed on edge devices with the orchestration of a multi-accessedge computing (MEC) server. Each device trains the model using its local data, and only theresultant model parameters or stochastic gradients are sent to the MEC server for aggregation.The scarce resources, especially the communication bandwidth, limit the efﬁcacy of FEELoperations, particularly for the transmission of large size models. Consequently, most of theexisting works in FEEL focus on designing scheduling algorithms with optimal resource usage.Several proposed works aim, for instance, to minimize the completion time of FEEL [3], localcomputation energy [4], or transmission energy of participating devices [5]. As a result, thenumber of scheduled devices is often restricted as a means to meet latency and energy constraints.This restriction often slows down the convergence of training [6, 5]. Therefore, schedulingalgorithms aim to maximize the number of collected updates in each round, but this schedulinggoal can be biased towards powerful devices with smaller datasets. Thus, the collected updatesmight not be representative, as they are not trained on richer data. To avoid this issue, schedulingalgorithms should also aim to diversify the participating devices through the use of fairnessmeasures [7, 8]. But, the amount of connected devices grows faster than the network capacities,which will make scaling these algorithms harder in practice. Moreover, Internet of things data ishighly redundant and inherently unbalanced, given that the data collected by the devices dependon the local environment and the device’s usage pattern. Therefore, the size and the statisticalproperties of local datasets distributions vary among devices [9]. Thus, a careful selection ofparticipating devices imposes the consideration of their data properties, which motivates thiswork.The main idea we advocate in this paper ﬁnds its roots in active learning [10, 11], wheremodels are trained using fewer data points provided that the chosen samples are required to bediverse and informative. While in active learning, the selection concerns single unlabelled data points, the selection in FEEL concerns complete datasets with already labelled data points,and therefore requires a different evaluation of diversity. Additionally, the incorporation ofthe diversity measures in FEEL requires different considerations in regards of privacy and theproperties of the FEEL setting [12]. In this paper, we consider diversity as the baseline criterionfor choosing participating devices in FEEL. The diversity evaluation is applied on datasets,where the priority is given to devices with potentially more informative datasets to speed upthe training process. To this end, we propose a method for incorporating datasets’ diversityproperties in FEEL scheduling, by identifying a set of dataset diversity measures, and designinga data-aware scheduling algorithm.The contributions of this paper can be summarized as follows:1) we design a suitable diversity indicator, which serves as a priority criterion for the selectionof devices;2) we formulate a joint device selection and bandwidth allocation problem taking into accountdata diversity;3) we prove that the formulated problem is NP-hard and we propose a scheduling algorithmbased on an iterative decomposition technique to solve it; and4) we evaluate the proposed diversity indicator and the proposed scheduling algorithm throughextensive simulations.The remainder of this paper is organized as follows. In Section II, we present the backgroundfor FEEL and related work. In Section III, we present the design of the proposed diversitymeasure, starting with the used uncertainty measures and their integration in FEEL. In Section IV,we integrate the proposed measure in the design of the joint selection and bandwidth allocationalgorithm. Simulation results are presented in Section V. At last, conclusions and ﬁnal remarksare presented in Section VI.II. B

ACKGROUND AND RELATED WORK

In this section, we start by brieﬂy introducing the main concepts of FEEL. Then, we describethe existing challenges in deploying FL in wireless edge networks. Next, we discuss the relatedwork, illustrate the existing research gaps and motivate the need for a new scheduling schemefor FEEL.

A. Federated Edge Learning

In contrast to centralized training, FL keeps the training data at each device and learns ashared global model through the federation of distributed connected devices. Keeping data locallyyields many beneﬁts, namely preserved privacy, reduced bandwidth use, and rapid access todata. Applying FL to wireless edge networks forms the so-called concept of federated edgelearning or simply FEEL. FEEL involves a multi-access edge computing (MEC) [13] server thatperforms aggregation and edge devices that perform collaborative learning. The MEC server thatis equipped with a parameter server (PS) can be a next generation nodeB (gNB), or simply abase station (BS), in a wireless edge cellular network in which there are N edge devices thatcollaboratively train a shared model.Each device k has a local dataset D k with a data size of | D k | . The goal is to ﬁnd the optimalglobal model parameters w ∈ R l that minimizes the average prediction loss f ( w ) : min w ∈ R l f ( w ) = 1 D N X k =1 f k ( w ) , (1)where w is the model parameter vector to be optimized with dimension l , f k ( w ) is the lossvalue function computed by device k based on its local training data, and D is the total numberof data points across all devices (i.e., D = P Nk =1 | D k | ). Several models’ loss functions can betrained using FEEL, such as linear and logistic regression, support vector machines, and artiﬁcialneural networks.Ideally, all the devices independently train their local models using their local training data.Then, each one uploads its gradient updates to the server for aggregation. The server aggregatesthe received local updates, typically by averaging, to obtain a global model. Afterwards, theserver sends the global model to the edge devices, and a new iteration begins where eachdevice computes the gradient updates and uploads it to the server. Nonetheless, the constrainededge resources and limited communication bandwidth in wireless (edge) networks result insigniﬁcant delays for FEEL. The federated averaging (FedAvg) [1] algorithm was thereforeproposed to perform FEEL in a communication-efﬁcient way. FedAvg is perhaps the mostadopted communication-efﬁcient FEEL algorithm. The main idea behind FedAvg is to selecta small subset of devices and to run local epochs, in parallel, using stochastic gradient descent(SGD) on the local datasets of the selected devices. Next, all devices’ resulting model updatesare averaged to obtain the global model. In contrast to the naive application of SGD, which requires sending updates very often , FedAvg performs more local computation and less frequentcommunication updates. Since FedAvg assumes synchronous updates collection, this may resultin large communication delays. In fact, the computation, storage, and communication capabilitiesamong participating devices might be very different. Further, devices are frequently ofﬂine orunavailable either due to low battery levels, or because their resources are fully or partially usedby other applications. Thus, due to the synchronous nature of FedAvg, straggler devices, i.e.,devices with low performances, will cause large delays to the whole learning process.Despite the promising theoretical results attained using FedAvg, deploying it or other FEELalgorithms in wireless edge networks is still not clear and challenging due to the fast changingnature of the network, limited resources and statistical heterogeneity. Statistical heterogeneity isa very important aspect in FEEL. In fact, most use cases of FEEL suppose that the system doesnot have control over participating devices and their training data. Furthermore, data distributionsin user equipment depend heavily on the users’ behaviour. As a result, any particular user’s localdataset will not be representative of the population distribution. Additionally, the datasets aremassively distributed, statistically heterogeneous, i.e., non-independent and identically distributed(non-IID) and unbalanced, and highly redundant. Moreover, the raw generated data is oftenprivacy-sensitive as it can reveal personal and conﬁdential information.The wireless edge environment is composed of heterogeneous and limited capabilities of thedevices, as well as of heterogeneous data distributions. As a result, many new considerationsrelated to communication resources and edge devices’ data should be reﬂected in the designand deployment of FEEL algorithms. This motivates several works in FEEL to adapt to thecommunication-restrained edge environment. B. Related Work

Several prior works investigated the communication-constrained FEEL systems from differentperspectives. In addition to several compression [14] and partial participation [15] techniques,several mechanisms were proposed to adapt the federated training to the constrained resources.For instance, authors in [16] propose an accommodation mechanism where the numbers of globaland local iterations are changed depending on the available communication and computationresources. Another suggested approach relays on collecting the updates in an asynchronousmanner [17], which allows a smooth adaptation to heterogeneous resources and a ﬂexible updatescollection. However, due to stale updates’ effect on learning, synchronous updates collection remains the preferred method. As a result, a common approach is to selectively schedule asubset of devices to send their updates in each communication round. For example, authors in[3] proposed a client selection algorithm to reduce the latency of the model training, whereonly the end devices with good communication and computation capabilities are chosen, thusavoiding the straggler’s problem. Nevertheless, this method is biased toward powerful deviceswith better channel states, which discards devices with potentially more informative or importantupdates, and might lead to models that cannot generalize to a wide range of devices. To diversifythe sources of updates, several works adopted scheduling algorithms that aim to maximize thenumber of participating devices. For instance, authors in [5] proposed an energy-efﬁcient jointbandwidth allocation and scheduling algorithm, which ensures the training speed by collectingthe maximum amount of updates possible. Nonetheless, this method does not guarantee thediversity of updates sources. Consequently, fairness measures [7, 8] were adopted in schedulingpolicies to ensure gradient diversity. However, relying on strict fairness-based policy may yielda low number of collected updates within a round.Another approach for selective scheduling relies on evaluating the resulting model in an attemptto reduce the number of collected updates by removing the irrelevant ones [18]. This is achievedthrough measuring the signiﬁcance of a local update relative to the current global model, andwhether this update aligns with the collaborative convergence trend. Nonetheless, this approachmay not be energy efﬁcient, as it is applied post-training. Computing updates is an energyconsuming operation, as a result, disregarded updates are synonymous with wasted energy.Despite the variety of research progresses, the resource-efﬁcient FEEL scheduling algorithmdesign with highly heterogeneous dataset distributions remains a topic that is not well addressed.This motivates our work, in which we investigate a possible direction to evaluate the potentialsigniﬁcance of the updates through local dataset characteristics, namely size and diversity.III. D

IVERSITY IN F EDERATED L EARNING

Our idea comes from the fact that many prior works in ML have imposed the diversityon the construction of training batches to improve the efﬁciency of the learning process [19].Furthermore, active learning [11, 10] is premised on the idea that models can be trained withfewer data points, provided that the selected samples are diverse and more informative. In activelearning, the diversity is used as a criterion for choosing informative data points for efﬁcient MLtraining. However, to the best of our knowledge, this premise has never been used in FEEL prior to this work. Thus, we investigate the possibility of exploiting the different dataset propertiesto carefully select devices with potentially more informative datasets with less redundancy, bymeasuring their size and diversity. Therefore, we propose a FEEL algorithm where these aspectsare the heart of the devices selection.The ﬁrst question to be asked is what would be a good diversity measure for FEEL? Variousmeasures of uncertainty are used in active learning [20] to choose the samples that shouldbe labeled. In FEEL, the device selection does not concern independent samples, instead thediversity should be evaluated at the level of the entire dataset. Moreover, in the premise ofFEEL, the labels are already known, which gives the possibility to use more informed measures.For instance, we can use Shannon entropy [21] or Gini-Simpson index [22] for classiﬁcationproblems. Other methods can be used for sequential data, such as approximate entropy (ApEn)and sample entropy (SampEn) [23].The Gini-simpson index is a modiﬁcation of the Simpson index. The Simpson index measuresthe probability that two samples taken at random from the dataset of interest are from the sameclass, it is calculated as follows: λ = C X c =1 p c , (2)where C is the total number of classes, and p c is the probability of the class c . The originalSimpson index λ represents the probability that two samples taken at random from the dataset areof the same type (i.e., are within the same class). The Gini–Simpson index is its transformation − λ , which represents the probability that the two samples belong to different classes. TheGini-Simpson index is used in different applications such as ﬁnancial markets [24] and analyzingECG signal [25].The Shannon entropy also quantiﬁes the uncertainty of a prediction, and was used in severalapplications such as text prediction and image classiﬁcation [26]. In the context of FEEL, it canbe used as follows: H = − C X c =1 p c log ( p c ) , (3)where C is the total number of classes, and p c is the probability of the class c . Shannon Entropyis not deﬁned for the extreme case of 0 samples in a class, which can be problematic in somehighly unbalanced classiﬁcation problems.In sequence data, statistical measures such as the mean and the variance are not enough toillustrate the regularity, as they are inﬂuenced by system noise. ApEn was proposed to quantify the amount of regularity and the unpredictability of time-series data [27]. It is based on thecomparison between values of data in successive vectors, by quantifying how many successivedata points vary more than a deﬁned threshold. A random time series with fewer data pointscan have a lower ApEn than a more regular time series, whereas, a longer random time serieswill have a higher ApEn. SampEn [28] was proposed as a modiﬁcation of ApEn. It can be usedfor assessing the complexity of time-series data, with the advantage of being independent fromthe length of the vectors. Both these measures can help eliminate outliers, however, it should benoted that measuring ApEn and SampEn is a computationally heavy task, therefore it should beevaluated on a small sample rather than the entire dataset.To sum up, several diversity measures can be applied on datasets for different applications.Dataset diversity will allow a more informed participant selection in FEEL. Choosing deviceswith diversiﬁed datasets can accelerate the training and avoid overﬁtting, as the datasets containmore information. IV. S YSTEM M ODEL AND P ROBLEM F ORMULATION

Having discussed several diversity measures that can be used in various FEEL applications,let us introduce how such diversity measures can be used in the overall design of the FEELalgorithm. Hereinafter, we consider a FEEL system with multi-device and a single MEC server.The system model is illustrated in Fig. 1. In this section, we introduce the different elements ofthe system model, then we formulate a joint device selection and bandwidth allocation problem.To study the suitability of the proposed selection criteria in the context of FEEL, we considera wireless edge network composed of a MEC server and K devices collaboratively training ashared model. Each device k is characterized by a local dataset D k with a data size | D k | . A. Learning Model

First, the global model’s architecture and weights are initialized by the MEC server. At thebeginning of each training round r , the devices send their information and dataset diversityindicators to the MEC server. Based on the received information, alongside with the evaluatedchannel state information, the server selects a subset S r of the devices and allocates the necessarybandwidth to each scheduled device in order to receive the global model g . The scheduling ofthe devices, presented in Algorithm 2, is based on the trade-off between the datasets diversityand the required time and energy, under the constraint of a minimum number of devices that Fig. 1:

FEEL system model. should be scheduled in each round. In fact, given synchronous aggregation, the MEC serverrequires a minimum number K of updates to be collected to consider a round complete. Then,each device k in the chosen subset S r uses | D k | examples from its local dataset. SGD is thenused by each device k to compute its local update for some period of E local epochs. Theupdated models w k are sent to the MEC server for aggregation. Ideally, all devices transmit theirtrained local models to the MEC server simultaneously. The FEEL process is repeated over r max communication rounds and we use D r = P k ∈ S r | D k | to denote the total size of the datasets ofall selected devices.In order to aggregate the client updates, the MEC server uses the weighted average techniqueof the FedAvg algorithm proposed in [1]. The MEC server aggregates the updates and sends theresulting parameters to a new subset of selected devices. This process is repeated until the desiredprediction accuracy is reached or a maximum number of rounds is attained. In the consideredarchitecture, the considered FEEL procedure is detailed in Algorithm 1.

B. Dataset Diversity Index Design

Due to the unbalance and non-IID nature of the distributions, and under high bandwidthconstraints, the dataset size and diversity need to be considered in the device selection. Algorithm 1

FEEL Procedure while r < r max or accuracy < desired accuracy do if r = 0 then initialize the model’s parameters at the MEC server end if Receive devices information (transmit power, available data size, dataset diversity index) Schedule a subset S r of devices with at least N devices using Algorithm 2 for device k ∈ S r do k receives model g k trains on local data D k for E epochs k sends updated model w k to MEC server end for MEC server computes new global model using weighted average: g ← P k ∈ S r D k D r w k start next round r ← r + end while Additionally, we consider a second aspect which can be viewed at the system level which isthe diversity of sources. This goal can be achieved through maximizing the fairness in terms ofthe number of collected updates among devices is used to guarantee the diversity of the datasources.Therefore, the goals of the devices’ selection are twofold: 1) select devices with potentiallyinformative datasets, which is achieved through evaluating the size and diversity of the datasets;and 2) guarantee that the selected devices are diversiﬁed, which will be attained by addingage-of-update to the designed diversity index.Since our scheduling problem can consider multiple criteria, namely dataset diversity andfairness of the selection, and each measure is calculated with a function that has an output ondifferent scales, the function should be designed to output a weighted rank value bounded in [0 , γ i ] , where i ∈ { dataset diversity , dataset size , age } . The value of this function is given asfollows: v i × γ i , where γ i is the adjustable weight for each metric assigned by the server and v i is the normalized value of the metric i calculated as follows: v i = measured value of metric i maximum for metric i . We deﬁne the diversity index of dataset k as: I k = X i v i,k γ i,k , (4)where v i,k and γ i,k are deﬁned as v i and γ i , respectively, but for speciﬁc dataset k . Note thatthis measure is in line with federated learning principles, as it can be evaluated on-device, andit does not reveal any privacy-sensitive information about the dataset.We formulate the ﬁrst goal of the device selection problem as: max x K X k =1 I k x k , (5)where x = [ x , ..., x K ] and x k , for k = 1 , , . . . , K , is a binary variable that indicates whetheror not device k is scheduled to send an update, and I k is the diversity index. C. Transmission Model

As presented in Algorithm 1, the transmission aspect is also considered during the devicesscheduling. Given that the bandwidth is the bottleneck of FEEL, it is essential to estimate therequired transmission time and energy during the scheduling, as a means to avoid stragglersand device drop out due to low energy. Henceforth, we consider orthogonal frequency-divisionmultiple access (OFDMA) for local model uploading from the devices to the MEC server,with total available bandwidth of B Hz. We deﬁne α = [ α , ..., α K ] , where for each device k , α k ∈ [0 , is the bandwidth allocation ratio. The channel gain between device k and the BS isdenoted by h k . Due to limited bandwidth of the system, the bandwidth allocation ration shouldrespect the constraints P Kk =1 α k ≤ . The achievable rate of device k when transmitting to theBS is given by: r k = α k B log (1 + g k P k α k BN ) , ∀ k ∈ [1 , K ] , (6)where P k is the transmit power of device k , and N is the power spectral density of the Gaussiannoise. Based on the synchronous aggregation assumption, the duration of a communication rounddepends on the last scheduled device to ﬁnish uploading. The round duration is therefore givenby: T = max(( t traink + t upk ) x k ) , ∀ k ∈ [1 , K ] , (7) where t traink and t upk are, respectively, the training time and transmission time of device k . Thetraining time t traink depends on device k ’s dataset properties as well as on the model to be trained.It can be estimated using Eq.8: t traink = E | D k | C k f k , (8)where C k (cycles/bit) is the number of CPU cycles required for computing one sample data atdevice k and f k is its computation capacity. To send an update of size s within a transmissiontime of t upk , we must have: t upk = sr k . (9)Finally, the wireless transmit energy of device k is given by: E k = P k t upk . (10) D. Device Scheduling Problem Formulation

Considering the collaborative aspect of FEEL, and the communication bottleneck, we deﬁnethe following goals for the device scheduling algorithm: • From the perspective of accelerating learning, we adopt the goal deﬁned in Subsection IV-Bin Eq 5. • From the perspective of devices, it is desirable to consume the least amount of energy tocarry the training and uploading tasks. Given that the participating devices are responsiblefor the training, the aspect that can be adjusted is the upload energy. Therefore, the ﬁrstgoal is to minimize the consumed upload energy of the scheduled devices : min x,α K X k =1 x k E k . (11) • Due to the heterogeneous device capabilities and the largely varying data sizes, it is hardto estimate a suitable deadline for each round. To avoid stragglers’ problem, it is desirablefor the MEC server to have short round duration. Thus, a part of the objective is theminimization of the communication round. min x,α T. (12) By combining these three goals, the problem is formulated as a multi-objective optimizationproblem as follows: minimize x, α { K X k =1 x k E k , T, − K X k =1 x k I k } (13a) subject to( t traink + t upk ) x k ≤ T, ∀ k ∈ [1 , K ] , (13b) K X k =1 α k ≤ , ∀ k ∈ [1 , K ] , (13c) ≤ α k ≤ , ∀ k ∈ [1 , K ] , (13d) K X k =1 x k ≥ N, ∀ k ∈ [1 , K ] , (13e) x k ∈ { , } ∀ k ∈ [1 , K ] (13f) E. NP-hardness

Problem (13) is a multi-objective problem that is non-linear. Thus, it is very challenging tosolve. Even worse, we show in the following that the problem is NP-hard even for a singleobjective case. Problem (13) is equivalent to a knapsack problem and thus it is NP-hard.Indeed, for ﬁxed transmit powers p k and ﬁxed α k and for K = 1 , the problem is equivalent tomaximizing the weighted number of devices, i.e., P k I k x k subject to a knapsack capacity givenby P k α k x k ≤ . Constraints (13b) can be veriﬁed for each device to ﬁlter out the devices thatdo not respect them. Thus, the problem is equivalent to a knapsack problem and since the latteris NP-hard, so is problem (13). V. S CHEDULING ALGORITHM

In this section we present our data-aware FEEL scheduling algorithm to optimize the multi-objective problem deﬁned in the previous section. The deﬁned multi-objective problem is mixedinteger non-linear program. To efﬁciently solve it, we decompose it into two sub-problems, andwe discuss each one in the sequel. The ﬁrst sub-problem (

Sub1 ) is a selection problem in which we select the devices in orderto optimize a weighted linear combination of the different objectives. The selection sub-problemis formulated as follows: minimize x λ

E K X k =1 x k E k + λ T T − λ I K X k =1 x k I k (14a) subject to x k ∈ { , } ∀ k ∈ [1 , K ] , (14b) K X k =1 x k ≥ K ∀ k ∈ [1 , K ] (14c)where λ E , λ T , and λ I are positive scaling constants used ﬁrst to scale the value of the objectivefunction and second to combine the different conﬂicting objectives into a linear single one.The second sub-problem ( Sub2 ) is a bandwidth allocation problem in which the deviceselection decision is ﬁxed through solving the previous selection sub-problem. The objectiveof the bandwidth allocation problem consists of linear combination, using a positive constant ρ ,of the consumed energy and the round’s completion time. This problem is formulated as follows: minimize α ρ K X k =1 x k E k + (1 − ρ ) T (15a) subject to K X k =1 α k ≤ , ∀ k ∈ [1 , K ] , (15b) ≤ α k ≤ , ∀ k ∈ [1 , K ] (15c)To solve Sub1 , we use relaxation and rounding . Speciﬁcally, we relax the integer constraint x k ∈ { , } as the real-value constraint ≤ x k ≤ , and then the integer solution is determinedusing rounding after solving the relaxed problem, then verifying whether the condition (14c) issatisﬁed. The relaxed problem can be written as: minimize x λ E K X k =1 x k E k + λ T T − λ I K X k =1 x k I k (16a) subject to0 ≤ x k ≤ ∀ k ∈ [1 , K ] (16b)The continuous value of x k can be viewed as the selection priority of the device k , therefore,if the condition (14c) is not satisﬁed, we set x k = 1 for the N devices with highest priorities. To solve Sub2 , we applied off-the-shelf solvers in order to obtain the optimal bandwidth allocationratios.The proposed FEEL algorithm is an iterative algorithm that solves each sub-problem sequen-tially as discussed previously and updates the solution in each iteration. The pseudo-code of thedata-aware scheduling (DAS) algorithm is given in Algorithm 2. The algorithm iterates untilconvergence or until a maximum number of iterations is reached, whichever appears ﬁrst. Theconvergence happens when the values of x and α does not undergo a large change, e.g., whentheir values are almost the same as in the previous iteration, the loop is terminated. The numberof iterations is set to iterations max and is used to guarantee the algorithm termination. Algorithm 2

DAS algorithm for FEEL initialize x k = 1 ∀ k ∈ [1 , N ] uniformly allocate the bandwidth iterations ← while iterations < iterations max and not convergence do Solve Sub1 :

Return x Round x if condition(14c) is satisﬁed then continue select K devices with the highest priorities end if Solve Sub2 :

Return α iterations ← iterations + 1 end while VI. S

IMULATION AND RESULTS

In this section, we present the performance evaluation of the DAS algorithm. To do so, weﬁrst evaluate our proposed diversity index deﬁned in Section IV-B, then we compare DAS to twoscheduling strategies: 1) a baseline scheduling where all the devices participate in the trainingwith optimized time and energy following problem

Sub2 . We compare to this baseline schedulingin order to evaluate the scalability of the algorithm in terms of the consumed energy and time.2) an age-of-update based scheduling (ABS) algorithm [8, 7], speciﬁcally we used the age-based priority function proposed in [8] with α = 1 , f ( k ) = log (1 + T ( k )) with T ( k ) the number ofrounds since last selection of device k . This algorithm considers both the number and varietyof resources. Therefore, by comparing DAS to ABS, we are able to measure the importance ofthe dataset diversity in the algorithm.The simulations were conducted on a desktop computer with a 2,6 GHz Intel i7 processorand 16 GB of memory and NVIDIA GeForce RTX 2070 Super graphic card. We used Pytorch[29] for the machine learning library, and Scipy Optimize [30] for the optimization modelingand solver. In the numerical results, each presented value is the average of 50 independent runs.We consider a cellular network modelled as a square of side meters and composed of oneBS located in the center of the square. The K edge devices are randomly deployed inside thesquare following uniform distribution. Unless speciﬁed otherwise, the simulation parameters areas follows. We consider K = 100 edge devices, and N = 1 the minimum number of devicesto be scheduled. The OFDMA bandwidth is B = 1 MHz. The channel gains g k between edgedevice k and the BS includes large-scale pathloss and small-scale fading following Rayleighdistribution, i.e., | g k | = d − αk | h k | where h k is a Rayleigh random variable and α is the pathlossexponent and d k is the distance between edge device k and the BS. We set the parameters ofproblem Sub1 as λ E = λ T = , λ I = and the parameters of Sub2 as ρ = . The remainderof the used parameters are summarized in Table I. TABLE I:

Generated ValuesDevices CPU frequency [ 1,3 ] GhzCycle /bit [ 10,30 ] (cycles/bit)Transmit Power [1,5]Model Size 100 kbitsBandwidth 1MHzNumber of shards per device [1,30]

We used benchmark image classiﬁcation dataset MNIST [31], which we distribute among thesimulated devices. The data distribution we adopted is as follows: We ﬁrst sort the data by digitlabel, then we form 1200 shards composed of 50 images each. We then allocate a minimum of1 shard and a maximum of 30 shards to each of the 100 devices considered in this simulation.We keep 10% of the distributed data for test, and used the remaining for training.We train two models: A convolutional neural network (CNN) with two 5x5 convolution layers (the ﬁrst with 10 channels, the second with 20, each followed with 2x2 max pooling), two fullyconnected layers with 50 units and ReLu activation, and a ﬁnal softmax output layer. We alsotrain a simpler multi-layer perceptron (MLP) model with two fully connected layers. Since ourgoal through using these models is to evaluate our scheduling algorithm and not to achievestate-of-the-art accuracy on MNIST, therefore they are sufﬁcient for our goal. Furthermore, theselected models are fairly small, thus they can be realistically trained on resource-constrainedand legacy devices, using reasonable amounts of energy in short time windows. We ﬁrst evaluatethe diversity index through several experiments. Then we evaluate DAS over wireless networksby varying the model size and the number of local epochs. A. Diversity Index evaluation

To test the proposed index, we train both models (i.e., CNN and MLP), using the FedAvgalgorithm over a total of 15 rounds. We used Gini-Simpson index to evaluate the datasets diversityand we set the weights of the index equally to / . We compare DAS performance to randomselection-based scheduling.To clearly illustrate the efﬁciency of the diversity index in devices’ selection, we stress-testthe selection by limiting the number of selected devices. Furthermore, the trade-off betweenlocal update and global aggregation imposes the evaluation of the performance when varyingthe number of local iterations.As a ﬁrst experiment, we limited the number of selected devices to 3, 5 and 7. Fig.2 showsthe obtained accuracy throughout training rounds. The average accuracy obtained using DAS issigniﬁcantly higher across different simulations. Thus, it is clear that using the diversity index asa criteria for scheduling devices allows to accelerate the learning signiﬁcantly, especially whenthe number of devices that can be scheduled is low.In a second experiment, we ﬁx the number of selected devices to 7, and we vary the numberof local epochs E ∈ { , , } . Fig. 3 illustrates the obtained results for the CNN and the MLPmodels. Adding more local computations allows signiﬁcant gains in communication and inaccuracy, especially for the MLP model. We notice that the data-aware device selection stillsurpasses random selection in these simulations when adding more local computation. All in all,these results validate our hypothesis involving the importance of dataset diversity. A v e r a g e A cc u r a c y Random 3 devicesDAS 3 devicesRandom 5 devicesDAS 5 devicesRandom 7 devicesDAS 7 devices (a)

CNN A v e r a g e A cc u r a c y Random 3 devicesDAS 3 devicesRandom 5 devicesDAS 5 devicesRandom 7 devicesDAS 7 devices (b)

MLP

Fig. 2:

Average test accuracy by varying number of selected devices A v e r a g e A cc u r a c y Random 1 epochDAS 1 epochRandom 2 epochsDAS 2 epochsRandom 3 epochsDAS 3 epochs (a)

CNN A v e r a g e A cc u r a c y Random 1 epochDAS 1 epochRandom 2 epochsDAS 2 epochsRandom 3 epochsDAS 3 epochs (b)

MLP

Fig. 3:

Average test accuracy by varying number of local epochs with ﬁxed number of devices

B. Effect of Model Size

Fig. 4 and Fig.5 show that using DAS algorithm, the number of required communicationrounds to reach the desired accuracy is always lower than (or at least equal to) the one neededusing ABS. For both models, the size can affect the convergence speed of the learning. It shouldbe noted that ABS tends to select more devices in early communication rounds, which can yieldhigher accuracy than DAS similarly to what is shown in Fig.4a. Nonetheless, ABS gives higher

85 86 87 88 89 90 91 92

Goal accuracy N u m b e r o f r e q u i r e d r o un d s DASbaselineABS (a) s = 100 k

85 86 87 88 89 90 91 92

Desired accuracy A v e r a g e nu m b e r o f r e q u i r e d r o un d s (b) s = 150 k

85 86 87 88 89 90 91 92

Desired accuracy A v e r a g e nu m b e r o f r e q u i r e d r o un d s (c) s = 200 k Fig. 4:

The impact of the model size on the required number of rounds to achieve the desired accuracy using the CNN model.

70 72 74 76 78 80 82

Goal accuracy N u m b e r o f r e q u i r e d r o un d s DASbaselineABS (a) s = 100 k

70 71 72 73 74 75 76 77

Desired accuracy A v e r a g e nu m b e r o f r e q u i r e d r o un d s (b) s = 150 k

70 71 72 73 74 75 76 77

Desired accuracy A v e r a g e nu m b e r o f r e q u i r e d r o un d s (c) s = 200 k Fig. 5:

The impact of the model size on the required number of rounds to achieve the desired accuracy using the MLP model. priority to devices that did not participate, which leads to an decrease in the number of devicesthat can be selected, and thus a slower convergence over the following rounds.Indeed, the baseline scenario requires less rounds to reach high accuracy levels, nonetheless,it is hard to scale for large number of devices in rapidly changing environments. In fact, whileFig. 7 and Fig.6 show that the ABS and DAS are comparable in terms of energy and completiontime, the consumed energy and time only represent a fraction of those of the baseline scenariorequirements. This is mainly due to scheduling only a fraction of the devices, which for bothalgorithms did not exceed 20%.Considering a goal accuracy of 77% for the MLP model and 92% for the CNN, the valuesin Fig.6 and Fig.7, represent gains in energy compared to the baseline as follows: On total, theconsumed energy per device for the ABS represents a gain of 68,85% on average for trainingthe MLP model and 76,56% for the CNN model. Even higher gains are achieved for DAS, T r a n s m i ss i o n E n e r g y p e r d e v i c e ( J ) ABSDAS (a)

Energy per device C o m p l e t i o n t i m e ( s ) ABSDAS (b)

Completion time

Fig. 6:

Energy per device and completion time for training the CNN model for a goal accuracy of 92% T r a n s m i ss i o n E n e r g y p e r d e v i c e ( J ) ABSDAS (a)

Energy per device C o m p l e t i o n t i m e ( s ) ABSDAS (b)

Completion time

Fig. 7:

Energy per device and completion time for training the MLP model for a goal accuracy of 77% where a gain of 78,86% is reached when training the MLP model and 84,96% CNN. In termsof completion time, the required time across different experiments for DAS is signiﬁcantly lessthan the required time for ABS. These results are consistent with the number of required roundsfor training presented in Fig. 4 and Fig.5. Additionally, the ABS algorithm selects more devicesin the ﬁrst few rounds, and much less devices in the later rounds, thus leading to longer roundsin the beginning of the training, and shorter rounds later.These results show that the careful selection of participating devices jointly with the optimizedbandwidth allocation, make DAS scalable.

85 86 87 88 89 90 91 92

Goal accuracy N u m b e r o f r e q u i r e d r o un d s DASbaselineABS (a) E = 1

86 88 90 92 94

Desired accuracy A v e r a g e nu m b e r o f r e q u i r e d r o un d s (b) E = 2

86 88 90 92 94

Desired accuracy A v e r a g e nu m b e r o f r e q u i r e d r o un d s (c) E = 3 Fig. 8:

The impact of the number of local epochs on the required number of rounds to achieve the desired accuracy using theCNN model.

70 72 74 76 78 80 82

Goal accuracy N u m b e r o f r e q u i r e d r o un d s DASbaselineABS (a) E = 1

70 72 74 76 78 80 82

Desired accuracy A v e r a g e nu m b e r o f r e q u i r e d r o un d s (b) E = 2

76 78 80 82 84 86 88 90

Desired accuracy A v e r a g e nu m b e r o f r e q u i r e d r o un d s (c) E = 3 Fig. 9:

The impact of the number of local epochs on the required number of rounds to achieve the desired accuracy using theMLP model.

C. Local computation

In this section, we study the effect of increasing the computation per device. We ﬁx s =100 kbits, and add more local computation per client on each round.By increasing the number of local epochs E , we take full advantage of available parallelismon the client hardware, which leads to higher accuracy on the test set with less communicationrounds. However, in previous work [9, 1], long local computation may lead to the divergencetraining loss. Furthermore, due to the changing environment due to the mobility of the devices, itis hard to plan ahead the communication rounds for large E . Therefore, we limit the experimentsto (1 , , local epochs.Fig. 8 and Fig. 9 show that adding more local epochs per round can produce a dramatic T r a n s m i ss i o n E n e r g y p e r d e v i c e ( J ) ABSDAS (a)

Energy per device C o m p l e t i o n t i m e ( s ) ABSDAS (b)

Completion time

Fig. 10:

Energy per device and completion time for training the CNN model for a goal accuracy of 92% T r a n s m i ss i o n E n e r g y p e r d e v i c e ( J ) ABSDAS (a)

Energy per device C o m p l e t i o n t i m e ( s ) ABSDAS (b)

Completion time

Fig. 11:

Energy per device and completion time for training the MLP model for a goal accuracy of 77% decrease in communication costs. We see that increasing the number of local epochs E canbeneﬁt largely from DAS, as it achieves a closer behaviour to the baseline, especially for theless powerful MLP model.Fig. 10 and Fig. 11 show how trading frequent communication with more local computation hasseveral beneﬁts, as it reduces the required transmission energy, as well as the FEEL completiontime. Similarly to the previous experiments, we considered a goal accuracy of 77% for the MLPmodel and 92% for the CNN, the gains in energy compared to the baseline when increasingthe number of local epochs are as follows: On total, the consumed energy per device for theABS represents gains of 83.5% and 95.97% for the MLP when training for 2 epochs and 3 epochs respectively 88.19%, 96,95% on average for training the CNN model. Even higher gainsare achieved for DAS, where a gain of 86.65% and 96.54% is reached when training the MLPmodel and 90% and 96,54% for the CNN model. These results are consistent with the fractionof selected devices which does not exceed 20% on average.We noticed that when increasing the number of local epochs, the required completion timeusing DAS becomes slightly higher than the required completion time for ABS. This is mainlydue to prioritizing devices that have larger datasets, which leads to longer training duration.Nevertheless, the difference can be seen as marginal when considering the gain in energy.VII. C ONCLUSION

In this paper, we have investigated the problem of devices scheduling in federated edge learningby formulating the following question: Can the use of a suitable diversity index help achieve abetter accuracy in fewer rounds? To answer this question, we consider data properties as the keymotor of the selection of devices, as we designed a diversity index which can be adapted to a widevariety of use-cases. Additionally, we integrated the diversity index in a novel scheduling strategyin wireless networks, where the completion time and energy efﬁciency of the transmission arealso of high importance. To this end, we derived the time and energy consumption models forFEEL based on devices and channels properties. With these models, we have formulated a jointselection and bandwidth allocation problem, aiming to minimize a multi-objective function of thecompletion time and the total transmission energy, while balancing with a goal to maximizingthe diversity of the selected devices. We have proposed to solve this problem through an iterativealgorithm that starts with the selection of the devices and then allocates the bandwidth. Throughextensive evaluations, we proved the importance of the data properties in FEEL and the efﬁcacyof the diversity index. Furthermore, we showed that our proposed scheduling algorithm caneffectively reduce the number of required rounds to achieve high accuracy levels especially forlarge models, which results also in savings in time and energy.ACKNOWLEDGEMENTThe authors would like to thank the Natural Sciences and Engineering Research Council ofCanada, for the ﬁnancial support of this research. R EFERENCES [1] H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y. Arcas, “Communication-efﬁcient learning of deepnetworks from decentralized data,” arXiv:1602.05629 [cs] , 2016. [Online]. Available: http://arxiv.org/abs/1602.05629[2] G. Zhu, Y. Wang, and K. Huang, “Broadband Analog Aggregation for Low-Latency Federated Edge Learning,”

IEEETransactions on Wireless Communications , vol. 19, no. 1, pp. 491–506, Jan. 2020, conference Name: IEEE Transactionson Wireless Communications.[3] T. Nishio and R. Yonetani, “Client Selection for Federated Learning with Heterogeneous Resources in Mobile Edge,”

ICC 2019 , pp. 1–7, May 2019. [Online]. Available: http://arxiv.org/abs/1804.08333[4] N. H. Tran, W. Bao, A. Zomaya, M. N. H. Nguyen, and C. S. Hong, “Federated Learning over Wireless Networks:Optimization Model Design and Analysis,” in

IEEE INFOCOM 2019 , Apr. 2019, pp. 1387–1395.[5] Q. Zeng, Y. Du, K. K. Leung, and K. Huang, “Energy-Efﬁcient Radio Resource Allocation for Federated Edge Learning,” arXiv:1907.06040 [cs, math] , Jul. 2019. [Online]. Available: http://arxiv.org/abs/1907.06040[6] K. Yang, T. Jiang, Y. Shi, and Z. Ding, “Federated Learning via Over-the-Air Computation,”

IEEE Transactions onWireless Communications , vol. 19, no. 3, pp. 2022–2035, Mar. 2020, conference Name: IEEE Transactions on WirelessCommunications.[7] H. H. Yang, A. Arafa, T. Q. S. Quek, and H. Vincent Poor, “Age-Based Scheduling Policy for Federated Learning inMobile Edge Networks,” in

ICASSP 2020 , May 2020.[8] H. H. Yang, Z. Liu, T. Q. S. Quek, and H. V. Poor, “Scheduling Policies for Federated Learning in Wireless Networks,” arXiv:1908.06287 [cs, eess, math] , Oct. 2019, arXiv: 1908.06287. [Online]. Available: http://arxiv.org/abs/1908.06287[9] S. Caldas, S. M. K. Duddu, P. Wu, T. Li, J. Koneˇcn´y, H. B. McMahan, V. Smith, and A. Talwalkar, “LEAF: ABenchmark for Federated Settings,” arXiv:1812.01097 [cs, stat] , Dec. 2019, arXiv: 1812.01097. [Online]. Available:http://arxiv.org/abs/1812.01097[10] L. Shi and Y.-D. Shen, “Diversifying convex transductive experimental design for active learning,” in

Proceedings of theTwenty-Fifth International Joint Conference on Artiﬁcial Intelligence , ser. IJCAI’16. New York, New York, USA: AAAIPress, Jul. 2016, pp. 1997–2003.[11] X. You, R. Wang, and D. Tao, “Diverse Expected Gradient Active Learning for Relative Attributes,”

IEEE Transactions onImage Processing , vol. 23, no. 7, pp. 3203–3217, Jul. 2014, conference Name: IEEE Transactions on Image Processing.[12] A. Ta¨ık and S. Cherkaoui, “Federated Edge Learning : Design Issues and Challenges,” arXiv:2009.00081 [cs] , Aug.2020, arXiv: 2009.00081. [Online]. Available: http://arxiv.org/abs/2009.00081[13] A. Filali, A. Abouaomar, S. Cherkaoui, A. Kobbane, and M. Guizani, “Multi-Access Edge Computing: A Survey,”

IEEEAccess , vol. 8, pp. 197 017–197 046, 2020, conference Name: IEEE Access.[14] Y. Lin, S. Han, H. Mao, Y. Wang, and W. J. Dally, “Deep Gradient Compression: Reducing the CommunicationBandwidth for Distributed Training,” arXiv:1712.01887 [cs, stat] , Jun. 2020, arXiv: 1712.01887. [Online]. Available:http://arxiv.org/abs/1712.01887[15] J. Koneˇcn´y, H. B. McMahan, F. X. Yu, P. Richt´arik, A. T. Suresh, and D. Bacon, “Federated Learning:Strategies for Improving Communication Efﬁciency,” arXiv:1610.05492 [cs] , Oct. 2017. [Online]. Available:http://arxiv.org/abs/1610.05492[16] S. Wang, T. Tuor, T. Salonidis, K. K. Leung, C. Makaya, T. He, and K. Chan, “Adaptive Federated Learning in ResourceConstrained Edge Computing Systems,”

IEEE Journal on Selected Areas in Communications , vol. 37, no. 6, pp. 1205–1221,Jun. 2019.[17] C. Xie, S. Koyejo, and I. Gupta, “Asynchronous Federated Optimization,” arXiv:1903.03934 [cs] , Sep. 2019. [Online]. Available: http://arxiv.org/abs/1903.03934[18] L. Wang, W. Wang, and B. Li, “CMFL: Mitigating Communication Overhead for Federated Learning,” in

IEEE ICDCS2019 , Dallas, TX, USA, Jul. 2019, pp. 954–964. [Online]. Available: https://ieeexplore.ieee.org/document/8885054/[19] C. Zhang, H. Kjellstrom, and S. Mandt, “Determinantal Point Processes for Mini-Batch Diversiﬁcation,” arXiv:1705.00607[cs, stat] , Aug. 2017, arXiv: 1705.00607. [Online]. Available: http://arxiv.org/abs/1705.00607[20] Y. Yang, Z. Ma, F. Nie, X. Chang, and A. G. Hauptmann, “Multi-Class Active Learning by Uncertainty Sampling withDiversity Maximization,”

International Journal of Computer Vision , vol. 113, no. 2, pp. 113–127, Jun. 2015. [Online].Available: http://link.springer.com/10.1007/s11263-014-0781-x[21] A. Holub, P. Perona, and M. C. Burl, “Entropy-based active learning for object recognition,” in , Jun. 2008, pp. 1–8, iSSN: 2160-7508.[22] L. Jost, “Entropy and diversity,”

Oikos , vol. 113, no. 2, pp. 363–375, 2006.[23] A. Delgado-Bonal and A. Marshak, “Approximate Entropy and Sample Entropy: A Comprehensive Tutorial,”

Entropy

Physica A: Statistical Mechanics and its Applications

Biomedical Signal Processing and Control arXiv:1910.02214 [cs, math] , Oct. 2019, arXiv: 1910.02214. [Online]. Available:http://arxiv.org/abs/1910.02214[27] P. M. Catt, “Forecastability: Insights from Physics, Graphical Decomposition, and Information Theory,”

Foresight: TheInternational Journal of Applied Forecasting , no. 13, pp. 24–33, 2009, publisher: International Institute of Forecasters.[Online]. Available: https://ideas.repec.org/a/for/ijafaa/y2009113p24-33.html[28] L. Montesinos, R. Castaldo, and L. Pecchia, “On the use of approximate entropy and sample entropy with centre ofpressure time-series,”

Journal of NeuroEngineering and Rehabilitation