[PDF] Battery-constrained Federated Edge Learning in UAV-enabled IoT for B5G/6G Networks

Abstract

In this paper, we study how to optimize the federated edge learning (FEEL) in UAV-enabled Internet of things (IoT) for B5G/6G networks, from a deep reinforcement learning (DRL) approach. The federated learning is an effective framework to train a shared model between decentralized edge devices or servers without exchanging raw data, which can help protect data privacy. In UAV-enabled IoT networks, latency and energy consumption are two important metrics limiting the performance of FEEL. Although most of existing works have studied how to reduce the latency and improve the energy efficiency, few works have investigated the impact of limited batteries at the devices on the FEEL. Motivated by this, we study the battery-constrained FEEL, where the UAVs can adjust their operating CPU-frequency to prolong the battery life and avoid withdrawing from federated learning training untimely. We optimize the system by jointly allocating the computational resource and wireless bandwidth in time-varying environments. To solve this optimization problem, we employ a deep deterministic policy gradient (DDPG) based strategy, where a linear combination of latency and energy consumption is used to evaluate the system cost. Simulation results are finally demonstrated to show that the proposed strategy outperforms the conventional ones. In particular, it enables all the devices to complete all rounds of FEEL with limited batteries and meanwhile reduce the system cost effectively.

Full PDF

11 Battery-constrained Federated Edge Learning inUAV-enabled IoT for B5G/6G Networks

Shunpu Tang, Wenqi Zhou, Lunyuan Chen, Lijia Lai, Junjuan Xia and Liseng Fan

Abstract —In this paper, we study how to optimize the federatededge learning (FEEL) in UAV-enabled Internet of things (IoT)for B5G/6G networks, from a deep reinforcement learning (DRL)approach. The federated learning is an effective framework totrain a shared model between decentralized edge devices orservers without exchanging raw data, which can help protectdata privacy. In UAV-enabled IoT networks, latency and energyconsumption are two important metrics limiting the performanceof FEEL. Although most of existing works have studied howto reduce the latency and improve the energy efﬁciency, fewworks have investigated the impact of limited batteries at thedevices on the FEEL. Motivated by this, we study the battery-constrained FEEL, where the UAVs can adjust their operatingCPU-frequency to prolong the battery life and avoid withdrawingfrom federated learning training untimely. We optimize thesystem by jointly allocating the computational resource andwireless bandwidth in time-varying environments. To solve thisoptimization problem, we employ a deep deterministic policygradient (DDPG) based strategy, where a linear combination oflatency and energy consumption is used to evaluate the systemcost. Simulation results are ﬁnally demonstrated to show thatthe proposed strategy outperforms the conventional ones. Inparticular, it enables all the devices to complete all rounds ofFEEL with limited batteries and meanwhile reduce the systemcost effectively.

Index Terms —UAV, federated learning, latency, energy con-sumption, mobile edge computing.

I. I

NTRODUCTION S WARMS of unmanned aerial vehicles (UAVs) play anon-negligible role in Internet of things (IoT) for be-yond the ﬁfth-generation (B5G) and the forthcoming sixth-generation (6G) wireless mobile networks [1]–[3]. Due tothe mobility and ﬂexibility, UAVs can effectively collect dataand communicate in the edge computing-based IoT networks.Thanks to these advantages, UAVs have been widely used inmany application scenarios, such as environmental monitoring,emergence communication, transportation control and remotesensing.In recent years, deep learning has shown its great success onspeech recognition, computer vision and many other domains.[4]. Especially in the era of big data, millions or billionsof data are collected by various sensors and are applied totrain the deep learning model. The authors in [5] proposeda large-scale hierarchical database for image classiﬁcationand promoted the emergence of state-of-the-art (SOTA) CNNmodels [6]–[8]. Large amounts of high-quality data are the key

S. Tang, W. Zhou, L. Chen, L. Lai, J. Xia and L. Fan are all with theSchool of Computer Science, Guangzhou University, Guangzhou, China (e-mail: { tangshunpu, 2112006156, 2112019037, 2112006122 } @e.gzhu.edu.cn,[email protected], [email protected]). to improve the performance of the deep learning model. Asa matter of fact, data collection faces a series of challenges.Though a lot of data are produced by heterogeneous devices,especially IoT devices and smartphones in the mobile edge,these data are fragmented and scattered in various devices.This brings out a huge communication cost to centralize dataon the server to train models when we adopt the traditionalcentralized training method of deep learning. Meanwhile,increasing concerns of sensitive privacy make data collectionmore difﬁcult. People do not want their personal data (e.g.ﬁgs, voice, chat history) to be uploaded to others’ servers.There are also some companies, banks and hospitals witha lot of sensitive data which are not allowed to divulge.Moreover, laws on privacy protection have been promulgatedcontinuously. Due to these reasons, data are difﬁcult to becentralized to train deep learning models, which is called“data island” [9]. In this context, federated learning has beenproposed to break the “data island” and it makes full use ofeach device’s data to train a model with a ﬁne performance.In the federated learning, all devices use local data to traina model and share the model on the premise of safety toaggregate a global model with the federated optimization [10]–[12].In the edge computing-based IoT networks, in order to speedup the process of training and inference of deep learning, anew concept “Edge AI” was presented to bring model closerto the places where data are generated when the computationalcapabilities of edge devices are continually growing [13]–[15]. Federated learning is also applied to train deep learningmodels cooperatively in MEC networks, which is referred toas federated edge learning (FEEL) [16], [17]. Google Inc [18]used an edge server to build a keyboard input prediction modelbased on FEEL, and FEEL can be also applied to improve theperformance of content caching without gathering users’ datacentrally for training [19], [20].Although the FEEL has been successfully applied to manyapplication scenarios, there still exist some challenges. Onemajor challenge is that there is a high requirement forthe latency and energy consumption in UAV-enabled IoTnetworks, which is one of the important and critical issues.More importantly, UAVs are powered by limited batteries andthey can only work for a few dozen minutes. It is dangerous forUAVs to run out of power when they are working and it is ofvital importance to control the remaining power. On the otherhand, enough participants are the guarantee of the performancefor the FEEL training. With a limited battery power, howto complete more training rounds of federated learning isan unconsidered issue and it is our motivation to control a r X i v : . [ c s . N I] J a n batteries effectively to prolong their service life. Accordingly,in this paper we study the battery-constrained FEEL, wherethe UAVs can adjust their operating CPU-frequency to prolongthe battery life and avoid withdrawing from federated learningtraining untimely. We optimize the system by jointly allocatingthe computational resource and wireless bandwidth in time-varying environments. To solve this optimization problem, weemploy a deep deterministic policy gradient (DDPG) basedstrategy, where a linear combination of latency and energyconsumption is used to evaluate the system cost. The maincontributions of this work are summarized below. • We study the resource allocation strategy for the FEEL incomplicated scenarios, where UAVs in edge computing-based IoT networks are powered by limited batteries.Moreover, the characteristics of each device are differentand complicated, such as computational capabilities andchannel conditions which are time-varying. • We propose a resource allocation strategy based on thedeep reinforcement learning for the FEEL optimization,where the total latency and energy consumption can beminimized by adjusting the CPU-frequency of devicesand upload wireless bandwidth for each device. Moreimportantly, the strategy can prolong the battery life ofeach device and enable devices to complete the requiredrounds of federated learning. • We conduct simulated experiments to evaluate the per-formance of the proposed resource allocation strategyand compare the performance with some conventionalstrategies to demonstrate the superiority of the proposedapproach.The rest of this paper is described as follows. Existedand relevant works are introduced in Sec. II. Then, Sec. IIIdescribes the system model and formulates the optimizationproblem, and Sec. IV provides the DDPG-based allocationstrategy involving the optimization on computational resourceand wireless bandwidth in time-varying environments. Afterthat, Sec. V gives some simulation results and discussions,and we ﬁnally conclude our work in Sec. VI.II. R

ELATED W ORKS

Mobile Edge computing networks:

With the explosivegrowth of the number of IoT devices in 5G and 6G era,moreand more data are mainly generated in the mobile edge, andmobile edge computing (MEC) is used to process a largequantity of data with a lower latency and energy consumption[21]. Many researchers focus on the ofﬂoading strategyin MEC to reduce system cost (e.g. Zhao [22] optimizedthe computation ofﬂoading based on the discrete particleswarm algorithm, the authors in [23] studied the multi-usercomputational ofﬂoading and Li [24] proposed a DRL basedapproach to solve the problem of ofﬂoading for multiuser andmulti-CAP MEC networks and so on [25]–[27])

FL in the wireless networks:

In MEC networks, thereare many existing works about reducing the latency andenergy consumption by allocating the system resources. Butfederated learning is basically distributed and heterogeneous,and it needs to synchronize all the participants with different

User 3 User 4

DownlinkUplink 𝑤 = 1𝑚෍ 𝑖=1𝑚 𝑤 𝑖 User 2User 1 User M

Fig. 1. System model of FEEL in the UAV-enabled IoT networks. channel conditions and computational capabilities [28]. Hence,another challenge is how to reduce the cost of FEEL. In thisdirection, Takayuki [29] proposed a client selection scheme forfederated learning in mobile edge to accelerate the training.In [30], [31], the authors studied the federated learning inwireless networks and tried to allocate more bandwidth tonodes with poor channel or weak computational capabilities.Due to the lack of continuous connection between the UAVswarms, a joint powered allocation design was proposed tooptimize the convergence of federated learning [32]. Theauthors in [33] presented worker-centric model selection forfederated learning in the MEC networks. Zhou [34] proposeda blockchain-based FL framework in the B5G network. Sofar, few works have considered that most of the IoT devicesare powered by limited batteries. Zhan et al. [35] executedDRL to adjust the CPU-frequency making a good trade-offbetween the latency and energy consumption. But this workdid not quantify the amount of battery power.III. S

YSTEM M ODEL AND P ROBLEM F ORMULATION

A. Federated Learning

Federated learning is a distributed machine learning methodto train a shared model in the context of protecting personalprivacy. It allows users to train their dataset locally instead ofuploading the sensitive data to a server. The server collects theinformation from different users and generates a global model.This process can be expressed as arg min w ∈ R F ( ω ) = 1 M M (cid:88) m =1 F m ( ω ) , (1)where ω is the model weight, M is the user number, and F m ( · ) is the loss function. As a matter of fact, FedAvg [10] isused to aggregate a global model to reduce the communicationrounds. Firstly, all the devices train their models locally andthen update the weight as ω mk +1 ← ω mk − α ∇ F m ( ω ) , (2)where α is a positive learning rate, ∇ ( · ) denotes the gradientoperation and k is the round index. Each user executes(2) several times, and then it uploads the weight w mk +1 to the server. The server aggregates the collected weights andcalculates the weighted average weights as ω k +1 ← M (cid:88) m =1 n m n ω mk +1 , (3)where n m n is the proportion of samples in total. It is calledas model average [11], [36]. Lastly, the server broadcasts thenew global model to each user. B. System Model

Fig. 1 shows the system model of the considered FEELframework, where there are M distributed users and acentralized server. In practical, the M users can be variousUAV devices in the edge computing-based IoT network, andthe parameters server can be the base station or one of theUAVs. We use M = { , , · · · , M } to denote the user set.The number of data samples which needs to be trained locallyis d m . We assume that a round of FL training can be completedin a time slot. For each device m ∈ M in the k -th round, thebase CPU-frequency is f km and it requires c m CPU-cycles toprocess a sample of the local dataset. To deal with the problemof limited communication resources, each device will train e times locally before uploading the weights [10]. The localtraining time of device m at the k -th round can be written as t km,local = ec m D m η km f km , (4)where η km is the coefﬁcient of frequency adjustment whichdetermines the practical operating CPU-frequency and K is themaximum round of FL. The weights will be uploaded to thecentralized server through the wireless link after each devicecompletes the local training. The data rate of the wireless linkfrom device m to the centralized server is: r km = B km log (cid:18) P m | h km | σ m (cid:19) , (5)where h km ∼ CN (0 , β ) is the channel parameter of the linkfrom device m to the centralized server, σ is the varianceof the additive white Gaussian noise (AWGN) at the serverand P m is the transmit power of device m . Notation B km isthe bandwidth of device m and it can be allocated by thebase station. At each round, B km should meet the followingrequirement M (cid:88) m =1 B km = B total . (6)According to (5), the transmission latency can be calculatedas t km,up = (cid:15)r km , (7)where (cid:15) is the size of deep learning model’s weights. The totaltime to provide a local model of device m can be expressedas T km = t km,local + t km,up . (8) Each device trains its local model and uploads parallelly. Thecentralized server has to wait for the local model from theslowest device which can be viewed as the bottleneck of thesystem to aggregate a new global model. So the system latencyat the k -th round is T k = max m ∈M T km . (9)Similarly, we can calculate the energy consumption in twostages. At the ﬁrst stage, device m performs some algorithmslike back propagation (BP) to update the weights at the costof huge energy consumption. According to [2], the trainingenergy consumption can be written as E km,local = ζ m c m D m ( η km f km ) , (10)where ζ m is the energy consumption coefﬁcient of CPU chips.Then device m uploads the weights with transmit power P m and the energy consumption of transmission can be easilyexpressed as E km,up = P m t km,up . (11)So the total energy consumption of device m at the k -th roundis E km = E km,local + E km,up . (12)Under ideal conditions, devices can continuously participatein FedAvg [10] and contribute their local models untilachieving the best performance. However, in the case oflimited resources, we have to consider that the battery ofdevices may run out. The energy model should meet thefollowing requirement K (cid:88) k =1 E km ≤ δ m , (13)where δ m is the total battery power of device m .To describe the total system cost, we use a linear combi-nation of latency and energy consumption as following [24],[35] Φ = K (cid:88) k =1 (cid:34) λT k + (1 − λ ) M (cid:88) m E km (cid:35) , (14)where λ ∈ [0 , is a factor to describe the importance oflatency and energy consumption in the system cost. We canadjust λ to trade-off the latency and energy consumption.Speciﬁcally, the linear combination approaches to the energyconsumption when λ goes to zero, while it degenerates intothe latency if λ is near one. In fact, the linear combinationis a method to solve multi-objective programming problemsby giving a proper weight coefﬁcient according to theimportance of each objective and changing to single-objectiveprogramming problems. C. Problem Formulation

For the practical battery-constrained devices, we expectthat they can participate in the rounds of FEEL trainingas much as possible and meanwhile try to avoid runningout of power. Moreover, the system cost measured by the

TABLE IS

YMBOL NOTATIONS

Notation Deﬁnition α learning rate of FL ω mk model weight of the m -th user in the k -th round (cid:15) size of deep learning model weight ζ m energy consumption coefﬁcient of the m -th user Φ system cost r km transmission rate between the m -th user and theserver e local training times c m cpu-cycles to process a sample of the m -th user d m number to data samples of the m -th user δ m total battery power of device mη km coefﬁcient of frequency adjustment of the m -th userin the k -th round f km base CPU-frequency of the n -th user in the k -thround t km,local local latency of the m -th user in the k -th round t km,up transmission latency of the m -th user in the k -thround B km wireless bandwidth of the n -th user in the k -th round T k system latency in the k -th round E km,local Local energy consumption of the m -th user in the k -th round E km system energy consumption of the n -th user in the k -th round F ( · ) loss function of FL latency and energy consumption should be minimized, whichis particularly important for the IoT devices. By taking intoaccount these factors, we can optimize the training rounds andsystem cost by adjusting the CPU-frequency and bandwidth.The system optimization problem can be given by max { η km ,B km } K − Φs . t . C : η km ∈ [0 , , (15) C : M (cid:88) m =1 B km = B total ,C : K (cid:88) k =1 E km ≤ δ m , where Φ = Φ K is the average cost of per round. It isgenerally very hard to solve the above optimization problemby using conventional optimization methods such as convexoptimization. Especially, in a time-varying environment, thebase CPU-frequency and channel condition are different ineach round, causing heterogeneity in the training process.Due to these reasons, a learning based algorithm should bedeveloped to adapt to different states in order to ﬁnd a propersolution. The notations of this section we have used aresummarized in Table. I.IV. DDPG-B ASED A LLOCATION S TRATEGY

In this section, we will try to solve the system optimizationproblem in (15), by using the DDPG-based allocation strategy.Speciﬁcally, we will ﬁrst describe the Markov decision process(MDP), and then introduce how to implement the DDPG-basedresource allocation strategy. state Actor Criticaction

Fig. 2. Actor-critic framework.

A. Markov Decision Process

MDP is used to model decision-making in time-varying en-vironments and it mainly consists of a 4-tuple { S, A, P a , R a } .Speciﬁcally, S = { k, O k , F k , H k } is the state space, where O k is a vector of the remaining battery power, F k is avector of the current base CPU-frequency of all the devices,and H k is the channel parameter of the k -th round whichcan be obtained by using some channel estimation methods.We use A = { N k , B k } to denote the action space, where N k = { η k , η k , · · · η kM } is the coefﬁcient of CPU-frequencyadjustment and B k = { B k , B k · · · B kM } is the allocated vectorof bandwidth.At the k -th round of the FEEL training, the current stateis s k ∈ S and the agent will perform an action a t ∈ A inthe environment. From the feedback of environment, s k willtransit to s k +1 with a conditional probability P . Meanwhile,the agent will achieve an instant reward from the environment,which can be expressed by R k = k − φ, (16)where k is positive feedback and φ is the instant cost ofa round in FEEL training, which is denoted by the linearcombination of energy consumption and latency. The agentexpects to achieve a long-term average reward from theenvironment by maximizing K and minimizing Φ .However, it is difﬁcult to estimate the conditional probabil-ity P in many application scenarios. Hence, we turn to usethe DDPG algorithm to solve this problem. B. Deep Deterministic Policy Gradient

In the considered FEEL scenario, the CPU-frequency andbandwidth allocation vary over a continuous range with inﬁnitepossibilities. Accordingly, although deep Q-learning network[37], [38] can deal with the problem of discrete and low-dimensional action space, it is unable to work efﬁciently inthe face of continuous action space since its performanceseverely relies on ﬁnding the maximum of value functionapproximated by the neural network in each iteration. Dueto this reason, we turn to use the DDPG strategy to solvethe optimization problem of CPU-frequency and bandwidthallocation. Deterministic Policy Gradient [39] was proposedto output a deterministic action, instead of the value of allthe actions under the actor-critic framework. Fig. 2 showsthe actor-critic framework in the DDPG, where a deep neuralnetwork called actor network is used to produce a deterministicaction while a critic network is employed to approximate thevalue-function which evaluates the action. To formulate theDDPG strategy, we use µ ( s | θ µ ) and Q ( s, a | θ Q ) to denotethe actor and critic network, respectively. For a deterministic Policy gradient 𝑠 𝑡 𝑎 𝑡 (𝑠 𝑡 , 𝑎 𝑡 ) (𝑠 𝑡 , 𝑎 𝑡 , 𝑟, 𝑠 𝑡+1 ) 𝑄(𝑠 𝑡 , 𝑎 𝑡 |𝜃 𝑄 )𝑠 𝑡+1 (𝑠 𝑡+1 , 𝑎 𝑡+1 ) 𝑦 𝑖 = 𝑟 + 𝛾𝑄′(𝑠 𝑡+1 , 𝜇′(𝑠 𝑡+1 |𝜃 𝜇 ′ )|𝜃 𝑄 ′ ) Loss function updateupdate copy θ copy θ Fig. 3. Implementation structure of DDPG. action a t in the state s t , the Bellman equation for the action-state value function can be written as Q ( s, a ) = E s t +1 ∼ S [ r ( s t , a t ) + γQ ( s t +1 , a t +1 )] , (17)where s t +1 is the next state when the agent executes the action a t in the state s t and a t +1 is the next action given by the actornetwork.Motivated by the idea of double network in the DQN, targetnetworks and main networks are used in DDPG. Fig. 3 showsthe implementation structure of DDPG, where there are fourneural networks working together, • Main actor network µ : We will get the deterministicaction from a t = µ ( s | θ µ ) and µ is used to update weightsof the target actor network. • Target actor network µ (cid:48) : The target actor networkpredicts the next action by a t +1 = µ (cid:48) ( s t +1 | θ µ (cid:48) ) . • Main critic network Q : According to the current state s t and action given by the main actor network µ , the maincritic network Q is responsible for giving the Q value by Q ( s, a | θ Q ) . • Target critic network Q (cid:48) : In the state s t +1 , the Q -valuewill be evaluated by Q (cid:48) ( s t +1 , µ (cid:48) ( s t +1 | θ µ (cid:48) ) | θ Q (cid:48) ) .The approach of target networks reduces the correlationbetween the current Q -value and target Q -value, which helpsincrease the robustness of training. For the main critic network,the loss function can be expressed as L ( θ Q ) = E µ (cid:48) ( Q ( s, a | θ Q ) − y i ) , (18)where y i = r ( s t , a t ) + γQ (cid:48) ( s t +1 , µ (cid:48) ( s t +1 | θ µ (cid:48) ) | θ Q (cid:48) ) , (19)in which γ is a positive discount factor of reward. By executingthe gradient descent method and minimizing the loss functionin (18), we can update the main critic network. Additionally,to optimize action decisions and achieve higher rewards, Q -value is considered as the loss function of the actor network. Algorithm 1

DDPG-based resource allocation strategy

Input: current state S = { k, O k , F k , H k } Output: allocation matrix A = { N k , B k } Initialize the target network Q (cid:48) and µ (cid:48) with θ Q (cid:48) ← θ Q , θ µ (cid:48) ← θ µ in the server. Initialize experience replay memory D for episode = 1,M do Initialize a random process N for exploring. Initialize state s while s / ∈ S end do Devices upload the base frequency to the serverand estimate the channel parameters. Server chooses an action a k = µ ( s t | θ µ ) + N t Execute action the a k and observe reward r k andnext state s k +1 Store ( s k , a k , r k , s k +1 ) in D Randomly sample a mini-batch of transitions ( s i , a i , r i , s i +1 ) from D y i = r ( s i + γQ (cid:48) ( s i +1 , µ (cid:48) ( s i +1 | θ µ (cid:48) ) | θ Q (cid:48) ) . Update the critic network by minimizing : L = 1 N µ (cid:48) ( Q ( s, a | θ Q ) − y i ) Update the actor network by gradient ascent: ∇ θ µ = E µ (cid:48) (cid:2) ∇ a Q ( s, a | θ Q ) | s = s t ,a = µ ( s t ) ∇ θ µ µ ( s | θ µ ) | s t (cid:3) Update the target networks: θ Q (cid:48) ← τ θ + (1 − τ ) θ Q (cid:48) θ µ (cid:48) ← τ θ + (1 − τ ) θ µ (cid:48) end while end for The main actor network will be updated by the gradient ascentfrom ∇ θ µ = E µ (cid:48) (cid:2) ∇ a Q ( s, a | θ Q ) | s = s t ,a = µ ( s t ) ∇ θ µ µ ( s | θ µ ) | s t (cid:3) . (20)Different from copying weights from the main networks in acertain interval of time in DQN, the target networks in DDPGadopt a soft update in each step as θ Q (cid:48) ← τ θ + (1 − τ ) θ Q (cid:48) (21) θ µ (cid:48) ← τ θ + (1 − τ ) θ µ (cid:48) , (22)where τ is an update factor. In addition, an experiencereply unit (ERU) is used to collect training samples and theagent will randomly choose batches of samples during thetraining process. This can help break the relevance and non-stationary distribution between training samples and improvethe performance. C. DDPG-based resource allocation strategy

In this part, we will introduce the proposed DDPG-basedresource allocation strategy. Firstly, we design a mutil-taskneural network which outputs the results of N k and B k simultaneously. In order to ensure the outputs in the correctrange, the Sigmoid and Softmax activation functions are usedin the last layer of network. Before each round of FEELtraining, clients need to report their available resources suchas the base CPU-frequency and remaining battery power tothe server. At the same time, the server should estimatethe channel parameters of the links from it to the users.These information is entered as the input of the main actornetwork to get a deterministic action of adjusting the practicalCPU-frequency and bandwidth allocation. The base stationwill inform participants the operating CPU-frequency and theavailable wireless bandwidth to them. The agent will achievedifferent rewards in each round according to the latency,energy consumption and the current number of rounds. Whenthe experience reply unit has enough samples, the agent willadjust its strategy by the way introduced in the previous part.In order to guarantee the global convergence, we reset theenvironment and re-initialize state s to train the agent whenthe agent reaches the ﬁnal state. After some episodes withconvergence, the DDPG strategy can ﬁnd a proper result of η km and B km . In this way, the optimization problem of (15)is solved. The whole procedure of the DDPG strategy issummarized in Algorithm 1.V. S IMULATIONS AND D ISCUSSIONS

In this section, we will demonstrate the performance ofthe proposed DDPG-based resource allocation strategy bysimulations. To simulate the practical scenarios of FEEL, wesuppose that there are 20 users with limited battery powersubject to U (2 × , × ) and their initial base CPU-frequencies are subject to U (1 × , × ) . The users willtrain locally 5 times per round and participate in 1000 roundsof the FEEL training at most. In addition, the number of CPU-cycles to process a sample is set subject to U (7 × , × ) ,the sizes of users’ dataset are subject to U (400 , , and the energy consumption coefﬁcients of CPU chipsare set subject to U (1 × − , × − ) . Moreover, thetransmit power of all the devices is set to × − , the totalwireless bandwidth is MHz, and the model size is set to MB. We assume that the condition will be static in a round,while it is time-varying in different rounds of training. Tosimulate the time-varying environments, we set the initial baseCPU-frequency varying from . to . times. Similarly, theaverage channel gain of the wireless links is set to unity, andthe variance of the AWGN is set to − .As to the DDPG network, we implement it by the well-known Pytorch library [40]. The actor networks consist of 2hidden layers with 64 and 256 nodes, respectively. In the criticnetworks, there are 2 hidden layers with 30 nodes per layer.To enhance the ﬁtting ability of the networks, the RectiﬁedLinear Unit (ReLU) [41] is used as the activation function.The Adam [42] method is used to optimize the loss functionof networks and the learning rates are − and − in themain actor and main critic networks, respectively. We set thecapacity of ERU to and the batch size is . Besides, thevalue of γ is . , and τ is set to − . The DDPG agentwill train times at the end of FEEL in each episode. The Episode R e w a r d Fig. 4. Convergence of the proposed DDPG approach with λ = 0 . . Episode M a x r ound s Fig. 5. Convergence of the maximum round. total episode of DRL is , and we repeat the experimentsto reduce the accidental error.Fig. 4 shows the training process of the proposed DDPG-basd resource allocation strategy, where λ = 0 . and thereare 800 episodes at all. We focus on the total reward in eachepisode. From this ﬁgure, we can ﬁnd that the curve of thetotal reward grows very fast in the ﬁrst 200 episodes and then itcontinues to increase with a slow growth tendency. After about500 episodes, the proposed DDPG approach can achieve anasymptotic reward value of about 4000. These results can helpverify the proposed DDPG-based strategy.Fig. 5 depicts the maximum round of the proposed strategyversus episodes, where λ = 0 . and there are 800 episodes atall. From this ﬁgure, we can observe that at the beginning ofepisode, the battery-constrained devices are easily to run out ofpower and can only participate in about 500 rounds. When theepisode increases, the devices can participate in more roundsof training. This is because that with the learning of DDPGagent, the proposed resource allocation strategy continues toimprove and all the devices can prolong the life of theirbatteries. When the curve is converged, all the devices cancomplete 1000 rounds of training. This can not only ensurethe performance of training, but also avoid the danger caused Static adjustment factor of frequency M a x m i u m r ound s DDPGStatic

Fig. 6. Maximum rounds versus the static adjustment factor of frequency. by shutdown.Fig. 6 shows the maximum round of two strategies versusthe static adjustment factor of frequency, where λ = 0 . andthe adjustment factor varies from 0.1 to 1. For comparison,we provide the performance of the static strategy, where thecoefﬁcient of frequency adjustment is static at each FL traininground and bandwidth allocation is even with B km = B total /M for each device. We use the static strategy to compare sincein the practical time-varying scenario, it is however difﬁcultto determine how to set the proper operating CPU-frequencyof devices, and the devices should run at a proportion of baseCPU-frequency in order to reduce the energy consumption.From this ﬁgure, we can see that the proposed strategyremains unchanged with the adjustment factor, and it canenable all the devices to complete 1000 rounds of training.In contrast, the static strategy can only obtain about 200rounds of training when the adjustment factor is around1, since all the devices operate at the base CPU-frequencywith a high energy consumption. When the adjustment factordecreases, the static strategy can enable all devices to reducethe energy consumption and take part in more rounds oftraining. In particular, the static approach can achieve 1000rounds of training when the adjusting factor is smaller than0.3. The comparison results in this ﬁgure can further verifythe effectiveness of the proposed DDPG strategy.Fig. 7 shows the energy consumption versus the staticadjustment factor of frequency, where λ = 0 . . The energyconsumption of the proposed strategy is about and itis a little higher than it in the static approach when theadjustment factor is smaller than . . With the increase of thestatic adjustment factor, the energy consumption of the staticapproach grows explosively. This causes the battery power ofdevices being easily exhausted and inﬂuences the continuoustraining of FL.Fig. 8 describes the latency versus the static adjustmentfactor of frequency, where λ = 0 . . From this ﬁgure, wecan observe that the latency is more than 200 when theadjustment factor is smaller than . while the latency of theproposed strategy is only about 96. Hence, it is unacceptableto save energy by decreasing the adjustment factor smaller Static adjustment factor of frequency E ne r g y c on s up t i on DDPGStatic

Fig. 7. Energy consumption versus the static adjustment factor of frequency. than . . With the increase of the adjustment factor, thelatency is lowing down. Moreover, the latency of the proposedstrategy is almost identical to that in the static strategy whenthe adjustment factor is . . By combining the results inFigs. 7- 8, we can conclude that the proposed strategy canmake a good trade-off between the energy consumption andlatency efﬁciently. This can help verify the effectiveness of theproposed DDPG strategy furthermore.Fig. 9 shows the system cost versus the static adjustmentfactor of frequency. To describe clearly, we use the linearcombination to express the system cost, where λ = 0 . .We can ﬁnd from this ﬁgure that the system cost ofthe proposed DDPG strategy remains unchanged with theadjustment factor, which is similar to the phenomena inFigs. 6-8. In contrast, the static approach is affected by theadjustment factor signiﬁcantly. Speciﬁcally, the system costof the static approach becomes smaller when the adjustmentfactor increases in a low region of adjustment factor. Thisis due to the energy saving, which however leads to a hugelatency. On the contrary, the system cost of the static approachbecomes larger when the adjustment factor increases in a highregion of adjustment factor. This is because that the pursuitof reducing latency leads to a great energy consumption. Inparticular, the static approach achieves the minimum systemcost when the adjustment factor is 0.3, which is still higherthan that of the proposed DDPG strategy.Fig. 10 depicts the system cost of several strategies versusthe number of users, where λ = 0 . . For comparison, weplot the performance of the proposed DDPG strategy witheven bandwidth allocation, denoted by E-DDPG. From thisﬁgure, we can ﬁnd that the proposed DDPG outperforms E-DDPG, since the former can exploit the wireless bandwidthresources in the learning process, which can help improve thesystem performance. Moreover, E-DDPG is much better thanthe static approach, since the later cannot utilize the systemcommunication and computational resources efﬁciently. Infurther, the system performance of all the three strategiesincreases with a larger number of users, as more uses imposea heavier burden on the training process.In Fig. 11, we plot the performance of several resource Static adjustment factor of frequency

Lan t en cy DDPGStatic

Fig. 8. Latency versus the static adjustment factor of frequency.

Static adjustment factor of frequency A v e r age S ys t e m c o s t DDPGStatic

Fig. 9. Average system cost versus the static adjustment factor of frequency.

10 15 20 25 30

Number of users A v e r age sys t e m c o s t DDPGStaticE-DDPG

Fig. 10. Average system cost versus number of users.

Bandwidth(MHz) S ys t e m c o s t DDPGStaticE-DDPG

Fig. 11. Average system versus different bandwidth. allocation strategies versus the wireless bandwidth, where λ =0 . and the wireless bandwidth varies from 1MHz to 9MHz.From this ﬁgure, we can see that the performances of the threestrategies become better when the bandwidth increases, sincea larger bandwidth can help reduce the transmission latencyas well as the transmission energy consumption. Moreover,the proposed strategy outperforms the static approach andE-DDPG for various values of wireless bandwidth, since itcan exploit the system communication and computationalresources efﬁciently. In particular, when the total bandwidth is1MHz, the cost of the proposed DDPG strategy, E-DDPG andthe static approach is about 130, 140, and 170, respectively.In further, when the bandwidth increases, the transmissionlatency becomes negligible in the system cost for the proposedDDPG and E-DDPG strategies, which makes the performancegap between these two strategies decrease. The results in thisﬁgure demonstrate the merits of the proposed DDPG strategyfurthermore. VI. C ONCLUSIONS

This paper studied how to optimize the FEEL in UAV-enabled IoT networks where UAVs had limited batteries,from a deep reinforcement learning approach. Speciﬁcally, weprovided an optimization framework where the devices couldadjust their operating CPU-frequency to prolong the batterylife and avoid withdrawing from federated learning traininguntimely, through jointly allocating the computational resourceand wireless bandwidth in time-varying environments. Tosolve this optimization problem, we employed the DDPGbased strategy, where a linear combination of latency andenergy consumption was used to evaluate the system cost.Simulation results were demonstrated to show that theproposed strategy could efﬁciently avoid devices withdrawingfrom the FL training untimely and meanwhile reduce theaverage system cost. R

EFERENCES[1] M. Mozaffari, A. T. Z. Kasgari, W. Saad, M. Bennis, and M. Debbah,“Beyond 5g with uavs: Foundations of a 3d wireless cellular network,”

IEEE Transactions on Wireless Communications. , vol. 18, no. 1, pp.357–372, 2019. [2] X. Li, Q. Wang, Y. Liu, T. A. Tsiftsis, Z. Ding, and A. Nallanathan,“Uav-aided multi-way noma networks with residual hardware impair-ments,”

IEEE Wireless Communications Letters , vol. 9, no. 9, pp. 1538–1542, 2020.[3] M. Mozaffari, W. Saad, M. Bennis, Y. Nam, and M. Debbah, “Atutorial on uavs for wireless networks: Applications, challenges, andopen problems,”

IEEE Communications Surveys Tutorials , vol. 21, no. 3,pp. 2334–2360, 2019.[4] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,”

Nature , vol. 521,no. 7553, pp. 436–444, 2015.[5] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and F. Li, “Imagenet: Alarge-scale hierarchical image database,” in

Proceedings of the IEEEconference on computer vision and pattern recognition (CVPR) , 2009,pp. 248–255.[6] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classiﬁcationwith deep convolutional neural networks,” in

Advances in NeuralInformation Processing Systems (NIPS) , 2012, pp. 1106–1114.[7] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in

Proceedings of the IEEE conference on computer visionand pattern recognition (CVPR) , 2016, pp. 770–778.[8] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger,“Densely connected convolutional networks,” in

Proceedings of theIEEE conference on computer vision and pattern recognition (CVPR) ,2017, pp. 4700–4708.[9] Q. Yang, Y. Liu, T. Chen, and Y. Tong, “Federated machine learning:Concept and applications,”

ACM Transactions on Intelligent Systems andTechnology (TIST) , vol. 10, no. 2, pp. 1–19, 2019.[10] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas,“Communication-efﬁcient learning of deep networks from decentralizeddata,” in

Proceedings of the 20th International Conference on ArtiﬁcialIntelligence and Statistics, AISTATS 2017, 20-22 , vol. 54, 2017, pp.1273–1282.[11] J. Koneˇcn`y, H. B. McMahan, F. X. Yu, P. Richt´arik, A. T. Suresh, andD. Bacon, “Federated learning: Strategies for improving communicationefﬁciency,” arXiv preprint arXiv:1610.05492 , 2016.[12] J. Koneˇcn`y, H. B. McMahan, D. Ramage, and P. Richt´arik, “Federatedoptimization: Distributed machine learning for on-device intelligence,” arXiv preprint arXiv:1610.02527 , 2016.[13] W. Shi, J. Cao, Q. Zhang, Y. Li, and L. Xu, “Edge computing: Vision andchallenges,”

IEEE internet of things journal , vol. 3, no. 5, pp. 637–646,2016.[14] E. Li, Z. Zhou, and X. Chen, “Edge intelligence: On-demanddeep learning model co-inference with device-edge synergy,” in

Pro-ceedings of the 2018 Workshop on Mobile Edge Communications,MECOMM@SIGCOMM . ACM, 2018, pp. 31–36.[15] E. Li, L. Zeng, Z. Zhou, and X. Chen, “Edge ai: On-demand acceleratingdeep neural network inference via edge computing,”

IEEE Transactionson Wireless Communications , vol. 19, no. 1, pp. 447–457, 2020.[16] W. Y. B. Lim, N. C. Luong, D. T. Hoang, Y. Jiao, Y. Liang, Q. Yang,D. Niyato, and C. Miao, “Federated learning in mobile edge networks:A comprehensive survey,”

IEEE Communications Surveys & Tutorials ,vol. 22, no. 3, pp. 2031–2063, 2020.[17] Y. Guo, F. Liu, Z. Cai, L. Chen, and N. Xiao, “Feel: A federated edgelearning system for efﬁcient and privacy-preserving mobile healthcare,”in , 2020,pp. 1–11.[18] T. Yang, G. Andrew, H. Eichner, H. Sun, W. Li, N. Kong, D. Ramage,and F. Beaufays, “Applied federated learning: Improving googlekeyboard query suggestions,” arXiv preprint arXiv:1812.02903 , 2018.[19] Z. Yu, J. Hu, G. Min, H. Lu, Z. Zhao, H. Wang, and N. Georgalas,“Federated learning based proactive content caching in edge computing,”in

IEEE Global Communications Conference (GLOBECOM) , 2018, pp.1–6.[20] L. Cui, X. Su, Z. Ming, Z. Chen, S. Yang, Y. Zhou, and W. Xiao, “Creat:Blockchain-assisted compression algorithm of federated learning forcontent caching in edge computing,”

IEEE Internet of Things Journal ,pp. 1–1, 2020.[21] Y. Mao, C. You, J. Zhang, K. Huang, and K. B. Letaief, “A surveyon mobile edge computing: The communication perspective,”

IEEECommunications Surveys & Tutorials , vol. 19, no. 4, pp. 2322–2358,2017.[22] Z. Zhao, R. Zhao, J. Xia, X. Lei, D. Li, C. Yuen, and L. Fan, “Anovel framework of three-hierarchical ofﬂoading optimization for mecin industrial iot networks,”

IEEE Transactions on Industrial Informatics ,vol. 16, no. 8, pp. 5424–5434, 2019.[23] Z. Liang, Y. Liu, T.-M. Lok, and K. Huang, “Multiuser computationofﬂoading and downloading for edge computing with virtualization,”

IEEE Transactions on Wireless Communications , vol. 18, no. 9, pp.4298–4311, 2019.[24] C. Li, J. Xia, Y. Rao, F. Liu, L. Fan, G. K. Karagiannidis, andA. Nallanathan, “Dynamic ofﬂoading for multiuser muti-cap mecnetworks: A deep reinforcement learning approach,”

IEEE Transactionson Vehicular Technology , vol. 72, no. 3, pp. 3424–3438, 2020.[25] J. Feng, F. R. Yu, Q. Pei, J. Du, and L. Zhu, “Joint optimizationof radio and computational resources allocation in blockchain-enabledmobile edge computing systems,”

IEEE Transactions on WirelessCommunications , vol. 19, no. 6, pp. 4321–4334, 2020.[26] J. Feng, Q. Pei, F. R. Yu, X. Chu, J. Du, and L. Zhu, “Dynamic networkslicing and resource allocation in mobile edge computing systems,”

IEEETransactions on Vehicular Technology , vol. 69, no. 7, pp. 7863–7878,2020.[27] Y. Wang, X. Tao, Y. T. Hou, and P. Zhang, “Effective capacity-basedresource allocation in mobile edge computing with two-stage tandemqueues,”

IEEE Transactions on Communications , vol. 67, no. 9, pp.6221–6233, 2019.[28] K. Bonawitz, H. Eichner, W. Grieskamp, D. Huba, A. Ingerman,V. Ivanov, C. Kiddon, J. Koneˇcn`y, S. Mazzocchi, H. B. McMahan et al. ,“Towards federated learning at scale: System design,” arXiv preprintarXiv:1902.01046 , 2019.[29] T. Nishio and R. Yonetani, “Client selection for federated learningwith heterogeneous resources in mobile edge,” in

IEEE InternationalConference on Communications (ICC) , 2019, pp. 1–7.[30] W. Shi, S. Zhou, and Z. Niu, “Device scheduling with fast convergencefor wireless federated learning,” in

IEEE International Conference onCommunications (ICC) , 2020, pp. 1–6.[31] W. Shi, S. Zhou, Z. Niu, M. Jiang, and L. Geng, “Joint devicescheduling and resource allocation for latency constrained wirelessfederated learning,”

IEEE Transactions on Wireless Communications ,2020.[32] T. Zeng, O. Semiari, M. Mozaffari, M. Chen, W. Saad, and M. Bennis,“Federated learning in the sky: Joint power allocation and schedulingwith uav swarms,” in

IEEE International Conference on Communi-cations (ICC): Next-Generation Networking and Internet Symposium ,2020.[33] H. Huang and Y. Yang, “Workerﬁrst: Worker-centric model selectionfor federated learning in mobile edge computing,” in , 2020,pp. 1039–1044.[34] S. Zhou, H. Huang, W. Chen, P. Zhou, Z. Zheng, and S. Guo, “Pirate:A blockchain-based secure framework of distributed machine learningin 5g networks,”

IEEE Network , vol. 34, no. 6, pp. 84–91, 2020.[35] Y. Zhan, P. Li, and S. Guo, “Experience-driven computational resourceallocation of federated learning by deep reinforcement learning,” in

Proceedings of IEEE International Parallel and Distributed ProcessingSymposium (IPDPS) , 2020, pp. 234–243.[36] H. Yu, S. Yang, and S. Zhu, “Parallel restarted sgd with faster con-vergence and less communication: Demystifying why model averagingworks for deep learning,” in

Proceedings of the AAAI Conference onArtiﬁcial Intelligence , vol. 33, 2019, pp. 5693–5700.[37] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou,D. Wierstra, and M. Riedmiller, “Playing atari with deep reinforcementlearning,” arXiv preprint arXiv:1312.5602 , 2013.[38] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al. , “Human-level control through deep reinforcement learning,” nature , vol. 518, no. 7540, pp. 529–533, 2015.[39] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller,“Deterministic policy gradient algorithms,” in

Proceedings of the 31stInternational Conference on Machine Learning (ICML) , vol. 32, 2014,pp. 387–395.[40] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al. , “Pytorch: Animperative style, high-performance deep learning library,” in

Advancesin neural information processing systems , 2019, pp. 8026–8037.[41] X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse rectiﬁer neuralnetworks,” in

Proceedings of the fourteenth international conference onartiﬁcial intelligence and statistics , 2011, pp. 315–323.[42] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980arXiv preprint arXiv:1412.6980