[PDF] Artificial Intelligence for UAV-enabled Wireless Networks: A Survey

Abstract

Unmanned aerial vehicles (UAVs) are considered as one of the promising technologies for the next-generation wireless communication networks. Their mobility and their ability to establish line of sight (LOS) links with the users made them key solutions for many potential applications. In the same vein, artificial intelligence (AI) is growing rapidly nowadays and has been very successful, particularly due to the massive amount of the available data. As a result, a significant part of the research community has started to integrate intelligence at the core of UAVs networks by applying AI algorithms in solving several problems in relation to drones. In this article, we provide a comprehensive overview of some potential applications of AI in UAV-based networks. We also highlight the limits of the existing works and outline some potential future applications of AI for UAV networks.

Full PDF

MMachine learning for UAV-Based networks

Mohamed-Amine Lahmeri, Mustafa A.Kishk, and Mohamed-Slim Alouini

Abstract —Unmanned aerial vehicles (UAVs) are considered asone of the promising technologies for the next-generation wirelesscommunication networks. Their mobility and their ability toestablish a line of sight (LOS) links with the users made themkey solutions for many potential applications. In the same vein,artiﬁcial intelligence is growing rapidly nowadays and has beenvery successful, particularly due to the massive amount of theavailable data. As a result, a signiﬁcant part of the researchcommunity has started to integrate intelligence at the core ofUAVs networks by applying machine learning (ML) algorithmsin solving several problems in relation to drones.In this article, we provide a comprehensive overview of somepotential applications of ML in UAV-Based networks. We willalso highlight the limits of the existing works and outline somepotential future applications of ML for UAVs networks.

Index Terms —Machine learning, Deep learning, Reinforcementlearning, AI, UAVs.

I. I

NTRODUCTION

Unmanned aerial vehicles, known as UAVs, attracted a lotof research interest in the last decades due to many inherentattributes such as their mobility, their easy deployment, andtheir ability to establish line of sight (LOS) links with theusers [1]–[4]. In general, UAVs can be classiﬁed into two maintypes, namely ﬁxed-wing and rotary-wing UAVs. Each type ofUAV is adapted to a speciﬁc type of application. For example,ﬁxed-wing UAVs are more appropriate for the type of missionswhere stationarity is not required, e.g. military applicationssuch as attack and surveillance. However, rotary wing UAVshave more complex aerodynamics. They also have the abilityto remain stationary at a given location, but they cannot carryout long-range missions. For example, rotary-wing UAVs arebetter suited to provide temporary wireless coverage to groundusers.Moreover, the involvement of many industries in the man-ufacture of UAVs has helped to reduce the cost of UAVs onthe markets, making the use of a UAV network no longer adream or a futuristic idea. In fact, they have been used inmany scenarios ranging from providing wireless connectivity,weather forecasting, disaster management, farming , delivery,trafﬁc control [5]–[10].The main limitations of UAVs are their constrained batteryand low computational capabilities [11]–[16]. In fact, mostcommercially available UAVs struggle to hover for more thantwo hours and must always return to base to recharge theirbatteries. Added to that, the fact that complex algorithmscannot run onboard due to the limited computational capac-ity, consequently, classical solutions and algorithms does notnecessarily ﬁt any UAV-related problem.

Mohamed-Amine Lahmeri, Mustafa A. Kishk, and Mohamed-Slim Alouiniare with King Abdullah University of Science and Technology (KAUST),Thuwal 23955-6900, Saudi Arabia (e-mail:[email protected] ;[email protected]; [email protected]).

In another context, machine learning (ML) has emerged inthe last years as a sub-ﬁeld of artiﬁcial intelligence. Moreover,its use has become prevalent in the scientiﬁc research offeringa new style usually referred to as the black-box techniquewhere you only care about inputs and outputs, and hence, thetraditional statistical or pure mathematical techniques are nolonger used. Furthermore, the huge amount of data availablenowadays across the internet and the existence of the highperformance computing (HPC) and good GPU’s helped MLto see the light. As a result, ML is being actively used todayin many ﬁelds, perhaps more than one would expect.We can also notice the emergence of several sub-ﬁeldsfrom ML such as deep learning (DL), reinforcement learning(RL), and federated learning (FL), each for a speciﬁc type ofproblems. For example, DL is a branch of ML that uses layersof artiﬁcial perceptrons to imitate the human mind thinking. Itis massively used in speech recognition, computer vision, andnatural language processing. However, RL is another branchof ML that appeared around 1979 [17] wherein an agent learnsthe way of making good action in order to achieve maximumrewards. The learning process is achieved by exploitation andexploration of the different available states. RL is an activeﬁeld of ML that evolved and matured very quickly. UnlikeDL, RL is massively used in robotics for path planning andlearning the way to do complex tasks. This does not meanthat it is limited only to robotics, it is also used in many otherdecision-making problems that consider a goal-oriented agentthat it is interacting with a speciﬁc environment. Another newﬁeld of ML is FL, proposed by Google in 2016 and designedto support network systems with decentralized data. FL isconsidered as an ML setting with the objective of training ahighly centralized model on devices sharing decentralized datawithout the need of sending the data to a local shared unit. Inother words, it is used to run ML algorithms with decentralizeddata architecture. This task is performed in a secure mannerand it is widely used in many types of networks (e.g. UAVsnetworks or mobile networks).In short, ML is one of the trending areas that bringsintelligence to machines and makes them able to performtasks even better than a human can do. This is why webelieve that combining the advantages of using ML withinUAVs networks is a challenging and interesting idea at thesame time. In the same vein, UAV-Based applications can beimproved by integrating ML at the core of UAVs network.For instance, and as we already mentioned, UAV batteriesare limited and ML can, therefore, play an important role inresource management for the UAV so its performance getsoptimized [18]. Moreover, the design of UAVs trajectory anddeployment are also subject to ML improvement by equippingthe UAV with the ability to design its trajectory automatically.Imaging also can be improved for UAV by applying the a r X i v : . [ ee ss . SP ] S e p xisting state of art related to computer vision for UAVsimaging. A wide range of applications can be improved inthis context such as surveillance, trafﬁc management, landingsite detection.To conclude, UAV-Based networks performance can behighly improved with the integration of ML algorithms inorder to automate complex tasks and enhance the overallsystem level of intelligence. A. Previous survey and tutorial works:

With the vast amount of published work linking machinelearning to UAV wireless networks, several tutorials andsurveys have attempted to summarize the existing literature.The authors in [4] provided a tutorial for UAV-based wirelesscommunication system by covering potential applications,challenges and describing the open problems in the ﬁeld.However, the aforementioned work does not consider the MLaspect for UAVs. Many other surveys and tutorials do notconsider ML techniques, for instance, a motion planning forUAVs guide was presented in [19] and a survey for UAV trafﬁcmonitoring is provided in [20].In addition to the work mentioned above, there exist othertutorials and surveys that are oriented towards the applicationof ML tools in wireless communication networks. For exam-ple, a tutorial on artiﬁcial neural networks (ANN) for wirelessnetworks is proposed in [19]. In the same context, a reviewfor DL techniques for UAVs resumed some of the works donein regards [21].However, all the works mentioned above does not considerML techniques speciﬁcally for UAVs application. Neverthe-less, one recent work can be considered close to this workwhich is the survey on ML for UAV-Based communicationpresented in [22]. Yet, this survey does not consider FLwhich is a key technique that enables installing intelligenceat the edge of the UAV in a decentralized and secure manner.A number of recent works published in late 2019 and in2020 were covered in this article in Sec. IV. We wouldlike to highlight the importance of FL, especially for 6Gnetworks. Moreover, we tried to maintain a unique approachfor describing the current state of the art for the RL and MLpart by focusing on the most recent published works.

B. Contributions:

In this article, we provide a holistic overview of the stateof art relating ML and UAVs networks. We will also discusssome limitation of the existing research work and outline somepotential ideas that could be addressed in the near future. Wewill also study the implementation of intelligence at the edgeof a UAV network by reporting some of the works done infederated learning for UAV-Based networks. Furthermore, weprovide a comprehensive introduction to each ML area studiedin this work so that readers with different backgrounds havethe ability to understand a major part of this article.This survey will be organized as follows: • In Sec. II, we start by reporting the works based on MLsupervised and unsupervised areas and designed for UAV-based networks. A brief overview will cover these two different areas of ML and some typical algorithms andneural network (NN) architecture will be provided for thereader’s convenience. • In Sec. III, we go over the works relating RL with UAV-related problems. We again start by a quick overviewof RL science and present a classic example for RL pathplanning in order to understand the basic concepts of RL. • In Sec. IV, we outline the key research directions thatenables installing intelligence at the edge of a UAVnetwork by reporting some of the works relating federatedlearning to UAV-related problems.All the above-mentioned sections ﬁnish with a discussionand a conclusion presenting the limits of the works and high-lighting some possible future works that could be established.II. S

UPERVISED AND U NSUPERVISED MACHINE LEARNINGFOR

UAV S ML is a recent buzzword related to artiﬁcial intelligence.In short, it the subset of artiﬁcial intelligence that enables acomputer to execute tasks accurately based on the experiencegained by learning from some previous examples. In fact,ML has been very successful over the last decade becauseof the large amount of data available on the Internet andtoday’s powerful computing computers. A huge number ofresearch has been undertaken to apply ML in many areas. Asa result, a new dimension is offered to these ﬁelds by bringingintelligence at their core.The areas of machine learning can be divided into differentcategories of problems, for instance, it might be divided asshown in Fig. 2 to a supervised learning problems, unsuper-vised learning problems, and RL based problems. In whatfollows, we will distinguish between the supervised and theunsupervised learning areas to avoid confusion later.

A. Supervised Learning Overview

In supervised learning, the data provided is labeled, in otherwords, we provide for each data entry the ground-truth valueso that the algorithm uses these values to learn how to make adecision for a new unlabeled entry. For example, predictinga UAV price from its characteristics. In this example, youneed to provide the algorithm with a set of training datathat contains each UAV characteristics and its associated label(the price). The dataset is usually divided into a trainingset and a test set. The training set is used to learn therelationship between the input and the output and the testset is used to validate the model by measuring its accuracy.The supervised problems are often divided into either regres-sion problems or classiﬁcation problems. Regression problemsprovide continuous output values (Eg. predicting a price).However, classiﬁcation problems provide a discrete valuesindicating to which class the input belongs to (Eg. classifybenign or malignant cancer disease).

Some supervised algorithms and NN architectures: Combined classiﬁcation and regression algorithms

There are several supervised algorithms that can be . Brief introduction to UAV networks and machine learning. ii. Previous surveys. iii. Contributions.

Introduction i. Machine learning overview.ii. Supervised and unsupervised solutions for UAVs. iii. Discussion and future works.

Supervised and unsupervised learning for UAVs Reinforcement learning for UAVs

Section I Section II Section

III

Section IV Section V i. Reinforcement learning overview.ii. Reinforcement learning solutions for UAVs. iii. Discussion and future works. Federated learning for UAVs i. Federated learning overview.ii. Federated learning solutions for UAVs.iii. Discussion and future works.

Concluding notes i. Conclusion.ii. Future research directions.

Survey organization

Fig. 1: Survey Organization.

Supervised learning

Tasks

RegressionClassificationE.g.SVM/GMM/ANN/CNN…

Labeled data Unlabeled data

Unsupervised learning

Tasks

Clustering Data generationE.g.GAN/AE/K-means Reinforcement learning Machine learning

Tasks

Path planning Resource managementE.g.Q-learningDeep Q-net

Fig. 2: Machine Learning Overview.used either for classiﬁcation or regression. For instanceSupport Vector Machine (SVM) can do both the twotasks, decision trees also can be formulated to solveregression or classiﬁcation depending on the use case.2)

Regression algorithms

There exists algorithms that perform pure regression taskby predicting continuous value output. For instance, wecan mention two classical algorithms in ML which arethe linear regression and the logistic regression.3)

Classiﬁcation algorithms

It make sense to talk about pure classiﬁers in ML.Although it is mentioned in some references that NaiveBayes classiﬁer with ”some modiﬁcation” can be usedfor regression, we will present it as a pure classiﬁerexample since it was derived initially for classiﬁcationbased on the probabilistic Bayes theorem.4)

Multi Layer Perceptron (MLP)

To imitate the biological human neural networks, ANNs are mathematically formulated for machine learning.ANN are built with a number of partially-connectednodes denoted by perceptrons and grouped into differentlayers. Each perceptron is responsible for processinginformation from its input and delivering an output. Asshown in Fig. 7, MLP is the simplest form of an ANNthat consists of one input layer, one or more hiddenlayers, and an output layer where a classiﬁcation orregression task is performed.5)

Convolutional neural networks (CNN)

CNN is another type of ANN designed initially forcomputer vision tasks. A CNN usually take an imageas an input, assign learnable weights and biases that areupdated according to a speciﬁed algorithm. The CNNarchitecture is characterized by the convolutional layerswhich extract high-level features from the image thatwill be used later. Technical details such as activationfunctions, pooling layers, and padding operation arebeyond the scope of this survey. Fig. 7 shows a typicalCNN architecture where feature extraction is performedin the ﬁrst convolutional layers and classiﬁcation isperformed via a fully-connected layer.6)

Recurrent neural networks (RNN)

When the data is sequential in nature, RNN take placeto solve the problem. For the sake of example, we canmention a text speech, a video, a sound recording. RNNare widely used in natural language processing (NLP), inspeech recognition, and for generating image descriptionautomatically. The RNN architecture is similar to aregular neural network, only it contains a loop thatallows the model to carry forward results from previousneurons. RNN in its simplest form is composed of anoutput containing the prediction and which is denoted eep learning MLP CNN RNNAEGAN

Fig. 3: Neural Network Architectures.by h in Fig. 7 and a hidden state that represents theshort term memory of the system.

B. Unsupervised learning Overview

Unlike supervised learning, the unsupervised learning doesnot use a labeled data, instead, it looks for some underlyingstructure or hidden pattern in the data and reveals it. Forinstance, clustering the data, reducing data dimensionality, anddata generation are considered as typical tasks for unsuper-vised learning.In what follows, we present some classical unsupervisedalgorithms.

Unsupervised algorithms and NN architectures K-means

K-means is a very popular algorithm for clustering inML. it takes a number of clusters K as an input andallocates every data point to the nearest cluster, whilekeeping the centroids as small as possible.2)

Gaussian Mixture Modeling (GMM)

GMM is another clustering algorithm in ML, but unlikethe k-means algorithm, GMM is a probabilistic model.As its name indicates, the clusters are derived froma Gaussian distribution and the association is soft. Inother words, every data point have a probability ofassociation to every cluster center, however, in the K-means algorithm, we have hard association policy.3)

Autoencoders

AE is a type of neural network used to learn a represen-tation of the data and hence encode it. This techniqueis often used for dimensionality reduction. Surprisingly,the architecture of an AE is extremely simple as theFig. 7 shows. It is usually formed by an input layerand a hidden layer called ”bottleneck” which forcesa compressed knowledge representation of the originalinput.4)

Generative adversarial networks (GANs)

GANs are algorithmic architectures that use two neuralnetworks in order to generate new, synthetic instancesof data that can pass for real data. They are used widely in image generation, video generation and voicegeneration.

C. Supervised and Unsupervised solutions for UAVs-BasedProblems1) The positioning and deployment of the UAV:

Authorsin [10] investigates the optimal deployment of aerial basestations to ofﬂoad terrestrial base stations by minimizing thepower consumption of the drones. The provided solution isconsidered as ML-assisted due to the fact that UAVs arenot required to continuously change their positions, instead,they are placed temporarily by predicting the congestion inthe wireless network. The wireless trafﬁc is predicted basedon the Gaussian Mixture Model (GMM) which is a proba-bilistic model that belongs to the set of unsupervised MLdeﬁned previously. It assumes that the data distribution canbe modeled by the Gaussian distribution. First, a K-meansalgorithm divides the users into K clusters and then a weightedexpectation maximization algorithm is performed on the Kclusters in order to ﬁnd the optimal parameters for the GMMmodel. The next step is to deduce the optimal deploymentby formulating a power minimization problem for the UAVs.The numerical results show that the ML-assisted approachoutperforms the classical solution by reducing the mobilityand the power needed for down-link purposes. Although thework done is of great importance by combining ML withoptimization techniques, using a K-means algorithm to classifythe users brings the question of how to choose manually thevalue of the cluster K and also how to initialize the centroidespositions.In the same context, the authors in [23] investigate anoptimal placement of the UAVs acting as aerial base stationby building a structured radio map. Due to the nature of thecomplex terrain and the difﬁculty of exploiting such radio map,the authors proposed a joint clustering and regressing problemusing a maximum likelihood approach that is formulated basedon the K-segment ray tracing model. ML is also used inpredicting the channel in order to reconstruct the radio map.n [24] the communication efﬁciency between a UAV anda base station is improved by predicting the location ofthe UAV given its past locations. In fact, while ofﬂoadinga terrestrial base station, a UAV can be subject to windperturbation which will result in a certain degree of offsetand hence a loss in capacity. To solve this issue, the authorspropose a RNN assisted framework where the next elevationand horizontal angle of the UAV with reference to the basestation, are predicted using the past angles. This method leadsto predicting the speciﬁc location for a high-speed movementUAV. The authors kept tuning the RNN parameters such as thenumber of hidden nodes and the number of hidden layers andthen study their impact on the prediction accuracy. Numericalresults have shown that a high accuracy could be achieved fora 4 layer RNN with 16 hidden nodes.

2) Channel estimation:

Another application of ML is cov-ered in [25] where air-to-air path loss is predicted. Thepredictions generated by the KNN and the Random Forestalgorithms are compared to empirical results. The path lossis predicted starting from several parameters such that thepropagation distance, transmitter altitude, receiver altitude andelevation angle. The comparison of the results with the datagenerated by the ray-tracing software shows that machinelearning performs well in these prediction tasks.In the same context of channel estimation, the paper [26]uses ANN to predict the signal strength of the UAV andestimate the channel propagation. A shallow artiﬁcial neuralnetwork is proposed to analyze the effect of several naturalphenomena on the signal such as : diffraction, reﬂection, andscattering. The input layer is composed of parameters like thedistance to the UAV, altitude, frequency, and path loss. Thisexciting work may be impeded by the large processing timeof the data by the ANN which raises the question of whetheris this solution adequate for real time application.As we mentioned previously that the SVM algorithm couldbe used for regression in addition to classiﬁcation, the authorin [27] proposes a method for path loss prediction in urbanoutdoor environment using support vector regression algorithmand compares the obtained results with the empirical ones.

3) UAV sound detection:

The authors in [28] present a real-time UAV detection system based on analyzing the sound datacoming from the drone. For this purpose, two ML methodshave been applied and compared in terms of accuracy. The ﬁrststep consists in detecting potential UAV existing by analyzingthe frequency and then check whether the sound exceeds apredeﬁned threshold for drones.The ﬁrst ML method used is Plotted Image Machine Learning(PIL), this method uses the visualized FFT graph generatedfrom the data sound to compare average image similarity witha reference FFT target. The second method is based on theK Nearest Neighbors (KNN) algorithm applied to the FFTcsv ﬁle and measures the average distance with the target.The simulation results show that the PIL method outperformsthe KNN methods and succeeded to provide good results. Atthis level, we point out that even if the visual drone detectioncan be limited by the quality and the resolution of the inputimage, sound data also can be highly affected by ambientnoise in real applications. It also is not obvious if all the UAVs will have the same FFT proﬁle used as predeﬁnedtarget in the problem. Moreover, KNN algorithm is a simpleand straightforward algorithm in ML and hence trying moresophisticated algorithms will be beneﬁcial for the problem. Wealso highlight the fact that making a hybrid system that usesprobably image, sound, and radio UAV transmission signalsat the same time will be a very interesting futuristic idea.

4) Imaging for UAVs:

Although computer vision is beyondthe scope of this survey, you may ﬁnd several topics that relateimaging to UAVs, for example the author in [11] investigatesthe detection of a forced (emergency) safe landing site. Thedetection is converted into a classiﬁcation problem where twoknown classiﬁers ( GMM, and SVM ) are tested. The classiﬁerconverts the real map into a safe or non-safe grid map. A ﬁlteris applied later to remove non safe spots and keep the potentiallanding sites. The main reason why these types of problemsare not considered in this study is that they can be treated aspure computer vision problems, and the application to UAVsdoes not change the nature of the task, except that the imagesare taken from a given altitude. In other words, the sametechniques are applied to UAVs imaging, such as CNN, featureextractors, edge detectors. For the reader convenience, weresumed a number of the recent works starting from late 2017until present in table II. We also refer the readers interested inmore UAV imaging problems to the Table.1 in [43] resumingsome works done before 2017.

D. Discussion and future work

Although we tried to objectively critique some of the worksthat we have covered previously, we intend in this sectionto present our thoughts related to the use of ML in wirelesscommunication problems in a general fashion.Firstly, it can be noted that frequently in literature, ML toolsare used to solve problems which could be solved in a simplerway deterministically, giving the impression that the need touse ML is not well justiﬁed, which could lead in many casesto an ML misapplication.Moreover, we remarked that in all works that we havecovered so far, ML results always appear better than empiricalresults in the numerical simulation. This fact raises the ques-tion whether is it true that ML tools are always outperformingthe classical methods or not? And the answer is simply no.At this point, we should mention that we are neither doubtingthe major success of ML nor questioning its efﬁciency insolving many problems, instead we are highlighting the factthat in some cases, choosing the data plays an important rolein assessing the accuracy of an ML model.To clarify the idea, let’s put in place a concrete example.Imagine that you are working on a computer vision objectdetection problem and the goal is to detect a UAV in theimages. Then, if you do not provide a good quantity of non-UAV images to the model, you will ﬁnd out that the CNN isproviding good accuracy on UAV images and fail in non-UAVimages. Moreover, if the test set for example is biased and hassome similarity with the training set, you will end up with agood accuracy but in reality the model will fail to predict newunseen examples. Hence, the quality and the quantity of theABLE I: Imaging for UAVs

Reference Model Application Date of publication[29] Faster R-CNN Car detection Late 2017[30] Nazr-CNN Damage detection Late 2017[31] CNN+SVM=CSVM General object detection 2018[32] modiﬁed region-based CNN Electrical equipment defect detection 2018[33] Faster R-CNN+ Region proposal network(RPN) Pedestrians detection 2018[34] Semantic segmentation+CNN UAV geolocalization 2018[35] CNN Car detection 2018[36] CNN Building crack detection 2018[37] Faster R-CNN+Yolov3+RetinaNet Tree detection 2019[38] CNN+Digital Surface Model(DSM) Surface classiﬁcation 2019[39] CNN with semi-supervised learning Agricultural detection (soybean leaf and herbivorous pest) 2019[7] CNN Rice-grain estimation 2019[9] Yolov3 Weed location 2020[8] CNN Single tree detection 2020[40] Yolov2 Green mangoes detection 2020[41] Faster R-CNN Maize Tassels detection 2020[42] CNN Counting and locating citrus trees 2020 data plays a big role in evaluating the accuracy of the model,and neglecting this point will lead to a fake ML accuracy.In the same previous context, we know that data playsan indispensable role as the learning algorithm is used todiscover and learn knowledge or properties from the data. Thatis why we strongly believe that the wireless communicationcommunity should accredit more importance to providing opensource high quality data as we remarked that there is not a suf-ﬁcient quantity of data online dedicated for wireless purposescompared to the amount of data available for computer visiontasks for example.Another ML drawback is remarkable in some works wheremethods are compared in terms of performance. For instancefor CNN architecture comparison or even in [25], you maynotice that when comparing two ML models, there is nomathematical explanation as to why such model is better thanthe other one or why such a NN architecture outperformsanother NN architecture. This point illustrates the “blackbox“ aspect of ML, in other words, it is a matter of tuningparameters and evaluating the result and no further explanationcould be provided. Consequently, for a given problem wesometimes cannot predict which model will perform well andwhich model is not promising.However, with all that has been mentioned above, MLremains an interesting alternative and a promising tool forUAV-related problems in particular. Therefore, we believethat several ideas can be addressed in the future. In fact,more complex ML models can be tested on some UAVrelated problems, for example, in path loss prediction moreregression tools can be tested on this problem. Also for UAVdetection problems, we noticed that it is solved either viasound detection or via image detection by converting it intoa computer vision problem. Instead we think that a complexhybrid system that uses different types of inputs (e.g. soundFFT , image, radio transmission.) is feasible by ML, wherean adequate NN (e.g. a type of CNN for images and a giventype of RNN for sound and radio) will provide a score foreach type of input and then a ﬁnal NN is used to classify theoutput using the previous scores.

AgentEnvironment

Reward R t State S t Action A t Fig. 4: Reinforcement Learning Elements.To conclude, ML supervised and unsupervised frameworkssucceeded in confronting many contradictory challenges byproviding intelligent solutions for various problems involvingUAVsIII. R

EINFORCEMENT LEARNING SOLUTIONS FOR

UAV S A. RL overview

Like the supervised and unsupervised learning areas ofML, RL is another area of ML dedicated to make decisionsin a well-deﬁned environment. Formally, a reinforcementproblem always has 5 main elements as shown in the Fig .4:1)

The agent : An entity that can take an action denoted by A t and receives a reward R t accordingly.2) The environment : A representation of the real world inwhich the agent operates.3)

The policy : It is the mapping of each state S t to anaction A t . We usually denotes a policy by π .4) The reward signal : The feedback that the agent receivesafter performing an action. It is denoted on the Fig. 4by R t .5) The value function : It represents how good a state is,hence it is the total expected future rewards starting from given state.A value function is usually denoted by V ( s ) where s isthe state that we are interested in. Mathematically, it isformulated as follows: V ( s ) = E ( G t ) , where G t is thediscounted sum of future rewards: G t = (cid:80) t γ t − R t , γ ∈ [0 , .The goal is to decide correct actions (or policy) in a waythat they maximize a predeﬁned reward function that shouldbe adapted to the type of the problem. In addition to the 5elements of RL mentioned above, another element can beconsidered in some cases, namely the model, depending on itspresence or not, RL problems can be divided into two maincategories which are the model-based RL and the model-freeRL.In what follows, we will differentiate between these two areasin order to avoid confusion later on.

1) Model-based RL:

As its name indicates, the model-basedRL problem uses a model as a sixth element to mimic thebehavior of the environment to the agent. Consequently, theagent becomes able to predict the state and the action fortime T+1 given the state and the action at time T. At thislevel, supervised learning could be a powerful tool to do theprediction work. Thus, unlike the model-free RL, in model-based RL, the update of the value function is based on themodel and not on any experience.

2) Model-free RL:

In model-free RL problems, the agentcannot predict the future and this is the main difference withthe model-based RL framework explained previously. Instead,the actions are based on the so-called ”trials and errors”method where the agent, for instance, can search over thepolicy space, evaluate the different rewards, and pick ﬁnally anoptimal reward. A well known classic example for model-freeRL is the Q-learning method where it estimates the optimal Q-values of each action and reward and chooses the action havingthe highest Q-value for the current state. To summarize andmake it simple for the reader, differentiating between model-based and model-free RL problems is an easy task. Just askyourself the following question: Is the agent able to predictthe next state and action? If the answer is yes then you aredealing with a model-based RL otherwise it is more likely amodel-free RL problem.

3) Deep Reinforcement Learning (DRL) Overview:

Whileclassical RL proposed an efﬁcient solution for many types ofdiscrete decision problems, more realistic solutions could beprovided using DRL which proven its efﬁciency by reachingsuper human level control. DRL is based on using ANN toevaluate action values using the previous experiences of theagent. Many algorithms were proposed in the literature, inthe following section we will go over the most used ones. Forthe reader convenience, the two algorithms, Deep Q Network(DQN) and Deep Deterministic Policy gradient (DDPG), aregoing to be covered brieﬂy. Thus we kindly refer readersinterested in their deep technical details to [44] regarding theoriginal publication for DQN and to [45] for DDPG.

Deep Q Network (DQN)

DQN was the ﬁrst algorithmproposed in the context of DRL by Mnih et al. in [44]. Tounderstand the key concepts of DQN, a basic knowledge of Q-

X-Axis Y - A x i s UAV goalUAV starting positionOptimal path

Fig. 5: Grid Maplearning algorithm is recommended, hence we refer interestedreaders to Sec. III-B1. It is worth to mention that DQNwas proposed as an improvement to Q-learning which usesa discrete state and action space in order to build the Q-table.In contrast, the Q-values of the DQN are approximated usingANN by stocking al the previous agent experience in a datasetand then feeding it to the ANN to generate the actions based onminimizing a predeﬁned loss function derived from the Bell-man equation. We should also mention the fact that the ideaof DQN was inspired from Neural Fitted Q-learning (NFQ)proposed in 2005, but it was suffering from overestimationproblems and instabilities in the convergence. There existsmany other improved variations of DQN such as double DQN,dueling DQN, and distributional DQN. Despite the remarkablesuccess of DQN especially when it was historically tested onATTARI games, it has its own limitations such as the fact thatit cannot deal with continuous space action and cannot usestochastic policies.

Deep Deterministic Policy Gradient (DDPG)

To over-come the restriction of discrete actions, Deterministic PolicyGradient (DPG) algorithm was ﬁrst proposed in Deepmind’spublication in 2014 [45] based on an Actor-critic off policyapproach. We refer readers that are not familiar with Actor-critic RL methods to [17, Chapter 13]. For the sake of simplic-ity, let’s keep in mind that Actor-critic methods are generallycomposed mainly from two part: a Critic that estimates eitherthe action-value or the state-value and an Actor that updatesthe policy in the direction proposed by the Critic. Later on, in2015, and based on the DPG algorithm, Deepmind proposesa new DRL algorithm called the Deep Deterministic PolicyGradient (DDPG) algorithm. DDPG is a model-free, off-policymethod that is based on Actor-critic algorithm. In short, DDPGis a deep reinforcement learning algorithm that helps the agentto ﬁnd an optimal strategy by maximizing the reward returnsignal. The main advantage of such deep algorithm is that itperforms well on high-dimensional/inﬁnite continuous actionspace. . Case study

Motivated by its popularity among RL algorithms, weintroduce Q-learning which is a classical free-model RLalgorithm. We intend to provide a comprehensive and practicalexplanation to the reader on how RL could be used in a pathplanning problem. Readers with a basic knowledge on RLcould deﬁnitely skip this section. We stick to a basic examplewhere a UAV ﬂying at a ﬁxed altitude learn how to reacha given target while avoiding obstacles in the map shown inFig. 5.

1) Q-learning overview:

Q-learning algorithm is based onthe Q-table used to select actions for the agent at eachstep. The table is composed of the combination of everystate with every possible action and hence its dimension is | States | × |

Actions | . The Q-table is used to store and updatethe maximum future reward referred to by Q( state i , action j )which is the ( i th , j th ) entry of the Q-table. This Q-table is ofgreat importance to the Q-learning algorithm simply becauseit is used to determine which action should the agent take suchthat the expected future reward is maximized.

2) Update rule:

The update of the Q-table is done using afundamental equation in RL which is the Bellman equation: Q new ( s t , a t ) = (1 − α ) Q old ( s t , a t ) + α ( R t +1 + γmax a ( Q ( s t +1 , a )) (1)Where s t , a t are respectively the state and the action taken attime t, α is the learning rate, it allows the old value of theQ-table to inﬂuence current updates, γ is the discount factor,it is a measure of how future rewards will affect the system.After every taken action, the agent updates its Q-table valuesusing the Eq. 1, then, at a given state, it selects the actionhaving the highest Q-value.

3) The exploration/exploitation dilemma:

One fundamentalconcept for RL, which is visible also in Q-learning, is theexploration/exploitation dilemma. To explain this duality, let’sdiscover how the agent will succeed in reaching its goal.First, the agent makes a random step in the environment,then it starts updating the Q table (initialized with zeros forexample) according to Eq. 1. However, if the agent onlyuses the Bellman equation, it is possible that it is stuck ina good state forever, while better states exist on the map. Itis similar to a case of an optimization process that is stuckin a local minimum or maximum while better solutions stillexist by exploring the environment. To solve the last problem,the exploitation/exploration dilemma is proposed, this dualityintroduces a randomness into the system so that the agent ateach step could either exploit the environment by selectingactions that maximize the Q-values of the Q-table, or explorethe system by selecting some random actions. The parameterthat usually refers to the probability threshold for explorationis designated by (cid:15) . In our implementation, we used a decaytechnique that decreases the value for epsilon at each episodeso that we encourage exploration at the beginning of theprocess, usually known as early exploration, and then prioritizeexploitation so that the agent can use the learned paths. Fig. 6shows the effect of the initial value of (cid:15) , denoted by (cid:15) , onthe convergence of the system. The red line, corresponding toa low (cid:15) value, converges more rapidly since the exploration Episodes S t e p s =0.9 =0.3 Decay technique effect

Fig. 6: Exploration/Exploitation Dilemmaprobability is low and hence the system will rapidly use theoptimal values from the Q-table to take actions. However, it isclear that for the blue line using early exploration, additionalrandomness is introduced at the beginning of the process dueto the high starting value of (cid:15) . We also remark that the numberof steps is decreasing due to the fact that the UAV has alreadyfound its optimal path, shown by a solid black line in Fig. 5,starting from episode 20 approximately.RL is considered as a promising framework for the UAVnetwork in many scenarios, in the following we will coverpart of its applications based on the existing literature

C. RL solutions for UAV-Based problems1) RL for Autonomous navigation :

A network of UAVscan no longer be controlled in a classical way by manuallycontrolling the navigation of each UAV from the networkseparately. It is highly recommended nowadays to equip UAVswith the ability to make intelligent decisions by implementinga high level of control. Achieving such a high autonomy forUAV is a challenging task due to the continuous changes inthe UAV environment and to the different constraints relatedto UAV navigation (e.g. battery, UAV dynamics).In [53] a high-level control method of UAV is implementedfor uncertain or unknown environments. Although model-based models are generally not suitable for real-time applica-tions due to the expensive computation needed for learning themodel and deciding actions to take, the paper uses a modiﬁedmodel-based RL by implementing Texplore algorithm te per-form path planning task for a UAV. The advantage of Texploreis that it separates the action selection, model learning, andplanning by performing them in a parallel manner. Simulationresults show that Texplore algorithm outperforms the classicalQ-learning method by avoiding exhaustive exploration of theenvironment. The work done so far is interesting but stilllimited to a 2D problem with a simpliﬁed map and hence couldbe extended in the future to a more complex and challenging3D environment where the UAV can adjust its height in orderto avoid potential obstacles.Beyond pure path planning, the work in [54] investigatesproviding coverage for ground users by studying the deploy-ment and the trajectory design of a network of UAVs inorder to meet several performance metrics such as; coverage,ABLE II: Path planning for UAVs

Paper RL technique Application 3D/2D Single/MultipleUAVs Problem ParametersWireless communicationbased parameters Obstacles Los/Nlos Battery UAV dynamics Users movement[46] Q-Learning based Texplore Path planning 2D Single (cid:55) (cid:55) (cid:55) (cid:51) (cid:51) (cid:55) [47] DRL ESN-Based Path planning 2D Multiple Interference/ Wireless latency/ Transmit power (cid:55) (cid:55) (cid:55) (cid:55) (cid:55) [48] K-means+Q-learning Deployment and path planning 3D Multiple QoE at the users (cid:55) (cid:55) (cid:55) (cid:55) (cid:51) [49] Q-learning vs NN-based Q-learning Path planning 2D Single Transmission rate Single (cid:51) (cid:55) (cid:55) (cid:55) [50] Q-learning Path planning 2D Single (cid:55)

Multiple (cid:55) (cid:55) (cid:55) (cid:55) [51] Deep Q-network Path planning 3D Single (cid:55)

Multiple (cid:55) (cid:55) (cid:55) (cid:55) [52] DDPG Target tracking 2D Single (cid:55)

Multiple (cid:51) (cid:55) (cid:51) (cid:55) minimum interference, and best QoE. The proposed model-free RL-assisted framework enables dynamic tracking of users’movement by adjusting the UAV location accordingly. Theidea starts by clustering the users using a classical GAK-means algorithm ( which is a modiﬁed version of the K-meansalgorithm explained brieﬂy in Sec. II). It is worth mentioningthat deploying the UAV at the cluster center does not meetoptimality simply because the performance metrics adopted inthe problem are not only related to the euclidean distance tousers, instead they are related to other parameters such thatthe altitude of the UAV and the LOS presence. Consequently,a Q-learning algorithm is proposed to ﬁrstly deploy UAV ina sub-optimal way and decide their trajectory later. The workdone so far is of great importance, however, some assumptionsmade may be far from reality, such as the fact that users, whenmoving around, are not supposed to mix with other clusters.Moreover, positioning the UAVs initially using the K-meansresults might be better than selecting a random locationin terms of fast convergence to the sub-optimal positions.More other works in the literature have focused on coverage,for instance, in [55] optimal coverage is studied through adistributed RL algorithm based on Multi Agent ReinforcementLearning (MARL).So far, we covered classical RL solutions for UAVs pathplanning problems, however, a more complex autonomousnavigation solutions could be provided via DRL. In whatfollows, we go over the most relevant works coupling DRLwith autonomous navigation for UAVs.In the context of providing coverage for ground users, DRLcan play an important role in building efﬁcient solutions. Insuch a setup, the UAVS are usually deployed as ﬂying basestations or relays. For instance, the authors in [56] investigatedapplying DRL to a multiple input multiple output MIMO-based UAV network where each UAV is equipped with asingle antenna. We would like to remind that MIMO systemis a typical wireless communication scheme that is proposedto improve communication performance by using multipleantennas as an input and multiple receive antennas are usedas an output. The proposed DRL solution is based on DQNwhere the Signal-to-Interference-Plus-Noise Ratio (SINR) wasused as a metric for the quality of the channel and basedon which the reward signal is deﬁned. The UAV maximizesthe expected reward calculated based on the received signalstrength. Consequently, the UAV will maximize its coverageefﬁciency measured based on a predeﬁned coverage score. Theproposed solution was ﬁnally compared with other DRL meth-ods and proved its superiority in some setups. However, wethink that comparing it to a different type of DRL algorithmssuch as DRL-JSAC proposed in [57] does not makes total sense since the latter solution is based on DDPG where theaction space and the state space are continuous. In contrast,the state space considered in the solution provided in [58] islimited to a 3 possible cases related to the received signalstrength.In addition to providing ground wireless connectivity, thereexists a plethora of areas where UAVs could be used ef-ﬁciently, drone delivery is considered as one of the recentareas that it is time to rise it up . In this frame, achievingdrone delivery tasks through DRL was investigated in [58].The authors used double DQN to propose a path planning forUAV having an objective to reach a destination in an obstacle-enabled environment. The proposed solution is an improve-ment to the author previous work in [59], where three DRLalgorithms were tested which are the DQN, double DQN, andDuel DQN. As double DQN gave the best results, in [58] thesame algorithm was used and the depth information deducedfrom the image of the UAV stereo-vision front camera wasused as an input. Moreover, we envision futuristic scenarioswhere UAVs can be used as waitress and RL is used toaccomplish the task of drink serving to customers [60].At this level, we have covered a couple of research worksthat are based on Q-learning methods, either using a classicalRL or DQN. However, policy gradient methods could beapplied to a wider range of RL problems. For instance, DDPGalgorithm, which belongs to the set of policy gradient methods,is more suitable for complex problem especially when dealingwith continuous action space. To make it simpler, let’s considera UAV path planning problem, then to apply DQN you needto discretize the action space and accordingly the UAV willhave a well deﬁned set of movement that he can perform.The discretization process can be sometimes computationallyexpensive and, in other cases, it is even impossible to imple-ment it especially when targeting real time application. As asolution, policy gradient methods could be easily implementedto perform continuous action. For instance, in the context ofUAV motions, actions could be related to the speed valuesand direction angles. We need to point out that policy gradientmethods are not always better than Q-learning-based methodsand that they have their own drawbacks such as the highvariance problem in estimating the expectation of the reward.In what follows, we are going to cover some of the relevantworks that apply this type of methods to solve path planningtask for UAVs.Back to providing wireless connectivity for ground users,the authors in [57] used an actor-critic based method tosolve a multi objective control problem where the UAVstend to minimize their energy consumption and maximizetheir coverage range in a fair manner. Unlike the previouslyeported DQN-based solutions, this work is based on con-tinuous action space formed by the UAV direction and theﬂying distance for each UAV. Moreover, the authors take intoconsideration coverage fairness which is an important indicatorsince maximizing coverage could fall into covering a smallsubset of ground users. As a solution for the deﬁned multiobjective problem, the authors adjusted the DDPG algorithmaccordingly and called it DRL-EC. The new algorithm wascompared to two baseline methods and proved its superiorityin terms of coverage score and energy consumption.In [61] the environment considered is a complex large scalethree-dimensional map. In other words, the map is crowdedwith obstacles, where all directions are possible for the UAV,and it is also dynamic. Those type of maps are quite challeng-ing for RL path planning due to the fact that it is very difﬁcultto rely on methods that uses maps to represent the environmentfor the agent. The solution proposed is based on modeling thenavigation problem using partially observable Markov decisionprocess (POMDP) and then solving it by applying a DRL-based algorithm called Fast-RDPG. It is worth to mention that1) POMDP is an extension to markov decision process (MDP)and that 2) recurrent deterministic policy gradient algorithm(RDPG) belongs to another set of DRL algorithms that arebased on DPG.In [62], DDPG algorithm, brieﬂy introduced in Sec. III-A3,is used to train the UAV to navigate in a 3D environment whileavoiding obstacles. The proposed solution considers a contin-uous action space which explains the use of a DDPG-basedapproach. The authors used transfer learning to accelerate thelearning by using the weights learned by the UAV after beingtrained in a free space environment.An urban environment is simulated by adding obstacles inspeciﬁc locations and penalizing the UAV for any crashoccurred while reaching its target location. Numerical resultsshowed that the UAV succeeded in reaching its target whileavoiding all obstacles in the way. However the success rateof the UAV decreases with complexity of the environment byadding more obstacles. The lacking of precision was explainedby the fact that using inﬁnite continuous action space makeit hard to reach full accuracy. DDPG is also used differentlyin [63] for to jointly design a path for a network of UAVsin order to maximize its throughput. The idea proposed isto formulate the problem as an MDP where the reward isrelated to the throughput and the constraints are related tototal transmission power and channel availability. The actionstaken by the UAV are related to adjusting both the 3D locationand the transmission control. Due to the fact that the actionsand the states are continuous DDPG framework was used in3 three different setup. For each setup the reward function ischanged to achieve a given control objective.Several other works existing in the current literature arequite interesting, and due to shortage in space, we cannotgo over all of them in details. For instance, environmentexploration and obstacle avoidance problems for UAVs aresolved via different RL methods with both continuous anddiscrete space action in [61], [64]–[75]. Other works tested RLon assisting a UAV in a landing operation [76]–[79]. In [80]an anomaly detection is performed via RL in order to detect abnormality in the functioning of the motor and launch thelanding procedure immediately.All the previously invoked works only focus on a speciﬁctype of RL application which is path planning. In the comingsection, we will cover more interesting potential applicationsof RL such as event scheduling and resource allocation.

2) RL for scheduling and resource management :

Beyondpath planning for smart UAVs, one can think about au-tonomously setting a smart event schedule for a drone network.In this context, the authors in [81] propose a spatiotemporalscheduling framework for autonomous UAVs. The proposedRL solution is model-free based using the popular Q-learningalgorithm. The algorithm is handling the unexpected eventsiteratively by checking at every time slot their existenceand updating the UAVs schedule accordingly. After that, thetrajectory of the UAV is updated according to the Q-learningstrategy. There are multiple parameters took into accountfor every event (e.g. starting time, processing time, location,priority). The considered work is interesting for many reasons;it takes into account multiple factors such as dealing withunexpected events efﬁciently, it also considers the battery leveland works within a cooperative UAV environment. However, itis still not clear how to select some parameters optimally. Forinstance, the time discretization parameter enables a trade-offbetween the complexity and time efﬁciency, in other words,deciding in an optimal way the next event will inevitably resultin an increased time processing. This will certainly affect thecoverage rate of the UAV badly. Moreover, the author couldhave considered a more realistic case where multiple dockingstations are available instead of considering only one stationfor the whole network, and therefore the UAV should alwaysconsider moving to the nearest station if needed.In [82] a UAV network is managed by contriving the UAVconnectivity given the available bandwidth and energy. Theset of drones are charged by a wind-powered station whichenables a green wireless power transfer. The number of UAVsauthorized to take off is managed through classical RL bysolving a system of Bellman optimality equations in orderto extract the optimal policy. The authors focused on thephysical implementation of the charge station and the dronereceiving pads by going through the different technical detailsof the wireless power transfer system. Among the assumptionsmade throughout the work is the fact that the charging time isconstant, which could be hampered by several factors. First ofall, while establishing wireless power transfer, the UAV couldface a number of problems such as loosing LOS connectionwith the station or some misalignment issues. Secondly, thefact that the charging station is uses wind power makes itsubject to variability in the harvested power. The author arguedthat the latter problem could be solved via setting an adaptivecurrent control.Resource allocation represents a potential problem thatcould be paired with RL. The work in [18] is among thefew works that goes beyond UAV deployment or trajectorydesign, instead it focuses on resource allocation for a networkof multiple UAVs that communicates with ground users. Thesolution provided is based on Multi Agent ReinforcementLearning (MARL) and the problem formulation is based ontochastic game theory. The author investigates sub-channelselection, user selection and also power allocation for eachuser. Several parameters are taken into account such as theSINR, LOS and NLOS condition with the users. The workdescribed is of great importance especially when consideringthe scarcity of publication in this particular application. Hencefuture works can exploit other RL techniques in this area.

D. Discussions and future work:

Based on the recent literature related to RL for UAV- relatedproblems, we would like to offer the following observations.One can easily notice that the big majority of the publishedworks are focusing towards path planning applications forUAVs. More speciﬁcally, we remarked that a great numberof papers tend to use a Q-learning approach to proposean autonomous path planning for the UAVs. Although Q-learning is a classical algorithm and an interesting way tostart solving such problems, it is somewhat impeded by theneed of full knowledge of the map which is not trivial inreality especially when considering a high movement speedof UAVs. Added to that the fact that Q-learning might beslow if optimality is needed. Consequently a trade-off betweenboth complexity-optimality must be carefully studied. To sumup, DRL techniques such as Q-learning neural networks andDDPG, are more promising in terms of path planning andshould gain more interest in the future.In addition, we also noted that most research contributionsuse a discrete approach for path planning problems. Yetsolutions with a discrete set of actions and states is a classicapproach to address RL problems, it does not reﬂect a realsituation where actions could be inﬁnite as in real worldtrajectory planning. Although solutions with a continuousstate/action space are more difﬁcult, solving them can onlybring signiﬁcant beneﬁts to the area.Furthermore, we noticed that most of the existing workare focusing on a traditional centralized approaches for RLsolutions which raises several challenges related to complexityand time management. That is why we strongly believe thatdistributed reinforcement learning is an interesting techniqueto solve UAVs real time application such as the distributedQ-learning algorithm. This type of RL techniques is wellsuited for a UAV networks were multi agents are subject tocollaborative decisions.Besides that, we think that other potential applicationssuch as resource allocation and event scheduling are not wellcovered by the literature which have created an unbalancedresearch content oriented towards path planning problems. Theactual works looking into these new topics are quite few andhence future works can be directed toward applying other RLbased approaches to solve these problems.IV. F

EDERATED LEARNING FOR

UAV S So far, we have covered a lot of techniques that that couldcontribute to the development of smart UAV networks, rangingfrom supervised learning to unsupervised learning to RL.However, some the algorithms covered previously does notgo along with some constraints related to UAVs. More specif-ically, we point out the limited on-board computing capacity of the UAV. Hence, we are questioning the applicability ofartiﬁcial intelligence in a UAV network in a practical situation.In response to the latter question, Google recently put in placethe so-called FL, envisioning a practical way of implementingML algorithms in constrained networks [83], [84]. FL consistsof executing ML algorithms in a decentralized way without theneed of uploading the training set to a central node (or server).It is not designed speciﬁcally to a UAV network, instead it isdesigned to any network type composed of a central server(e.g. Base station in our setup) and a number of clients (e.g.UAVs, mobile users).

A. FL principle

Without loss of generality, we provide a comprehensiveexplanation for FL algorithm for a setup of a network of UAVsthat are served by a terrestial base station. As a typical task,we suppose that the UAVs are processing different imagesfrom the groun. We also assume that the optimization of theloss function is done via a simple stochastic gradient descent(SGD) algorithm. As illustrated in Fig. 8, the central server,which is the base station in our case, shares the current updateof the global model, denoted by w t , with a sub-set of the users.The sub-set size denoted usualy by C, is randomly selected bythe server. Once the client UAV receives the current update ofthe global model, it uses its local training data to computea local update of the global model. We should mention thatseveral parameters are related to each UAV as indicated inFig. 8. Those parameters are the mini-batch size denoted byB and indicating the amount of the local data used per eachUAV, the index k of the UAV, and the the number of trainingpasses each client makes over its local dataset on each round,which is denoted by E. After performing the update, the UAVonly communicates the update, denoted by w kt +1 , to the centralserver which is the base station in our case. For an SGD basedoptimization, the update is calculated as follows : w kt +1 = w t − η ∇ l ( w t , B ) (2)Where η is the learning rate and l is the loss function.For example, the UAV (k=4) on Fig. 8 performs a full batchupdate and hence uses all its local data since B = ∞ . Then itrepeats the Eq. 2 ten times and delivers the output w kt +1 to thebase station. Once the local update w kt +1 is received by thecentral node, it improves the global model and then removesthese updates because they are no longer needed. B. FL advantages ¨ı¿½ FL is the ultimate solution for constrained networkssimply because exhaustive calculation could no be done on-board the clients. It helps decoupling the model training andthe access to the raw data due to the fact that UAVs are notrequired to share any data with the server, instead, they onlytransmit their local update as explained previously. Firstly, FLreduce privacy and security issues by minimizing the datatrafﬁc. As a result, it is considered as a key solution forconﬁdential system where data should not be shared. Secondly,FL is suitable for applications where the data is unbalanced,in other words, it happens that sometimes a client is out ofthe region of interest and so it has a small amount of data ath planning Drone delivery Providing coverage

Problem type ApplicationsLiterature

Autonomous flightDQN-[58],[59]Q-learning-[60] Q-learning-[54]DQN-[56]DDPG-[57]MARL-[55] Texplore-[53]Q-learning-[58],[61]DQN-[64],[65],[69]DRL-[71],[72](actor critic) [67],[68],[70] Event scheduling Resource allocationScheduling and resource management Q-learning-[81] RL-[82]MARL-[18]

Fig. 7: RL taxonomy.

C=3B=50E=1K=2n B= ⚮ E=10K=4n B= ⚮ E=10K=5n w t w t w t W t + W t + W t + Base station(server)UAV(client) Selected client sub-setFDL links

Fig. 8: Federated Learning Principlecompared to the other clients. Lastly, FL perform well onnon independent and identical distributed data, for examplethe partition of data observed by a single drone cannot berepresentative of the overall data of the system simply becausethe drone can be viewing only one part of a given process.

C. FL solutions for UAVs1) FL for resource allocation and scheduling: ¨ı¿½In someUAVs applications such as autonomous path planning andcollision avoidance in cooperative networks, communication isneeded with extremely low errors and delays. In this context,the authors in [85], [86] investigate enabling ultra-reliablelow latency communication for a network of vehicular users.Although the work carried out does not specify UAVs in par-ticular, we think that the same methodology apply for a UAVnetwork. In addition, we believe that more optimistic resultsare expected for the UAV network as they are more suitable forLOS links . Consequently, the LOS link probability modeledin this work is expected to be higher. In short, the authorspropose a joint power and resource allocation framework for ultra-reliable and low-latency network of vehicles. Reliabilitymeasure is investigated in terms of probabilistic queue lengthwhere extreme event are detected and handled using a statisti-cal powerful tool called extreme value theory (EVT). Extremeevent occur when a queue length for a given vehicle exceedsa predeﬁned threshold with non-negligible probability. Theextreme event detection is done in a distributed manner byapplying FL principle. In other words, instead of applying aclassical EVT method based on maximum likelihood estimateto learn the parameters of the queue of each node by sharingthe queues with a central node, a FL algorithm will simplyallow the distributed nodes to keep their data and run the samemodel in a decentralized way, revealing ﬁnally extreme event.Surprisingly, the FL based method achieves same performancewith a centralized method but with a signiﬁcant reductionof data transfer that reaches 79%. This fact shows how FLcan enable privacy in the network with same performance asclassical algorithms.Moreover, in the same context of resource allocation, the au-thors in [87] have investigated task and resource allocation forhigh altitude balloon networks. It is worth mentioning that thistype of network has a lot of similarity with a UAVs networkas the high-altitude balloons are operating as a wireless basestation. The authors formulate an optimization problem for amobile edge computing enabled ballon network by minimizingenergy and time consumption. However, solving this problemneed to specify ﬁrst the user association with the high altitudeballoons. To solve the latter issue, a distributed SVM-basedFL model was proposed to determine to which HAB everyuser connects. As usual, FL principal will guarantee privacyby minimizing data sharing across the network.To improve the efﬁciency of an internet of vehicles network,the authors in [88] suggest the use of UAVs as relays inorder to overcome the problem of communication failurewhile executing a FL task. To do so, the authors propose theormation of a coalition of UAVs in order to facilitate thetraining process by improving the communication efﬁciencylevel. Each UAV in a coalition will participate in the trainingin a sequential way, in other words, after completing themaximum number of iteration by the nearest UAV, the secondnearest UAV take the hand and continue training the modeland so on until all required iterations are done. A reward willbe received to the UAV depending on the number of iterationperformed. In the same context, an auction was designed forthe UAVs to ﬁnd the optimal allocation that maximizes theproﬁts of the drones.

2) FL for UAV path control:

The authors in [89] investigatethe control of massive number of UAVs starting from asource point and aiming to reach a destination spot. TheUAVs movement is perturbed by the wind factor which is themain randomness source in the problem. The latter mentionedperturbation can result in fatal collision between UAVs and asa solution a path control is proposed to avoid this scenario.The authors used the mean ﬁeld game framework to controlthe UAV path. However, in this framework, complex differ-ential equations are required to be solved analytically whichis not feasible for real-time applications and especially forconstrained networks. That is why approximation solutionsare proposed based on two ANN for each one of the twodifferential equation. At this level, even approximating thesolution via DL is not enough for the convergence of the meanﬁeld game framework. Thus, FL is used as solution to sharemodel parameters of the two NN between UAVs and as a resultUAVs will be able to take into account the effect of locallynon observable samples by a UAV for learning.

3) FL for UAV network security : ¨ı¿½ In [90], FlyingAd-Hoc Network (FANET) security is studied in depth. AFANET is a decentralized communication network compusedby a number of UAVs. This type of network is vulnerableto jamming attack disrupting the communication at the re-ceiver. To avoid such scenarios, the authors propose a FLassisted solution for jamming attack detection. Many rea-sons stands behind selecting FL as a potential solution forFANET. First, FANET are usually a heterogeneous networksin terms of power consumption constraints and communicationrange. Secondly, the data available at the different nodesis unbalanced and lastly because the number of interactingnodes is huge. As we have already mentioned previously,FL performs well on this type of setups. Moreover, the FLtechnique is enhanced by a client selection algorithm based onDempster-Shafer algorithm. This technique enables user groupprioritization mechanism allowing selecting better clients forcalculting the global update to the model. The numericalresults elaborated based on 2 different datasets, shows thatFL always outperforms distributed learning in many differentsetups. Furthermore, the client selection based FL model itselfoutperforms the traditional FL algorithm. This result is due tothe different values of latency and bandwidth available at eachsingle UAV.

4) FL for content caching:

To address one of the 5G draw-backs, which is the increased delay caused by the signiﬁcantactivity and congestion at the backhaul links, the 6G networksemploy content caching technique at the small-cell base sta- tions of the 6G network. Those small-cells are usually a ﬂyingUAVs acting as base stations. As a result, content caching isconsidered as a good alternative regarding the limited capacityof the UAVs in terms of computing capacity and memory. Inthis context, the paper [90] investigates an intelligent cachingtechnique for a 6G heterogeneous aerial-terrestrial networkcomposed by heterogeneous base stations such as UAVs andterrestrial remote radio heads. The proposed solution is basedon FL techniques and hence users are no longer requiredto share explicitly their reporting and content preference.Instead, a heterogeneous computing platform (HPC) proposedby the authors, will accurately predict the content cache to thedifferent base stations depending on mobile users preferences.In the above mentioned setup, the HPC plays the role of theserver and the different nodes of the network will only shareupdates to the global model in a secure manner. Technically, aCNN was used so that the HPC learn the most popular ﬁles tocache in the heterogenous base stations, and the optimizationof the loss function is done via SGD as described previouslyin Sec. IV-A. The HCP based FL solution was tested on 2different data-sets based on a movie rating by users and provedits efﬁciency compared to other baseline methods.

5) FL for swarm UAVs:

In [91] an optimization problem isformulated to design joint power allocation and scheduling fora UAV swarm network. The network considered is composedby one leader UAV and a group a following UAVs. Eachone of the following UAV runs a FL algorithm on its localdata and then sends the update to the leader UAV. The leaderUAV aggregates all the local updates in order to perform aglobal update to the global model. While transmitting theupdates between the UAVs, several natural phenomena willaffect the wireless transmissions such as the fading, wind,the transmission delay, the antenna gain and deviation, theinterference.In the same mentioned work, the impact of thesewireless transmission factors in the performance of the FLalgorithm is analyzed.At the numerical simulation part, theeffect of multiple wireless parameters on the convergence ofthe FL is studied.

D. Client selection strategies for UAVs

Many published works related to FL are made on anoptimistic assumption that all the client will unconditionallyparticipate in FL whenever they are called by the server.However, it is obvious that deciding which nodes (UAVs)should participate in the learning is a sensitive task for FLthat could inﬂuence the overall accuracy. Thus, in this section,we will cover some client selection technique and someparticipation strategies that could be of great importance toFL.Related to what was stated previously, the authors in [92]propose a contract-matching solution based on which the UAVwill get a reward according to its type. The contract proposedby the authors is multi-dimensional so that it takes into accountthe different source of heterogeneity in the UAV types. Aftersetting the contracts, a matching-based algorithm will assignthe optimal UAVs to each region. The UAV parameters con-sidered while designing the contracts are the sensing model,computation model and the transmission model. The proposedethod enabled selecting the UAVs with the lowest costs tothe target sub-region.In the same context, in [93] a new algorithm called FedCSis proposed to mitigate the problem of low channel quality orshortage in resources for the clients which could affect badlythe training process. The proposed solution enables clientselection based on the available resources of each client. In thenumerical results section, it has been shown that the proposedalgorithm accelerates the training process signiﬁcantly.Some other works have already studied the use of anincentive mechanism for FL. Incentive mechanisms are usedto motivate the nodes to join the training. For example, in [94],the author designed an incentive mechanism for FL basedon game theory and deep reinforcement learning techniques.According to the proposed design, the edge nodes could adapttheir training strategies. However, in the same mentionedwork, some of the assumptions made midway raise doubtsabout the applicability of the method in real situation. Forinstance, assuming that the data quality is the same in allthe nodes or that the data is independent and identicallydistributed is not always guaranteed, especially for FL which isknown to perform well on non-IID type of data. In the samevein, the authors in [95] used contract theory to design aneffective incentive mechanism for FL framework. The clientswill receive rewards regarding to their participation in theFL training according to their data quality and accuracy. Theproposed method will encourage nodes having a good qualitydata to join the learning process so that the overall accuracywill get improved.

E. Discussion and future works

We would like to emphasis the fact that FL is not necessarilyapplied only for UAV or mobile users networks, instead, itis being used successfully in many daily applications. Forexample, Google’s Gboard implements the FL to learn aRNN to predict your next word when you start typing on thekeyboard. However, we would like to point out that it is notclear how to select certain parameters in the FL algorithm asdeﬁned by Google in [83]. For example, the client selectionprocess has been deﬁned as random, which raises the question:is there a better way to assign clients in each round of FLalgorithm.Although we covered so far some of the works done in relationwith client selection for UAV based networks, this last issueneeds to be studied in depth for UAV networks where severalparameters could affect the client selection process. Froma wireless communication perspective, channel quality, LOScondition, available data and battery are crucial factors thatcould signiﬁcantly affect the client selection process. To bespeciﬁc, those parameters could make a subset of users moresuitable to be selected for the FL training. Without a doubt,the client selection topic seems to be a very exciting andchallenging idea for the future.Furthermore, and in the meantime, while a major part of thescientiﬁc community asserts that the primary purpose of theFL is data privacy, others doubt this assumption and arguethat even sharing only updates over the wireless networkis not secure. In fact, the author in [96] proposes a secure aggregation algorithm for the FL. This algorithm allows usersto encrypt their local patterns before transmitting them overthe channel. The author argues that it has been proven thatfor a given neural network, the shared model parameters canbe used to build an attack and reconstruct training examples.Hence, based on the later assumption, FL will loose one of itsmajor advantages as using encryption will certainly increasethe overall complexity of the system, especially when a highlevel of privacy is needed at the network.In addition to security issues, more attention should bepaid to the convergence of an FL algorithm which is notalways guaranteed. Convergence depends on the speciﬁc typeof problem, such as the convexity of the loss function andthe number of updates performed on the model. For example,if there is a poor selection of clients where the designednodes are not available or do not have enough data, theoptimization of the overall model will fail. One can notice thatthis issue overlaps with the client selection problem mentionedpreviously, however, it not only related to client selection butalso to the type of the loss function.Another point that could be discussed regarding the appli-cability of the FL framework to ML algorithms is the fact thatthe FL can only be applied to supervised ML problems whereinput data is labeled. Therefore, future work can be orientedtowards the design of an FL version adapted to unsupervisedML problems with unlabeled data.To sum up, even with all the above-mentioned issues relatedto FL, it remains a good alternative for UAV-based networks.As a result, we hope to see more future works in this area byapplying FL framework to many of the supervised learningproblems that we have already covered in supervised MLsection of this survey.V. C

ONCLUDING NOTES

In this paper we presented the major state of the art relatingML techniques to UAV-Based networks. We are motivated bythe fact that UAV networks are considered as key componentsin a plethora of new applications, such as for smart citiesarchitectures and the next 6G wireless networks.We started by providing an extensive overview for unsuper-vised and supervised machine learning techniques that havebeen applied in UAV networks. Then, we covered the RL areaand resumed a number of the relevant works that implementedthis ML technique for UAVs. Then we ended by going overa recent privacy-preserving area of ML, namely FL. Specialemphasis was placed on providing constructive criticism tosome existing works and on exploring open issues to thereader.To this end, we highlight the fact that intelligent UAV networksare a fertile area of research that should be further explored.R

EFERENCES[1] B. Li, Z. Fei, and Y. Zhang, “UAV communications for 5G and beyond:Recent advances and future trends,”

IEEE Internet of Things Journal ,vol. 6, no. 2, pp. 2241–2263, 2019.[2] Y. Zeng, Q. Wu, and R. Zhang, “Accessing from the sky: A tutorial onUAV communications for 5G and beyond,”

Proceedings of the IEEE ,vol. 107, no. 12, pp. 2327–2375, 2019.3] X. Cao, P. Yang, M. Alzenad, X. Xi, D. Wu, and H. Yanikomeroglu,“Airborne communication networks: A survey,”

IEEE Journal on Se-lected Areas in Communications , vol. 36, no. 9, pp. 1907–1926, 2018.[4] M. Mozaffari, W. Saad, M. Bennis, Y.-H. Nam, and M. Debbah, “Atutorial on UAVs for wireless networks: Applications, challenges, andopen problems,”

IEEE Communications Surveys & Tutorials , vol. 21,no. 3, pp. 2334–2360, 2019.[5] B. Zhang, L. Tang, and M. Roemer, “Probabilistic weather forecastinganalysis for unmanned aerial vehicle path planning,”

Journal of Guid-ance, Control, and Dynamics , vol. 37, no. 1, pp. 309–312, 2014.[6] A. Puri, “A survey of unmanned aerial vehicles (UAV) for trafﬁc surveil-lance,”

Department of computer science and engineering, University ofSouth Florida , pp. 1–29, 2005.[7] Q. Yang, L. Shi, and L. Lin, “Plot-scale rice grain yield estimationusing UAV-based remotely sensed images via CNN with time-invariantdeep features decomposition,” in

IGARSS 2019-2019 IEEE InternationalGeoscience and Remote Sensing Symposium . IEEE, 2019, pp. 7180–7183.[8] G. T. Miyoshi, M. d. S. Arruda, L. P. Osco, J. Marcato Junior, D. N.Gonc¸alves, N. N. Imai, A. M. G. Tommaselli, E. Honkavaara, and W. N.Gonc¸alves, “A novel deep learning method to identify single tree speciesin UAV-based hyperspectral images,”

Remote Sensing , vol. 12, no. 8, p.1294, 2020.[9] R. Zhang, C. Wang, X. Hu, Y. Liu, S. Chen et al. , “Weed location andrecognition based on UAV imaging and deep learning,”

InternationalJournal of Precision Agricultural Aviation , vol. 3, no. 1, 2020.[10] Q. Zhang, M. Mozaffari, W. Saad, M. Bennis, and M. Debbah, “Machinelearning for predictive on-demand deployment of UAVs for wirelesscommunications,” in , 2018, pp. 1–6.[11] Y. Zeng, R. Zhang, and T. J. Lim, “Wireless communications withunmanned aerial vehicles: Opportunities and challenges,”

IEEE Com-munications Magazine , vol. 54, no. 5, pp. 36–42, May 2016.[12] M. Lahmeri, M. A. Kishk, and M. Alouini, “Stochastic geometry-basedanalysis of airborne base stations with laser-powered UAVs,”

IEEECommunications Letters , vol. 24, no. 1, pp. 173–177, 2020.[13] O. M. Bushnaq, M. A. Kishk, A. C¸ elik, M.-S. Alouini, and T. Y. Al-Naffouri, “Cellular trafﬁc ofﬂoading through tethered-UAV deploymentand user association,” arXiv preprint arXiv:2003.00713 , 2020.[14] Y. Qin, M. A. Kishk, and M. Alouini, “Performance evaluation of UAV-enabled cellular networks with battery-limited drones,”

IEEE Commu-nications Letters , pp. 1–1, 2020.[15] M. A. Kishk, A. Bader, and M. Alouini, “On the 3-Dplacement ofairborne base stations using tethered UAVs,”

IEEE Transactions onCommunications , vol. 68, no. 8, pp. 5202–5215, 2020.[16] M. A. Kishk, A. Bader, and M.-S. Alouini, “Aerial base stationsdeployment in 6G cellular networks using tethered drones: The mobilityand endurance trade-off,” 2020.[17] R. S. Sutton and A. G. Barto,

Reinforcement learning: An introduction .MIT press, 2018.[18] J. Cui, Y. Liu, and A. Nallanathan, “Multi-agent reinforcement learning-based resource allocation for UAV networks,”

IEEE Transactions onWireless Communications , vol. 19, no. 2, pp. 729–743, 2019.[19] M. Chen, U. Challita, W. Saad, C. Yin, and M. Debbah, “Artiﬁcial neuralnetworks-based machine learning for wireless networks: A tutorial,”

IEEE Communications Surveys Tutorials , vol. 21, no. 4, pp. 3039–3071,2019.[20] C. Goerzen, Z. Kong, and B. Mettler, “A survey of motion planningalgorithms from the perspective of autonomous UAV guidance,”

Journalof Intelligent and Robotic Systems , vol. 57, no. 1-4, p. 65, 2010.[21] A. Carrio, C. Sampedro, A. Rodriguez-Ramos, and P. Campoy, “Areview of deep learning methods and applications for unmanned aerialvehicles,”

Journal of Sensors , vol. 2017, 2017.[22] P. S. Bithas, E. T. Michailidis, N. Nomikos, D. Vouyioukas, and A. G.Kanatas, “A survey on machine-learning techniques for UAV-basedcommunications,”

Sensors , vol. 19, no. 23, p. 5170, 2019.[23] J. Chen, U. Yatnalli, and D. Gesbert, “Learning radio maps for UAV-aided wireless networks: A segmented regression approach,” in , 2017, pp.1–6.[24] K. Xiao, J. Zhao, Y. He, and S. Yu, “Trajectory prediction of UAV insmart city using recurrent neural networks,” in

ICC 2019-2019 IEEEInternational Conference on Communications (ICC) . IEEE, 2019, pp.1–6.[25] Y. Zhang, J. Wen, G. Yang, Z. He, and X. Luo, “Air-to-air path lossprediction based on machine learning methods in urban environments,”

Wireless Communications and Mobile Computing , vol. 2018. [26] S. Alsamhi, O. Ma, and S. Ansari, “Predictive estimation of the optimalsignal strength from unmanned aerial vehicle over internet of thingsusing ANN,” 05 2018.[27] R. D. Timoteo, D. C. Cunha, and G. D. Cavalcanti, “A proposal for pathloss prediction in urban environments using support vector regression,”in

Proc. Advanced Int. Conf. Telecommun , 2014, pp. 1–5.[28] J. Kim, C. Park, J. Ahn, Y. Ko, J. Park, and J. C. Gallagher, “Real-time UAV sound detection and analysis system,” in , 2017, pp. 1–5.[29] Y. Xu, G. Yu, Y. Wang, X. Wu, and Y. Ma, “Car detection from low-altitude UAV imagery with the faster R-CNN,”

Journal of AdvancedTransportation , vol. 2017, 2017.[30] N. Attari, F. Oﬂi, M. Awad, J. Lucas, and S. Chawla, “Nazr-CNN: Fine-grained classiﬁcation of UAV imagery for damage assessment,” in . IEEE, 2017, pp. 50–59.[31] Y. Bazi and F. Melgani, “Convolutional SVM networks for objectdetection in UAV imagery,”

IEEE transactions on geoscience and remotesensing , vol. 56, no. 6, pp. 3107–3118, 2018.[32] M. Lan, Y. Zhang, L. Zhang, and B. Du, “Defect detection from UAVimages based on region-based CNNs,” in . IEEE, 2018, pp.385–390.[33] B. Zhang, L. Tang, and M. Roemer, “Probabilistic weather forecastinganalysis for unmanned aerial vehicle path planning,”

Journal of Guid-ance, Control, and Dynamics , vol. 37, no. 1, pp. 309–312, 2013.[34] A. Nassar, K. Amer, R. ElHakim, and M. ElHelw, “A deep CNN-basedframework for enhanced aerial imagery registration with applicationsto UAV geolocalization,” in

Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition Workshops , 2018, pp. 1513–1523.[35] C. Kyrkou, G. Plastiras, T. Theocharides, S. I. Venieris, and C.-S.Bouganis, “Dronet: Efﬁcient convolutional neural network detector forreal-time UAV applications,” in . IEEE, 2018, pp. 967–972.[36] J. Vazquez-Nicolas, E. Zamora, I. Gonzalez-Hernandez, R. Lozano, andH. Sossa, “Towards automatic inspection: crack recognition based onquadrotor UAV-taken images,” in . IEEE, 2018, pp. 654–659.[37] A. A. d. Santos, J. Marcato Junior, M. S. Ara´ujo, D. R. Di Martini,E. C. Tetila, H. L. Siqueira, C. Aoki, A. Eltner, E. T. Matsubara,H. Pistori et al. , “Assessment of CNN-based methods for individualtree detection on images captured by RGB cameras attached to UAVs,”

Sensors , vol. 19, no. 16, p. 3595, 2019.[38] H. A. Al-Najjar, B. Kalantar, B. Pradhan, V. Saeidi, A. A. Halin,N. Ueda, and S. Mansor, “Land cover classiﬁcation from fused DSMand UAV images using convolutional neural networks,”

Remote Sensing ,vol. 11, no. 12, p. 1461, 2019.[39] Y. Bazi and F. Melgani, “Convolutional SVM networks for objectdetection in UAV imagery,”

IEEE transactions on geoscience and remotesensing , vol. 56, no. 6, pp. 3107–3118, 2018.[40] J. Xiong, Z. Liu, S. Chen, B. Liu, Z. Zheng, Z. Zhong, Z. Yang,and H. Peng, “Visual detection of green mangoes by an unmannedaerial vehicle in orchards based on a deep learning method,”

BiosystemsEngineering , vol. 194, pp. 261–272, 2020.[41] Y. Liu, C. Cen, Y. Che, R. Ke, Y. Ma, and Y. Ma, “Detection of maizetassels from UAV RGB imagery with faster R-CNN,”

Remote Sensing ,vol. 12, no. 2, p. 338, 2020.[42] L. P. Osco, M. d. S. de Arruda, J. M. Junior, N. B. da Silva, A. P. M.Ramos, ´E. A. S. Moryia, N. N. Imai, D. R. Pereira, J. E. Creste,E. T. Matsubara et al. , “A convolutional neural network approach forcounting and geolocating citrus-trees in UAV multispectral imagery,”

ISPRS Journal of Photogrammetry and Remote Sensing , vol. 160, pp.97–106, 2020.[43] A. Carrio, C. Sampedro, A. Rodriguez-Ramos, and P. Campoy, “Areview of deep learning methods and applications for unmanned aerialvehicles,”

Journal of Sensors , vol. 2017, 2017.[44] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al. , “Human-level control through deep reinforcement learning,” nature , vol. 518, no. 7540, pp. 529–533, 2015.[45] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller,“Deterministic policy gradient algorithms,” 2014.[46] N. Imanberdiyev, C. Fu, E. Kayacan, and I.-M. Chen, “Autonomous nav-igation of UAV by using real-time model-based reinforcement learning,”in . IEEE, 2016, pp. 1–6.47] U. Challita, W. Saad, and C. Bettstetter, “Interference managementfor cellular-connected UAVs: A deep reinforcement learning approach,”

IEEE Transactions on Wireless Communications , vol. 18, no. 4, pp.2125–2140, 2019.[48] X. Liu, Y. Liu, and Y. Chen, “Reinforcement learning in multiple-UAVnetworks: Deployment and movement design,”

IEEE Transactions onVehicular Technology , vol. 68, no. 8, pp. 8036–8049, 2019.[49] H. Bayerlein, P. De Kerret, and D. Gesbert, “Trajectory optimizationfor autonomous ﬂying base station via reinforcement learning,” in . IEEE, 2018, pp. 1–5.[50] I. Kim, S. Shin, J. Wu, S.-D. Kim, and C.-G. Kim, “Obstacle avoidancepath planning for UAV using reinforcement learning under simulatedenvironment,” in

IASER 3rd International Conference on Electronics,Electrical Engineering, Computer Science, Okinawa , 2017, pp. 34–36.[51] H. X. Pham, H. M. La, D. Feil-Seifer, and L. V. Nguyen, “Au-tonomous UAV navigation using reinforcement learning,” arXiv preprintarXiv:1801.05086 , 2018.[52] B. Li and Y. Wu, “Path planning for UAV ground target tracking viadeep reinforcement learning,”

IEEE Access , vol. 8, pp. 29 064–29 074,2020.[53] N. Imanberdiyev, C. Fu, E. Kayacan, and I. Chen, “Autonomous naviga-tion of UAV by using real-time model-based reinforcement learning,” in , 2016, pp. 1–6.[54] X. Liu, Y. Liu, and Y. Chen, “Reinforcement learning in multiple-UAVnetworks: Deployment and movement design,”

IEEE Transactions onVehicular Technology , vol. 68, no. 8, pp. 8036–8049, 2019.[55] H. X. Pham, H. M. La, D. Feil-Seifer, and A. Neﬁan, “Cooperative anddistributed reinforcement learning of drones for ﬁeld coverage,” arXivpreprint arXiv:1803.07250 , 2018.[56] H. Huang, Y. Yang, H. Wang, Z. Ding, H. Sari, and F. Adachi, “Deepreinforcement learning for UAV navigation through massive mimotechnique,”

IEEE Transactions on Vehicular Technology , vol. 69, no. 1,pp. 1117–1121, 2020.[57] C. H. Liu, Z. Chen, J. Tang, J. Xu, and C. Piao, “Energy-efﬁcientUAV control for effective and fair communication coverage: A deepreinforcement learning approach,”

IEEE Journal on Selected Areas inCommunications , vol. 36, no. 9, pp. 2059–2070, 2018.[58] G. Mu˜noz, C. Barrado, E. C¸ etin, and E. Salami, “Deep reinforcementlearning for drone delivery,”

Drones , vol. 3, no. 3, p. 72, 2019.[59] K. Kersandt, G. Munoz, and C. Barrado, “Self-training by reinforcementlearning for full-autonomous drones of the future*,” in , 2018, pp. 1–10.[60] E. Camci and E. Kayacan, “Waitress quadcopter explores how to servedrinks by reinforcement learning,” in . IEEE, 2016, pp. 28–32.[61] C. Wang, J. Wang, Y. Shen, and X. Zhang, “Autonomous navigationof UAVs in large-scale complex environments: A deep reinforcementlearning approach,”

IEEE Transactions on Vehicular Technology , vol. 68,no. 3, pp. 2124–2136, 2019.[62] O. Bouhamed, H. Ghazzai, H. Besbes, and Y. Massoud, “Autonomousuav navigation: A ddpg-based deep reinforcement learning approach,” arXiv preprint arXiv:2003.10923 , 2020.[63] M. Zhu, X.-Y. Liu, and X. Wang, “Deep reinforcement learning forunmanned aerial vehicle-assisted vehicular networks,” arXiv preprintarXiv:1906.05015 , 2019.[64] S.-Y. Shin, Y.-W. Kang, and Y.-G. Kim, “Obstacle avoidance drone bydeep reinforcement learning and its racing with human pilot,”

AppliedSciences , vol. 9, no. 24, p. 5571, 2019.[65] A. Singla, S. Padakandla, and S. Bhatnagar, “Memory-based deepreinforcement learning for obstacle avoidance in UAV with limited en-vironment knowledge,”

IEEE Transactions on Intelligent TransportationSystems , 2019.[66] I. Kim, S. Shin, J. Wu, S.-D. Kim, and C.-G. Kim, “Obstacle avoidancepath planning for UAV using reinforcement learning under simulatedenvironment,” in

IASER 3rd International Conference on Electronics,Electrical Engineering, Computer Science, Okinawa , 2017, pp. 34–36.[67] S.-Y. Shin, Y.-W. Kang, and Y.-G. Kim, “Reward-driven U-net trainingfor obstacle avoidance drone,”

Expert Systems with Applications , vol.143, p. 113064, 2020.[68] Z. Ma, C. Wang, Y. Niu, X. Wang, and L. Shen, “A saliency-basedreinforcement learning approach for a UAV to avoid ﬂying obstacles,”

Robotics and Autonomous Systems , vol. 100, pp. 108–118, 2018.[69] X. Han, J. Wang, J. Xue, and Q. Zhang, “Intelligent decision-makingfor 3-dimensional dynamic obstacle avoidance of UAV based on deepreinforcement learning,” in . IEEE, 2019,pp. 1–6.[70] K. Wan, X. Gao, Z. Hu, and G. Wu, “Robust motion control for UAVin dynamic uncertain environments using deep reinforcement learning,”

Remote sensing , vol. 12, no. 4, p. 640, 2020.[71] L. He, N. Aouf, J. F. Whidborne, and B. Song, “Deep reinforcementlearning based local planner for UAV obstacle avoidance using demon-stration data,” arXiv preprint arXiv:2008.02521 , 2020.[72] O. Walker, F. Vanegas, F. Gonzalez, and S. Koenig, “A deep reinforce-ment learning framework for UAV navigation in indoor environments,”in . IEEE, 2019, pp. 1–14.[73] F. Bin, F. XiaoFeng, and X. Shuo, “Research on cooperative collisionavoidance problem of multiple UAV based on reinforcement learning,”in . IEEE, 2017, pp. 103–109.[74] G. Kahn, A. Villaﬂor, V. Pong, P. Abbeel, and S. Levine, “Uncertainty-aware reinforcement learning for collision avoidance,” arXiv preprintarXiv:1702.01182 , 2017.[75] B. G. Maciel-Pearson, L. Marchegiani, S. Akcay, A. Atapour-Abarghouei, J. Garforth, and T. P. Breckon, “Online deep reinforcementlearning for autonomous UAV navigation and exploration of outdoorenvironments,” arXiv preprint arXiv:1912.05684 , 2019.[76] A. Rodriguez-Ramos, C. Sampedro, H. Bavle, P. De La Puente, andP. Campoy, “A deep reinforcement learning strategy for UAV au-tonomous landing on a moving platform,”

Journal of Intelligent &Robotic Systems , vol. 93, no. 1-2, pp. 351–366, 2019.[77] R. Polvara, M. Patacchiola, S. Sharma, J. Wan, A. Manning, R. Sutton,and A. Cangelosi, “Autonomous quadrotor landing using deep reinforce-ment learning,” arXiv preprint arXiv:1709.03339 , 2017.[78] A. Rodriguez-Ramos, C. Sampedro, H. Bavle, I. G. Moreno, andP. Campoy, “A deep reinforcement learning technique for vision-based autonomous multirotor landing on a moving platform,” in . IEEE, 2018, pp. 1010–1017.[79] R. Polvara, M. Patacchiola, S. Sharma, J. Wan, A. Manning, R. Sutton,and A. Cangelosi, “Toward end-to-end control for UAV autonomouslanding via deep reinforcement learning,” in , 2018, pp. 115–123.[80] H. Lu, Y. Li, S. Mu, D. Wang, H. Kim, and S. Serikawa, “Motor anomalydetection for unmanned aerial vehicles using reinforcement learning,”

IEEE Internet of Things Journal , vol. 5, no. 4, pp. 2315–2322, 2018.[81] O. Bouhamed, H. Ghazzai, H. Besbes, and Y. Massoud, “A generic spa-tiotemporal scheduling for autonomous UAVs: A reinforcement learning-based approach,”

IEEE Open Journal of Vehicular Technology , vol. 1,pp. 93–106, 2020.[82] G. Faraci, A. Raciti, S. A. Rizzo, and G. Schembra, “Green wirelesspower transfer system for a drone ﬂeet managed by reinforcementlearning in smart industry,”

Applied Energy , vol. 259, p. 114204, 2020.[83] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas,“Communication-efﬁcient learning of deep networks from decentralizeddata,” in

Artiﬁcial Intelligence and Statistics , 2017, pp. 1273–1282.[84] K. Bonawitz, H. Eichner, W. Grieskamp, D. Huba, A. Ingerman,V. Ivanov, C. Kiddon, J. Koneˇcn`y, S. Mazzocchi, H. B. McMahan et al. ,“Towards federated learning at scale: System design,” arXiv preprintarXiv:1902.01046 , 2019.[85] S. Samarakoon, M. Bennis, W. Saad, and M. Debbah, “Federatedlearning for ultra-reliable low-latency v2v communications,” in . IEEE, 2018,pp. 1–7.[86] ——, “Distributed federated learning for ultra-reliable low-latencyvehicular communications,”

IEEE Transactions on Communications ,vol. 68, no. 2, pp. 1146–1159, 2019.[87] S. Wang, M. Chen, C. Yin, W. Saad, C. S. Hong, S. Cui, and H. V.Poor, “Federated learning for task and resource allocation in wirelesshigh altitude balloon networks,” arXiv preprint arXiv:2003.09375 , 2020.[88] J. Shyuan Ng, W. Y. B. Lim, H.-N. Dai, Z. Xiong, J. Huang, D. Niyato,X.-S. Hua, C. Leung, and C. Miao, “Joint auction-coalition formationframework for communication-efﬁcient federated learning in UAV-enabled internet of vehicles,” arXiv e-prints , pp. arXiv–2007, 2020.[89] H. Shiri, J. Park, and M. Bennis, “Communication-efﬁcient massive UAVonline path control: Federated learning meets mean-ﬁeld game theory,” arXiv preprint arXiv:2003.04451 , 2020.[90] N. I. Mowla, N. H. Tran, I. Doh, and K. Chae, “Federated learning-based cognitive detection of jamming attack in ﬂying ad-hoc network,”

IEEE Access , vol. 8, pp. 4338–4350, 2019.91] T. Zeng, O. Semiari, M. Mozaffari, M. Chen, W. Saad, and M. Bennis,“Federated learning in the sky: Joint power allocation and schedulingwith UAV swarms,” arXiv preprint arXiv:2002.08196 , 2020.[92] B. Lim, J. Huang, Z. Xiong, J. Kang, D. Niyato, X.-S. Hua, C. Leung,and C. Miao, “Towards federated learning in UAV-enabled internet ofvehicles: A multi-dimensional contract-matching approach,” 04 2020.[93] T. Nishio and R. Yonetani, “Client selection for federated learningwith heterogeneous resources in mobile edge,” in

ICC 2019-2019 IEEEInternational Conference on Communications (ICC) . IEEE, 2019, pp.1–7.[94] Y. Zhan, P. Li, Z. Qu, D. Zeng, and S. Guo, “A learning-based incentivemechanism for federated learning,”

IEEE Internet of Things Journal ,vol. 7, no. 7, pp. 6360–6368, 2020.[95] J. Kang, Z. Xiong, D. Niyato, H. Yu, Y. Liang, and D. I. Kim,“Incentive design for efﬁcient federated learning in mobile networks:A contract theory approach,” in , 2019, pp. 1–5.[96] K. Bonawitz, V. Ivanov, B. Kreuter, A. Marcedone, H. B. McMa-han, S. Patel, D. Ramage, A. Segal, and K. Seth, “Practical secureaggregation for federated learning on user-held data,” arXiv preprintarXiv:1611.04482arXiv preprintarXiv:1611.04482