Joint Resource Block and Beamforming Optimization for Cellular-Connected UAV Networks: A Hybrid D3QN-DDPG Approach
11 Joint Resource Block and BeamformingOptimization for Cellular-Connected UAVNetworks: A Hybrid D3QN-DDPG Approach
Yuanjian Li, A. Hamid Aghvami,
Fellow, IEEE and Yansha Deng,
Member, IEEE
Abstract
Integrating unmanned aerial vehicle (UAV) into the existing cellular networks that are delicatelydesigned for terrestrial transmissions faces lots of challenges, in which one of the most striking concernsis how to adopt UAV into the cellular networks with less (or even without) adverse effects to groundusers. In this paper, a cellular-connected UAV network is considered, in which multiple UAVs receivemessages from terrestrial base stations (BSs) in the down-link, while BSs are serving ground users intheir cells. Besides, the line-of-sight (LoS) wireless links are more likely to be established in ground-to-air (G2A) transmission scenarios. On one hand, UAVs may potentially get access to more BSs. Onthe other hand, more co-channel interferences could be involved. To enhance wireless transmissionquality between UAVs and BSs while protecting the ground users from being interfered by the G2Acommunications, a joint time-frequency resource block (RB) and beamforming optimization problemis proposed and investigated in this paper. Specifically, with given flying trajectory, the ergodic outageduration (EOD) of UAV is minimized with the aid of RB resource allocation and beamforming design.Unfortunately, the proposed optimization problem is hard to be solved via standard optimization tech-niques, if not impossible. To crack this nut, a deep reinforcement learning (DRL) solution is proposed,where deep double duelling Q network (D3QN) and deep deterministic policy gradient (DDPG) areinvoked to deal with RB allocation in discrete action domain and beamforming design in continuousaction regime, respectively. The hybrid D3QN-DDPG solution is applied to solve the outer Markovdecision process (MDP) and the inner MDP interactively so that it can achieve the sub-optimal resultfor the considered optimization problem. Simulation results illustrate the effectiveness of the proposedhybrid D3QN-DDPG algorithm, compared to exhaustive/random search based benchmarks.
Yuanjian Li, A. Hamid Aghvami and Yansha Deng are with Centre for Telecommunications Research (CTR), King’s CollegeLondon, London WC2R 2LS, U.K. (e-mail: {yuanjian.li, hamid.aghvami, yansha.deng}@kcl.ac.uk)This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after whichthis version may no longer be accessible. a r X i v : . [ ee ss . SP ] F e b I. I
NTRODUCTION
In recent years, drones, a.k.a., unmanned aerial vehicles (UAVs) have been intensively appliedin either military or enterprise applications, which offers lots of delightful advantages overterrestrial transceivers, e.g., flexible deployment, larger coverage and additional degree of freedomenabled by their controllable mobility [1]–[3]. Based on statistics from the Federal AviationAdministration (FAA) on July 2018, 100,000 and more individuals have been certificated tooperate drones for commercial and recreational activities [4]. Besides, from the report of BusinessInsider (BI) Intelligence, it is estimated to meet the future where the amount of Consumer DroneShipments is promised to hit no less than 29 Million by the year 2021, while that of CommercialDrones Shipments will reach 805,000 and the overall sales of drones may surpass 12 BillionUSD [5]. This extensively growing and skyrocketing UAV market has attracted the attentionfrom wireless communication academia as well [6]–[9].In most cases, UAVs are communicating with the ground transceivers through simple point-to-point channels over unlicensed spectrum, resulting in constrained transmission performance.To circumvent this, cellular-connected UAV communication is deemed as a promising solution,where base stations (BSs) in terrestrial cellular networks are leveraged to achieve more satis-factory ground-to-air (G2A) transmission quality. Unfortunately, the existing cellular networksare exclusively established for serving ground user equipments (GUEs), barely consideringaerial UEs. Specifically, antennas at BSs in current cellular networks are conventionally down-tilted towards the ground for mitigating terrestrial inter-cell interferences (ICIs), which meansthat UAVs can only be served via the side-lobes and satisfactory G2A connections cannot beguaranteed in general [10]. On the perspective of forthcoming 5G or 6G cellular networks, themain serving objects are still GUEs, which means that finding a proper way of involving UAVsinto cellular networks without posing negative impacts on terrestrial transmissions is inherentlyof importance. In fact, integrating drones into the existing cellular networks has already been oneof the most important research directions, which is believed to further release the potentials ofdrones in terms of reliability, coverage and throughput. Unlike terrestrial cellular transmissionswhere non-line-of-sight (NLoS) pathloss appears more frequently, the first significant differenceintroduced by drones is that line-of-sight (LoS) link occurs more likely in G2A communications[11], [12], which plays the role as a double-edge blade. On one hand, LoS-dominant G2A linkscan help relieve the sufferance of severe multi-path fading, shadowing and pathloss, which are very common "illnesses" in terrestrial transmissions due to the existence of blockages, e.g.,buildings and trees. On the other hand, it may make drones generate stronger interferences (orsuffer more severe interferences) to (or from) BSs in the up-link (or the down-link) transmissions.Besides, drones can cover larger region for data transmissions due to its high flying altitude,then greater macro-diversity gain can usually be achieved because more BSs can cooperate toenhance G2A communication qualities in terms of throughput and reliability. Unfortunately, moreco-channel interfering sources for the drones in the down-link might be involved as well (or,UAVs can act as the interferers to more GUEs in the up-link). Therefore, interference coordinationissue for cellular-connected UAV networks is more intricate and must be seriously treated.Various interference management strategies have been investigated in the literature for ter-restrial cellular transmission scenario, e.g., inter-cell interference coordination (ICIC) [13], [14],cognitive beamforming [15] and coordinated multipoint (CoMP) communications [16]. However,they are most likely ineffective to handle more sophisticated interfering environment cased byUAVs with LoS-dominant G2A links and larger coverage. Therefore, interference managementapproaches that are adaptive to cellular-connected UAV networks should be delicately designedto achieve efficient spectrum sharing with coexisting GUEs. Up to date, there exist several relatedworks devoted to offer interference management approaches for cellular-connected UAV networks[17]–[20]. Chandhar et al. [17] leveraged multiple-input multiple-output (MIMO) technique todeal with interference coordination problem of single-antenna UAV swarms served by a multiple-antenna BS. Senadhira et al. [18] studied the impacts of UAV’s trajectory and altitude for up-link non-orthogonal multiple access (NOMA) cellular-connected UAV network, in which ICIissue was dealt with NOMA technique. However, protecting the GUEs located in current cellor other cells within the coverage of UAVs was not considered in these works, which maysignificantly deteriorate the transmission performance of potential co-channel GUEs. Fortunately,some recent literature took care of interference control issue while protecting GUEs in cellular-connected UAV networks [19], [20]. Liu et al. [19] proposed a new cooperative interferencecancellation strategy for multi-beam cellular-connected UAV up-link transmissions, in whichco-channel interference elimination and sum-rate maximization are investigated with the helpof transmit beamforming design. Mei et al. [20] studied interference mitigation issue in up-link communication from a UAV to BSs, where weighted sum-rate of the UAV and GUEsare maximized via jointly optimizing up-link cell association and power allocation. However,these works contain practical limitations. First, they both assumed fixed-location UAV in their considered model, without involving UAV’s mobility. Second, the G2A channel models theyapplied are based on either oversimplified free-space pathloss channel model or more advancedprobabilistic LoS channel model. It is worth noting that probabilistic G2A channel model isstatistical, which means that it can only reflect G2A pathloss gain in an average manner withoutconsidering local building distribution where UAVs are actually deployed. Last but not least,most of traditional optimization-based problems (e.g. those in [19], [20]) are highly non-convexand hard to be tackled efficiently, even with adequate information of needed evaluation factors.Motivated by the above observations, interference coordination issue and beamforming designin down-link cellular-connected UAV networks are investigated in this paper. It is worth notingthat the terrestrial transmissions between GUEs and BSs are protected to be not contaminated bythe down-link G2A channels. The main contributions of this paper can be concluded as follows. • A joint time-frequency resource block (RB) allocation and beamforming design optimizationproblem is formulated to minimize the ergodic outage duration (EOD) of UAV, with giventrajectory. Specifically, the RB allocation is utilized to assign proper RB resource to UAVswhile insuring that the terrestrial transmissions are not violated by the potential co-channelinterferences generated at UAVs. To enhance the strength of received signals at UAVs afterRB allocation, transmit beamforming design is applied at BSs. • To deal with the difficulty of traditional optimization-based methods solving the proposedEOD minimization problem, a deep reinforcement learning (DRL)-based solution is invokedvia mapping the proposed EOD minimization problem into an outer Markov decision process(MDP) and an inner MDP. The outer MDP reflects the dynamic RB possession environmentat BSs, while the inner MDP tracks the corresponding small-scale fading characteristics. Theout MDP contains discrete action space (i.e., RB indexes), which is tackled by invoking deepdouble duelling Q network (D3QN), while the continuous action space (i.e., beamformingvectors) in the inner MDP is dealt with deep deterministic policy gradient (DDPG) approach.The proposed hybrid D3QN-DDPG algorithm can optimize EOD performance for UAVsvia interactively interacting with the outer and inner environments, without requirement ofprior knowledge regarding channel model or propagation characteristics. • In contrast to the majority of related literature adopting statistical G2A channel model (e.g.,probabilistic G2A channel model), LoS/NLoS G2A pathloss is determined via checkingpotential blockages between UAV and BSs in this paper, according to one realization of localbuilding distribution suggested by the International Telecommunication Union (ITU) [21]. … … … … Desired Down-link ChannelsCo-channel Interference Links
Figure 1: System modelThe considered G2A channel model is more practical than its statistical counterpart whichcan only reflect average pathloss gain over large number of similar building distributionrealizations, because building distribution in local area remains unchanged in practice.The rest of this paper is organized as follows. Section II presents the system model, RBallocation criterion, channel model and problem formulation. Section III briefly introduces thepreliminary knowledge of DRL, which is the necessary theoretical foundation for the proposedDRL-based solution. Section IV shows the proposed hybrid D3QN-DDPG algorithm. Simulationresults are presented in Section V and conclusions are drawn in Section VI.II. S
YSTEM M ODEL
In this paper, joint optimization of RB allocation and beamforming design for down-linkcellular-connected UAV network is considered, where a set B = { , . . . , B } of B terrestrial BSsserves a set U = { , . . . , U } of U drone UEs (DUEs) and a set G = { , . . . , G } of G GUEs usinga set K = { , . . . , K } of K RBs at each BS, in a given subregion (e.g., Fig. 1) of cellular network.Each DUE is assumed to equip single antenna for receiving wireless information and so as eachGUE, while all the terrestrial BSs employ antenna array for message emitting. Specifically, eachterrestrial BS b ∈ B possesses M antennas, serving g b GUEs with orthogonal RBs (so theredoes not exist intra-cell interferences within each cell), where g b ≥ , ∀ b ∈ B and (cid:80) Bb =1 g b = G . Different from terrestrial transmission scenario, DUEs fly in the sky with relatively high altitudes,resulting in higher probability achieving LoS-dominant links from BSs. Thus, DUEs are ableto connect with more BSs within their wireless coverage, which is a distinguishable featurecompared to terrestrial transmissions. However, this characteristic is a double-edge blade, in termsof not only inducing more and stronger desired signals but also richer co-channel interferences.To practically reflect the aforementioned double-edge blade feature, each DUE is considered tobe associated with at least one BS when possible, taking advantages of macro-diversity gainfrom terrestrial BSs. Unfortunately, the assigned RB for a DUE might be already occupied bysome GUEs due to heavy frequency reuse in cellular networks, severely interfering the DUEvia LoS-dominant channels. Therefore, RB allocation plays an important role in the consideredcellular-connected UAV network. Besides, after RB assignment for a DUE, wireless transmissionperformance can be enhanced via invoking transmit beamforming technique at the correspondingserving BSs. Note that we do not consider transmit power control strategy at each BS, and thuswe fix P b = P for all terrestrial BSs. The 3-dimensional (3D) locations of each DUE, each ground BS and each GUE are denoted as (cid:126)q u = ( x u , y u , h u ) , (cid:126)q b = ( x b , y b , z b ) and (cid:126)q g = ( x g , y g , , respectively. For simplicity and withoutloss of generality, the flying altitude of each DUE is assumed universally as h u = h and the heightof each BS’s antenna is set identically as z b = z , where h (cid:29) z always holds in the consideredmodel. Each DUE is supposed to reach its destination (cid:126)q u ( D ) from predefined initial location (cid:126)q u ( I ) with time duration T u . For clarity, the considered subregion is formulated as a cubicsphere specified by [ x lo , x up ] × [ y lo , y up ] × [ z lo , z up ] , where the subscripts "lo" and "up" representthe lower and upper boarders of this 3D airspace, respectively. Furthermore, the coordinate ofarbitrary DUE u at time t ∈ [0 , T u ] should locates in the range of (cid:126)q lo (cid:22) (cid:126)q u ( t ) (cid:22) (cid:126)q up , where (cid:126)q lo = ( x lo , y lo , z lo ) , (cid:126)q up = ( x up , y up , z up ) and (cid:22) denotes the element-wise inequality. The start andfinal locations of each DUE can be given by (cid:126)q u (0) = (cid:126)q u ( I ) and (cid:126)q u ( T u ) = (cid:126)q u ( D ) , respectively. Transmit power control is indeed an important approach for interference management in cellular networks. In our consideredmodel, it is straightforward to infer that all BSs should communicate with their paired DUEs using maximum transmit power.Besides, all the occupied BSs are supposed to apply their minimum transmit power to reduce the level of co-channel interferenceto DUEs, which inevitably deteriorates the transmission quality for their severing GUEs. Therefore, to tackle this dilemma, wefix the transmit powers of all considered BSs. For specific DUE u ∈ U , the flying duration T u is determined by its predefined trajectory and velocity. However, thetrajectory planning task is beyond the scope of this paper, which is one of our future research interests. Therefore, the trajectory of each DUE u can be fully traced by (cid:126)q u ( t ) , ∀ t ∈ [0 , T u ] . A. The RB Allocation Criterion
To properly manage ICIs among G GUEs, the following RB assignment criterion is adoptedat all BSs. The set
T I b ( p ) is defined to denote the first p -tier BSs that encompass a specificBS b ∈ B in the considered model, where ≤ p ≤ and T I b ( p ) includes this focused BS.When arbitrary RB has been assigned to any GUE in the serving cells of BSs from T I b ( p ) ,the focused BS b should avoid allocating this RB to other GUEs in its corresponding cell. Toensure that the total RB resource is sufficient for all GUEs in cells of BSs from
T I b ( p ) , theconstraint (cid:80) ˆ b ∈T I b ( p ) g ˆ b ≤ K should hold, where card ( T I b ( p )) = p × and card ( · ) indicatesthe cardinality of a set. In this regard, the focused BS b cannot generate any interference toGUEs in the serving cells of BSs from T I b ( p ) . For GUEs outside the serving cells of BSs from T I b ( p ) , the potential ICIs caused by the focused BS b are assumed to be negligible, due tosevere terrestrial NLoS pathloss and shadowing. For each possible RB k , some BSs may alreadyoccupy it to serve GUEs in their corresponding cells. These BSs are recognized as the occupiedBSs, which are denoted by the occupied BS set B ko ⊂ B . Furthermore, the set ˆ B ko = B\B o includes all the potential BSs, where the RB k is idle. For a specific RB k assigned to servea DUE, the corresponding associated BSs come from the potential set ˆ B ko , while all the non-associated co-channel interferences root from the occupied set B ko . For a DUE u associated witha RB k , it is supposed to be paired with all BS in the potential set ˆ B ko , to take the advantage ofmacro-diversity gain. However, this may generate additional ICIs to DUEs in the serving cellsof BSs from T I b ∈ ˆ B ko ( p ) . To avoid ICIs attenuating the receiving quality of existing GUEs overthe same RB, a potential BS b ∈ ˆ B ko can be allowed to pair DUE if and only if there are noother BSs applying RB k in its first p -tier neighbours, i.e., ˆ B ko ∩ T I kb ∈ ˆ B ko ( p ) = ∅ . (1)Then, the available BS set ˘ B ko ⊂ ˆ B ko is defined to denote the potential BSs satisfying (1). In the case of sufficiently large p , the ICIs among all GUEs become ignorable, thanks to sufficient frequency reuse andsevere terrestrial pathloss. B. Channel Models
In contrast to terrestrial transmission between BS and GUE (denoted as B2G thereafter),wireless links between BS and DUE (denoted as B2U thereafter) have higher probability expe-riencing LoS pathloss. In this subsection, channel model of the considered cellular-connectedUAV network will be introduced.
1) B2G Channel Model:
The B2G channel may include the large-scale fading cased byNLoS-dominated pathloss and corresponding small-scale fading exponent in practice. In thispaper, we concentrate on the down-link interference management problem, where the terrestrialtransmissions could affect the B2U communication quality as a part of co-channel interferences.This is because the occupied BSs may apply some channel-aware precoding techniques toenhance their transmissions with corresponding GUEs. For simplicity, we apply the popularly-used Rayleigh fading [22], [23] to model the terrestrial small-scale fading component which isdenoted as (cid:126)h bg ( t ) ∈ C × M , ∀ b ∈ B , g ∈ G .
2) B2U Channel Model:
Probabilistic B2G pathloss model is widely applied to characterizewireless pathloss between BS and DUE in current literature, where LoS and NLoS channelsare considered separately with different occurrence probabilities. According to 3GPP urban-macro (UMa) channel model [24], the expected B2U pathloss in dB can be expressed asPL bu = Pr LoS PL LoS + Pr NLoS PL NLoS , where Pr
LoS represents the occurrence probability of LoSlink, Pr
NLoS = 1 − Pr LoS indicates that of NLoS channel, and PL
LoS and PL
NLoS denote thepathlosses for LoS and NLoS links, respectively. Specifically, we havePr
LoS = min { ε r bu , } × (cid:104) − exp (cid:16) − r bu ε (cid:17)(cid:105) + exp (cid:16) − r bu ε (cid:17) , . m < h ≤ m , m < h ≤ m , (2)PL l = . ( d bu ) + 20 log ( f c ) , l = LoS − . − ( h )] log ( d bu ) + 20 log (cid:0) πf c (cid:1) , l = NLoS , (3)in which r bu = (cid:112) d bu − h , ε = max {
460 log ( h ) − , } , ε = 4300 log ( h ) − , f c represents the carrier frequency and d bu = || (cid:126)q u − (cid:126)q b || calculates the Euclidean distance betweenDUE u and ground BS b . Nakagami- m fading is invoked to represent the small-scale fading In contrast to terrestrial communication scenarios where Rayleigh fading is widely applied to model small-scale fading,Rician or Nakagami- m fading is more suitable to track the characteristics of B2U small-scale fading since the LoS-dominatedB2U channels. component for B2U channels, denoted as (cid:126)h bu ( t ) ∈ C × M , ∀ b ∈ B , u ∈ U . Note that the shapefactor m varies with its corresponding type of large-scale fading, i.e., LoS or NLoS.To practically reflect the characteristics of B2U channels in the considered subregion, onerealization of the statistical model suggested by the ITU is generated to formulate the buildingdistribution (including structures’ 2D locations on the ground and their corresponding heights).There are three key parameters in the ITU building distribution model: 1) ˆ α indicates theratio of land region covered by buildings to the total land area; 2) ˆ β represents the meanof buildings per unit area; and 3) ˆ γ determines the distribution of building heights, which isusually following Rayleigh distribution with mean ˆ γ > . Note that the B2U pathlosses aremodelled and tracked in terms of average large-scale channel gain via calculating the occurrenceprobabilities of LoS/NLoS links as depicted in (2), in the vast majority of related literature.This kind of approach is more mathematically tractable, however, it can only reflect the ergodiccharacteristics of B2U channels over many realizations of building distribution. On the contrary,in this paper, the occurrences of LoS/NLoS links are alternatively tracked via checking whetherthe line of B2U channel is blocked or not by any building, given one realization of ITU buildingdistribution model. Then, the corresponding type of large-scale pathloss can be determined foreach time of B2U channel regeneration. Fig. 2 illustrates the considered one realization of localbuilding distribution in this paper, where 25 building clusters and 37 BSs are depicted in asquare subregion with side length D = 3 km, road width ˆ D = 0 . km, ˆ α = 0 . , ˆ β = 103 buildings/km and ˆ γ = 20 m. With these parameter settings, the total amount of buildings is ˆ βD = 927 and the expected size of each building is ˆ α/ ˆ β ≈ . km . Besides, the maximumheight of buildings is clipped to be under 70 m, and the locations of BSs are presented by whiteasterisks in Fig. 2(a). C. SINR at DUE
Denote C ku ( t ) ∈ { , } as the RB association indicator which means that DUE u is occupyingRB k at time t when C ku ( t ) = 1 , and C ku ( t ) = 0 otherwise. Each DUE is assumed to occupy atmost one single RB each time , then we have (cid:80) Kk =1 C ku ( t ) ≤ . Note that our approach is more practical because the building distribution of a subregion in real world can hardly vary overtime (say, days even years). In this paper, we focus on the scenario in which each DUE can only occupy one single RB each time. Integrating moresophisticated RB allocation approaches might be considered in our future works. (a) 2D view of local building and BS distribution (b) 3D view of local building distribution Figure 2: The considered building distributionIf RB k is feasible to be assigned to DUE u , i.e., C ku ( t ) = 1 , it has to satisfy the RBassignment criterion presented in subsection II-A. Then, all BSs in the potential set ˆ B ko meetingthe regulation (1), i.e., b ∈ ˘ B ko , are recognized as the available BSs for DUE u , to take theadvantage of macro-diversity gain. Besides, all BSs b ∈ B ko that are occupying the selected RB k should be classified as the source of co-channel ICIs. Thus, the received signal of DUE u overRB k at time t can be given by y ku ( t ) = C ku ( t ) (cid:88) b ∈ ˘ B ko (cid:113) − PL l (cid:126)h bu ( t ) (cid:126)w bu ( t ) x u ( t )+ (cid:88) b ∈B ko (cid:113) − PL l (cid:126)h bu ( t ) (cid:126)w bg ( t ) x bg ( t ) + n ku ( t ) , (4)where (cid:126)w bu ( t ) ∈ C M × indicates the transmit beamforming vector at BS b ∈ ˘ B ko for DUE u , (cid:126)w bg ( t ) ∈ C M × represents the transmit beamforming vector at BS b ∈ B ko for correspondingGUEs, x u ( t ) ∼ CN (0 , P ) is the intended message from BS b to DUE u , x bg ( t ) ∼ CN (0 , P ) implies the signal for GUEs, and n ku ( t ) ∼ CN (0 , σ ) denotes the received additive complexGaussian noise (AWGN) at DUE u . Note that the explicit type of large-scale fading between BS b and DUE u at time t , i.e., l = { LoS , NLoS } , can be determined via checking possible blockagesaccording to the considered one realization of local building distribution mentioned in sectionII-B2. Taking the advantages of macro-diversity gain, all signals from the associated BS b ∈ ˘ B ko are recognized as the legitimate in-phase information and thus can be added constructively atDUE u . Therefore, the instantaneous signal-to-interference-plus-noise-ratio (SINR) of DUE u at time t can be calculated as Γ u ( t ) = K (cid:88) k =1 C ku ( t ) (cid:80) b ∈ ˘ B ko P − PL l | (cid:126)h bu ( t ) (cid:126)w bu ( t ) | I ku ( t ) + σ , (5)where I ku ( t ) = (cid:80) b ∈B ko P − PL l | (cid:126)h bu ( t ) (cid:126)w bg ( t ) | means the ICIs introduced by the co-channel BSsin the occupied set B ko . Note that we set (cid:126)w bg ( t ) = (cid:126)h † bg ( t ) / || (cid:126)h bg ( t ) || for the considered model. D. Problem Formulation
Straightforwardly, the received SINR of DUE u at time t (5) is a random variable because ofthe randomness introduced by small-scale fadings (cid:126)h bu ( t ) and (cid:126)h bg ( t ) , as well as the RB allocation.Specifically, the RB allocation affects Γ u ( t ) in terms of how many available BSs and interferingBSs will be involved, i.e., card ( ˘ B ko ) and card ( B ko ) , respectively. Then, with given RB allocation,the transmit beamforming vector (cid:126)ω bu ( t ) should be designed to adapt to the small-scale fading (cid:126)h bu ( t ) so that | (cid:126)h bu ( t ) (cid:126)w bu ( t ) | can be optimized. Therefore, the corresponding transmission outageprobability (TOP) can be formulated as a function of C ku ( t ) and (cid:126)ω bu ( t ) , given by T OP u { C ku ( t ) , (cid:126)ω bu ( t ) } = Pr [Γ u ( t ) < Γ th ] , (6)where Pr outputs the probability calculated with respect to (w.r.t.) the aforementioned small-scale fadings, RB allocation and B2U transmit beamforming vector. Then, the ergodic outageduration (EOD) of DUE u travelling with trajectory (cid:126)q u ( t ) , ∀ t ∈ [0 , T u ] from (cid:126)q u ( I ) to (cid:126)q u ( D ) canbe expressed as EOD u { C ku ( t ) , (cid:126)ω bu ( t ) } = (cid:90) T u T OP u { C ku ( t ) , (cid:126)ω bu ( t ) } dt. (7)This paper assumes that DUEs move with predefined trajectories (cid:126)q u ( t ) , ∀ u ∈ U , t ∈ [0 , T u ] with constant velocity V u , then T u in (7) can be implied as a fixed parameter posing no impactson the overall integral. Hence, the EOD of arbitrary DUE u is fully determined by C ku ( t ) and (cid:126)ω bu ( t ) . Without loss of generality, in the following contents of this paper, a specific DUE in Fig.1 is focused to evaluate our proposed scheme which can be easily applied to other DUEs withorthogonal RB assignment. For enhancing the down-link transmission quality of DUE across This paper focuses on the interference management for cellular-connected UAV networks, thus the precoding configurationregarding terrestrial transmissions is not our interest. Here, we assume that the occupied BSs simply perform maximum ratiotransmission (MRT) technique for their serving GUEs. its travelling trajectory, this paper focuses on minimizing its EOD. Then, the correspondingoptimization problem can be stated as ( P
1) : min C ku ( t ) ,(cid:126)ω bu ( t ) EOD u { C ku ( t ) , (cid:126)ω bu ( t ) } , (8a)s.t. K (cid:88) k =1 C ku ( t ) ≤ , ∀ t ∈ [0 , T u ] , (8b) || (cid:126)w bu ( t ) || = 1 , ∀ b ∈ B , ∀ t ∈ [0 , T u ] , (8c) C ku ( t ) ∈ { , } , ∀ k ∈ K , ∀ t ∈ [0 , T u ] , (8d) (cid:107) . (cid:126)q u ( t ) (cid:107) = V u , ∀ t ∈ [0 , T u ] , (8e) (cid:126)q lo (cid:22) (cid:126)q u ( t ) (cid:22) (cid:126)q up , ∀ t ∈ [0 , T u ] , (8f) (cid:126)q u (0) = (cid:126)q u ( I ) , (cid:126)q u ( T ) = (cid:126)q u ( D ) , ∀ t ∈ [0 , T u ] . (8g)The constraint (8b) makes sure that the DUE can at most occupy one single RB each time. Theconstraint (8c) is the normalization requirement for transmit beamforming vector, which ensuresthat the transmit power of each available BS b ∈ ˘ B ko equals to P . The constraint (8d) indicatesthat C ku ( t ) is a binary variable. The constraints (8e)-(8g) emphasize the velocity, the feasiblecubic airspace as well as the start and the final locations of DUE, respectively.It is extremely challenging to solve the proposed optimization problem (P1), given the listedconstraints. The main difficulties can be concluded as follows: 1) the closed-form expression of EOD u { C ku ( t ) , (cid:126)ω bu ( t ) } should be derived, which is extraordinarily sophisticated, if not impos-sible; 2) the variations of LoS/NLoS pathloss, small-scale fading (cid:126)h bu ( t ) and the B2G transmitbeamforming vector (cid:126)ω bg ( t ) should be taken into consideration, which is dynamic over timehorizon and dependent on their modellings; 3) even given the closed-form expression of theoptimization object (8a) and the perfect knowledge of the considered cellular-connected UAVnetwork, it is still mathematically inefficient to be tackled for the non-convexity of mix-integerconstraint (8d) and that of the optimization object (8a) w.r.t. C ku ( t ) and (cid:126)ω bu ( t ) . To provide abetter alternative solving the proposed optimization problem (P1), DRL-aided solution will beproposed in this paper. III. P RELIMINARIES OF
DRLIn this section, the basics of DRL will be introduced, which is the corresponding theoreticalfoundation for understanding the proposed DRL-based solution. Generally speaking, DRL-related problems can be mapped into Markov Decision Process(MDP) consisting of five elements listed in the tuple ( S , A , T , r, γ ) . Specifically, s t ∈ S denotesthe state observed from the environment at trial t , while a t ∈ A represents the action picked bythe DRL agent following action selection policy π ( s t , a t ) : S × A → [0 , . The policy π claimsthe probability distribution of picking action a t for state s t , constrained by (cid:80) a t ∈A π ( s t , a t ) = 1 .After executing action a t over state s t , state transition s t → s t +1 will occur in the environmentfollowing corresponding state transition function T = Pr( s t +1 | s t , a t ) . Then, a scale reward r t ( s t , a t ) will be generated, which reflects the immediate return of applying a t on s t . Last but notleast, the discount factor γ ∈ [0 , is applied to discount the future rewards, which characterizeshow much the agent cares about the rewards in distant future. More variances will be introducedby the reward function with the expanding of time horizon, while the discount factor γ can helpreduce such uncertainty and realize the convergence of DRL algorithms.After exploring the environment, the agent can be trained from the experience { s t , a t , r t , s t +1 } which records the corresponding essential information for learning. The state-action value func-tion Q π ( s t , a t ) (i.e., Q function) calculates the accumulated rewards and specifies how good itis for an agent to perform action a t over state s t following policy π , defined as Q π ( s t , a t ) = E π (cid:34) G t = + ∞ (cid:88) n =0 γ n r t + n | s t = s, a t = a (cid:35) , (9)where G t derives the discounted counterpart of accumulated rewards. The Q function Q π ( s t , a t ) follows the Bellman equation, shown as Q π ( s t , a t ) = E π r t + γ (cid:88) s t +1 ∈S T ( s t +1 | s t , a t ) (cid:88) a t +1 ∈A π ( s t +1 , a t +1 ) Q π ( s t +1 , a t +1 ) . (10)Based on (10), the optimal Q function (Bellman optimality equation) can be derived as Q ∗ ( s t , a t ) = r t + γ (cid:88) s t +1 ∈S T ( s t +1 | s t , a t ) max a t +1 ∈A Q ∗ ( s t +1 , a t +1 ) . (11)Note that the optimal Q function (11) is no-linear and has no closed-form solution, which canbe tackled alternatively via iterative algorithms. Specifically, (11) can be deduced recursivelyto achieve the optimality Q ∗ ( s t , a t ) , via TD learning when the knowledge of explicit rewardand state transition models are absent or through dynamic programming (e.g., value iteration) When T is not available, the MDP can still be solved via temporal difference (TD)-based approach, which completes the"model-free" learning progress. when the agent possesses full information of the MDP. The TD learning method is model-free, which applies bootstrapping to update the Q function via directly sampling the experience { s t , a t , r t , s t +1 } . Q learning is one popular TD learning approach applying the recursive updatingrule on the state-action value function Q ( s t , a t ) , given by Q ( s t , a t ) ← Q ( s t , a t ) + α temporal difference (cid:122) (cid:125)(cid:124) (cid:123) r t + γ max a t +1 ∈A Q ( s t +1 , a t +1 ) (cid:124) (cid:123)(cid:122) (cid:125) temporal difference target − Q ( s t , a t ) , (12)where α ∈ (0 , is the learning rate determining what extent the newly-acquired informationoverrides its former counterpart. It is validated that Q learning method can achieve the optimum Q ∗ ( s t , a t ) if the state-action pairs are sufficiently experienced and α is appropriately picked [25]. A. Deep Q Network
Intuitively, DRL is a composition of reinforcement learning (RL) and artificial neural networks(ANNs), where "deep" refers to ANN with multiple hidden layers. ANN is one of popular andpowerful Q function approximators, which has been theoretically proved being able to universallyimitate any function (linear or non-linear), even with only one single hidden layer consisting ofa sufficiently large number of neurons [25]. Furthermore, ANN with multiple hidden layers, i.e.,DNN, is more suitable for approximating complex functions.In DRL, e.g., deep Q network (DQN) [26], the Q function is approximated in a parametricmanner with parameter vector θ , shown as Q ( s t , a t ) ≈ Q ( s t , a t | θ ) , (13)where θ corresponds to the weight coefficients and biases of all links in the DNN. The DNN-based function approximation (13) generally introduces two distinguishable advantages overtabular RL method: 1) it enables generalization, which is able to predict Q values for inexperi-enced state-action pairs, because state-action pairs are mutually coupled via Q ( s t , a t | θ ) and θ ;2) only the parameter θ is necessary to be learnt, rather than recording and updating Q valuesfor state-action candidates, which can tremendously relieve the computing burdens.In DQN, the parameter vector θ in (13) can be updated via bootstrapping method to minimizethe loss function loss ( θ ) which is defined as loss ( θ ) = (cid:20) r t + γ max a t +1 ∈A Q ( s t +1 , a t +1 | θ ) − Q ( s t , a t | θ ) (cid:21) . (14) Unfortunately, the loss function (14) is contaminated by the updating parameter vector θ , leadingto oscillations or divergence when applying standard deep training approaches. This nut can becracked via adopting target network, denoted as Q ( s t , a t | θ − ) with parameter vector θ − [26]. Notethat the target network Q ( s t , a t | θ − ) is just a copy of the training network Q ( s t , a t | θ ) , wherethe updating frequency of θ − is much less than that of θ . Specifically, the target network willbe synchronized to the training network with a given frequency, in terms of updating θ − ← θ .Then, the loss function (14) can be reformulated as loss ( θ ) = (cid:20) r t + γ max a t +1 ∈A Q ( s t +1 , a t +1 | θ − ) − Q ( s t , a t | θ ) (cid:21) . (15)Other common obstacles encountered by DQN are highly correlated data in time domain andlarge variance of the updates, which can be relieved via involving experience replay buffer andmini-batch updating technique. The experience replay buffer is a finite-sized memory storingexperienced transitions { s t , a t , r t , s t +1 } , while mini-batch updating method randomly samplesmultiple experiences from the experience replay buffer to perform DNN updates.Based on DQN, other advanced DRL algorithms are proposed to further improve learningperformance, e.g., double DQN (DDQN) which can handle the maximization bias introduced bythe max operation in (15) [27] and dueling DQN which decouples value and action advantages,achieving better learning performance when facing large amount of similar-value actions [28]. B. Deep Deterministic Policy Gradient
Value-based DRL methods, e.g., DQN, DDQN and dueling DQN, are not suitable to tackleproblems containing continuous actions, as it is extremely inefficient to find the maximum Qvalue over continuous action space [29]. To deal with this obstacle, policy gradient approachand actor-critic architecture are invoked. In actor-critic algorithms, the actor is a policy networkwhich takes states as inputs and reproduces a specific action, instead of a probability distributionover possible actions. Besides, the critic is a state-action value network, in which action and stateare treated as the inputs and state-action values are the corresponding outputs. DDPG introducesactor-critic architecture into DQN, which is model-free and off-policy [29]. The actor network inDDPG can eliminate the need of locating the action maximizing the state-action function giventhe next state, which can robustly solve problems with continuous action space. IV. T HE P ROPOSED A LGORITHM
In this section, the proposed optimization problem (P1) will be tackled via DRL-based method,i.e., the hybrid D3QN-DDPG algorithm.
A. The Formulation of MDP
To realize the DRL-based solution for the proposed optimization problem (P1), the first stepis to formulate (P1) into MDP which is based on discrete time slots. The length of time slot isdefined as δ u for the considered model and thus the number of time slots is equal to N u = T u /δ u for the DUE u . Note that the duration of time slot δ u should be designed as small as possible, toachieve that the distances between the DUE and BSs remain approximately constant and stablewithin each time slot. In this regard, the EOD expression can be rewritten as EOD u { C ku ( n ) , (cid:126)ω bu ( n ) } ≈ N u (cid:88) n =1 δ u T OP u { C ku ( n ) , (cid:126)ω bu ( n ) } . (16)However, even with given C ku ( n ) , the closed-form expression of the transmission outageprobability T OP u { C ku ( n ) , (cid:126)ω bu ( n ) } is still difficult to be derived, for its complex formulationand the lack of designed B2U transmit beamforming vector (cid:126)ω bu ( n ) . Alternatively, this challengecan be circumvented via numerical evaluation on the raw measurements of received signalsat the DUE. The reason is that, compared to the length of time slot δ u (typically on themagnitude of seconds), the length of channel coherence blocks (typically on the magnitude withinmilliseconds) is relatively small. Then, provided with C ku ( n ) for a time slot n , the indicator ofTOP can be defined as IT OP u { C ku ( n ) , (cid:126)ω bu ( n, i ); ˆ h ( n, i ) } = 1 in the case of Γ u ( n, i ) < Γ th ,and IT OP u { C ku ( n ) , (cid:126)ω bu ( n, i ); ˆ h ( n, i ) } = 0 otherwise, where ˆ h ( n, i ) and (cid:126)ω bu ( n, i ) indicate onerealization of small-scale fadings and that of corresponding beamforming vector, respectively.Then, the corresponding TOP can be calculated as T OP u { C ku ( n ) , (cid:126)ω bu ( n ) } = E ˆ h,(cid:126)ω (cid:104) IT OP u { C ku ( n ) , (cid:126)ω bu ( n, i ); ˆ h ( n, i ) } (cid:105) . (17)To realize the average calculation E ˆ h,(cid:126)ω over ˆ h and (cid:126)ω in (17), ς times of SINR measurementshould be performed. Furthermore, the arithmetic TOP of the DUE u can be expressed as ¯ T OP u { C ku ( n ) , (cid:126)ω bu ( n ) } = 1 ς ς (cid:88) i =1 IT OP u { C ku ( n ) , (cid:126)ω bu ( n, i ); ˆ h ( n, i ) } . (18) The existing soft handover technique, e.g., RSPR and RSRQ reports, can be applied to help complete this kind of task. When sufficiently large amount of SINR measurements are performed, i.e., ς (cid:29) , the statisticalaverage in (17) can be alternatively replaced by its arithmetic counterpart in (18). Furthermore,the EOD expression in (16) can be modified as
EOD u { C ku ( n ) , (cid:126)ω bu ( n ) } ≈ N u (cid:88) n =1 ς (cid:88) i =1 δ u ς IT OP u { C ku ( n ) , (cid:126)ω bu ( n, i ); ˆ h ( n, i ) } . (19)Then, the original optimization problem (P1) can be approximately revised as ( P
2) : min C ku ( n ) ,(cid:126)ω bu ( n,i ) N u (cid:88) n =1 ς (cid:88) i =1 δ u ς IT OP u { C ku ( n ) , (cid:126)ω bu ( n, i ); ˆ h ( n, i ) } , (20a)s.t. K (cid:88) k =1 C ku ( n ) ≤ , ∀ n ∈ [1 , N u ] , (20b) || (cid:126)w bu ( n, i ) || = 1 , ∀ b ∈ B , ∀ n ∈ [1 , N u ] , (20c) C ku ( n ) ∈ { , } , ∀ k ∈ K , ∀ n ∈ [1 , N u ] , (20d) (cid:126)q ( n + 1) = (cid:126)q ( n ) + V u δ t (cid:126)V u ( n ) , ∀ n ∈ [1 , N u ] , (20e) (cid:126)q lo (cid:22) (cid:126)q u ( n ) (cid:22) (cid:126)q up , ∀ t ∈ [1 , N u ] , (20f) (cid:126)q u (0) = (cid:126)q u ( I ) , (cid:126)q u ( T ) = (cid:126)q u ( D ) , ∀ n ∈ [1 , N u ] , (20g)where (cid:126)V u ( n ) represents the travelling direction at time slot n , satisfying (cid:107) (cid:126)V u ( n ) (cid:107) = 1 .In the considered system model, the terrestrial BSs are controlled by a central coordinator(C2) via high-speed broadband cables (e.g., optical fibers), to realize the joint RB allocationand beamforming design task. Once the DUE u registers into the cellular network, the C2 willfirst check the overall RB availability of all BSs, after which a map of RB possession (RBP)formulated as a 2D matrix C ( n ) = [ C kb ( n )] b × k will be generated. Note that C kb ( n ) = 1 if RB k is occupied by BS b at time slot n and C kb ( n ) = 0 otherwise. Then, for each RB k , followingthe RB allocation criterion presented in subsection II-A, the corresponding occupied set B ko , thepotential set ˆ B ko and the available set ˘ B ko can be determined. Taking the advantage of macro-diversity gain, the C2 will assign all available BSs b ∈ ˘ B ko to serve the DUE cooperatively. Notethat C ( n ) remains constant within each time slot and varies for different time slots , capturingthe dynamics of RBP at terrestrial BSs. For each time slot, the current location of the DUE (cid:126)q u ( n ) is observable. Then, the large-scale fading distribution map between the DUE and BSs In the case of ς → + ∞ , lim ς → + ∞ ¯ T OP u { C ku ( n ) , (cid:126)ω bu ( n ) } = T OP u { C ku ( n ) , (cid:126)ω bu ( n ) } can be guaranteed theoretically. To avoid frequent handover, the selected RB k is considered as unchanged within each time slot. can be traced, via checking the potential blockages between the DUE and each BS according tothe local building distribution as mentioned in Section II-B. Specifically, we define the matrix L ( n ) = [ LS bu ( n )] bu to store large-scale fading component, where LS bu ( n ) = 1 denotes that theLoS link exists between b ∈ B and the DUE, and LS bu ( n ) = 0 indicates the existence of NLoSlink. From the point of view on SINR in (5), the allocated RB k serving the DUE can affectthe value of SINR in terms of how many desired channels and interfering links are introduced.Hence, the selection of RB resource can inherently impact the EOD performance and should bedelicately assigned. Next, with specific RB for each time slot, the beamforming strategy adaptingto the dynamic small-scale fading component can further affect the EOD performance.To handle the aforementioned two-step process, a hybrid D3QN-DDPG algorithm is proposed,in which an outer MDP is formulated for the D3QN agent while an inner MDP is forged for theDDPG agent. Specifically, the D3QN determines which RB should be selected for each time slotand the DDPG outputs the proper beamforming vector for each links between the DUE and BSsin the available BS set. Furthermore, the considered cellular-connected UAV network is dividedinto the outer environment and the inner environment. For time slot n , the DUE’s location (cid:126)q u ( n ) and the RBP map C ( n ) can be observed from the outer environment. Then, the large-scalefading distribution map L ( n ) can be generated. The inner environment is defined to reflect thetime-varying characteristic of small-scale fading, i.e., the Nakagami- m distribution. Note thatthe inner environment is dependent on the outer environment, because the shape parameter m in Nakagami- m distribution varies for LoS and NLoS links. B. Description of the Hybrid D3QN-DDPG Solution
To derive a flexible solution which can solve the proposed optimization problem (P2) in adynamic RBP and time-varying small-scale fading scenario, both the D3QN and the DDPGnetworks in the proposed hybrid D3QN-DDPG algorithm are trained interactively. Specifically,the D3QN network maps the outer state and the RB selection into Q-values, while the actor ofDDPG agent transforms the inner state into beamforming vector and the critic of DDPG networkevaluates the corresponding Q values. It is worth noting that L ( n ) is sorely determined by (cid:126)q u ( n ) , because the locations of all terrestrial BSs and the local buildingrealization should be assumed as unchanged in line with practice. … … …… … … … … Input Layer Output LayerHidden Layers … … K Action AdvantagesState Value S
1) D3QN:
To tackle the RB allocation problem, state-of-the-art DQN with duelling architec-ture will be invoked to approximate Q function for the outer MDP. Compared to the originalDQN method, the duelling DQN explicitly separates the representation of state value and thecorresponding action advantages into two independent streams, as depicted in Fig. 3. Specifically,the duelling DQN first estimates the state value and the action advantages that are dependenton the state, and then calculates Q value for each state-action pair via aggregation. The duellingarchitecture can help approximate Q function more robustly and efficiently, especially when theQ values of various actions with the same state are indistinguishable.The outer MDP for the D3QN agent can be formulated as follows. The outer state s is a list ofthe RBP map C ( n ) , while the outer action a refers to the selected RB k ∗ = arg k { C ku ( n ) = 1 } .The considered optimization problem is fully determined by the value of SINR, given SINRthreshold. In other word, larger available BS set and smaller occupied BS set are favourable tominimize the EOD. For outer state s and the selected outer action a , the corresponding availableBS set ˘ B k ∗ o and the occupied BS set B k ∗ o can be determined according to Section II-A. Then,the outer reward function can be defined as r = card ( ˘ B k ∗ o ) card ( ˘ B k ∗ o ) + card ( B k ∗ o ) . (21)The designed outer reward function (21) infers that the selected RB k ∗ resulting in larger availableBS set and smaller occupied BS set is more favourable. Given the formulation of outer MDP,the duelling DQN is invoked to approximate Q D ( s , a | θ D ) where θ D represents the parameter The transition of RBP map is stochastic and can be observed from the outer environment, which means that the D3QNlearning process is model-free. vector of D3QN network. The D3QN network is trained to minimize its loss function via thegradient descent updating rule, shown as θ D ( t + 1) = θ D ( t ) − α D ∇ θ D loss ( θ D ) , (22)where α D denotes the learning rate and ∇ θ D loss ( θ D ) represents the gradient of the D3QNnetwork’s loss function w.r.t. θ D . For a mini-batch of N D transitions randomly sampled fromthe outer replay buffer, the mean-square loss function in (22) is defined as loss ( θ D ) = 1 N D N D (cid:88) t =1 [ y t − Q D ( s t , a t | θ D )] , (23)where y t = r t + γQ D ( s t +1 , a ∗ t +1 | θ − D ) and θ − D indicates the parameter vector of target D3QNnetwork. Note that the optimal outer action for the next outer state s t +1 is selected by the D3QNnetwork instead of the target D3QN network, given by a ∗ t +1 = arg max a t +1 Q D ( s t +1 , a t +1 | θ D ) . (24)In this manner, the bootstraping outer action is evaluated by the target D3QN network whilethe selection of outer action is achieved by the D3QN network, which completes the doubleQ learning procedure. If the outer action selection and evaluation are accomplished via thetraditional DQN method in (15), it leads to overestimation of Q values while bootstrapping,i.e., learning estimates from estimates. Applying double Q learning method to separate actionselection and bootstrapping evaluation into two networks can address the overestimation biasissue introduced by the max operator in calculating the loss function. After several steps onupdating the D3QN network, the target D3QN network will be synchronized to the D3QNnetwork via letting θ − D = θ D .Given outer state s , the outer action selection strategy applied by the D3QN agent follows thepopular (cid:15) -greedy policy, shown as a = randi ( K ) , with probalility (cid:15) arg max k =1 ,...,K Q D ( s , k | θ D ) , otherwise , (25)where the exploration parameter (cid:15) ∈ [0 , is used to balance exploration and exploitationin learning process. Specifically, larger (cid:15) encourages the D3QN agent to explore the outerenvironment, while smaller (cid:15) results in more frequent exploitation of learned knowledge. Usually,the exploration parameter (cid:15) is annealing alongside the learning process, inducing the D3QN agentfrom more frequent exploration to higher probability of exploitation.
2) DDPG:
For each time slot n , the D3QN agent observes the outer environment from whichit obtains the DUE’s location (cid:126)q u ( n ) and the RBP map C ( n ) . Then, the D3QN agent selects theouter action, i.e., the RB k ∗ . With the selected RB and the current RBP map, the correspondingset of available BSs ˘ B k ∗ o can be determined. Thereafter, the type of large-scale fading betweenthe DUE and each available BS can be obtained from L ( n ) . Then, the inner MDP for the DDPGnetwork can be formulated as follows. Each inner state ˆ s consists of a list of small-scale fadingcomponents (cid:126)h bu ( n, i ) and its corresponding type of LoS or NLoS, i.e., LS bu ( n ) , where b ∈ ˘ B k ∗ o .It is well known that ANNs can only accept real numbers as its inputs, rather than complexvalues. To circumvent this problem, the complex-value small-scale fading channel (cid:126)h bu ( n, i ) willbe transferred into a flatten layer which decouples the complex value and reshapes its real partand imagery part into a real-value vector. Each possible inner action ˆ a generated from the actornetwork is a vector of real-value numbers, which will be reshaped into a normalized complex-value vector to construct the corresponding beamforming vector (cid:126)w bu ( n, i ) . The transitions of innerstates is determined by the Nakagami- m distribution, where m varies according to LS bu ( n ) . Theinner reward function evaluates how good the selected inner action is for each time of statetransition. To reflect the quality of selected inner action, the inner reward function is defined as ˆ r = | (cid:126)h bu ( n, i ) (cid:126)w bu ( n, i ) | (cid:107) (cid:126)h bu ( n, i ) (cid:107) . (26)DDPG method belongs to actor-critic algorithms, in which the critic network learns Q functionapproximation Q P (ˆ s , ˆ a | θ P ) and the actor network is the policy generator approximating theaction µ (ˆ s | θ µ ) , where θ P and θ µ denote the parameter vectors of critic and actor networks,respectively. Specifically, the actor network takes the inner state as its input and generates deter-ministic continuous action as its output, unlike DQN-related methods that output a probabilitydistribution over discrete action space. Furthermore, the inner action generated by the actornetwork will be leveraged to the input layer of the critic network together with the currentinner state. Then, the corresponding state-action value will be generated at the output layer ofthe critic network. The actor network is invoked to approximate the inner action and thus theexhaustive search of the optimal inner action maximizing the Q function given the next innerstate is avoided. Fig. 4 depicts the overall architecture of DDPG network.The gradient descent updating on the critic network can be given by θ P ( t + 1) = θ P ( t ) − α P c ∇ θ P loss ( θ P ) , (27) where α P c indicates the learning rate and ∇ θ P loss ( θ P ) denotes the gradient of critic network’sloss function w.r.t. θ P . Besides, the corresponding mean-square loss function is defined as loss ( θ P ) = 1 N P N P (cid:88) t =1 [ˆ y t − Q P (ˆ s t , ˆ a t | θ P )] , (28)where ˆ y t = ˆ r t + γQ P [ˆ s t +1 , ˆ µ (ˆ s t +1 | θ − µ ) | θ − P ] represents the target Q value, N P is from a mini-batchof N P transitions randomly extracted from the inner replay buffer, and θ − P and θ − µ denote theparameters of target critic network and target actor network, respectively.Moreover, the actor network aims to maximize its expected return, defined as J ( θ ) = E ˆ s t { Q [ˆ s t , µ (ˆ s t | θ µ ) | θ P ] } , (29)of which the derivative w.r.t. θ µ can be calculated with help of the chain rule, shown as ∇ θ µ J ( θ ) ≈ E ˆ s t {∇ θ µ Q [ˆ s t , µ (ˆ s t | θ µ ) | θ P ] } = 1 N P N P (cid:88) t =1 ∇ a Q P (ˆ s t , a | θ P ) ∇ θ µ µ (ˆ s t | θ µ ) . (30)Then, the gradient ascent updating of the actor network can be expressed as θ µ ( t + 1) = θ µ ( t ) + α P a ∇ θ µ J ( θ ) , (31)where α P a is the learning rate for the actor network.Furthermore, the Polyak averaging updates for the target critic and actor networks are appliedto enhance the stability of learning, given by θ − P ← τ θ P + (1 − τ ) θ − P , (32) θ − µ ← τ θ µ + (1 − τ ) θ − µ , (33)respectively, where τ is the interpolation factor in Polyak averaging method for target networksand it is usually set to be close to zero, i.e., τ (cid:28) .Different from probabilistic action selection policy on discrete actions for D3QN agent,exploration on continuous actions for DDPG agent can be realized via adding noise sampledfrom a noise process N to the actor network, i.e., ˆ a ← ˆ a + N , where N can be chosen to adaptto the inner environment [29]. For simplicity, Normal noise N (0 , σ P ) is applied to generateartificial noise for the output of actor network, in which σ P is annealing alongside the learningprocess to guide the DDPG agent from exploration to exploitation. … … …… … … … … Reshape & Normalization
Input Layer Output LayerHidden Layers
ActorNetwork ˆ s
CriticNetwork ˆ s
3) The Hybrid D3QN–DDPG Algorithm:
The overall pseudo-code and interacting diagram ofthe proposed hybrid D3QN-DDPG solution are given by Algorithm 1 and Fig. 5, respectively.All the neural networks as well as their corresponding target networks and replay buffers are firstinitialized (line 1). For each learning episode, the outer environment will be initialized, whichmeans that the drone’s location should be reset to the start coordinate of the given trajectory andthe RBP map should be re-observed as well (line 3). For each outer epoch in a learning episode,the D3QN agent picks the outer action a i according to the (cid:15) -greedy action selection policy(25) and then the corresponding available set ˘ B a i o and the occupied set B a i o can be determinedfollowing the RB allocation regulation as mentioned in section II-A (line 6). Based on the localbuilding distribution as introduced in section II-B, the large-scale fading map L for currentdrone’s location can be generated (Line 7). Next, the types of wireless links (LoS or NLoS)between the DUE and BSs in the available set ˘ B a i o can be extracted from L . To initialize theinner environment for each outer epoch, a random available BS will be selected form set ˘ B a i o (line 8). Furthermore, the actor of DDPG agent selects the inner action ˆ a j . After executingthe contaminated inner action, the DDPG agent can observe the next inner state ˆ s j +1 from theinner environment and then calculate the immediate reward ˆ r j (line 11). Transitions of the innerMDP will be stored into the inner replay buffer, i.e., (ˆ s j , ˆ a j , ˆ s j +1 , ˆ r j ) → ˆ R (Line 12). Afterat least N P times of interaction between the DDPG agent and the inner environment, a mini-batch of N P transitions will be sampled from ˆ R to train the critic and the actor networks, viagradient descent method in (27) and gradient ascent approach in (31), respectively (line 13). Foreach time of training the DDPG network, the target critic and the target actor networks will be updated following Polyak averaging rule (line 14). After the evaluation and training of theDDPG agent, the selected outer action a i will be conducted and the next outer state s i +1 can beobserved from the outer environment, then the immediate outer reward r i can be derived (line16). Furthermore, transitions of the outer MDP will be stored into the outer replay buffer R, i.e., ( s i , a i , s i +1 , r i ) → R (line 17). When at least N D transitions are recorded into R, a mini-batchof N D transitions will be randomly sampled from R, which will be utilized to train the D3QNnetwork (line 18). For every Υ D steps, the target D3QN network will be updated to the D3QNnetwork via letting θ − D = θ D (line 19). For each training episode, the exploration parameter (cid:15) and Normal noise variance σ P will be discounted by their decaying rates to deal with thedilemma of exploration and exploitation (line 21). D3QN AgentDDPG Agent
Outer ObservationOuter Action s i
Outer Reward r i
D3QN Replay Buffer { s i , a i , r i , s i +1 }
DDPG Replay Buffer
Store TransitionsSample Mini-batch { ˆ s j , ˆ a j , ˆ r j , ˆ s j +1 }
IMULATION R ESULTS
In this section, numerical results will be provided to evaluate the performance of the proposedhybrid D3QN-DDPG solution. As shown in Fig. 2, an urban subregion specified by [0 , × [0 , × [0 , . (in km) is focused, in which local building distribution is generated via one realizationof ITU statistical model. The parameter setting of this statistical model is in line with sectionII-B2. Note that the generated building distribution remains stable and unchanged for the entiresimulation process, which consents with the practical scenario in real life. In our consideredmodel, the DUE’s location at each time slot is observed to determine the LoS/NLoS links viachecking potential blockages between the DUE and the BSs. Note that there are card ( B ) × card ( K ) variants for the RBP map, which can not be traversed in simulation or even in practice, dueto its Exponential expansion. To generate repetitive simulation results, we assume that the totalRBP variants are equal to the amount of time slots and these RBP variants form the pool of Algorithm 1:
The proposed hybrid D3QN-DDPG solution Initialization:
Initialize randomly the D3QN network Q D ( s, a | θ D ) and its target network Q D ( s, a | θ − D ) , with θ − D ← θ D . Initialize randomly the DDPG network, including theactor network µ ( s | θ µ ) , the critic network Q P ( s, a | θ P ) , the target actor network µ ( s | θ − µ ) and the target critic network Q P ( s, a | θ − P ) , with θ − µ ← θ µ and θ − P ← θ P . Initialize theD3QN replay buffer R with capacity ` D and the DDPG replay buffer ˆ R with capacity ´ D; for episode = [1 , epi ] do Initialize the outer environment and rest the UAV’s location to (cid:126)q u (0) ; for i = [1 , epo outer ] do Observe the outer state s i ; Select the outer action a i , observe the available set ˘ B a i o and the occupied set B a i o ; Get L ( i ) via checking the potential bolckages between (cid:126)q u ( i ) and all BSs; Randomly select a BS ˘ b ∈ ˘ B a i o and initialize the inner environment with LS ˘ bu ( i ) ; for j = [1 , epo inner ] do Observe the inner state ˆ s j ; Select and execute the inner action ˆ a j , then observe the next inner state ˆ s j +1 and calculate the corresponding inner reward ˆ r j ; Store transition (ˆ s j , ˆ a j , ˆ s j +1 , ˆ r j ) into ˆ R; Sample a mini-batch of N P transitions from ˆ R, then update the critic network Q P ( s, a | θ P ) via gradient descent method in (27) and the actor network µ ( s | θ µ ) via gradient ascent approach in (31); Update the DDPG target networks Q P ( s, a | θ − P ) and µ ( s | θ − µ ) , following thePolyak averaging rule in (32) and (33), respectively; end Execute the outer action a i , then observe the next outer state s i +1 and calculatethe outer reward r i ; Store transition ( s i , a i , s i +1 , r i ) into R; Sample a mini-batch of N D transitions from R, then update the D3QN network Q D ( s, a | θ D ) via gradient descent method in (22); Update the D3QN target network Q D ( s, a | θ − D ) every Υ D steps, i.e., θ − D ← θ D ; end Update (cid:15) ← (cid:15) × dec (cid:15) and σ P ← σ P × dec σ ; end RBP map. For each interaction between the D3QN agent and the outer environment, the RBPmap can only varies randomly in the range of RBP pool. It is reasonable assumption becausethese RBP variants can be recognized as the most likely experienced cases in the consideredcellular network, and the remaining RBP variants can be ignored for their rareness.For ease of implementation, the DUE’s initial location and destination are fixed at (cid:126)q u ( I ) =(1 , , . km and (cid:126)q u ( D ) = (2 , , . km, respectively. The given trajectory is defined as the linebetween (cid:126)q u ( I ) and (cid:126)q u ( D ) , of which the length is (cid:112) (cid:107) (cid:126)q u ( D ) − (cid:126)q u ( I ) (cid:107) ≈ . km. Besides, thevelocity of DUE is set as V u = 35 m/s and hence the DUE will spend T u = 40 s to travel between (cid:126)q u ( I ) and (cid:126)q u ( D ) . The shape factor m in Nakagami- m fading is assumed as 3 for LoS and 1 forNLoS. Without specification, the simulation parameter setting is in accordance with Table I.Table I: Simulation Parameter Setting Parameters Values Parameters Values
Capacities of replay buffers ` D/ ´ D 100,000/100,000 Given TOP threshold Γ th epi
100 Capacity of B epo outer
22 Capacity of K epo inner
20 Transmit power of each BS P Υ D
500 Number of antennas at each BS M (cid:15) / σ P p dec (cid:15) / dec σ σ -90dBmSize of mini-batch N D / N P f c τ h /BS’s antenna height z α D / α Pc / α Pa ς γ δ u A. Construction of DNNs
The proposed hybrid D3QN-DDPG solution is implemented on Python 3.8 with TensorFlow2.3.1 and Keras. Besides, the optimizer minimizing the mean square error (MSE) for all theapplied DNNs is
Adam with fixed learning rate. The activation function at each hidden layeris the popular
Relu function, for its simplicity and generality. Besides, the activation functionutilized for both output layers in D3QN and critic network of DDPG is
Linear , while that foractor network of DDPG is
Tanh . The DNN of D3QN agent is constructed with fully-connected feedforward ANN, in which 3hidden layers contain 512, 256 and 128 neurons, respectively. The shapes of input and outputlayers of D3QN are determined by the dimension of RBP map and the number of possible RBs,i.e., card ( B ) × card ( K ) and card ( K ) , respectively. Before the output layer and after the lasthidden layer, there is a dueling layer with card ( K ) + 1 neurons, where one neuron reflects theestimation of state-value and the remaining card ( K ) neurons track the action advantages forthe card ( K ) possible actions. After aggregation, the output layer generates the estimation of the card ( K ) state-action values, as depicted in Fig. (3).Both critic and actor networks’ DNNs in the DDPG agent are fully-connected feedforwardANNs with 2 hidden layers consisting of 512 and 128 neurons. The dimensions of input layerand output layer of the critic network correspond to M + 1 and , while those of the actornetwork are M + 1 and M , respectively. This is because the Nakagami- m fading component isin form of complex value, which should be decoupled at the input layers of the critic and actornetworks. Besides, one additional neuron should be added into the input layers of the critic andactor networks to help them identify LoS/NLoS inner environment. To calculate the inner rewardfunction (26), the actor network’s outputs will be reconstructed into complex-value vector with M × dimension, after which the vector will be normalized to satisfy constraint (20c). (a) Reward history of D3QN (b) Reward history of DDPG Figure 6: Reward history B. Training of Hybrid D3QN-DDPG Algorithm
Fig. 6 shows reward history curves versus training episodes for the proposed hybrid D3QN-DDPG solution. The average reward reflects the expected value of epoch rewards for eachepisode, which is calculated via averaging accumulated rewards over training epochs within everyepisode. It can be observed from Fig. 6 that both D3QN and DDPG networks illustrate increasingtrending of average reward alongside the training process, though experiencing some fluctuationsthat are usual phenomenon in the regime of DRL-related algorithms. Specifically, the D3QN’saverage reward converges to the optimum (around 0.6) after 85 training episodes, while theDDPG converges to its highest average reward (about 0.78) after 40 training episodes. Fig. 6(a)validates that the D3QN agent can adapt to the dynamic RBP environment via allocating properRB index to the DUE, while Fig. 6(b) verifies that the DDPG agent is able to adjust transmitbeamforming vector to fit the small-scale fading environment. After saving the hybrid D3QN-DDPG model with the highest average rewards, we can re-load it to realize EOD performancecomparison which will be illustrated in section V-D. (a) Impact of α D (b) Impact of Υ D Figure 7: Impact of learning rates and target network update frequency
C. Impacts of Hyper-parameters
It is well known that the overall performance of DRL-related algorithms is sensitive tohyper-parameters, e.g., target network update and learning rate. The hyper-parameters shouldbe picked carefully for given system setting, which can help realize satisfactory learning qualityand convergence speed. (a) Impact of α Pa and α Pc (b) Impact of τ Figure 8: Impact of learning rates and Polyak interpolation factorFig. 7(a) delivers average D3QN reward curves versus training episodes with various α D ,while Fig. 8(a) demonstrates average DDPG reward curves versus training episodes with differentcombinations of α P a and α P c . From these subfigures, it can be observed that learning ratespose significant impacts on learning performance and convergence speed. With relatively high α D , i.e., α D = { . , . } , although the D3QN’s convergences are quite rapid, it reachesextremely unsatisfactory learning scores (around 0.25 and 0.32, respectively). With relativelysmall α D , i.e., α D = { . , . } , the D3QN agent can achieve higher scores (about 0.6).Surprisingly, when α D is extremely small, i.e., α D = 0 . , it leads to unsatisfactorylearning performance in the range of 100 training episodes. However, α D = 0 . may hasthe potential to help the D3QN agent reach a new highest score, for which the price is that muchmore training episodes are needed (i.e., less favourable convergence rate). For Fig. 8(a), learningrate combination [ α P a = 0 . , α P c = 0 . is selected as the anchor for comparison, which canconverge to its optimal score (around 0.78) after about 40 training episodes. With higher α P a , i.e., [ α P a = 0 . , α P c = 0 . , the DDPG agent barely learns anything and achieves the worse score(around 0.06). With smaller α P a , i.e., [ α P a = 0 . , α P c = 0 . , the DDPG agent convergesfaster (around 20 training episodes) but achieves slightly worse score (about 0.74). With higher α P c , i.e., [ α P a = 0 . , α P c = 0 . , the DDPG agent experiences more fluctuations (especiallyfrom episode 10 to 20) and achieves worse learning quality (around 0.69). With smaller α P c , i.e., [ α P a = 0 . , α P c = 0 . , the DDPG agent can reach approximately equivalent learning score (around 0.78) with slower convergence speed (after episode 60). From the above observations, itis straightforward to conclude that the proposed hybrid D3QN-DDPG solution is unsurprisinglysensitive to learning rate which should be selected delicately for accomplishing a good trade-off between learning quality and convergence speed. The intuitive reason of why learning rateis so important can be interpreted as that it refers to the step size of DNN weights’ updatingwhile being trained using the stochastic gradient descent method. Smaller learning rates leadto less rapid weight changes for each update but generally require more training epochs. Onthe contrary, larger learning rates result in more rapid weight changes but usually require fewertraining epochs.Fig. 7(b) depicts average D3QN reward curves versus training episodes with different Υ D ,while Fig. 8(b) illustrates average DDPG reward curves versus training episodes with various τ . From Fig. 7(b), it can be easily concluded that target network technique adopted in theproposed hybrid D3QN-DDPG algorithm is undoubtedly of essence. Specifically, less frequentupdating (i.e., larger Υ D ) on D3QN’s target network can help the D3QN agent achieve betterlearning scores, while less amount of updating (i.e., smaller τ ) on DDPG’s target networks ismore favourable. However, larger Υ D and smaller τ can result in slower convergence speed.Hence, the picking of Υ D and τ is important for the proposed hybrid D3QN-DDPG solution todeal with the dilemma between learning performance and convergence speed. Intuitively, targetnetwork is applied to stabilize DRL-related algorithms and helps the proposed solution becomemore robust and stable. (a) Performance comparison versus P (b) Performance comparison versus M Figure 9: Performance comparison D. Performance Comparison
For performance comparison, the following benchmarks are provided. 1)
RR w/o BD : theRB index selected for each time slot and the beamforming vector at each available BS areboth randomly generated. Note that this approach is supposed to be the worst, which maylead the DUE to suffer the maximal transmission outage duration. 2)
RR w/ BD : the RB indexscheduled for each time slot is randomly selected, but the beamforming vectors at available BSsare generated with the help of trained DDPG agent. 3)
ER w/ BD : the RB index assigned foreach time slot is the optimal via exhaustive search method, which can maximize (21) for everyobserved RBP map. Besides, the beamforming vector at each available BS is obtained from thetrained DDPG agent. Note that this benchmark serves as the lower bound of EOD performance,which is supposed to help the DUE suffer the minimal transmission outage duration.The proposed hybrid D3QN-DDPG solution provides the proper RB index for each timeslot and designed beamforming vector for each available BS, with the aid of trained D3QNagent and DDPG agent, respectively. Fig. 9(a) and Fig. 9(b) show EOD curves of the proposedD3QN-DDPG solution and benchmarks versus P and M , respectively. It is clearly illustrated inFig. 9(a) that the EOD curves decrease dramatically, with the increase of P , which means thathigher P can help the DUE achieve better transmission outage performance (i.e., lower EOD).Comparing the EOD curves of RR w/o BD and
RR w/ BD , EOD performance enhancementcan be observed (especially, for P ∈ [0 , dBm), which validates the effectiveness of DDPGcomponent. Besides, greater EOD performance improvement can be achieved with the help ofD3QN component, via comparing the EOD curves of RR w/ BD and the proposed hybrid D3QN-DDPG solution (especially, for P ∈ [ − , dBm). The aforementioned observations validatethat the D3QN and DDPG agents are able to offer independent EOD performance gain, whichis a favourable feature of the proposed hybrid D3QN-DDPG solution. Compared to the optimalmethod ER w/ BD , the proposed hybrid D3QN-DDPG solution can help the DUE achieve sub-optimal EOD performance which performs slightly worse than the optimal approach but canprovide significant EOD reduction for the DUE than the other two benchmarks (i.e.,
RR w/o BD and
RR w/ BD methods). Similar conclusion can be drawn from Fig. 9(b) which demonstratesEOD curves with various M . From these two subfigures, one can observe the other fact thatincreasing P can improve EOD performance much more significantly than increasing M . VI. C
ONCLUSION
This paper studied a joint RB allocation and beamforming design optimization problem incellular-connected UAV network while protecting GUEs’ transmission quality, in which the EODof DUE was minimized via the proposed hybrid D3QN-DDPG algorithm. Specifically, the D3QNand DDPG agents were trained to accomplish the RB allocation in discrete action domain andbeamforming design in continuous action regime, respectively. To realize this, an outer MDPwas defined to characterize the dynamic RBP environment at the terrestrial BSs, while the innerMDP was formulated to trace the time-varying feature of the small-scale fading environment.The hybrid D3QN-DDPG solution was proposed to solve the outer MDP and the inner MDPinteractively so that sub-optimal EOD performance for the considered optimization problem canbe achieved. Numerical results illustrated that the proposed hybrid D3QN-DDPG solution cansignificantly reduce EOD for DUE and achieve sub-optimal EOD performance, compared tothe provided benchmarks. Most importantly, the trained D3QN and DDPG agents were alsovalidated to offer independent improvements on EOD performance. This paper can be furtherextended towards various promising future research directions. For example, more complex RBallocation scenario, i.e., the DUE can accept more than one RB index each time. Moreover, sumEOD minimization for multiple-drone cellular-connected UAV network, subjected to individualEOD threshold at each DUE. R
EFERENCES [1] F. Zhou, Y. Wu, R. Q. Hu, and Y. Qian, “Computation rate maximization in UAV-enabled wireless-powered mobile-edgecomputing systems,”
IEEE J. Sel. Areas Commun. , vol. 36, no. 9, pp. 1927–1941, 2018.[2] S. Zhang, H. Zhang, B. Di, and L. Song, “Cellular UAV-to-X communications: Design and optimization for multi-UAVnetworks,”
IEEE Trans. Wireless Commun. , vol. 18, no. 2, pp. 1346–1359, 2019.[3] J. Hu, H. Zhang, L. Song, R. Schober, and H. V. Poor, “Cooperative internet of UAVs: Distributed trajectory design bymulti-agent deep reinforcement learning,”
IEEE Trans. Commun. , vol. 68, no. 11, pp. 6807–6821, 2020.[4] W. Mei and R. Zhang, “Cooperative downlink interference transmission and cancellation for cellular-connected UAV: Adivide-and-conquer approach,”
IEEE Trans. Commun.
Proc. IEEE Global Commun. Conf. (GLOBECOM) , Waikoloa, USA, 2019, pp. 1–6.[7] F. Wu, H. Zhang, J. Wu, and L. Song, “Cellular UAV-to-device communications: Trajectory design and mode selectionby multi-agent deep reinforcement learning,”
IEEE Trans. Commun. , vol. 68, no. 7, pp. 4175–4189, 2020.[8] X. Zhou, S. Yan, J. Hu, J. Sun, J. Li, and F. Shu, “Joint optimization of a UAV’s trajectory and transmit power for covertcommunications,”
IEEE Trans. Signal Process. , vol. 67, no. 16, pp. 4276–4290, 2019. [9] G. Pan, H. Lei, J. An, S. Zhang, and M.-S. Alouini, “On the secrecy of UAV systems with linear trajectory,” IEEE Trans.Wireless Commun. , vol. 19, no. 10, pp. 6277–6288, 2020.[10] G. Hattab and D. Cabric, “Energy-efficient massive IoT shared spectrum access over UAV-enabled cellular networks,”
IEEE Trans. Commun. , vol. 68, no. 9, pp. 5633–5648, 2020.[11] F. Zhou, Y. Wu, H. Sun, and Z. Chu, “UAV-enabled mobile edge computing: Offloading optimization and trajectory design,”in
Proc. IEEE International Conference on Communications (ICC) , Kansas, USA, 2018, pp. 1–6.[12] J. Hu, Y. Wu, R. Chen, F. Shu, and J. Wang, “Optimal detection of UAV’s transmission with beam sweeping in covertwireless networks,”
IEEE Trans. Veh. Technol. , vol. 69, no. 1, pp. 1080–1085, 2019.[13] G. Boudreau, J. Panicker, N. Guo, R. Chang, N. Wang, and S. Vrzic, “Interference coordination and cancellation for 4Gnetworks,”
IEEE Commun. Mag. , vol. 47, no. 4, pp. 74–81, 2009.[14] C. Kosta, B. Hunt, A. U. Quddus, and R. Tafazolli, “On interference avoidance through inter-cell interference coordination(ICIC) based on OFDMA mobile systems,”
IEEE Commun. Surveys Tuts. , vol. 15, no. 3, pp. 973–995, 2012.[15] R. Zhang, Y.-C. Liang, and S. Cui, “Dynamic resource allocation in cognitive radio networks,”
IEEE Signal Process. Mag. ,vol. 27, no. 3, pp. 102–114, 2010.[16] R. Irmer et al. , “Coordinated multipoint: Concepts, performance, and field trial results,”
IEEE Commun. Mag. , vol. 49,no. 2, pp. 102–111, 2011.[17] P. Chandhar, D. Danev, and E. G. Larsson, “Massive MIMO for communications with drone swarms,”
IEEE Trans. WirelessCommun. , vol. 17, no. 3, pp. 1604–1629, 2017.[18] N. Senadhira, S. Durrani, X. Zhou, N. Yang, and M. Ding, “Uplink NOMA for cellular-connected UAV: Impact of UAVtrajectories and altitude,”
IEEE Trans. Commun. , vol. 68, no. 8, pp. 5242–5258, 2020.[19] L. Liu, S. Zhang, and R. Zhang, “Multi-beam UAV communication in cellular uplink: Cooperative interference cancellationand sum-rate maximization,”
IEEE Trans. Wireless Commun. , vol. 18, no. 10, pp. 4679–4691, 2019.[20] W. Mei, Q. Wu, and R. Zhang, “Cellular-connected UAV: Uplink association, power control and interference coordination,”
IEEE Trans. Wireless Commun. , vol. 18, no. 11, pp. 5380–5393, 2019.[21] P. Series, “Propagation data and prediction methods required for the design of terrestrial broadband radio access systemsoperating in a frequency range from 3 to 60 GHz,”
Recommendation ITU-R , pp. 1410–1415, 2013.[22] Y. Li, R. Zhao, Y. Deng, F. Shu, Z. Nie, and A. H. Aghvami, “Harvest-and-opportunistically-relay: Analyses on transmissionoutage and covertness,”
IEEE Trans. Wireless Commun. , vol. 19, no. 12, pp. 7779–7795, 2020.[23] Y. Li, R. Zhao, X. Tan, and Z. Nie, “Secrecy performance analysis of artificial noise aided precoding in full-duplex relaysystems,” in
Proc. IEEE Global Commun. Conf. (GLOBECOM) , Singapore, 2017, pp. 1–6.[24] 3GPP TR 36.777, “Enhanced LTE support for aerial vehicles,” Dec. 2017.[25] R. S. Sutton and A. G. Barto, “Reinforcement learning: An introduction 2nd ed,” 2018.[26] V. Mnih et al. , “Human-level control through deep reinforcement learning,” nature , vol. 518, no. 7540, pp. 529–533, 2015.[27] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double Q-learning,” in
Proc. of AAAI , 2015,pp. 2094–2100.[28] Z. Wang, T. Schaul, M. Hessel, H. Hasselt, M. Lanctot, and N. Freitas, “Dueling network architectures for deepreinforcement learning,” in
Proc. of ICML , 2016, pp. 1995–2003.[29] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control withdeep reinforcement learning,” in