QoE Optimization for Live Video Streaming in UAV-to-UAV Communications via Deep Reinforcement Learning
Liyana Adilla binti Burhanuddin, Xiaonan Liu, Yansha Deng, Ursula Challita, Andras Zahemszky
11 QoE Optimization for Live Video Streaming inUAV-to-UAV Communications via DeepReinforcement Learning
Liyana Adilla binti Burhanuddin,
Student, IEEE,
Xiaonan Liu,
Student, IEEE,
Yansha Deng,
Member, IEEE,
Ursula Challita,
Member, IEEE, and Andr´as Zahemszky,
Member, IEEE
Abstract —A challenge for rescue teams when fighting againstwildfire in remote areas is the lack of information, such as the sizeand images of fire areas. As such, live streaming from UnmannedAerial Vehicles (UAVs), capturing videos of dynamic fire areas, iscrucial for firefighter commanders in any location to monitor thefire situation with quick response. The 5G network is a promisingwireless technology to support such scenarios. In this paper, weconsider a UAV-to-UAV (U2U) communication scenario, wherea UAV at a high altitude acts as a mobile base station (UAV-BS) to stream videos from other flying UAV-users (UAV-UEs)through the uplink. Due to the mobility of the UAV-BS and UAV-UEs, it is important to determine the optimal movements andtransmission powers for UAV-BSs and UAV-UEs in real-time, so asto maximize the data rate of video transmission with smoothnessand low latency, while mitigating the interference according tothe dynamics in fire areas and wireless channel conditions. In thispaper, we co-design the video resolution, the movement, and thepower control of UAV-BS and UAV-UEs to maximize the Qualityof Experience (QoE) of real-time video streaming. To learn thedynamic fire areas and communication environment, we apply theDeep Q-Network (DQN) and Actor-Critic (AC) to maximize theQoE of video transmission from all UAV-UEs to a single UAV-BS. Simulation results show the effectiveness of our proposedalgorithm in terms of the QoE, delay and video smoothness ascompared to the Greedy algorithm.
Index Terms —Quality of Experience (QoE), UAV-to-UAV(U2U) communication, video streaming, Deep Q Network (DQN),Actor Critic (AC).
I. I
NTRODUCTION
Over the years, an increasing number of wildfires hascaused challenges for firefighters to control and monitor firein remote areas [1], [2]. Without new technology to monitorthe incident area from the control station, the current practiceof the fire station control lacks the technology to remotelyvisualize the dynamic fire situation in real-time for immediateaction [2]. Therefore, monitoring multiple firefighting areasin different locations with dynamic fire heights and areasis vital. Unmanned Aerial Vehicles (UAVs) with low cost,high mobility, and the capability to capture high-definitionvideo, can be a good solution to oversee the fire situation,
L. A. B. Burhanuddin, X. Liu and Y. Deng are with Department ofEngineering, King’s College London, London, UK. L. A. B. Burhanuddinis also with School of Information Science and Technology, Xiamen Uni-versity Malaysia, Sepang, Malaysia (e-mail: [email protected],[email protected], [email protected]).U. Challita and A. Zahemszky are with Ericsson AB, Stockholm, Sweden(email: [email protected], [email protected])
Corresponding author: Yansha Deng and facilitate the fire commander’s response for the choice ofnumber of firefighters and firefighting machines. The use ofUAVs provides the fire commander with the overall situationof the fire and danger, such as explosions or human requiringrescue. More importantly, it helps to reduce any imminentdangers and obstacles to firefighters. Existing wireless tech-nologies, such as WiFi, Bluetooth, and radio wave, can onlysupport UAVs’ communication within a short transmissionrange, which are inefficient for multi-UAV collaboration withlimited multi-UAV control [3]. Meanwhile, cellular networkscan support the real-time video streaming from UAV users(UAV-UEs) with beyond line of sight control, low latency,real-time communication, and ubiquitous coverage from flyingbase stations (UAV-BSs) with wireless backhaul to the corenetworks. Despite the growing interests in cellular-connectedUAVs, there are still many challenges unsolved for commercialdeployment [3] [4]. A UAV has been initially proposed as arelay to help other UAVs transmit to a nearby terrestrial basestation (BS) with low signal to noise ratio (SNR) [4]. When thedistance of UAV-to-UAV (U2U) communication decreases, theSNR of the transmission among the UAVs increases resultingin a better transmission performance [5].The use of UAVs in disaster scenarios has been investigatedin literature [6]–[13]. In [6], the UAV was introduced as anemergency BS to serve the affected ground users with limitedcoverage. In [7], multiple mini-UAVs were used to form flyingad-hoc network (FANET) to explore large and disjoint terrainin disaster areas while adapting their transmission power tooptimize the energy usage. In [11], through optimizing thetrajectory and the transmit power of the UAV and the mobiledevice, the outage probability of the UAV relay network inthe disaster area was minimized. In [12], a UAV platformwas developed to compensate the communication loss duringa natural disaster, with the aim to obtain the optimal flightpaths in high-rise urban and urban microcell environment. In[13], UAV-assisted networks was studied in disaster area, andthe proposed power control optimization problem was solvedvia relaxing the non-convex problem. Nevertheless, no studieshave focused on the real-time video streaming between UAV-UEs and UAV-BS.Real-time video streaming has higher requirements in termsof data rate, latency, and smoothness compared to other datatypes. In a firefighting scenario, the network channel capacityfluctuates dramatically with the dynamic environment along-side the UAVs’ movement, which can cause poor network a r X i v : . [ ee ss . SP ] F e b performance and undesirable delays. This in turn makes itharder to learn the pattern variance of the channel capacity,thus resulting in failure to transmit with high capacity andhigh video quality. To overcome these limitations, the authorsin [14] applied the Additive Variation Bitrate (ABR) methodwith Deep Reinforcement Learning (DRL) to select propervideo resolution based on previous communication rate andthroughput. However, [14] only focused on a single videosource ABR, which was guided by RL to make decisionsbased on network observations and video playback states forselecting the optimal video resolution. In search and rescuefirefighting scenario, a nonordinary optical camera [15] shouldbe considered to ensure the reception of a high quality video.To deal with a more complex environment and practicalscenarios, such as search and rescue firefighting scenarios, theDRL algorithm is a promising tool for solving the problemof jointly optimizing the UAVs location while maximizing thedata rate [16].In this paper, we consider a cellular-connected UAV-BSstreaming the real-time video captured by UAV-UEs from thefirefighting area for fire monitoring. The contributions of thispaper are summarized as follows: • We develop a framework for a dynamic UAV-to-UAV(U2U) communication model with a moving UAV-BSin multiple firefighting areas to capture a live-streamingpanoramic view. We model the dynamic fire arrival withdifferent heights in every fire area and UAVs’ requestarrival as Poisson process in each time slot, and designthe UAV-UEs location spaces to capture a full panoramicview with multiple UAVs. • To guarantee the smoothness and latency of the live videostreaming among UAV-BS and UAV-UEs in this U2Unetwork, we formulate a long-term Quality of Experience(QoE) maximization problem via optimizing the UAVs’positions, video resolution, and transmit power over eachtime slot. • To solve the above problem, we propose a Deep Rein-forcement Learning (DRL) approach based on the Actor-Critic (AC) and the Deep Q Network (DQN). Our resultsshown that our proposed AC and DQN approaches out-perform the greedy algorithm in terms of QoE.The rest of this paper is organized as follows. The systemmodel and problem formulation are given in Section II. Theoptimization problem via reinforcement learning is presentedin Section III. Simulation results and conclusion are presentedin Sections IV and V, respectively.II. S
YSTEM M ODEL AND P ROBLEM F ORMULATION
As illustrated in Fig. 1, we consider a single UAV-BS toprovide a network coverage for multiple UAV-UEs to satisfythe network rate requirement of each UAV-UE to stream highquality video of multiple firefighting areas. The UAV-BS islocated at the center of the environment, such as forest area,with the maximum coverage radius r max . The UAV-BS isconnected through wireless network to the fixed or mobilecontrol station. We assume that the arriving distribution ofthe fire video streaming request is the same as that of the Figure 1. Illustration of System Model fire arrival distribution [17], which follows Poisson processdistribution with density λ a . The UAV-BS receives a requestwhen a fire event occurs, and the k th UAV-UE automaticallyflies to the center of k th flying region FR k to serve the i th firearea A i ( x i , y i ) .We consider a video streaming task that lasts for T timeslots with an equal duration t . The selection of the optimallocation to stream the video plays an important role in en-suring the UAV-UEs capture the full firefighting area of A i .Therefore, the k th UAV-UE needs to find the optimal position U ( x ∗ k , y ∗ k , h ∗ k ) to transmit the video to the UAV-BS. The sizeof the k th fire region FR k for the k th UAV-UE depends on thenumber of UAV-UEs that perform the video streaming for the i th fire area A i . To make sure that all UAV-UEs can jointlycapture the panoramic video of A i , K UAV-UEs are distributedevenly around A i , as shown in Fig. 1. Meanwhile, the UAV-BS also searches for the optimal location P ( x ∗ BS , y ∗ BS , h ∗ BS ) to satisfy the minimum data rate requirement for all UAV-UEs. In addition, the safety region of the A i is considered toguarantee FR k and A i , and A i and A i +1 are not overlappingto guarantee that the UAV-BS and UAV-UEs are safe from fire. A. Request Arrival
The request contains the i th area A i with its centre at ( x i , y i ) with radius r i . We assume that K UAV-UEs serveeach fire area and stream real-time videos simultaneously.We assume that the height of the fire h i follows Log-normaldistribution [18], thus, the minimum flying height of all UAVsis h min , which satisfy h min = max( h i ) . All UAV-UEs in A i will be operated at the same altitude. The environment isdivided into W square grids, thus, the length, width and heightof each grid are X √ W , Y √ W , Z √ W , respectively. At the t th timeslot, the flying position (cid:126)U ( x i,k , y i,k , h i,k ) of the k th UAV-UEcan be calculated as (cid:126)U t +1 ( x i,k , y i,k , h i,k ) = (cid:126)U t ( x i,k , y i,k , h i,k ) + (cid:126)a t ( x, y, z ) , (1)with x i − a ≤ x i,k ≤ x i + a, (2) y i − a ≤ y i,k ≤ y i + b, (3) h min ≤ h i,k ≤ h max , (4) U t ( i,k =1) = { ( x , y , h ) | x i − a ≤ x i, ≤ x i + a, y i + a ≤ y i, ≤ y i + b, h i ≤ h ≤ h max } , (5a) U t ( i,k =2) = { ( x , y , h ) | x i − b ≤ x i, ≤ x i − a, y i − a ≤ y i, ≤ y i + a, h i ≤ h ≤ h max } , (5b) U t ( i,k =3) = { ( x , y , h ) | x i − a ≤ x i, ≤ x i + a, y i − b ≤ y i, ≤ y i − a, h i ≤ h ≤ h max } , (5c) U t ( i,k =4) = { ( x , y , h ) | x i + a ≤ x i, ≤ x i + b, y i − a ≤ y i, ≤ y i + a, h i ≤ h ≤ h max } . (5d) Figure 2. Flying boundry of the k th UAV-UE. where (cid:126)a t ( x, y, z ) is the action vector, a = r i + r s , b = r i + r s + l , r s is the safe distance between A i and FR k , l is thelength of flying region, and h max is the maximum height ofUAV-UE regulated by the government (i.e. 120m in UK [19]).Furthermore, to capture full panoramic video, we propose theboundary flying area for UAV-UEs in each fire area, whichcan be written as Eq. (5). B. Channel Model
In the wireless network, we assume that the channel modelbetween the k th UAV-UE and the UAV-BS contains large-scalefading (path loss and channel gain) and small-scale fading [3].We assume that the link between the UAVs are line-of-sight(LoS). The pathloss from the k th UAV-UE to the UAV-BS canbe written as P L
LoS , k ( t ) = 20 log (cid:18) πf c d k D ( t ) c (cid:19) + η LoS , (6)where f c is the carrier frequency, c is the speed of lightin vacuum, η Los is the additional attenuation factors due tothe LoS connection, and d k D ( t ) is distance between the k th Figure 3. UAV-to-UAV communication.
UAV-UE and the UAV-BS, as shown in Fig. 3, which can becalculated as d k D ( t ) = (cid:113) ( x BS ( t ) − x k ( t )) + ( y BS ( t ) − y k ( t )) + ( h BS ( t ) − h k ( t ))) . (7)In our model, we use the Rician distribution [20] [21] todefine small scale fading p ξ ( d k ) , which can be denoted as p ξ ( d k ) = d k σ exp (cid:18) − d k − ρ σ (cid:19) I (cid:18) d k ρσ (cid:19) , (8)with d k ≥
0, and ρ and σ are the strength of the dominant andscattered (non-dominant) paths, respectively. The Rice factor κ can be defined as κ = ρ σ . (9)It is possible that the selected position of each UAV-UEcan generate more interference to the UAVs nearby, which can result in poor transmission performance and make it difficultfor the UAV-UE to maintain the connection with the UAV-BS. Power control can be a solution to minimize the uplinkinterference among UAV-UEs at appropriate power level [22].Through properly controlling the transmit power of each UAV-UE in the uplink transmission, the interference among UAV-UEs can be mitigated. According to the 3GPP guidelines [23],we consider fractional power control for all UAVs and thepower transmitted by the k th UAV-UE while communicatingwith the UAV-BS can be given by P U k = min (cid:8) P max U k , (cid:0)
10 log ( B )) + ρ u k P L (cid:1)(cid:9) , (10)where P max U k is the maximum transmit power of theUAV-UE, B is the channel bandwidth, and ρ u k = { , . , . , . , . , . , . , } is a fractional path loss com-pensation power control parameter [22].In the proposed wireless UAV network, the received powerfrom the k th UAV-UE to the UAV-BS at the t th time slot ispresented as P k ( t ) = P U k G ( d D ( t )) − α − pξ ( dk )10 , (11)where P U k is the transmit power of the k th UAV-UE, G is channel power gains factor introduced by amplifier andantenna [4], ( d D ( t )) − α is the pathloss, α is the path lossexponent, and p ξ ( d k ) is the Rician small scale fading. Theinterference from the m th UAV-UE to the UAV-BS at the t thtime slot can be written as I U U ( t ) = (cid:88) m ∈ K \ k ψ m ( t ) P m ( t ) , (12)where ψ m ( t ) = 1 indicates that the transmission between the k th UAV-UE and the UAV-BS is active, otherwise, ψ m ( t ) = 0 ,and P m ( t ) is the transmit power of m th UAV-UE. The signalto interference plus noise ratio (SINR) of the UAV-BS is givenby γ k ( t ) = P k ( t ) N + (cid:80) m ∈ K \ k ψ m ( t ) P m ( t ) , (13)where N is the noise power at the UAV-BS whose elementsare average of independent random Gaussian variables withthe variances σ n . Then, the transmission uplink rate from the k th UAV-UE to the UAV-BS can be denoted as R k ( t ) = B log (1 + γ k ( t )) . (14) C. Video Streaming Model
In this paper, we consider the long-term video streaming thatare modelled as consecutive video segments. Each segmentconsists of multiple frames, and the frame is consideredto be the smallest data unit. The resolution of each framecorresponds to its minimum data rate requirement. Table Ipresents the type of Video Quality [24]. For example, if thecommunication rate (bitrate) is between 300-700 kbps, thevideo type that we should consider to use is 240 p. Knowingthat 144p corresponds to the smallest size of the video type, all UAV-UEs need to satisfy the minimum uplink bitrate, i.e., R min =80 kbps.Each UAV-UE is equipped with a nonordinary optical cam-era with the resolution of r px × r py , and the video is consistedof multiple consecutive frames [15], which is used to monitorthe fire area with three main goals: 1) detect the size of fire bycontinuous capturing the panoramic video; 2) verify and locatefires reported; and 3) closely monitor a known fire by streamsusing distribution relationship around the incident. The qualityof the video frame depends on its resolution of the i th videoframe at the t th time slot v i ( t ) . Furthermore, for each videoframe, we assume that it has the same playback time T l , i.e.2ms to 4ms, which depends on 30 FPS or 60 FPS. In addition,the delay of video streaming via UAVs is consisted of threeelements, i.e. capture time, encoding time, and transmissiontime. As all UAVs capture a video using the same resolution,the capture time and the encoding time are constant. Thus, wemainly focus on the uplink transmission time, which can beexpressed as T i,k ( t ) = D ( v i ( t )) R k ( t ) = r px · r py · bB log (1 + γ k ( t )) , (15)where b is the number of bits per pixel, and D ( v i ( t )) is thedata size based on v i ( t ) . The video frames are processed inparallel in multi-core processors, and the time consumption atthe t th time slot is T ( t ) = max { T i,k ( t ) } [25]. To guarantee thesmoothness and seamless of the video streaming, T ( t ) mustsatisfy the delay constraint, namely, T ( t ) < T l . D. Quality of Experience Model
The key parameters of video streaming are video quality,quality of variation, rebuffer time, and the startup delay [26].According to [14], the rebuffering time and startup delay canbe ignored. Thus, the video transmission may be suffered froma delay, which can be calculated as D ( t ) = T ( t ) − T l , with T l as the delay constraint. The QoE can be formulated as thesum of QoE over all the areas and all the UAV users, anddenoted as QoE = κ i,k ( t ) IK (cid:18) I (cid:88) i =1 K (cid:88) k =1 q ( R i,k ( t )) − | q ( R i,k ( t )) − q ( R i,k ( t − | (cid:19) − ω ( t ) D ( t ) , (16)where q ( R i,k ( t )) is video quality metrics [27], which can bewritten as q ( R i,k ( t )) = log (cid:18) R i,k ( t ) R min ( v i ( t )) (cid:19) , (17) κ i,k ( t ) and ω ( t ) are the weight of video quality and delay,respectively. As our aim is to maximize the QoE, the conditionof κ i,k ( t ) > ω ( t ) must be guaranteed, and R min ( v i ( t )) is theminimum rate that should be satisfied for the selected v i ( t ) . E. Problem Formulation
Our aim is to maximize the QoE that jointly exploit the op-timal positions of the UAV-BS and UAV-UEs, and the optimal
Table IT
YPE OF V IDEO Q UALITY [24]Video Quality Resolution (pixels) Framrate (FPS) Bitrate (average) Data used per minute Data used per 60 minutes144p 256x144 30 80-100 Kbps 0.5-1.5 MB 30-90 MB240p 426x240 30 300-700 Kbps 3-4.5 MB 180-250 MB360p 640x360 30 400-1,000 Kbps 5-7.5 MB 300-450 MB480p 854x480 30 500-2,000 Kbps 8-11 MB 480-660 MB720p (HD) 1280x720 30-60 1.5-6.0 Mbps 20-45 MB 1.2-2.7 GB1080p (FHD) 1920x1080 30-60 3.0-9.0 Mbps 50-68 MB 2.5-4.1 GB adaptive bitrate selection. The fluctuation of the transmissionlink will cause unstable network performance that leads tolow QoE and high delay. Thus, to minimize the delay ateach Transmission Time Interval (TTI) and maximize thequality of video streaming, we jointly consider the optimalUAV-BS location P = ( x BS ( t ) , y BS ( t ) , h BS ( t )) , the posi-tion of the k th UAV-UE U = ( x i,k ( t ) , y i,k ( t ) , h i,k ( t )) , themaximum power of UAV-UE P U k , and the bitrate resolution V = { , , , , , and } p. The optimizationproblem can be formulated as max { P , U ,P Uk , V } κ i,k ( t ) IK (cid:18) I (cid:88) i =1 K (cid:88) k =1 q ( R i,k ( t ) − | q ( R i,k ( t )) − q ( R i,k ( t − | ) (cid:19) − ω ( t ) D ( t ) , (18)s.t. max h i > h BS ( t ) > h max , (19) R i,k ( t ) > R k ( min ) ( v i ( t )) , (20) (cid:112) ( x BS ( t ) − x i ) + ( y BS ( t ) − y i ) > r i + r s , (21) U ∈ Eq . (1) . (22)The objective function in Eq. (18) captures the average QoEreceived at the UAV-BS. The UAV-BS’s height must follow thecondition in Eq. (19). Eq. (20) guarantees R k obtained from U k to meet the minimum requirement of data rate of UAV-UEs based on the adaptive bitrate selection. Then, Eq. (21)guarantees that the position of the UAV-BS will not intersectwith the UAV-UE’s flying region. U follows the requirement ofthe flying region FR i presented in Eq. (1). In the experiment,the UAVs are hover and flying at constant speed.III. O PTIMIZATION P ROBLEM VIA R EINFORCEMENT L EARNING
In this section, we design several DRL algorithms to solveQoE maximization problem in UAV-to-UAV network and to becompared with existing traditional method - Greedy algorithm.Specifically, we propose two DRL algorithms, which are DeepQ-Learning and Actor-Critic, to maximize the QoE of livevideo streaming in U2U communication.
A. Reinforcement Learning
For our proposed RL-based method, the UAV-BS acts asan agent to collect video from UAV-UEs while maximizingQuality of Experience (QoE). The QoE optimization problem is influenced by the delay, UAVs’ positions, and bitrate selec-tion during each Transmission Time Interval (TTI), and formsa Partially Observed Markov Decision Problem (POMDP).Through learning algorithms, the UAV-BS is able to select thepositions of the UAV-BS P , the UAV-UEs U , and the adaptiveresolution V , in order to maximize the QoE.
1) State Representation:
The current state s ( t ) correspondsto a set of current observed information. The state of the UAV-BS can be denoted as s = [ P , V , U , QoE ] , where P = ( x BS ( t ) , y BS ( t ) , h BS ( t ) ) is the position of the UAV-BS, V is the bitrateselection, and U = ( x k ( t ) , y k ( t ) , h k ( t )) is the positions ofUAV-UEs.
2) Action Space:
Q-agent will choose action a =( BP, BU, BV ) from set A . The dimension of the action setcan be calculated as A = BP × BU i × k × BV i × P . Theactions for UAVs include (i) UAV-BS’s flying direction (BP),(ii) UAV-UEs’ flying direction (BU), (iii) resolution of the i th UAV-UE (BV), and (iv) UAV-UE’s power (P). The actionspace is presented as • BP = (Position coordinate follows Eq.(21) ) • BU = (Position coordinate with boundaries of Eq.(22)) • BV= (144, 240, 360, 480, 720, or 1080) p • P = (23, 25, 30) dBmTo ensure the balance of exploration and exploitation actionsof the UAV-BS, (cid:15) -greedy ( < (cid:15) ≤ ) exploration is deployed.At the t th TTI, the UAV-BS randomly generates a probability p (cid:15) ( t ) to compare with (cid:15) . If the probability p (cid:15) ( t ) < (cid:15) , thealgorithm randomly selects an action from the feasible actionsto improve the value of the non-greedy action. However, if p (cid:15) ( t ) ≥ (cid:15) , the algorithm exploits the current knowledge of theQ-value table to choose the action that maximizes the expectedreward.
3) Rewards:
When the a ( t ) is performed, the correspondingreward re ( t ) is defined as re ( t ) = ψ i,k ( t ) IK (cid:18) I (cid:88) i =1 K (cid:88) k =1 q ( R i,k ( t )) − | q ( R i,k ( t )) − q ( R i,k ( t − | (cid:19) − ω ( t ) D ( t ) , (23)where q ( R i,k ( t )) is video quality metrics [27], which can bewritten as q ( R i,k ( t )) = log (cid:18) R i,k ( t ) R min ( v i ( t )) (cid:19) , (24) ψ i,k ( t ) and ω ( t ) are the weights of video quality and delay,respectively. If R i,k ( t ) is unable to satisfy the minimum trans-mission rate for R k min ( v i ( t )) , namely, R i,k ( t ) < R k min ( v i ( t )) , the system will receive negative reward, which means re ( t ) < . B. Q-learning
The learning algorithm needs to use Q-table to store thestate-action values according to different states and actions.Through the policy π ( s, a ) , a value function Q ( s, a ) can beobtained through performing action based on the current state.At the t th time slot, according to the observed state s ( t ) , anaction a ( t ) is selected following (cid:15) -greedy approach from allactions. By obtaining a reward re ( t ) , the agent updates itspolicy π of action a ( t ) . Meanwhile, Bellman Equation is usedto update the state-action value function, which can be denotedas Q ( s ( t ) , a ( t )) =(1 − α ) Q ( s ( t ) , a ( t ))+ α (cid:26) re ( t + 1) + γ max a ( t ) ∈A Q ( s ( t + 1) , a ( t )) (cid:27) , (25)where α is the learning rate, γ ∈ [0 , is the discount ratethat determines how current reward affects the updating valuefunction. Particularly, α is suggested to be set to a small value(e.g., α = 0.01) to guarantee the stable convergence of training. C. Deep Q-learning
However, the dimension of both state space and action spacecan be very large if we use the traditional tabular Q-learning,which will cause high computation complexity. To solve thisproblem, deep learning is combined with Q-learning, namely,Deep Q-Network (DQN), where a deep neural network (DNN)is used to approximate the state-action value function. Q ( s, a ) is parameterized by using a function Q ( s, a ; θ DQN ) , where θ DQN is the weight matrix of DNN with multiple layers. s is the state observed by the UAV and acts as an input toNeural Networks (NNs). The output are selected actions in A .Furthermore, the intermediate layer contains multiple hiddenlayers and is connected with Rectifier Linear Units (ReLu)via using f ( x ) = max(0 , x ) function. At the t th time slot, theweight vector is updated by using Stochastic Gradient Descent(SGD) and Adam Optimizer, which can be written as θ DQN ( t + 1) = θ DQN ( t ) − λ ADAM · ∇L ( θ DQN ( t )) , (26)where λ ADAM is the Adam learning rate, and λ ADAM ·∇L ( θ DQN ( t )) is the gradient of the loss function L ( θ DQN ( t )) ,which can be written as ∇L ( θ DQN ( t )) = E S i ,A i , re( i +1) ,S i +1 (cid:2)(cid:0) Q tar − Q (cid:0) S i , A i ; θ DQN ( t )) · ∇ Q (cid:0) S i , A i ; θ DQN ( t ) (cid:1)(cid:3) (27)where the expectation is calculated with respect toa so-called minibatch, which are randomly selected inprevious samples ( S i , A i , Re i +1 , S i +1 ) for some i ∈{ t − M r , t − M r + 1 , . . . , t } , with M r being the replay mem-ory. The minibatch sampling is able to improve the con-vergence reliability of the updated value function [28]. Inaddition, the target Q-value Q tar can be estimated by Q tar = re i +1 + γ max a ∈A Q ( S i +1 , a ; ¯ θ DQN ( t )) , (28) Algorithm 1 : Optimization by using DQNInput: The set of UAV-BS position { x BS , y BS , h BS } ,bitrate selection V , the position of the k th UAV-UE U k = ( x tk , y tk , h tk ) , (cid:80) QoE and operation iteration I . Algorithm hyperparameters:
Learning rate α ∈ (0 , , (cid:15) ∈ (0 , , target network update frequency K ;Initialization of replay memory M , the primary Q-network θ , and the target Q-network ¯ θ ; For e ← to I Initialization of s by executing a random action a ; For t ← to T If p (cid:15) < (cid:15) Randomly select action a t from A ; else select a t = argmax a ∈A Q ( S t , a, θ ) ;The UAV-BS performs a t at the t th TTI ;The UAV-BS observes s t +1 , and calculate re t +1 using Eq.(23);Store transition ( s t ; a t ; re t +1 ; s t +1 ) in replay memory M ;Sample random minibatch of transitions ( S i ; A i ; Re i +1 ; S i +1 ) from replay memory M ;Perform a gradient descent for Q ( s ; a ; θ ) using (27) ;Every K steps update target Q-network ¯ θ = θ .where ¯ θ DQN ( t ) is the weight vector of the target Q-networkto be used to estimate the future value of the Q-function inthe update rule. This parameter is periodically copied from thecurrent value θ DQN ( t ) and kept fixed for a number of episodes.The DQN algorithm is presented in Algorithm 1. D. Actor-Critic
Different from the DQN algorithm, which obtains the op-timal strategy indirectly by optimizing the state-action valuefunction, the AC algorithm directly determines the strategythat should be executed by observing the environment state.The AC algorithm combines the advantages of value-basedfunction method and policy-based function method. In theAC algorithm, the agent is consisted of two parts, i.e., actornetwork and critic network, and it solves the problem throughusing two neural networks. Meanwhile, the AC algorithmdeploys a separate memory structure to explicitly representthe policy which is independent of the value function. Thepolicy structure is known as the actor network, which is usedto select actions. Meanwhile, the estimated value function isknown as the critic network, which is used to criticize theactions performed by the actor. The AC algorithm is an on-policy method and temporal difference (TD) error is deployedin the critic network. To sum up, the actor network aims toimprove the current policies while the critic network evaluatesthe current policy to improve the actor network in learningprocess.The critic network uses value-based learning to learn a valuefunction. The state-action value function V ( s ( t ) , w ( t )) in thecritic network can be denoted as V ( s, w ( t )) = w (cid:62) ( t ) Φ ( s ( t )) , (29) Algorithm 2 : Actor-Critic AlgorithmInputs: The set of UAV-BS position { x BS , y BS , h BS } ,bitrate selection V , the position of the k th UAV-UE U k = ( x tk , y tk , h tk ) , (cid:80) QoE and operation iteration I . Algorithm hyper-parameter : Learning rate α c ∈ (0 , , (cid:15) ∈ (0 , , Target network update frequency K ;Initialization of policy parameter θ AC , weight of the actornetwork w , value of the critic network V ; For e ← to I Initialization of s by executing a random action; For t ← to T Select action a t according to the current policy;The UAV-BS observes s t +1 , and calculate re t +1 using (23);Store transition ( s t ; a t ; re t +1 ; s t +1 ) ;Update TD-error functions;Update the weights w of critic network by minimizing theloss;Update the policy parameter vector θ for actor network;Update the policy θ AC and state-value function V ( s ( t ) , w ( t )) .where Φ ( s ( t )) = s ( t ) is state features vector and w ( t ) is criticparameters, which can be updated as w ( t + 1) = w ( t ) + α c ( t ) δ ( t ) ∇ w V ( s ( t ) , w ( t )) , (30)where α c is the learning rate in the critic network. Afterperforming the selected action, TD error δ ( t ) is used toevaluate whether the selected action based on the current stateperforms well [29], which can be calculated as δ ( t ) = re ( t + 1) + γ w ( V ( s ( t + 1) , w ( t )) − V ( s ( t ) , w ( t ))) . (31)Then, the actor network is used to search the best policy tomaximize the expected reward under the given policy withparameters θ AC , which can be updated as θ AC ( t + 1) = θ AC ( t ) + α a ∇ θ AC J (cid:0) π θ AC ( t ) (cid:1) , (32)where α a is the learning rate in the actor network, which ispositive and must be small enough to avoid causing oscillatorybehavior in the policy, and according to [29], ∇ θ AC J ( π θ AC ) can be calculated as ∇ θ AC J (cid:0) π θ AC ( t ) (cid:1) = δ ( t ) ∇ θ AC ln ( π ( a t | s t , θ AC ( t ))) . (33)The AC algorithm is presented in Algorithm 2.IV. S IMULATION R ESULTS
In this section, we evaluate our proposed learning algo-rithms in our problem setup. The area of the region is 5000m x 5000m x 100m. In the simulation, the maximum flyingheight h max of the UAV-BS is 100m, which is satisfied withthe maximum flying height 120m that is stipulated by theUK government. We assume that the available video bitratesof the adaptive video streaming for each video frame are (80 , , , , , kbps. The target area is cap-tured by K UAV-UE(s), i.e., K = i th fire area A i ( i = 1 , , , , and , . At the beginning, the UAV-BS will bedeployed at the centre of the environment, i.e. (1250, 1250, Table IIP
ARAMETER
Parameter ValueNumber of UAV-UEs 12Transmission power,
P Ue
23 dBm [4]Bandwidth, B σ -96 dBm [4]Center frequency, f c G -31.5 dB [4]Alpha, α η LoS η NLoS
21 [30]Channel parameter, a b r i
250 mTable IIIH
YPERPARAMETER
Hyperparameter Value
Learning Rate 0.1, 0.01Initial Exploration 1Final Exploration 0.1Discount Rate 0.8Replay memory 10000 A v e r a g e Q o E Episodes DQN Greedy AC
Figure 4. Average QoE value for each frame via AC, DQN and Greedyalgorithms. h min ), where h min is the maximum height of the fire. When thefire occurs at the remote area, the UAV-UEs will immediatelyreach the fire location to stream and oversee the real-timesituation. The height of the UAV-UEs in each fire area arefixed and follow the distribution of the fire height [17]. Thenetwork parameters for the system are shown in Table II andfollow the existing approach and 3GPP specifications in [4],[23], and [30]. The performance of all results is obtainedby averaging around 100 episodes, where each episode isconsisted of 100 TTIs. Finally, the channel model parametersand grid environment parameters are set according to [4].In each scenario, our proposed DQN and AC algorithms arecompared with the Greedy algorithm. The Greedy algorithmselects the actions based on the immediate reward and localoptimum strategy. The DQN is designed with 3 hidden layers, A v e r a g e Q o E Episodes DQN-Adaptive Resolution DQN-AB DQN-ABU AC-Adaptive Resolution AC-AB AC-ABU
Figure 5. Average QoE of the UAV-BS with different schemes via differentlearning algorithms and with different optimization schemes of each episode.
10 20 30 40 50 60 70 80 900 1000.05.010.015.0 N u m b e r o f R e qu e s t Time Figure 6. The request of the UAV-UEs in continuous time slots. where each layer consists of 256, 128, 128 ReLU units,respectively. For the AC method, the critic DNN consists of aninput layer with 19 neurons, a fully-connected neural networkwith two hidden layers, each with 128 neurons, and an outputlayer with 1 neuron. The UAV-BS is initially set at the centre ofthe environment with the height h min . In wildfires environmentproblem, the network coverage with smooth streaming needsto overview the real-time situation. To guarantee high qualityof video transmission from multiple UAVs in continuous timeslots, the Recurrent Neural Network (RNN) is deployed.Fig. 4 plots the average QoE value over all frames viaAC, DQN and Greedy algorithms. It can be seen that DRLalgorithms outperform the non-learning based algorithm, i.e.,Greedy algorithm. Moreover, the convergence speed of theDRL algorithms is faster than that of the Greedy algorithm.Specifically, in the Greedy algorithm, the UAVs only considerexploiting the current reward, rather than exploring the long-term reward. Therefore, the UAVs are not able to achievehigher expected reward compared to the DRL algorithm.Fig. 5 plots the average QoE of the UAV-BS with different
10 20 30 40 50 60 70 80 900 1000.20.40.60.81.0 A d a p ti v e P o w e r C on t r o l ( W ) Time DQN Greedy AC1 2 3
Figure 7. The power control of the UAV-UEs in continuous time slots withdifferent learning algorithms.
10 20 30 40 50 60 70 80 900 100 M i n i m u m V i d e o R e s o l u ti on ( p ) Time
DQN Greedy AC
Figure 8. The average adaptive resolution of the UAV-UEs in continuous timeslots with different learning algorithms. video transmission schemes via different learning algorithmsin each episode. For simplicity, “Adaptive Resolution” repre-sents the scheme with adaptive resolution, “AB” is the schemewith adaptive resolution and dynamic UAV-BS, and “ABU”is the scheme with adaptive resolution, dynamic UAV-BSand UAV-UEs. It is observed that the average QoE of theAC algorithm outperforms all other algorithms, with it beingable to achieve an optimal trade-off between data rate, bitrateresolution selection, power control, and positions. From theresult, it is observed that with the dynamic environment andlarge size of the action, and the AC algorithm is able to selectproper positions of UAVs and video resolution of video frames.This is mainly due to the experience replay mechanism, whichefficiently utilizes the training samples, and the actor and criticfunctions are able to smooth the training distribution over theprevious behaviours compared to DQN. In addition, we canobserve that the strategies of selecting optimal positions forUAVs achieve higher performance compared to the UAVs withfixed locations. This result emphasizes the importance of thestrategy with mobile UAVs. This is due to the fact that mobile A v e r a ng e T i m e D e l a y ( m s ) Episodes DQN Greedy AC
Figure 9. Average latency of video streaming with different learning algo-rithms.
UAVs can move through the network to reach the optimalpositions that are able to adapt to dynamic fire scenarios.Next, we provide more in-depth investigation of the re-lationship between the number of UAV request, adaptivevideo resolution, adaptive power control, and throughput withdifferent learning algorithms in continuous 100 time slots. Theresults are also compared among the three algorithms, namelyDQN, AC, and Greedy algorithms. The detailed results showhow the control optimization helps UAVs to maximize the QoEat each time slot.Fig. 6 plots the UAV’s requests follow the fire arrivaldistribution, which follow Poisson process distribution withdensity λ . In phase 1, there is a small number of fire arrivalwhich leads to low request of UAV’s number. But, when timeis increasing, the number od fire arrival is getting higher andleads high number of UAV’s request is needed as shown inphase 2 and in phase 3, the request is drop and less UAV’srequest is demand. As the number of requests rapidly changes,we introduce power control to control the transmit power atUAV-UEs to mitigate the interference among UAV-UEs, thusmaximizing the achievable rate of each UAV-UE.Following the fire arrival requests in Fig 6, Fig. 7 plotsthe average power control over all UAV-UEs in continuoustime slots with AC, DQN and Greedy algorithms. The powercontrol helps mitigating the interference among UAV-UEs. Asshown in phase 1 and phase 3 in Fig. 6, there is a smallnumber of fire requests with small number of UAVs to transmitthe data. However, when the number of requests increases, alarge number of UAVs are demanded as shown in phase 2of Fig. 6. As can be seen from Phase 2 of Fig. 7, the DRLalgorithms learn the environment and effectively reduce thetransmit power of each UAV-UE, to reduce the interferencefrom UAV-UEs. We see that the Greedy algorithm maintainsthe higher power, even though high power can provide highreceived signal, it also causes high interference at the UAV-BSand failure in transmission.Following the fire arrival requests in Fig. 6, Fig. 8 plots theminimum adaptive resolution over all UAV-UEs in continuous A v e r a g e S m oo t hn e ss P e n a lt y Episodes DQN Greedy AC
Figure 10. Average smoothness penalty with different learning algorithms. time slots with different learning algorithms. It is shownthat the minimum video resolution of the AC algorithm ishigher than that of the DQN and the Greedy algorithm in allscenarios. The AC algorithm is able to maintain an optimalvideo resolution at each time slot and guarantee high qualityand smooth video playback with new request. However, theGreedy algorithm exploits with a minimum video resolution tomaintain high rewards, and it only uses local optimal policyand causes poor performance. For phase 1 and 3, when thenumber of requests is low at the t th time slot, the poweris high, and the throughput increases, thus, the resolutionof video is high. However, when the number of request isincreasing in phase 2, the AC algorithm is able to maintain ahigh resolution due to helps of adaptive power, which leadsto better QoE for each UAV-UE. This will help to reduce theinterference and improve the quality of the video resolution.In Fig. 9, we plot the average latency of video streamingwith AC, DQN and Greedy algorithms. It can be seen thatthe latency performance of the AC algorithm outperforms thatof the DQN algorithm. When multiple video streaming existin the U2U communication, the interference among UAV-UEs occur and cause higher latency. Based on the observedstate, the AC algorithm is able to select proper positionsand transmission power of the UAV-UEs to mitigate theinterference, which further decreases the latency. Thus, the ACalgorithm is able to maximize average QoE with the lowestaverage time latency. However, the Greedy algorithm is unableto exploit the violation of latency constraints and lead to higherlatency, which leads to lower QoE.Fig. 10 plots the average smoothness penalty with AC, DQNand Greedy algorithms. The smoothness penalty demonstratesthe average video stability occupancy of UAV-UEs at eachepisode. When the learning algorithm is able to automaticallychoose the suitable resolution at the t th time slots and ( t − thtime slot, it will obtain lower smoothness penalty and higherQoE. Moreover, the AC algorithm is able to automaticallychoose the proper action based on actor and critic functionwhich leads to better smoothness of the AC algorithm com-pared to that of the DQN and Greedy algorithms. It is proved that the AC algorithm guarantees the smoothness of videotransmission with high QoE. Meanwhile, the Greedy algorithmshows the worst performance because it only makes localoptimal selections. V. C ONCLUSION
In this paper, we developed a deep reinforcement learningapproach for the mobile U2U communication to maximize theQuality of Experience (QoE) of UAV-UEs, through optimizingthe locations for all UAVs, the additive video resolution, andtransmission power for UAV-UEs. The dynamic interferenceproblem was handled by utilizing adaptive power control toachieve a higher achievable rate. Through our developed DeepQ Network and Actor-Critic methods, the optimal additivevideo resolution can be selected to stream real-time videoframes, and optimal positions of the UAV-BS and UAV-UEs can be selected to satisfy the transmission rate require-ment. Simulation results demonstrated the effectiveness ofour proposed learning-based schemes compared to the Greedyalgorithm in terms of higher QoE with low latency and highvideo smoothness. R
EFERENCES[1] M. M¨uller, L. Vil`a-Vilardell, and H. Vacik, “Forest fires in the alps–state of knowledge, future challenges and options for an integrated firemanagement,”
EUSALP Action Group , vol. 8, 2020.[2] K. W. Sung et al. , “PriMO-5G: making firefighting smarter with im-mersive videos through 5G,” in
Proc. 2019 IEEE 2nd 5G World Forum(5GWF) , Sep. 2019, pp. 280–285.[3] M. M. Azari, G. Geraci, A. Garcia-Rodriguez, and S. Pollin, “CellularUAV-to-UAV communications,” in
Proc. IEEE 30th Annu. Int. Symp.Pers. Indoor Mobile Radio Commun. (PIMRC) , Sep. 2019, pp. 120–127.[4] S. Zhang, H. Zhang, B. Di, and L. Song, “Cellular UAV-to-X communi-cations: Design and optimization for multi-UAV networks,”
IEEE Trans.Wireless Commun. , vol. 18, no. 2, pp. 1346–1359, Feb. 2019.[5] M. M. Azari, G. Geraci, A. Garcia-Rodriguez, and S. Pollin, “UAV-to-UAV communications in cellular networks,”
IEEE Trans. on WirelessCommun. , vol. 19, no. 9, pp. 6130–6144, Jun. 2020.[6] X. Liu et al. , “Transceiver design and multihop D2D for UAV IoTcoverage in disasters,”
IEEE Internet of Things J. , vol. 6, no. 2, pp.1803–1815, Apr. 2019.[7] A. Joshi, S. Dhongdi, S. Kumar, and K. Anupama, “Simulation of multi-uav Ad-Hoc network for disaster monitoring applications,” in , Jan. 2020, pp. 690–695.[8] A. Masaracchia et al. , “The concept of time sharing NOMA into UAV-Enabled communications: An energy-efficient approach,” in , Aug. 2020, pp. 61–65.[9] U. Challita, W. Saad, and C. Bettstetter, “Deep reinforcement learningfor interference-aware path planning of cellular-connected UAVs,” in
Proc. 2018 IEEE Int. Commun. Conf. (ICC) . IEEE, Jul. 2018, pp. 1–7.[10] Y. Sadi, S. C. Ergen, and P. Park, “Minimum energy data transmissionfor wireless networked control systems,”
IEEE Trans. on WirelessCommun. , vol. 13, no. 4, pp. 2163–2175, Feb. 2014.[11] S. Zhang et al. , “Joint trajectory and power optimization for UAV relaynetworks,”
IEEE Commun. Lett. , vol. 22, no. 1, pp. 161–164, Oct. 2017.[12] G. E. G. Padilla, K.-J. Kim, S.-H. Park, and K.-H. Yu, “Flight pathplanning of solar-powered UAV for sustainable communication relay,”
IEEE Robot. Automat. Lett. , vol. 5, no. 4, pp. 6772–6779, Aug. 2020.[13] M. M. Selim et al. , “On the outage probability and power control ofD2D underlaying NOMA UAV-assisted networks,”
IEEE Access , vol. 7,pp. 16 525–16 536, Jan. 2019.[14] X. Xiao et al. , “Sensor-augmented neural adaptive bitrate video stream-ing on UAVs,”
IEEE Trans. on Multimedia , pp. 1–12, Oct. 2019.[15] K. Govil, M. L. Welch, J. T. Ball, and C. R. Pennypacker, “Preliminaryresults from a wildfire detection system using deep learning on remotecamera images,”
Remote Sensing , vol. 12, no. 1, p. 166, 2020. [16] N. Jiang, Y. Deng, A. Nallanathan, and J. A. Chambers, “Reinforcementlearning for real-time optimization in NB-IoT networks,”
IEEE J. Sel.Areas Commun. , vol. 37, no. 6, pp. 1424–1440, Jun. 2019.[17] J. J. Podur, D. L. Martell, and D. Stanford, “A compound poisson modelfor the annual area burned by forest fires in the province of ontario,”
Environmetrics , vol. 21, no. 5, pp. 457–469, 2010.[18] M. Val Martin, R. Kahn, and M. Tosca, “A global analysis of wildfiresmoke injection heights derived from space-based multi-angle imaging,”
Remote Sensing
Space-time wireless channels . Prentice Hall Professional,2003.[21] N. Goddemeier and C. Wietfeld, “Investigation of air-to-air channelcharacteristics and a UAV specific extension to the rice model,” in , Dec. 2015, pp. 1–5.[22] V. Yajnanarayana et al. , “Interference mitigation methods for unmanneddaerial vehicles served by cellular networks,” in , Jul. 2018, pp. 118–122.[23] “Study on enhanced lte support for aerial vehicles,” 3GPP, TR 36.777,Dec. 2017, V15.0.0.[24] “Recommended upload encoding settings - youtube help.” [On-line]. Available: https://support.google.com/youtube/answer/1722171?hl=en-G[25] P. Carballeira, J. Cabrera, A. Ortega, F. Jaureguizar, and N. Garc´ıa, “Aframework for the analysis and optimization of encoding latency formultiview video,”
IEEE J. Sel. Topics Signal Process. , vol. 6, no. 5, pp.583–596, Sept. 2012.[26] X. Yin, A. Jindal, V. Sekar, and B. Sinopoli, “A control-theoreticapproach for dynamic adaptive video streaming over HTTP,” in
Proc.2015 ACM Conf. on Special Interest Group on Data Commun. , Aug.2015, p. 325–338.[27] H. Mao, R. Netravali, and M. Alizadeh, “Neural adaptive video stream-ing with pensieve,” in
Proc. Conf. of the ACM Special Interest Groupon Data Commun. , Aug. 2017, pp. 197–210.[28] V. Mnih et al. , “Human-level control through deep reinforcement learn-ing,”
Nature , vol. 518, no. 7540, p. 529, Feb. 2015.[29] Z. Zhang et al. , “QoE aware transcoding for live streaming in SDN-Based Cloud-Aided HetNets: An actor-critic approach,” in
Proc. 2019IEEE Int. Commun. Conf. Workshops (ICC Workshops) , May 2019, pp.1–6.[30] A. Al-Hourani, S. Kandeepan, and S. Lardner, “Optimal LAP altitude formaximum coverage,”
IEEE Commun. Lett. , vol. 3, no. 6, pp. 569–572,Dec. 2014.[31] C. She et al. , “Ultra-reliable and low-latency communications in un-manned aerial vehicle communication systems,”
IEEE Trans. Commun. ,vol. 67, no. 5, pp. 3768–3781, May 2019.[32] A. Al-Hourani, S. Kandeepan, and A. Jamalipour, “Modeling air-to-ground path loss for low altitude platforms in urban environments,” in2014 IEEE Global Commun. Conf.