[PDF] Actor-Critic Learning Based QoS-Aware Scheduler for Reconfigurable Wireless Networks

Abstract

The flexibility offered by reconfigurable wireless networks, provide new opportunities for various applications such as online AR/VR gaming, high-quality video streaming and autonomous vehicles, that desire high-bandwidth, reliable and low-latency communications. These applications come with very stringent Quality of Service (QoS) requirements and increase the burden over mobile networks. Currently, there is a huge spectrum scarcity due to the massive data explosion and this problem can be solved by helps of Reconfigurable Wireless Networks (RWNs) where nodes have reconfiguration and perception capabilities. Therefore, a necessity of AI-assisted algorithms for resource block allocation is observed. To tackle this challenge, in this paper, we propose an actor-critic learning-based scheduler for allocating resource blocks in a RWN. Various traffic types with different QoS levels are assigned to our agents to provide more realistic results. We also include mobility in our simulations to increase the dynamicity of networks. The proposed model is compared with another actor-critic model and with other traditional schedulers; proportional fair (PF) and Channel and QoS Aware (CQA) techniques. The proposed models are evaluated by considering the delay experienced by user equipment (UEs), successful transmissions and head-of-the-line delays. The results show that the proposed model noticeably outperforms other techniques in different aspects.

Full PDF

AActor-Critic Learning Based QoS-Aware Schedulerfor Reconﬁgurable Wireless Networks

Shahram Mollahasani

School of Electrical Engineering and Computer ScienceUniversity of Ottawa [email protected]

Melike Erol-Kantarci,

Senior Member, IEEE

School of Electrical Engineering and Computer ScienceUniversity of Ottawa [email protected]

Mahdi Hirab

VMware Inc. [email protected]

Hoda Dehghan

VMware Inc. [email protected]

Rodney Wilson,

Senior Member, IEEE

Ciena Corp. [email protected] of Electrical Engineering

Abstract —The ﬂexibility offered by reconﬁgurable wirelessnetworks, provide new opportunities for various applicationssuch as online AR/VR gaming, high-quality video streaming andautonomous vehicles, that desire high-bandwidth, reliable andlow-latency communications. These applications come with verystringent Quality of Service (QoS) requirements and increase theburden over mobile networks. Currently, there is a huge spectrumscarcity due to the massive data explosion and this problem canbe solved by helps of Reconﬁgurable Wireless Networks (RWNs)where nodes have reconﬁguration and perception capabilities.Therefore, a necessity of AI-assisted algorithms for resource blockallocation is observed. To tackle this challenge, in this paper, wepropose an actor-critic learning-based scheduler for allocatingresource blocks in a RWN. Various trafﬁc types with differentQoS levels are assigned to our agents to provide more realisticresults. We also include mobility in our simulations to increase thedynamicity of networks. The proposed model is compared withanother actor-critic model and with other traditional schedulers;proportional fair (PF) and Channel and QoS Aware (CQA)techniques. The proposed models are evaluated by consideringthe delay experienced by user equipment (UEs), successfultransmissions and head-of-the-line delays. The results show thatthe proposed model noticeably outperforms other techniques indifferent aspects.

Index Terms —5G, AI-enabled networks, Reinforcement learn-ing, Resource allocation

I. I

NTRODUCTION

With the growing demand on social media platforms, Ultra-High-Deﬁnition (UHD) video and Virtual Reality (VR), Aug-mented Reality (AR) enabled applications, the ever-growingdata trafﬁc of Internet-of-Things (IoT), and the speedy ad-vances in mobile devices and smartphones, have led to expo-nential growth in the trafﬁc load over wireless networks [1].Due to these new applications, availability of various trafﬁctypes, and unpredictability of physical channels, includingfading, path loss, etc., maintaining quality of service (QoS) hasbecome more challenging than ever before. In recent years,with the introduction of Open Radio Access Network (O-RAN), Artiﬁcial Intellience (AI) and Machine Learning (ML)have found applications in wireless networks, and Reconﬁg-urable Wireless Networks (RWNs) have emerged. In RWNs, local networking nodes are controlled by groups of communi-cating nodes equipped with reconﬁgurable software, hardware,or protocols. Software reconﬁguration is useful for updating,inclusion, and exclusion of tasks while hardware reconﬁgura-tion will enable manipulating physical infrastructure. AI andML techniques can provide automation at a higher degree thanbefore (further than self-organized networks (SON) concept of3GPP) and manage the growing complexity of the RWNs [2].Reinforcement learning (RL) is a machine learning tech-nique that allows optimal control of systems by directing thesystem to a desired state by interacting with the environmentand using feedback from the environment [3]. RL tech-niques are widely used in cellular networks with various usecases such as video ﬂow optimization [4], improving energy-efﬁciency in mobile networks [5], and optimizing resource al-location [6]. A majority of RL-based methods used in wirelessnetworks focus on Q-learning or Deep Q-learning. Althoughpromising results are obtained using these techniques, theytechniques offer a single level control. Instead, actor-criticlearning which is a type of RL, can be implemented in multiplehierarchies and offer control from multiple points of views.In this work, we propose an Actor-Critic Learning approach[7], to allocate resource blocks in a way that the communica-tion reliability is enhanced and the required QoS by each UEis satisﬁed. In the proposed model, we formulate the choiceof the number of resource blocks (RBs) and the locationof them in the RBs’ map as a Markov Decision Process(MDP) [8], and we solve this problem by using an actor-criticmodel. We consider channel quality, packets priorities, andthe delay budget of each trafﬁc stream in our reward function.We adopt two Advantage Actor-Critic (A2C) models [9]. Theﬁrst technique solely schedules packets by giving priority totheir scheduling delay budget (called as D-A2C) while secondtechnique considers channel quality, delay budget, and packettypes (called as CDPA-A2C). We evaluate the performance ofthe proposed models using NS3 [10] with ﬁxed and mobilescenarios. Our results show that, in the ﬁxed scenario, theproposed model can reduce the mean delay signiﬁcantly with a r X i v : . [ c s . N I] J a n espect to proportional fair (PF) [11], Channel and QoS Aware(CQA) [12] and D-A2C schedulers. Additionally, CDPA-A2Ccan increase the packet delivery rate in the mobile scenario upto 92% and 53% in comparison with PF and CQA.The main contributions of this paper are: • Proposing an actor-critic learning technique that canbe implemented on disaggregated RAN functions andprovide control in two levels. • Proposing a comprehensive reward function which takescare of channel quality, packet priorities, and the delaybudget of each trafﬁc stream. • Proposing two A2C models, where the ﬁrst model solelyschedules packets by giving priority to their schedulingdelay budget (called as D-A2C) and the second techniqueconsiders channel quality, delay budget, and packet types(called as CDPA-A2C).The rest of the paper is organized as follows. In Section II,we summarized the related works. In Section III, the systemmodel is described. In Section IV, the proposed actor-criticresource block scheduler is explained. Numerical results andevaluation of the proposed model are presented in Section V,and ﬁnally, we conclude the paper in Section VI.II. R

ELATED W ORK

Providing ubiquitous connectivity for various devices withdifferent QoS requirements is one of the most challengingissues for mobile network operators [13]. This problem isampliﬁed in future 5G applications with strict QoS require-ments [14]. Additionally, in order to be capable of handlingall the new immersive applications (which are known fortheir heterogeneous QoS properties), advanced techniques arerequired to maintain quality of experience (QoE) among thenetwork’s entities. To this end, packet schedulers need to allowsharing the network bandwidth dynamically among UEs ina way that UEs achieve their target QoS. Many schedulingalgorithms have been introduced previously, which employsQoS in their models. In [15], a scheduler is proposed, whichencapsulates different features of scheduling strategies for thedownlink of cellular networks to guarantee multi-dimensionalQoS for various radio channels and trafﬁc types. However,most of the QoS-based schedulers are prioritizing some trafﬁctypes by ignoring the rest. For instance, in [16], a prioritizedtrafﬁc scheduler named by frame level scheduler (FLS) isintroduced, which gives higher priority to real-time trafﬁcs incomparison with elastic trafﬁcs (such as HTTP or ﬁle trans-fer). Additionally, in [17] required activity detection (RADS)scheduler is proposed, which prioritized UEs based on thefairness and their packet delay. However, most of prioritizingschedulers are not capable of quickly reacting to the dynamicsof cellular networks. Therefore, some trafﬁc classes may havedegradation in their QoS, while others can be over-provisioned.RL-based models are also applied in different ways inorder to optimize resource allocation in networks. In [18], anRL-based scheduler is presented for resource allocation in areliable vehicle-to-vehicle (V2V) network. The presented RLscheduler interacts with the environment frequently to learn and optimize resource allocation to the vehicles. In this work,it is assumed that the whole network structure is connected to acentralized scheduler. Additionally, in [19], resource allocationand computation ofﬂoading in multi-channel multi-user mobileedge cloud (MEC) systems are evaluated. In this work, theauthors presented a deep reinforcement network to jointlyoptimize the total delay and energy consumption of all UEs.Moreover, in [20], an RL controller is implemented to scheduledeadline-driven data transfers. In this paper, it is assumedthat the requests will be sent to a central network controllerwhere the ﬂows with respect to their pacing rates can bescheduled. In [21], [22], authors introduce a trafﬁc predictorfor network slices with a low complexity, based on a softgated recurrent unit. They also use the trafﬁc predictor to feedseveral deep learning models, which are trained ofﬂine to applyend-to-end reliable and dynamic resource allocation underdataset-dependent generalized service level agreement (SLA)constraints. The authors have considered resource bounds-based SLA and violation rate-based SLA in order to estimatethe required resources in the network.Traditional radio resource management (RRM) algorithmsare not able to handle the stringent QoS requirements of userswhile adapting to fast varying conditions of RWNs. Recently,machine learning algorithms have been employed in sched-ulers to beneﬁt from data in optimizing resource allocation,as opposed to using models [23]–[26]. An ML-based RRMalgorithm is proposed in [27] to estimate the required resourcesin the network for tackling trafﬁc on-demand trafﬁcs overHTTP. ML has been used in resource allocation by consideringvarious QoS combinations objectives such as packet loss [23],delay [24], and user fairness [25]. However, these modelsare deﬁned for improving delay or enhancing throughput forUltra-Reliable and Low-Latency Communications (URLLC)and throughput of enhanced Mobile Broadband (eMBB) UEs,while trafﬁc types are considered homogeneous [26]. Hence,they ignored the effect of trafﬁc types with various QoSrequirements in their model. In comparison with previousworks, we propose actor-critic learning for resource allocationwhich can be used at different levels of disaggregated RANs.We use a reward function that addresses channel quality,packet priorities, and the delay budget of each trafﬁc stream.We evaluate our proposed scheme under various trafﬁc types,interference levels and mobility. We provide detailed resultson key performance indicators (KPIs) collected from NS3simulator and integrated Open-AI Gym.III. S

YSTEM M ODEL

We assume the overall downlink bandwidth is divided intothe total number of available RBs, and each RB contains12 contiguous subcarriers. Moreover, a resource block group(RBG) will be formed when consecutive RBs are groupedtogether. In order to reduce the number of state’s in ourreinforcement learning approach, we consider RBG as a unitfor resource allocation in the frequency domain. The aimof the proposed actor-critic model is to assign RBGs byconsidering trafﬁc types, their QoS requirement, and theirig. 1: A Distributed RL-based RB Scheduler.priorities during each transmission time interval (TTI). Basedon our system model, Base Stations (BS) are actors and eachactor schedules the packets located in the transmission bufferof its associated UEs during each time interval such that,the amount of time that the packets stay in UEs buffers arereduced. The scheduling decision will be made every TTI, byconsidering the number of pending packets in the transmissionbuffers of active UEs. The overall delay experienced by apacket can be break down into three main factors as shownbelow:

P acket

Latency = T HOL + T tx + T HARQ (1)where T HOL is the time duration that a packet waits in thequeues to get a transmission opportunity (scheduling delay).HOL stands for head-of-the-line delay. T tx is communicationdelay, and T HARQ is a round-trip time which is requiredto retransmit a packet. T tx and T HARQ are basically basedon the environment (path loss, shadowing, fading, etc.), UEslocations (propagation distance) and channel condition (noise,interference, etc.). In order to satisfy the packets with lowlatency requirements (e.g., URLLC UEs), the scheduler needsto handle those packets in UE buffers as soon as they arrive,thus, minimizing HOL. We also need to limit the numberof HARQ to achieve lower delay during communication.However, limiting retransmissions can increase the packetdrop rate and reduce the reliability in the network [28]. Lowreliability can highly affect the UEs located at the edges. Theproposed RL-based scheduler aims to address this trade off byenhancing reliability and meeting the required latency budget.IV. A

CTOR -C RITIC S CHEDULING F RAMEWORK

It is well-known that due to the scheduler’s multi-dimensional and continuous state space, we can not enumeratethe scheduling problem exhaustively [29]. We can tackle this issue by employing RL and learning the proper schedulingrule. In actor-critic learning, policy model and value functionare the two main parts of policy gradients. In order to reducethe gradient variance in our policy, we need to learn thevalue function for update and assist system policy duringeach time interval, and this is known as the Actor-Criticmodel. At the exploitation step, in order to make decisions,the learnt actor function is used. In our model, we aim toprioritize certain trafﬁc types and reduce their HOL duringscheduling decision. The obtained M dimensional decisionmatrix is employed to schedule and prioritize the availabletrafﬁc classes at each TTI. To do so, a neural network (NN)framework is employed to tackle the complexity and obtainan approximation for achieving the best possible prioritizingdecision during each time interval. At the learning state, theweights of the neural network will be updated every TTI byconsidering the interactions occurring between the actor andthe critic. Moreover, during the exploitation state, the valueof the updated weights is saved, and the NN is tuned as anon-linear function.An RL agent basically tries to achieve higher value fromeach state s t and it can be obtained through the state-value( V ( s ) ) and action-value functions Q ( s, a ) . Using the action-value function, we can estimate the output of action a in state s during time interval t , and the average expected output ofstate s can be obtained by using the state-value function. Inthis work, instead of approximating both of action-value andstate-value functions, we estimate only V ( s ) by employingthe Advantage Actor-Critic (A2C) model, which simpliﬁesthe learning process and reduces the number of requiredparameters. More speciﬁcally, advantage refer to a value whichdetermines how much the performed action is better than theexpected V ( s ) ( A ( s t , a t ) = Q ( s t , a t ) − V ( s t ) ). Moreover,the A2C model is a synchronous model and, with respecto asynchronous actor-critic (A3C) [30], it provides betterconsistency among agents, making it suitable for disaggregateddeployments.The proposed A2C model contains two neural networks: • A neural network in the critic for estimating the valuefunction to criticize the actors’ actions during each state. • A neural network in the actor for approximating schedul-ing rules and prioritizing packet streams during each state.In the presented model, the critic is responsible for inspectingthe actors’ actions and enhance their decisions at each timeinterval and its located at an edge cloud while the actor is atthe BS. This ﬂexible architecture is completed with recent O-RAN efforts around disaggregation of network functions. Thehigh level perspective of the proposed model’s architecture ispresented in Fig.1.In the following we present the actor-critic architecture stepby step: Actor: We employed actor as an agent to explorethe required policy π ( θ is policy parameter) based on itsobservation ( O ) to obtain and apply the corresponding action( A ). π θ ( O ) = O → A (2)Therefore, the chosen action by an agent can be present as: a = π θ ( O ) , (3)where a ∈ A . Actions are considered as choosing a properresource block in the resource block map of each agent. Dueto the discrete nature of actions, we employ softmax functionsat the last layer (output) of actor to obtain the correspondingvalues of each actions. The summation of actions’ scores isequal to 1 and they are presented as probabilities of achievinga high reward value with respect to the chosen action.Critic: We employed the critic for obtaining the valuefunction V ( O ) . During each time interval t , after the agentexecuted the chosen action by actor ( a t ), it will send it to thecritic along with the current observation ( O t ). Then, the criticestimates the temporal difference (TD) by considering the nextstate ( O t +1 ) and the reward value ( R t ) as follows: δ t = R t + γV ( O t +1 ) − ( O t ) . (4)Here, δ t is TD error for the action-value at time t , and γ is a discount factor. At the updating step, the least squarestemporal difference (LSTD) need to be minimized to updatethe critic during each step: V ∗ = arg min V ( δ t ) , (5)Here, the optimal value function is presented as V ∗ . The actorcan be updated by policy gradient which can be obtained byusing the TD error as follow: (cid:53) θ J ( θ ) = E π θ [ (cid:53) θ logπ θ ( O, a ) δ t ] , (6)where, (cid:53) θ J ( θ ) is the gradient of the cost function with respectto θ , and the value of action a under the current policy is shownas π θ ( O, a ) . Then, the difference of parameters’ weights at theactor during time interval t can be calculated as: ∆ θ t = α (cid:53) θ t logπ θ t ( O t , a t ) δ t , (7) Here, The gradient is estimated per time step ( (cid:53) θ t ) andparameters will be updated in this gradient direction; also,the learning rate is deﬁned as α , which is between 0 and 1.Finally, the actor network, by using policy gradient can beupdated as follows: θ t +1 = θ t + α (cid:53) θ t logπ θ t ( O t , a t ) δ t . (8)Our main goal is to provide a channel, delay and prior-ity aware actor-critic learning based scheduler (CDPA-A2C).Actors are located at BSs; therefore, during their observation,they can access the channel condition through Channel QualityIndicator (CQI) feedback of UEs from the control channels,and the amount of time packets will remain in the bufferin order to be scheduled. Moreover, actors can tune theirfuture actions with respect to the received reward value ateach iteration. In order to train the agents based on thenetwork requirements, we need to include information aboutthe channel condition, the load of transmission buffer, andthe priority of packets into our reward function. The rewardfunction in actor i ( BS i ) is designed as follows: R = ϕR + τ R + λR , (9) R = max (cid:32) sgn (cid:32) cqi k − (cid:80) Kj =0 cqi j K (cid:33) , (cid:33) (10) R = (cid:26) P acket

URLLC Otherwise (11) R = sinc ( π (cid:22) P acket delay

P acket budget (cid:23) ) (12)Here, cqi k is the feedback send by U E k to agent i at timeinterval t , K is the total number of UEs associated with theagent i , P acket

URLLC is an identiﬁer for packets with a lowdelay budget, and it is associated to the

QCI , P acket delay is the HOL delay at RLC buffer and

P acket budget is themaximum tolerable delay for the corresponding packet type,which is deﬁned based on the packets’ types, ϕ , τ and λ are scalar weights to control the priority among trafﬁc types,maintaining delay budget and UEs condition (CQI feedback)[31], [32]. The reward function is tuned in a way that UEsreceived signal strength alongside packet delivery ratio bemaximized while giving higher priority to the critical trafﬁcload to increase QoS of URLLC users.In this paper, to examine the effect of the proposed rewardfunction, we deﬁned two A2C models as follows: • In the ﬁrst one, the scheduler will schedule packets justbased on packets’ delay budget, and it is named asDelay aware A2C (D-A2C). In this model, the priorityof packets (in our scenario URLLC packets) are ignoredby setting τ = 0 in eq. (11). • In the second one, the scheduler takes all performancemetrics into consideration by using eqs. (10),(11) and(12). The scheduler is called as CDPA-A2C. In thismodel, instead of giving priority to just one metric, thescheduler is equipped with a more complex model to becapable of handling RB allocation in different conditions.. S

IMULATION E NVIRONMENT

In this work, we implemented the proposed algorithms inns3-gym [33], which is a framework for connecting ns-3 withOpenAI Gym (a tool for integrating machine learning libraries)[34]. The neural networks of CDPA-A2C and D-A2C wereimplemented in Pytorch. In our simulations, three BSs areconsidered, and the number of UEs varies between 30 to90 UEs, which are distributed randomly in the environment.Simulation results are based on 30 runs and each run contain5000 iterations. To run the following model, we used a PCequipped with

Core

T M i7-8700 CPU and 32 GB of RAM.The simulation time depends on the number of assigned UEsand BSs, trafﬁc types, and UEs’ mobility. The simulation timecan be varied between 30-90 minutes for simulating 5000TTI based on the deﬁned scenarios in this paper. In thiswork numerology zero is employed with 15 KHz subcarrierspacing, 12 subcarriers per resource block, and 14 symbolsper subframe and our scheduling interval is set to 1 ms [35].We deployed two scenarios. In the ﬁrst scenario, we assumethe number of UEs vary between 30 to 90 and UEs are notmobile. We distributed three trafﬁc types (voice, video, andIP Multimedia Subsystem (IMS)) with different QoS require-ments uniform randomly among UEs. The UE applicationsgenerate trafﬁc with Poisson arrivals. Each trafﬁc type has itsown properties and different arrival time with respect to otherpackets.In our simulations, when the amount of time a packet staysin a UE’s buffer is lower than its delay budget, we considerit as a satisﬁed packet, and when this value is higher than thedelay budget, it will be evaluated as an unsatisﬁed packet. InTable.I, we present the required QoS metrics for each trafﬁctype considered in this paper, in detail.TABLE I: The employed packet types and their properties[36]. QCI ResourceType Priority PacketDelayBudget ServiceExample1 GBR 2 100 ms Voice5 Non-GBR 1 100 ms IMS6 Non-GBR 6 300 ms Video75 GBR 2.5 20 ms V2X

In the second scenario, we have 70-110 UEs in our vehicularnetwork, which 10% of them are vehicular UEs (

U E v ) andthe rest are ﬁxed users ( U E c ). The U E c requires a highcapacity link, while U E v demands need to be satisﬁed by alow latency link. In this scenario, in addition to “Voice, Video,and IMS” we have Vehicle-to-everything “V2X” packets thatare deﬁned based on 5GAA standards [37]. Table II presentsthe assigned network parameters and neural network settingsin our simulations. ϕ , τ and λ Before presenting network performance results, in Fig. 2we show the convergence of the proposed reward. The ﬁgureshows the behavior of the reward function (eq. 9) when thenumber of UEs are 90 and 10% of them are mobile. Inthis work, an epsilon-greedy policy is used, in which during TABLE II: Simulation parameters.

Parameters ValueNumber of neurons 256 x 3 layers (Actor + Critic)Scheduler algorithm CDPA-A2C, D-A2C, PF, CQANumber of BSs 3Number of UEs 30-110Maximum Trafﬁc load per UE(Downlink) 256 kbpsTrafﬁc types Voice, Video, IMS, V2XTrafﬁc stream per TTI 50D-A2C reward’s weights ϕ = 1 − packet delay traffic delay − budget , τ = 0 and λ = 5 CDPA-A2C reward’s weights ϕ = 1 − packet delay traffic delay − budget , τ = 5 and λ = 5 Discount factor 0.9Actor learning-rate 0.01Critic learning-rate 0.05

Fig. 2: The convergence performance of the CDPA-A2Calgorithm’s reward function when the number of UEs are 90while 10% of them are mobile.the exploration phase actors perform their actions either byrandomly or by choosing an RB with the highest weightassigned by the proposed actor-critic model. As is shown inFig. 2, the exploration phase will end after almost 3700 rounds,and the model will converge after that.

A. Simulation Results with no mobility

In this scenario, we assumed UEs are ﬁxed, and we havethree packet streams (voice, video, IMS) with different QoSrequirements (delay budget, packet loss ratio, etc.). As it ismentioned previously, we considered two A2C models withdifferent reward functions named by D-A2C and CDPA-A2C,respectively. In addition to these two models, we compare ourresults with the traditional Proportional Fair (PF) schedulerand Channel and QoS Aware (CQA) scheduler as describedin [12].In Fig. 3, we present the mean delay by considering varioustrafﬁc types and varying number of UEs. Although, when thenumber of UEs below 50 (in case of CQA scheduler when thenumber of UEs are below 70), PF and CQA can satisfy therequired delay budget for all types of trafﬁc, as we increasethe number of UEs the mean HOL delay will be signiﬁcantlyincreased, and it becomes higher than the target delay for voiceand IMS trafﬁc types. In cases with higher number of UEs,ig. 3: The HOL delay for different number of UEs (with no mobility).Fig. 4: Packet delivery ratio (with no mobility).D-A2C results are better than the traditional PF and CQAschedulers, while employing a more comprehensive rewardfunction such as CDPA-A2C can remarkably enhance networkperformance. In general, The CDPA-A2C scheduler can reducethe mean delay in comparison with D-A2C, PF and CQAschedulers up to 100 ms, 225 ms and 375 ms, respectively.In Fig. 4, we present the packet delivery ratio (the ratioof the packets which satisfy the predeﬁned delay budget). Asit is shown, the packet delivery ratio of CDPA-A2C and D-A2C can be up to 117% and 73% higher than PF and CQA, byconsidering various numbers of UEs for different trafﬁc types.Moreover, CDPA-A2C can considerably enhance the packetdelivery ratio in comparison to D-A2C (up to 63% for IMSpackets). Note that, we assume all applications are sensitive todelay and will consider packets that are beyond service targetsas undelivered.

B. Simulation Results with mobility

In the second scenario, we consider a case when 10% ofUEs are mobile. We include V2X trafﬁc based on 5GAAstandards in addition to other trafﬁc types used in the previoussimulations. Due to the better performance of CDPA-A2C, weomit the D-A2C from second scenario.In Fig. 5, we evaluate the mean HOL delay for variousnumbers of UEs (70-110) when we have mobile UEs in thenetwork. As we can see, although, PF and CQA schedulerscan maintain the mean HOL delay below the delay budget,by increasing the number of UEs, the mean HOL delay willincrease dramatically. Moreover, due to the high sensitivityof V2X packets to delay (V2X delay budget = 20 ms ), PF and CQA can not satisfy V2X packets in any of these cases.However, the proposed CDPA-A2C model, by schedulingpackets on time and preventing the congestion in the UEsbuffer, can provide a lower delay for the presented trafﬁc typeswith respect to PF and CQA.In Fig. 6, we present the packet delivery ratio (packetsthat can satisfy delay target) of the proposed model withrespect to other schedulers. In Fig. 6a, we evaluate the effectof increasing the number of UEs over the packet deliveryratio when 10% of UEs are mobile. As shown, the packetdelivery ratio of CDPA-A2C for different trafﬁc types andnumbers of UEs up to 92% and 53% higher than PF andCQA, respectively. Although CQA has a good performancein the delivery of Voice, Video, and IMS packets, CDPA-A2C can considerably enhance the packet delivery ratio fordelay-sensitive trafﬁcs such as V2X packets in comparisonwith CQA. We also examined the effect of increasing themobile UEs’ ratio, on the packet delivery ratio, when the totalnumber of UEs is ﬁxed to 90, in Fig. 6b. As we can see, theproposed model can signiﬁcantly enhance the packet deliveryratio for different trafﬁc types with respect to PF and CQA.Moreover, by increasing the density of mobile UEs, the packetloss ratio of V2X packets when CQA and PF schedulers areemployed, will be increased up to 40% and 75%, respectively.Therefore, the CDPA-A2C scheduler can noticeably enhanceQoS by reducing the HOL delay and increasing the packetdelivery ratio in comparison with PF and CQA schedulers.ig. 5: The HOL delay for different number of UEs (with mobility).VI. C ONCLUSION

In this paper, we propose two actor-critic learning-basedschedulers, namely Delay-aware actor-critic (D-A2C) andchannel, delay and priority-aware actor-critic (CDPA-A2C)techniques, for reconﬁgurable wireless networks, where thenetwork is capable to autonomously learn and adapt itselfto the wireless environment’s dynamicity for optimizing theutility of the network resources. Reconﬁgurable wireless net-works play a vital role in network automation. Applyingmachine learning algorithms can be an appropriate candidatefor making versatile models and making future networkssmarter. Here, the proposed model is constructed based on twoneural network models (actor and critic). Actors are employedto apply actions (RB allocation); simultaneously, the critic isused to monitor agents’ actions and tune their behaviors in thefollowing states to make the convergence faster and optimizethe model. The proposed comprehensive model (CDPA-A2C),in addition of considering channel condition and delay budgetof each packet, prioritizes the received packets by consideringtheir types and their QoS requirements. We include ﬁxed andmobile scenarios to evaluate the performance of the proposedschemes. We compare the learning-based schemes to twowell-known algorithms: the traditional proportional fair anda QoS-aware algorithm CQA. Our results show that CDPA-A2C signiﬁcantly reduces the mean delay with respect to PFand D-A2C schedulers. Additionally, CDPA-A2C can increasethe packet delivery rate in the mobile scenario up to 92% and53% in comparison with PF and CQA, respectively. VII. A

CKNOWLEDGEMENT

This work is supported by Ontario Centers of Excellence(OCE) 5G ENCQOR program.R

EFERENCES[1] J. Navarro-Ortiz, P. Romero-Diaz, S. Sendra, P. Ameigeiras, J. J. Ramos-Munoz, and J. M. Lopez-Soler, “A Survey on 5G Usage Scenarios andTrafﬁc Models,”

IEEE Communications Surveys Tutorials , vol. 22, no. 2,pp. 905–929, 2020.[2] M. Polese, R. Jana, V. Kounev, K. Zhang, S. Deb, and M. Zorzi, “Ma-chine learning at the edge: A data-driven architecture with applications to5G cellular networks,”

IEEE Transactions on Mobile Computing , 2020.[3] N. C. Luong, D. T. Hoang, S. Gong, D. Niyato, P. Wang, Y.-C.Liang, and D. I. Kim, “Applications of deep reinforcement learningin communications and networking: A survey,”

IEEE CommunicationsSurveys & Tutorials , vol. 21, no. 4, pp. 3133–3174, 2019.[4] M. Polese, R. Jana, V. Kounev, K. Zhang, S. Deb, and M. Zorzi, “Ma-chine learning at the edge: A data-driven architecture with applications to5G cellular networks,”

IEEE Transactions on Mobile Computing , 2020.[5] R. Li, Z. Zhao, X. Zhou, G. Ding, Y. Chen, Z. Wang, and H. Zhang,“Intelligent 5G: When cellular networks meet artiﬁcial intelligence,”

IEEE Wireless communications , vol. 24, no. 5, pp. 175–183, 2017.[6] S. Chinchali, P. Hu, T. Chu, M. Sharma, M. Bansal, R. Misra, M. Pavone,and S. Katti, “Cellular network trafﬁc scheduling with deep reinforce-ment learning,” in

Proceedings of the AAAI Conference on ArtiﬁcialIntelligence , vol. 32, 2018.[7] Y. Wei, F. R. Yu, M. Song, and Z. Han, “User scheduling andresource allocation in HetNets with hybrid energy supply: An actor-critic reinforcement learning approach,”

IEEE Transactions on WirelessCommunications , vol. 17, no. 1, pp. 680–692, 2017.[8] C. C. White, “A survey of solution techniques for the partially observedMarkov decision process,”

Annals of Operations Research , vol. 32, no. 1,pp. 215–230, 1991.[9] O. Nachum, M. Norouzi, K. Xu, and D. Schuurmans, “Bridging the gapbetween value and policy based reinforcement learning,” in

Advances inNeural Information Processing Systems , pp. 2775–2785, 2017. a) The effect of UEs’ number on packet delivery ratio when the ratio of mobile UEs is ﬁxed to 10%.(b) The effect of percentage of mobile UEs on packet delivery ratio when the total number of UEs is 90.

Fig. 6: Packet delivery ratio (with mobility).

10] A. K. Saluja, S. A. Dargad, and K. Mistry, “A Detailed Analogy ofNetwork Simulators—NS1, NS2, NS3 and NS4,”

Int. J. Future Revolut.Comput. Sci. Commun. Eng , vol. 3, pp. 291–295, 2017.[11] M. T. Kawser, H. Farid, A. R. Hasin, A. M. Sadik, and I. K. Razu,“Performance comparison between round robin and proportional fairscheduling methods for LTE,”

International Journal of Information andElectronics Engineering , vol. 2, no. 5, pp. 678–681, 2012.[12] B. Bojovic and N. Baldo, “A new channel and QoS aware scheduler toenhance the capacity of voice over LTE systems,” in ,pp. 1–6, IEEE, 2014.[13] M. F. Audah, T. S. Chin, Y. Zulfadzli, C. K. Lee, and K. Rizaluddin,“Towards Efﬁcient and Scalable Machine Learning-Based QoS TrafﬁcClassiﬁcation in Software-Deﬁned Network,” in

International Confer-ence on Mobile Web and Intelligent Information Systems , pp. 217–229,Springer, 2019.[14] M. A. Habibi, M. Nasimi, B. Han, and H. D. Schotten, “A compre-hensive survey of RAN architectures toward 5G mobile communicationsystem,”

IEEE Access , vol. 7, pp. 70371–70421, 2019.[15] S. Abedi, “Efﬁcient radio resource management for wireless multime-dia communications: a multidimensional QoS-based packet scheduler,”

IEEE Transactions on Wireless Communications , vol. 4, no. 6, pp. 2811–2822, 2005.[16] G. Piro, L. A. Grieco, G. Boggia, R. Fortuna, and P. Camarda, “Two-level downlink scheduling for real-time multimedia services in LTEnetworks,”

IEEE Transactions on Multimedia , vol. 13, no. 5, pp. 1052–1065, 2011.[17] G. Monghal, D. Laselva, P.-H. Michaelsen, and J. Wigard, “Dynamicpacket scheduling for trafﬁc mixes of best effort and VoIP usersin E-UTRAN downlink,” in , pp. 1–5, IEEE, 2010.[18] T. ¸Sahin, R. Khalili, M. Boban, and A. Wolisz, “Reinforcement learningscheduler for vehicle-to-vehicle communications outside coverage,” in , pp. 1–8, IEEE,2018.[19] S. Nath, Y. Li, J. Wu, and P. Fan, “Multi-user Multi-channel Computa-tion Ofﬂoading and Resource Allocation for Mobile Edge Computing,”in

ICC 2020-2020 IEEE International Conference on Communications(ICC) , pp. 1–6, IEEE, 2020.[20] G. R. Ghosal, D. Ghosal, A. Sim, A. V. Thakur, and K. Wu, “ADeep Deterministic Policy Gradient Based Network Scheduler ForDeadline-Driven Data Transfers,” in , pp. 253–261, IEEE, 2020.[21] H. Chergui and C. Verikoukis, “Ofﬂine SLA-constrained deep learningfor 5G networks reliable and dynamic end-to-end slicing,”

IEEE Journalon Selected Areas in Communications , vol. 38, no. 2, pp. 350–360, 2019.[22] H. Chergui and C. Verikoukis, “Big Data for 5G Intelligent NetworkSlicing Management,”

IEEE Network , vol. 34, no. 4, pp. 56–61, 2020.[23] I.-S. Com¸sa, S. Zhang, M. E. Aydin, P. Kuonen, Y. Lu, R. Trestian, andG. Ghinea, “Towards 5G: A reinforcement learning-based schedulingsolution for data trafﬁc management,”

IEEE Transactions on Networkand Service Management , vol. 15, no. 4, pp. 1661–1675, 2018.[24] I.-S. Coms , a, S. Zhang, M. Aydin, P. Kuonen, R. Trestian, and G. Ghinea,“A comparison of reinforcement learning algorithms in fairness-orientedOFDMA schedulers,” Information , vol. 10, no. 10, p. 315, 2019.[25] M. Elsayed and M. Erol-Kantarci, “AI-enabled radio resource allocationin 5G for URLLC and eMBB users,” in , pp. 590–595, IEEE, 2019.[26] M. Mohammadi and A. Al-Fuqaha, “Enabling cognitive smart citiesusing big data and machine learning: Approaches and challenges,”

IEEECommunications Magazine , vol. 56, no. 2, pp. 94–101, 2018.[27] A. Martin, J. Egaña, J. Flórez, J. Montalbán, I. G. Olaizola, M. Quartulli,R. Viola, and M. Zorrilla, “Network resource allocation system for QoE-aware delivery of media services in 5G networks,”

IEEE Transactionson Broadcasting , vol. 64, no. 2, pp. 561–574, 2018.[28] X. Du, Y. Sun, N. B. Shroff, and A. Sabharwal, “Balancing Queueingand Retransmission: Latency-Optimal Massive MIMO Design,”

IEEETransactions on Wireless Communications , vol. 19, no. 4, pp. 2293–2307, 2020.[29] I.-S. Com¸sa, G.-M. Muntean, and R. Trestian, “An Innovative Machine-Learning-Based Scheduling Solution for Improving Live UHD VideoStreaming Quality in Highly Dynamic Network Environments,”

IEEETransactions on Broadcasting , 2020. [30] M. Sewak, “Actor-Critic Models and the A3C,” in

Deep ReinforcementLearning , pp. 141–152, Springer, 2019.[31] M. Elsayed and M. Erol-Kantarci, “Learning-based resource allocationfor data-intensive and immersive tactile applications,” in , pp. 278–283, IEEE, 2018.[32] M. Elsayed, M. Erol-Kantarci, B. Kantarci, L. Wu, and J. Li, “Low-latency communications for community resilience microgrids: A re-inforcement learning approach,”

IEEE Transactions on Smart Grid ,vol. 11, no. 2, pp. 1091–1099, 2019.[33] P. Gawłowicz and A. Zubow, “NS-3 meets openai gym: The playgroundfor machine learning in networking research,” in

Proceedings of the 22ndInternational ACM Conference on Modeling, Analysis and Simulation ofWireless and Mobile Systems , pp. 113–120, 2019.[34] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schul-man, J. Tang, and W. Zaremba, “Openai gym,” arXiv preprintarXiv:1606.01540 , 2016.[35] J. Vihriälä, A. A. Zaidi, V. Venkatasubramanian, N. He, E. Tiirola,J. Medbo, E. Lähetkangas, K. Werner, K. Pajukoski, A. Cedergren, et al. ,“Numerology and frame structure for 5G radio access,” in , pp. 1–5, IEEE, 2016.[36] “Table 6.1.7-A: Standardized QCI characteristics from 3GPP TS23.203V16.1.0.,”[37]5GAA: Paving the Way towards 5G