[PDF] Dealing with Limited Backhaul Capacity in Millimeter Wave Systems: A Deep Reinforcement Learning Approach

Abstract

Millimeter Wave (MmWave) communication is one of the key technology of the fifth generation (5G) wireless systems to achieve the expected 1000x data rate. With large bandwidth at mmWave band, the link capacity between users and base stations (BS) can be much higher compared to sub-6GHz wireless systems. Meanwhile, due to the high cost of infrastructure upgrade, it would be difficult for operators to drastically enhance the capacity of backhaul links between mmWave BSs and the core network. As a result, the data rate provided by backhaul may not be sufficient to support all mmWave links, the backhaul connection becomes the new bottleneck that limits the system performance. On the other hand, as mmWave channels are subject to random blockage, the data rates of mmWave users significantly vary over time. With limited backhaul capacity and highly dynamic data rates of users, how to allocate backhaul resource to each user remains a challenge for mmWave systems. In this article, we present a deep reinforcement learning (DRL) approach to address this challenge. By learning the blockage pattern, the system dynamics can be captured and predicted, resulting in efficient utilization of backhaul resource. We begin with a discussion on DRL and its application in wireless systems. We then investigate the problem backhaul resource allocation and present the DRL based solution. Finally, we discuss open problems for future research and conclude this article.

Full PDF

aa r X i v : . [ ee ss . SP ] D ec IEEE COMMUNICATIONS MAGAZINE, VOL.XXX, NO.XXX, MONTH YEAR 1

Dealing with Limited Backhaul Capacity inMillimeter Wave Systems: A Deep ReinforcementLearning Approach

Mingjie Feng,

Student Member, IEEE and Shiwen Mao,

Fellow, IEEE

Abstract —Millimeter Wave (MmWave) communication is oneof the key technology of ﬁfth generation (5G) wireless systems toachieve the expected 1000x data rate. With large bandwidth atmmWave band, the link capacity between users and base stations(BS) can be much higher compared to sub-6GHz wireless systems.Meanwhile, due to the high cost of infrastructure upgrade,it would be difﬁcult for operators to drastically enhance thecapacity of backhaul links between mmWave BSs and the corenetwork. As a result, the data rate provided by backhaul maynot be sufﬁcient to support all mmWave links, the backhaulconnection becomes the new bottleneck that limits the systemperformance. On the other hand, as mmWave channels aresubject to random blockage, the data rates of mmWave userssigniﬁcantly vary over time. With limited backhaul capacity andhighly dynamic data rates of users, how to allocate backhaulresource to each user remains a challenge for mmWave systems.In this article, we present a deep reinforcement learning (DRL)approach to address this challenge. By learning the blockagepattern, the system dynamics can be captured and predicted,resulting in efﬁcient utilization of backhaul resource. We beginwith a discussion on DRL and its application in wireless systems.We then investigate the problem backhaul resource allocationand present the DRL based solution. Finally, we discuss openproblems for future research and conclude this article.

I. I

NTRODUCTION

With the explosion of smart devices and data-intensivewireless applications, the demand for high data rate serviceshas drastically increased in recent years. To meet such demand,the ﬁfth generation (5G) cellular network is under intensiveresearch from both industry and academia. According to arecent report, the 5G networks are expected to support massiveconnections with minimum data rate of 100 Mbps and peakdata rate higher than 10 Gbps [1]. To achieve this goal, severaltechnologies are considered as candidates for 5G systems, in-cluding millimeter-wave (mmWave) communication, massiveMIMO, and small cell. By operating at mmWave band withlarge bandwidth, an mmWave system can signiﬁcantly enhancethe data rate performance to multi-Gbps level.As the data rates of links between an mmWave base station(BS) and users are greatly enhanced, the capacity of backhaullink between the BS and the core network becomes relativelylimited, posting a new challenge to mmWave cellular net-works. Compared to a long term evolution (LTE) system withtypical cell throughput less than 150 Mbps [2], the cell throughof an mmWave system can be greater than 1.5 Gbps [3], which

M. Feng and S. Mao are with the Department of Electrical and Com-puter Engineering, Auburn University, Auburn, AL 36849-5201 USA. Email:[email protected], [email protected]. is comparable to the data rate of a current backhaul link.As a result, the backhaul links in mmWave cellular networksare expected to achieve much higher data rates compared tocurrent cellular networks. In current LTE networks, the conﬁg-uration of a backhaul link is to support peak cell throughput.However, this may not be feasible in mmWave networks. Dueto cost concern, it is highly unlikely for operators to upgradeexisting infrastructure to drastically enhance capacity of wiredbackhauls. In case of wireless backahul, e.g., mmWave-basedwireless backhaul or free space optical, although the cost canbe reduced, the challenge brought by limited backhaul capacityremains. One the one hand, the capacity of wireless backhaullink is shared by multiple BS-user links. On the other hand,the backhaul links are likely experience higher propagationloss than the BS-user links.The tension caused by limited backhaul capacity may beaggravated in the future as the data rate of mmWave linksis expected to keep increasing. For example, high resolution ◦ virtual reality (VR) requires data rate on the order of 1Gbps and latency of 1 ms. Based on a prediction in [1], the 5GmmWave networks need to support 50 Gbps data rate by the2024. In addition, due to the expected dense deployment ofmmWave BSs [4], a large number of backhaul connections,which can be wired or wireless, would be coexist. As aresult, the achievable data rate of each backhaul link wouldbe limited, which may be caused by resource sharing, mutualinterference, potential congestion, or increased overhead [5].Therefore, unlike traditional cellular networks (from 1G to 4G)in which the wireless transmission between BS and user isthe bottleneck, the backhaul becomes a potential bottleneck inmmWave system. Although some ﬁeld tests were performed todemonstrate the potential of mmWave cellular systems such asin [3], these tests are not based on actually cellular networks.Thus, the impact of limited backhaul capacity has not beentested and veriﬁed, which requires further investigation. Thechallenge of possible bottleneck at backhaul has been observedin the context of ultra-dense small cell deployment [4], inwhich the large number of small cells put pressure on thebackhaul links. Compared to the case of network desiﬁcation,the bottleneck challenge in an mmWave system is caused bythe signiﬁcantly increased data rate of mmWave transmissions.On the other hand, due to the short wavelength of mmWavecommunication, the transmissions between BS and users aresubject to random blockage. As a result, the data rate ofeach user is highly dynamic. In contrast, the data rate ofa backhaul link is much more stable since it is implement EEE COMMUNICATIONS MAGAZINE, VOL.XXX, NO.XXX, MONTH YEAR 2 (cid:17)(cid:17)(cid:17) UE (cid:17)(cid:17)(cid:17) MmWave links BSCore network Backhaul

Limited data rate High data rate Stable connection Unstable connection

Fig. 1. System model of a mmWave system with limited backhaul capacity. by wired connection or line of sight (LOS) wireless con-nection. Therefore, the BS-user link is characterized by highdata rate and unstable connection, while the backhaul linkis characterized by relatively limited data rate and stableconnection, as shown in Fig. 1. To balance such mismatch andenhance the system performance, efﬁcient backhaul resourceallocation to each user is necessary. For example, when auser switches from LOS transmission to non line of sight(NLOS) or outage, less resource shoud be allocated to thisuser. However, such adaptive control cannot be implementedby traditional resource allocation schemes, due to the varyingsystem dynamics. To perform efﬁcient scheduling, a BS needsto predict possible blockage and estimate the data rate of eachuser based on current channel state information (CSI). Then,it makes decision on the backhaul resource allocation andsends a request to the core network. This way, the backhaulscheduling can be performed in a timely manner that capturesthe blockage pattern.Deep reinforcement learning (DRL) is a new paradigmfor intelligent decision-making [6], which can be implementby TensorFlow and Keras. Combing reinforcement learningand deep neural network, a DRL agent interacts with theenvironment and learns the pattern of a Markov DecisionProcess (MDP) through training experience. Speciﬁcally, aDRL agent employs a deep neural network to approximatethe Q-values, where the Q-values are deﬁned by discountedcumulative rewards that can be obtained by taking differentactions under certain system states. Then, the agent makesoptimal decisions based on the estimated Q-values. Comparedto other machine learning approaches, DRL is model-free anddoes not require data samples from an external supervisor. Dueto these beneﬁts, the application of DRL is wireless networkshas drawn growing attention recently. In this article, we applyDRL to deal with the challenge of limited backhaul capacityin mmWave networks. By learning the blockage pattern basedon the CSI of mmWave users, a BS decides the resourceallocation of backhaul link with the objective of maximizingthe sum utility of all users.In the remainder of this article, we ﬁrst introduce thebackground of DRL and review its recent applications inwireless systems. Then, we present a DRL based approach for backhaul resource allocation. Finally, we discuss open researchproblems and conclude this article.II. D

EEP R EINFORCEMENT L EARNING FOR W IRELESS S YSTEMS

A. Preliminaries of Deep Reinforcement Learning

A reinforcement learning (RL) agent aims to learn fromthe environment and take action to maximize the long termcumulative reward. The environment is modeled an MDP withstate space S and a RL agent can take actions from space A .The agent interacts with the environment by taking actions,observing the reward and system state transition, and updatingits knowledge about the environment. The objective of a RLalgorithm is to ﬁnd the optimal policy, which determines thestrategy of taking actions under certain system states. A policy π is speciﬁed by π ( a | s ) = P { A t = a | S t = s } . In general,a policy is in a stochastic form to enable exploration overdifferent actions. To ﬁnd the optimal policy, the key componentis to determine the value of each state-action function, alsoknown as Q-function, which is deﬁned by Q π ( s, a ) = E π [ G t | S t = s, A t = a ]= R as + γ X s ′ ∈ S P ass ′ v π ( s ′ ) (1)where R as is the instant reward that can be obtained by takingaction a under state s ; P ass ′ is the transition probability fromstate s to state s ′ under action a ; γ is the discount factor used tobalance the long-term and short-term rewards; G t is the cumu-lative reward from time t , given by G t = P ∞ k =0 γ k R t + k +1 .In (1), v π ( s ) is the state-value function which indicates theexpected reward if the system is in state s and follow policy π , given by v π ( s ) = E π [ G t | S t = s ] = P a ∈ A π ( a | s ) Q π ( s, a ) .With Q-functions, an MDP is solved when the optimal policyis found, i.e., Q ∗ ( s, a ) = max π Q π ( s, a ) . A common RLtechnique for solving an MDP is Q-learning, which usesan empirical iterative approach to update the values of Q-functions (Q-values). In particular, an agent interacts with theenvironment by taking actions and obtaining reward, and thenupdate the Q-values by Q ( s t , a t ) ← Q ( s t , a t )+ α (cid:20) R t +1 + γ max a t +1 Q ( s t +1 , a t +1 ) − Q ( s t , a t ) (cid:21) (2)RL has been applied in decision-making problems ofmmWave networks such as in [7]. However, in large scalesystems with large numbers of states and actions, traditional Q-learning approach becomes infeasible since a table is requiredto store all the Q-values. In addition, traditional Q-learningneeds to visit and evaluate every state-action pair, resulting inhuge complexity and slow convergence. An effective approachto deal with such challenge is use a neural network (NN) toapproximate the Q-values, given by Q ( s, a, w ) ≈ Q π ( s, a ) ,where w are the weights of the NN. By training a NN withsampled data, the NN can map the inputs of state-action pairsto their corresponding Q-values. However, a direct application EEE COMMUNICATIONS MAGAZINE, VOL.XXX, NO.XXX, MONTH YEAR 3 (cid:40)(cid:81)(cid:89)(cid:76)(cid:85)(cid:82)(cid:81)(cid:80)(cid:72)(cid:81)(cid:87)(cid:40)(cid:91)(cid:83)(cid:72)(cid:85)(cid:76)(cid:72)(cid:81)(cid:70)(cid:72)(cid:3)(cid:80)(cid:72)(cid:80)(cid:82)(cid:85)(cid:92) (cid:55)(cid:68)(cid:85)(cid:74)(cid:72)(cid:87)(cid:3)(cid:81)(cid:72)(cid:87)(cid:90)(cid:82)(cid:85)(cid:78) (cid:39)(cid:52)(cid:49) (cid:55)(cid:68)(cid:78)(cid:72)(cid:3)(cid:68)(cid:70)(cid:87)(cid:76)(cid:82)(cid:81) (cid:47)(cid:82)(cid:86)(cid:86)(cid:3)(cid:73)(cid:88)(cid:81)(cid:70)(cid:87)(cid:76)(cid:82)(cid:81)(cid:3)(cid:80)(cid:76)(cid:81)(cid:76)(cid:80)(cid:76)(cid:93)(cid:68)(cid:87)(cid:76)(cid:82)(cid:81)(cid:54)(cid:87)(cid:68)(cid:87)(cid:72)(cid:3)(cid:87)(cid:85)(cid:68)(cid:81)(cid:86)(cid:76)(cid:87)(cid:76)(cid:82)(cid:81)(cid:3)(cid:9)(cid:3)(cid:85)(cid:72)(cid:90)(cid:68)(cid:85)(cid:71) (cid:68) (cid:87) (cid:72) (cid:87)(cid:3) (cid:32)(cid:3)(cid:11) (cid:86) (cid:87) (cid:15) (cid:68) (cid:87) (cid:15) (cid:85) (cid:87) (cid:15) (cid:86) (cid:87) (cid:3)(cid:14)(cid:3)(cid:20) (cid:12) (cid:39) (cid:87) (cid:3)(cid:32)(cid:3)(cid:94) (cid:72) (cid:20) (cid:15)(cid:171)(cid:15) (cid:72) (cid:87) (cid:96) (cid:55)(cid:68)(cid:85)(cid:74)(cid:72)(cid:87)(cid:3)(cid:52)(cid:16)(cid:89)(cid:68)(cid:79)(cid:88)(cid:72)(cid:86)(cid:3)(cid:69)(cid:68)(cid:86)(cid:72)(cid:71)(cid:3)(cid:82)(cid:81) w (cid:16) (cid:52)(cid:16)(cid:73)(cid:88)(cid:81)(cid:70)(cid:87)(cid:76)(cid:82)(cid:81)(cid:86)(cid:3) (cid:69)(cid:68)(cid:86)(cid:72)(cid:71)(cid:3)(cid:82)(cid:81) w (cid:82)(cid:81)(cid:86)(cid:3) (cid:56)(cid:83)(cid:71)(cid:68)(cid:87)(cid:72) w (cid:52)(cid:52)(cid:52)(cid:52)(cid:52)(cid:52)(cid:52)(cid:52)(cid:52)(cid:52)(cid:52)(cid:52)(cid:16)(cid:73)(cid:88)(cid:81)(cid:70)(cid:87)(cid:76) (cid:69)(cid:68)(cid:86)(cid:72)(cid:71)(cid:3)(cid:82)(cid:81)(cid:44)(cid:81)(cid:83)(cid:88)(cid:87)(cid:3)(cid:11) (cid:86) (cid:15) (cid:68) (cid:12)(cid:3)(cid:73)(cid:85)(cid:82)(cid:80)(cid:3)(cid:87)(cid:75)(cid:72)(cid:3)(cid:3) (cid:80)(cid:76)(cid:81)(cid:76)(cid:69)(cid:68)(cid:87)(cid:70)(cid:75)(cid:44)(cid:81)(cid:83)(cid:88)(cid:87)(cid:3)(cid:11) (cid:86)(cid:182) (cid:15) (cid:85) (cid:12)(cid:3)(cid:73)(cid:85)(cid:82)(cid:80)(cid:3) (cid:87)(cid:75)(cid:72)(cid:3)(cid:80)(cid:76)(cid:81)(cid:76)(cid:69)(cid:68)(cid:87)(cid:70)(cid:75)(cid:11) (cid:86) (cid:15) (cid:68) (cid:15) (cid:85) (cid:15) (cid:86)(cid:397) (cid:12)(cid:3) (cid:3) (cid:56) (cid:11) (cid:39) (cid:12)(cid:42)(cid:72)(cid:81)(cid:72)(cid:85)(cid:68)(cid:87)(cid:72)(cid:3)(cid:68)(cid:3) (cid:85)(cid:68)(cid:81)(cid:71)(cid:82)(cid:80)(cid:3)(cid:80)(cid:76)(cid:81)(cid:76)(cid:69)(cid:68)(cid:87)(cid:70)(cid:75) Fig. 2. Framework of the DRL approach in [6] of NN in Q-learning may be unstable or even diverge due tothe correlations between training samples and the correlationsbetween Q-values and target values [6].To reduce such correlations, a DRL approach was proposedin [6], in which a deep neural network (DNN) is used toapproximate the Q-value, yielding a deep Q-network (DQN).In the DRL approach presented in [6], the agent ﬁrst exploresthe environment by randomly taking actions and stores theexperience, e t = ( s t , a t , R t , s t +1 ) , in a target network. Then,a mechanism called experience replay is used, where the dataare randomly sampled in minibatches from the target networkto break the correlation in a sequence of observation. With thesamples from the target network, the weights of the DQN isupdated by minimizing the loss function given by L i ( w i ) = E ( s,a,r,s ′ ) ∈ U ( D ) h(cid:16) r + γ max a ′ Q ( s ′ , a ′ , w − i ) − Q ( s, a, w i )) i (3)where w i and w − i are the weights of DQN and target networkat iteration i , respectively. The loss function (3) is the meansquare error between DQN and target network, which canbe minimized through stochastic gradient descent. To reducethe correlation between DQN and target network, the targetnetwork is updated less frequently. After the training of DQN,the agent then takes action based on the estimated Q-values.The general framework of the DRL approach in [6] is shownin Fig. 2. B. Applications in Wireless Networks

In the design of wireless networks, a major challenge is tosolve the formulated combinatorial problems. While exhaus-tive search is infeasible due to prohibitive complexity, existingsolutions typically rely on network information exchange,which yields a tradeoff between overhead and performance.For DRL approaches, the network optimization is based ontrial and error processes, which does not require explicit orinstantaneous network information. In particular, a DRL algo-rithm is model-free, which does not require explicit knowledge on the inter-dependent patterns of different nodes. In addition,with extensive ofﬂine training, a DRL agent is able to predictthe system dynamics, which enables timely scheduling. Thus,compared to traditional approaches, DRL-base schemes havethe potential to achieve better performance with reduced onlineoverhead.Due to such promising prospect, DRL algorithms have beenrecently considered in several wireless networks to performintelligent decision making [8]–[13]. In [8], DRL is used toestimate the availability of cache and select proper set of usersfor interference alignment. In [9], [10], [13], the problem ofmulti-channel access is considered in which each user observesthe channel dynamics from history and estimates the possibleactions of other users, then determines its channel accessstrategy. In [11], DRL is used to predict the QoS that canobtained when handover to another BS, resulting in efﬁcienthandover process. In [12], continuous actions and states areconsidered so that DQN-based DRL cannot be applied. Thedeep deterministic policy gradient (DDPG), which is based onactor-critic framework, was employed to address continuousspace control problem. The general idea is to parameterizethe Q-functions and derive the optimal values of parametersthrough policy gradient. In [10], [11], [13], the problems areformulated as multi-agent control with interactions amongagents. As a result, experience replay for a single agent cannotbe applied in such scenarios. To take the inter-agent impactinto account, the long short term memory (LSTM) approachis used to generate target values. The key aspects of systemmodels in recent works are summarized in Table I.III. DRL B

ASED B ACKHAUL R ESOURCE A LLOCATION

A. System Model

We consider an mmWave BS serving K user equipments(UE) indexed by k = 1 , ..., K . Each UE has three link states,LOS, NLOS, and outage, which are denoted by three binary0-1 variables, x k,i ( t ) , i = 1 , , . Speciﬁcally, x k, ( t ) = 1 , x k, ( t ) = 1 , and x k, ( t ) = 1 indicate that user k is under LOS,NLOS, and outage state at time t , respectively. The link state ofeach user follows a Markov process with steady probabilitiesgiven in [3]. We assume that the BS can estimate the values of x k,i ( t ) through the statistics of user signals. The BS can alsomeasure the achievable data rate of mmWave link for user k , C k ( t ) , via uplink signal.We assume the backhaul resource is divided into M orthog-onal blocks, M > K , each block can be a period of time ora range of wavelength. The capacity of each block is U , thenthe total backhaul capacity is U · M . Let n k ( t ) be the numberof blocks allocated to user k , the backhaul capacity allocatedto user k is B k ( t ) = U · n k ( t ) . Then, the actual data rate ofuser k is R k ( t ) = min { B k ( t ) , C k ( t ) } . B. DRL Framework

The proposed DRL-based approach employs a DQN to ﬁndthe resource allocation strategy under different system states.The key component of system state is the achievable data rateof each UE, C k ( t ) . We also set the link state of each UE, x k,i ( t ) , as part of the system states, since it impacts the future EEE COMMUNICATIONS MAGAZINE, VOL.XXX, NO.XXX, MONTH YEAR 4

TABLE IA

PPLICATIONS OF

DRL IN D IFFERENT W IRELESS N ETWORKS

Application State Action Reward Learning Objective [8] Cache based Channel power gain User selection for Network throughput Channel dynamics&interference alignment interference alignment cache availability[9] Multi-channel Channel state: Channel selection Number of Channel availabilityaccess good/bad of each user successful transmissions[10] Resource management Current channel Channel access Total throughput Channel access patternsin LTE-Unlicensed usage pattern probability on selected channels of other users[11] Handover control in Signal qualities from BS selection Weighed sum of data Prediction for channel qualitiesultra-dense network different BSs rate & handover energy from different BSs[12] Trafﬁc allocation Throughput & delay Trafﬁc split ratio Total utility (weighted Learn trafﬁc patternin multi-hop network of each session sum of throughput & delay) from experience[13] Multi-channel Channel access Channel access strategy Number of Probabilities of successrandom access of other users successful transmissions transmission over multi-channel data rates. Then, the system state is used as input of the DQN.The action taken by the agent indicates the backhaul capacityallocation, i.e., the number of blocks allocated to each user n k ( t ) . The action space is consisted of all feasible resourceallocation, which includes multiple combinations of integersthat satisfy P k n k ( t ) = M , and we index the actions by a ( t ) = 1 , ..., A . To achieve a good system performance as wellas guarantee the fairness among users, we deﬁne the utility ofeach user to be a concave function of its data rate. Then, thesystem reward is set as the sum of utilities of all users. Thearchitecture of the DQN is shown in Fig. 3. The input layerincludes the link state and achievable data rate information ofall UEs. The output layer presents the approximated Q-valuesand there are several hidden layers between input and outputlayers. To match the capacity of a backhaul resource block,we deﬁne D k ( t ) = l C k ( t ) U m and use it at the input layer of theDQN. D k ( t ) indicates the numbers of resource blocks neededto satisfy the data rate requirements of UE k .The training procedure of the DQN is the same as the onein [6], which uses experience replay to reduce the correlationbetween training samples, as shown in Fig. 2. With the DQN,the agent at the BS ﬁrst observes the current system state, i.e.,the values of D k ( t ) and x k,i ( t ) of all users. Then, it obtainsthe Q-values of taking different actions, i.e., selecting differentresource allocation strategies. With the Q-values, the agenttakes an action according to the ǫ -greedy approach, whichselects the action with maximum Q-value with probability − ǫ and randomly selects an action with probability ǫ . C. Illustrative Example

We evaluate the performance of the DRL based approachwith simulations. We consider an mmWave cell with a cov-erage radius of 100 m, users are randomly distributed in thecell. Let d be the distance between a user and the BS, Theprobabilities of a user in different link states are functions of d .The probabilities under outage, LOS, and NLOS are p out ( d ) = (cid:17)(cid:17)(cid:17) Input layer (cid:17)(cid:17)(cid:17) (cid:17)(cid:17)(cid:17) (cid:17)(cid:17)(cid:17) (cid:17)(cid:17)(cid:17)

Hidden layers Output layer (cid:17)(cid:17)(cid:17) (cid:17)(cid:17)(cid:17) (cid:17)(cid:17)(cid:17) (cid:17)(cid:17)(cid:17) (cid:17)(cid:17)(cid:17) (cid:17)(cid:17)(cid:17) (cid:17)(cid:17)(cid:17)(cid:17)(cid:17)(cid:17)(cid:17)(cid:17)(cid:17) D D K (cid:17)(cid:17)(cid:17) x i x k,i (cid:17)(cid:17)(cid:17) s Q ( s ,1) Q ( s ,2) Q ( s , A ) Fig. 3. Architecture of the DQN for backhaul resource allocation. max(0 , − e − a out + b out ) , p LOS ( d ) = (1 − p out ( d )) e − a los d , and p NLOS ( d ) = 1 − p out ( d ) − p LOS ( d ) , respectively [3], whichare the steady probabilities of the Markov process of link state x k,i ( t ) . We employ the channel model of 73 GHZ band in [3],where the NLOS links experience higher path loss than theLOS links. The system bandwidth is 1 GHz, the transmissionpowers of BS and UEs are 30 dBm and 20 dBm, respectively.The backhaul capacity is 10 Gbps, the backhaul resource isdivided into 20 resource blocks. There are two hidden layersin the DQN and we use ReLu as the activation function.We consider two DRL-based schemes, namely DRL-1 andDRL-2, with reward functions given as P k log( R k ( t )) and P k p R k ( t ) , respectively. With logarithmic utility function,DRL-1 scheme achieves proportion fairness. Compared toDRL-1, DRL-2 is more efﬁciency-prone with worse fairness.Two benchmark schemes are considered for comparison, amyopic scheme and the equal allocation scheme. For themyopic scheme, the backhaul resource allocation is based oncurrent data rates of mmWave links, without considering thefuture change of link states.Fig. 4 shows the sum rate performance under differentnumber of users. As the number of users increases, the sumrates of all schemes grow at reduced rates, showing that the EEE COMMUNICATIONS MAGAZINE, VOL.XXX, NO.XXX, MONTH YEAR 5

Number of users A v e r a g e s u m r a t e ( G bp s ) DRL-1DRL-2MyopicEqual allocation

Fig. 4. Sun rate performance of different schemes versus the number of users. system performance is limited by the backhaul capacity. Theproposed DRL-based schemes outperforms other ones andthe performance gap is enlarged when the number of usersincreases. This is because the BS is able to predict the variationof link state and allocate the resource based on long-termconsideration. Then, the backhaul resource can be efﬁcientlyutilized, and such advantage becomes signiﬁcant when thenumber of users is large. Compared to DRL-1 scheme, DRL-2achieves higher data rate since its utility and reward functionsare set to prioritize efﬁciency over fairness.The performance under different values of blockage coef-ﬁcient a out is shown in Fig. 5. The blockage coefﬁcient isdeﬁned in [3], which indicates the likelihood that a user expe-rience blockage. Given the same BS-UE distance, a scenariowith larger a out has higher blockage probability compared toa scenario with lower a out . From Fig. 5, we can see that when a out is small, the performance of the myopic scheme is closeto the proposed DRL-based schemes, since the ratio of ofusers under blockage is small and the data rates of mmWavelinks are relatively stable. However, when the number of usersincreases, the performance gap between the proposed schemesand the myopic scheme is increased, showing that DRL-basedschedule is effective in capturing the system dynamics andmaking intelligent decisions from the perspective of long-termbeneﬁt.IV. O PEN P ROBLEMS AND F UTURE R ESEARCH

A. Joint Optimization of Backhaul and MmWave Link

The DRL based backhaul resource allocation presented inSection III is based on given achievable rate of each user. Tomitigate the pressure caused by limited backhaul capacity, thedesign of BS-user links can also be considered. The designof resource allocation in LTE systems with limited backhaulcapacity has been studied in [14]. In mmWave systems, thedata rate of each mmWave link can be adjusted throughprecoding design. Considering the channel characteristics ofdifferent users, a joint consideration of backhaul resourceallocation and precoding can provide a better solution to

Blockage coefficient a out (m -1 ) A v e r a g e s u m d a t a ( G bp s ) DRL-1DRL-2MyopicEqual allocation

Fig. 5. Performance of different schemes versus blockage coefﬁcient a out . balance the tension between limited backhaul and increasedmmWave data rate demand. B. Dynamic Backhaul Capacity

In our model, we assume ﬁxed capacity for backhaul, whichcorresponds to the case of wired backhaul or LOS mmWavebackhaul with highly stable data rate. However, in a practicalsystem with wireless backhaul, the data rate of backhaul wouldvary over time. Thus, it is necessary for the agent to learn suchdynamics as well, and more sophisticated design is requiredbased on the proposed framework.

C. Multi-Cell Scenario1) Capacity Allocation Among Different Backhauls:

Thedesign in Section III is based on a single cell scenario. Fromthe perspective of multi-cell, the capacity allocated to eachbackhaul can be optimized to further enhance the systemperformance. For example, an mmWave BS with heavy trafﬁcand high aggregated data rate requirement can share morecapacity from the core network. However, the load balancingand capacity allocation require coordination between differentBSs and efﬁcient design is required. In addition, how toaddress the scalability issue would be another challenge. Ca-pacity allocation among different backhauls for load balancinghas been investigated in other wireless networks, such asin heterogeneous cloud radio access networks [15]. Due tothe dynamic nature of mmWave communications, the varyingcapacity requirement of each backhaul need to be learned toenable effective scheduling.

2) Adaptive User Association:

To mitigate the pressure oflimited backhaul, an effective approach is to perform load bal-ancing. For a BS with large deﬁcit in backhaul capacity, part ofthe users severed by the BS can handover to neighboring BSsto reduce the trafﬁc demand on this BS. Thus, trafﬁc-awareuser association is another design factor that can consideredfor better system performance.

EEE COMMUNICATIONS MAGAZINE, VOL.XXX, NO.XXX, MONTH YEAR 6

D. Heterogeneous Network

In a heterogeneous network, the trafﬁc of small cells istransmitted to a macrocell via backhaul connections andthen forwarded to the core network via the backhaul of themacrocell. Then, the backhaul resource allocation becomes atwo-tier problem, which requires more complicated design. Inaddition, similar to the multi-cell case, the capacity allocationfor different small cell backhaul links and adaptive userassociation are important design issues that should be jointlyconsidered with backhaul resource allocation.

E. Caching Assisted System

BS caching, e.g., femtocaching, was recently proposed asan effective approach to enhance the data rate of users. Bydownloading popular contents in advance and storing at localBSs, the ﬁles requested by users are directly transmitted fromlocal BS. While the primary goal of caching is to increase thecapacity of BS-user links and reduce delay, it is also a goodsolution to the limited backhaul capacity challenge. When thetrafﬁc load of an mmWave BS is low, it can request popularﬁles from the core network. When the trafﬁc load is increased,the popular ﬁles at the BS can be used to satisfy the demandof some users. As a result, the backhaul capacitiy is mainlyused to satisfy the instantaneous demands from users, thusmitigating the trafﬁc burden at the backhaul. Under the cachingarchitecture, the key design issue is the selection of popularcontents. With limited storage, it is necessary to learn thepatterns of users preference and blockage. For example, whena users is under frequent blockage, caching and storing thecontent of this user would lead to under-utilization. However,if the content requested by the user is also frequently requestedby other users, the utilization of would be improved. Thus, theagent needs to learn multiple patterns to derive an efﬁcientcaching strategy.

F. Performance-Complexity Tradeoff

In the system model of Section III, we assume the backhaulresource is divided into M blocks. To improve resource uti-lization and enhance the system performance, a larger value of M is desirable. However, this results in increased dimensionsof both action and state spaces. Thus, an adaptive selectionof M that achieves a good tradeoff between complexity andperformance is another design issue.V. C ONCLUSION

In this article, we aim to address the challenge of limitedbackhaul capacity in mmWave networks with a DRL basedapproach. We ﬁrst overview the background of DRL andits applications in wireless networks. Then, we present aDRL based approach to enable efﬁcient backhaul resourceallocation, and show the effectiveness through an illustrativeexample. We then discuss the future research problems andconclude this article.A

CKNOWLEDGMENT

This work was supported in part by the NSF under GrantCNS-1702957 and by the Wireless Engineering Research andEngineering Center at Auburn University. R

IEEE J. Sel. Areas Commun., vol. 32, no. 6, pp. 1164–1179,June 2014.[4] X. Ge, S. Tu, G. Mao, C.-X. Wang, and T. Han, “5G ultra-dense cellularnetworks,”

IEEE Wireless Commun. Mag., vol. 23, no. 1, pp. 72–79, Feb.2016.[5] M. Feng, S. Mao, and T. Jiang, “Joint frame design, resource allocationand user association for massive MIMO heterogeneous networks withwireless backhaul,”

IEEE Trans. Wireless Commun., vol. 17, no. 3, pp.1937–1950, Mar. 2018.[6] V. Mnih et al., “Human-level control through deep reinforcement learn-ing,”

Nature, vol. 518, no. 7540, pp. 529–533, Feb. 2015.[7] M. Mezzavilla, S. Goyal, S. Panwar, S. Rangan, and M. Zorzi, “An MDPmodel for optimal handover decisions in mmWave cellular networks,”in

Proc. IEEE EuCNC’16,

Athens, Greece, June 2016, pp. 100–105.[8] Y. He, Z. Zhang, F.R. Yu, N. Zhao, H. Yin, V.C.M. Leung, andY. Zhang, “Deep-reinforcement-learning-based optimization for cache-enabled opportunistic interference alignment wireless networks,”

IEEETrans. Veh. Technol., vol. 66, no. 11, pp. 10433–10445, Nov. 2017.[9] S. Wang, H. Liu, P.H. Gomes, and B. Krishnamachari, “Deep reinforce-ment learning for dynamic multichannel access in wireless networks,”

IEEE Trans. Cognitive Commun. and Netw., vol. 4, no. 2, pp. 257–265,June 2018.[10] U. Challita, L. Dong, and W. Saad, “Proactive resource management forLTE in unlicensed spectrum: A deep learning perspective,”

IEEE Trans.Wireless Commun., vol. 17, no. 7, pp. 4674–4689, July 2018.[11] Z. Wang, L. Li, Y. Xu, H. Tian, and S. Cui, “Handover control in wirelesssystems via asynchronous multi-user deep reinforcement Learning,”

IEEE Internet of Things,

DOI: 10.1109/JIOT.2018.2848295.[12] Z. Xu, et al., “Experience-driven networking: A deep reinforcementlearning based approach,” in

Proc. IEEE INFOCOM’18,

Honolulu, HI,Apr. 2018.[13] O. Naparstek and K. Cohen, “Deep multi-user reinforcement learning fordynamic spectrum access in multichannel wireless networks,” in

Proc.IEEE GLOBECOM’17,

Singapore, Dec. 2017.[14] D.W.K. Ng, E.S. Lo, and R. Schober, “Energy-efﬁcient resource allo-cation in multi-cell OFDMA systems with limited backhaul capacity,”

IEEE Trans. Wireless Commun., vol. 11, no. 10, pp. 3618–3631, Oct.2012.[15] C. Ran, S. Wang, and C. Wang, “Balancing backhaul load in hetero-geneous cloud radio access networks,”

IEEE Wireless Commun. Mag., vol. 22, no. 3, pp. 42–48, June 2015.

Mingjie Feng [S’15] received his Ph.D. degree in Electricaland Computer Engineering from Auburn University, Auburn,AL, USA, in 2018. He received his Bachelor’s and Master’sdegrees from School of Electronic Information and Commu-nications, Huazhong University of Science and Technology,Wuhan, China, in 2010 and 2013, respectively. He is cur-rently a postdoctoral research associate in the Departmentof Electrical and Computer Engineering at the University ofArizona. In 2013, he was a visiting student in the Departmentof Computer Science and Engineering, Hong Kong Universityof Science and Technology. His research interests includemmWave communication, massive MIMO, cognitive radionetworks, heterogeneous networks, and full-duplex commu-nication. He is a recipient of a Woltosz Fellowship at AuburnUniversity.

Shiwen Mao [S’99-M’04-SM’09-F’19] received his Ph.D. inECE from Polytechnic University, Brooklyn, NY in 2004. Heis the Samuel Ginn Distinguished Professor and Director ofthe Wireless Engineering Research and Education Center at