[PDF] Deep Reinforcement Learning for Dynamic Spectrum Sharing of LTE and NR

Abstract

In this paper, a proactive dynamic spectrum sharing scheme between 4G and 5G systems is proposed. In particular, a controller decides on the resource split between NR and LTE every subframe while accounting for future network states such as high interference subframes and multimedia broadcast single frequency network (MBSFN) subframes. To solve this problem, a deep reinforcement learning (RL) algorithm based on Monte Carlo Tree Search (MCTS) is proposed. The introduced deep RL architecture is trained offline whereby the controller predicts a sequence of future states of the wireless access network by simulating hypothetical bandwidth splits over time starting from the current network state. The action sequence resulting in the best reward is then assigned. This is realized by predicting the quantities most directly relevant to planning, i.e., the reward, the action probabilities, and the value for each network state. Simulation results show that the proposed scheme is able to take actions while accounting for future states instead of being greedy in each subframe. The results also show that the proposed framework improves system-level performance.

Full PDF

DDeep Reinforcement Learning for DynamicSpectrum Sharing of LTE and NR

Ursula Challita and David SandbergEricsson Research, Stockholm, SwedenEmail: { ursula.challita and david.sandberg } @ericsson.com Abstract —In this paper, a proactive dynamic spectrum sharingscheme between 4G and 5G systems is proposed. In particular,a controller decides on the resource split between NR and LTEevery subframe while accounting for future network states suchas high interference subframes and multimedia broadcast singlefrequency network (MBSFN) subframes. To solve this problem,a deep reinforcement learning (RL) algorithm based on MonteCarlo Tree Search (MCTS) is proposed. The introduced deep RLarchitecture is trained ofﬂine whereby the controller predictsa sequence of future states of the wireless access network bysimulating hypothetical bandwidth splits over time starting fromthe current network state. The action sequence resulting in thebest reward is then assigned. This is realized by predicting thequantities most directly relevant to planning, i.e., the reward,the action probabilities, and the value for each network state.Simulation results show that the proposed scheme is able totake actions while accounting for future states instead of beinggreedy in each subframe. The results also show that the proposedframework improves system-level performance.

I. I

NTRODUCTION

Dynamic spectrum sharing (DSS) has emerged as an effec-tive solution for a smooth transition from 4G to 5G by intro-ducing 5G systems in existing 4G bands without hard/static re-farming spectrum [1]. Using DSS, 4G LTE [2] and 5G NR [3]can operate in the same frequency band where a controllerdistributes the available spectrum resources dynamically overtime between the two radio access technologies (RATs). Forinstance, in LTE-NR downlink (DL) sharing, LTE schedulerloans resources during a certain time to NR and NR avoidssymbols used in LTE for cell speciﬁc signals. Moreover, DSShelps ease the transition from non-standalone 5G networksto standalone 5G. That said, it is important to investigate aneffective scheme for the bandwidth (BW) split between LTEand NR to reap the beneﬁts of DSS.While some literature has recently studied the problemof spectrum sharing between LTE and WiFi (i.e., LTE-unlicensed) [4], NR and WiFi (i.e., NR-unlicensed) [5], aerialand ground networks [6], and radars and communicationsystems [7], the performance analysis of 4G/5G DSS remainsrelatively scarce [8]. For instance, an instant spectrum sharingtechnique at subframe time scale has been proposed [8].The proposed scheme takes into account several informationabout the cell, such as the amount of data in the buffer, thussplitting the BW between 4G and 5G in every transmissiontime internal (TTI). Despite the promising results, this workconsiders a reactive spectrum sharing approach that does notaccount for the future network states and thus resulting inperformance degradation. On the other hand, in a proactive approach, rather than reactively splitting the BW based onincoming demands and serving them when requested, thenetwork takes into account future states for 4G/5G spectrumsharing thus improving the overall system level performance.The main contribution of this paper is to introduce anovel model-based deep reinforcement learning (RL) basedalgorithm for DSS between LTE and NR. The main scope ofthe proposed scheme is planning in the time domain wherebythe controller distributes the communication resources dynam-ically over time and frequency between LTE and NR at asubframe level while accounting for future network states overa speciﬁc time horizon. To enable an efﬁcient planning, wepropose a deep RL technique based on Monte Carlo TreeSearch (MCTS) [9]. When a model of the environment isavailable, algorithms like AlphaZero [10] have been used withgreat success. However, in the case of DSS, the LTE and NRschedulers are part of the environment, and these are not easilymodelled. Inspired by the MuZero work [11], we use a learnedmodel of the environment for planning in the time domain.When applied iteratively, the proposed solution predicts thequantities most directly relevant to planning, i.e., the reward,the action probabilities, and the value for each state. This inturn enables the controller to predict a sequence of futurestates of the wireless network by simulating hypothetical com-munication resource assignments over time starting from thecurrent network state and evaluating a reward function for eachhypothetical communication resource assignment over the timewindow. As such, the communication resources in the currentsubframe are assigned based on the simulated hypotheticalBW split action associated with maximized reward over thetime window. To our best knowledge, this is the ﬁrst workthat exploits the framework of deep RL for DSS between 4Gand 5G systems. Simulation results show that the proposedapproach improves quality of service in terms of latency.Results also show that the proposed algorithm results in gainin different scenarios by accounting for several features whileplanning in the time domain, such as multimedia broadcastsingle frequency network (MBSFN) subframes and diverseuser service requirements.The rest of this paper is organized as follows. Section IIpresents the system model. Section III describes the proposeddeep RL algorithm. In Section IV, simulation results areanalyzed. Finally, conclusions are drawn in Section V.II. S

YSTEM M ODEL

Consider the downlink of a wireless cellular system com-posed of a co-located cell operating over NR and LTE serving a r X i v : . [ c s . N I] F e b set J of J users. NR and LTE are assumed to operate inthe 3.5 GHz frequency band and apply FDD as the duplexingmethod. We consider a 15 kHz NR numerology and that LTEand NR subframes are aligned in time and frequency. EachRAT, s , serves a set of K s ⊆ J of UEs. The total systembandwidth, B , is divided into a set C of C resource blocks(RBs). Each RAT, s , is allocated a set C s ⊆ C of RBs, andeach UE j ∈ K s is allocated a set C j,s ⊆ C s of C j,s RBs byits serving RAT s .For an efﬁcient spectrum sharing model of LTE and RAT,one must design a mechanism for dividing the availablebandwidth for data and control transmission for each of theRATs. For the control region, we consider the following: • LTE PDCCH is restricted to symbols • NR has no signals/channels in symbols • NR PDCCH is limited to symbol 2, assuming that theUE only supports type-A scheduling (no mini-slots). • In LTE subframes where no NR PDCCH is transmittedin the overlapped NR slots, LTE PDCCH could span 3symbols.For data transmission, a controller decides on the resourcesplit, C s , between NR and LTE every subframe. A. Channel Model

We assume the 3GPP Urban Macro propagation model [12]with Rayleigh fading. The path loss between UE j at location a and its serving BS s , ξ j,s,a , is given by Model1 [13],considering 3.5 GHz frequency band: ξ j,s,a = 20 . . × log ( d ) , (1)where d is the distance between the UE and the BS in meters.The signal-to-noise ratio (SNR), Γ j,s,c,a of the UE-BS linkbetween UE j at location a served by RAT s over RB c willbe: Γ j,s,c,a = P j,s,c,a h j,s,c,a B c N , (2)where P j,s,c,a = P j,s,a /Cj, s is the transmit power ofBS/RAT s to UE j at location a over RB c and P j,s,a is the total transmit power of BS/RAT s to UE j location a . Here, the total transmit power of RAT s is assumed tobe distributed uniformly among all of its associated RBs. h j,s,c,a = g j,s,c,a − ξ j,s,a / is the channel gain between UE j and BS/RAT s on RB c at location a where g j,s,c,a is theRayleigh fading complex channel coefﬁcient. N is the noisepower spectral density and B c is the bandwidth of an RB c . Therefore, the achievable data rate of UE j at location a associated with RAT s can be deﬁned as: R j,s,a = C j,s (cid:88) c =1 B c log (1 + Γ j,s,c,a ) , (3) B. Trafﬁc Model

We assume a periodic trafﬁc arrival rate per UE j witha ﬁxed periodicity λ j and a ﬁxed packet size β j . Time do-main scheduling is typically governed by a scheduling weightwhereby a high weight corresponds to a high priority for scheduling that particular UE. We adopt a similar mechanismfor measuring the quality of bandwidth splits between LTEand NR where a UE not fulﬁlling its QoS is associated witha high weight. The weight for user j in subframe p can becalculated as: w p,j ( t ) = (cid:40) α j t t < δ j ,α j t + η j t ≥ δ j . (4)where t is the time the oldest packet has been waiting in thebuffer, δ j and η j correspond to the step delay and step weightof the delay weight function of user j , respectively and α j isa small positive factor that makes the weight non-zero whenthere is data in the buffer. Note that a UE with zero weightwill not be scheduled. Here, the step delay corresponds tothe maximum tolerable delay in order to maintain QoS. If apacket remains in the buffer for a time period larger than δ j ,the weight for user j increases by η j .Given this system model, next, we develop an effectivespectrum sharing scheme that can allocate the appropriatebandwidth to each RAT, at a subframe time scale, whileaccounting for future network states.III. D EEP R EINFORCEMENT L EARNING FOR D YNAMIC S PECTRUM S HARING

In this section, we propose a proactive approach for DSSenabling LTE and NR to operate on the same BW simultane-ously. In this regard, we propose a deep RL framework thatenables the controller to learn the BW split between LTE andNR during subframe t while accounting for future networkstates over a time window T . To realize that, ﬁrst, we proposethe adopted RL algorithm for training the controller to learnthe optimal policy for BW split. Then, we introduce the RLarchitecture and components for the DSS problem. A. Deep RL Algorithm

To enable a proactive BW split between LTE and NR, weadopt in this paper the MuZero algorithm [11]. One of themain challenges of the proposed solution technique is that itrequires a model for the individual schedulers for LTE andNR, which is hard to devise. Instead, we propose in thispaper to learn the scheduling dynamics via a model-basedreinforcement learning algorithm that aims to address thisissue by simultaneously learning a model of the environment’sdynamics and planning with respect to the learned model [11].This approach is more data efﬁcient compared to model-freemethods where current state-of-the-art algorithms may requiremillions of samples before any near-optimal policy is learned.During the training phase of the proposed algorithm, theprediction comprises performing a MCTS over the actionspace and over the time window T to ﬁnd the sequence ofactions that maximizes the reward function. MCTS iterativelyexplores the action space, gradually biasing the explorationtowards regions of states and actions where an optimal policymight exist. To enable our model to learn the best exploredsequence of actions for each network state, we deﬁne threeneural networks - the representation function ( h ), dynamicsunction ( g ), and prediction function ( f ). The motivation forincorporating each of these neural networks in the proposedalgorithm is described as follows: • A representation function ( h ) encodes the observation insubframe p into an initial hidden state ( s ). • A dynamics function ( g ) computes a new hidden state( s i +1 ) and reward ( r i +1 ) given the current state ( s i ) andan action ( a i +1 ). • A prediction function ( f ) outputs a policy ( p i ) and a value( v i ) from a hidden state ( s i ).During the training phase, the model predicts the quantitiesmost directly relevant to planning, i.e., the reward, the actionprobabilities and the value for each state. The proposedtraining algorithm is summarized in Algorithm 1 and the mainsteps are given as follows: • Step 1 : The model receives the observation of the networkstate as an input and transforms it into a hidden state ( s ). • Step 2 : The prediction function ( f ), is then used to predictthe value v i and policy vector p i for the current hiddenstate s i . • Step 3 : The hidden state is then updated iteratively to anext hidden state s i +1 by a recurrent process consistingof N unroll steps, using the dynamics function ( g ), withan input representing the previous hidden state s i anda hypothetical next action a i +1 , i.e., a communicationsresource assignment selected from the action space com-prising allowable bandwidth splits between LTE and NR. • Step 4 : Having deﬁned a policy target, reward and value,the representation function ( h ), dynamics function ( g ),and prediction function ( f ) are trained jointly, end-to-endby backpropagation-through-time (BPTT).Meanwhile, the testing algorithm refers to the actual executionof the algorithm after which the weights of ( h ), ( g ), and ( f )have been optimized and is implemented for execution duringrun time. Given that DSS is performed on a 1 ms basis, itis too demanding to run MCTS online. As such, we use therepresentation ( h ) and prediction ( f ) functions only during testtime. The main steps performed by the controller at test timeare summarized in Algorithm 2. B. Deep RL Components

In this subsection, we deﬁne the RL framework components,namely the observations, actions, and rewards. • Action space:

BW split between LTE and NR forDL transmission for subframe p , denoted as a p = { a , a , ...a N } where N is the size of the action space.Here, an action corresponds to a horizontal line splittingthe BW on one side to LTE and the other side to NR.The possible BW splits are chosen by grouping a set ofmultiple RBs thereby resulting in a quantized action set.This would in turn reduce the action space size and isvalid due to the fact that the gain between bandwidthsplits from consecutive RBs is negligible. • Observation: the observation for subframe p , denotedas o p , is divided into two parts, where the ﬁrst part, Algorithm 1

Training phase

Input:

Representation ( h ), dynamics ( g ) and prediction ( f ) functions. for i = 0 ... N iter − do Data Generation

Step 1:

Sample N episode environments ( E ) with random parameters. for ( env ∈ E ) dofor p = 0 ... N timestep − doStep 2: Encode the observation o p into an initial hidden state, s . Step 3:

Run N mcts MCTS simulations from this state using ( g ) and ( f ). Step 4:

Sample an action to take in the environment.

Step 5:

Store the policy ( π ) and reward ( u ) in the replay buffer ( R ). end forend for Neural Network Training for p = 0 ... N step − doStep 6: Sample a batch of sequences from the replay buffer R . Step 7:

Compute the total discounted reward ( z ) over each sequence. Step 8:

Take a training step using BPTT to make p ≈ π , v ≈ z , r ≈ u . end forend for Algorithm 2

Execution phase

Input:

Representation ( h ) and prediction ( f ) functions. for n = 0 ... N timestep − doStep 1: Encode the observation into an initial hidden state, s . Step 2:

Calculate the action probabilities using ( f ) and select the best action. Step 3:

Find the BW split to use and send that to the schedulers. end for ( o p, ), consists of components with size ( J x whereasthe second part, ( o p, ), consists of components with size ( J x T ) , where T is the time window consisting of a set offuture subframes. The different observations componentsare summarized as follows: – NR support: a vector with J x1 elements that indi-cates if a user j is NR user or not. – Buffer state: a vector with J x1 elements containingthe number of bits in the buffer of user j . – MBSFN subframe: a matrix with J x T elements thatindicates for each subframe p , p ∈ T , if a UEis conﬁgured with MBSFN or not. By conﬁguringLTE UEs with MBSFN subframes, some broadcastsignalling can be avoided at the cost of decreasedscheduling ﬂexibility. – Predicted number of bits per PRB and TTI for eachUE j : a matrix with J x T elements, where each el-ement contains the estimated number of the averagebits that can be transmitted for user j in subframe p ,taking into account the estimated channel quality ofuser j during subframe p , p ∈ T . – Predicted packet arrivals: a matrix with J x T ele-ments indicating the number of bits that will arrivein the buffer for each user j over a set of futuresubframes T . • Reward function: the reward function is modelled as asummation of the exponential of the most delayed packetper user and can be expressed as: r p = e − (cid:80) Jj =1 w p,j , (5)where w p,j is the delay weight function of user j insubframe p , as described in (4). The intuition behind thisig. 1: A schematic illustration of the proposed setupsummarizing the connection between the controller, networkstate, and LTE and NR schedulers. Table I:

Simulation parameters for the radio environment.

Parameter Value

Frequency 3.5 GHzBandwidth 25 PRBs (5 MHz)Trafﬁc Model PeriodicUE speed 3 m/sTransmit power 0.8W/PRBNoise power ( N ) 112.5 dBm/PRBAntenna conﬁg 1 Tx, 2 Rx reward function is that high total weight is penalized witha low reward in subframe p . Meanwhile, if the controllermanages to keep the user buffers empty, the reward persubframe will be one. If a highly prioritized UE is queuedfor several subframes, its weight will increase and thusthe reward will approach zero.Figure 1 summarizes the relationship between the networkstate, controller, and LTE and NR schedulers. At each sub-frame, the LTE scheduler, NR scheduler, and controller receivethe network state information. This information is then used bythe controller to generate observations and thus take an actionfor the BW split between LTE and NR. This action is thenconveyed to the LTE and NR schedulers. Given the networkstate information and the corresponding BW split, each of theschedulers allocates their respective users to the correspondingBW portion for the current subframe. Finally, the weights forthe users are fed to the controller and used as an input for thecalculation of the reward. Next, we provide simulation resultsand analysis for the proposed RL framework.IV. S IMULATION R ESULTS AND A NALYSIS

In this section, we provide simulation results and analysisfor the performance of the proposed algorithm under fourdifferent scenarios where planning in the time domain fordynamic spectrum sharing is relevant. Tables I and II providea summary of the main simulation parameters.The structure of the representation, dynamics, and predic-tion neural networks is depicted in Figure 2. All dense layersexcept for the output layer use 64 activations with ReLUactivation. The representation outputs ( s ) use 10 activationswith tanh activation. The reward ( r ) and value ( v ) outputs arescalar with linear activation, and the policy ( p ) has the same Fig. 2: Neural network architectures for a) the representationfunction, b) the dynamics function and c) the predictionfunction. Table II:

Simulation parameters for the RL framework.

Parameter Value

Number of MCTS simulations ( N mcts ) 64Episode length ( N timestep ) 16 subframesDiscount factor ( γ ) 0.99Window size (T) 10 subframesBatch size 32 examplesNumber of unroll steps ( N unroll ) 3Number of TD steps ( N td ) 16Optimizer AdamLearning rate ∗ − Number of episodes per iteration ( N episode ) 100Representation size 10 Algorithm 3

Baseline Algorithm

Input:

Observation, o t . Output:

Action, a t . Step 1:

Calculate the weight of each user according to Eq 4.

Step 2:

Sort the users in order of decreasing weight.

Step 3:

Schedule users from the list until spectrum is full.

Step 4:

Check how many PRBs are needed for NR and LTE users.

Step 5:

Select the action, a t , that splits the BW proportionally between the RATs. number of activations as the number of actions with sof tmax activation.Next, we provide a detailed description for the simulationresults and analysis of each of the four studied scenarios. Notethat in all of the scenarios, the episode length is 16 and thusthe evaluation score for a perfectly solved scenario is also 16.Moreover, we assume that LTE users (if any) are scheduled onthe lower part of the spectrum band and NR users (if any) arescheduled on the high part of the band. As for the baseline,we split the available spectrum proportionally to the numberof required RBs between LTE and NR users, as summarizedin Algorithm 3. We also compare the performance of theproposed algorithm to equal BW split and alternating BWbetween LTE and NR. The user weight is calculated usingEq. 4, with η j = 5 and α j = 10 − for all users. The stepdelay, δ j , is set appropriately for the different users in thedifferent scenarios as speciﬁed below. A. Scenario 1: MBSFN subframes

LTE requires CRSes to enable demodulation of data. There-fore, if only NR UEs are scheduled, the CRSes are not neededand are hence an overhead. If there is a lot of NR trafﬁc toe scheduled, LTE can be conﬁgured with so called MBSFNsubframes. In these subframes, no CRSes are transmitted andit is therefore not possible to schedule LTE users but thiscan result in improved efﬁciency for NR users. This scenarioaims to investigate if the controller can learn to account forMBSFN subframes during planning thus enabling time criticalLTE trafﬁc to be served before MBSFN subframes.

1) Scenario description:

We consider two users, one NRuser and one LTE user, both having a trafﬁc arrival periodicityof 4 ms and a step delay δ =3ms. The packet size is 45000 bitsand 15000 bits for the NR and LTE users, respectively. Thesystem is conﬁgured with a repeating MBSFN pattern witha periodicity of 4 subframes, where the ﬁrst two subframesare non-MBSFN (i.e., both LTE and NR UEs can be sched-uled) and the last two subframes in the pattern are MBSFNsubframes (i.e., only NR UEs can be scheduled).

2) Optimal bandwidth split:

To solve this scenario opti-mally, both packets must be served within 3 ms. As such, theLTE user should be served in the non-MBSFN subframes tomake resources available for the NR user later in the cycle.Therefore, the optimal strategy is to start scheduling LTE suchthat its buffer is emptied before the MBSFN subframes.

3) Results and analysis:

From Figure 3, we can see thatthe proposed algorithm converges to the optimal strategy in12 iterations. Also, note that the performance of the proposedscheme exceeds that of equal bandwidth split between LTEand NR and the case where MBSFN subframes are allocatedto the NR user and non-MBSFN subframes are allocated tothe LTE user. With this MBSFN conﬁguration the amount ofoverhead due to e.g. LTE CRSes can be minimized whichresults in improved efﬁciency on network level. The controllercan learn to account for the MBSFN subframes by schedulingin such a way that maximizes the quality of service despite thereduced scheduling ﬂexibility due to the MBSFN subframes.Fig. 3: Evaluation score as a function of number of iterationsfor scenario 1 with MBSFN subframes.

B. Scenario 2: Periodic high interference

In this scenario, we investigate the controller’s ability tolearn to account for future high interference on one of the usersduring planning. Periodic high interference can, for instance,occur in case a user is at the cell edge and is interfered byanother base station or in case of unsynchronized time divisionduplexing scenarios.

1) Scenario description:

We consider two users, one NRuser and one LTE user, both having a trafﬁc arrival periodicityof 2 ms. We assume a larger packet size for NR user comparedto that of LTE so that we can observe the gain of NR beneﬁtingfrom the 2 extra symbols of LTE PDCCH if it is allocated thefull bandwidth. Users have a small weight value when thedelay is less than 2 ms but then it increases abruptly to 2 after2 ms (i.e., δ = 2 ms ). Moreover, a periodic high interferenceis observed on LTE user every 3 subframes. Here, the periodicinterference term is added artiﬁcially for analysis purposes.

2) Optimal bandwidth split:

The optimal strategy for thisscenario is to allocate the full bandwidth to NR duringsubframes with high interference on the LTE user.

3) Results and analysis:

From Figure 4, we can see thatthe proposed algorithm converges to the optimal strategy in18 and 28 iterations for the case of 2 and 3 action space,respectively. The proposed approach outperforms the baselinealgorithm, equal bandwidth split, and alternating bandwidthsplit where the controller learns to allocate the full bandwidthto NR during subframes with high interference for the LTEuser as opposed to taking actions based on buffer status only.This allows the controller to split the bandwidth between LTEand NR such that the impact of the interference level fromneighboring cells is reduced thus resulting in an improvedsystem level performance.Fig. 4: Evaluation score as a function of number of iterationsfor scenario 2 with periodic high interference.

C. Scenario 3: Mixed services

In this scenario, we investigate the controller’s ability tohandle users with different delay requirements.

1) Scenario description:

We consider two users, one highpriority NR user ( δ = 5 ms ) with 90000 bits, and one lowpriority LTE user ( δ = 10 ms ) with 90000 bits. Data arrivesin subframe 1 for both users.

2) Optimal bandwidth split:

The optimal strategy for thisscenario is to postpone the scheduling of the low priorityLTE user in order to allow the high priority NR user tobe scheduled. When the buffer of the high priority users isemptied, the controller can schedule the LTE user.

3) Results and analysis:

From Figure 5, we can see that theproposed approach converges to the optimal policy within 5iterations. The controller learns to prioritize the NR user with aight delay requirement over the LTE user thus outperformingthe baseline algorithm as well as the equal BW split.Fig. 5: Evaluation score as a function of number of iterationsfor scenario 3 with mixed services.

D. Scenario 4: Time multiplexing

In this scenario, we investigate the controller’s ability tolearn to do time multiplexing (as opposed to frequency multi-plexing) between LTE and NR. Time multiplexing can resultin two extra symbols for NR when no LTE is scheduled dueto the fact that no LTE PDCCH needs to be transmitted. Thisin turn results in an increased efﬁciency when the RATs arescheduled in a time multiplexed fashion.

1) Scenario description:

We consider two users, one NRuser and one LTE user, both having a trafﬁc arrival periodicityof 2 ms. The packet size for the NR user is larger (14000 bits)compared to that of the LTE user (10000 bits). Users have asmall weight when delay is less than 2 ms but then increasesabruptly to 5 after 2 msec (i.e. δ = 2 ms ).

2) Optimal bandwidth split:

The optimal strategy for thisscenario is to perform time multiplexing whereby the fullbandwidth is allocated to a particular RAT every other sub-frame. As such, NR could beneﬁt from the 2 extra symbols ofLTE PDCCH when it is given the full bandwidth. This resultsin a larger transport block size and thus the large NR packetsize can be served within one subframe.

3) Results and analysis:

From Figure 6, we can see thatthe proposed approach converges to the optimal action strategywithin 14 and 15 iterations for the case of 3 and 4 actions,respectively. The proposed approach outperforms the baselinealgorithm, equal BW split, and alternating BW split wherethe network learns to perform time multiplexing betweenLTE and NR resulting in an increased spectrum efﬁciency.For the studied scenario, when NR is scheduled alone, i.e.without overhead from LTE PDCCH, the maximum transportblock size is 14112 bits. On the other hand, when LTEis scheduled with NR, there is an extra overhead for LTEPDCCH and thus the maximum transport block size decreasesto 12576 bits. Consequently, the NR packet can be scheduledin one subframe given that NR is scheduled alone during thatsubframe. Fig. 6: Evaluation score as a function of number of iterationsfor scenario 4 considering a time multiplexing scenario.V. C

ONCLUSION

In this paper, we have proposed a novel AI planningframework for dynamic spectrum sharing of LTE and NR.Results have shown that the controller can split the bandwidthbetween LTE and NR in an intelligent way while accountingfor future network states, such as MBSFN subframes andhigh interference level, thus resulting in an improved systemlevel performance. This gain comes from the fact that theproposed algorithm uses knowledge (or beliefs) about futurenetwork states to make decisions that perform well on a longertimescale rather than being greedy in the current subframe.As part of future work, we aim to further investigate if thesuggested algorithm can learn to account for uncertainties inthe observations. R

EFERENCES[1] Ericsson, “Sharing for the best performance - stay ahead of the gamewith ericsson spectrum sharing,”

Ericsson white paper , 2019.[2] E. Dahlman, S. Parkvall, and J. Skold, , 1st ed. USA: Academic Press, Inc., 2011.[3] E. Dahlman, S. Parkvall, and J. Skold,

5G NR: The Next GenerationWireless Access Technology , 1st ed. USA: Academic Press, Inc., 2018.[4] U. Challita, L. Dong, and W. Saad, “Proactive resource managementfor LTE in unlicensed spectrum: A deep learning perspective,”

IEEEtransactions on wireless communications , vol. 17, no. 7, pp. 4674–4689,July 2018.[5] 3GPP TR 38.889, “Study on NR-based access to unlicensed spectrum.”[6] U. Challita, W. Saad, and C. Bettstetter, “Interference managementfor cellular-connected UAVs: A deep reinforcement learning approach,”

IEEE Transactions on Wireless Communications , vol. 18, no. 4, pp.2125–2140, March 2019.[7] A. Khawar, A. Abdelhadi, and C. Clancy,

Spectrum Sharing BetweenRadars and Communication Systems . Springer, 2018.[8] S. Kinney, “Dynamic spectrum sharing vs. static spectrum sharing,”

RCRwireless , March 2020.[9] C. B. Browne, E. Powley, D. Whitehouse, S. M. Lucas, P. I. Cowling,P. Rohlfshagen, S. Tavener, D. Perez, S. Samothrakis, and S. Colton,“A survey of monte carlo tree search methods,”

IEEE Transactions onComputational Intelligence and AI in Games , vol. 4, no. 1, pp. 1–43,2012.[10] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez,M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, T. Lillicrap, K. Simonyan,and D. Hassabis, “Mastering chess and shogi by self-play with a generalreinforcement learning algorithm,” 2017.[11] J. Schrittwieser, I. Antonoglou, T. Hubert, K. Simonyan, L. Sifre,S. Schmitt, A. Guez, E. Lockhart, D. Hassabis, T. Graepel, T. Lillicrap,and D. Silver, “Mastering atari, go, chess and shogi by planning with alearned model,” arXiv:1911.08265 , Nov. 2019.[12] 3GPP, “Study on channel model for frequencies from 0.5 GHz to 100GHz,”3GPP TR 38.901, V15.0.0