[PDF] Risk-Sensitive Reinforcement Learning for URLLC Traffic in Wireless Networks

Abstract

In this paper, we study the problem of dynamic channel allocation for URLLC traffic in a multi-user multi-channel wireless network where urgent packets have to be successfully transmitted in a timely manner. We formulate the problem as a finite-horizon Markov Decision Process with a stochastic constraint related to the QoS requirement, defined as the packet loss rate for each user. We propose a novel weighted formulation that takes into account both the total expected reward (number of successfully transmitted packets) and the risk which we define as the QoS requirement violation. First, we use the value iteration algorithm to find the optimal policy, which assumes a perfect knowledge of the controller of all the parameters, namely the channel statistics. We then propose a Q-learning algorithm where the controller learns the optimal policy without having knowledge of neither the CSI nor the channel statistics. We illustrate the performance of our algorithms with numerical studies.

Full PDF

aa r X i v : . [ c s . N I] N ov Risk-Sensitive Reinforcement Learning for URLLCTrafﬁc in Wireless Networks

Nesrine Ben-Khalifa ⋄ , Mohamad Assaad ⋄ , M´erouane Debbah ∗ ⋄ TCL chair on 5G, Laboratoire des Signaux et Syst`emes (L2S), Centrale Sup´elec, France ∗ Huawei Technologies, Boulogne-Billancourt, France

Abstract —In this paper, we study the problem of dynamicchannel allocation for URLLC trafﬁc in a multi-user multi-channel wireless network where urgent packets have to besuccessfully transmitted in a timely manner. We formulate theproblem as a ﬁnite-horizon

Markov Decision Process with astochastic constraint related to the QoS requirement, deﬁned asthe packet loss rate for each user. We propose a novel weightedformulation that takes into account both the total expectedreward (number of successfully transmitted packets) and the risk which we deﬁne as the QoS requirement violation. First,we use the value iteration algorithm to ﬁnd the optimal policy,which assumes a perfect knowledge of all the model parameters,namely the channel statistics. We then propose a Q-learningalgorithm where the controller learns the optimal policy withouthaving knowledge of neither the CSI nor the channel statistics.We illustrate the performance of our algorithms with numericalstudies.

Index Terms —URLLC, risk-sensitivity, resource allocation,constrained MDP, reinforcement learning

I. I

NTRODUCTION

In the ﬁfth generation (5G) wireless networks, there arenew service categories with heterogeneous and challengingrequirements, among them the Ultra Reliable Low Latency(URLLC) trafﬁc [6], designed for delay and reliability sen-sitive applications like real-time remote control, autonomousdriving, and mission-critical trafﬁc. In URLLC trafﬁc, the End-to-End (E2E) latency deﬁned by 3GPP is lower than msalong with a reliability requirement of − − to − − [6], [15].A plausible solution to address the latency requirement issueis to make transmissions without Channel State Information(CSI) knowledge at the transmitter side. To increase reliability,exploiting frequency diversity is beneﬁcial, and occurs bymaking parallel transmissions of the same packet over differentsubcarriers in an Orthogonal Frequency Division Multiplexing(OFDM) system where each subcarrier experiences differentchannel characteristics.However, this solution is costly in terms of system capacity.Therefore, the number of parallel transmissions should not beﬁxed in advance but should rather be variable and dependingon many parameters such as the position of a user in the cell,or the statistics about his packet losses over the previous timeslots. For example, if a user experienced a high number ofpacket losses in the previous time slots, it should be allocated ahigh number of subchannels to increase his success probability,whereas a user with a low number of dropped packets may be assigned a low number of subcarriers. Hence, it is crucial todesign efﬁcient dynamic schemes able to adapt the number ofparallel transmissions for each user to his experienced QoS.In this work, we study the problem of dynamic channelallocation for URLLC trafﬁc in a multi-user multi-channelwireless network under QoS constraints. A channel here refersto a frequency band or a subcarrier in an OFDM system,and the QoS is related to the packet loss rate for each user,deﬁned as the average number of dropped packets. Besides, weintroduce the notion of risk related to the violation of the QoSrequirements; more precisely, a risk occurs or equivalently, arisk state is reached when the QoS requirement is violatedfor a user. Furthermore, we consider that the transmitter doesnot have neither the CSI nor the channel statistics at thetransmission moment. In fact, due to the urgency of URLLCpackets mentioned previously, there is not enough time for theBS to make channel estimation and probing techniques like inconventional wireless communications. A. Related Work

The issue of deadline-constrained trafﬁc scheduling hasbeen investigated by several works including [8]–[10], [17].For example, in [8], the authors study the problem of dynamicchannel allocation in a single user multi-channel system withservice costs and deadline-constrained trafﬁc. They proposeonline algorithms to enable the controller to learn the optimalpolicy based on Thompson sampling for multi-armed banditproblems. The MDP framework and reinforcement learningapproaches for downlink packet scheduling are considered in[1]–[3], [9], [10], [14]. In [10], the authors propose an MDPfor deadline-constrained packet scheduling problem and usedynamic programming to ﬁnd the optimal scheduling policies.The authors do not consider QoS constraints in the schedulingproblem.Most risk-sensitive approaches consist in analyzing higherorder statistics than the average metric such as the varianceof the reward [5], [6], [13], [16]. For instance, a risk-sensitivereinforcement learning is studied in [20] in millimeter-wavecommunications to optimize both the bandwidth and transmitpower. The authors consider a utility (data rate) that incor-porates both the average and the variance to capture the taildistribution of the rate, useful for the reliability requirement ofURLLC trafﬁc. The authors do not exploit frequency diversity.In this work, we consider an alternative approach to the riskwhich consists in minimizing the risk state visitation probabil-ty. In fact, due to the stochastic nature of the problem (time-varying channel and random arrival trafﬁc in our context),giving a low reward to an undesirable or a risk-state may beinsufﬁcient to minimize the probability of visiting such state[12]. Therefore, in addition to the maximization of the totalexpected reward, we propose to consider a second criterionwhich consists in minimizing the probability of visiting riskstates where a risk state here is related to the violation of QoSrequirements.

B. Addressed Issues and Contribution

In this work, we address the following issues: • We formulate the dynamic channel allocation problemfor URLLC trafﬁc as a ﬁnite-horizon

MDP wherein thestate represents the QoS of the users, that is, the averagenumber of dropped packets or packet loss rate of theusers. The decision variable is the number of channels toassign to each user. We deﬁne a risk state as any statewhere the QoS requirement is violated for at least oneuser. Besides, we deﬁne a stochastic constraint related tothe risk state visitation probability. • Assuming the channel statistics are known to the con-troller, we use the ﬁnite-horizon value iteration algorithmto ﬁnd the optimal policy to the weighted formulationof the problem, which takes into account both the totalexpected reward over the planning horizon and the risk criterion (QoS requirement violation probability). • When the channel statistics are unknown to the con-troller, we propose a reinforcement learning algorithm(Q-learning) for the weighted formulation of the problem,which enables the controller to learn the optimal policy.We illustrate the performance of our algorithms withnumerical studies.

C. Paper Structure

In Section II, we present the system model for the multi-user multi-channel wireless network with URLLC packetsand time-varying channels along with the QoS deﬁnition. InSection III, we introduce the constrained MDP formulationwith all its components. In Section IV, we present both theﬁnite-horizon value iteration algorithm and the reinforcementlearning algorithm. Section V is devoted to numerical results.Finally, we conclude the paper in Section VI.II. S

YSTEM M ODEL

We consider a multi-user multi-channel wireless networkwhere URLLC packets have to be transmitted over time-varying and fading channels. Due to the strict latency re-quirement of URLLC packets in 5G networks mentionedpreviously, there is not enough time for the BS to estimatethe channel, and the packets are then immediately transmittedin the absence of CSI at the transmitter side. When a packet issuccessfully decoded, the receiver sends an acknowledgmentfeedback, which is assumed to be instantaneous and error-free. We consider a centralized controller which dynamically Controller user user K ρ ( t + 1) ρ K ( t + 1) ... a ( t ) a K ( t ) ℓ ( t ) ℓ K ( t ) Fig. 1:

Dynamic allocation of channels ( ℓ , .., ℓ K ) to theusers based on their QoS ( ρ , .., ρ K ) .distributes the channels to the users based on their QoS (seeFig. 1).Furthermore, we make the following assumptions: Packet arrival process : the packet arrival process is con-sidered as an independent and identically distributed (i.i.d.)random process over a ﬁnite set I = { , , .., A max } , where A max is a positive constant, and is identical for all the users.Let α a denote the probability that a ∈ I packets arrive for agiven user at the beginning of a time slot. Deadline-constrained trafﬁc: regarding the strict URLLClatency requirement speciﬁed by 3GPP (lower than 1 ms), eachpacket has a lifetime of one time slot and can either be servedor dropped; if there are available channels, the packet will betransmitted, otherwise, it will be dropped because after onetime slot it becomes outdated and useless. Furthermore, onepacket is transmitted per channel.

Channel model : we consider i.i.d. Bernoulli channels with amean µ ∈ [0 , . In millimeter-wave communications, the linksare characterized by their intermittence and high sensitivity,and this channel model reﬂects the existence of a light-of-sight(LOS) channel state [4], [8]. To increase reliability, a user canbe assigned more channels than the number of waiting packets(depending on his experienced QoS). Some packets are thensimultaneously sent over multiple parallel channels. Channel split: for each user, all the packets are equallyimportant: when the number of available channels is largerthan that of waiting packets, we assume that some packetsare picked uniformly at random to be replicated. A packetis obviously more likely to be successfully transmitted whensent over many channels simultaneously. However, assigningmore channels to a user will affect the QoS experienced by theother users. Note that channel split across the packets (whichoccurs in the same manner for all the users) should not beconfused with the channel split across the users (which takesinto account the QoS perceived by the users).For user k , the distribution of available channels ℓ k overthe waiting packets a k occurs as follows: each packet istransmitted over ( ℓ k ∧ a k ) channels and may be furthermorereplicated once with a probability ( ℓ k ∨ a k a k ), where the symbol ℓ k ∧ a k denotes the larger integer m such that ma k ℓ k , and k ∨ a k denotes the remaining integer of the division of ℓ k by a k .The probability that a packet is successfully transmittedgiven that there are a k waiting packets at the transmitter and ℓ k assigned channels can then be expressed by ν k ( a k , ℓ k ) = (cid:18) − ℓ k ∨ a k a k (cid:19) (cid:16) − (1 − µ ) ( ℓ k ∧ a k ) (cid:17) + (cid:18) ℓ k ∨ a k a k (cid:19) (cid:16) − (1 − µ ) ℓ k ∧ a k ) (cid:17) . (1)The expected number of successfully transmitted packets foruser k is then given by E [ N k ( ℓ k )] = X a k ∈I a k α a k ν k ( a k , ℓ k ) . (2) QoS criterion: for each user k , we deﬁne the packet loss rateat time slot t , ρ k ( t ) , as follows ρ k ( t ) = 1 t t − X i =0 n k ( i ) a k ( i ) , t > , (3)where n k ( t ) denotes the number of lost packets for user k attime slot t . Note that ρ k ∈ [0 , ( n k ( t ) a k ( t ) ). A packet islost when either of the two following events occurs:(i) it is not transmitted because of insufﬁcient availablechannels,(ii) is transmitted but ACK feedback is not received.The parameter ρ k reﬂects the QoS perceived by user k : highervalues of ρ k mean a higher number of lost packets and poorQoS whereas lower values of ρ k mean good QoS. To ensuregood QoS for the users, the resource allocation scheme shouldtake account of their experienced QoS and keep this parametervalues for all users within an acceptable range.Finally, the decision variable is the number of channelsassociated to each user k at each time slot, denoted by ℓ k ,which satisﬁes K X k =1 ℓ k ( t ) = L, (4)where L denotes the number of available channels.III. C ONSTRAINED

MDP F

RAMEWORK

The stochastic nature of the wireless channel incites us toconsider an MDP framework to solve the decision problem. Inthis section, we ﬁrst introduce the constrained MDP formula-tion along with its components. We then derive the optimalityequations.

A. Model Formulation

We deﬁne the following ﬁnite-horizon

MDP • State Space : is the ﬁnite set

T × S where T = { , .., T } , S = { ρ × .. × ρ K } , ρ k for k = 1 , .., K is deﬁned in (3),and the symbol × stands for the Cartesian product. • Action Space : is the ﬁnite set L = { ( ℓ , .., ℓ K ) satisfying (4) } , where ℓ k denotes thenumber of channels assigned to user k . • Reward : we deﬁne the reward r at time slot t , when thecontroller chooses action ℓ ∈ L in state s t , as the expectedtotal number of successfully transmitted packets over allthe users, that is, r ( s t , ℓ ) = E " K X k =1 N k ( ℓ k ) . (5)Note that the reward depends only on the number ofchannels allocated for each user (the action), and not onthe current state s t . Besides, the reward is a non-linearfunction of the action. • Transition Probabilities:

First, we deﬁne the probability that n packets are lost foruser k as a function of the number of waiting packets a k and the number of assigned channels ℓ k at a given timeslot as follows σ k ( n, a k , ℓ k ) = (cid:18) a k n (cid:19)(cid:16) − ν k ( a k , ℓ k ) (cid:17) n ν k ( a k , ℓ k ) a k − n , where n a k and (cid:0) a k n (cid:1) denotes the binomial coefﬁcient.The state transition probability for user k is given by p ( ρ ′ k | ρ k ( t ) , ℓ k ) = α a k σ k ( n, a k , ℓ k ) , (6)where ρ ′ k = tt + 1 ρ k + 1 t + 1 na k . (7)Finally, let s t +1 = ρ ′ × .. × ρ ′ K and s t = ρ × .. × ρ K ,the transition probability from state s t to state s t +1 givenwhen action l is taken, is then given by p ( s t +1 | s t , ℓ ) = K Y k =1 p ( ρ ′ k | ρ k ( t ) , ℓ k ) . (8)Regarding the strict requirements of URLLC packets describedearlier, we introduce in the following the notion of a risk-state . Deﬁnition 1.

We deﬁne a risk state any state where ρ k > ρ max for any k ∈ { , .., K } with ρ max > is constant ﬁxed by thecontroller. The set of risk states Φ is then, Φ = { ρ × .. × ρ K where there ∃ k such that ρ k > ρ max } . Besides, a risk-state is an absorbing state, that is, the processends when it reaches a risk state [12].A deterministic policy π assigns at each time step andfor each state an action. Our goal is to ﬁnd an optimaldeterministic policy π ∗ which maximizes the total expectedreward V πT ( s ) given by V πT ( s ) = E π " T X t =0 r ( s t , π ( s t )) | s = s , (9)with the reward r is deﬁned in (5), while satisfying the QoSconstraint given by η π ( s ) < w, (10)where η π ( s ) denotes the probability of visiting a risk state overthe planning horizon, given that the initial state (at time slot ) is s and policy π is followed, and w is a positive constant.Formally, η π ( s ) = P π ( ∃ t such that s t ∈ Φ | s = s ) . (11)In order to explicitly characterize η π ( s ) , we introduce in thefollowing the risk signal r . Deﬁnition 2.

We deﬁne a risk signal r as follows r ( s t , ℓ t , s t +1 ) = (cid:26) if s t +1 ∈ Φ0 otherwise, (12) where s t and ℓ t denote the state and action at time slot t ,respectively, and s t +1 denotes the subsequent state. Proposition 1.

The probability of visiting a risk-state, η π ( s ) ,is given by η π ( s ) = V πT ( s ) , (13) where we set V πT ( s ) = E π " T X t =0 r ( s t , π ( s t ) , s t +1 ) | s = s . (14) Proof.

The random sequence r ( t = 0) , r ( t = 1) ,.., r ( t = T ) may contain if a risk state is visited, otherwise all itscomponents are equal to zero (recall that a risk state is anabsorbing state). Therefore, P Tt =0 r ( t ) is a Bernoulli randomvariable with a mean equal to the probability of reaching arisk state, that is, relation (13) holds. B. Optimality Equations

By virtue of Proposition 1, we associate a state valuefunction V πT to the probability of visiting a risk state. Now, wedeﬁne a new weighted value function V πξ,T , which incorporatesboth the reward and the risk, as follows V πξ,T ( s ) = ξ V πT ( s ) − V πT ( s ) , (15)where ξ > is the weighting parameter, determined by the risklevel the controller is willing to tolerate. The function V πξ,T canbe seen as a standard value function associated to the reward ξr − r . The case ξ = 0 corresponds to a minimum-risk policywhereas the case ξ → ∞ corresponds to a maximum-value policy.Let Π denote the set of deterministic policies, and deﬁne V ∗ T ( s ) = max π ∈ Π V πT ( s ) , V ∗ T ( s ) = min π ∈ Π V πT ( s ) , V ∗ ξ,T ( s ) = max π ∈ Π V πξ,T ( s ) . Besides, we deﬁne u πt , u πt , and u πξ,t for t T respectivelyby u πt ( s ) = E π " T X i = t r ( s i , π ( s i )) | s t = s , (16) u πt ( s ) = E π " T X i = t r ( s i , π ( s i ) , s i +1 ) | s t = s , (17) u πξ,t ( s ) = ξu πt ( s ) − u πt ( s ) . (18) Note that V πT incorporates the total expected reward over theentire planning horizon whereas u t incorporates the rewardsfrom decision epoch t to the end of the planning horizon only.Besides, u t ( s ) is the probability of visiting a risk state giventhat at time t the system is in state s ∈ {S / Φ } , and is thus ameasure of the risk.The optimality equations are given by (the proof is similarto that in [18], chap. 4 and skipped here for brevity) u ∗ t ( s ) = max ℓ ∈L n r ( s t , ℓ ) + X j ∈S p ( j | s t , ℓ ) u ∗ t +1 ( j ) o (19) u ∗ t ( s ) = min ℓ ∈L n X j ∈S p ( j | s t , ℓ ) (cid:16) r ( s t , ℓ, j ) + u ∗ t +1 ( j ) (cid:17)o (20) u ∗ ξ,t ( s ) = max ℓ ∈L n X j ∈S p ( j | s t , ℓ ) (cid:16) ξr ( s t , ℓ ) − r ( s t , ℓ, j )+ u ∗ ξ,t +1 ( j ) (cid:17)o , (21)for t = 0 , .., T − . For the boundary conditions, that is attime slot T , u ∗ T ( s ) , u ∗ T ( s ) , and u ∗ ξ,T ( s ) are set to zero foreach s ∈ S .In a non-risk state, the reward r is given in (5) and the risksignal is equal to zero whereas in a risk state the reward r isset to zero and the risk signal r is set to one.IV. A LGORITHM D ESIGN

In this section, we present two algorithms: (i) ﬁnite-horizonvalue iteration algorithm which assumes that all the model pa-rameters are known to the controller, namely the channel statis-tics (model-based algorithm), and (ii) reinforcement learningalgorithm which does not require the controller knowledge ofchannel statistics (model-free algorithm).

A. Value Iteration Algorithm

In order to ﬁnd a policy that maximizes the weighted valuefunction deﬁned in (15), we use the value iteration algorithm[18]. In this algorithm, we proceed backwards: we start bydetermining the optimal action at time slot T for each state,and successively consider the previous stages, until reachingtime slot (see Algorithm 1). Algorithm 1

Finite-Horizon Value Iteration Algorithm Initialization: for each s u ∗ T ( s ) ← , u ∗ T ( s ) ← , u ∗ ξ,T ( s ) ← Endfor t ← T − while t > For each s update u ∗ t ( s ) , u ∗ t ( s ) , and u ∗ ξ,t ( s ) according to (19), (20),and (21), respectively EndFor t ← t − EndWhile nvironment(WirelessChannel)Learning Controllernewstate risksignal r actionInteraction Observation r Fig. 2:

Reinforcement Learning Model.

B. Risk-Sensitive Reinforcement Learning Algorithm

During the learning phase, the controller gets estimates ofthe value of each state-action pair. It updates its estimatesthrough the interaction with the environment where at eachiteration it performs an action and then observes the reward,risk signal r , and the next state (see Fig. 2).The learning controller chooses an action at each learningstep following the ε -greedy policy, that is, it selects an actionthat maximizes its current estimate with probability − ε , ora random action with probability ε . The parameter ε capturesthe exploration-and-exploitation trade-off: when ε → , thecontroller tends to choose an action that maximizes its currentstate’s estimated value; whereas when ε → , the controllertends to choose randomly an action and to favor the explo-ration for optimality.The state-action value function is given by [19], [21] Q π ( s t , ℓ ) = r ( s t , ℓ ) + X j ∈S p ( j | s t , ℓ ) u πt +1 ( j ) , where the ﬁrst term denotes the immediate reward, that is thenumber of successfully transmitted packets over all the users,when the action l is performed in state s t ; and the second termdenotes the expected reward when the policy π is followed inthe subsequent decision stages. Similarly to the state-actionvalue function associated to the reward, we deﬁne the state-action value function associated to the risk Q π as Q π ( s t , ℓ ) = X j ∈S p ( j | s t , ℓ ) (cid:0) r ( s t , ℓ, j ) + u πt +1 ( j ) (cid:1) . Note that the introduction of the signal risk r enabled us todeﬁne a state-action value function, Q to the risk.Besides, the state-action value function associated to theweighted formulation, Q πξ , is given by Q πξ ( s t , ℓ ) = ξQ π ( s t , ℓ ) − Q π ( s t , ℓ ) . Finally, the Q-function updates at the learning step n (whichshould not be confused with the decision epoch t ) are givenby [21] Q ( n +1) ( s t , ℓ ) ← (cid:2) − α n ( s t , ℓ ) (cid:3) Q ( n ) ( s t , ℓ ) + α n ( s t , ℓ ) (cid:2) r + max ℓ ∈L { Q ( n ) ( s t +1 , ℓ ) } (cid:3) , (22) Q ( n +1) ( s t , ℓ ) ← (cid:2) − α n ( s t , ℓ ) (cid:3) Q ( n ) ( s t , ℓ ) + α n ( s t , ℓ ) (cid:2) r + min ℓ ∈L { Q ( n ) ( s t +1 , ℓ ) } (cid:3) , (23)and, Q ( n +1) ξ ( s t , ℓ ) ← (cid:2) − α n ( s t , ℓ ) (cid:3) Q ( n ) ξ ( s t , ℓ ) + α n ( s t , ℓ ) (cid:2) ξr − r + max ℓ ∈L { Q ( n ) ξ ( s t +1 , ℓ ) } (cid:3) , (24)where α n ( s t , ℓ ) denotes the learning rate parameter at step n when the state s t and action ℓ are visited.The learning algorithm converges to the optimal state-action value function when each state-action pair is performedinﬁnitely often and when the learning rate parameter satisﬁesfor each ( s t , ℓ ) pair (the proof is given in [7], [21] and skippedhere for brevity), ∞ X n =1 α n ( s t , ℓ ) = ∞ , and ∞ X n =1 α n ( s t , ℓ ) < ∞ . In this case, the Q-functions are related to the value functionsas followsmax ℓ ∈L { Q ( s t , ℓ ) } = u ∗ t ( s t ) , min ℓ ∈L (cid:8) Q ( s t , ℓ ) (cid:9) = u ∗ t ( s t ) , max ℓ ∈L { Q ξ ( s t , ℓ ) } = u ∗ ξ,t ( s t ) . When a risk state is reached during the learning phase, thesystem is restarted according to the uniform distribution to anon-risk state. In addition, when t > T , we consider that anartiﬁcial absorbing state is reached and we reinitialize t (seeAlgorithm 2). Algorithm 2

Q-learning Algorithm Initialization t ← , s ← s , n ← , for each ℓ ∈ L Q ( s , ℓ ) ← , Q ( s , ℓ ) ← , Q ξ ( s , ℓ ) ← End for Repeat observe current state s t select and perform action ℓ in state s t observe the new state s t +1 , reward r and the risk r update the Q-functions Q ( s t , l ) , Q ( s t , l ) , Q ξ ( s t , ℓ ) according to (22), (23), (24) respectively t ← t + 1 n ← n + 1 update α n if t = T , then t ← artiﬁcial absorbing state reached if s t ∈ Φ , then s t ∼ Unif {S / Φ } absorbing state reached until convergence . P ERFORMANCE E VALUATION

In this section, we present the numerical results obtainedwith the value iteration and the learning algorithms in a varietyof scenarios. We consider the setting of two users along with anumber of channels L = 5 . For the arrival trafﬁc, we considerthe following truncated Poisson distributionProb ( a = m ) = ( λ m /m ! P Amaxi =0 λ i /i ! if m A max zero otherwise, (25)where λ = 3 and A max = 6 . The mean of the Bernoullichannel µ and the value of the parameter ρ max throughoutthis section are ﬁxed to . and . respectively. A. Minimum-risk vs maximum-value policy

First, we compare the performance of the minimum-riskpolicy (obtained when ξ = 0 ), maximum-value policy (ob-tained when ξ → ∞ ), weighted policy (when ξ > ), and theﬁxed policy which consists is assigning the same number ofchannels for each user at each time slot ( ℓ = 2 and ℓ = 3 ).We depict in Fig. 3- top the reward u t ( s ) given in (19) asa function of time when s = 0 . × and different policiesare followed. We observe that the maximum-value policyclearly outperforms the ﬁxed and the minimum-risk policy.In Fig. 3- bottom showing u t ( s ) given in (20), we observethat the probability of visiting a risk-state when the ﬁxedpolicy is followed is much higher than that obtained whenthe minimum-risk policy π ∗ is performed. For example, atthe time step t = 5 , u t ( s ) is equal to . when the policy π f is performed whereas this value reduces to . when thepolicy π ∗ is followed. In fact, the ﬁxed policy does not takeaccount of the experienced QoS of the users, and therefore, itis the policy which results in the highest risk-state visitationprobability. Besides, this probability decreases over time forall the policies. In fact, as time goes on, the probability ofentering a risk-state over the remaining time steps decreases.The reward u t ( s ) increases for the lower values of t untilreaching a maximum value and then it decreases, for all thepolicies. In fact, for the lower values of t , the probability ofvisiting a risk-state is high, and this affects the expected valueof the reward (recall that in the risk state, the reward is equal tozero). As time goes on, this probability decreases, and thus theexpected reward increases. However, at the further time steps,the number of remaining decision stages is low and hencethe expected reward (total number of successfully transmittedpackets over the remaining time slots) decreases. B. Learning

In the learning algorithm, we simulate the wireless channelwith a Bernoulli random variable with a number of trials equalto the number of channels associated to each packet for eachuser. For the learning rate parameter α n , we considered thefollowing expression [11]: α n = 1(1 + n ( s t , ℓ )) γ , (26) PSfrag replacements u t ( s ) u t ( s ) time π ∗ π ∗ π f π ∗ ξ

012 2 34 4 56 6 78 8 9 . . . . . . . . . . . . . . . . . PSfrag replacements u t ( s ) u t ( s ) time π ∗ π ∗ π f π ∗ ξ

011 2 3 4 5 6 7 8 9 . . . . . . . . . . . . . . . . . Fig. 3:

Performance of the minimum-risk policy π ∗ , themaximum-value policy π ∗ , the weighted-policy π ∗ ξ with ξ = 0 . , and the ﬁxed policy π f . On the top , u t ( s ) , onthe bottom , u t ( s ) where s = 0 . × and T = 9 .where n ( s t , ℓ ) denotes the number of times the state-actionpair ( s t , ℓ ) was visited until iteration n , and γ is a positiveparameter ∈ [0 . , [11].We depict in Fig. 4 the optimal (minimum-risk) policy(number of channels to assign to user , ℓ ∈ [0 , .., )computed by the learning algorithm, as a function of timesteps (decision epochs) and ρ , when ρ is ﬁxed to . Theﬁgure shows a monotony property: the number of channels toassign to user increases with time and with ρ . In fact, asthe QoS of user degrades ( ρ increases), more channels areassigned to it to compensate for this degradation; and as timegoes on, this policy is more sensitive to this degradation asmore channels are assigned for the same values of ρ , but atfurther time steps.Sfrag replacements ρ = 0 ρ = 0 . ρ = 0 . ℓ ∗ time

01 12 23 34 45 5

Fig. 4:

Optimal policy ℓ ∗ as a function of time steps and ρ with ρ = 0 and T = 5 .VI. C ONCLUSION

In this work, we studied the problem of dynamic channel al-location for URLLC trafﬁc in a multi-user multi-channel wire-less network within a novel framework. Due to the stochasticnature of the problem related to time-varying, fading channelsand random arrival trafﬁc, we considered a ﬁnite-horizonMDP framework. We determined explicitly the probability ofvisiting a risk state and we wrote it as a cumulative return(risk signal). We then introduced a weighted global valuefunction which incorporates two criteria: reward and risk. Byvirtue of the value iteration algorithm, we determined theoptimal policy. Furthermore, we used a Q-learning algorithmto enable the controller to learn the optimal policy in theabsence of channel statistics. We illustrated the performance ofour algorithms with numerical studies, and we showed that byadapting the number of parallel transmissions in a smart way,the performance of the system can be substantially enhanced.In the future work, we would like to take account of spatialdiversity in the dynamic allocation scheme where both the BSand the user terminals can be equipped with multiple antennasto enhance the system performance.R

EFERENCES[1] R. Aggarwal, M. Assaad, C. E. Koksal, and P. Schniter. OFDMA down-link resource allocation via ARQ feedback. In

Conference Record of theForty-Third Asilomar Conference on Signals, Systems and Computers ,pages 1493–1497, Paciﬁc Grove, CA, Nov 2009.[2] A. Ahmad and M. Assaad. Joint resource optimization and relay selec-tion in cooperative cellular networks with imperfect channel knowledge.In

Proc. of IEEE 11th International Workshop on Signal ProcessingAdvances in Wireless Communications (SPAWC) .[3] A. Ahmad and M. Assaad. Optimal resource allocation framework fordownlink OFDMA system with channel estimation error. In

Proc. ofIEEE Wireless Communication and Networking Conference , pages 1–5,Sydney, April 2010.[4] M. R. Akdeniz, Y. Liu, M. K. Samimi, S. Sun, S. Rangan, T. S.Rappaport, and E. Erkip. Millimeter wave channel modeling and cellularcapacity evaluation.

IEEE Journal on Selected Areas in Communica-tions , 32(6):1164–1179, June 2014. [5] M. Assaad, A. Ahmad, and H. Tembine. Risk sensitive resource controlapproach for delay limited trafﬁc in wireless networks. In

Proc. ofIEEE Global Telecommunications Conference (GLOBECOM) , pages 1–5, Kathmandu, Dec 2011.[6] Mehdi Bennis, M´erouane Debbah, and H. Vincent Poor. Ultra-reliableand low-latency wireless communication: tail, risk and scale.

TechnicalReport, ArXiv , may 2018.[7] D. P. Bertsekas and J. Tsitsiklis.

Neuro-dynamics Programming . MITPress, Cambridge, Massachusetts, 1996.[8] S. Cayci and A. Eryilmaz. Learning for serving deadline-constrainedtrafﬁc in multi-channel wireless networks. In

Proc. of the 15th Inter-national Symposium on Modeling and Optimization in Mobile, Ad Hoc,and Wireless Networks (WiOpt) , pages 1–8, May 2017.[9] A. Destounis, G. S. Paschos, J. Arnau, and M. Kountouris. SchedulingURLLC users with reliable latency guarantees. In

Proc. of the 16thInternational Symposium on Modeling and Optimization in Mobile, AdHoc, and Wireless Networks (WiOpt) , pages 1–8, May 2018.[10] A. Dua and N. Bambos. Downlink wireless packet scheduling withdeadlines.

IEEE Transactions on Mobile Computing , 6(12):1410–1425,Dec 2007.[11] Eyal Even-Dar and Yishay Mansour. Learning rates for Q-learning.

J.Mach. Learn. Res. , 5:1–25, 2004.[12] Peter Geibel and Fritz Wysotzki. Risk-sensitive reinforcement learningapplied to control under constraints.

J. Artif. Int. Res. , 24(1):81–108,jul 2005.[13] A. Gosavi. Finite horizon Markov control with one-step variance penal-ties. In

Proc. of the 48th Annual Allerton Conference on Communication,Control, and Computing , pages 1355–1359, Sept 2010.[14] N. U. Hassan and M. Assaad. Dynamic resource allocation in multi-service OFDMA systems with dynamic queue control.

IEEE Transac-tions on Communications , 59(6):1664–1674, June 2011.[15] H. Ji, S. Park, J. Yeo, Y. Kim, J. Lee, and B. Shim. Ultra-reliable andlow-latency communications in 5G downlink: Physical layer aspects.

IEEE Wireless Communications , 25(3):124–130, June 2018.[16] A. Kumar, V. Kavitha, and N. Hemachandra. Finite horizon risk sensitiveMDP and linear programming. In

Proc. of the 54th IEEE Conferenceon Decision and Control (CDC) , pages 7826–7831, Dec 2015.[17] N. Nomikos, N. Pappas, T. Charalambous, and Y. Pignolet. Deadline-constrained bursty trafﬁc in random access wireless networks. In

SPAWC , May 2018.[18] Martin L. Puterman.

Markov Decision Processes: Discrete StochasticDynamic Programming . John Wiley & Sons, 2005.[19] Richard S. Sutton and Andrew G. Barto.

Reinforcement Learning: AnIntroduction . MIT Press, Cambridge, Massachusetts, 1998.[20] T. K. Vu, M. Bennis, M. Debbah, M. Latva-aho, and C. S. Hong.Ultra-reliable communication in 5G mmWave networks: A risk-sensitiveapproach.

IEEE Communications Letters , 22(4):708–711, April 2018.[21] Christopher J. C. H. Watkins and Peter Dayan. Q-learning.