[PDF] Risk-Based Optimization of Virtual Reality over Terahertz Reconfigurable Intelligent Surfaces

Abstract

In this paper, the problem of associating reconfigurable intelligent surfaces (RISs) to virtual reality (VR) users is studied for a wireless VR network. In particular, this problem is considered within a cellular network that employs terahertz (THz) operated RISs acting as base stations. To provide a seamless VR experience, high data rates and reliable low latency need to be continuously guaranteed. To address these challenges, a novel risk-based framework based on the entropic value-at-risk is proposed for rate optimization and reliability performance. Furthermore, a Lyapunov optimization technique is used to reformulate the problem as a linear weighted function, while ensuring that higher order statistics of the queue length are maintained under a threshold. To address this problem, given the stochastic nature of the channel, a policy-based reinforcement learning (RL) algorithm is proposed. Since the state space is extremely large, the policy is learned through a deep-RL algorithm. In particular, a recurrent neural network (RNN) RL framework is proposed to capture the dynamic channel behavior and improve the speed of conventional RL policy-search algorithms. Simulation results demonstrate that the maximal queue length resulting from the proposed approach is only within 1% of the optimal solution. The results show a high accuracy and fast convergence for the RNN with a validation accuracy of 91.92%.

Full PDF

RRisk-Based Optimization of Virtual Reality overTerahertz Reconﬁgurable Intelligent Surfaces

Christina Chaccour ∗ , Mehdi Naderi Soorki † , Walid Saad ∗ , Mehdi Bennis ‡ , and Petar Popovski §∗ Wireless@ VT, Bradly Department of Electrical and Computer Engineering, Virginia Tech, Blacksburg, VA USA, † Faculty of Engineering, Shahid Chamran University of Ahvaz, Ahvaz, Iran, ‡ Centre for Wireless Communications, University of Oulu, Finland, § Department of Electronic Systems, Aalborg University, Denmark.Emails: { christinac, mehdin, walids } @vt.edu, mehdi.bennis@oulu.ﬁ, [email protected] Abstract —In this paper, the problem of associating reconﬁg-urable intelligent surfaces (RISs) to virtual reality (VR) users isstudied for a wireless VR network. In particular, this problemis considered within a cellular network that employs terahertz(THz) operated RISs acting as base stations. To provide a seamlessVR experience, high data rates and reliable low latency needto be continuously guaranteed. To address these challenges, anovel risk-based framework based on the entropic value-at-riskis proposed for rate optimization and reliability performance.Furthermore, a Lyapunov optimization technique is used to refor-mulate the problem as a linear weighted function, while ensuringthat higher order statistics of the queue length are maintainedunder a threshold. To address this problem, given the stochasticnature of the channel, a policy-based reinforcement learning (RL)algorithm is proposed. Since the state space is extremely large,the policy is learned through a deep-RL algorithm. In particular,a recurrent neural network (RNN) RL framework is proposedto capture the dynamic channel behavior and improve the speedof conventional RL policy-search algorithms. Simulation resultsdemonstrate that the maximal queue length resulting from theproposed approach is only within of the optimal solution.The results show a high accuracy and fast convergence for theRNN with a validation accuracy of . . Index Terms — Virtual Reality, Terahertz, ReliabilityI. I

NTRODUCTION

Virtual reality (VR) applications will revolutionize the wayin which humans interact by allowing them to be immersed, inreal time, in a range set of virtual environments [1]. Neverthe-less, unleashing the potential of VR requires their integrationinto wireless networks in order to provide a seamless andimmersive VR experience [2]. However, deploying wireless VRservices faces many technical challenges, the most fundamentalof which is providing high rate wireless links with high relia-bility. On the one hand, VR communication requires high datarates to guarantee a seamless visual experience while delivering ◦ VR content. On the other hand, providing reliable hapticVR communications will also require maintaining a very lowend-to-end (E2E) delay in face of extreme and uncertainnetwork conditions.Guaranteeing this dual performance requirement constitutesa major departure from classical ultra reliable low latencycommunications (URLLC) services limited to low-rate sensors

This research was supported by the U.S. National Science Foundation underGrant CNS-1836802, and in part, by the Academy of Finland Project CARMA,by the Academy of Finland Project MISSION, by the Academy of FinlandProject SMARTER, as well as by the INFOTECH Project NOOR. [3], or traditional enhanced mobile broadband (eMBB) serviceslimited to high capacity delivered to dense networks [4]. Inorder to overcome the rate challenge of VR communications,one can explore the high bandwidth available at the terahertz(THz) frequency bands [5]. However, the reliability of theTHz channel can be impeded by its susceptibility to blockage,molecular absorption, and communication range. This, in turn,can violate the reliability requirements of VR systems. In orderto alleviate these reliability concerns, one can deploy reconﬁg-urable intelligent surfaces (RISs) [6] acting as a base station(BS) that can provide a nearly continuous line-of-sight (LoS)connectivity to VR users. The RIS concept can be viewed asa scaled-up version of conventional multiple-input multiple-output (MIMO) systems beyond their traditional large arrayconcept, however, an RIS exhibits several key differences frommassive MIMO systems [6]. Most fundamentally, RISs will bedensely located in both indoor and outdoor spaces, making itpossible to perform near-ﬁeld communications through a line-of-sight (LoS) path. Hence, coupling RISs with THz communi-cations can potentially provide connectivity that exhibits bothhigh data rates and high reliability (in terms of guaranteeingLoS communication). Moreover, VR users will always be ata proximity of physical structures with high rate wirelesscapabilities. Thus, it is imperative to understand whether THz-operated RISs can indeed provide an immersive VR experienceby delivering continuously reliable connectivity with low E2Edelay and high data rate.A number of recent works attempted to address the chal-lenges of VR communications [1], [7]–[9]. In [7], the authorsstudy the spectrum resource allocation problem with a brain-aware quality-of-service (QoS) constraint. The work in [1]proposes a VR model that captures the tracking and delaycomponents of VR QoS. Meanwhile, the work in [8] proposesa novel framework that uses cellular-connected drone aerialvehicles to collect VR content for reliable wireless transmis-sion. In [9] the authors study the issue of concurrent support ofvisual and haptic perceptions over wireless cellular networks.However, the works in [1] and [7]–[9] do not account forrealistic delays and their statistics, and their solutions cannotsatisfy high rates and low latency simultaneously. In contrast, toprovide reliable VR, it is of interest to explore the possibility ofdeploying RISs. If properly operated, serving VR users throughexisting walls and structures with wireless capabilities will a r X i v : . [ c s . I T ] F e b nleash the potential of reliable VR. Speciﬁcally, equippingRISs with THz will guarantee the overall seamless experience.We also note that despite the surge of recent works on THzcommunications (e.g., see [10] and [5], and references therein)and RIS design and optimization (e.g., see [11], [12], andreferences therein), these works focus on the physical layerand do not address VR or networking challenges of THzcommunications.The main contribution of this paper is a novel rate andreliability optimization framework for VR systems leveragingTHz-operated RISs. We consider the downlink of a cellularnetwork in which THz-operated RISs serve VR users. In thisnetwork, due to the mobility of users and the stochastic natureof the channel, the RISs must be dynamically and intelligentlyscheduled to VR users. Also, to guarantee reliability andcapture a full knowledge of delay statistics, we propose a novelapproach that exploits the economic concept of entropic value-at-risk (EVaR) which coherently measures the risk associatedto a random event [13]. Hence, the EVaR is employed so as tocapture higher order statistics of the delay and, thus, allowingus to deﬁne a concrete a measure of the risk associated todelay unreliability. We then formulate a high reliability andsum rate maximization scheduling problem by combining bothLyapunov optimization and deep neural networks (DNNs).Using the Lyapunov optimization technique, the problem istransformed into a linear weighted function, which ensuresthat the maximum queue length and the maximum queuelength variance among VR users remains bounded. To solvethe proposed Lyapunov optimization problem, we proposea reinforcement learning (RL) algorithm based on recurrentneural networks (RNNs) that can ﬁnd the user associations toRISs while capturing the dynamic temporal behavior of theusers in the channel. Simulation results show that the gapbetween the proposed approach and the optimal solution isminimal.The rest of this paper is organized as follows. The systemmodel is presented in Section II. The risk aware associationfor VR users is proposed in Section III. The RL approachis presented Section IV. In Section V, we provide simulationresults. Finally, conclusions are drawn in Section VI.II. S YSTEM M ODEL

Consider the downlink of an RIS-based wireless network ina conﬁned indoor area, servicing a set U of U mobile wirelessVR users via a set B of B RISs acting as THz operatedBSs as depicted in Fig. 1. The VR users are mobile andmay change their locations and orientations at any point intime. We consider discrete time slots indexed by t with ﬁxedduration τ . Each RIS is a BS, that is provided with a feeder(antenna) with a corresponding transmit power denoted by p .Hence, the transmitted data is encoded onto the phases of thesignals reﬂected from different reconﬁgurable meta-surfacesthat compose the RIS [6]. Henceforth, if the RIS consists The uplink of VR requests is assumed to follow an arbitrary URLLCscheme and is outside of the scope of this paper

MEC capable

RIS MEC capable

RIS 2

Controller:

Centralized RNN RL

Visual:

High Data Rates

Haptic:

ReliabilitySeamless VR THz RIS

Fig. 1:

Illustrative example of our system model. of N meta-surfaces whose reﬂection phase can be optimizedindependently, then an N -stream virtual MIMO system canbe realized by using a single radio frequency (RF) activechain [6]. We assume that the RF source is close enoughto the RIS surface so that the transmission between eachpair of RF source and RIS is not affected by fading. Then,the electromagnetic response of the N meta-surface elementscan be programmed by using a centralized controller, whichgenerates input signals that tune varactors and change thephase of the reﬂected signal [11]. Let Φ bu,t = [ φ bun,t ] N × be the phase shift vector of RIS b , with respect to the userequipment (UE) u , at time slot t , where φ bun,t ∈ Φ , n isthe index corresponding to the meta-surface of each RIS, and Φ = {− π + zπZ − | z = 0 , , ..., Z − } . Z is the number ofpossible phase shifts per meta-surface element. A. Channel Model

Due to the mobility of the VR users, the THz link betweena VR user and its respective RISs may be blocked by self-blockage, i.e., the event of blocking the signal received byUE u ’s own body, or by dynamic blockages, i.e., blockingthe signal by other VR users’ bodies respectively. Let s bu,t be a random binary variable where s bun,t = 1 if there is aLoS link between RIS b and VR UE u at time slot t , and s bun,t = 0 , otherwise. As a byproduct of directional beam-forming and propagation differences, the network consideredis noise limited. Thus, the random channel gain between RIS b and VR UE u at time slot t is given by [14]: h bu,t = (cid:40)(cid:0) λ πd bu,t (cid:1) (cid:0) e − k ( f ) d bu,t (cid:1) , with Pr( s bun,t = 1) , , with Pr( s bun,t = 0) .d bu,t is the distance between RIS b and the VR UE u at timeslot t , k ( f ) is the overall molecular absorption coefﬁcients ofthe medium at THz band, and f is the operating frequency.Let ψ bun,t be the phase shift of the channel between VR UE u and the meta-surface n of RIS b at t . Then, for a givenreﬂection phase shift vector, Φ bu,t , the transmission rate fromRIS b to VR UE u will be (under an approximate averagesignal-to-noise ratio (SNR) value across the THz band): r bu,t = W log (cid:32) ph bu,t | (cid:80) Nn =1 e ( φ bun,t − ψ bun,t ) j | s bun,t N ( d bu,t , p, f ) (cid:33) , (1)where N ( d u,t , p, f ) = N + (cid:80) Bb =1 pA d − bu,t (1 − e − K ( f ) d bu,t ) , N = W λ π k B T , k B is the Boltzmann constant, T is thetemperature in Kelvin, A = c π f , and c is the speed oflight [14], [15]. Note that the optimal choice for φ bun,t forevery RIS association is equal to ψ bun,t , thus maximizing therate ˜ r bu,t , as shown in [6]. This selection will be made by thecontroller after learning the RIS association and optimizing it,as shown in subsequent sections. B. Queuing Model

Each RIS is equipped with mobile edge computing (MEC)capabilities, and, thus, we model the queuing and transmissionof each VR content as an M/G/1 queue at the MEC server ofthe RIS. We deﬁne a decision binary variable x bu,t that is equalto 1 if RIS b is scheduled to serve the VR content queue of user u at time slot t , otherwise x bu,t = 0 . Note that, multiple userscan be associated to one RIS, however, each user is connectedto a single RIS. Let Q u ( t ) be the queue length correspondingto UE u ’s requested VR image at the beginning of slot t , thenthe queue dynamics are given by: Q u, ( t +1) = max { Q u,t − ˜ R bu,t , } + A u,t , if x bu,t = 1 , (2)where A u,t is the number of VR images queued for trans-mission at time slot t . The arrival of VR content follows aPoisson arrival process with mean rate λ u . ˜ R bu,t is the rate ofVR image transmission over THz link between RIS b and VRUE u at time slot t . ˜ R bu,t = ˜ r bu,t τM where M is the size of theVR image. Given that the availability of the THz LoS link isa random variable, ˜ R bu,t is a stochastic random variable withrespect to time.III. R ISK -A WARE

RIS-VR

USER A SSOCIATION

A. Problem Formulation

Our goal is to characterize the RIS-UE association policywhich determines the system parameters over a ﬁnite horizonof length T . The objective of this optimal policy is to max-imize the sum-rate while maintaining reliable transmission.Formally, we deﬁne a policy Π t = { x bu,t |∀ b ∈ B , ∀ u ∈ U} for the controller that associates each RIS to its respectiveVR users. The control policy at a given slot t depends onunknown environmental changes, which is a consequence ofthe stochastic nature of the channel and the sudden changes thatmight block the LoS signal between RISs and mobile VR UE.Thus, Pr( s bun,t = j | s bun, ( t − = i ) , ∀ b ∈ B , ∀ u ∈ U , over theLoS THz links. Furthermore, the reliability metric is satisﬁedas long as the cumulative distribution function (CDF) of theE2E delay does not exceed the reliability constraint associatedwith its respective network. Subsequently, to account for therisk of loss incurred when reliability is not satisﬁed, the value-at-risk (VaR) concept deﬁned as VaR − α = − inf t ∈ R { P ( X ≤ t ) ≥ − α } [13], can be used. However, VaR is an uncoherent risk measure, making its analysis intractable. Thus, we deﬁnethe EVaR as φ t = log E [exp( − γQ t )] γ , which is a coherent riskmeasure that corresponds to the tightest possible upper boundobtained from the VaR. In the EVaR, Q t := max u ∈U { Q u,t } and < γ (cid:28) . Subsequently, to ensure reliability, the fol-lowing condition needs to be met: lim t →∞ φ t < κ . Expandingthe Maclaurin series of φ t with respect to the log and exp functions we obtain, φ t = E [ Q t ] + ϑ ( Q t ) − O ( Q t ) ,where ϑ ( Q t ) = E [ Q t ] − E [ Q t ] is the variance of themaximum queue length. Thus, to minimize lim t →∞ φ t , it issufﬁcient to minimize the ﬁrst two terms of its Maclaurinseries. Consequently, we formulate the RIS association andphase shift-control problem for an RIS-assisted THz indoornetwork as follows: max { Π t } (cid:88) b ∈B (cid:88) u ∈U x bu,t ˜ R bu,t , (3)s.t. lim T →∞ T T (cid:88) t =1 E [ Q t ] < ε, (4) lim T →∞ T T (cid:88) t =1 E [ Q t ] < η, (5) φ bu,t ∈ Φ , ∀ b ∈ B , ∀ u ∈ U , ∀ t ∈ T , (6) (cid:88) u ∈U x bu,t ≤ , ∀ b ∈ B , ∀ t ∈ T , (7) x bu,t ∈ { , } , ∀ b ∈ B , ∀ u ∈ U , ∀ t ∈ T . (8)Here, maximizing the objective function in (3) ensures that thevisual component of the VR experience is guaranteed, thus,delivering a seamless experience. On the other hand, (4) and(5) ensure that the constraint of mitigating the risk will besatisﬁed, where η = ε + 2 [ γ ( κ + 1) − ε ] , this further guar-antees that the haptic component of the VR will be deliveredsuccessfully. Given that the length of the queues changes withrandom events and their probability distribution is not knowna priori, the optimization problem in (3) cannot be solvedusing traditional stochastic optimization techniques [16]. Next,we propose a tunable minimum-drift-plus-penalty optimizationproblem based on Lyapunov optimization to reformulate theproblem stated previously. B. Lyapunov Optimization

We use a Lyapunov optimization approach [16] to solveproblem (3). This approach allows us to convert the constraintsinto a tractable form. Henceforth, to ensure (4) and (5), wedeﬁne two virtual queues Z and Z , having with the followingdynamics: Z , ( t +1) = max { Z ,t + Q t − ε, } , (9) Z , ( t +1) = max { Z ,t + Q t − η, } . (10)Moreover, given that our initial optimization problem is amaximization problem, our aim is to minimize the drift-plus-penalty expression given by: ∆ t − V (cid:88) b ∈B (cid:88) u ∈U x bu,t ˜ R bu,t , (11)here ∆ t = E [ L t +1 − L t | Q t ] , L t is the Lyapunov functiongiven by L t = ( Z ,t + Z ,t + (cid:80) u ∈U Q u,t ) . Next, we transformproblem (3) into one whose objective is a linear weightedfunction, and its constraints are no longer a function of Q t as in (4) and (5). Proposition 1.

The conditional Lyapunov drift-plus-penaltybound under any feasible control policy π t is formulated asfollows: ∆ t ≤ Υ + U (cid:88) u =1 Q u,t ( A u,t − R bu,t ) + Z ,t ( Q t − ε )+ Z ,t ( Q t − η ) − V (cid:88) b ∈B (cid:88) u ∈U x bu,t ˜ R bu,t (12) Proof:

Given that for ∀ x ∈ R , max { x, } ≤ x , wesubtract Q u,t on both sides and square (2) as follows: Q u,t +1 − Q u,t ≤ ( Q u,t − ˜ R bu,t ) + A u,t + 2 A u,t ( Q u,t − ˜ R bu,t ) − Q u,t . Simplifying the equation leads to the following: Q u,t +1 − Q u,t ≤ (cid:16) ˜ R bu,t − A u,t (cid:17) Q u,t ( A u,t − ˜ R bu,t ) Similarly, Z ,t +1 − Z ,t ≤ ( Q t − ε ) Z ,t ( Q t − ε ) ,Z ,t +1 − Z ,t ≤ (cid:0) Q t − η (cid:1) Z ,t ( Q t − η ) . After some mathematical manipulation, we obtain: L t +1 − L t ≤ Υ + U (cid:88) u =1 Q u,t ( A u,t − ˜ R bu,t ) + Z ,t ( Q t − ε )+ Z ,t ( Q t − η ) . (13)where Υ = U ( max b ∈B ,u ∈U ,t ˜ R bu,t ) + ε + η . Thus, instead of minimizing the drift-plus-penalty expres-sion, we minimize the maximum bound of the one-time slotconditional Lyapunov drift plus penalty in (12). The initialoptimization problem is reformulated as: max { Π t } V (cid:88) b ∈B (cid:88) u ∈U x bu,t ˜ R bu,t − U (cid:88) u =1 Q u,t ( A u,t − ˜ R bu,t ) − Z ,t ( Q t − ε ) − Z ,t ( Q t − η ) , (14)subject to φ bu,t ∈ Φ , ∀ b ∈ B , ∀ u ∈ U , ∀ t ∈ T , (15) (cid:88) u ∈U x bu,t ≤ , ∀ b ∈ B , ∀ t ∈ T , (16) x bu,t ∈ { , } , ∀ b ∈ B , ∀ u ∈ U , ∀ t ∈ T . (17)Thus, employing virtual queues and Lyapunov optimizationallowed us to transform our optimization problem into alinear weighted function. Nevertheless, solving problem (14)using integer programming will be very complex due to its combinatorial nature and the stochasticity of the variables.Given that the distribution of system parameters is not char-acterizable, the problem cannot be solved using stochasticmatching theory or stochastic optimization. Next, to solve(14) we propose a centralized and low-complexity RNN RLframework that provides the optimal policy for the RIS-UEassociation. Moreover, RNN RL is suitable for this problemsince it can reduce the dimensionality of the large state space,while capturing dynamic temporal behaviors [17].IV. R ECURRENT N EURAL N ETWORKS RL FOR W IRELESS VR IN TH Z OPERATED

RIS

NETWORK

In this section, an adaptive control policy based on a deep RLframework is proposed. The proposed framework will allow usto learn the policy to solve the problem of RIS-UE associationsin (14). We model (14) as a Markov decision process (MDP)represented by the tuple {S , A , P, R } , where S is the statespace, A is the action space, P is an unknown state transitionfunction, P ( s (cid:48) , s , a ) = Pr( s t +1 = s (cid:48) | s t = s , a t = a ) , and R ( a t , s t ) is the reward function [18]. Our action space is theset of all possible RIS associations VR UEs, A = { [ x bu ] B × u | x bu ∈ { , } , b = 1 , ..., B, u = 1 , ..., U } and the reward is R ( a t , s t ) = V (cid:80) b ∈B (cid:80) u ∈U x bu,t ˜ R bu,t − (cid:80) Uu =1 Q u,t ( A u,t − R bu,t ) − Z ,t ( Q t − ε ) − Z ,t ( Q t − η ) which is the currentobjective function in (14). The state is the set of VR UEqueue lengths, virtual queue lengths, and the state of LoS linksbetween RISs and VR UEs, S = { [ s bu ] B × u , [ Q u ] U × , Z , Z , | s bu ∈ { , } , { Q u , Z , Z } ∈ { Z + , b = 1 , ..., B, u = 1 , ..., U } .We represent the class of parameterized policies of ourMDP as Π t = { π θ ( a t | s t ) | θ ∈ R m } , where π θ ( a t | s t ) =Pr { a = a t | s = s t , θ } . The stochastic reward function R ( a t , s t ) during next time slot has a transition probability of ρ t = (cid:81) s (cid:48) ∈S π θ ( a t | s t ) Pr( s t +1 = s (cid:48) | s t = s , a t = a ) . Tosolve the optimization problem in (14), the controller needsto have full knowledge about the transition probability and allpossible values of R ( a t , s t ) for all possible states of MDP un-der a given policy π θ . Given that our model is highly dynamicdue the mobility of users and the nature of the channel, thetransition probability of states cannot be characterized throughprobability distribution functions (PDFs). Thus, it is necessaryto use an RL framework to solve (14). We particularly usea policy-search approach to ﬁnd the optimal RIS to VR userassociation while maintaining a high reliability and a high datarate formulated in (14). For each policy, we deﬁne its value as: J ( θ ) = (cid:88) s (cid:48) ∈S R ( a t , s t ) ρ t . (18)Hence, to ﬁnd the optimal policy, we need to ﬁnd θ ∗ =arg max θ J ( θ ) . To do so, we need to perform a gradient ascenton the policy parameters. Subsequently, we need to derive ∇ θ J ( θ ) . Similarly to [18], by writing ∇ θ log π θ ( a t | s t ) = ∇ θ π θ ( a t | s t ) π θ ( a t | s t ) we can reformulate (18), as follows: ∇ θ J ( θ ) ≈ E Λ t {∇ θ log π θ ( a t | s t ) R ( a t , s t ) } , (19)where Λ t = { a t , s t +1 = s (cid:48) | s (cid:48) ∈ S} is the trajectory of theMDP for the next time slot. Subsequently, we can use (19) STM

LSTM

LSTM 1, Fully connected layer to ReLu Layer ...

LSTM

Fully connected layer to h ReLu Layer

LSTM

Fully connected layer h to number of classifiers ...... Softmax

Layer Softmax

Layer Softmax Layer B ... Fig. 2:

Illustrative example of the RNN architecture. to solve the optimization problem in (14) using a gradientascent algorithm such as REINFORCE [18]. Nevertheless,given that the number of states is considerably high, thisprocedure will be intractable, which motivates the need fora function approximator through the use of DNNs. Giventhat the reliability depends on the prediction of the VR usersmobility pattern that will determine their associations to RISs,it is important to implement a framework that is capable ofcapturing the dynamic behavior exhibited. To address thesechallenges, using an RNN to represent the policy of RL willextract the channel’s dynamic features and learn an optimizedsequence guaranteeing reliability at each time instant, based onthe input features [17]. Since we deal with time-varying poli-cies, it is natural to resort to RNNs. Indeed, RNNs are knownto be effective in processing time-related data and capturedynamic temporal behaviors. As such, we represent the policy π θ ( a | s ) by an RNN [19] that can learn the user associations toRISs. In particular, we use a many-to-many RNN as shown inFig. 2. The overall RNN consists of an encoder and a decodernetwork: The encoder network comprises three long short-termmemory (LSTM) layers, two fully connected layers and tworectiﬁed linear unit (ReLu) layers. The state of the MDP isthus encoded in the output of LSTM , h . As for the decodernetwork, it consists of the last fully connected layer and the B softmax layers. Moreover, the input consists of the states s t ∈ S of the MDP that are fed to the ﬁrst LSTM layer of h hidden layers. Subsequently, the B softmax layers outputthe actions of the MDP a t ∈ A , i.e., the b th softmax layeroutputs { x bu | x bu ∈ , , b = 1 , . . . , B, u = 1 , . . . , U } . Thisarchitecture was chosen given that the LSTM layers allow us toavoid the problem of vanishing gradients [18]; more preciselycompared to other DNNs, it provides a faster RL algorithm via Fig. 3: The training and validation process of the RNN. a slow RL. That is, instead of depending on the convergencespeed of the RL algorithm through conventional gradientascent, its policy is represented by an RNN. Subsequently,given that the RNN receives all the typical information thata regular RL algorithm would receive, the activations of theRNN store the state that would improve the speed of the RLalgorithm on the current MDP [20].V. S

IMULATION R ESULTS AND A NALYSIS

For our simulations, we consider the following parameters: T = 300 K , p = 1 W , M = 10 Mbits , f = 1 THz , W = 30 GHz , K ( f ) = 0 . − with of water vapormolecules as in [21]. The RISs are deployed over the 4 wallsof an indoor area modeled as a square of size

40 m ×

40 m .All statistical results are averaged over a large number ofindependent runs. In order to train the network, we considerthe RNN architecture shown in Fig. 2 with a maximum epochof and a minimum batch size of . Furthermore,the network was trained with data generated from VR usersmoving according to a random walk which constitutes themost general scheme characterizing users’ mobility [19]. InFig. 3, we analyze the convergence of our proposed RNN-RLalgorithm, in terms of accuracy and training loss. Note that,80% of our dataset is used for the training process, 10% isused for the validation, and the remaining 10% is used fortesting. In particular, the training process uses data generatedfrom users’ random walks and the optimal solution to ﬁt themodel. Subsequently, throughout the validation process, therandom walks’ data is used to provide unbiased estimators ofthe hyper-parameters corresponding to the RNN. Fig. 3 showsa validation accuracy of . . Both the accuracy and lossof the training and validation processes show smoothness inthe curve. Finally, after obtaining all the hyper-parameters ofthe RNN, the test dataset can provide an unbiased evaluationof a ﬁnal model ﬁt. As such, simulation results show a testingerror of . . Our approach can accommodate any other mobility model.

Fig. 4:

Maximum queue length vs. number of time slots.

Fig. 5:

Sum-rate of VR content vs. number of time slots.

We compare our RNN RL policy to the optimal solution fordifferent number of users. In Fig. 4, the maximum queue lengthover the number of iterations is plotted. Given that the RNNwas capable of capturing the dynamic behavior of the channel,for U = 7 , the maximum queue length for the policy-approachis only . higher than the optimal maximum queue length.Meanwhile, for U = 3 it is nearly equal to the optimal solution.Thus, this conﬁrms that combining the risk-based approachwith the RNN RL leads to highly reliable results. Clearly, themaximum queue length increases as the number of VR usersin the network increases. Fig. 5 shows the sum-rate of VRcontent, which constitutes the objective function in (14), overtime. Here, the policy-based controller offers a solution that isconsiderably farther than the optimal solution in comparisonto the results obtained for the maximum queue length inFig. 4. The reason for this is that a sum-term rate is beingcompared rather than individual rates; thus, the inaccuracy inevery measure propagates into a higher inaccuracy when termsare added. As we can see, for U = 7 , the sum-rate for thepolicy-approach is less than optimal sum-rate. As for U = 3 , it is less than the optimal solution. Clearly, asthe number of VR users increases, the sum-rate increases, andso does the gap between the optimal solution and the policyapproach. VI. C ONCLUSION

In this paper, we have investigated the problem of RIS-VR user association while guaranteeing reliable, low latencyand high rate communications. We have proposed a risk-based aware optimization problem that takes into account the higher order statistics of the queue length, thus guarantee-ing continuous reliability. The proposed problem was furthertransformed using Lyapunov optimization to a linear weightedfunction. Furthermore, the problem was solved using an RNNRL framework to reduce the dimensionality of the state spaceand capture channel dynamics and user mobility.R

EFERENCES[1] M. Chen, W. Saad, and C. Yin, “Virtual reality over wireless networks:Quality-of-service model and learning-based resource management,”

IEEE Transactions on Communications , vol. 66, no. 11, pp. 5621–5635,Jun. 2018.[2] W. Saad, M. Bennis, and M. Chen, “A vision of 6G wireless systems:Applications, trends, technologies, and open research problems,” arXivpreprint arXiv:1902.10265 , 2019.[3] G. Gerardino, B. Alvarez, K. Pedersen, and P. Mogensen, “MAC layerenhancements for ultra-reliable low-latency communications in cellularnetworks,” in

Proc. of IEEE International Conference on Communica-tions (ICC) , Paris, France, pp. 1–7.[4] M. Steeg and A. St¨ohr, “High data rate 6 gbit/s steerable multibeam 60ghz antennas for 5g hot-spot use cases,” in

Proc. of IEEE PhotonicsSociety Summer Topical Meeting Series (SUM) , San Juan, Puerto Rico,Jul. 2017, pp. 141–142.[5] A. Moldovan, P. Karunakaran, I. F. Akyildiz, and W. H. Gerstacker,“Coverage and achievable rate analysis for indoor terahertz wireless net-works,” in

Proc. of IEEE International Conference on Communications(ICC) , Paris, France, Jul. 2017, pp. 1–7.[6] E. Basar, M. Di Renzo, J. de Rosny, M. Debbah, M.-S. Alouini, andR. Zhang, “Wireless communications through reconﬁgurable intelligentsurfaces,” arXiv preprint arXiv:1906.09490 , 2019.[7] A. T. Z. Kasgari, W. Saad, and M. Debbah, “Human-in-the-loop wirelesscommunications: Machine learning and brain-aware resource manage-ment,” to appear, IEEE Transactions on Communications , 2019.[8] M. Chen, W. Saad, and C. Yin, “Echo-liquid state deep learning for 360content transmission and caching in wireless vr networks with cellular-connected UAVs,”

IEEE Transactions on Communications , vol. 67, no. 9,pp. 6386–6400, 2019.[9] J. Park and M. Bennis, “URLLC-eMBB slicing to support VR mul-timodal perceptions over wireless cellular systems,” in

Proc. of IEEEGlobal Communications Conference (GLOBECOM) , Abu Dhabi, UAE,Dec. 2018.[10] A. S. Cacciapuoti, R. Subramanian, K. R. Chowdhury, and M. Calefﬁ,“Software-deﬁned network controlled switching between millimeter waveand terahertz small cells,” arXiv preprint arXiv:1702.02775 , 2017.[11] C. Huang, A. Zappone, G. C. Alexandropoulos, M. Debbah, andC. Yuen, “Reconﬁgurable intelligent surfaces for energy efﬁciency inwireless communication,”

IEEE Transactions on Wireless Communica-tions , vol. 18, no. 8, pp. 4157–4170, 2019.[12] M. Jung, W. Saad, Y. Jang, G. Kong, and S. Choi, “Uplink data rate inlarge intelligent surfaces: Asymptotic analysis under channel estimationerrors,” in submitted to Proc. of IEEE Wireless Communications andNetworking Conference (WCNC), Marrakech, Morocco , 2019.[13] A. Ahmadi-Javid, “Entropic value-at-risk: A new coherent risk measure,”

Journal of Optimization Theory and Applications , vol. 155, no. 3, pp.1105–1123, 2012.[14] C. Chaccour, R. Amer, B. Zhou, and W. Saad, “On the reliability ofwireless virtual reality at terahertz (THz) frequencies,” in

Proc. of the10th IFIP International Conference on New Technologies, Mobility andSecurity (NTMS) , Canary Islands, Spain, June. 2019.[15] R. Zhang, K. Yang, Q. H. Abbasi, K. A. Qaraqe, and A. Alomainy,“Analytical modelling of the effect of noise on the terahertz in-vivocommunication channel for body-centric nano-networks,”

Nano commu-nication networks , vol. 15, pp. 59–68, 2018.[16] M. J. Neely, “Stochastic network optimization with application to com-munication and queueing systems,”

Synthesis Lectures on Communica-tion Networks , vol. 3, no. 1, pp. 1–211, 2010.[17] M. Chen, U. Challita, W. Saad, C. Yin, and M. Debbah, “Artiﬁcial neuralnetworks-based machine learning for wireless networks: A tutorial,”

IEEE Communications Surveys & Tutorials , 2019.[18] R. S. Sutton and A. G. Barto,

Reinforcement learning: An introduction .MIT press, 2018.[19] M. Naderi Soorki, W. Saad, and M. Bennis, “Ultra-reliable millimeter-wave communications using an artiﬁcial intelligence-powered reﬂector,”in

Proc. IEEE Global Communications Conference , Waikoloa, HI, USA,Dec. 2019.[20] Y. Duan, J. Schulman, X. Chen, P. L. Bartlett, I. Sutskever, and P. Abbeel,“Rl2: Fast reinforcement learning via slow reinforcement learning,” arXivpreprint arXiv:1611.02779 , 2016.[21] C.-C. Wang, X.-W. Yao, C. Han, and W.-L. Wang, “Interference andcoverage analysis for terahertz band communication in nanonetworks,”in