Hierarchical Reinforcement Learning for Relay Selection and Power Optimization in Two-Hop Cooperative Relay Network
HHierarchical Reinforcement Learning for RelaySelection and Power Optimization in Two-HopCooperative Relay Network
Yuanzhe Geng, Erwu Liu,
Senior Member, IEEE,
Rui Wang,
Senior Member, IEEE, and Yiming Liu,
Student Member, IEEE
Abstract
Cooperative communication is an effective approach to improve spectrum utilization. In order toreduce outage probability of communication system, most studies propose various schemes for relayselection and power allocation, which are based on the assumption of channel state information (CSI).However, it is difficult to get an accurate CSI in practice. In this paper, we study the outage probabilityminimizing problem subjected to a total transmission power constraint in a two-hop cooperative relaynetwork. We use reinforcement learning (RL) methods to learn strategies for relay selection and powerallocation, which do not need any prior knowledge of CSI but simply rely on the interaction with com-munication environment. It is noted that conventional RL methods, including most deep reinforcementlearning (DRL) methods, cannot perform well when the search space is too large. Therefore, we firstpropose a DRL framework with an outage-based reward function, which is then used as a baseline. Then,we further propose a hierarchical reinforcement learning (HRL) framework and training algorithm. Akey difference from other RL-based methods in existing literatures is that, our proposed HRL approachdecomposes relay selection and power allocation into two hierarchical optimization objectives, whichare trained in different levels. With the simplification of search space, the HRL approach can solvethe problem of sparse reward, while the conventional RL method fails. Simulation results reveal that
This work is supported in part by the grants from the National Science Foundation of China (No. 61571330, No. 61771345),Shanghai Integrated Military and Civilian Development Fund (No. JMRH-2018-1075), and Science and Technology Commissionof Shanghai Municipality (No. 19511102002). Corresponding author: Erwu Liu.Yuanzhe Geng, Erwu Liu, and Yiming Liu are with the College of Electronics and Information Engineering, Tongji University,Shanghai 201804, China, E-mail: [email protected], [email protected], ymliu [email protected] Wang is with the College of Electronics and Information Engineering and Shanghai Institute of Intelligent Science andTechnology, Tongji University, Shanghai 201804, China, E-mail: [email protected]. a r X i v : . [ ee ss . S Y ] J a n ompared with traditional DRL method, the HRL training algorithm can reach convergence 30 trainingiterations earlier and reduce the outage probability by in two-hop relay network with the sameoutage threshold. Index Terms cooperative communication, outage probability, relay selection, power allocation, hierarchical rein-forcement learning
I. I
NTRODUCTION
The rapid development of communication technology makes wireless spectrum resources verytight [1]–[3]. Therefore, in recent years, cooperative communication has been paid much atten-tion, for it helps improving spectrum utilization and system throughput in multi-user scenario.Cooperative communication systems usually use outage probability as a metric, to measure theQuality-of-Service (QoS) and the robustness of the system. An outage occurs when the receivedsignal-to-noise ratio (SNR) falls below a certain threshold [4], [5]. In order to minimize theoutage probability of the cooperative relay network, there are usually two intuitive approaches,that is, optimize relay selection scheme or power allocation scheme.
Relay Selection Schemes:
For a scenario with multi-relay, it is usually possible to selectmultiple relays to coordinate data transmission and assign orthogonal channels to avoid interfer-ence. Jedrzejczak et al. [6] studied the relay selection problem by calculating of harmonic meanof channel gains. Islam et al. [7] demonstrated the influence of network coverage capability onrelay selection. Das and Mehta [8] proposed an approach for relay selection by analyzing theoutage probability. The draw back of employing too many relays is that, it may lead to extensivetime consumption and frequency resource waste when forwarding signal. To solve this issue,Bletsas et al. [9] proposed an opportunistic relay section scheme by only choosing the bestrelay according to channel state, which can obtain the full set gain. However, these methods allassume an exact channel state information (CSI), which is not practical because of the existenceof inevitable noise.
Power Allocation Schemes:
Based on the given relay, reasonable power allocation can furtherimprove the received SNR and reduce the outage probability. Given partial channel information,Wang and Chen [10] derived a closed-form formula for optimal power allocation based onmaximizing a tight capacity lower bound. In [11] and [12], the authors considered optimal powerllocation scheme in different situations of convention relay and opportunistic relay. Tabataba etal. [13] deduced the expression of system outage probability under the condition of high SNR,and studied power allocation using AF protocol. However, these researches still require priorknowledge of the channel, which can not be further applied to other situations.Recently, some researchers have successfully applied reinforcement learning (RL) methods tocooperative communication. RL is one of the three paradigms of machine learning, which canachieve high-precision function fitting through powerful computing capability. Unlike traditionalmethods, RL methods do not need prior knowledge of the environment, that is, we do not haveto add any assumptions to the learning process.In RL approaches, the source node is empowered with the learning ability to determine theoptimal relay or power allocation for the current moment, based on previous observation ofsystem state and rewards. Shams et al. [14] employed Q-learning algorithm to solve the powercontrol problem, and Wang et al. [15] proposed a Q-learning based relay selection scheme inrelay-aided communication scenarios. The drawback of these studies is obvious, as they canbe only suitable for simple problems with a low dimension. Su et al. [16] proposed a deep Qnetwork (DQN) based relay selection scheme with detailed mutual information (MI) as reward,but did not take power allocation into consideration. In order to solve the joint optimizationproblem, Su et al. [17] employed convex optimization and DQN to deal with relay selection andpower allocation, respectively. However, these RL methods and frameworks can only be used insome simple environments, and do not work well in the high-dimensional search space.In this paper, we propose a hierarchical reinforcement learning (HRL) approach for relayselection and power allocation, to minimize the outage probability of the two-hop cooperativecommunication system. Unlike traditional optimization methods, our method can learn behaviorpolicy without assuming any prior knowledge of channel state. It is also different from existingRL-based methods that, we design an outage-based reward function, which uses a binary signalto represent success or failure of communication. Furthermore, we propose a novel hierarchicalframework to reduce searching space and improve learning efficiency. Specifically, the contri-butions of this paper can be summarized as follows. • In our two-hop cooperative communication model, we transform the traditional outageprobability optimization problem into a statistical problem, so that the RL method canbe used to solve the problem. By employing RL methods, we no longer need to addany assumptions to channel distributions, and only rely on interaction with communicationnvironment. • We propose an outage-based reward function. Compared with other existing RL methods, ourmethod needs less information fed back from the environment. Rewards are only determinedby binary signals of success or failure, and do not include other concrete representationsof information. It is practical because other additional feedback may be not available incertain situations. • We further design an HRL framework with two levels for cooperative relay network, whererelay selection and power allocation are disassembled into two optimization objectives.Traditional deep reinforcement learning (DRL) methods considers relay selection and powerallocation together, which leads to a more complex action space and may affected thelearning performance. By decomposing different optimization objectives into different levels,complex action space is therefore simplified in our HRL framework.The rest of this paper is organized as follows. Section II introduces the preliminaries ofDRL. Section III analyzes our system model and Section IV formulates the outage minimizationproblem. Section V describes our outage-based method using DQN framework. Section VIdescribes our proposed HRL framework and learning algorithm, and presents our pre-trainingalgorithm in detail. Section VII presents simulation results. Finally, Section VIII concludes thispaper and outlines future works. II. P
RELIMINARIES
In the field of cooperative communication, recent studies have proposed several machinelearning methods for relay selection or power allocation. Traditional communication methodsmake assumptions about CSI, while these methods use data for learning and then making channelpredicting.As one of the three paradigms of machine learning, RL is an emerging tool to solve decision-making problems such as resource management in communication [18]–[20]. RL methods usean agent, which can be regarded as an intelligent robot, to interact with the environment.The agent in RL methods has no access to prior knowledge of the environment, but canonly get familiar with the environment through an interactive process called Markov DecisionProcess (MDP). In order to minimize outage probability, the agent will repeatedly interactwith the communication environment, and choose a suitable relay and allocate transmissionpower according to current system state. In addition, it continuously adjusts its behavior policyccording to the feedback from the communication environment. In this section, we introducethese preliminaries of reinforcement learning.
A. Markov Decision Process
An MDP consists of an environment E , a state space S , an action space A , and a rewardfunction S ×A → R . At each discrete time step t , the agent observes the current state s t ∈ S , andselects an action a t ∈ A according to a policy π : S → P ( A ) , which maps states to a probabilitydistribution over actions. After executing action a t , the agent receives a scalar reward r ( s t , a t ) from the environment E and observes the next state s t +1 according to the transition probability p ( s t +1 | s t , a t ) . This process will continue until a terminal state is reached.The goal of the agent is to find the optimal policy to maximize the expected long-termdiscounted reward, i.e. , maximize the expected accumulated return R t = (cid:80) Ti = t γ i − t r ( s i , a i ) fromeach state s t , where T denotes the total step, and γ ∈ [0 , denotes the discount factor thattrades off the importance of immediate and future rewards.Action-value function is usually used to describe the expected return after selecting action a t in state s t according to policy π . Q π ( s t , a t ) = E s t ∼E ,a t ∼ π ( s t ) ( R t | s t , a t ) . (1)And we can obtain the preceding action-value function via recursive relationship known asBellman function. Q π ( s t , a t ) = E s t +1 ∼E (cid:104) r ( s t , a t ) + γ E a t +1 ∼ π ( s t +1 ) (cid:2) Q π ( s t +1 , a t +1 ) (cid:3)(cid:105) . (2)Moreover, the optimal action-value function Q ∗ ( s t , a t ) = max π ∈ Π Q π ( s, a ) gives the maximumaction value under state s and action a , and it also obeys Bellman function. Q ∗ ( s, a ) = E s (cid:48) ∼E [ r + γ max a (cid:48) Q π ( s (cid:48) , a (cid:48) )] . (3) B. Reinforcement Learning
In practice, we usually do not know the underlying state transition probability, i.e. , in a model-free situation. It requires the agent to interact with the environment and learn from the feedback,constantly adjust its behavior to maximize the expected reward.n reinforcement learning, temporal difference (TD) methods [21] are proposed via combiningMonte Carlo methods and dynamic programming methods, which enable the agent to learndirectly from raw experience. Therefore, we have the following well-known Q-learning algorithm. Q ( s t , a t ) ← Q ( s t , a t ) + αδ t (4)with δ t = r ( s t , a t ) + γ max a t +1 ∈A Q ( s t +1 , a t +1 ) − Q ( s t , a t ) , (5)where δ t denotes TD error and α ∈ [0 , denotes learning rate.Through continuous iterative updating, the Q value of different actions selected in each statefinally tends to be stable, which can then provide a policy for the subsequent action selection.III. S YSTEM M ODEL
Consider a wireless network where exists an N S -antenna source S , an N D -antenna destination D , and a group of single-antenna relays R = { R , R , . . . , R K } , as shown in Fig. 1. We assumethat the source is far from the destination, and the help of relay nodes is needed. Due tothe limitation of equipment of relays, we consider a half-duplex signaling mode where thecommunication from S to D via the selected relay R i will take two time slots. In the firsttime slot, S broadcasts its signal, and all the other nodes, include the destination, listen tothis transmission. In the second time slot, the selected relay forwards the decoded signal todestination. Source ...
Destination R R R i R R K R K RelaysSource ...
Destination R R i R R K RelaysSource ...
Destination R R R i R R K R K RelaysSource ...
Destination R R i R R K Relays
Fig. 1: Relay network model.Depending on how the cooperative relay processes the received signal, the relay mode canbe mainly divided into amplify-and-forward (AF) and decode-and-forward (DF). Next, we willanalyze the MI obtained by using AF protocol and DF protocol, respectively. . Amplify-and-Forward Relaying
In the case that all relays can only scale the received signal and send it to the destination,we employ AF protocol to realize cooperative communication. In the first phase, the receivedsignal at R i can be written as y si ( t ) = (cid:112) P s h † si ( t ) x ( t ) + n i ( t ) , (6)where P s ∈ [0 , P max ] is the transmission power at source with P max being the maximum value, x ( t ) is a N s × data symbol vector and (cid:107) x ( t ) (cid:107) = 1 , h si ( t ) = [ h si ( t ) , . . . , h N S si ( t )] T is an N S × channel vector between source and relay, and each element is a complex Gaussian randomvariable with zero mean and variance σ si , n i ( t ) ∼ CN (0 , σ n ) is the complex Gaussian noise atrelay. Similarly, we have the received signal at D which can be written as y sd ( t ) = (cid:112) P s h † sd ( t ) x ( t ) + n d ( t ) , (7)where h sd ( t ) denotes a N S × N D channel matrix, and n d ( t ) ∼ CN ( , σ n I N D ) denotes thecomplex Gaussian noise at destination, where I is identity matrix.In the second time slot, the selected relay amplifies the signal and transmits it to the destination.The received signal at the destination from the relay can be written as y id,AF ( t ) = (cid:112) P r h † id ( t ) βy si ( t ) + n d ( t ) , (8)where P r ∈ [0 , P max ] is the transmission power at relay with P max being the maximum value, h id ( t ) = [ h id ( t ) , . . . , h N D id ( t )] is a × N D channel vector between relay and destination, andsimilarly, each element is a complex Gaussian random variable with zero mean and variance σ id . β is the amplification factor, which can be written as follows [22], [23]. β = (cid:115) P s (cid:107) h si (cid:107) + σ n . (9)The destination combines the data from the source and the relay using maximal ratio combining(MRC), and after some manipulations according to [24], [25], we have the following final end-to-end SNR. ϕ z = ϕ si ϕ id ϕ si + ϕ id + 1 , (10)where ϕ si = P s (cid:107) h si (cid:107) /σ n and ϕ id = P r (cid:107) h id (cid:107) /σ n .imilarly, we can obtain the SNR of direct transmission from source to destination, whichcan be represented as ϕ sd = P s (cid:107) h sd (cid:107) /σ n . Then we have the MI between the source and thedestination using AF protocol. I AF = 12 log (1 + ϕ AF ) = 12 log (1 + ϕ sd + ϕ si ϕ id ϕ si + ϕ id + 1 ) . (11) B. Decode-and-Forward Relaying
Assume that all relays are able to decode the signal from the source, and then re-encode andtransmit the signal to the destination. The first time slot in DF mode is the same as that in AFmode, and (6) and (7) have given the received signal at R i and D in this time slot.In the second time slot, different from that in AF mode, the selected relay decodes andforwards the signal to the destination, and the received signal at the destination from the relaycan be written as y id,DF ( t ) = (cid:112) P r h id † ( t ) y si ( t ) + n d ( t ) , (12)When employing DF protocol, the relays need to first successfully decode the signal fromsource on the condition that the MI is above the required transmission rate. Then, the signalsreceived in both two time slots are combined at the destination using MRC [9], [25], and wehave the following instantaneous MI between the source and the destination. I DF = 12 log (1 + ϕ DF ) = 12 log (1 + ϕ sd + ϕ id ) . (13)IV. P ROBLEM F ORMULATION
Suppose that there is an agent in the communication environment, which has access to channelstate in previous time slots. The agent estimates current channel state based on historical CSI,and accordingly selects relay and allocates transmission power. Afterwards, it receives a rewardfrom the environment, which indicates whether the communication is successful.In this section, we model this process as an MDP, where historical channel state is regardedas system state, and relay selection along with power allocation are considered as system action.Then we describe the variables in our two-hop cooperative communication scenario and formulateour optimization problem. . State Space
Full observation of our two-hop communication system consists of the channel states betweenany two nodes in the previous time slot. Therefore, the state space in current time slot is a unionof different wireless channel states, which can be denoted as S t (cid:44) [ h si ( t − , h id ( t − , h sd ( t − , (14)where the integer i satisfies i ∈ [1 , K ] .In order to characterize the temporal correlation between time slots for each channel, weemploy the following widely adopted Gaussian Markov block fading autoregressive model [22],[26]. h ij ( t ) = ρ h ij ( t −
1) + (cid:112) − ρ e ( t ) , (15)where ρ denotes the normalized channel correlation coefficient between corresponding elementsin h ( t ) and h ( t − , e ( t ) ∼ CN ( , σ I ) denotes the error variable and is uncorrelated with h ij ( t ) . According to Jake’s fading spectrum, we have ρ = J (2 πf d τ ) where J ( · ) denotes zeroth-order Bessel function of the first kind, f d and τ denote Doppler frequency and the length oftime slot, respectively. B. Action Space
Full action space includes relay selection a R ( t ) , source power allocation a P s ( t ) , and relaypower allocation a P r ( t ) .Considered that the total power is constraint, i.e. , P s + P r ≤ P max , we can assume that thesum of power used by the source and its selected optimal relay is P max . So P r can be directlyrepresented by the difference between P max and P s . Then, we can reduce the number of actionsthat need to be optimized and derive the following reduced action space. A t (cid:44) [ a R ( t ) , a P s ( t )] . (16)The first part of action space is relay selection, which is denoted by a R ( t ) = [ a R ( t ) , a R ( t ) , . . . , a RK ( t )] , (17)where a Rk ( t ) = 1 means relay R k is selected in time slot t and a Rk ( t ) = 0 otherwise.The second part of is power allocation for the source node. Similarly, it is denoted by a P s ( t ) = [ a P s ( t ) , a P s ( t ) , . . . , a P s L − ( t )] , (18) ooperative Communication Environment Agent
State s t Action Set a R (t)a Ps (t) State Set h s1 (t-1)h s2 (t-1) ... h sd (t-1) State Set h s1 (t-1)h s2 (t-1) ... h sd (t-1) … … … … … … … … … … … … … … … … Action a t Sample and train
Reward r t & State s t+1 DNN
Experience Replay Buffer … … s a r s Exp 1 s a r s Exp 1 s a r s Exp 2 s a r s Exp 2 a t r t s t+1 Exp t s t a t r t s t+1 Exp t s t Cooperative Communication Environment
Agent
State s t Action Set a R (t)a Ps (t) State Set h s1 (t-1)h s2 (t-1) ... h sd (t-1) … … … … … … Action a t Sample and train
Reward r t & State s t+1 DNN
Experience Replay Buffer … … s a r s Exp 1 s a r s Exp 2 a t r t s t+1 Exp t s t Fig. 2: DRL framework for relay selection and power allocation.where P max is divided into L power-levels, and a P s l ( t ) = 1 means the l -th power-level is selectedfor source node transmission in time slot t , and a P s l ( t ) = 0 otherwise. C. Reward and Optimization Problem
The outage probability minimizing problem that jointly optimizes relay selection and powerallocation can be intuitively formulated as min A t Prob ( I < λ ) , (19)where the positive scalar λ is denoted to be the outage threshold, and I can be calculatedaccording to (11) and (13).Traditional methods establish a probabilistic model, where the distribution employed to de-scribe channel uncertainty is assumed artificially. However, we do not rely on underlying channeldistributions in this paper, and thus traditional probabilistic analysis methods are not applicable.n our problem, the agent can only make use of the communication result that denotes successor failure from the cooperative communication environment. Therefore, we define the followingindicator function of event I < λ , which represents the result after each selection. f ( a R , a P s ; h ) (cid:44) I<λ = , if I < λ , otherwise (20)Consider the fact that, when an indicator function is employed to represent each occurrence ofan event, then the expectation of the indicator function can be used to calculate the probabilityof the original event. Therefore, we can reformulate the optimization problem for minimizingoutage probability of our communication system in the form of statistics. Then, this problemcan be solved using RL methods, which is formulated as follows. min A t E (cid:34) T T (cid:88) t =1 f (cid:0) a R ( t ) , a P s ( t ); h ( t ) (cid:1)(cid:35) s.t. C : K (cid:88) k =1 a Rk ( t ) = 1 , C : L − (cid:88) l =1 a P s l ( t ) = 1 , C : a Rk ( t ) , a P s l ( t ) ∈ { , } . (21)In MDP, the reward is fed back to the agent to evaluate the selected action under currentsystem state. In this paper, we design an outage-based reward function, which only consists ofthe communication result. Note that, the goal of the agent is to find the optimal behavior policyto maximize the expected long-term discounted reward, so our binary reward function is denotedas r t = 1 − f (cid:0) a R ( t ) , a P s ( t ); h ( t ) (cid:1) . (22)V. DRL B ASED S OLUTION
In reinforcement learning, we often estimate the action-value function by using Bellmanfunction as an iterative update to converge to the optimal. Unfortunately, traditional RL methodwhich only employs Bellman function has no generalization ability. Since channel state isuncountable, traditional RL method often fails to make decisions when faced with channel statethat has never appeared. DRL is a combination of deep neural networks (DNN) and traditionalRL method, which is proposed to solve generalization problem in large state or action space27], [28]. In this section, we adopt DQN framework to design a DRL-based solution for relayselection and power allocation.The DRL framework for relay selection and power allocation is shown in Fig. 2. Note that thesource node has no prior knowledge of the communication system, which means the distributionsof wireless channels between any two nodes are all unknown to it. The agent can only observethe current state from the communication environment and get state s t . Then a deep neuralnetwork is employed as the nonlinear function approximator to deal with the input data, whichcan be applied to estimate the action-value function for high dimensional state space. Accordingto (3), we have Q π ( s t , a t ; θ ) = E (cid:104) r e,t + γ l max a t +1 Q π ( s t +1 , a t +1 ; θ ) (cid:105) . (23)After calculation, the agent chooses the best action which can induce the maximal value of Q π ( s t , a t ; θ ) , and then the environment will give the corresponding reward and update systemstate. So far, we have obtained a complete experience tuple e t = ( s t , a t , r t , s t +1 ) , which will bestored in experience replay buffer B = { e , e , . . . , e t } . When training, a batch of experiencewill be sampled and used to optimize a set of loss functions below. L i ( θ i ) = E e t ∼B (cid:104)(cid:0) y i − Q ( s t , a t ; θ i ) (cid:1) (cid:105) , (24)with y i = r + γ l max a t +1 Q π ( s t +1 , a t +1 ; θ − ) , (25)where θ − is parameters from previous iteration in a separate target network which are held fixedwhen optimizing, and will be replaced by θ i − from the evaluate network after a period of time.Then differentiate operation on these loss functions will be carried out, and we can yield thefollowing expression. ∇ θ i L i ( θ i ) = E (cid:104)(cid:0) y i − Q ( s t , a t ; θ i ) (cid:1) ∇ θ i Q ( s t , a t ; θ i ) (cid:105) . (26)The following standard non-centered RMSProp optimization algorithm [29], [30] is thenadopted to minimize the loss function and update parameters in Q network . θ ← θ − η ∆ θ √ υ + (cid:15) , (27)with υ = κυ + (1 − κ )∆ θ , (28)where κ is a momentum and ∆ θ is the accumulated gradients.ote that, this is a model-free approach, as the agent using state and reward sampled from theenvironment rather than estimating transition probability. And this is an offline policy, becausean epsilon greedy method will be employed as the behavior policy. For a two-hop cooperativerelay network, Algorithm 1 employs DQN framework to make dynamic relay selection andpower allocation. In the Evaluation part of this paper, we will test the algorithm and use it forcomparison. Pseudocode of the algorithm can be found in Algorithm 1. Algorithm 1
DRL Based Relay Selection and Power Allocation Initialize experience replay buffer B . Initialize Q network with random weights θ . Initialize target Q network with θ − = θ . for episode u = 1 , , . . . , u max do Initialize the environment, get state s . for time slot t = 1 , , . . . , t max do Choose action a t using epsilon-greedy method with a fix parameter (cid:15) . Execute action a t , and observe reward r t and next state s t +1 . Collect and save the tuple e t in B . Sample a batch of transitions ( s j , a j , r j , s j +1 ) from B . if episode terminates at time slot j + 1 then Calculate y j according (25); else Set y j = r j . end if Perform gradient descent and update Q-network according to (26).
Update current state, and every C steps reset θ − = θ . end for end for VI. HRL B
ASED S OLUTION
Traditional DRL-based approaches put all variables together in its action, which results in acomplex search space. HRL is a recent technology based on DRL, which has developed rapidlyin recent years and is considered a promising method to solve problems with sparse rewards inomplex environment. HRL enables more efficient exploration of the environment by abstractingcomplex tasks into different levels [31]–[33]. In this section, we propose a novel two-level HRLframework for cooperative communication, to learn relay selection policy and power allocationpolicy in different levels.
CooperativeCommunicationEnvironment Meta Controller
Gradient Bandit
Online Network
Controller
Target Network
Updateperiodically
Goal g t for next n steps Action a t Experience Replay Buffer
Meta Memory Tuple(s t , g t , r e,t , s t+n ) Memory Tuple(s t , g t , a t , r i,t , s t+1 ) Agent
Reward r i,t
State s t Updatepreference Sample and train
Probability Distribution
Soft-max operation Calculate r e,t
New State s t+1
Final State s t+n
Fig. 3: HRL framework for decomposing relay selection and power allocation into differentlevels.
A. Proposed Framework
As shown in Fig. 3, the communication agent has two levels. In the higher level, the meta-controller receives observation of state from the external communication environment, and out-puts a goal g t . The controller in the lower level is supervised with the goals that are learned andproposed by meta-controller, which observes the state of the external communication environmentand selects an action a t . Note that the meta-controller gives a goal every n steps, and the goalwill remain until the low-level controller reaches the terminal. We employ a standard experiencereplay buffer, and it is worth to note that, experience tuples ( s t , g t , r e,t , s t + n ) for meta controllerand ( s t , g t , a t , r i,t , s t +1 ) for controller are stored in disjoint spaces for training.Hierarchical environment includes state, high-level goal, low-level action and reward, whichare described specifically below. tate: State in our hierarchical framework is the same as that in DRL environment, whichconsists of the channel states between any two nodes in the previous time slot. The expressionfor the state space can be referred to (14).
High-Level Goal:
In cooperative communication systems, we can intuitively find that relayselection plays a major role. Therefore, we separate the different action components to makedifferent levels, and extract relay selection as high-level goals for overall planning. Denote g t tobe the goal in higher level, we then have g t ∈ G (cid:44) [ g R ( t )] = [ g R ( t ) , g R ( t ) , . . . , g RK ( t )] . (29)In fact, the goal selection in high-level is similar to relay selection action in the previous DRLmethod. Therefore, g R ( t ) should meet the same constraint that (cid:80) Kk =1 g Rk ( t ) = 1 , g Rk ( t ) ∈ { , } . Low-Level Action : By decomposing relay selection and power allocation into different levels,we can further reduce the action space. Then the low-level action space only has one variable a P s ( t ) , which satisfies C and C in (21). Reward : Note that the higher level and the lower level are working in different time scales.Meta-controller first proposes a temporarily fixed goal for the lower level, and then controllerperforms actions in a period of time according to both system state and high-level goal andreceive feedbacks from the environment. Therefore, we can denote the internal reward for low-level controller as r i,t = 1 − f (cid:0) g R ( t ) , a P s ( t ); h ( t ) (cid:1) . (30)On the other hand, we use communication success rate of a given relay over a period of time n to measure the quality of current relay selection. Therefore, the external reward for high-levelmeta-controller can be represented as r e,t = 1 n n (cid:88) t =1 r i,t , (31)which the agent aims to maximize its expectation. B. Hierarchical Learning Policy
For meta-controller in higher level, we use gradient bandit method to learn goal-policy fordynamically proposing goals according to a given system state. Recall that, we have K relaysto choose from. Therefore, we first establish the following probability distribution. π ht ( R i ) (cid:44) P r { g t = R i } (cid:44) e M t ( R i ) (cid:80) Kb =1 e M t ( R b ) , (32)here π h is the high-level policy, and π ht ( R i ) denotes the probability that relay R i is selected asthe goal in time slot t . M t ( R b ) denotes the preference value for choosing relay R b , which willbe updated every n steps.Then, we employ stochastic gradient descent to update the preference values. M t + n ( R i ) (cid:44) M t ( R i ) + ζ ∂ E [ r e,t ] ∂M t ( R i ) , (33)where ζ > denotes learning step size, and the expectation of r e,t can be equally calculated by (cid:80) Kb =1 π ht ( R b ) Q ∗ h ( R b ) . Replace the expectation form in equation (33), and we then have ∂ E [ r e,t ] ∂M t ( R i ) = ∂∂M t ( R i ) (cid:34) K (cid:88) b =1 π ht ( R b ) Q ∗ h ( R b ) (cid:35) = K (cid:88) b =1 Q ∗ h ( R b ) ∂π ht ( R b ) ∂M t ( R i )= K (cid:88) b =1 ( Q ∗ h ( R b ) − ¯ r e,t ) ∂π ht ( R b ) ∂M t ( R i )= K (cid:88) b =1 π ht ( R b ) ( Q ∗ h ( R b ) − ¯ r e,t ) π ht ( R b ) ∂π ht ( R b ) ∂M t ( R i ) , (34)where the newly introduced scalar ¯ r e,t is independent of b . It denotes the average of all externalrewards, i.e. , the average success rate of our communication system. Further, the partial derivativepart can be further written as ∂π ht ( R b ) ∂M t ( R i ) = ∂∂M t ( R i ) (cid:34) e M t ( R i ) (cid:80) Kb =1 e M t ( R b ) (cid:35) = i = b e M t ( R i ) (cid:80) Kb =1 e M t ( R b ) − e M t ( R i ) e M t ( R b ) (cid:16) (cid:80) Kb =1 e M t ( R b ) (cid:17) = i = b π ht ( R b ) − π ht ( R b ) π ht ( R i ) . (35)Note that, E [ r e,t | g t ] = Q ∗ h ( g t ) and r e,t is independent of the other variables. Then we canderive the following equation by employing the form of expectation. ∂ E [ r e,t ] ∂M t ( R i ) = K (cid:88) b =1 π ht ( R b ) (cid:0) Q ∗ h ( R b ) − ¯ r e,t (cid:1)(cid:0) i = b − π ht ( R i ) (cid:1) = E (cid:104) ( r e,t − ¯ r e,t ) (cid:0) i = b − π ht ( R i ) (cid:1)(cid:105) . (36)n the training process, sampling is conducted every n time steps, and the gradient in (33)is replaced by the expectation value of the single sample. Therefore, we can finally obtain thefollowing update expression of preference value. M t + n ( R i ) (cid:44) M t ( R i ) + ζ ( r e,t − ¯ r e,t ) (cid:0) i = b − π ht ( R i ) (cid:1) . (37)For controller in lower level, it learns action-policy for selecting actions according to bothstate and goal, which aims to maximize the long-term discounted expected internal reward.In order to reflect the difference in values of the different actions, we perform some changesto the architecture of traditional deep Q network. Inspired by [34], we further employ a duelingnetwork, which can enhance the stability of DRL algorithm by ignoring subtle changes in theenvironment and focusing on key states.Schematic illustration of dueling architecture is shown in Fig. 4. The input layer and hiddenlayers are the same as that in traditional deep Q network. The key difference is that, there isa sub-output layer in our dueling network, where the traditional Q function output is separatedinto a state-goal valuation function V ( s t , g t ) and an advantage evaluation function A ( s t , g t , a t ) .In state-goal valuation part, there is only one neuron which represents the assessment of thecurrent state and goal. In advantage evaluation part, the number of neurons is equal to that inoutput layer, representing advantage of choosing each optional action. … … … … … … … Input
Layer
Hidden
Layer1
Hidden
Layer2
Sub-Output
Layer
Output
Layer
State-Goal Valuation Advantage Evaluation … … … … … … … … Input
Layer
Hidden
Layer1
Hidden
Layer2
Sub-Output
Layer
Output
Layer
State-Goal Valuation Advantage Evaluation … Fig. 4: Dueling network in controller for low-level learning.Consider the fact that E a t ∼ π l [ Q ( s t , g t , a t )] = V ( s t , g t ) , we have E a t ∼ π l [ A ( s t , g t , a t )] = 0 .herefore, to meet this property, we can rewrite the advantage part as A ( s t , g t , a t ) (cid:44) A ( s t , g t , a t ; θ, w ) − | A | (cid:88) a (cid:48) A ( s t , a (cid:48) ; θ, w ) , (38)and thus have the following expression of Q function by combining the two parts in sub-outputlayer. Q l ( s t , g t , a t ) (cid:44) V ( s t , g t ; θ, w ) + A ( s t , g t , a t ; θ, w ) , (39)where θ denotes parameters in common part of DNN ( i.e. , three columns on the left in Fig.4), and w and w are parameters in separated fully connected sub-output layer for valuationfunction and advantage function, respectively. Note that, the output of our dueling network isstill the same as that of the traditional network, that is, the estimated expected return for eachaction a ∈ A under the current state s t and goal g t .By employing temporal difference method, the optimal value of Q function can be written as Q ∗ l ( s t , g t , a t ; θ, w , w ) = max π lt E (cid:34) ∞ (cid:88) j = t γ j − tl r i,j (cid:35) = max π lt E (cid:104) r i,j + γ l max a t +1 Q ∗ l ( s t +1 , g t , a t +1 ; θ, w , w ) (cid:105) , (40)where π lt denotes low-level policy for power allocation in time slot t . Note that meta-controllerand controller work on different timescales. The controller operates at each time step, while themeta-controller operates on a longer timescale of n time steps.When training, a batch of memories are sampled from experience replay buffer B l . Wecalculate a set of loss functions, and derive a batch of gradients as ∇ θ l L l ( θ l ) = E (cid:104)(cid:0) r i,t + γ l max a t +1 Q l ( s t +1 , g t , a t +1 ; θ (cid:48) l , w , w ) − Q l ( s t , g t , a t ; θ l , w , w ) (cid:1) ∇ θ l Q l ( s t , g t , a t ; θ l , w , w ) (cid:105) . (41)Then the RMSProp optimizer in (27) is employed to update network parameters. Please refer toAlgorithm 2 for detailed procedure of the hierarchical algorithm.VII. E VALUATION
In this section, we first introduce the setup of simulation environment. We then carry outexperiments to evaluate the proposed algorithms. lgorithm 2
HRL Based Relay Selection and Power Allocation Initialize experience replay buffer: B h for higher level, and B l for lower level. Initialize deep neural network parameters θ l for low-level Q-network. Initialize exploration probability (cid:15) = 1 and anneal factor σ for controller. for episode u = 1 , , . . . , u max do Initialize the environment, obtain initial state s . Choose goal g u according to policy π h . for time slot t = 1 , , . . . , t max do Set goal g t = g for current time slot. Choose action a t using epsilon-greedy method with parameter (cid:15) l,g t . Execute action a t , receive internal reward r i,t from the environment and observe nextstate s t +1 . Collect and save the tuple ( s t , g t , a t , r i,t , s t +1 ) in B l . Randomly choose a batch of index, and sample transitions from B l . Perform gradient descent and update low-level Q-network according to (41).
Update current state.
Update internal exploration probability (cid:15) ← (cid:15) − σ . end for Calculate external reward r e according to (31). Collect and save the tuple ( s , g u , r e , s t max ) in B h . Read data from B h , update preference values according to (37). Update probability distribution according to (32). end for
A. Experiment Setup
Similar to [26], in the two-hop cooperative relay network, channel vectors between any twonodes in each time slot are calculated according to formula (15), where the correlation coefficient ρ is set to be 0.95. The maximum total power for transmission P max , which is the sum of powerfor source and relay, is limited to 4W. On the other hand, we set the outage threshold as λ = 2 . during training process, which means an outage will occur when the MI is lower than 2.0.To implement our proposed framework, learning step size ζ in high-level meta-controller iset to be 0.1. Elements in preference value vector are initialized to be 0, and the vector will beupdated after each inner loop is completed.In low-level controller, we use two separate dueling deep Q networks, which share the samestructure. Both dueling deep Q networks include two hidden layers, each of which has 50neurons, and we employ ReLU function for all hidden layers as activation function. The numberof neurons in input layer is equal to the sum of numbers of states and goals, and the number ofneurons in output layer corresponds to the dimensions of low-level action.Note that, deep Q network requires the action space to be discrete, so we have discretizedthe power allocation in the environment and set L different power-levels for the agent to choosefrom. For comparison, we use the following methods as baseline in our experiments. Random Selection : For each time slot, the agent randomly selects a relay to perform coop-erative communication with random transmission power.
DRL Based Approach : Our DRL method for minimizing outage probability is proposed inSection V. We employ traditional DQN framework to make relay selection and power allocationat the same time, and it is now used as one of baseline methods.
B. Numerical Results
Consider that our DRL method and HRL method both use the structure of deep Q network,therefore we first study the influence of different hyper-parameters on the convergence perfor-mance, to obtain the optimal network structure. Note that, average success rate with differenthyper-parameter values is tested 10 times, and mean curves and ranges are then recorded.First of all, the learning rate for updating network parameter should have an appropriate value.If the learning rate is too small, such as 0.1 (orange solid line), it will lead to local optimum. Ifthe learning rate is too large, such as 0.0001 (pink dotted line), it will then take much more timeto converge. As shown in Fig. 5(a), we finally set the learning rate as 0.001 for the followingsimulations.Experience replay buffer stores experience tuples obtained by the agent. In Fig. 5(b), we studythe effect of replay buffer size, i.e. , memory size, on the performance of convergency. However,unlike Fig. 5(a), different memory sizes have little influence on the final value that averagesuccess rate converges to. Therefore, we directly select the replay buffer size as 8000.When training, a batch of data is sampled from the experience buffer to improve the DNN.In Fig. 5(c), we further fix the memory size, and study the effects of different batch sizes
20 40 60 80 100Iterations0.00.10.20.30.40.50.60.70.80.9 A v e r a g e S u cc e ss R a t e learning rate = 0.1learning rate = 0.01learning rate = 0.001learning rate = 0.0001 0 20 40 60 80 100Iterations0.00.10.20.30.40.50.60.70.80.9 A v e r a g e S u cc e ss R a t e memory size = 2000memory size = 5000memory size = 8000memory size = 100000 20 40 60 80 100Iterations0.00.10.20.30.40.50.60.70.80.9 A v e r a g e S u cc e ss R a t e batch size = 32batch size = 64batch size = 128batch size = 256 (a) (b)(c) A v e r a g e S u cc e ss R a t e rratraining interval =ttraining interval =tratraining interval =ttraining interval =tra (d) Fig. 5: Average success rate under different parameters: (a) learning rate, (b) memory size, (c)batch size, (d) training interval.during training on the convergence performance. It can be found that training with a small batchsize cannot take advantage of all data stored in experience buffer, and converges slowly. Whiletraining with a large batch size, such as 256 (pink dotted line), has the fastest convergence speed,although it will consume much more time during training process.Finally, we investigate the convergence performance under different training intervals, as shownin Fig. 5(d). Theoretically, the shorter the training interval, the faster the convergence speed.However, shorter training interval also means more training times, which will result in a certainwaste of computing resources. On the other hand, we find that the final convergence values withtime intervals of 5 and 10 are very close. Considering the above reasons, we finally set traininginterval as 10.Set the above hyper-parameters to optimal values and apply them to all deep Q networks,then we carry out the following experiments.n training process, we evaluate the performance of different methods in AF environment andDF environment, with relay number K = 10 and power level number L = 10 . The result isdepicted in Fig. 6. A v e r a g e S u cc e ss R a t e HRL Method, DF ProtocolDRL Method, DF ProtocolRandom Selection, DF ProtocolHRL Method, AF ProtocolDRL Method, AF ProtocolRandom Selection, AF Protocol
Fig. 6: Average success rate using different protocols.It can be observed that when using the method of random selection, the performance is alwaysvery poor. On the other hand, both DRL method and our hierarchical algorithm can be effectivelytrained, and their average success rate curves eventually converge to a stable value with slightchanges.However, our HRL method can achieve a lower outage probability. Take employing DFprotocol as an example, with DRL method, the average success rate is only about 0.82, whichmeans the outage probability of communication system is about . When employing ourhierarchical method, the average success rate is closer to 0.9, with an improvement of about . On the other hand, our HRL method has an obvious faster learning speed, which convergesafter about 10 iterations, while DRL method needs about 40 iterations to reach the convergencevalue.When comparing results under with different protocols, Fig. 6 shows that there is littledifference in convergence speed of the same method, but the fluctuation of convergence valueusing DRL method is larger. We also observe that the average success rate using AF protocolis generally lower than that using DF protocol, which leads to less successful experiences forgents to learn from. The performance of DRL method is obviously affected by this factor, butour HRL method can still converge to a value with slighter changes. What’s more, in the caseof using AF protocol, the average communication success rate obtained by our HRL method isapproximately higher than that obtained by DRL method. Compared with DF protocol, theperformance gap between the two methods under AF protocol is larger. In all, our HRL agentcan learn a better strategy faster for dynamic relay selection and power allocation, in both AFcommunication environment and DF communication environment.Then, we evaluate the performance of our proposed hierarchical method and DRL methodunder different search space scales. We conduct this experiment in DF communication environ-ment. As shown in Fig. 7, we study two scenarios where the number of relays K and powerlevels L are both set to be 10 or 20. A v e r a g e S u cc e ss R a t e HRL Method, K=10, L=10DRL Method, K=10, L=10HRL Method, K=20, L=20DRL Method, K=20, L=20
Fig. 7: Outage probability under different relay number K and power level number L .As K and L increased, there is a small increase in the average success rate of our HRLmethod, which increases from 0.87 to 0.91. With optional power level L increased, transmissionpower can be allocated more efficiently at source and relay node, resulting in an improvedcommunication success rate. In addition, Fig. 7 also vividly shows that the result curve obtainedby our HRL method is much smoother than that obtained by DRL method.One the other hand, DRL method performs worse under larger K and L , where the averagesuccess rate drops by about after convergence. In a larger search space, it becomes moreifficult for the agent to select appropriate relay and power simultaneously. As a result, thereare fewer successful explorations, which leads to a problem of sparse reward. Therefore, ittakes more training iterations for DRL agent to converge, and the fluctuation becomes larger.It is worth noting that the number of successful explorations is also small when using the AFprotocol (in Fig. 6), but this situation is actually different from that shows in Fig. 7. In theprevious experiment, even if the optimal action is taken, there is still a high probability that thecommunication will fail due to the uncertainty of the channel. However, in this experiment, theoptimal action with good return exists, but the agent may not be able to find this action policyas the search space is too large. Therefore, the problem of sparse reward usually refers to thelatter situation.Traditional DRL methods usually perform poorly in the environment with sparse rewards, dueto the lack of positive experience to learn from. However, by making different hierarchies, ourmethod can reduce the complexity of search space, which ensures the efficiency of explorationand learning. Therefore, when employing our proposed hierarchical method, we can still obtaina more stable behavior policy for relay selection and power allocation.After 100 iterations of training, we obtain dynamic relay selection and power allocationpolicies by applying both HRL method and DRL based method. To further evaluate the robustnessof different methods, we evaluate the performance by using these well-trained policies underdifferent outage thresholds, and the result is depicted in Fig. 8.This experiment is conducted in DF environment, and the communication outage thresholdranges from 1.6 to 2.4 in testing process. The only difference between testing and training isthat, the parameters of all networks in testing process are fixed, which means DNN is only usedto provide a best action rather than executing further learning.As we can see from Fig. 8, both HRL policy and DRL policy trained in a smaller searchspace can be applied to other situations. However, DRL policy trained in a larger search spaceperforms poorly when testing, while we can still obtain proper actions in different environmentsby following our HRL policy.Take λ = 1 . as an example, the outage probability under different policies is different. Interms of the small search space, the outage probability using HRL policy is lower than 0.03,and that using DRL policy and random selection are about 0.07 and 0.8. In terms of the largesearch space, outage probability using HRL policy is about 0.04, and that using DRL policyand random selection are 0.8 and nearly 1.0. It is obvious that our HRL method is more robust .6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4Outage Threshold10 O u t a g e P r o b a b ili t y HRL Method, K=10, L=10HRL Method, K=20, L=20DRL Method, K=10, L=10DRL Method, K=20, L=20Random Selection, K=10, L=10Random Selection, K=20, L=20
Fig. 8: Outage probability under different outage threshold λ .and can greatly reduce the outage probability, which means that HRL agent can perform betterrelay selection and adjust power allocation more reasonably according to the current state aftertraining. VIII. C ONCLUSION AND F UTURE W ORKS
In this paper, we propose an HRL method to dynamically select relay and allocate transmissionpower in a two-hop cooperative communication model, in order to minimize outage probabilityunder a total transmission power constraint. Unlike traditional studies, our method does notrequire any assumptions about channel distribution, but relies on the interaction between the agentand the communication environment. Compared with existing RL-based methods, we proposean outage-based reward function. Our reward function uses only binary reward that indicatesthe result of communication, while other RL-based methods require concrete representations offeedback information, such as instantaneous SNR or MI. We further design an HRL frameworkby decomposing relay selection and power allocation into two sub-tasks, which reduces the searchspace. Simulation results show that our HRL method can reduce outage probability by , andreach convergence 30 iterations earlier than DRL methods in both AF and DF communicationenvironment. In addition, the hierarchical method can effectively solves the problem of sparsereward, while other methods can hardly deal with it.ur HRL method provides a novel way for the research of resource allocation and optimiza-tion in the field of communication. However, the total transmission power is discretized intoenumerable power levels in our framework, which can be improved further. In future works, wewould like to explore new methods applicable to continuous action space.R EFERENCES [1] F. Zhong, X. Xia, H. Li, and Y. Chen, “Distributed linear convolutional space-time coding for two-hop full-duplex relay2x2x2 cooperative communication networks,”
IEEE Transactions on Wireless Communications , vol. 17, no. 5, pp. 2857–2868, May 2018.[2] C. Wang, T. Cho, T. Tsai, and M. Jan, “A cooperative multihop transmission scheme for two-way amplify-and-forwardrelay networks,”
IEEE Transactions on Vehicular Technology , vol. 66, no. 9, pp. 8569–8574, Sept. 2017.[3] Y. Liu, E. Liu, and R. Wang, “Energy efficiency analysis of intelligent reflecting surface system with hardware impairments,”in , Taipei, Taiwan,December 2020.[4] Y. Shi, A. Konar, N. D. Sidiropoulos, X. Mao, and Y. Liu, “Learning to beamform for minimum outage,”
IEEE Transactionson Signal Processing , vol. 66, no. 19, pp. 5180–5193, Oct. 2018.[5] Y. Liu, E. Liu, R. Wang, and Y. Geng, “Beamforming designs and performance evaluations for intelligent reflecting surfaceenhanced wireless communication system with hardware impairments,” arXiv preprint arXiv:2006.00664 , 2020.[6] J. Jedrzejczak, G. J. Anders, M. Fotuhi-Firuzabad, H. Farzin, and F. Aminifar, “Reliability assessment of protective relaysin harmonic-polluted power systems,”
IEEE Transactions on Power Delivery , vol. 32, no. 1, pp. 556–564, Feb. 2017.[7] S. N. Islam, M. A. Mahmud, and A. M. T. Oo, “Relay aided smart meter to smart meter communication in a microgrid,”in , Sydney, NSW, Australia, Nov.2016, pp. 128–133.[8] P. Das and N. B. Mehta, “Direct link-aware optimal relay selection and a low feedback variant for underlay CR,”
IEEETransactions on Communications , vol. 63, no. 6, pp. 2044–2055, Jun. 2015.[9] A. Bletsas, A. Khisti, D. P. Reed, and A. Lippman, “A simple cooperative diversity method based on network pathselection,”
IEEE Journal on Selected Areas in Communications , vol. 24, no. 3, pp. 659–672, Mar. 2006.[10] C. Wang and J. Chen, “Power allocation and relay selection for af cooperative relay systems with imperfect channelestimation,”
IEEE Transactions on Vehicular Technology , vol. 65, no. 9, pp. 7809–7813, Sept. 2016.[11] O. Amin, S. S. Ikki, and M. Uysal, “On the performance analysis of multirelay cooperative diversity systems with channelestimation errors,”
IEEE Transactions on Vehicular Technology , vol. 60, no. 5, pp. 2050–2059, Jun. 2011.[12] M. Seyfi, S. Muhaidat, and J. Liang, “Amplify-and-forward selection cooperation over Rayleigh fading channels withimperfect CSI,”
IEEE Transactions on Wireless Communications , vol. 11, no. 1, pp. 199–209, Jan. 2012.[13] F. S. Tabataba, P. Sadeghi, and M. R. Pakravan, “Outage probability and power allocation of amplify and forward relayingwith channel estimation errors,”
IEEE Transactions on Wireless Communications , vol. 10, no. 1, pp. 124–134, Jan. 2011.[14] F. Shams, G. Bacci, and M. Luise, “Energy-efficient power control for multiple-relay cooperative networks using Q-learning,”
IEEE Transactions on Wireless Communications , vol. 14, no. 3, pp. 1567–1580, Mar. 2015.[15] X. Wang, T. Jin, L. Hu, and Z. Qian, “Energy-efficient power allocation and Q-learning-based relay selection for relay-aidedD2D communication,”
IEEE Transactions on Vehicular Technology , vol. 69, no. 6, pp. 6452–6462, Jun. 2020.16] Y. Su, X. Lu, Y. Zhao, L. Huang, and X. Du, “Cooperative communications with relay selection based on deep reinforcementlearning in wireless sensor networks,”
IEEE Sensors Journal , vol. 19, no. 20, pp. 9561–9569, Oct. 2019.[17] Y. Su, M. LiWang, Z. Gao, L. Huang, X. Du, and M. Guizani, “Optimal cooperative relaying and power control for IoUTnetworks with reinforcement learning,”
IEEE Internet of Things Journal , pp. 1–1, Jul. 2020.[18] Y. Hua, R. Li, Z. Zhao, X. Chen, and H. Zhang, “GAN-powered deep distributional reinforcement learning for resourcemanagement in network slicing,”
IEEE Journal on Selected Areas in Communications , vol. 38, no. 2, pp. 334–349, Feb.2020.[19] L. P. Qian, A. Feng, X. Feng, and Y. Wu, “Deep RL-based time scheduling and power allocation in EH relay communicationnetworks,” in
IEEE International Conference on Communications (ICC) , Shanghai, China, May 2019, pp. 1–7.[20] L. Huang, S. Bi, and Y. J. A. Zhang, “Deep reinforcement learning for online computation offloading in wireless poweredmobile-edge computing networks,”
IEEE Transactions on Mobile Computing , vol. 19, no. 11, pp. 2581–2593, Nov. 2020.[21] R. S. Sutton and A. G. Barto,
Reinforcement learning: An introduction . MIT press, 2018.[22] H. A. Suraweera, T. A. Tsiftsis, G. K. Karagiannidis, and A. Nallanathan, “Effect of feedback delay on amplify-and-forward relay networks with beamforming,”
IEEE Transactions on Vehicular Technology , vol. 60, no. 3, pp. 1265–1271,Mar. 2011.[23] A. Ribeiro, Xiaodong Cai, and G. B. Giannakis, “Symbol error probabilities for general cooperative links,”
IEEETransactions on Wireless Communications , vol. 4, no. 3, pp. 1264–1273, May 2005.[24] J. Boyer, D. D. Falconer, and H. Yanikomeroglu, “Multihop diversity in wireless relaying channels,”
IEEE Transactionson Communications , vol. 52, no. 10, pp. 1820–1830, Oct. 2004.[25] R. Annavajjala, P. C. Cosman, and L. B. Milstein, “Statistical channel knowledge-based optimum power allocation forrelaying protocols in the high SNR regime,”
IEEE Journal on Selected Areas in Communications , vol. 25, no. 2, pp.292–305, Feb. 2007.[26] Z. Chen and X. Wang, “Decentralized computation offloading for multi-user mobile edge computing: A deep reinforcementlearning approach,” arXiv preprint arXiv:1812.07394 , 2018.[27] V. Mnih, K. Kavukcuoglu, D. Silver et al. , “Human-level control through deep reinforcement learning,”
Nature , vol. 518,no. 7540, pp. 529–533, Feb. 2015.[28] L. Espeholt, H. Soyer, R. Munos et al. , “IMPALA: Scalable distributed deep-RL with importance weighted actor-learnerarchitectures,” in
International Conference on Machine Learning (ICML) , Stockholm, Sweden, Jul. 2018, pp. 1407–1416.[29] V. Mnih, A. P. Badia, and M. Mirza, “Asynchronous methods for deep reinforcement learning,” in
International Conferenceon Machine Learning (ICML) , New York City, NY, USA, Jun. 2016, pp. 1928–1937.[30] T. Tieleman, G. Hinton, G. K. Karagiannidis, and A. Nallanathan, “Lecture 6.5-rmsprop: Divide the gradient by a runningaverage of its recent magnitude,”
COURSERA: Neural Networks for Machine Learning , vol. 4, no. 2, pp. 26–31, 2012.[31] T. D. Kulkarni, K. Narasimhan, A. Saeedi, and J. Tenenbaun, “Hierarchical deep reinforcement learning: Integratingtemporal abstraction and intrinsic motivation,” in
Neural Information Processing Systems (NIPS) , Barcelona, Spain, Dec.2016, pp. 3675–3683.[32] O. Nachum, S. Gu, H. Lee, and S. Levine, “Data-efficient hierarchical reinforcement learning,” in
Neural InformationProcessing Systems (NIPS) , Montreal, Canada, Dec. 2018, pp. 3303–3313.[33] N. Dilokthanakul, C. Kaplanis, N. Pawlowski, and M. Shanahan, “Feature control as intrinsic motivation for hierarchicalreinforcement learning,”
IEEE Transactions on Neural Networks and Learning Systems , vol. 30, no. 11, pp. 3409–3418,Nov. 2019.[34] Z. Wang, T. Schaul et al. , “Dueling network architectures for deep reinforcement learning,” in