[PDF] Centralized & Distributed Deep Reinforcement Learning Methods for Downlink Sum-Rate Optimization

Abstract

For a multi-cell, multi-user, cellular network downlink sum-rate maximization through power allocation is a nonconvex and NP-hard optimization problem. In this paper, we present an effective approach to solving this problem through single- and multi-agent actor-critic deep reinforcement learning (DRL). Specifically, we use finite-horizon trust region optimization. Through extensive simulations, we show that we can simultaneously achieve higher spectral efficiency than state-of-the-art optimization algorithms like weighted minimum mean-squared error (WMMSE) and fractional programming (FP), while offering execution times more than two orders of magnitude faster than these approaches. Additionally, the proposed trust region methods demonstrate superior performance and convergence properties than the Advantage Actor-Critic (A2C) DRL algorithm. In contrast to prior approaches, the proposed decentralized DRL approaches allow for distributed optimization with limited CSI and controllable information exchange between BSs while offering competitive performance and reduced training times.

Full PDF

11 Centralized & Distributed Deep ReinforcementLearning Methods for Downlink Sum-RateOptimization

Ahmad Ali Khan,

Student Member, IEEE,

Raviraj Adve,

Fellow, IEEE,

Abstract

For a multi-cell, multi-user, cellular network downlink sum-rate maximization through power allocation isa nonconvex and NP -hard optimization problem. In this paper, we present an effective approach to solving thisproblem through single- and multi-agent actor-critic deep reinforcement learning (DRL). Speciﬁcally, we use ﬁnite-horizon trust region optimization. Through extensive simulations, we show that we can simultaneously achievehigher spectral efﬁciency than state-of-the-art optimization algorithms like weighted minimum mean-squared error(WMMSE) and fractional programming (FP), while offering execution times more than two orders of magnitudefaster than these approaches. Additionally, the proposed trust region methods demonstrate superior performanceand convergence properties than the Advantage Actor-Critic (A2C) DRL algorithm. In contrast to prior approaches,the proposed decentralized DRL approaches allow for distributed optimization with limited CSI and controllableinformation exchange between BSs while offering competitive performance and reduced training times. Index Terms

Deep reinforcement learning, optimization, sum-rate maximization

I. I

NTRODUCTION

Improving spectral efﬁciency is a central focus of modern research in wireless communication. Asdevice density grows and more consumer-centric, data-intensive, applications become popular, it is clearthat coordinated resource allocation is necessary to achieve this goal. Unfortunately, developing effectivetechniques remains fundamentally challenging due to the nature of sum-rate maximization: not only is theassociated optimization problem for the broadcast channel non-convex, but, as proved by Luo and Zhangin [1], it is also NP -hard. It follows that methods to ﬁnd the globally optimal solution to this problem, such The authors are with the Department of Electrical and Computer Engineering, University of Toronto, Ontario, ON M5S 3G4, Canada.E-mails: (akhan, rsadve)@ece.utoronto.ca. This work has been accepted for publication in

IEEE Transactions on Wireless Communications .The authors would like to acknowledge the support of TELUS Canada and the National Science and Engineering Research Council,Canada through its Collaborative Research and Development (CRD) program. a r X i v : . [ c s . I T ] S e p as outer polyblock approximation, are computationally prohibitive and generally impractical for cellularnetworks involving multiple cells and multiple users [2].The solutions proposed in the literature balance a tradeoff between exchange of channel state information(CSI), computational complexity and performance. On the one hand, heuristic methods such as random,greedy and max-power allocation are extremely simple to implement and require no exchange of CSIbetween cooperating base stations [3]. However, since they ignore the impact of inter-cell and intra-cellinterference they tend to have a correspondingly poor performance.In contrast, centralized optimization algorithms like weighted minimum mean-squared error mini-mization (WMMSE) and fractional programming (FP) offer much better performance since they takeinterference into account. In particular, the work by Shi et al. in [4] shows the equivalence of weightedsum-rate maximization and weighted mean-squared error minimization and develops an iterative algorithmfor ﬁnding a resource allocation strategy. Somewhat similarly, in [5], Shen and Yu develop the fractionalprogramming approach, based on minorization-maximization methods, to solve the downlink power al-location problem iteratively. Both of these approaches are guaranteed to converge to a local optimumof the original max sum-rate problem, and as such offer excellent performance. At the same time, thisperformance comes with two undesirable compromises: ﬁrst, these algorithms need to optimize across allcooperating BSs and thus have very high (albeit polynomial-time) computational complexity; and second,they require extensive exchange of network CSI between cooperating BSs. In other words, they assumethat all the cooperating base stations are connected to a central cloud processer via high-speed backhaullinks that receives the complete downlink CSI from the BSs and computes and forwards their coordinatedpower allocation based on this information exchange [6]. As a result, due to both the computationalcomplexity and CSI exchange requirements, these methods do not scale well for deployment in real-world wireless networks, where ensuring this stringent level of cooperation and shared information acrossa large number of cells is infeasible. To the best of our knowledge, FP is the state-of-the art among locallyoptimal schemes [6].To reduce the information exchange requirements, distributed optimization algorithms have been studiedextensively in the literature. One approach is to model the sum-rate maximization problem as a noncoop-erative game, in which the BSs attempt to reach a socially acceptable power allocation strategy throughbest-response dynamics without having access to the complete network CSI [7], [8]. The work in [7],for example, analyzes the conditions under which a Nash equilibrium can be reached for a discretized variant of the sum-rate maximization problem in which BSs have limited access to the network CSI. Onthe other hand, as noted by the authors, Nash equilibria and Pareto optimal points are not necessarilyglobal or even local optima of the problem; thus, game-theoretic distributed optimization methods tend tohave noticeably worse performance than centralized methods like FP and WMMSE. This is particularlytrue when BSs have access to only partial or imperfect CSI [8].Distributed methods for wireless resource management have also been developed under the frameworksof robust and stochastic optimization [9]–[11]. Similar to game-theoretic approaches, however, thesemethods achieve performance inferior to centralized optimization approaches as the number of usersbecomes larger [11]. It is worth emphasizing that distributed optimization implementations of high-performance methods like FP and WMMSE are not possible: as observed in [6], convergence to a localoptimum of the sum-rate maximization problem is only guaranteed if the BSs have access to the fullnetwork CSI.Resource allocation schemes based on deep supervised learning have been recently explored as a meansto overcome some of these challenges [3], [12], [13]. In [3], Sun et al. utilized supervised regression toenable a deep feedforward neural network to approximate the locally optimal power allocation solutionachieved by the WMMSE algorithm. Similarly, in [12], the authors utilize a deep supervised learningalgorithm to approximate the globally optimal power allocation strategy for sum-rate maximization inthe uplink of a massive MIMO network. In both of these cases, the nonlinear function approximationproperty of neural networks was successfully leveraged to learn a mapping from the input CSI to thedesired solution of the problem.While such an approach reduces execution time once the trained model is deployed, it also suffersfrom an obvious drawback: the level of CSI exchange required is identical to the optimization algorithmssupervised learning aims to mimic. A further disadvantage is that the supervised learning algorithm in[3] can at best match the performance of the WMMSE algorithm, not exceed it. For the approach usedin [12], a major concern is generating samples of the globally optimal solution for the algorithm to learnfrom, since, as mentioned earlier, the original optimization problem is NP -hard. In [13], Mattheisen etal. aim to mitigate this problem by using branch-and-bound to ﬁnd globally optimal solutions to the(also NP -hard) energy efﬁciency maximization problem; however, the computational cost of generatinga sufﬁciently large number of samples to enable effective supervised learning for a deep neural networkremains extremely high. Due to the issues associated with supervised learning methods, deep reinforcement learning (DRL) hasemerged as a viable alternative to solve wireless resource management problems [14]–[17]. Crucially,DRL methods do not require prior solutions to the problem unlike supervised learning methods; instead,the algorithms improve over time through a mechanism of feedback and interaction. For example, in[14], the authors present a multi-agent deep Q-learning (DQL) approach to solve the problem of linkspectral efﬁciency maximization; the proposed approach provides performance that closely matches theFP algorithm but requires discretization of the power control variables. In a similar fashion, the work in[15] explores the use of DQL, REINFORCE, and deep deterministic policy gradient (DDPG) algorithmsfor link sum-rate maximization. The DQL and REINFORCE algorithms presented once again requirediscretization of the power allocation variables; this introduces uncertainty since there is no known methodfor choosing the optimal discretization factor.A major issue with the approaches presented in [14] and [15] is that direct optimization of the networkspectral efﬁciency is not possible; instead, as a proxy, each transmitter aims to maximize the differencebetween its own spectral efﬁciency and the weighted leakage to other transmitters. This proxy objectivefunction is only proportional to the sum-rate when the number of transmitter-receiver pairs is very large.Additionally, the approaches proposed require CSI exchange between the transmitters to function. Thelearning algorithms presented also require signiﬁcant feature engineering; for example, in [15], in additionto the current CSI, historical power allocation values are needed in order to output current power allocationvalues; this adds additional requirements to the already burdensome information exchange problem.Finally, we observe that the aforementioned deep learning methods do not allow for distributed op-timization with limited CSI available at each BS. For example, the supervised learning method appliedin [3] to approximate the WMMSE algorithm can only be implemented in a centralized fashion which,once again, requires complete CSI exchange between all BSs in the network. As the authors of [3]demonstrate, supervised learning would be unable to succeed in approximating the WMMSE algorithm ina decentralized environment, since there would not be a one-to-one mapping between the CSI input andpower allocation output at each BS. Similarly, the reinforcement learning approaches implemented in [14]and [15] require full CSI exchange between BSs, in addition to prior channel state information (and per-cellspectral efﬁciency) from previous time slots. Furthermore, to compute the power allocation for each link,the transmitter requires a ﬁxed number of the strongest interference channels as input; this necessitates theexchange of CSI between BSs even after the agents have been deployed and training is complete. These processing and information exchange requirements necessarily prevent the implementation of distributedoptimization of the network spectral efﬁciency. Thus, we conclude that the current optimization approaches,both learning-based and otherwise, compromise by either having prohibitively high computation andinformation exchange requirements or poor performance. It is this gap in the literature that we aim toaddress with our work.In this paper, we present a class of DRL methods based on Markov Decision Processes (MDP), trust-region policy optimization (TRPO) for downlink power allocation. In contrast to the deep Q-learningmethod, trust region methods are capable of directly solving continuous-valued problems; this removesthe problem of decision spaces growing exponentially larger as the discretization of the optimizationvariables is increased. Compared to the REINFORCE approach, trust region methods demonstrate greaterrobustness since we can ensure that updates meet practically justiﬁable constraints; thus, they have beenshown to achieve far higher performance across a variety of continuous-control reinforcement learningproblems [18], [19]. Speciﬁcally, we utilize ﬁnite-horizon trust region methods to derive centralized anddecentralized algorithms for sum-rate optimization which require varying degrees of CSI exchange betweenBSs.The contributions of this paper can be summarized as follows: • We demonstrate that the use of ﬁnite-horizon trust region policy optimization is effective in di-rectly solving the sum-rate maximization problem in a multi-cell, multi-user environment. Unlikeprevious deep RL-based attempts which necessitate discretization of the power variables and theuse of proxy objective functions, we solve the continuous-valued power allocation problem with thenetwork sum-rate taken directly as the reward function, thereby simplifying the algorithm design andimplementation. • The approaches we present provide notably higher spectral efﬁciency than the state-of-the-art FP andWMMSE algorithms across different network sizes. • We show that agents trained using the proposed approach produce solutions to the sum-rate maximiza-tion problem over two orders of magnitude faster than the aforementioned conventional optimizationalgorithms. The proposed TRPO methods also demonstrate lower variance and higher performancethan the conventional Advantage Actor-Critic (A2C) algorithm. • The multi-agent distributed approaches allow us to ﬂexibly control information exchange betweenBSs, which is not possible in prior distributed optimization methods or using prior deep learning approaches. In particular, the partially decentralized strategy we propose allows BSs to exchangeonly power allocation information to effectively improve network sum-rate while utilizing only local

CSI at each BS. • The trained DRL methods demonstrate robust performance to changes in the reward function (i.e.,the network sum-rate) across a range of BS transmit powers.This paper is organized as follows: in Section II, we present the system model and formulate thenetwork sum-rate maximization problem. In Section III, we describe the Markov Decision Process (MDP)framework, and derive the proposed centralized and decentralized DRL algorithms. This is followedby training, performance, and execution time results and comparisons in Section IV. We draw someconclusions in Section V. II. S

YSTEM M ODEL AND P ROBLEM F ORMULATION

Our system model is similar to that used in [5]. We consider a time-division duplexed networkcomprising of B single-antenna base stations and K single-antenna users per cell uniformly distributedover the cell area. The combined downlink channel gain from BS b to user k associated with base station b (cid:48) in the n th time slot is denoted by h ( n ) b → k,b (cid:48) and given by h ( n ) b → k,b (cid:48) = g ( n ) b → k,b (cid:48) (cid:113) β ( n ) b → k,b (cid:48) where g ( n ) b → k,b (cid:48) ∼ CN (0 , represents the complex-valued small-scale Rayleigh fading component and β ( n ) b → k,b (cid:48) represents the real-valued pathloss component. The latter is given by: β ( n ) b → k,b (cid:48) = (cid:32) d ( n ) b → k,b (cid:48) d (cid:33) − α where d ( n ) b → k,b (cid:48) denotes the Euclidean distance in meters between the base station and user in question, d is a reference distance and α is the pathloss exponent. Hence, h ( n ) b → k,b denotes the information-bearingchannel from BS b to user k within its cell while h ( n ) b → k,b (cid:48) indicates the interference channel from the sameBS to user k being served by BS b (cid:48) . Furthermore, we assume that the Rayleigh fading coefﬁcients of theinformation-bearing and interference channels are independent across users and time slots.For notational convenience, the network CSI in the n th time slot is denoted by the K × B × B tensor H ( n ) , i.e., H ( n ) = (cid:110) h ( n ) b → k,b (cid:48) | b,b (cid:48) =1 , . . . ,B ; k =1 , . . . ,K (cid:111) The transmit power allocated by BS b to user k within its cell is denoted by p b → k and is similarly collectedin the K × B matrix P ( n ) . Assuming a receiver noise power of z ( n ) k,b at user k served by BS b , the networksum-rate for the n th time slot is given by: R (cid:0) H ( n ) , P ( n ) (cid:1) = (cid:88) ( b,k ) log  p ( n ) b → k (cid:12)(cid:12)(cid:12) h ( n ) b → k,b (cid:12)(cid:12)(cid:12) (cid:80) ( b (cid:48) ,k (cid:48) ) (cid:54) =( b,k ) p ( n ) b (cid:48) → k (cid:12)(cid:12)(cid:12) h ( n ) b (cid:48) → k,b (cid:12)(cid:12)(cid:12) + z ( n ) k,b  Based on these deﬁnitions, the goal of maximizing the network spectral efﬁciency for a single time slotcan be expressed as the following optimization problem: maximize P ( n ) R (cid:0) H ( n ) , P ( n ) (cid:1) (1a) subject to 0 ≤ p ( n ) b → k ≤ P max k =1 , . . . ,K ; b =1 , . . . ,B (1b)where the constraint in (1b) ensures that the power allocated to each of the users is nonnegative and doesnot exceed the maximum allowed transmit power. As stated earlier, this optimization problem has beenshown to be non-convex and NP-hard in [1].III. P ROPOSED D EEP R EINFORCEMENT L EARNING A PPROACH

A. Markov Decision Process (MDP)

Our proposed DRL-based algorithms employ trust-region methods to solve problem (1) in both single-agent centralized and multi-agent distributed fashion. Both these variants are modeled using the MarkovDecision Process (MDP) framework; accordingly, in this section we deﬁne some related terminology thatwill be used throughout this paper.An agent is an entity capable of processing information from its environment and taking decisionsdirected towards the maximization of a chosen reward function. The agent interacts with its environmentat discrete time instants n =1 , . . . ,N (where N , the episode length , is ﬁxed). The interaction model of theagent with its enivronment can be described as follows: • At time step n , the agent is assumed to be in state s n ∈ S . The state comprises the information thatthe agent has access to and deﬁnes its situation relative to its environment. The state space , S , isdeﬁned as the set of all possible states that can be encountered by the agent. • Acting only upon the state at time step n , the agent takes action a n ∈ A from the set of all possibleactions i.e., the action space A . As illustrated in Figure 1, the action is sampled from a conditionalprobability distribution known as the policy , denoted by π θ ( a n | s n ) , where the subscript indicates that s a s a s Fig. 1: Graphical model illustrating transitions in a Markov Decision Process.the policy is parameterized through the (generally tensor-valued) variable θ . In this work, we use theterms policy and policy network interchangeably since a deep neural network is used to parametrizethe policy. • We assume a Markov state transition model; thus, the probability of transition to s n +1 is assumed todepend only on the current state, s n , and current action, a n : p ( s n +1 | s n, a n , . . . , s , a ) = p ( s n +1 | s n, a n ) (2)Note that the transition probability can be completely deterministic based on the agent’s currentaction; likewise, it may be completely independent of the current action [20]. This Markov transitionproperty is depicted further in Figure 1. The state transition likelihoods are contained in the transitionoperator T . Furthermore, the state distribution arising as a result of the policy π θ and environmentaltransition dynamics is denoted as d π θ and referred to as the policy state distribution . • At the end of the time step, the agent receives a scalar reward , r ( s n , a n ) , which is a function of thecurrent state and current action, i.e., r ( s n , a n ) : S × A (cid:55)→ R .Taken together, the tuple M = {S , A , T , r } deﬁnes an MDP. Using the Markov property of state transitions,the likelihood of observing a given sequence of states and actions in the MDP can be decomposed as: p ( s , a , . . . , s N , a N ) = p ( s ) N (cid:89) n (cid:48) =1 p ( s n (cid:48) +1 | s n (cid:48) , a n (cid:48) ) π θ ( a n (cid:48) | s n (cid:48) ) (3)The average discounted reward achieved over an episode is a function of θ and is given by ¯ R ( θ )= E a n (cid:48) ∼ π θ , s n (cid:48) ∼ d π θ (cid:34) N (cid:88) n (cid:48) =1 γ n (cid:48) − r ( s n (cid:48) , a n (cid:48) ) (cid:35) (4)where γ ∈ [0 , is a discount factor controlling the value of future rewards relative to current rewards.The subscripts a n ∼ π θ and s n ∼ d π θ indicate that the expectation is with respect to actions sampled from the policy and states sampled from the policy state distribution.We also deﬁne the state-action value function , the state value function and the advantage function .The state-action value function Q π θ ( s n , a n ) (also known as the Q -function), is the total expected rewardthat can be accumulated by taking action a n from state s n and following the policy π θ in subsequenttimesteps, i.e., Q π θ ( s n , a n ) (cid:44) E a n (cid:48) ∼ π θ , s n (cid:48) ∼ d π θ (cid:34) N (cid:88) n (cid:48) = n γ n (cid:48) − n r ( s n (cid:48) , a n (cid:48) ) | s n , a n (cid:35) (5)The expectation of the state-action value function, V π θ ( s n ) , over all possible actions with respect tothe policy and its corresponding state distribution is deﬁned as the state-value function and is given by: V π θ ( s n ) (cid:44) E a n (cid:48) ∼ π θ , s n (cid:48) ∼ d π θ (cid:34) N (cid:88) n (cid:48) = n γ n (cid:48) − n r ( s n (cid:48) , a n (cid:48) ) | s n (cid:35) = E a n ∼ π θ [ Q π θ ( s n , a n )] (6)In other words, Q π θ ( s n , a n ) indicates the expected value of choosing a particular action a n in timestep n and afterwards following the policy whereas V π θ ( s n ) indicates the expected reward that can beobtained by simply following the policy throughout.The difference between the state-action value function and state-value function is called the advantagefunction and is denoted by A π θ ( s n , a n ) : A π θ ( s n , a n ) (cid:44) Q π θ ( s n , a n ) − V π θ ( s n ) (7)Intuitively, the advantage function can be interpreted as a measure of how much better an action a n isthan the average action chosen by the policy π θ .The goal in reinforcement learning is to ﬁnd the optimum value of the policy variable θ that leads tomaximization of the expected reward: θ ∗ = arg max θ (cid:48) ¯ R ( θ (cid:48) ) (8)Direct differentiation of the expected reward ¯ R ( θ ) with respect to the policy parameter yields a closed-form expression for the policy gradient , a Monte-Carlo sampled estimate of which can then be used toupdate the policy through the well-known REINFORCE algorithm as developed by Williams [21]. In otherwords, the REINFORCE algorithm attempts to change the policy parameter in the direction of increasingreward as obtained from samples of the policy. Actor-critic algorithms employ a similar approach, bututilize a separate function approximator (typically also a deep neural network with parameters φ ), calledthe value network to estimate the state-action value function Q π θ ( s n , a n ) for gradient computation; it canbe shown that this results in estimates of the gradient with lower variance and hence improved performance Centralized Policy Network CSI Power

Fig. 2: For the centralized approach, each BS forwards its downlink CSI to a policy network which thendetermines the power allocation strategy for the entire network.[22].Unfortunately, both the REINFORCE and standard actor-critic algorithms suffer from a serious defect:choosing a suitable step size for the policy parameter update is extremely challenging. Depending on theparameterization, even small changes in the policy parameter θ can lead to drastic changes in the policyoutput π θ ; this is especially true for policies utilizing deep neural networks, in which the output is a highlynonlinear function of the weights and biases. It is therefore desirable to seek algorithms that modify boththe ascent direction and step size to achieve stable, incremental changes in the policy space rather thanparameter space. B. Centralized Trust Region Policy Optimization

In this section, we consider the application of trust-region policy optimization to the sum-rate maxi-mization problem. We begin by designing a centralized approach in which a single agent (i.e., a policynetwork with parameter θ ) receives the downlink CSI from all base stations, computes the network powerallocation strategy and forwards it to the base stations. Thus, it follows that the state and action for thecentralized agent are the complete network CSI and power allocation matrix respectively, i.e., s n = H ( n ) and a n = P ( n ) . We parameterize a probabilistic policy over the actions as follows: the policy networktakes as input the current state and outputs the mean and log-standard deviation values for each user’snormally-distributed power allocation, as illustrated in Figure 3. The power allocation variables are thensampled using the following normal distribution: p ( n ) b → k ∼ (cid:104) N (cid:16) µ (cid:16) p ( n ) b → k (cid:17) ,σ (cid:16) p ( n ) b → k (cid:17)(cid:17)(cid:105) P max (9) Input Layer Multiple Hidden Layers Output Layer (a) Value Network.

Input Layer Multiple Hidden Layers Output Layer (b) Policy Network.

Fig. 3: The value network outputs an estimate of the value function; in contrast, the policy networkproduces a mean and log-standard deviation of the action as output.where µ (cid:16) p ( n ) b → k (cid:17) and σ (cid:16) p ( n ) b → k (cid:17) are the policy network output mean and variance values for the powerallocated by the b th BS to the k th user, and the limits are set to ensure that the power constraints in(1b) are enforced. Thus, the input size for the policy network is KB and the output size is KB . Onthe other hand, the value network takes the current state and action as inputs (and hence has input size KB + 2 KB ) and outputs an estimate of the state-action value function (which is a scalar, hence theoutput size is 1).Our goal is to optimize the policy network parameter θ to maximize the expected reward, which we choose as the network sum-spectral efﬁciency averaged across multiple time slots: ¯ R ( θ ) = E a n ∼ π θ , s n ∼ d π θ (cid:34) N N (cid:88) n =1 γ n R (cid:0) H ( n ) , P ( n ) (cid:1)(cid:35) It should be noted that for this centralized approach, the likelihood of transition to a new state (i.e., H ( n ) )in the n th time slot is, in fact, independent of the actions taken in the previous time slot (i.e., P ( n − ).However, as we shall see, this does not affect the derivations of the algorithms that follow.As illustrated in Figure 2, the centralized method is similar to the FP and WMMSE algorithms interms of the computation and CSI exchange. Further, we emphasize that since we consider an episode toconsist of a single time-slot, the scenario is conceptually similar to the well-known ‘multi-armed bandit’problem in reinforcement learning [23] in which the episode length consists of a single time slot. Thissingle time-slot episode approach has also been utilized in prior works exploring the use of policy-baseddeep reinforcement learning methods to solve resource allocation problems, such as [24] and [25].To optimize the policy network parameter, we adopt trust-region methods; these alter the ascent directionwhile utilizing decaying step sizes to avoid destructively large policy updates [26]. By bounding thechange in expected reward as a function of the change in policy parameters, we can develop an iterativeoptimization approach that uses second-order curvature information, rather than just the gradient, to achieverobust policy updates. Our approach can be viewed as a hybrid of the approaches presented in [18], [19]:we utilize a ﬁnite-horizon approach derived from that in [19] with the bounds established in [18].We begin by casting the policy optimization problem in the following equivalent form: suppose that thecurrent policy parameter is θ ; then the problem of maximizing expected reward is equivalent to ﬁndingthe new policy parameter θ ∗ that maximizes the difference in expected reward over the current policy: θ ∗ = arg . max θ (cid:48) ¯ R ( θ (cid:48) ) − ¯ R ( θ ) (10)Instead of attempting to optimize the difference in expected reward directly, we derive the followinglower bound similar to the approach in [19]: ¯ R ( θ (cid:48) ) − ¯ R ( θ )= E a n (cid:48)∼ π θ (cid:48) s n (cid:48)∼ dπ θ (cid:48) (cid:34) N (cid:88) n (cid:48) =1 γ n (cid:48) A π θ ( s n (cid:48) , a n (cid:48) ) (cid:35) (11a) = E a n (cid:48)∼ π θ s n (cid:48)∼ dπ θ (cid:48) (cid:34) N (cid:88) n (cid:48) =1 γ n (cid:48) π θ (cid:48) ( a n (cid:48) | s n (cid:48) ) π θ ( a n (cid:48) | s n (cid:48) ) A π θ ( s n (cid:48) , a n (cid:48) ) (cid:35) (11b) ≥L θ ( θ (cid:48) ) − C (cid:114) E s n (cid:48)∼ dπ θ [ D KL ( π θ (cid:48) (cid:107) π θ )] (11c) where L θ ( θ (cid:48) ) (cid:44) E a n (cid:48)∼ π θ s n (cid:48)∼ dπ θ (cid:34) N (cid:88) n (cid:48) =1 γ n (cid:48) π θ (cid:48) ( a n (cid:48) | s n (cid:48) ) π θ ( a n (cid:48) | s n (cid:48) ) A π θ ( s n (cid:48) , a n (cid:48) ) (cid:35) (12)and the optimal value of the constant C to make this inequality tight can be determined analyticallyprovided that γ is ﬁxed [18]; D KL ( π θ (cid:48) ( s n ) (cid:107) π θ ( s n )) represents the Kullback-Leibler (KL) divergencebetween the new and old policy and is given by D KL ( π θ (cid:48) (cid:107) π θ ) (cid:44) E a n ∼ π θ (cid:48) (cid:20) log π θ (cid:48) ( a n | s n ) π θ ( a n | s n ) (cid:21) The equality in (11a) follows from the deﬁnition of the advantage function in (7) and algebraic manipu-lation; the equality in (11b) follows since we utilize importance sampling to change the expectation to beover actions sampled from π θ instead of π θ (cid:48) . The inequality in (11c) follows from utilizing a ﬁnite-timehorizon in Corollary 2 in [18]. Furthermore, we observe that by setting θ (cid:48) = θ , both the difference inexpected reward and the lower bound equal zero. Thus, we conclude that the given lower bound minorizesthe difference in discounted rewards, and optimizing this lower bound while ensuring that it is nonnegative,we can always ﬁnd a value of the policy parameters that does not lead to a decrease in expected reward.The lower bound in (11c) is useful because it allows us to bound the improvement in the expectedreward in terms of the change to the policy parameters via the KL-divergence; this is in sharp contrastto the REINFORCE algorithm in which the effect of taking too large a step cannot be quantiﬁed. At thesame time, directly optimizing this lower bound leads to conservatively small step sizes and impracticallylong training times [19]. Instead, following [18], we consider the following proxy optimization problem: maximize θ (cid:48) L θ ( θ (cid:48) ) (13a) subject to E s n (cid:48)∼ dπ θ [ D KL ( π θ (cid:48) (cid:107) π θ )] ≤ δ (13b)In its current form, this problem is non-convex and thus mathematically intractable. However, the objectivefunction and constraint in (13a) can be approximated using their Taylor series expansion as follows: L θ ( θ (cid:48) ) ≈ L θ ( θ ) + g T θ ( θ (cid:48) − θ ) (14a) E s n (cid:48)∼ dπ θ [ D KL ( π θ (cid:48) (cid:107) π θ )] ≈

12 ( θ (cid:48) − θ ) T F θ ( θ (cid:48) − θ ) (14b)where F θ (cid:44) ∇ θ (cid:48) E s t (cid:48)∼ dπ θ [ D KL ( π θ (cid:48) (cid:107) π θ )] | θ = E a n (cid:48)∼ π θ s n (cid:48)∼ dπ θ (cid:2) ∇ θ (cid:48) log π θ (cid:48) ( a n (cid:48) | s n (cid:48) ) | θ ∇ θ (cid:48) log π θ (cid:48) ( a n (cid:48) | s n (cid:48) ) | T θ (cid:3) (15) We remark at this juncture that F θ can be interpreted as the Fisher Information matrix (FIM) of the policyand describes the curvature as a function of the policy parameter. Utilizing these approximations to boththe objective function and constraints, we obtain the following convex proxy optimization problem: maximize θ (cid:48) g T θ ( θ (cid:48) − θ ) (16a) subject to 12 ( θ (cid:48) − θ ) T F θ ( θ (cid:48) − θ ) ≤ δ (16b)This proxy problem in (13a) can be solved analytically to yield the optimal value of θ (cid:48) in terms of thecurrent parameter θ as: θ (cid:48) = θ + (cid:115) δ g T θ F − θ g θ F − θ g θ (17)We observe that the proxy optimization problem requires knowledge of g θ and F θ ; however, we areunable to evaluate them analytically. This can be overcome as follows: we sample M episodes ε , . . . ,ε M from the current policy π θ , where: ε m = { s m , a m ,r m , . . . , s mN , a mN ,r mN } These episodes can be used to obtain unbiased estimates of the policy gradient and FIM as follows: ˆ g θ = 1 M N M (cid:88) m =1 N (cid:88) n (cid:48) =1 γ n (cid:48) ∇ θ (cid:48) log π θ (cid:48) ( a mn (cid:48) | s mn (cid:48) ) ˆ A π θ (cid:48) ( s mn (cid:48) , a mn (cid:48) ) | θ (cid:48) = θ (18) ˆ F θ = 1 M N M (cid:88) m =1 N (cid:88) n (cid:48) =1 [ ∇ θ (cid:48) log π θ (cid:48) ( a mn (cid:48) | s mn (cid:48) ) ·∇ θ (cid:48) log π θ (cid:48) ( a mn (cid:48) | s mn (cid:48) ) T ] | θ (cid:48) = θ (19)While directly utilizing the update in (17) is possible [27], it can lead to a violation of the KL-divergenceconstraint in (13b) since we utilize Taylor approximations rather than the original objective function andconstraints. To overcome this issue, we use a backtracking line search with decaying step sizes as proposedin [18]; thus, intead in (17) denoting: ∆ θ = (cid:115) δ ˆ g T θ ˆ F − θ ˆ g θ ˆ F − θ ˆ g θ (20)we use the following parameter update θ (cid:48) = θ + ζ j ∆ θ (21)instead of the one in (17), where ζ ∈ (0 , is the step size and the integer j = 0 , , . . . is incremented(thus decaying the step size by a factor of ζ with each increment) until the KL-divergence constraint is satisﬁed and the objective function from (12) is nonnegative, i.e., L θ ( θ (cid:48) ) ≥ (22) E s t (cid:48)∼ dπ θ [ D KL ( π θ (cid:48) (cid:107) π θ )] ≤ δ (23)Finally, combining all these steps together, we obtain the deep trust-region reinforcement learningalgorithm in Algorithm 1. We emphasize at this point that the value network is necessary only during thetraining phase; once the centralized policy has been trained using Algorithm 1, the current state can bedirectly forward-propagated through the policy network to obtain the network power allocation. Algorithm 1

Trust Region Policy Optimization for Centralized Power Allocation initialize centralized policy and value network parameters θ , φ . for i = 0 , , . . . , N iterations do for m = 0 , , . . . , M do Sample π θ i to collect ε m = (cid:8) s m , a m ,R m (cid:0) H (1) , P (1) (cid:1) , . . . , s m N , a m N ,R m (cid:0) H ( n ) , P ( n ) (cid:1)(cid:9) . end for Estimate A π θ i . Update critic network parameter φ i to ﬁt estimated advantage values. Compute policy gradient estimate g θ i using (18). Compute FIM estimate ˆ F θ i using (19). Compute policy update direction ∆ i using (20). for j = 0 , , . . . do Compute: θ i +1 = θ i + ζ j ∆ i if (22) and (23) are satisﬁed then break end if end for end for C. Decentralized Multi-Agent Approaches

The centralized single-agent approach we have developed so far has the same information exchangerequirements as the FP and WMMSE algorithms, which become burdensome as the number of cellsand users increases. More importantly, the increasing size of both the state and action spaces leads toextremely slow convergence. This issue, commonly referred to as the ‘curse-of-dimensionality’ [28], [29],renders the centralized DRL approach unsuitable for large wireless networks as the associated trainingtimes become impractically long.Multi-agent approaches are attractive from a computational perspective since the policy network onlycomputes the actions for a single agent; this circumvents the problem of the increasing size of the AgentEnvironment ActionRewardState (a) Single-agent DRL.

ActionReward AgentEnvironmentAgent ActionStateState (b) Multi-agent DRL

Fig. 4: In multi-agent DRL methods, each agent acts individually. The collective actions of all agentsinﬂuence the common reward and the (usually distinct) states the agents will encounter in the next timestep.state and action spaces that is inevitably incurred with a centralized approach. Furthermore, as shownin Figures 4a and 4b, each agent processes only its own state information, but can learn under a commonreward . Intuitively, training a single policy for deployment across all BSs allows the network size to growwithout necessarily increasing the training time required. As a result, multi-agent approaches are generallyutilized to solve large-scale reinforcement learning problems as opposed to single-agent methods [30]–[32]. However, the trust region methods employed in [18] are developed for a centralized single-agentapproach.We overcome these challenges by modifying the trust region policy optimization approach to functionin a decentralized fashion with limited information exchange. More speciﬁcally, we consider each BS inthe network to be an individual agent with partial access to network CSI; however, we design the learningprocess so that these agents learn under the common reward of network spectral efﬁciency . This is incontrast to prior DRL approaches explored in the literature, since methods like REINFORCE are typicallynot robust enough to be used in a multi-agent setting, due to the aforementioned sensitivity to changes inthe policy parameters. While deep Q-learning has been used in multi-agent settings, as mentioned earlier,it cannot directly optimize the network spectral efﬁciency and requires signiﬁcant feature engineering. Onthe other hand, constrained policy optimization algorithms like trust region policy optimization have beensuccessfully employed in solving multi-agent common-objective learning problems [19], [30]. Partially Decentralized Approach : Accordingly, we propose a partially decentralized multi-agent ap-proach with reduced information exchange and per-BS CSI requirements. We propose a round-robinapproach in which the BSs sequentially allocate powers, in a randomly chosen order, to the users withintheir cells, using an identical policy network deployed at each BS. Thus, the time horizon for the multi- agent approach is N = B . To limit the information exchange, each BS is allowed access to its owndownlink CSI (but not the downlink CSI of other BSs) as well as the power allocation information of theprevious BSs. Thus, the state and action for BS b are given by: s b = (cid:110)(cid:110) h ( n ) b → k,b (cid:48) | b (cid:48) =1 , . . . ,B ; k =1 , . . . ,K (cid:111) , (cid:110) p ( n ) b (cid:48) → k | b (cid:48) =1 , . . . ,b − k =1 , . . . ,K (cid:111)(cid:111) a b = (cid:110) p ( n ) b → k | k =1 , . . . ,K (cid:111) To clarify, in this round-robin allocation scheme, the initial power allocation is set to zero, and BS allocates power to the users within its own cell using its policy output based on its downlink CSI. Itthen forwards its power allocation to BS , which then uses this power information along with its owndownlink CSI to allocate power to its own users using the same policy as deployed at BS . This processis halted once every BS in the network has allocated power to its users, as illustrated in Figure a .To avoid notational clutter in our subsequent derivations, we do not explicitly indicate the randomizedorder of BSs, and assume a single-time slot episode length for the decentralized methods . We remarkthat this procedure leads to slight differences in the deﬁnition of an episode: for the centralized setting,an episode can encompass multiple time slots, but for the partially decentralized case we consider anepisode to be complete once all BSs have allocated the transmit powers to users for that particular timeslot. Additionally, the power allocation values are sampled from a normal distribution dependent on thepolicy network output similar to the centralized setting.Unlike the centralized approach, however, the probability of transition to a new state does depend uponthe previous action since an identical policy is utilized at each BS. Speciﬁcally, each BS’s state consistsof its own downlink CSI and the power allocated by the previous BSs; the former is independent of theprevious action as in the centralized approach, while the latter directly depends upon the previous actionchosen by the policy. With this crucial difference, the actions of all BSs are coupled; and the problemcan now be considered in the more general MDP framework as introduced in Section III earlier withmultiple time-steps. We remark once again that this decentralized setting is similar to the game-theoreticapproaches detailed earlier [7], [11] as well as prior works that have utilized DRL for solving resourceallocation problems [33].Allowing each BS access to its downlink CSI is obviously necessary; the additional power allocationinformation of the previous BSs should also allow it to avoid creating intercell interference. For instance, Note that an episode can alternatively be deﬁned to comprise multiple time slots; however, in this work we consider a single time-slotper episode without loss of generality. Decentralized Policy Network Decentralized Policy NetworkDecentralized Policy Network CSI Power (a)

Decentralized Policy Network Decentralized Policy NetworkDecentralized Policy Network CSI Power (b)

Fig. 5: For the partially decentralized approach, the BSs sequentially exchange power allocationinformation in a predetermined order until the last BS is reached. For the fully decentralized approach, noCSI or power information is exchanged, and the BSs simultaneously and independently determine theirpower allocation strategy.if the power allocated by BS b to user k is high, BS b + 1 should avoid creating interference to that useron channel h b +1 → k,b with its own power allocation strategy since that could potentially reduce the networksum-rate.We emphasize that such an approach with partial information sharing is only possible in the frameworkof multi-agent reinforcement learning; model-based optimization algorithms like WMMSE and FP requirefull exchange of CSI between all BSs to enable cooperative power allocation. Also, the prior decentralizedapproaches in the literature do not allow for ﬂexible information sharing: for example, the sequentialsharing of power information by BSs in our proposed partially decentralized algorithm would simply notbe possible in either the game-theoretic decentralized approaches presented in [7], [8] or the optimizationapproaches presented in [9]–[11], as this information would be useless without the corresponding downlinkchannel state information of other BSs.As with the centralized approach, the reward function is chosen to be the network sum spectralefﬁciency; however, to ensure that all BSs learn under a common reward, the reward is accessible only atthe end of each episode (i.e., once all BSs have allocated powers to their associated users) and intermediatevalues of reward are set to zero. Hence, a complete episode is given by: ε = (cid:8) s , a , , . . . , s B , a B ,R (cid:0) H ( n ) , P ( n ) (cid:1)(cid:9) The ascent direction is once again calculated using (20). Since an identical policy is deployed at all BSs,the same gradient update is applied to every agent. Combining all steps together, the partially decentralizedapproach is summarized in Algorithm 2.

Algorithm 2

Trust Region Policy Optimization for Partially Decentralized Power Allocation initialize policy and value network parameters θ , φ at BSs b = 1 , . . . , B . for i = 0 , , . . . , N iterations do for m = 0 , , . . . , M do for b = 0 , , . . . , B do Propagate s m b through π θ i at BS b to obtain a m b . Forward a m , . . . , a m b to BS b + 1 . end for Collect ε m = (cid:8) s m , a m , , . . . , s m B , a m B ,R m (cid:0) H ( n ) , P ( n ) (cid:1)(cid:9) . end for Estimate A π θ i . Update critic network parameter φ i at BSs b = 1 , . . . , B to ﬁt estimated advantage values. Compute policy gradient estimate g θ i using (18). Compute FIM estimate ˆ F θ i using (19). Compute policy update direction ∆ i using (20). for j = 0 , , . . . do Compute: θ i +1 = θ i + ζ j ∆ i if (22) and (23) are satisﬁed then Apply policy update at BSs b = 1 , . . . , B . break end if end for end for Similar to the centralized setting, we remark that once the (identical) policy network at each BS hasbeen trained, the value network is no longer necessary. Likewise, the common gradient update needs tobe applied only during the training phase; once the trained agents are deployed, the state information candirectly be propagated through the policy network at each BS to obtain the desired power allocation forthat cell.

Fully Decentralized Approach:

For the fully decentralized approach, we eliminate all communicationbetween the BSs in the network; hence the state information for each BS comprises only its downlinkCSI: s b = (cid:110) h ( n ) b → k,b (cid:48) | b (cid:48) =1 , . . . ,B ; k =1 , . . . ,K (cid:111) a b = (cid:110) p ( n ) b → k | k =1 , . . . ,K (cid:111) Since no information is exchanged between the agents, all BSs independently and simultaneously determine their individual power allocation strategies, as illustrated in Figure b . The reward function isonce again chosen to be the network sum-rate and is only accessible at the end of each episode; likewise,an identical policy is utilized at each BS and updated at the end of each training episode. The fullydecentralized algorithm is summarized in Algorithm .Finally, we note that the feedback of the network sum-rate to both the centralized agent and thedistributed agents is only necessary during the training phase. Once a trained agent has been deployed,there is no further need for this information. Algorithm 3

Trust Region Policy Optimization for Fully Decentralized Power Allocation initialize policy and value network parameters θ , φ at BSs b = 1 , . . . , B . for i = 0 , , . . . , N iterations do for m = 0 , , . . . , M do for b = 0 , , . . . , B do Propagate s m b through π θ i at BS b to obtain a m b . end for Collect ε m = (cid:8) s m , a m , , . . . , s m B , a m B ,R m (cid:0) H ( n ) , P ( n ) (cid:1)(cid:9) . end for Estimate A π θ i . Update critic network parameter φ i at BSs b = 1 , . . . , B to ﬁt estimated advantage values. Compute policy gradient estimate g θ i using (18). Compute FIM estimate ˆ F θ i using (19). Compute policy update direction ∆ i using (20). for j = 0 , , . . . do Compute: θ i +1 = θ i + ζ j ∆ i if (22) and (23) are satisﬁed then Apply policy update at BSs b = 1 , . . . , B . break end if end for end for IV. N

UMERICAL R ESULTS

To evaluate the performance of the proposed methods, we simulated different cellular networks withthe following system parameters:TABLE I: Numerical values of system model parametersTotal bandwidth W = MHzBS maximum transmit power P max = dBmNoise PSD N = − dBm/HzNoise ﬁgure N f = dBReference distance d = . mCell radius mThe parameters utilized for the proposed centralized, partially and fully decentralized deep learningalgorithms are chosen identically as follows:TABLE II: Numerical values of deep learning parametersKL-divergence constraint δ = 0.01Step size ζ = 0.90Discount factor γ = 0.99Policy network hidden layers 3Value network hidden layers 3Neurons per hidden layer 256Episodes per iteration M =1000Furthermore, the neurons in the hidden layers for both the policy and value networks are chosen to beexponential linear units and have an activation function given by: σ ( z ) =  z z> e z − z ≤ (24)The proposed DRL schemes were implemented using Python . and TensorFlow . . . The machineutilized for training and evaluation of all schemes was equipped with a dual-core . GHz Core i processer, GB of memory and no discrete GPU. A. Training Performance

We begin by considering training performance for the proposed approaches in a -cell network with theparameters in Table I and K = 2 users per cell; additionally, the path loss exponent utilized in this scenariowas set to α = 3 . . Note that we utilize this small network size to fully evaluate training performanceand algorithm convergence for all three approaches. The policy and value networks are characterized usingthe parameters in Table II. It should be noted that different random initializations of the neural networksmay converge to different ﬁnal policy and value networks; thus, we plot training curves for differentparameter initializations to analyze the convergence behaviour. Furthermore, to ensure a fair comparison,we train the centralized, partially decentralized, and fully decentralized algorithms for × trainingsteps each; this corresponds to × and / × episodes for the centralized and decentralizedapproaches respectively.Figure a ) illustrates the evolution of per-episode reward for the centralized approach. Starting from anaverage of around Mbps, the algorithm converges to a per-episode reward of

Mbps. Furthermore, weobserve that the training curves for the different random initializations are highly consistent in convergingto a similar ﬁnal reward. We observe some variance during the training process; this is to be expected, sincethe channel realizations (and hence the network sum-rate achieved) during each time slot are generatedrandomly. The convergence behaviour can be visualized better by plotting an exponentially weightedaverage of the training performance which is illustrated in Figure b ) . As we can observe, the differentrandom initializations follow closely matching training trajectories in converging to nearly identical ﬁnalrewards.In Figures and respectively, we plot similar training curves for random initializations of thepartially and fully decentralized algorithms. In these cases too, we observe that the average reward improvesas training progresses. However, two crucial differences emerge in comparison to the centralized algorithm:ﬁrst, the decentralized algorithms converge to a lower ﬁnal reward, and second, they demonstrate notablyhigher variance across runs. In particular, considering the smoothed training curves, we observe that boththe partially and fully decentralized approaches achieve a ﬁnal reward of around Mbps. The fullydecentralized algorithm also demonstrates the greatest spread of ﬁnal rewards among the different random The exponentially weighted average s [ n ] of a sequence x [ n ] is calculated using the relation s [ n ] = (cid:26) wx [ n ] + (1 − w ) s [ n − n > x [ n ] n = 1 where w represents the smoothing factor. In this paper, we use w = 0 . . initializations and, on average, is outperformed by the partially decentralized algorithm in this setting.As a baseline, we also compare the training performance against the well-known A2C (Advantage Actor-Critic) DRL algorithm ﬁrst proposed in [22]. Like TRPO, A2C utilizes a critic network to help reducevariance in the policy gradient estimates; however, unlike TRPO, the step size is ﬁxed. We plot the trainingperformance of the A2C algorithm for the -cell centralized setting for random policy initializationswith a ﬁxed step size of × − and a smoothing factor of . in Figure 9. The A2C algorithm isunable to converge in this setting and demonstrates poor performance across all initializations. Speciﬁcally,the average reward attained ﬂuctuates between approximately and Mbps; this is in contrast to thecorresponding centralized TRPO method, which achieves an average reward of around

Mbps witha variation of less than Mbps across all initializations. This can be directly attributed to the fact thattaking even small changes in the policy parameters can lead to unexpected changes in terms of the rewardfunction.

B. Sum-Rate Performance

In this section, we compare the performance of the proposed approaches against the following uncoor-dinated and state-of-the-art coordinated algorithms:1)

Maximum Power Allocation : In this scheme, each BS transmits at the maximum power P max toevery user within its cell. As the simplest uncoordinated scheme, this scheme necessarily achievespoor performance; nonetheless, it is the least computationally complex and requires no exchangeof information between the BSs. Thus, it serves as a useful benchmark to evaluate the performanceand complexity of more sophisticated resource management solutions.2) Random Power Allocation : For this scheme, the power allocated to each user by its serving basestation is chosen uniformly randomly from the interval [0 , P max ] . Like the maximum power alloca-tion scheme, this approach is desirable from a computation and information exchange perspectivealthough its performance is worse than that of coordinated schemes.3) Fractional Programming:

FP is an iterative method based on minorization-maximization developedby Shen and Yu in [5], and is the highest-performing resource allocation scheme developed thus farin the literature [5], [6], [14]. As emphasized in the introduction, FP is guaranteed to converge toa local optimum of the network-sum-rate optimization problem; however, the number of iterationsneeded to converge is unknown a priori . Training Episodes (x10 ) A v e r age E p i s ode R e w a r d ( M bp s ) Run 1Run 2Run 3Run 4Run 5Run 6Run 7Run 8Run 9Run 10 (a) Training convergence.

Training Episodes (x10 ) A v e r age E p i s ode R e w a r d ( M bp s ) Run 1Run 2Run 3Run 4Run 5Run 6Run 7Run 8Run 9Run 10 (b) Exponentially weighted average training convergence

Fig. 6: Centralized training convergence for random initializations, B = 3 , K = 2 . Training Episodes (x10 ) A v e r age E p i s ode R e w a r d ( M bp s ) Run 1Run 2Run 3Run 4Run 5Run 6Run 7Run 8Run 9Run 10 (a) Training convergence.

Training Episodes (x10 ) A v e r age E p i s ode R e w a r d ( M bp s ) Run 1Run 2Run 3Run 4Run 5Run 6Run 7Run 8Run 9Run 10 (b) Exponentially weighted average training convergence

Fig. 7: Partially decentralized training convergence for random initializations, B = 3 , K = 2 . Training Episodes (x10 ) A v e r age E p i s ode R e w a r d ( M bp s ) Run 1Run 2Run 3Run 4Run 5Run 6Run 7Run 8Run 9Run 10 (a) Training convergence.

Training Episodes (x10 ) A v e r age E p i s ode R e w a r d ( M bp s ) Run 1Run 2Run 3Run 4Run 5Run 6Run 7Run 8Run 9Run 10 (b) Exponentially weighted average training convergence

Fig. 8: Fully decentralized training convergence for random initializations, B = 3 , K = 2 . Training Episodes (x10 ) A v e r age E p i s ode R e w a r d ( M bp s ) Run 1Run 2Run 3Run 4Run 5Run 6Run 7Run 8Run 9Run 10

Fig. 9: Exponentially weighted centralized training convergence for A2C algorithm for randominitializations; B = 3 , K = 2 .4) Weighted Minimum Mean-Squared Error : WMMSE transforms the original sum-rate maximizationproblem into an equivalent optimization problem of minimum mean-squared error minimization.Like FP, WMMSE is guaranteed to converge in a monotonically nondecreasing fashion to a localoptimum of the problem. Speciﬁcally, we compare our proposed methods against the SISO variantof the WMMSE algorithm as utilized in [3], [5], [14], [15].Note that we do not compare against the decentralized approaches in [7], [8], [10], [11] as we considerthe state-of-the-art FP and WMMSE algorithms as the performance benchmarks.We begin by considering the performance of the different algorithms for the aforementioned B = 3 cells, K = 2 users per cell setting considered for the training curves. Figure 10 plots the convergence ofthe network sum-rate for a single random initialization of channel values according to the given systemmodel. As expected, the uncoordinated max-power and random-power allocation schemes achieve theworst performance; on the other hand, both FP and WMMSE converge in a monotonically nondecreasingfashion to local optima of the network sum-rate maximization problem as expected. In comparison, boththe coordinated and uncoordinated schemes are outperformed by the proposed deep learning approaches.As with the training results, the centralized approach outperforms the partially decentralized approachwhich, in turn, achieves a small performance advantage over the fully decentralized scheme. Also, the Iterations N e t w o r k S u m - R a t e ( M bp s ) Centralized RLPartially Decentralized RLFully Decentralized RLFPWMMSEMax PowerRandom

Fig. 10: Sum-rate convergence for a single random channel realization, B = 3 , K = 2 .coordinated FP and WMMSE algorithms require multiple iterations to reach the local optimum of theproblem; this is in contrast to the proposed DRL approaches, in which the state information can be directlypropagated through the policy network to obtain the desired solution to the optimization problem.The results in Figure are for a single time slot with user locations and channels generated randomlyaccording to the distributions described in the system model. To compare fairly with the stated benchmarks,in Figure we consider the average network sum-rate achieved across independent random channelrealizations. These results display a similar trend: the centralized scheme achieves the highest averagenetwork spectral efﬁciency, followed by the partially and fully decentralized schemes respectively. Themodel-based optimization algorithms are outperformed by the DRL schemes while the uncoordinatedschemes once again achieve the worst performance. We also remark that even the fully decentralizedapproach outperforms the state-of-the-art FP and WMMSE algorithms, despite the fact that it uses only afraction of the downlink CSI available to these coordinated schemes. This result is all the more remarkablewhen we consider that none of the proposed approaches are trained for particular channel realizations;instead, during training, the user locations and channel values are generated randomly from the stateddistributions in the channel model. In other words, these results indicate that the DRL approaches achievegood generalization, as they are able to perform well on channel realizations that are drawn from the Cent. RL Part. Decent. RL Fully Decent. RL FP WMMSE Random Max-Power020406080100120140 A v e r age N e t w o r k S u m - R a t e ( M bp s ) Fig. 11: Averaged network sum-rate for channel realizations, B = 3 , K = 2 .same distribution as, but not identical to , the training episodes [34].As stated earlier, the chief drawback of the centralized scheme is that the training times becomeimpractically large compared to the decentralized schemes as the cellular network size grows. Furthermore,the information exchange requirements are identical to the FP and WMMSE algorithms and as thesebecome burdensome for larger wireless networks. Accordingly, we proceed by testing the performance ofthe decentralized approaches in a much larger -cell hexagonal network with wraparound; the networkparameters are chosen to be identical to the -cell setting, with the exception of the pathloss exponent α which is set to . . Figure demonstrates the convergence of network sum-rate for a single randomrealization of channels for this setting; similar to the 3-cell network, the partially and fully decentralizedapproaches once again achieve higher network spectral efﬁciency than the benchmark schemes.We also consider the average sum-rate achieved across a large number of independent channel realiza-tions in Figure , similar to Figure . These results display similar trends to the -cell setting, with thepartially decentralized approach once again achieving the highest network spectral efﬁciency and achieving7% higher network sum-rate than FP. The fully decentralized approach is once again outperformed by thepartially decentralized approach, but still manages to achieve higher objective function values than theFP, WMMSE, random and max-power schemes. Iterations N e t w o r k S u m - R a t e ( M bp s ) Partially Decentralized RLFully Decentralized RLFPWMMSEMax PowerRandom

Fig. 12: Sum-rate convergence for a single random channel realization, B = 7 , K = 8 . Partially Decentralized RL Fully Decentralized RL FP WMMSE Random Max-Power050100150200250300350400 A v e r age N e t w o r k S u m - R a t e ( M bp s ) Fig. 13: Averaged network sum-rate for channel realizations, B = 7 , K = 8 .To test the robustness of the proposed RL approaches we also plot the average sum-rate achieved acrossmultiple time slots for P max values ranging between and dBm for the -cell setting in Figure .

20 25 30 35 40 45 50

Maximum Transmit Power P max (dBm) N e t w o r k S u m - R a t e ( M bp s ) Centralized RLPartially Decentralized RLFully Decentralized RLFPWMMSEMax PowerRandom

Fig. 14: Averaged network sum-rate for different transmit powers, B = 3 , K = 2 .For this test, we utilize the policy networks that have been previously trained for a transmit power of

43 dBm by simply scaling the output of the policy networks to match the transmit power constraint; thistests the performance of the DRL approaches when the reward function is changed from that utilizedduring training. As we can see, all three DRL methods demonstrate excellent performance across theentire range, closely matching the FP and WMMSE algorithms for the lower transmit values. At highertransmit power levels, the proposed DRL approaches demonstrate superior performance to the WMMSEalgorithm while continuing to match the FP algorithm.For completeness, we also consider a slight variation of the original network sum-rate maximizationproblem with a sum-power constraint of dBm instead of a per-user power constraint for the -cellsetting. This power constraint is enforced for the RL algorithms by simply rescaling the power outputduring training to ensure that the per-BS sum power does not exceed P max . As the results in Figure show, the fully centralized and partially decentralized RL algorithms still outperform the FP and WMMSEalgorithms, as well as the uncoordinated approaches. C. Channel State Information Exchange

In comparing the performance of different schemes, it is important to consider the amount of informationexchange between BSs necessary in order to achieve the resulting performance. As mentioned earlier, the Cent. RL Part. Decent. RL Fully Decent. RL FP WMMSE Random Equal Power020406080100120140 A v e r age N e t w o r k S u m - R a t e ( M bp s ) Fig. 15: Averaged network sum-rate for sum-power constraint, B = 3 , K = 2 .FP and WMMSE algorithms require all BSs to send their complete downlink CSI to a central cloudprocessor which then computes and forwards in order to compute the power allocation variables. For anetwork consisting of B cells and K users per cell, this corresponds to the central processor receiving O (cid:0) KB (cid:1) scalars from the cooperating BSs. The centralized single-agent approach also requires identicalCSI exchange between the BSs.For the partially decentralized approach, the state information for each BS comprises the downlinkCSI from itself to all the users in the network as well as the power allocation decisions of the previousBSs. This corresponds to an information exchange requirement of O ( KB ) . On the other hand, for thefully decentralized approach, the state information comprises only the downlink channels from itself to theusers in the network. Accordingly, there is no information exchange between the BSs of the network. Thisis similar to the uncoordinated max-power and random-power schemes. These results are summarized inTable III, along with the necessary CSI per BS. D. Execution Times

An essential aspect of comparison between different resource allocation algorithms is to consider thetime and complexity necessary to calculate a solution. A key beneﬁt of utilizing deep reinforcementlearning is that propagating the current state through the policy network to ﬁnd the action is typically TABLE III: Comparison of CSI and information exchange requirementsResource Allocation Scheme Information exchange Per-BS CSI neededCentralized RL O (cid:0) KB (cid:1) O (cid:0) KB (cid:1) Partially Decentralized RL O ( KB ) O ( KB ) Fully Decentralized RL O ( KB ) FP O (cid:0) KB (cid:1) O (cid:0) KB (cid:1) WMMSE O (cid:0) KB (cid:1) O (cid:0) KB (cid:1) Max-power

Random

TABLE IV: Average execution times of different resource allocation schemesResource Allocation Scheme Average Execution Time (seconds)Centralized RL . × − Partially Decentralized RL . × − Fully Decentralized RL . × − FP . × − WMMSE . × − Max-power Random . × − quite fast; thus, when the agents have been trained, power allocation solutions can be found in a fractionof the time required by conventional optimization techniques.Table IV illustrates the execution times of the trained models for our implementations of the differentschemes. As we can observe, the deep-RL algorithms require an execution time that is over two orders ofmagnitude lower to calculate a power allocation strategy as compared to the WMMSE and FP algorithms.All times have been measured for the -cell setting; although since we use identical neural network sizesthroughout, these results are equally representative for the -cell network.It should be noted that the partially decentralized approach requires the longest execution time amongthe DRL approaches due to the sequential sharing of power allocation information. The state informationfor each BS needs to be propagated through its policy network to produce its power allocation strategy,and then forwarded to the next BS, which repeats this process, and so on; it follows that the last BSwould have to wait until all other BSs have computed their power allocation strategies. Accordingly, theexecution time for this approach will be B times longer than the fully decentralized approach, which canbe executed in parallel at each BS. It should be noted, however, that the latency introduced in this regardwould still be much lower than that required by the FP and WMMSE algorithms; for example, projectingfrom the execution times in Table III, the execution time would still be around an order of magnitudelower than the FP and WMMSE algorithms. V. C

ONCLUSIONS

In this paper, we employed a deep-reinforcement learning approach to directly solve the continuous-valued, non-convex and NP-hard downlink sum-rate optimization problem. We proposed multiple vari-ants: a fully centralized single-agent approach as well as partially and fully decentralized multi-agentapproaches. The centralized and partially decentralized TRPO approaches achieve higher network sum-rate than the state-of-the-art FP, WMMSE and A2C algorithms while, once trained, all three approaches arecapable of ﬁnding solutions in a fraction of the time needed by the conventional optimization methods. Ofgreater consequence, the framework of trust region policy optimization allows us to design decentralizedschemes enabling varying degrees of information exchange between base stations while overcoming the‘curse-of-dimensionality’ issues associated with centralized reinforcement learning.This work represents a preliminary step in investigating centralized and decentralized reinforcementlearning algorithms for solving wireless resource management problems, and many fruitful directions arepossible for future research. In particular, reducing the number of samples and computation necessaryto achieve effective performance is possibly the most important open problem in reinforcement learningalgorithms. Our work is, to the best of our knowledge, the ﬁrst to implement information-sharing betweenDRL agents to solve optimization problems. While the exchange of power information is beneﬁcial inhelping increase network spectral efﬁciency as compared to the fully decentralized scheme, it is possiblethat feature engineering may be able to improve performance further. Finally, we note that the extensionto multiple-input multiple-output (MIMO) wireless networks remains challenging for machine learningalgorithms due to input representation and prohibitively large input dimensionality; distributed DRLalgorithms such as those proposed in this work may help overcome the latter obstacle.R

EFERENCES [1] Z.-Q. Luo and S. Zhang, “Dynamic Spectrum Management: Complexity and Duality,”

IEEE J. Sel. Topics Signal Process. , vol. 2,no. 1, pp. 57–73, 2008.[2] L. Liu, R. Zhang, and K.-C. Chua, “Achieving Global Optimality for Weighted Sum-Rate Maximization in the K-User GaussianInterference Channel with Multiple Antennas,”

IEEE Trans. Wireless Commun. , vol. 11, no. 5, pp. 1933–1945, 2012.[3] H. Sun, X. Chen, Q. Shi, M. Hong, X. Fu, and N. D. Sidiropoulos, “Learning to Optimize: Training Deep Neural Networks forInterference Management,”

IEEE Trans. Signal Process. , vol. 66, no. 20, pp. 5438–5453, 2018.[4] Q. Shi, M. Razaviyayn, Z.-Q. Luo, and C. He, “An Iteratively Weighted MMSE Approach to Distributed Sum-Utility Maximizationfor a MIMO Interfering Broadcast Channel,”

IEEE Trans. Signal Process. , vol. 59, no. 9, pp. 4331–4340, 2011.[5] K. Shen and W. Yu, “Fractional Programming for Communication Systems - Part I: Power Control and Beamforming,”

IEEE Trans.Signal Process. , vol. 66, no. 10, pp. 2616–2630, 2018. [6] A. A. Khan, R. Adve, and W. Yu, “Optimizing Multicell Scheduling and Beamforming via Fractional Programming and HungarianAlgorithm,” in IEEE Globecom Workshops (GC Wkshps) , Abu Dhabi, Dec. 2018, pp. 1–6.[7] I. Menache and A. Ozdaglar, “Network Games: Theory, Models, and Dynamics,”

Synthesis Lectures Commun. Netw. , vol. 4, no. 1, pp.1–159, 2011.[8] P. De Kerret, S. Lasaulce, D. Gesbert, and U. Salim, “Best-Response Team Power Control for the Interference Channel with LocalCSI,” in

IEEE Int. Conf. on Commun. (ICC) , London, Jun. 2015, pp. 4132–4136.[9] O. Tervo, H. Pennanen, D. Christopoulos, S. Chatzinotas, and B. Ottersten, “Distributed Optimization for Coordinated Beamformingin Multicell Multigroup Multicast Systems: Power Minimization and SINR Balancing,”

IEEE Trans. Signal Process. , vol. 66, no. 1,pp. 171–185, 2017.[10] R. Fritzsche and G. P. Fettweis, “Distributed Robust Sum Rate Maximization in Cooperative Cellular Networks,” in , Jun 2013.[11] S.-H. Park, H. Park, and I. Lee, “Distributed Beamforming Techniques for Weighted Sum-Rate Maximization in MISO InterferenceChannels,”

IEEE Commun. Lett. , vol. 14, no. 12, pp. 1131–1133, 2010.[12] C. D’Andrea, A. Zappone, S. Buzzi, and M. Debbah, “Uplink Power Control in Cell-Free Massive MIMO via Deep Learning,” in

IEEE Int. Workshop Comput. Advances Multi-Sensor Adaptive Process. (CAMSAP) , Le gosier, Dec. 2019, pp. 554–558.[13] B. Matthiesen, A. Zappone, E. A. Jorswieck, and M. Debbah, “Deep Learning for Real-Time Energy-Efﬁcient Power Control in MobileNetworks,” in

IEEE Int. Workshop Signal Process. Advances in Wireless Commun. (SPAWC) , Cannes, Jul. 2019, pp. 1–5.[14] Y. S. Nasir and D. Guo, “Multi-Agent Deep Reinforcement Learning for Dynamic Power Allocation in Wireless Networks,”

IEEE J.Sel. Areas Commun. , vol. 37, no. 10, pp. 2239–2250, 2019.[15] F. Meng, P. Chen, L. Wu, and J. Cheng, “Power Allocation in Multi-User Cellular Networks: Deep Reinforcement Learning Approaches,”

IEEE Trans. Wireless Commun. , 2020.[16] Y. Wei, F. R. Yu, M. Song, and Z. Han, “User Scheduling and Resource Allocation in Hetnets with Hybrid Energy Supply: AnActor-Critic Reinforcement Learning Approach,”

IEEE Trans. Wireless Commun. , vol. 17, no. 1, pp. 680–692, 2017.[17] X. Zhang, M. R. Nakhai, G. Zheng, S. Lambotharan, and B. Ottersten, “Calibrated Learning for Online Distributed Power Allocationin Small-Cell Networks,”

IEEE Trans. Commun. , 2019.[18] J. Achiam, D. Held, A. Tamar, and P. Abbeel, “Constrained Policy Optimization,” in

Int. Conf. on Mach. Learn. , Sydney, Aug. 2017,pp. 22–31.[19] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust Region Policy Optimization,” in

Int. Conf. Mach. Learn. , Lille,Jul. 2015, pp. 1889–1897.[20] K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Bharath, “Deep Reinforcement Learning: A Brief Survey,”

IEEE SignalProcess. Mag. , vol. 34, no. 6, pp. 26–38, 2017.[21] R. J. Williams, “Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning,”

Mach. Learn. , vol. 8,no. 3-4, pp. 229–256, 1992.[22] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous Methods for DeepReinforcement Learning,” in

Int. Conf. Mach. Learn. , New York City, Jun. 2016, pp. 1928–1937.[23] V. Kuleshov and D. Precup, “Algorithms for Multi-Armed Bandit Problems,” Feb. 2014. [Online]. Available: https://arxiv.org/abs/1402.6028[24] F. Liang, C. Shen, W. Yu, and F. Wu, “Towards Optimal Power Control via Ensembling Deep Neural Networks,”

IEEE Trans. Commun. ,vol. 68, no. 3, pp. 1760–1776, 2019. [25] M. Eisen, C. Zhang, L. F. Chamon, D. D. Lee, and A. Ribeiro, “Learning Optimal Resource Allocations in Wireless Systems,” IEEETrans. Signal Process. , vol. 67, no. 10, pp. 2775–2790, 2019.[26] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, “High-Dimensional Continuous Control Using Generalized AdvantageEstimation,” Oct. 2018. [Online]. Available: https://arxiv.org/abs/1506.02438[27] S. M. Kakade, “A Natural Policy Gradient,” in

Advances Neural Inf. Proc. Syst. , 2002, pp. 1531–1538.[28] T. Zahavy, M. Haroush, N. Merlis, D. J. Mankowitz, and S. Mannor, “Learn What Not to Learn: Action Elimination with DeepReinforcement Learning,” in

Advances Neural Inf. Process. Syst. , 2018, pp. 3562–3573.[29] J. Han, A. Jentzen, and E. Weinan, “Solving high-dimensional partial differential equations using deep learning,”

Proc. Nat. Acad. Sci. ,vol. 115, no. 34, pp. 8505–8510, 2018.[30] B. Baker, I. Kanitscheider, T. Markov, Y. Wu, G. Powell, B. McGrew, and I. Mordatch, “Emergent Tool Use From Multi-AgentAutocurricula,” Aug. 2019. [Online]. Available: https://arxiv.org/abs/1909.07528[31] S. Omidshaﬁei, J. Pazis, C. Amato, J. P. How, and J. Vian, “Deep Decentralized Multi-Task Multi-Agent Reinforcement Learning underPartial Observability,” in

Int. Conf. Mach. Learn. , Sydney, Aug. 2017, pp. 2681–2690.[32] J. Foerster, I. A. Assael, N. De Freitas, and S. Whiteson, “Learning to Communicate with Deep Multi-Agent Reinforcement Learning,”in

Advances Neural Inf. Process. Syst. , 2016, pp. 2137–2145.[33] P. de Kerret, D. Gesbert, and M. Filippone, “Team Deep Neural Networks for Interference Channels,” in

IEEE Int. Conf. Commun.Workshops (ICC Workshops) , Kansas City, May 2018, pp. 1–6.[34] C. M. Bishop,