[PDF] Deep Actor-Critic Learning for Distributed Power Control in Wireless Mobile Networks

Abstract

Deep reinforcement learning offers a model-free alternative to supervised deep learning and classical optimization for solving the transmit power control problem in wireless networks. The multi-agent deep reinforcement learning approach considers each transmitter as an individual learning agent that determines its transmit power level by observing the local wireless environment. Following a certain policy, these agents learn to collaboratively maximize a global objective, e.g., a sum-rate utility function. This multi-agent scheme is easily scalable and practically applicable to large-scale cellular networks. In this work, we present a distributively executed continuous power control algorithm with the help of deep actor-critic learning, and more specifically, by adapting deep deterministic policy gradient. Furthermore, we integrate the proposed power control algorithm to a time-slotted system where devices are mobile and channel conditions change rapidly. We demonstrate the functionality of the proposed algorithm using simulation results.

Full PDF

DDeep Actor-Critic Learning for Distributed PowerControl in Wireless Mobile Networks

Yasar Sinan Nasir and Dongning GuoDepartment of Electrical and Computer EngineeringNorthwestern University, Evanston, IL 60208.

Abstract —Deep reinforcement learning offers a model-freealternative to supervised deep learning and classical optimizationfor solving the transmit power control problem in wirelessnetworks. The multi-agent deep reinforcement learning approachconsiders each transmitter as an individual learning agent thatdetermines its transmit power level by observing the local wirelessenvironment. Following a certain policy, these agents learn tocollaboratively maximize a global objective, e.g., a sum-rateutility function. This multi-agent scheme is easily scalable andpractically applicable to large-scale cellular networks. In thiswork, we present a distributively executed continuous powercontrol algorithm with the help of deep actor-critic learning, andmore speciﬁcally, by adapting deep deterministic policy gradient.Furthermore, we integrate the proposed power control algorithmto a time-slotted system where devices are mobile and channelconditions change rapidly. We demonstrate the functionality ofthe proposed algorithm using simulation results.

I. I

NTRODUCTION

With ever-increasing number of cellular devices, interfer-ence management has become a key challenge in developingnewly emerging technologies for wireless cellular networks.An access point (AP) may increase its transmit power toimprove data rate to its devices, but this will cause moreinterference to nearby devices. Power control is a well-knowninterference mitigation tool used in wireless networks. It oftenmaximizes a non-convex sum-rate objective. It becomes NP-hard when multiple devices share a frequency band [1].Various state-of-the-art optimization methods have beenapplied to power control such as fractional programming (FP)[2] and weighted minimum mean square error (WMMSE)algorithm [3] which are model-driven and require a mathe-matically tractable and accurate model [4]. FP and WMMSEare iterative and executed in a centralized fashion, neglectingthe delay caused by the feedback mechanism between a centralcontroller and APs. Both require full channel state information(CSI), and APs need to wait until centralized controller sendsthe outcome back over a backhaul once iterations converge.Data-driven methods are promising in a realistic wirelesscontext where varying channel conditions impose seriouschallenges such as imperfect or delayed CSI. Reference [3]uses a deep neural network to mimic an optimization algorithmthat is trained by a dataset composed of many optimizationruns. The main motivation in [3] is to reduce the computa-tional complexity while maintaining a comparable sum-rate

This material is based upon work supported by the National ScienceFoundation under Grants No. CCF-1910168 and No. CNS-2003098 as wellas a gift from Intel Incorporation. performance with WMMSE. However, the training datasetrelies on model-based optimization algorithms. In this paper,we consider a purely data-driven approach called model-freedeep reinforcement learning.Similar to this work, we have earlier proposed a centralizedtraining and distributed execution framework based on deepQ-learning algorithm for dynamic (real-time) power control[5]. Since Q-learning applies only to discrete action spaces,transmit power had to be quantized in [5]. As a result, thequantizer design and the number of levels, i.e., number ofpossible actions, have an impact on the performance. Forexample, an extension of our prior work shows that quantizingthe action space with a logarithmic step size gives betteroutcomes than that of a linear step size [6].In this work, we replace deep Q-learning with an actor-criticmethod called deep deterministic policy gradient (DDPG)[7] algorithm that applies to continuous action spaces. Adistributively executed DDPG scheme has been applied topower control for ﬁxed channel and perfect CSI [6]. To thebest of our knowledge, we are the ﬁrst to study actor-criticbased dynamic power control that involves mobility of cellulardevices. Our prior work assumed immobile devices wherethe large-scale fading component was the steady state of thechannel. We adapt our previous approach to make it applicableto our new system model that involves mobility where channelconditions vary due to both small and large scale fading.In order to ensure the practicality, we assume delayed andincomplete CSI, and using simulations, we compare the sum-rate outcome with WMMSE and FP that have full perfect CSI.II. S

YSTEM M ODEL AND P ROBLEM F ORMULATION

In this paper, we consider a special case where N mobile devices are uniformly randomly placed in K homogeneoushexagonal cells. This deployment scenario is similar to the in-terfering multiaccess channel scenario which is also examinedin [3], [5]. Let N = { , . . . , N } and K = { , . . . , K } denotethe sets of link and cell indexes, respectively. Here we arenot concerned with the device association problem. As device n ∈ N is inside cell k ∈ K , its associated AP n is located atthe center of cell k . We denote the cell association of device n as b n ∈ K and its AP n is positioned at the center of b n .All transmitters and receivers use a single antenna and weconsider a single frequency band with ﬂat fading. The networkis assumed to be a fully synchronized time slotted system withslot duration T . We employ a block fading model to denote a r X i v : . [ ee ss . SP ] S e p he downlink channel gain from a transmitter located at thecenter of cell k to the receiver antenna of device n in timeslot t as ¯ g ( t ) k → n = (cid:12)(cid:12)(cid:12) h ( t ) k → n (cid:12)(cid:12)(cid:12) α ( t ) k → n , t = 1 , , . . . . (1)In (1), α ( t ) k → n ≥ represents the large-scale fading componentincluding path loss and log-normal shadowing which varies asmobile device j changes its position. Let x k denote the 2Dposition, i.e., ( x, y ) -coordinates, of cell k ’s center. Similarly,we represent the location of mobile device n at slot t as x ( t ) n .Then, the large-scale fading can be expressed in dB as α ( t ) dB ,k → n = PL (cid:16) x k , x ( t ) n (cid:17) + X ( t ) k → n , (2)where PL is the distance-dependent path loss in dB and X ( t ) k → n is the log-shadowing from x k to x ( t ) n . For each device n ,we compute the shadowing from all k possible AP posi-tions in the network. The shadowing parameter is updatedby X ( t ) k → n = ρ ( t ) s ,n X ( t ) k → n + σ s e ( t ) s ,k → n , where σ s is the log-normal shadowing standard deviation and the correlation ρ ( t ) s ,n is computed by ρ ( t ) s ,n = e ∆ x ( t ) nd cor with ∆ x ( t ) n = (cid:13)(cid:13)(cid:13) x ( t ) n − x ( t − n (cid:13)(cid:13)(cid:13) being the displacement of device n during the last slot andwith d cor being the correlation length of the environment.Note that X (0) k → n ∼ N (cid:0) , σ s (cid:1) and the shadowing innovationprocess e (1) s ,k → n , e (2) s ,k → n , . . . consists of independent and iden-tically distributed (i.i.d.) Gaussian variables with distribution N (cid:18) , − (cid:16) ρ ( t ) s ,n (cid:17) (cid:19) . Following [8], we model the change inthe movement behavior of each device as incremental stepson their speed and directions.Using the Jakes fading model [5], we introduce the small-scale Rayleigh fading component of (1) as a ﬁrst-order com-plex Gauss-Markov process: h ( t ) k → n = ρ ( t ) n h ( t − k → n + e ( t ) k → n ,where h (0) k → n ∼ C N (0 , is circularly symmetric complexGaussian (CSCG) with unit variance and the independentchannel innovation process e (1) k → n , e (2) k → n , . . . consists of i.i.d.CSCG random variables with distribution C N (cid:0) , − ρ (cid:1) .The correlation ρ ( t ) n depends on the ρ = J (2 πf ( t ) d,n T ) , where J ( . ) is the zeroth-order Bessel function of the ﬁrst kind and f ( t ) d,n = v ( t ) n f c /c is device n ’s maximum Doppler frequency atslot t with v ( t ) n = ∆ x ( t ) n /T being device n ’s speed, c = 3 × m/s, and f c being carrier frequency.Let b ( t ) n and p ( t ) n denote device n ’s associated cell andtransmit power of its associated AP in time slot t , respectively.Hence the association and allocation in time slot t can be de-noted as b ( t ) = (cid:104) b ( t )1 , . . . , b ( t ) N (cid:105) (cid:124) and p ( t ) = (cid:104) p ( t )1 , . . . , p ( t ) N (cid:105) (cid:124) ,respectively. The signal-to-interference-plus-noise ratio at re-ceiver n in time slot t can be deﬁned as a function of theassociation b ( t ) and allocation p ( t ) : γ ( t ) n (cid:16) b ( t ) , p ( t ) (cid:17) = ¯ g ( t ) b ( t ) n → n p ( t ) n (cid:80) m (cid:54) = n ¯ g ( t ) b ( t ) m → n p ( t ) m + σ , (3) where σ is the additive white Gaussian noise power spectraldensity which is assumed to be the same at all receivers with-out loss of generality. Then, the downlink spectral efﬁciencyof device n at time t is C ( t ) n = log (cid:16) γ ( t ) n (cid:16) b ( t ) , p ( t ) (cid:17)(cid:17) . (4)For a given association b ( t ) , the power control problem at timeslot t can be deﬁned as a sum-rate maximization problem: maximize p ( t ) N (cid:88) n =1 C ( t ) n subject to 0 ≤ p n ≤ P max , n = 1 , . . . , N , (5)where P max is the maximum power spectral density that an APcan emit. The real-time allocator solves the problem in (5) atthe beginning of slot t and its solution becomes p ( t ) . For easeof notation, throughout the paper, we use g ( t ) m → n = ¯ g ( t ) b ( t ) m → n .III. P ROPOSED P OWER C ONTROL A LGORITHM

A. Reinforcement Learning Overview

A learning agent intersects with its environment, i.e., whereit lives, in a sequence of discrete time steps. At each step t ,agent ﬁrst observes the state of environment, i.e., key relevantenvironment features, s ( t ) ∈ S with S being the set of possiblestates. Then, it picks an action a ( t ) ∈ A , where A is aset of actions, following a policy that is either deterministicor stochastic and is denoted by µ with a ( t ) = µ ( s ( t ) ) or π with a ( t ) ∼ π ( ·| s ( t ) ) , respectively. As a result of thisinteraction, environment moves to a new state s ( t +1) followinga transition probability matrix that maps state-action pairs ontoa distribution of states at the next step. Agent perceives howgood or bad taking action a t at state s ( t ) is by a reward signal r ( t +1) . We describe the above interaction as an experience at t + 1 denoted as e ( t +1) = (cid:0) s ( t ) , a ( t ) , r ( t +1) , s ( t +1) (cid:1) .Model-free reinforcement learning learns directly fromthese interactions without any information on the transitiondynamics and aims to learn a policy that maximizes agent’slong-term accumulative discounted reward at time t , R ( t ) = ∞ (cid:88) τ =0 γ τ r ( t + τ +1) , (6)where γ ∈ (0 , is the discount factor.Two main approaches to train agents with model-free re-inforcement learning are value function and policy searchbased methods [9]. The well-known Q-learning algorithmis value based and learns an action-value function Q ( s, a ) .The classical Q-learning uses a lookup table to represent Q-function which does not scale well for large state spaces, i.e., ahigh number of environment features or some continuous en-vironment features. Deep Q-learning overcomes this challengeby employing a deep neural network to represent Q-function inplace of a lookup table. However, the action space still remainsdiscrete which requires quantization of transmit power levelsin a power control problem. Policy search methods can directlyhandle continuous action spaces. In addition, compared to-learning that indirectly optimize agent’s performance bylearning a value function, policy search methods directlyoptimize a policy which is often more stable and reliable [10].By contrast, the policy search algorithms are typically on-policy which means each policy iteration only uses data thatis collected by the most-recent policy. Q-learning can reusedata collected at any point during training, and consequently,more sample efﬁcient. Another speciﬁc advantage of off-policylearning for a wireless network application is that the agentsdo not need to wait for the most-recent policy update and cansimultaneously collect samples while the new policy is beingtrained. Since both value and policy based approaches havetheir strengths and drawbacks, there is also a hybrid approachcalled actor-critic learning [9].Reference [7] proposed the DDPG algorithm which isbased on the actor-critic architecture and allows continuousaction spaces. DDPG algorithm iteratively trains an action-value function using a critic network and uses this functionestimate to train a deterministic policy parameterized by anactor network.For a policy π , the Q -function at state-action pair ( s, a ) ∈S × A becomes Q π ( s, a ) = E π (cid:104) R ( t ) (cid:12)(cid:12)(cid:12) s ( t ) = s, a ( t ) = a (cid:105) . (7)For a certain state s , a deterministic policy µ : S → A returns action a = µ ( s ) . In a stationary Markov decisionprocess setting, the optimal Q -function associated with thetarget policy µ satisﬁes the Bellman property and we can makeuse of this recursive relationship as Q µ ( s, a ) = E (cid:104) r ( t +1) + γQ µ ( s (cid:48) , µ ( s (cid:48) )) (cid:12)(cid:12)(cid:12) s ( t ) = s, a ( t ) = a (cid:105) , (8)where the expectation is over s (cid:48) which follows the distributionof the state of the environment. As the target policy isdeterministic, the expectation in (8) depends only on the en-vironment transition dynamics. Hence, an off-policy learningmethod similar to deep Q-learning can be used to learn a Q-function parameterized by a deep neural network called criticnetwork. The critic network is denoted as Q φ ( s, a ) with φ being its parameters. Similarly, we parameterize the policyusing another DNN named actor network µ θ ( s ) with policyparameters being θ .Let the past interactions be stored in an experience-replaymemory D until time t in the form of e = ( s, a, r (cid:48) , s (cid:48) ) . Thismemory needs to be large enough to avoid over-ﬁtting andsmall enough for faster training. DDPG also applies anothertrick called quasi-static target network approach and deﬁnetwo separate networks to be used in training which are trainand target critic networks with their parameters denoted as φ and φ target , respectively. To train φ , at each time slot, DDPGminimizes the following mean-squared Bellman error: L ( φ , D ) = E ( s,a,r (cid:48) ,s (cid:48) ) ∼D (cid:104) ( y ( r (cid:48) , s (cid:48) ) − Q φ ( s, a )) (cid:105) (9)where the target y ( r (cid:48) , s (cid:48) ) = r (cid:48) + γQ φ target ( s (cid:48) , µ θ ( s (cid:48) )) . Hence, φ is updated by sampling a random mini-batch B from D and Fig. 1: Diagram of the proposed power control algorithm.running gradient descent using ∇ φ |B| (cid:88) ( s,a,r (cid:48) ,s (cid:48) ) ∈B ( y ( r (cid:48) , s (cid:48) ) − Q φ ( s, a )) . (10)Note that after each training iteration φ target is updated by φ .In addition, the policy parameters are updated to learn apolicy µ θ ( s ) which gives the action that maximizes Q φ ( s, a ) .Since the action space is continuous, Q φ ( s, a ) is differentiablewith respect to action and θ is updated by gradient ascent using ∇ θ |B| (cid:88) ( s,... ) ∈B Q φ ( s, µ θ ( s )) . (11)To ensure exploration during training, a noise term is addedto the deterministic policy output [7]. In our multi-agentframework to be discussed next section, we employ (cid:15) -greedyalgorithm of Q-learning instead for easier tuning. B. Proposed Multi-Agent Learning Scheme for Power Control

For the proposed power control scheme in Fig. 1, we leteach transmitter be a learning agent. Hence, the next stateof each agent is determined by the joint-actions of all agentsand the environment is no longer stationary. In order to avoidinstability, we gather the experiences of all agents in a singlereplay memory and train a global actor network θ agent to beshared by all agents. At slot t , each agent n ∈ N observes itslocal state s ( t ) n and sets its own action a ( t ) n by using θ agent .For each link n , we ﬁrst describe the neighboring sets thatallow the distributively execution. Link n ’s set of interferingneighbors at time slot t consists of nearby AP indexes whosereceived signal-to-noise ratio (SNR) at device n was above acertain threshold during the past time slot and is denoted as I ( t ) n = (cid:110) i ∈ N , i (cid:54) = n (cid:12)(cid:12)(cid:12) g ( t − i → n p ( t − i > ησ (cid:111) . (12)Conversely, we deﬁne link n ’s set of interfered neighbors attime slot t using the received SNR from AP n , i.e., O ( t ) n = (cid:110) o ∈ N , o (cid:54) = n (cid:12)(cid:12)(cid:12) g ( t − n → o p ( t − n > ησ (cid:111) . (13)To satisfy the practical constraints introduced in [5], we limitthe information exchange between AP n and its neighboringAPs as depicted in Fig. 2. Although, it is assumed in [5] thatig. 2: The information exchange in time slot t .receiver n may do a more recent received power measurementfrom AP i ∈ I ( t +1) n just before the beginning of time slot t + 1 , i.e., ¯ g ( t +1) b ( t ) i → n p ( t ) i , we prefer not to require it for ournew model that involves mobility. Note that as device n ’sassociation changes, i.e., b ( t +1) n (cid:54) = b ( t ) n , we assume thatthe neighboring sets are still determined with respect to theprevious positioning of AP n and the feedback history frompast neighbors is preserved at the new AP position to be usedin agent n ’s state. We let the association of device change onlyafter staying within a new cell for T register consecutive slots.For the training process, as a major modiﬁcation on [5],we introduce training episodes where we execute a trainingprocess for T train slots and let devices do random walk withoutany training for T travel slots before the next training episode.The T travel slot-long traveling period induces change in thechannel conditions, and consequently allows policy to observemore variant states during its training which intuitively in-creases its robustness to the changes in channel conditions.To train a policy from scratch for a random wireless networkinitialization, we run E training episodes which are indexedby E = 1 , . . . , E . The e -th training episode starts at slot t e = ( e −

1) ( T train + T travel ) and is composed of two parallelprocedures called centralized training and distributed execu-tion. After an interaction with the environment, each agentsends its newly acquired experience to the centralized trainerwhich executes the centralized training process. The trainerclears its experience-replay memory at the beginning of eachtraining episode. Due to the backhaul delay as shown in Fig.1, we assume that the most-recent experience that the trainerreceived from agent n at slot t is e ( t − n . During slot t , afteracquiring all recent experiences in the memory D , the trainerruns one gradient step for the actor and critic networks. Sincethe purpose of the critic network is to guide the actor networkduring training, only the actor network needs to be broadcastedto the network agents and during the inference mode onlythe actor network is required. The trainer starts to broadcast θ broadcast (cid:0) θ agent once every T u slots and we assume θ broadcast is received by the agents after T d slots again due to delay. Inaddition, compared to the deep Q-network in [5] that reservesan output port for each discrete action, each actor network hasjust one output port.The local state of agent n at time t , i.e., s ( t ) n is a tupleof local environment features that are signiﬁcantly affectedby the agent’s and its neighbor’s actions. As described in[5], the state set design is a combination of three featuregroups. The ﬁrst feature group is called “local information” and occupies six neural network input ports. The ﬁrst inputport is agent n ’s latest transmit power p ( t − n which is followedby its contribution to the network objective (5), i.e., C ( t − n .Next, agent n appends the last two measurements of its directdownlink channel and sum interference-plus-noise power atreceiver n : g ( t ) n → n , g ( t − n → n , (cid:16)(cid:80) m ∈N ,m (cid:54) = n g ( t − m → n p ( t − m + σ (cid:17) ,and (cid:16)(cid:80) m ∈N ,m (cid:54) = i g ( t − m → n p ( t − m + σ (cid:17) .These are followed by the “interfering neighbors” featuregroup. Since we are concerned by the scalability, we limitthe number of interfering neighbors the algorithm involvesto c by prioritizing elements of I ( t ) n by their amount ofinterference at receiver n , i.e., g ( t − i → n p ( t − i . We form ¯ I ( t ) n bytaking ﬁrst c sorted elements of I ( t ) n . As | I ( t ) n | < c , we ﬁll thisshortage by using virtual neighbors with zero downlink andinterfering channel gains. We also set its spectral efﬁciencyto an arbitrary negative number. Hence, a virtual neighboris just a placeholder that ineffectively ﬁlls neural networkinputs. Next, for each i ∈ ¯ I ( t ) n , we reserve three input ports: g ( t ) i → n p ( t − i , C ( t − i . This makes a total of c input ports usedfor current interfering neighbors. In addition, agent n alsoincludes the history of interfering neighbors and appends c inputs using ¯ I ( t − n .Finally, we have the “interfered neighbors” feature group.If agent n does not transmit during slot t − , O ( t ) n = ∅ and there will be no useful interfered neighbor informationto build s ( t ) n . Hence, we deﬁne time slot t (cid:48) n as the last slotwith p ( t (cid:48) n ) n > and we consider O ( t (cid:48) n +1) n in our state setdesign. We also assume that as agent n becomes inactive,it will still carry on its information exchange between each o ∈ O ( t (cid:48) n +1) n without the knowledge of g ( t − n → o . Similar tothe scheme described above, agent i regulates O ( t (cid:48) n +1) n toset | ¯ O ( t ) n | = c . For o ∈ O ( t (cid:48) n +1) n , the prioritization crite-ria is now agent i ’s share on the interference at receiver o , i.e., g ( t − n → o p ( t − n (cid:16)(cid:80) m ∈N ,m (cid:54) = o g ( t − m → o p ( t − m + σ (cid:17) − . Foreach interfered neighbor o ∈ O ( t (cid:48) n +1) n , s ( t ) n accommodatesfour features which can be listed as: g ( t − o → o , C ( t − o , and g ( t (cid:48) i ) n → o p ( t (cid:48) i ) n (cid:16)(cid:80) m ∈N ,m (cid:54) = o g ( t − m → o p ( t − m + σ (cid:17) − .The reward of agent n , r ( t +1) n , is computed by the central-ized trainer and used in the training process. Similar to [5], r ( t ) n is deﬁned as agent’s contribution on the objective (5): r ( t +1) n = C ( t ) n − (cid:88) o ∈ O ( t +1) n π ( t ) n → o (14)with π ( t ) n → o = log (cid:16) γ ( t ) o (cid:16) b ( t ) , (cid:104) . . . , p ( t ) n − , , p ( t ) n +1 , . . . (cid:105) (cid:124) (cid:17)(cid:17) − C ( t ) o being the externality that link n causes to interfered o .IV. S IMULATIONS

Following the LTE standard, the path-loss is simulated by . . ( d ) (in dB) with f c = 2 GHz, where d is transmitter-to-receiver distance in km. We set σ s = 10 dB, d cor = 10 meters, T = 20 ms, P max = 38 dBm, and σ = − dBm. We simulate the mobility using Haas’

000 500 0 500 1000 1500 2000x axis position (meters)100050005001000 y a x i s p o s i t i o n ( m e t e r s ) APe 1travele 2e 3

Fig. 3: Example movement until the end of episode e = 3 . n o r m a li z e d s u m - r a t e p e r f o r m a n c e WMMSE w perfect CSIFP w perfect CSIpre-trained policy w mobility pre-trained policy w/o mobilityFP w 1-slot delayed CSIrandom powerfull power

Fig. 4: Test results for the 10 cells and 20 links scenario.model [8] with maximum speed being . m/s. Each mobiledevice randomly updates its speed and direction every seconduniformly within [ − . , . m/s and [ − . , . radians,respectively. Fig. 3 shows an example movement scenariountil the end of third training episode with T train = 5 , and T travel = 50 , slots. The DDPG implementation andparameters are included in the source code. Both WMMSEand FP start from a full power allocation, since it gives betterperformance than random initialization. WMMSE takes moreiterations to converge than FP, resulting in higher sum-rate.We ﬁrst train two policies for K = 10 cells and N = 20 links network deployment for E = 10 training episodes. Theﬁrst policy is trained with mobile devices, whereas the latter istrained without mobility, i.e., with steady channel. We set f d to 10 Hz for all time slots [5]. We save the policy parametersduring training for testing on several random deploymentswith ( K, N ) = (10 , and mobility. As shown in Fig. 4,without mobility, there is no signiﬁcant sum-rate gain afterthe ﬁrst training episode and policy converges to FP’s sum-rate performance. As a remark, FP is centralized and it hasfull CSI, whereas actor network is distributively executed withlimited information exchange. As we include device mobilityand a certain travel time between training episodes, the policyis able to experience various device positions and interferenceconditions during training, so its sum-rate performance consis-tently increases. Additionally, in Table I, we show that an actornetwork trained for ( K, N ) = (10 , can keep up with the GitHub repository: https://github.com/sinannasir/Power-Control-asilomar

TABLE I: Average sum-rate performance in bps/Hz per link. (cells,links) policy trained for (10,20) WMMSE FP FP w delay random full(10,20) 2.59 2.61 2.45 2.37 0.93 0.91(20,40) 1.97 2.09 1.98 1.87 0.68 0.68(20,60) 1.58 1.68 1.59 1.50 0.37 0.35(20,100) 1.14 1.23 1.15 1.09 0.18 0.17 sum-rate performance of optimization algorithms as networkgets larger. Hence, running centralized training from scratchis not necessary as device positions change or new devicesregister, since a pre-trained policy for a smaller and differentdeployment performs quite well. For the 20 link scenario, onaverage, WMMSE and FP converge in 42 and 24 iterations,respectively. For 100 links, WMMSE requires 74 iterations.Conversely, learning agent takes just one policy evaluation.V. C

ONCLUSION

In this paper, we presented a distributively executed deepactor-critic framework for power control. During training, onlyactor network is broadcasted to learning agents. Simulationsshow that a pre-trained policy gives comparable performancewith WMMSE and FP, and a policy trained for a smallerdeployment is applicable to a larger network without additionaltraining thanks to the distributed execution scheme. Further,we have shown that the proposed actor-critic frameworkenables real-time power control under certain practical con-straints and it is compatible with the case of mobile devices.DDPG in fact uses the mobility to increase its sum-rateperformance by experiencing more variant channel conditions.R

EFERENCES[1] Z. Q. Luo and S. Zhang, “Dynamic spectrum management: Complexityand duality,”

IEEE Journal of Selected Topics in Signal Processing ,vol. 2, no. 1, pp. 57–73, Feb 2008.[2] K. Shen and W. Yu, “Fractional programming for communicationsystemspart i: Power control and beamforming,”

IEEE Transactions onSignal Processing , vol. 66, no. 10, pp. 2616–2630, May 2018.[3] H. Sun, X. Chen, Q. Shi, M. Hong, X. Fu, and N. D. Sidiropoulos,“Learning to optimize: Training deep neural networks for interferencemanagement,”

IEEE Transactions on Signal Processing , vol. 66, no. 20,pp. 5438–5453, Oct 2018.[4] Z. Qin, H. Ye, G. Y. Li, and B. F. Juang, “Deep learning in physicallayer communications,”

IEEE Wireless Communications , vol. 26, no. 2,pp. 93–99, 2019.[5] Y. S. Nasir and D. Guo, “Multi-agent deep reinforcement learningfor dynamic power allocation in wireless networks,”

IEEE Journal onSelected Areas in Communications , vol. 37, no. 10, pp. 2239–2250,2019.[6] F. Meng, P. Chen, L. Wu, and J. Cheng, “Power allocation inmulti-user cellular networks: Deep reinforcement learning approaches,”

CoRR , vol. abs/1901.07159, 2019. [Online]. Available: http://arxiv.org/abs/1901.07159[7] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa,D. Silver, and D. Wierstra, “Continuous control with deep reinforcementlearning,” arXiv e-prints , p. arXiv:1509.02971, Sep. 2015.[8] Z. J. Haas, “A new routing protocol for the reconﬁgurable wirelessnetworks,” in

Proceedings of ICUPC 97 - 6th International Conferenceon Universal Personal Communications , vol. 2, Oct 1997, pp. 562–566.[9] K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Bharath,“Deep reinforcement learning: A brief survey,”