[PDF] Deep Reinforcement Learning for Energy-Efficient Beamforming Design in Cell-Free Networks

Abstract

Cell-free network is considered as a promising architecture for satisfying more demands of future wireless networks, where distributed access points coordinate with an edge cloud processor to jointly provide service to a smaller number of user equipments in a compact area. In this paper, the problem of uplink beamforming design is investigated for maximizing the long-term energy efficiency (EE) with the aid of deep reinforcement learning (DRL) in the cell-free network. Firstly, based on the minimum mean square error channel estimation and exploiting successive interference cancellation for signal detection, the expression of signal to interference plus noise ratio (SINR) is derived. Secondly, according to the formulation of SINR, we define the long-term EE, which is a function of beamforming matrix. Thirdly, to address the dynamic beamforming design with continuous state and action space, a DRL-enabled beamforming design is proposed based on deep deterministic policy gradient (DDPG) algorithm by taking the advantage of its double-network architecture. Finally, the results of simulation indicate that the DDPG-based beamforming design is capable of converging to the optimal EE performance. Furthermore, the influence of hyper-parameters on the EE performance of the DDPG-based beamforming design is investigated, and it is demonstrated that an appropriate discount factor and hidden layers size can facilitate the EE performance.

Full PDF

DDeep Reinforcement Learning for Energy-EfﬁcientBeamforming Design in Cell-Free Networks

Weilai Li, Wanli Ni, Hui Tian, and Meihui Hua

State Key Lab. of Networking and Switching Technology, Beijing Univ. of Posts and Telecommun., Beijing, ChinaE-mail: { liweilai, charleswall, tianhuin, huameihui } @bupt.edu.cn Abstract —Cell-free network is considered as a promisingarchitecture for satisfying more demands of future wirelessnetworks, where distributed access points coordinate with anedge cloud processor to jointly provide service to a smallernumber of user equipments in a compact area. In this paper,the problem of uplink beamforming design is investigated formaximizing the long-term energy efﬁciency (EE) with the aidof deep reinforcement learning (DRL) in the cell-free network.Firstly, based on the minimum mean square error channelestimation and exploiting successive interference cancellation forsignal detection, the expression of signal to interference plus noiseratio (SINR) is derived. Secondly, according to the formulation ofSINR, we deﬁne the long-term EE, which is a function of beam-forming matrix. Thirdly, to address the dynamic beamformingdesign with continuous state and action space, a DRL-enabledbeamforming design is proposed based on deep deterministicpolicy gradient (DDPG) algorithm by taking the advantage ofits double-network architecture. Finally, the results of simulationindicate that the DDPG-based beamforming design is capableof converging to the optimal EE performance. Furthermore,the inﬂuence of hyper-parameters on the EE performance ofthe DDPG-based beamforming design is investigated, and it isdemonstrated that an appropriate discount factor and hiddenlayers size can facilitate the EE performance.

I. I

NTRODUCTION

An innovative network architecture called cell-free networkemerges with myriad of attention recently, which is consideredto bring numerous potentials for the future wireless networks[1]–[3]. There are no cells or cell boundaries in the cell-free network. All access points (APs) are fully connected toa smaller number of user equipments (UEs) and coordinatetogether with an edge cloud processor (ECP) in a compactarea, in order to promote the exploitation of favorable prop-agation and channel hardening, as well as mitigate the cell-edge problem [4]. The APs estimate the channel locally basedon the non-orthogonal pilots transmitted from UEs due tothe insufﬁcient orthogonal resources, leading in turn to pilotcontamination [5]. In order to alleviate inter-user interference,the non-orthogonal pilot contamination and excessive controlsignaling, one of the essential signal processing technologies,beamforming, is utilized in cell-free networks, and it is provedthat appropriate beamforming design is conducive to improvethe system throughput, energy efﬁciency (EE) and probabilityof coverage [6]–[8].The state-of-the-art literature on beamforming design incell-free networks ﬁx attention on diverse aspects [8]–[12]. In

This paper is funded by Beijing Univ. of Posts and Telecommun.-ChinaMobile Research Institute Joint Innovation Center. [9], the authors proposed an iterative algorithm by utilizing themax-min beamformer and derived the capacity lower bound ofthe cell-free network with channel estimation error. In [10], theauthors presented a modiﬁcation of conjugate beamformingfor the forward link in cell-free networks that eliminated theself-interference completely and with no action required at thereceivers. In [11], the authors applied a distributed conjugatebeamforming scheme at each AP for the use of local channelstate information (CSI) and the scheme could enhance thetotal EE. However, the beamforming design is a problem ofdynamic, successive decision-making under uncertainty, andthese conventional optimization schemes are limited by theirmyopic decision criteria, poor scalability and high complexity.In view of it, deep reinforcement learning (DRL) is an adaptivemethod to overcome these challenges [13]. In [8], the authorsformulated a novel hybrid model using deep deterministicpolicy gradient (DDPG) and deep double Q-network for jointlyoptimizing the clustering of APs and the beamforming vectors,the simulation results demonstrated that the model is efﬁcientto maximize the per-user transmission rate. In [12], the authorsproposed a distributed dynamic down-link-beamforming coor-dination method with partial observability of the CSI usingDeep Q-Learning, and the simulation results proved that themethod is effective for the improvement of the achievabletransmission rate. The literature above reveal the effectivenessof DRL in beamforming optimization for network throughputimprovement. Whereas, the energy consumption becomes acritical issue as the future network size scales up, and thetrade-off between the throughput and the energy consumptiondeserves comprehensive attention in more practical scenarios.Motivated by the aforementioned discussion, the main con-tributions of this paper are as follow: 1) The expressionof signal to interference plus noise ratio (SINR) for eachUE is derived under minimum mean square error (MMSE)channel estimation and successive interference cancellation(SIC) detection. 2) We further deﬁne the closed-form ex-pression of the long-term EE, and the DDPG algorithm isutilized for beamforming design to achieve the long-term EEmaximum. 3) The convergence and optimality of our proposedDDPG-based beamforming design is analyzed theoreticallyand the EE performance is evaluated numerically, and itssuperiority in comparison with benchmarks is demonstrated. 4)The inﬂuence of the hyper-parameters on the EE performanceis further explored, and appropriate discount factor and hiddenlayers size improving the EE performance is concluded. a r X i v : . [ c s . I T ] F e b I. S

YSTEM M ODEL AND P ROBLEM F ORMULATION

A. System Model

As illustrated in Fig. 1, a cell-free network with M single-antenna APs and K single-antenna UEs is considered, wherethe locations of APs are ﬁxed, and UEs are initially randomlydistributed which are assumed to be of low mobility. In thecell-free network, all APs are fully connected to all UEsand link to ECP via the perfect backhaul links. All UEs areuniformly served by the distributed APs in a collaborated man-ner. The channel estimation is conducted at each AP locally,and signal detection occurs in the centralized processing unit(CPU) pool in ECP based on the channel state information(CSI) sent from APs. Subsequently, the CPU performs thebeamforming design and returns the beamforming decision toall APs as feedback.To acquire CSI, a random set of pilot sequences is assignedto UEs for channel estimation. Firstly, we assume that thechannel between the k-th UE and the m-th AP is modeled asfollow: g mk = β / mk h mk , (1)where h mk is the small-scale fading coefﬁcient between the k -th UE and the m -th AP, and β mk is the large-scale fadingcoefﬁcient, which is given by: β mk = ς ( d mk ) − , (2)where ς = − is the path loss at the reference distanceof 1 meter, d mk denotes the access link distance. We assumethat h mk ( ∀ m, ∀ k ) are independent and identically distributed(i.i.d.) random variables, i.e., h mk ∼ CN (0 , . B. Channel Estimation

The method of the channel estimation is to assign pilotsequences to UEs in the coverage area. The pilot sequenceof the k-th UE can be presented as ϕ k = [ ϕ k, · · · ϕ k,τ l ] H , (cid:107) ϕ k (cid:107) = 1 , where τ l is less than the coherence time of thechannel τ c . It’s worth mentioning that different UEs may beassigned the same pilot sequence on account of the limitedpilot length τ l , i.e., τ l ≤ K , hence the pilot sequences arepartially non-orthogonal which satisﬁes (cid:12)(cid:12) ϕ Hk ϕ n (cid:12)(cid:12) (cid:54) = 0 , k (cid:54) = n .In this way, the pilot signal received at the m -th AP can beexpressed as: y m,p = K (cid:88) k =1 (cid:112) τ l δ l g mk ϕ k + σ m,p , (3)where δ l is normalized transmission power for each symbolof the k -th UE’s pilot vector, σ m,p ∈ C τ l × is the complex-valued additive white Gaussian noise (AWGN) vector relatedto pilot symbols with i.i.d. random variables, i.e., σ m,p ∼CN (0 , σ ) .It is assumed that the large-scale fading coefﬁcient β mk is known, while the small-scale fading coefﬁcient h mk isunknown. In other words, the aim to estimate CSI is equivalentto estimate channel coefﬁcient ˆ g mk . In the ﬁrst phase, we . . .. . . CPU

Backhaul linksAccess links UE K UE K UE  UE AP M AP M AP  AP g MK g g . . . . . . Fig. 1. System model of cell-free networks. denote ˆ y mk as the projection of y m,p onto ϕ Hk and we canobtain: ˆ y mk = ϕ Hk y m,p = (cid:112) τ l δ l g mk + (cid:112) τ l δ l K (cid:88) k (cid:48) (cid:54) = k g mk (cid:48) ϕ Hk ϕ k (cid:48) + ϕ Hk σ m,p . (4)When we set the estimation coefﬁcient µ mk and ˆ g mk = µ mk · ˆ y mk , the estimation error is e = ˆ g mk − g mk accordingly.By utilizing the MMSE criterion to minimize E { e ∗ e } [14], weacquire the following equation: ∂ E { e ∗ e } ∂g mk = 0 , (5)after a series of operations, the estimation coefﬁcient µ mk iscalculated by: µ mk = E [ˆ y ∗ mk g mk ] E [ | ˆ y mk | ]= √ τ l δ l β mk τ l δ l (cid:80) Kn =1 β mk (cid:12)(cid:12) ϕ Hk ϕ n (cid:12)(cid:12) + σ . (6)Ultimately, ˆ g mk can be formulated as ˆ g mk = µ mk ϕ Hk y m,p . C. Uplink Data Transmission

In the cell-free network, the signal from each UE will bereceived by all APs. The APs weight the baseband signalsby beamforming vector w mk and transmit the signals to theCPU pool through backhaul links. Intuitively, maximizing thedesired signal and minimizing the interference, pilot contam-ination and noise conduces to improve the user experience.Signal of each UE is detected in the CPU pool, and thedetected signal of the k -th UE can be expressed as: k = M (cid:88) m =1 K (cid:88) n =1 w mk (ˆ g mn √ p u x n + ˜ σ m )= M (cid:88) m =1 w mk K (cid:88) n =1 (cid:112) τ l δ l p u x n µ mn g mn + M (cid:88) m =1 w mk ( K (cid:88) p =1 K (cid:88) q (cid:54) = p (cid:112) τ l δ l p u x p µ mp (cid:12)(cid:12) ϕ Hp ϕ q (cid:12)(cid:12) g mq )+ M (cid:88) m =1 w mk ( K (cid:88) s =1 √ p u µ ms x s (cid:12)(cid:12) ϕ Hs σ m,p (cid:12)(cid:12) +˜ σ m ) , (7)where ≤ w mk ≤ is the beamforming vector betweenthe k -th UE and the m -th AP, p u is the UE’s uplink signaltransmission power with ≤ p u ≤ P u , where P u is themaximum allowable signal transmission power, x n is the n-thUE’s transmitted symbol that satisﬁes E {| x n | } = 1 , ˜ σ m isthe AWGN at the the m -th AP with ˜ σ m ∼ CN (0 , σ ) .The equation (7) is composited of three components: theﬁrst component is the desired signal mixed with inter-userinterference, the second component is non-orthogonal pilotcontamination, and the last component is AWGN-related esti-mation error and AWGN component. For the purpose of theenhancement of the SINR, ﬁrst, we can simplify the elementsin (7): (cid:101) g mn = √ τ l δ l p u µ mn g mn , (8a) (cid:101) g mq = √ τ l δ l p u µ mp | ϕ Hp ϕ q | g mq , (8b) ψ k = (cid:80) Ks =1 p u µ ms ϕ s σ m,p + ˜ σ m . (8c)Accordingly, we derive the closed-form expression of the k -thUE’s SINR from (7): γ k = M (cid:80) m =1 w mk | (cid:101) g mk | M (cid:80) m =1 w mk ( K (cid:80) n (cid:54) = k | (cid:101) g mn | + K (cid:80) p =1 K (cid:80) q (cid:54) = p | (cid:101) g mq | + ψ k ) . (9)For signal detection, SIC is exploited and we assume thatthe effective channels are arranged in an ascending order asfollow [8]: M (cid:88) m =1 | (cid:101) g m | ≤ ... ≤ M (cid:88) m =1 | (cid:101) g mk | ≤ ... ≤ M (cid:88) m =1 | (cid:101) g mK | , (10)and the SINR of the k-th UE can be modiﬁed as: γ k = M (cid:80) m =1 w mk | (cid:101) g mk | M (cid:80) m =1 w mk ( k − (cid:80) n =1 | (cid:101) g mn | + K (cid:80) p =1 K (cid:80) q (cid:54) = p | (cid:101) g mq | + ψ k ) , (11)and it is obvious that SIC is effective to raise the SINR withthe reduction of inter-user interference. D. Problem Formulation

We deﬁne the EE as the ratio between the normalizedtransmission rate (bps/Hz) and the total energy consumption(Joule). When it comes to the total energy consumption inthe entire cell-free network, it’s mainly made up of twocomponents, the total signal transmission energy consumption( P K ), the hardware energy consumption of APs ( P AP ) andUEs ( P UE ), respectively. As a consequence, the total energyconsumption can be given by: P total ( t ) = P K ( t ) + KP UE + M P AP , (12)where P K ( t ) = M (cid:80) m =1 K (cid:80) k =1 w mk ( t ) τ l δ l p u , and P AP , P UE areregarded as constant. As the expression of SINR is givenby (11), the k -th UE’s normalized transmission rate can bederived by Shannon formula as follow: R k ( t ) = log (1 + γ k ( t )) . (13)The long-term EE is measured as the performance metricin the cell-free network and our goal is to ﬁnd the optimalbeamforming design for maximizing long-term EE. Based onthe aforementioned discussion, the beamforming optimizationproblem to maximize the long-term EE η EE can be formulatedas follow: max W mk T T (cid:88) t =1 K (cid:80) k =1 log (1 + γ k ( t )) P K ( t ) + KP UE + M P AP (14a) s . t . M (cid:88) m =1 w ml ( t )( | (cid:101) g ml | − l − (cid:88) n =1 | (cid:101) g mn | ) ≥ P s , (14b) M (cid:88) m =1 w mk ( t ) = 1 , (14c) P K ( t ) ≤ P max , (14d) ( ∀ l = 2 , ..., K and ∀ k = 1 , ..., K ) where W mk ∈ [0 , M × K is the overall beamforming matrix.The constraint (14b) represents the successful SIC opera-tion with the sensitivity P s of SIC receiver. The constraint(14c) represents that the beamforming vector of each UEis normalized. The constraint (14d) guarantees that the totalsignal transmission power is not more than P max , where P max is the maximum allowable total signal transmission powerin cell-free networks. Since the actual channels are time-varying, and the instantaneous CSI is unavailable at CPU, weconsider the channel statistics and replace channel parameters | (cid:101) g mi | ( i = k, n, q ) and ψ k in (11) with E {| (cid:101) g mi | } and E { ψ k } .The optimization problem of the long-term EE is closelybound up with long-term beneﬁts, and the DRL algorithm isﬁt to ﬁgure it out.The DDPG algorithm, a branch of DRL toempower agents, interacts with the environment and promotesits learning ability. The most beneﬁt of the DDPG algorithmis that it drives strong power to handle the problem of succes-sive decision-making. In consequence, we utilize the DDPGalgorithm to instruct CPU to perform beamforming design andmaximize the long-term EE in the cell-free network.II. B EAMFORMING D ESIGN USING

DDPGIn this section, we tend to propose a solution for beamform-ing design with the DDPG algorithm and the complete designprocess is showed in Fig. 2 in details. The DDPG algorithmconﬁgures a double-network architecture with the target andthe online networks. Besides, the DDPG algorithm exploitsexperience mechanism and deep neural network to make thelearning process more stable and the convergence rate faster.In each training step, CPU performs the immediate beam-forming design based on the current SINR and informs it toAPs by control signals. Afterwards the environment of cell-free network will make reaction to the action with the rewardsof all UEs, so that the environment will switch to a totallynew state. In each training episode, the mini-batch stochasticgradient descent algorithm is used to train the parameters inthe value network and the stochastic gradient ascent algorithmis exploited to update the parameters in the policy network.At the start of each episode, we intend to randomly generatean initial state denoted by s , and the environment will switchto the ﬁnal state after the max-episode-steps, accompanied bya tuple of designed parameter transitions stored in the replaybuffer R and we select mini-batch N transitions from replaybuffer for parameters update. A. State-Action-Reward Construction

As it’s aforementioned in the system model, the AP lo-cations are ﬁxed and UEs are of low mobility. We assumethat UEs’ locations are static and we ignore UEs’ mobility.Hence, we could regard UE’s SINR only affected by thebeamforming design, and our formulated problem can be mod-eled approximately as a Markov decision process (MDP). Wedescribe MDP elements ﬁrst, namely state, action and rewardfunction. The state of the environment, s t = { s , s ...s k } , isthe SINR of K UEs. The action is the beamforming matrix W mk ∈ [0 , M × K . More importantly, the beneﬁcial design ofthe reward function is closely associated with the long-termEE in the cell-free network. Actions will lead to a promotionor reduction with reward or penalty correspondingly. The reward f unction is designed as follow [15]: r ( t ) = ∆ η ( t ) = K (cid:80) k =1 log (1 + γ k ( t )) P K ( t ) + KP UE + M P AP − K (cid:80) k =1 log (1 + γ k ( t − P K ( t −

1) + KP UE + M P AP , (15)and the form of reward function contributes to the convergenceand performance of the DDPG algorithm empirically. B. DDPG-Based Beamforming Design

The element w mk in the beamforming matrix is continuousin the range [0,1], and the DDPG algorithm provides solutionto manage the problem with continuous state space andcontinuous action space. Thus the DDPG algorithm could beapplied to search the optimal beamforming matrix W mk . Environment Mini-batch ( , ) i i a s valuegradient ( ) i i a u s    

Replay buffer ( ) t u s OU noiseBehavior policy  ( , , , ) t t t t s a r s  Policy Networkpolicygradient parameterupdateOnline Policy NetworkTarget Policy NetworkOptimizersoftupdateparameter:parameter: u   u  Value NetworkTarget Value Networkvaluegradient softupdateparameterupdate i y parameter: Q   Optimizer ( , , , ) i i i i s a r a  Online Value Network Q  parameter: t a Fig. 2. DDPG-based beamforming design.

The DDPG algorithm equips a double network architecturewith a policy and a value network, respectively, a = u ( s | θ u ) , Q ( s , a | θ Q ) . In the DDPG algorithm, states map actions di-rectly so that we’re likely to ignore a probability distributionacross a discrete action space. The Q ( s , a | θ Q ) is estimated byBellman equation as follow: Q u ( s t , a t ) = E ( s t ,a t ,r t , s t +1 ) ∈ R { r ( s t , a t )+ ζQ u ( s t +1 , u ( s t +1 )) } . (16)Equation (16) deﬁnes the estimation of the current actionvalue based on the current state and deterministic policy u , where R is a set of experience and ζ is the discountfactor. In order to make the DDPG algorithm more stable andefﬁcient, it creates two neural networks for the both networksindependently: the online network with parameters θ u , θ Q andthe target network with parameters θ u (cid:48) , θ Q (cid:48) .The objective function is deﬁned as the expectation ofdiscount accumulated reward in the DDPG algorithm, whichcan be written as: J β ( u ) = E u { r + ζr + ζ r + ... + ζ n − r n +1 } , (17)the policy u ∗ to ﬁnd optimal deterministic action is equivalentto the policy of maximizing objective function J β ( u ) as u ∗ =argmax u J β ( u ) [16] and the gradient of the policy network is: ∇ θ u J ≈ N (cid:88) i ∇ a Q ( s , a ) | s = s i ,a = u ( s i ) ∇ θ u u ( s ) | s = s i , (18)and we optimize the objective function by stochastic gradientascent algorithm with learning rate lr u . In the online valuenetwork, the gradient can be represented as: ∇ θ Q = 1 N (cid:88) i [( y i − Q ( s i , a i )) ∇ θ Q Q ( s i , a i )] , (19) lgorithm 1 DDPG-Based Beamforming Design Randomly initialize the value network Q ( s , a | θ Q ) and thepolicy network a = u ( s | θ u ) with weights θ Q and θ u ; Initialize the target value network Q (cid:48) and the target policynetwork u (cid:48) with weights θ Q (cid:48) = θ Q and θ u (cid:48) = θ u ; for episode = 1 to Max-number-episodes do Randomly initialize process N for action exploration Initialize replay buffer R and randomly generate s ; for t=1 to Max-episode-steps do CPU executes the beamforming design based on thestate s t and the policy u , and a t = u ( s t | θ u ) + N t ; APs perform the action a t and CPU record reward r t and the next state s t +1 ; Store the transition ( s t , a t , r t , s t +1 ) in R ; end for Randomly sample mini-batch of N transitions from R ,where N is the mini-batch size; Minimize the loss function to update the online valuenetwork:

Loss = N (cid:80) i ( y i − Q ( s i , a i | θ Q )) , y i = r i + ζQ (cid:48) ( s i +1 , u (cid:48) ( s i +1 ; θ u (cid:48) ) | θ Q (cid:48) ) , ∇ θ Q = N (cid:80) i [( y i − Q ( s i , a i )) ∇ θ Q Q ( s i , a i )] ,and the update formula of θ Q can be expressed as: θ Q = θ Q − lr Q · ∇ θ Q ; Update the online policy network by sampled stochasticpolicy gradient ascent as: ∇ θ u J ≈ N (cid:80) i ∇ a Q ( s , a ) | s = s i ,a = u ( s i ) ∇ θ u u ( s ) | s = s i ,and the update formula of θ u can be expressed as: θ u = θ u + lr u · ∇ θ u ; Soft update the target value network and the targetpolicy network by Poylak averaging factor τ as follow: θ Q (cid:48) = τ θ Q + (1 − τ ) θ Q (cid:48) , θ u (cid:48) = τ θ u + (1 − τ ) θ u (cid:48) . end for where y i = r i + ζQ (cid:48) ( s i +1 , u (cid:48) ( s i +1 | θ u (cid:48) ) | θ Q (cid:48) ) , and we updatethe online value network by stochastic gradient descent withlearning rate lr Q . Finally, θ u (cid:48) and θ Q (cid:48) in the target networksare updated by Poylak averaging factor τ : θ Q (cid:48) = τ θ Q + (1 − τ ) θ Q (cid:48) , (20a) θ u (cid:48) = τ θ u + (1 − τ ) θ u (cid:48) . (20b)In conclusion, the ultimate aim of the DDPG algorithm is tomaximize the objective function J β ( u ) in the policy networkand minimize the loss of the action value Q in the valuenetwork simultaneously. Algorithm 1 summarizes the DDPG-based beamforming design.

C. Complexity Analysis

In this section, we’re about to discuss the complexityanalysis of the DDPG algorithm. Under the assumption that M APs and K UEs are considered in the cell-free network andthe overall beamforming matrix, W mk ∈ [0 , M × K , demandsoptimization. When we set a certain step size ∆ and the conventional approaches for the beamforming design followthe step, thus the complexity of the conventional approaches, O (cid:16) ( ) M + K (cid:17) , is exponential. The complexity analysis of theDDPG algorithm depends on two aspects: inference ﬂoatingoperations per second (FLOPS) and the convergence rate. Thenumber of FLOPS during the inference is mainly determinedby the structure of the policy network and the value network.When we set | S | , | A | , | H i | as the numbers of elements instate, action and the n-th hidden layers in policy and valuenetwork, the inference FLOPS in the policy network and thevalue network can be computed as: FLPOS u = | S || H | + (cid:80) I | H i − H i | + | A || H I | , (21a) FLPOS Q = | S + A || H | + (cid:80) I | H i − H i | + | H I | , (21b)where I represents the number of hidden layers in each neuralnetwork. As a consequence, the number of FLOPS duringthe inference results in low complexity because of its scalarmultiplication in the DDPG algorithm. Correspondingly, theconvergence rate is faster than the conventional approachesand it is shown in Fig. 3. In general, the DDPG algorithm isa more superior scheme to handle our formulated problem.IV. S IMULATION R ESULTS

In numerical analysis, we evaluate the EE with our proposedDDPG-based beamforming design in the cell-free network.We consider a possible cell-free network size with M = 10 , K = 6 where APs and UEs are uniformly located in a rangeof radius r = 20 meters. The hardware power consumption ofAPs and UEs are both 20 dBm, pilot length τ l = 6 samples,pilot transmission power per symbol δ l = 20 dBm, uplinktransmission power p u = 16 dBm, SIC sensitivity P s = 1 dBm [8] and noise power σ = − dBm. We train ourproposed model with the tool of Python and Pytorch 1.4.0,and the number of training episode is 1000 with 200 stepsin each episode. The networks both have two fully-connectedhidden layers with size 256 × Relu function in each layer, and the output layer is sof tmax function in the policy network and none in thevalue network. The loss function is Mean Squared Error inthe value network and the gradient of Q-value is used in thepolicy network update.

Adam is employed as the optimizer ofthe networks. The hyper-parameters of the DDPG algorithmare set as follows: the discount factor ζ = 0 . , the learningrate lr u = 0 . and lr Q = 0 . , the Poylak averaging factor τ = 0 . , the size of mini-batch N = 32 , and the size ofreplay buffer R = 10 .Next, we verify the effectiveness of the DDPG algorithmand the EE performance of our proposed DDPG-based beam-forming design in the cell-free network. Furthermore, we takethe water-ﬁlling scheme (larger g mk determines larger w mk )and the random scheme as the benchmarks. We intend to ﬁnishthe simulation work in terms of three aspects:

1) Convergence of the DDPG-Based beamforming design:

Fig. 3 illustrates the convergence of the DDPG-based beam-forming design in the training episodes. We may draw a

200 400 600 800 1000Episode number234567 N o r m a li z ed S u m E ne r g y E ff i c i en cy ( bp s / H z / J ) M=10,K=6DDPG AlgorithmRandom SchemeWater-filling Scheme

Fig. 3. Convergence of EE. N o r m a li z ed S u m E ne r g y E ff i c i en cy ( bp s / H z / J ) M=10,K=6 DDPG AlgorithmRandom SchemeWater-filling Scheme

Fig. 4. EE vs. transmission power. N o r m a li z ed S u m E ne r g y E ff i c i en cy ( bp s / H z / J ) M=10,K=6Discount factor=0.7Discount factor=0.8Discount factor=0.9Discount factor=0.1Discount factor=1-10 -10

Discount factor=10 -10

Fig. 5. EE vs. discount factor. N o r m a li z ed S u m E ne r g y E ff i c i en cy ( bp s / H z / J ) M=10,K=6Size of hidden layers=256×128Size of hidden layers=512×256

Fig. 6. EE vs. hidden layers size. conclusion that the DDPG algorithm is capable of convergingover the 1000 episodes, while the EE performance of the othertwo methods remain poor over 1000 episodes. In addition,the EE performance of the DDPG-based beamforming designachieves 90% of the best performance over about 180 episodesand the EE performance gap expands since about 50 episodes.

2) Energy efﬁciency versus transmission power:

Fig. 4characterizes the EE versus uplink signal transmission power p u . In the simulation, we tend to observe the variation of EEperformance when p u ranges from 4 dBm to 24 dBm. FromFig. 4, EE rises as p u increases from 4 dBm to 16 dBm anddeclines from 16 dBm to 24 dBm. It is because that the EEraises as the p u satisﬁes the demand of signal transmission,while when the p u is high enough, the redundant energyconsumption causes EE performance decline.

3) The inﬂuence of the hyper-parameters :

Fig. 5. illustratesthat ζ = 0 . hits the optimal EE performance and it degradesin an acceptable range when ζ = 0 . , . , . . However,the extreme ζ will lead to egregious EE performance: when ζ = 10 − , the Bellman equation closely associates with theinstantaneous reward; when ζ = 1 − − , the Bellmanequation will stand for the one-day’s sum reward to a largeextent which leads to a poor EE performance. Fig. 6 showsthat the hidden layers with the neuron size of 256 ×

128 resultsin a better EE performance than the neuron size of 512 × .It is because that if we set redundant neurons in hidden layers,it may occur overﬁtting which causes stuck in local minimainstead of global optimal.V. C ONCLUSION

This paper investigated the DRL for energy-efﬁcient beam-forming design in cell-free networks. Based on MMSE chan-nel estimation and SIC signal detection technologies, theclosed-form of SINR per user and long-term EE functionwere derived. The DDPG algorithm was exploited to performcentralized beamforming design for the long-time EE maxi-mum problem with continuous state and action space. It wasdemonstrated that the DDPG-based algorithm was convergentand reduced the exponential computational complexity topolynomial level. The simulation results indicated that theDDPG-based beamforming design outperformed benchmarksin terms of EE under different network setups. Moreover, appropriate discount factor and hidden layers size could leadto preferable performance.R

EFERENCES[1] H. Q. Ngo et al. , “Cell-Free massive MIMO versus small cells,”

IEEETransactions on Wireless Communications , vol. 16, no. 3, pp. 1834–1850, Jan. 2017.[2] S. Chen et al. , “Structured massive access for scalable Cell-Free massiveMIMO systems,”

IEEE Journal on Selected Areas in Communications ,Aug. 2020, accepted.[3] T. K. Nguyen et al. , “Max-min QoS power control in generalized Cell-Free massive MIMO-NOMA with optimal backhaul combining,”

IEEETrans. Veh. Technol. , vol. 69, no. 10, pp. 10 949–10 964, Jun. 2020.[4] M. Alonzo et al. , “Energy-efﬁcient power control in Cell-Free and user-centric massive MIMO at millimeter wave,”

IEEE Transactions on GreenCommunications and Networking , vol. 3, no. 3, pp. 651–663, Mar. 2019.[5] F. Tan et al. , “Energy-efﬁcient non-orthogonal multicast and unicasttransmission of Cell-Free massive MIMO systems with SWIPT,”

IEEEJournal on Selected Areas in Communications , Sep. 2020, accepted.[6] S. Jin et al. , “Spectral and energy efﬁciency in Cell-Free massive MIMOsystems over correlated rician fading,”

IEEE Systems Journal , pp. 1–12,May. 2020.[7] M. Hua et al. , “Channel estimation and resource allocation in NOMAenhanced Cell-Free massive MIMO networks,” in

Proc. IEEE ICC ,Montreal, Canada, 2021, under review.[8] Y. Al-Eryani et al. , “Multiple access in Cell-Free networks: Outageperformance, dynamic clustering, and deep reinforcement learning-baseddesign,”

IEEE Journal on Selected Areas in Communications , Aug.2020, accepted.[9] S. Mosleh et al. , “Downlink resource allocation in Cell-Free massiveMIMO systems,” in , Honolulu, HI, USA, Apr.2019, pp. 883–887.[10] M. Attarifar et al. , “Modiﬁed conjugate beamforming for Cell-Freemassive MIMO,”

IEEE Wireless Communications Letters , vol. 8, no. 2,pp. 616–619, Jan. 2019.[11] H. Q. Ngo et al. , “On the total energy efﬁciency of Cell-Free massiveMIMO,”

IEEE Transactions on Green Communications and Networking ,vol. 2, no. 1, pp. 25–39, Mar. 2018.[12] J. Ge et al. , “Deep reinforcement learning for distributed dynamic MISOdownlink-beamforming coordination,”

IEEE Transactions on Communi-cations , vol. 68, no. 10, pp. 6070–6085, Jun. 2020.[13] M. Chen et al. , “Artiﬁcial neural networks-based machine learning forwireless networks: A tutorial,”

IEEE Communications Surveys Tutorials ,vol. 21, no. 4, pp. 3039–3071, Jul. 2019.[14] T. Van Chien et al. , “Joint power allocation and load balancing opti-mization for energy-efﬁcient Cell-Free massive MIMO networks,”

IEEETransactions on Wireless Communications , vol. 19, no. 10, pp. 6798–6812, Jul. 2020.[15] X. Liu et al. , “RIS enhanced massive non-orthogonal multiple accessnetworks: Deployment and passive beamforming design,”

IEEE Journalon Selected Areas in Communications , Aug. 2020, accepted.[16] A. D.Silver et al. , “Mastering the game of go with deep neural networksand tree search,”