[PDF] Joint Deep Reinforcement Learning and Unfolding: Beam Selection and Precoding for mmWave Multiuser MIMO with Lens Arrays

Abstract

The millimeter wave (mmWave) multiuser multiple-input multiple-output (MU-MIMO) systems with discrete lens arrays (DLA) have received great attention due to their simple hardware implementation and excellent performance. In this work, we investigate the joint design of beam selection and digital precoding matrices for mmWave MU-MIMO systems with DLA to maximize the sum-rate subject to the transmit power constraint and the constraints of the selection matrix structure. The investigated non-convex problem with discrete variables and coupled constraints is challenging to solve and an efficient framework of joint neural network (NN) design is proposed to tackle it. Specifically, the proposed framework consists of a deep reinforcement learning (DRL)-based NN and a deep-unfolding NN, which are employed to optimize the beam selection and digital precoding matrices, respectively. As for the DRL-based NN, we formulate the beam selection problem as a Markov decision process and a double deep Q-network algorithm is developed to solve it. The base station is considered to be an agent, where the state, action, and reward function are carefully designed. Regarding the design of the digital precoding matrix, we develop an iterative weighted minimum mean-square error algorithm induced deep-unfolding NN, which unfolds this algorithm into a layerwise structure with introduced trainable parameters. Simulation results verify that this jointly trained NN remarkably outperforms the existing iterative algorithms with reduced complexity and stronger robustness.

Full PDF

11 Joint Deep Reinforcement Learning and Unfolding: BeamSelection and Precoding for mmWave Multiuser MIMOwith Lens Arrays

Qiyu Hu, Yanzhen Liu, Yunlong Cai, Guanding Yu, and Zhi Ding

Abstract

The millimeter wave (mmWave) multiuser multiple-input multiple-output (MU-MIMO) systems withdiscrete lens arrays (DLA) have received great attention due to their simple hardware implementation andexcellent performance. In this work, we investigate the joint design of beam selection and digital precodingmatrices for mmWave MU-MIMO systems with DLA to maximize the sum-rate subject to the transmitpower constraint and the constraints of the selection matrix structure. The investigated non-convex problemwith discrete variables and coupled constraints is challenging to solve and an efﬁcient framework of jointneural network (NN) design is proposed to tackle it. Speciﬁcally, the proposed framework consists of a deepreinforcement learning (DRL)-based NN and a deep-unfolding NN, which are employed to optimize the beamselection and digital precoding matrices, respectively. As for the DRL-based NN, we formulate the beamselection problem as a Markov decision process and a double deep Q-network algorithm is developed tosolve it. The base station is considered to be an agent, where the state, action, and reward function arecarefully designed. Regarding the design of the digital precoding matrix, we develop an iterative weightedminimum mean-square error algorithm induced deep-unfolding NN, which unfolds this algorithm into a layer-wise structure with introduced trainable parameters. Simulation results verify that this jointly trained NNremarkably outperforms the existing iterative algorithms with reduced complexity and stronger robustness.

Index Terms

Discrete lens arrays, deep reinforcement learning, deep-unfolding, beam selection, precoding design.

I. I

NTRODUCTION

Recently, the millimeter wave (mmWave) communication has received great attention for the designof wireless systems due to the beneﬁts of mitigating the spectrum shortage and supporting high datarates [1]. The short wavelength of mmWave makes it practically feasible to equip a large number

Q. Hu, Y. Liu, Y. Cai, and G. Yu are with the College of Information Science and Electronic Engineering, Zhejiang University,Hangzhou 310027, China (e-mail: [email protected]; [email protected]; [email protected]; [email protected]). Z. Ding is withthe Department of Electrical and Computer Engineering, University of California, Davis, CA 95616, USA (e-mail: [email protected]). a r X i v : . [ c s . I T ] J a n of antennas at the transceiver in a cellular network, where the massive multiple-input multiple-output (MIMO) techniques can be readily realized to perform highly directional transmissions [2]–[4].Nevertheless, the conventional precoding relied on fully digital (FD) processing structures generallyleads to an unaffordable cost of radio frequency (RF) chains and high power consumption in massiveMIMO scenarios. The hybrid analog-digital (AD) transceiver consisting of cascaded baseband digitalprecoder and RF analog beamformer has been proposed as a trade-off between cost and performance,which employs less RF chains than the number of antennas, while imposing a constant modulusconstraint on the analog beamforming matrix [5]. Moreover, note that the use of phase shifters stillcauses large energy consumption.To address these issues, a promising mmWave MIMO technique with discrete lens arrays (DLA) hasbeen proposed in [6] to reduce hardware implementation complexity without remarkable performanceloss [7]–[15]. A DLA generally consists of two parts: an electromagnetic lens and a matching antennaarray. It transforms the conventional MIMO spatial channels into beamspace channels with angle-dependent energy-focusing capabilities [7]. Practically, due to the sparsity of beamspace channels,only a small number of beams with large gains are selected. In addition, each RF chain selects a singlebeam, thus the required number of RF chains can be dramatically reduced. Moreover, the conventionalphase shifters-enabled hybrid precoding structure can be substituted by a switching network, whichgreatly reduces the hardware costs and cuts down the power budget. In spite of these superiorities ofthe DLA, it has been proved in [8] that the beam selection problem is NP-hard and how to handlethis problem efﬁciently remains an open issue. In [9], the interference-aware algorithm has beenproposed to solve this NP-hard problem heuristically. A penalty dual decompsition (PDD) algorithmhas been developed in [10] for jointly optimizing the beam selection and digital precoding matrices.The beam selection problem for the high-dimensional multiuser (MU) case has been investigated in[11]. In [12], several algorithms have been proposed, e.g., maximum magnitude beam selection andmaximum signal-to-interference-ratio (SINR) beam selection. The authors in [13] have developed anenergy-max algorithm for beam selection and a successive interference cancellation based precodingmethod in massive MIMO systems.However, the optimization techniques usually provide very high computational complexity andare easily trapped into a local optimum when dealing with a non-convex problem. In addition, theexisting algorithms mainly optimize the beam selection matrix and the digital precoding matrixseparately. Thus, the joint design is worth studying for further improving the performance, whichis more challenging and belongs to the mixed-integer non-linear programming (MINLP) in wireless communications. Recently, the machine learning techniques emerge as an efﬁcient method and there isa number of studies employing the neural networks (NNs) for solving the investigated problems [16],such as precoding design [17], beam selection [18]–[20], and resource allocation [21]. The idea is toregard the iterative algorithm as a black-box, and employ the NNs to learn the relationship betweenthe inputs and outputs. The authors in [17] have applied a NN with fully-connected (FC) layers toapproximate the iterative weighted minimum mean-square error (WMMSE) algorithm for precoding.In [18], the beam selection problem has been regarded as a multiclass-classiﬁcation problem solvedby the support vector machine. The authors in [19] have developed a dataset for investigating beamselection techniques in the scenario of vehicle-to-infrastructure. The authors in [20] have made fulluse of the angle-of-arrival (AoA) information to design the multi-layer perceptron (MLP) to optimizethe beam selection matrix.Nevertheless, these black-box NNs generally have poor interpretability and generalization ability,and always require a great number of training samples. To address these issues, so-called deep-unfolding NNs have been developed in [22]–[26], where the iterative optimization algorithms havebeen unfolded into a layer-wise structure that is similar to the NN. In addition, deep reinforcementlearning (DRL) [27] has also been widely applied to solve the high-dimensional non-convex opti-mization problems. Combining with the deep neural network (DNN), the deep Q-Network (DQN)achieves better performance than the conventional Q-learning. The DQN has been widely applied inmobile ofﬂoading [28], dynamic channel access [29], radio access [30], resource allocation [31], [32],and mobile-edge computing (MEC) [33]. In [32], a multi-agent DRL approach has been developedto maximize the network utility in the heterogeneous cellular networks. The authors in [33] havedeveloped a DRL-based online ofﬂoading framework for the MEC network.To the best of our knowledge, the problem of joint beam selection and precoding design has notbeen fully investigated since it is an NP-hard problem with discrete variables and coupling constraints,which is difﬁcult to ﬁnd a satisfactory solution with low complexity. In this work, we propose a novelframework to jointly optimize the beam selection and digital precoding matrices. Firstly, we formulatethe beam selection problem as a Markov decision process (MDP) and a double deep Q-network(DDQN) algorithm is developed to solve it. We aim at selecting the optimal beams to maximize thesum-rate for served users and ensuring that the constraints of the beam selection matrix are satisﬁed.The base station (BS) is treated as an agent and the state, action, and reward function are carefullydesigned for the agent. Moreover, we reduce the dimension of state by exploiting the sparsity ofbeamspace channel and model the selection of each beam as an action. Given the beam selection matrix, the problem of the digital precoding design is a classic sum-ratemaximization problem with power constraints that can be settled by the iterative WMMSE algorithm[34]. It achieves satisfactory sum-rate performance but with high complexity since it requires thehigh-dimensional matrix inversion and a great number of iterations. To address this problem, a deep-unfolding NN is developed, where the iterative WMMSE algorithm is unfolded into a layer-wisestructure. In this way, a much smaller number of iterations, i.e., the layers in the deep-unfoldingNN, are applied to approach the iterative WMMSE algorithm, and the matrix inversion is avoided toreduce the computational complexity. In addition, some trainable parameters are involved to improveits sum-rate performance. Furthermore, a training method is proposed to jointly train the DRL-basedNN and the deep-unfolding NN, which is totally different from the existing joint design scheme [35].We also compare the computational complexity of the proposed joint NN design with benchmarks.Numerical results verify that our proposed jointly trained NN signiﬁcantly outperforms the existingiterative algorithms with reduced computational complexity. In addition, the proposed jointly trainedNN has much stronger robustness against the channel state information (CSI) errors. The contributionsof this work can be summarized as follows. • We formulate the problem of joint beam selection and precoding design for a downlink mmWaveMU-MIMO system with DLA. To address this problem, we develop an efﬁcient framework ofjoint NN design, where a DRL-based NN and a deep-unfolding NN are designed to jointlyoptimize the beam selection and digital precoding matrices. • As for the DRL-based NN, we formulate the beam selection problem as an MDP and a DDQNalgorithm is developed to solve it. The state, action, and reward function are carefully designedfor the agent BS, where we utilize the beamspace channel’s characteristics and guarantee thatthe constraints of the beam selection matrix are satisﬁed. • We develop a deep-unfolding NN with introduced trainable parameters to optimize the digitalprecoding matrix, where the iterative WMMSE algorithm is unfolded into a layer-wise structure.A much smaller number of iterations are applied to approximate the iterative WMMSE algorithmand the matrix inversion is avoided to reduce the computational complexity. • A training method is proposed to jointly train the two NNs. Simulation results show that ourproposed jointly trained NN signiﬁcantly outperforms the existing iterative algorithms with muchlower computational complexity.The rest of paper is structured as follows. Section II introduces the system model of the downlinkMU-MIMO system with DLA and formulates the problem mathematically. Section III describes the

Digital

Precoding

Matrix P Beam

Selection

Matrix F RF ChainRF Chain RF s s k Base Station Discrete Lens Array M s N RF User 1

User 2User K

Fig. 1: A downlink mmWave MU-MIMO system with discrete lens array.framework of the joint NN design. The DRL-based NN designed for the beam selection matrix ispresented in Section IV. In Section V, the deep-unfolding NN is developed for the digital precodingmatrix and the computational complexity is analyzed. We present the our experimental results inSection VI. Finally, the paper is concluded in Section VII.

Notations:

Scalars, vectors, and matrices are respectively denoted by lower case, boldface lowercase, and boldface upper case letters. I represents an identity matrix and denotes an all-zero matrix.For a matrix A , A T , A ∗ , A H , and (cid:107) A (cid:107) denote its transpose, conjugate, conjugate transpose, andFrobenius norm, respectively. For a vector a , (cid:107) a (cid:107) represents its Euclidean norm. E {·} denotes thestatistical expectation. (cid:60){·} ( (cid:61){·} ) denotes the real (imaginary) part of a variable. Tr {·} denotes thetrace operation. | · | denotes the absolute value of a complex scalar and ◦ denotes the element-wisemultiplication of two matrices, i.e., Hadmard product. C m × n ( R m × n ) denotes the space of m × n complex (real) matrices.II. S YSTEM M ODEL AND P ROBLEM F ORMULATION

In this section, we ﬁrst introduce the system model for joint beam selection and precoding designand then formulate the optimization problem mathematically.

A. System Model

As shown in Fig. 1, we study a downlink MU-MIMO system consisting of a BS equipped witha single-sided DLA, M s transmit antennas and N RF (cid:28) M s RF chains. The BS serves K userssimultaneously, each of which is equipped with a single antenna receiver. To ensure the spatial multiplexing gain for the K users, the following relation should be satisﬁed: N RF ≥ K . The precodeddata vector at the BS is expressed as x = Ps = K (cid:88) k =1 p k s k , (1)where s = [ s , s , · · · , s K ] T , s k denotes the transmit signal for user k ∈ K (cid:44) { , , · · · , K } with zeromean and E [ ss H ] = I , and P = [ p , p , · · · , p K ] ∈ C N RF × K denotes the precoding matrix and p k isthe digital precoding vector for user k . Then, the received signal vector y ∈ C K × for all K users isgiven by y = H H FPs + n , (2)where H ∈ C M s × K denotes the beamspace channel matrix, F ∈ C M s × N RF is the beam selection matrixwhose entries f ij , ( i, j ) ∈ T are either or , and T (cid:44) { ( i, j ) | i = 1 , , · · · , M s , j = 1 , , · · · , N RF } .In addition, n ∼ CN ( , σ I K ) denotes the K × additive white Gaussian noise (AWGN) vector,where σ represents the noise variance. B. Beamspace Channel Model

The beamspace channel matrix H is modeled from the physical spatial MIMO channel H = [ h , h , · · · , h K ] = [ Ug , Ug , · · · , Ug K ] , (3)where U ∈ C M s × M s is a discrete Fourier transformation (DFT) matrix and g k ∈ C M s × denotes thespatial domain channel vector between the BS and user k . The DFT matrix U contains the arraysteering vectors of M s orthogonal beams [7] U = [ a ( ϕ ) , a ( ϕ ) , · · · , a ( ϕ M s )] H , (4)where ϕ m = 1 M s ( m − M s + 12 ) , m = 1 , , · · · , M s , are the normalized spatial directions, a ( ϕ m ) =1 √ M s [ e − j πϕ m i ] i ∈I denotes the corresponding M s × array steering vectors and I (cid:44) { n − M s − (cid:12)(cid:12) n =0 , , · · · , M s − } is the index set of the array elements. Note that the columns of U are orthonormal,i.e., U H U = I .Then, we apply the well-known channel model [9], [12] g k = ρ (0) k a ( φ (0) k ) + L (cid:88) l =1 ρ ( l ) k a ( φ ( l ) k ) , (5)where ρ (0) k a ( φ (0) k ) and ρ ( l ) k a ( φ ( l ) k ) denote the line-of-sight (LoS) and the l -th non-line-of-sight (NLoS)channel vectors between the BS and user k , respectively. Moreover, ρ (0) k and ρ ( l ) k are the complex gains of the LoS and NLoS channels, respectively, and φ (0) k and φ ( l ) k represent the spatial directions. Forsimplicity, we apply the 2D formulation that only considers the azimuth angel of departure (AoD).The number of NLoS components L in (5) is much less than M s since the number of scatters ina mmWave channel is limited, which leads to the sparse structure of H . In the channel model, H consists of M s beams and each row of H represents a beam vector. C. Problem Formulation

We focus on the joint design of the beam selection matrix F and the digital precoding matrix P to maximize the downlink sum-rate of the system. The sum-rate maximization problem can bemathematically formulated as max { F , P } K (cid:88) k =1 log (cid:18) | h Hk Fp k | K (cid:80) i (cid:54) = k | h Hk Fp i | + σ (cid:19) (6a)s.t. Tr ( P H F T FP ) ≤ P s , (6b) M s (cid:88) i =1 f ij = 1 , ∀ j, (6c) N RF (cid:88) j =1 f ij ≤ , ∀ i, (6d) f ij ∈ { , } , ∀ ( i, j ) ∈ T , (6e)where P s is the transmit power budget at the BS. Note that the constraint (6b) can be simpliﬁed asTr ( P H P ) ≤ P s since U H U = I M s and F T F = I N RF . The constraint (6c) guarantees that each RFchain feeds a single beam and the constraint (6d) ensures that each beam has been chosen for atmost one RF chain. These constraints guarantee that we select N RF beams from M s beams to serve K users. It is difﬁcult to solve problem (6) since it is an MINLP, which is non-convex and involvesdiscrete variable F . Therefore, we propose a joint NN design to solve it in the following sections.III. F RAMEWORK OF THE J OINT

NN D

ESIGN

In this section, a framework of joint NN design for beam selection and digital precoding isdeveloped. As shown in Fig. 2, the proposed framework includes two parts: joint NN design andsignal ﬂow simulator, which are elaborated as follows.

A. Joint NN Design

To solve the MINLP (6), we propose a joint NN design consisting of a DRL-based NN for the beamselection matrix F and a deep-unfolding NN for the digital precoding matrix P . Firstly, the channel DRL

Unfolding

Digital Precoder s Beam

Selection Ps Fading

Channel

FPsH

Digital Precoder

Normalization = ( ) s H

PTr

P PPP P H n y H H FPs

F P

Joint NN DesignSignal Flow Simulator H H F

User 1

User 2 H Fig. 2: Proposed framework of joint beam selection and digital precoding design.matrix H is converted into a × M s × K real-valued tensor, whose real part and imaginary partare stored separately. Then, we input H into the DRL-based NN with a novel DDQN architecture. Itaims at selecting the optimal beams and maximizing the sum-rate for served users while ensuring thatthe constraints (6c)-(6e) are satisﬁed. The DRL-based NN outputs the corresponding beam selectionmatrix F . Then, the equivalent channel matrix ¯ H H = H H F ∈ C K × N RF is obtained, which has muchsmaller dimension than H .Given the equivalent channel matrix ¯ H , the problem regarding the digital precoding ¯P can besolved by the iterative WMMSE algorithm [34] with signiﬁcantly high complexity. To solve thisproblem, a deep-unfolding NN is developed, where the iterative WMMSE algorithm is unfolded intoa layer-wise structure with some introduced trainable parameters. Speciﬁcally, ¯ H is input into thedeep-unfolding NN and the output is the unnormalized digital precoder ¯P . Due to the structure ofthe investigated problem, we can see that the optimal solution makes that the power constraint (6b)always meet equality. To satisfy (6b), we apply the following normalization layer to map ¯P into thedigital precoder P as P = √ P s (cid:112) T r ( ¯P ¯P H ) ¯P . (7)Finally, H , F , and P are substituted into the loss function L ( θ ) = − N s N s (cid:88) n =1 f ( H n , F n , P n ; θ ) , (8) where N s denotes the size of training data set, θ represents the trainable parameter of NN, and f ( H n , F n , P n ; θ ) indicates that the sum-rate (6a) is achieved with the n -th channel realization. Inthe training stage, with input H , the DRL-based NN and the deep unfolding NN generate F and P ,respectively. We then perform the back propagation (BP) to jointly train the two NNs and employ thestochastic gradient descent (SGD) to update the trainable parameters. In addition, the outputs of thedeep-unfolding NN will be substituted into (8), which is part of the reward of the DRL-based NNand could help to train it efﬁciently. B. Signal Flow Simulator

In the prediction stage, the signal ﬂow simulator imitates the process from the transmitting datasymbol s to the detected symbol y , over the wireless fading channel H with the AWGN n generatedin the simulation environment. The channel H is input into the NNs and the outputs are F and P .Thereafter, the digital precoder module and the beam selection module are replaced by the generatedprecoding matrix P and the beam selection matrix F , respectively.IV. DRL- BASED NN FOR B EAM S ELECTION

In this section, we present the DRL-based NN for beam selection after introducing the basic ideasof DRL.

A. MDP and Value Function1) MDP:

The MDP is deﬁned by the quintuple ( S , A , R , P , γ ) , where S denotes the state space, A represents the action space, R : S → R denotes the reward function, P : S × A × S → R represents the state transition probability, and γ ∈ (0 , denotes the discount factor. Note that s t , a t ,and r t denote the state, action, and reward at time step t , respectively. Then, we have a t ∼ π ( ·| s t ) , s t +1 ∼ p ( ·| s t , a t ) , r t (cid:44) r ( s t , a t ) ∼ R , and the cumulative discounted reward (cid:80) t ≥ γ t r t , where p isthe transition probability and π denotes the policy, π : S (cid:55)→ A .

2) Value Function:

The value function V π ( s ) at state s is deﬁned as the expected reward followingthe policy π V π ( s ) = E (cid:26) (cid:88) t ≥ γ t r t | s = s, π (cid:27) , (9)where the expectation E {·} represents the empirical average over a batch of samples. Similarly, theaction-value function Q π ( s, a ) denotes the expected reward obtained after taking action a at state s given the policy π , as Q π ( s, a ) = E (cid:26) (cid:88) t ≥ γ t r t | s = s, a = a, π (cid:27) . (10) The optimal policy π (cid:63) is obtained by the Bellman equation Q π (cid:63) ( s t , a t ) = max π E (cid:26) r t +1 + γ max a Q π ( s t +1 , a ) | s t , a t (cid:27) . (11)The deep Q-learning is proposed to estimate the action-value function. Our objective in deep Q-learning aims at minimizing the loss of L ( θ ) = E (cid:26)(cid:18) y t − Q ( s t +1 , a ; θ ) (cid:19) (cid:27) , (12)where θ denotes the parameter of the NN. We apply the DDQN [36] to learn the target value y t toavoid the overestimation and make the NN stable, which is expressed as y t = E (cid:26) r t +1 + γQ (cid:18) s t +1 , arg max a Q ( s t +1 , a ; θ ); θ (cid:48) (cid:19)(cid:27) , (13)where θ (cid:48) and θ denote the parameters of the two NNs, which have been applied to select and evaluatethe action, respectively. The gradient is expressed as ∇ θ L ( θ ) = E (cid:26) r t +1 + γQ (cid:18) s t +1 , arg max a Q ( s t +1 , a ; θ ); θ (cid:48) (cid:19) − Q ( s t +1 , a ; θ ) ∇ θ Q ( s t +1 , a ; θ ) (cid:27) . (14) B. DRL for Beam Selection

Firstly, we set up the process of beam selection as an MDP, where the BS is considered as the agent.Then, we carefully deﬁne the state, action, state transition, and reward function. The difﬁculties andinnovations of solving the MDP mainly include: (i) The beam selection problem is an NP-hard problemwith discrete variables and coupling constraints, where the constraints (6c)-(6e) must be satisﬁed. (ii)We aim at selecting the optimal beams and maximizing the sum-rate (6a) by carefully designingthe reward function. Hence, we need to make full use of the beamspace channel’s characteristicsand take into consideration the selected beam energy, the fairness of users, and the SINR. (iii) Thestate space is quite large due to the large number of M s , which severely affects the convergenceperformance of NN. Thus, the dimension reduction is required to accelerate the convergence rate ofNN without affecting the sum-rate performance. (iv) The DRL-based NN should be trained jointlywith the deep-unfolding NN, which will be illustrated in detail in the next section.

1) MDP of Beam Selection:

The MDP is formulated mathematically as follows. • Agent: The BS observes the current state s t and selects an action a t based on the policy π tointeract with the environment. The BS adjusts its policy π based on the feedback reward fromthe environment. We aim at ﬁnding the optimal policy π to maximize the expectation of thecumulative discounted reward E (cid:8) (cid:80) t ≥ γ t r t (cid:9) . • State space: We convert the beamspace channel matrix H ∈ C M s × K into a × M s × K real-valuedtensor with the real part and imaginary part stored separately. The state space S ∈ C × M s × K consists of two parts: (i) the channel matrix H with dimension × M s × K ; (ii) An indicatortensor with dimension × M s × K to indicate whether a beam is selected. The elements of theindicator tensor are all initialized as . • Action space: The action space is designed as the choice of the candidate beam, i.e., A = { , , · · · , M s } , and only one beam is selected at each time step t . The MDP has N RF timesteps in an episode, i.e., t ∈ { , , · · · , N RF } . By following this design, the constraints (6c) and(6e) are satisﬁed. • State transition: Given the current state s t and the action a t = i , the state transits to s t +1 and the i -th row of the indicator tensor is set to be , while the channel matrix H keeps unchanged.

2) Reward Function:

Firstly, we need to ensure that the constraint (6d) is satisﬁed, i.e., a beamwill be selected at most once. The agent will receive a penalty (cid:37) if one beam is selected more thanonce, i.e., the selected beam whose corresponding locations in the indicator tensor are . Then, if theselected beam has not been chosen before, the reward can contain the following parts, which evaluatethe selected beam from different aspects. • First of all, the reward function could be deﬁned as the l norm of the selected beam vector ateach time step since it shows the energy of the selected beam. In addition, H has the sparsestructure and this part of the reward could avoid the case that the BS chooses the beam withlow energy. • Secondly, at the last time step t = N RF , N RF beams are selected, thus we obtain the beamselection matrix F . Since we aim at maximizing (6a), another part of the reward function at timestep t = N RF is designed as K (cid:88) k =1 log (cid:18) | h Hk Fp k | (cid:80) Ki (cid:54) = k | h Hk Fp i | + σ (cid:19) , (15)where the digital precoding p k is calculated by the deep-unfolding NN introduced in the nextsection. • Thirdly, since (15) can only be applied at time step t = N RF , we add the following expressionas another part of the reward at each time step t < N RF . It is an approximation of the SINR[15], which measures the interference among the users, K (cid:88) k =1 | h tjk | K (cid:80) i (cid:54) = k | h tji | + σ , (16) User Index B e a m I nd e x Dimensional reduction

User Index B e a m I nd e x Fig. 3: The dimension reduction of the beamspace channel, where the darker color represents theelements (beams) with larger magnitude (higher energy).where j is the index of the selected beam at time step t and h tji represents the element in the j -th row and the i -th column of H . • Furthermore, to avoid that there is no beam aligned for some users, whose sum-rate mightapproach zero, we add the average energy of each user as part of the reward at time step t ≥ N RF to ensure the fairness among users K (cid:88) k =1 (cid:107) ˜ h tk (cid:107) − (cid:107) ˜ h t − k (cid:107) (cid:107) ˜ h tk (cid:107) + ε (cid:0) sgn ( δ − (cid:107) ˜ h t − k (cid:107) ) + 1 (cid:1) . (17)Note that ˜ H t ∈ C t × K denotes the channel matrix with t selected beams consisting of all thebeams chosen before the t -th time step, ˜ h tk ∈ C t × represents the k -th column of ˜ H t , ε is addedhere to avoid the numerical error, and δ is a given threshold. The term (cid:107) ˜ h tk (cid:107) −(cid:107) ˜ h t − k (cid:107) representsthe energy of the selected beams at time step t for user k . The sign function sgn ( · ) is addedhere since (17) is an extra reward for selecting a beam aligned for user k when there is no beamallocated for it and the beam energy of user k is very low, i.e., (cid:107) ˜ h t − k (cid:107) < δ .

3) Some Tricks for Improving the Performance: • Dimension reduction : The dimension of beamspace channel matrix H ∈ C M s × K is very high dueto the large number of M s . We choose ¯ M s ( ¯ M s (cid:28) M s ) highest energy beams to construct thebeamspace channel matrix with reduced dimenson. Correspondingly, the dimensions of the stateand action space are reduced to × ¯ M s × K and ¯ M s , respectively. The DRL-based NN is morestable and shows better convergence performance with lower dimensional state and action space.The dimension reduction will not lead to performance degradation with a proper ¯ M s , due to the sparse structure of H . The hyperparameter ¯ M s needs to be ﬁne tuned since a large number of ¯ M s would lead to a high dimension while a small ¯ M s might leave out some good beams. Fig.3 presents the dimension reduction process, where the scenario of K = N RF = 4 , M s = 16 ,and ¯ M s = 8 is taken as an example. The beams with quite low energy are eliminated and thebeamspace channel is reduced into ¯ M s = 8 beams. • Prioritized experience replay buffer [33] : To overcome learning instability and reduce the corre-lation among training examples, we use a replay buffer D to store the transitions ( s t , a t , r t , s t +1 ) obtained from the environment under the policy π as shown in Fig. 4. A mini-batch of samplesstored in D will be sampled for training based on the priority, where the samples with worseperformance will be given higher priority. • (cid:15) -Greedy strategy [32] : We apply the strategy that selects a random action a t with probability (cid:15) ,and selects the action a t = max a Q π (cid:63) ( s t , a ; θ ) with probability − (cid:15) , where (cid:15) decays with time.It incorporates more candidate actions into the training samples. • Noisy network : To enhance the exploration ability of the NN, we add the nosie to trainableparameters in the FC layers.

C. Architecture of the DRL-based NN1) DDQN:

Since the same NNs are applied to choose and evaluate the action in DQN, the Q-valuefunction might be over-estimated. To avoid the overestimation, we apply the DDQN architecture [36]with two NNs to select and evaluate the action, which are referred to as the main network and targetnetwork, respectively. The procedure of the DRL-based NN with the DDQN architecture is presentedin Fig. 4 (a). Particularly, the next state s t +1 is employed by both the main network and target networkto select the action and evaluate the value Q ( s t +1 , a t +1 ; θ (cid:48) ) , respectively. Then, the target value y iscalculated with discount factor γ and reward r t . Finally, the error is computed by subtracting y fromthe optimal value Q ( s t , a ∗ t ; θ ) , which is then backpropagated to update the trainable parameter θ .

2) Dueling Architecture:

The Q-value function depicts how beneﬁcial an action a is taken thestate s . We apply the dueling architecture to estimate the value function V ( s ) and the advantagefunction A ( s, a ) = Q ( s, a ) − V ( s ) , where A ( s, a ) describes the advantage of the action a comparedwith the other actions. Thus, as shown in Fig. 4 (b), the last layer of the DDQN is split into twosubnetworks to evaluate V ( s ) and A ( s, a ) separately, and we restrict (cid:80) a A ( s, a ) = 0 . Note that V ( s ) can be interpreted as the mean value of the alternative actions at state s and A ( s, a ) evaluates thevalue of action a compared to the mean value V ( s ) . Then, the action value function Q ( s, a ) can beestimated by combining V ( s ) with A ( s, a ) . MBS

UE PBS

Main

NetworkTarget

Network ( , ) t t

Q s a t argmax a * t a Environment

Update t+1 s t s ( , ) t t Q s a *t ( , ; ) t Q s a θ Back Propagation +1 t+1 ( , ; ) t Q s a θ t+1 argmax a *t+1 a t+1 s +1 t+1 ( , ; ') t Q s a θ *+1 t+1 ( , ; ') t Q s a θ y g t s t+1 s t r * t a Replay Memory Sampling

Mini-batch t s t+1 s t r * t a ( ) L Ñ θ θ Channel Selected Beam t s t r * t a Value

AdvantageQ-Value ´ PRelu/BN (cid:14) (a) (b)(c)

Main Network with Dueling ArchitectureResidual Block

PRelu/BN PRelu/BN

Fig. 4: (a) Proposed DRL-based NN for beam selection; (b) Dueling Architecture; (c) Structure ofthe residual block in mobile net.

3) Structure of the NN:

We adopt a well-known NN called “Mobile Net” for the main networkand target network, which consist of several cascaded residual blocks and an FC layer. Each residualblock is comprised of the convolutional layers with × ﬁlter and × ﬁlter, the PRelu function,and the batch normalization (BN) layer, as presented in Fig. 4 (c). Compared with the MLP, it hasmuch smaller number of trainable parameters, which leads to faster convergence rate and more stableconvergence performance.Based on the aforementioned design, the training procedures of DRL-based NN for beam selectionare described in Algorithm 1 .V. D

EEP -U NFOLDING NN FOR P RECODING D ESIGN

In this section, we propose a deep-unfolding NN to design the digital precoder and analyze thecomputational complexity. Algorithm 1

Training procedures of DRL-based NN for beam selection Input: batch size B , learning rate η , replay period Q , the number of episode J , and the capacity of replay memory D ; Initialize the replay memory D = ∅ and the action-value function Q π ( s, a ) with random parameter θ ; for episode = , , · · · , J do Observe the initial state s and choose action a ∼ π θ ( s ) ; for time step t = , , · · · , N RF do Selects a random action a t with probability (cid:15) , and selects a t = max a Q π (cid:63) ( s t , a ; θ ) with probability − (cid:15) ; Execute the action a t and obtain the reward r t based on Section IV-B2; Observe the next state s t +1 and store transition ( s t , a t , r t , s t +1 ) into the replay buffer D ; end for if episode = 0 mod Q then Sample a random mini-batch with size B of transitions ( s t , a t , r t , s t +1 ) from the replay buffer D ; Compute loss function (12) and apply the SGD to update the trainable parameter θ based on (14); Copy the trainable parameter θ into the target network from time to time. end if end for A. Iterative WMMSE Algorithm

The beam selection matrix F can be constructed based on the selected N RF beams and the equivalentchannel matrix ¯ H H = H H F is obtained. Then, the sum-rate maximization problem (6) is rewrittenas follows for a given F , max { p k } K (cid:88) k =1 log (cid:18) | ¯ h Hk p k | (cid:80) Ki (cid:54) = k | ¯ h Hk p i | + σ (cid:19) (18a)s.t. K (cid:88) k =1 Tr ( p k p Hk ) ≤ P s , (18b)where ¯ H (cid:44) [¯ h , ¯ h , · · · , ¯ h K ] ∈ C N RF × K and ¯ h k denotes the equivalent channel vector of user k , i.e., ¯ h Hk = h Hk F . It was demonstrated in [34] that the MMSE problem (19) shown below is equivalentto the sum-rate maximization problem (18), in the sense that the optimal solutions for these twoproblems are identical min { p k ,w k ,u k } K (cid:88) k =1 w k e k − log w k (19a)s.t. K (cid:88) k =1 Tr ( p k p Hk ) ≤ P s , (19b) Φ ψ (0) k p (1)k u (1)k w (1)k A (1)k u Φ ψ (1) k p (2)k u (2)k w (2)k u Φ ψ (L 1) k - p (L)k u (L)k w (L)k u (L) k p H First layer Second layer Modified Procedure Π Π Π (1)k Y (1)k Z (1)k O † (1)k A (1)k X (1) (1)k k (w u ) k h (L)k A Ω (2) (2)k k (w u ) k h (2)k A (2)k Y (2)k Z (2)k O (2)k X † (2)k A Fig. 5: The architecture of deep-unfolding NN for digital precoding design.where u k and w k are introduced auxiliary variables, and e k (cid:44) | u k ¯ h Hk p k | − (cid:60) ( u k ¯ h Hk p k ) + 1 + σ | u k | + (cid:88) i (cid:54) = k | u k ¯ h Hk p i | . The authors in [34] have developed an iterative WMMSE algorithm to address this problem, wherethey alternatively optimize one variable of { u k , w k , p k } with the other two ﬁxed. The iterative closed-form expressions are given by u k = (cid:0) K (cid:88) i =1 ¯ h Hk p k p Hk ¯ h k + σ (cid:1) − p Hk ¯ h k , ∀ k, (20a) w k = (1 − u k ¯ h Hk p k ) − , ∀ k, (20b) p k = (cid:18) K (cid:88) k =1 w k ¯ h k u ∗ k u k ¯ h Hk + λ I (cid:19) − × (cid:18) w k ¯ h k u k (cid:19) , ∀ k, (20c)where λ denotes the Lagrangian multiplier. The procedure of the iterative WMMSE algorithm isexecuting (20a), (20b), and (20c) iteratively until (19a) converges. Based on the iterative WMMSEalgorithm given by the expressions shown in (20a)-(20c), we develop a novel deep-unfolding NN. B. Deep-Unfolding NN for Precoding Design

Then, we introduce the deep-unfolding NN, where the iterative WMMSE algorithm is unfolded intoa layer-wise structure, as shown in Fig. 5. The computational complexity of the iterative WMMSEalgorithm is high since there exists the matrix inversion operation in (20c) and it usually requires alarge number of iterations. We can apply a quite smaller number of iterations and avoid the matrixinversion to reduce the computational complexity. Moreover, some trainable parameters are involvedto improve its sum-rate performance and avoid the bisection search of the Lagrangian multiplier.Firstly, we deﬁne a novel non-linear function that takes the reciprocal of diagonal elements inmatrix A and sets the off-diagonal elements to be , which is denoted as A (cid:45) . We take a × matrix as an example, A =  a a a a a a a a a  , A (cid:45) =  a a

00 0 a  . (21)Note that A − = A (cid:45) when A is a diagonal matrix. We see that the diagonal elements of the matricesare much larger than the off-diagonal elements in the iterative WMMSE algorithm. Hence, A (cid:45) isa good estimation of A − . The matrix inversion A − is approximated by the combination of twoparts: (i) A (cid:45) X with element-wise non-linear function A (cid:45) and trainable parameter X ; (ii) By recallingthe ﬁrst-order Taylor expansion of A − at A : A − = 2 A − − A − AA − , we apply AY + Z withtrainable parameters Y and Z .The architecture of the deep-unfolding NN is presented in Fig. 5, where Ψ , Φ , and Ω representthe iterative process of u k in (20a), w k in (20b), and p k in (20c), respectively. We replace the matrixinversion A − k as A (cid:45) k X k + A k Y k + Z k , where A k (cid:44) (cid:0) K (cid:80) k =1 w k ¯ h k u ∗ k u k ¯ h Hk + λ k I (cid:1) . To avoid the bisectionsearch for the Lagrangian multiplier, we introduce λ k for each p k as trainable parameters. Thus, inthe deep-unfolding NN, the iterative process of p k in (20c) can be changed into p lk = (cid:18) ( A lk ) (cid:45) X lk + A lk Y lk + Z lk (cid:19) w lk ¯ h k u lk + O lk , ∀ k, (22)where ( · ) ( l ) denotes the l -th layer and { X lk , Y lk , Z lk , O lk , λ lk } are introduced trainable parameters. Tosatisfy the power constraint (18b), we apply the following projection operator (cid:89)(cid:89)(cid:89) { p k } =  p k , K (cid:80) k =1 Tr ( p k p Hk ) ≤ P s , √ P s (cid:107) P (cid:107) p k , othewise , (23)where P (cid:44) [ p , p , . . . , p K ] . To improve the performance, we add a modiﬁed procedure in the lastlayer of the deep-unfolding NN, namely, employing the iterative process of p k in (20c).The training process of the deep-unfolding NN can be summarized as follows. Firstly, choose abatch of samples ¯ H H from the training dataset and input them into the deep-unfolding NN. Then, thedigital precoding P is obtained and it is normalized to satisfy the power constraint. Next, { ¯ H H , P } enters (18a) to calculate the sum-rate, followed by computing the gradients of sum-rate (18a) withrespect to trainable parameters { X lk , Y lk , Z lk , O lk , λ lk } . Finally, perform the back propagation and applythe SGD to update these parameters.In Algorithm 2 , we propose a training method to train the two NNs jointly, which is totally differentfrom the existing joint design scheme [35] since the gradients cannot pass from the deep-unfolding Algorithm 2

Joint training procedures of the two NNs Input the initialized parameters of the DRL-based NN in lines 1-2 of Algorithm 1. Input the parameters of thedeep-unfolding NN, i.e., the number of layers L , the batch size B , and the tolerance of accuracy (cid:15) for convergence,and initialize its trainable parameters { X lk , Y lk , Z lk , O lk , λ lk } randomly; while the loss function of the DRL-based NN and the sum-rate do not converge do for episode = , , · · · , J do Execute lines 4-9 of Algorithm 1 for one episode to generate ( s t , a t , r t , s t +1 ) and ¯ H H ; Calculate the precoding matrix P via the forward propagation of the deep-unfolding NN with input ¯ H H .Then, compute the reward (15) based on ¯ H H and P ; Store ( s t , a t , r t , s t +1 ) and ¯ H H into a buffer as the training dataset for the DRL-based NN and the deep-unfolding NN, respectively. end for if episode = 0 mod Q then Train DRL-based NN : Execute lines 11-13 of Algorithm 1;

Train deep-unfolding NN : Choose a batch of samples ¯ H H from the training dataset and input them intothe deep-unfolding NN to perform the forward propagation. Compute the gradients of sum-rate (18a) withrespect to trainable parameters and perform the back propagation with the SGD to update these parameters. end if end while NN to the DRL-based NN directly. Compared with the conventional seperate training, the proposedjoint training method connects the two NNs in two aspects: (i) the reward (15) of the DRL-based NNis computed based on the digital precoding matrix P calculated by the deep-unfolding NN; (ii) thetraining samples ¯ H H of the deep-unfolding NN are generated by the DRL-based NN. C. Computational Complexity

Firstly, we analyze the computational complexity of the proposed joint NN design, which consistsof two parts: the DRL-based NN and the deep-unfolding NN. The complexity of the DRL-based NNis O (cid:0) L − (cid:80) l =1 Q l S l C l − C l + KN RF C L − C out (cid:1) , where L is the number of layers, S l represents the sizeof convolutional kernel, C l is the number of channels in the l -th layer, C out denotes the output sizeof the FC layer, and Q l denotes the output size of the l -th layer, which depends on the input size,padding number, and stride. The complexity of the deep-unfolding NN is O (cid:0) I n ( K N RF + KN . RF ) (cid:1) ,where I n is the number of layers. Compared to the complexity of the iterative WMMSE algorithm O ( I m ( K N RF + KN RF )) , the deep-unfolding NN has much lower complexity for two reasons: (i) Thenumber of layers in the deep-unfolding NN is smaller than that of the iterative WMMSE algorithm, i.e., I n < I m ; (ii) The iterative WMMSE algorithm involves the matrix inversion with complexity O ( N RF ) ,while the deep-unfolding NN simply requires matrix multiplication with complexity O ( N . RF ) . InTable I, we present the computational complexity of the following schemes: • Joint NN design: The proposed joint design of DRL-based NN and deep-unfolding NN; • PDD: The PDD-based iterative algorithm developed in [10]; • MM-WMMSE: The maximum magnitude beam selection method with the iterative WMMSEprecoding algorithm; • IA-ZF: The interference-aware beam selection method with the ZF precoding algorithm developedin [9]; • MS-ZF: The maximization of SINR selection method with ZF precoding proposed in [12]; • FD-ZF: The fully digital ZF precoding; • FD-WMMSE: The fully digital WMMSE precoding [34].In Table I, I p and I p represent the number of inner and outer iterations of the PDD-based algorithm,and I m and I w are the number of iterations of the MM-WMMSE and FD-WMMSE, respectively. Basedon the complexity of beam selection and precoding, the overall complexity in Table I is calculatedunder the condition M s (cid:29) N RF ≥ K . Note that the overall complexity of the proposed joint NNdesign is unrelated to M s , which reduces the complexity to some extent. From Table I, we can seethat the FD-WMMSE precoding has very high computational complexity since it requires the samenumber of RF chains as transmit antennas, which also results in excessive hardware costs. In contrast,the proposed joint NN design achieves lower complexity but still maintains performance levels thatapproach the FD precoding, which will be presented in the simulation results. Moreover, the IA-ZFand MS-ZF algorithms have a generally lower computational complexity than the PDD and MM-WMMSE algorithms, but their performance is not as good. Therefore, our proposed joint NN designachieves a practical trade-off between complexity and performance.VI. S IMULATION R ESULTS

In this section, we evaluate the performance of the aforementioned algorithms by performingnumerical simulations.

A. Simulation Setup

The sum-rate performance of the proposed joint NN design is evaluated in the testing dataset afterthe NN converges. The system conﬁguration is described as follows. Unless otherwise speciﬁed, the TABLE I: Computational complexity of the analyzed schemes.

Algorithms Beam selection Precoding Overall ( M s (cid:29) N RF ≥ K ) Joint NN design O (cid:0) L − (cid:80) l =1 Q l S l C l − C l + KN RF C L − C out (cid:1) O (cid:0) I n ( K N RF + KN . RF ) (cid:1) O (cid:0) L − (cid:80) l =1 Q l S l C l − C l + I n ( K N RF + KN . RF ) (cid:1) PDD O ( I p I p M s N RF K ) O ( I p I p M s ) O (cid:0) I p I p ( M s N RF K + M s ) (cid:1) MM-WMMSE O ( M s log M s ) O ( I m ( K N RF + KN RF )) O ( M s log M s + I m KN RF ) IA-ZF O ( N RF M s ) O ( N RF K ) O ( N RF M s + N RF K ) MS-ZF O ( N RF KM s ) O ( N RF K ) O ( N RF KM s ) FD-ZF − O ( M s K ) O ( M s K ) FD-WMMSE − O ( I w KM s ) O ( I w KM s ) BS is equipped with a DLA with M s = 256 antennas and N RF = 20 RF chains to serve K = 20 users. We set the transmission power and the noise to be dBm and − dBm, respectively. Theparameters of the channel model (5) are selected based on [11]: (i) We adopt one LoS link and L = 3 NLoS links; (ii) ρ (0) k ∼ CN (0 , and ρ ( l ) k ∼ CN (0 , . for l = 1 , , ; (iii) φ (0) k and φ ( l ) k are generated randomly within [ − , ; (iv) ρ (0) k , ρ ( l ) k , φ (0) k , and φ ( l ) k are statistically independent. Theparameters for the DRL-based NN are set as: discount rate γ = 0 . , buffer size D = 16000 , batch size B = 40 , replay period Q = 10 , and the penalty of repeated selected beam (cid:37) = − . The number oflayers for the deep-unfolding NN is set to be L = 6 . The number of training and testing samples are and , respectively. We run the simulation on the platform of “Pytorch 1.5.0” with “Python3.6”. Since the platform cannot handle complex matrices, we convert A ∈ C a × b into a × a × b real-valued tensor with the real part (cid:60){ A } and imaginary part (cid:61){ A } stored separately. Then, thereal part and imaginary part of the multiplication of two complex matrices A and B can be writtenas ( (cid:60){ A }(cid:60){ B } − (cid:61){ A }(cid:61){ B } ) and ( (cid:60){ A }(cid:61){ B } + (cid:61){ A }(cid:60){ B } ) , respectively. For the inversionand determinant operations of complex matrix, we override the automatic differential of “Pytorch”by calculating the closed-form gradients employing the platform “Numpy” based on the formulas d log det( X ) = Tr ( X − d X ) , d Tr ( X − ) = − Tr ( X − ( d X ) X − ) , (24)which provide much higher accuracy. B. Convergence Performance of the Joint NN Design

We ﬁrst present the convergence performance of the proposed joint NN design. Fig. 6(a) shows theconvergence performance of the loss function, i.e., mean square error (MSE), with different learning Episode M ean s qua r e e rr o r Learning Rate: 0.0001Learning Rate: 0.0002Learning Rate: 0.0005Learning Rate: 0.001Learning Rate: 0.002Learning Rate: 0.005 (a)

Episode R e p ea t e d se l ec t e d b ea m s Learning Rate: 0.0001Learning Rate: 0.0002Learning Rate: 0.0005Learning Rate: 0.001Learning Rate: 0.002Learning Rate: 0.005 (b)

Fig. 6: (a) Convergence performance of the loss function (MSE) with different learning rates; (b) Theconstraint violation (number of repeated selected beams).

Episode S u m - r a t e ( b i t s / s / H z ) Learning Rate: 0.0001Learning Rate: 0.0005Learning Rate: 0.001Learning Rate: 0.005

Fig. 7: Convergence performance of the sum-rate with different learning rates.rates. It is observed that a smaller learning rate achieves better MSE performance, while a largerlearning rate results in a faster convergence speed. Fig. 6(b) presents the constraint violation, i.e., thenumber of repeated selected beams for samples with selected beams. The constraint (6d) issatisﬁed when it equals and a larger number means that it violates the constraint (6d) more severely.We can see that the constraint is severely violated at the beginning and is well satisﬁed after around episodes, where a larger learning rate results in a faster convergence speed.Fig. 7 presents the sum-rate performance of the proposed joint NN design achieved by different

16 18 20 22 24 26 28 30

Number of RF chains S u m - r a t e ( b i t s / s / H z ) FD-ZFJoint NN designPDDMM-WMMSEIA-ZFMS-ZF

Fig. 8: Achievable system sum-rate versus the number of RF chains N RF .learning rates. Similar to the MSE in Fig. 6(a), a smaller learning rate achieves better sum-rateperformance with more stable convergence performance, and a larger learning rate leads to a fasterconvergence speed. C. Sum-Rate Performance

Fig. 8 compares the sum-rate versus the number of RF chains achieved by the schemes mentionedabove where K = 16 . The sum-rate performance of all schemes increases monotonically with thenumber of RF chains. It is obvious that the gap between the proposed joint NN design and the FD-ZFprecoding reduces with the increase of N RF . Moreover, the joint NN design can achieve the sum-rateperformance approaching the FD-ZF precoding with a small number of RF chains, i.e., N RF = 30 .When N RF becomes larger, the sum-rate reaches its bottleneck and increases more slowly. From theresults, the best performance is achieved by the FD-ZF precoding, followed by the joint NN design,PDD, MM-WMMSE, IA-ZF, and MS-ZF schemes. The PDD and MM-WMMSE outperform the IA-ZF and MS-ZF mainly because the former two schemes apply the iterative WMMSE algorithm fordigital precoding, which achieves better performance than the ZF precoding. In addition, the PDDoutperforms the MM-WMMSE and the IA-ZF outperforms the MS-ZF since the PDD and IA selectbetter beams than the MM and MS.Fig. 9 presents the achievable sum-rate versus the number of users where N RF = 20 . We seethat the sum-rate performance achieved by all these algorithms increases with K . The joint NNdesign achieves the performance close to the FD-ZF precoding, where the gap increases with K . Number of Users S u m -r a t e ( b i t s / s / H z ) FD-ZFJoint NN designPDDMM-WMMSEIA-ZFMS-ZF

Fig. 9: Achievable system sum-rate versus the number of users K .

16 32 64 128 256 512

Number of transmit antennas S u m - r a t e ( b i t s / s / H z ) Joint NN designPDDMM-WMMSEIA-ZFMS-ZF

Fig. 10: Achievable system sum-rate versus the number of transmit antennas M s .From the results, the proposed joint NN design delivers better sum-rate performance than the PDD,MM-WMMSE, IA-ZF, and MS-ZF schemes, which demonstrates its superiority to mitigate the multi-user interference. Besides, the performance gap between the proposed joint NN design and the otherschemes escalates upon the increasing number of users, which shows that the joint NN design haspotential for applications with a large number of users.Fig. 10 shows the achievable sum-rate when applying different numbers of transmit antennas. Forthis ﬁgure, we set N RF = 16 and K = 16 . The joint NN design achieves nearly the same sum-rateperformance with that of the PDD and MM-WMMSE when M s = 16 since there are only beams

10 15 20 25 30 35 40 45

SNR (dB) S u m - r a t e ( b i t s / s / H z ) Joint NN designPDDMM-WMMSEIA-ZFMS-ZF

10 15 2020304050

Fig. 11: Achievable system sum-rate versus the SNR.selected for RF chains and the digital precoding designed for these three schemes are nearly thesame. Similarly, the IA-ZF and MS-ZF achieve the same peformance since they both apply the ZFprecoding. In addition, the gap between the former three schemes and the latter two when M s = 16 is mainly caused by the difference of WMMSE and ZF precoding. We can see that the performanceof all schemes increases monotonically with M s , and the joint NN design achieves the best sum-rateperformance, followed by the PDD, MM-WMMSE, IA-ZF, and MS-ZF schemes. The gap amongthese schemes increases with M s since the angular resolution of the antenna increases, which leadsto large distinctions among the different beams. Besides, the increasing rate slows down with M s dueto the fact that the angular resolution of antenna is enough when M s is large, i.e., M s = 256 .Fig. 11 illustrates the achievable system sum-rate versus the SNR in different schemes. We can seethat the sum-rate achieved by all algorithms increases monotonically with SNR. Besides, the jointNN design signiﬁcantly outperforms the benchmarks, and the gap increases with SNR. It is mainlybecause in low SNR, the differences among beams are small, which leads to insigniﬁcant effects ofthe beam selection and digital precoding. For high SNR, e.g, dB, the joint NN design has about performance gain compared to the PDD, and more than performance gain in comparison to theother benchmarks. It veriﬁes the advantages of the proposed joint NN design. Thus, we can concludethat the joint NN design provides an efﬁcient and attractive solution for this problem, especially inthe high SNR scenario.

16 18 20 22 24 26 28 30

Number of RF chains S u m - r a t e ( b i t s / s / H z ) Joint NN design (No Mismatch)Joint NN design (Mismatch)PDD M s =64 M s =128 M s =256 (a) Number of Users S u m - r a t e ( b i t s / s / H z ) Joint NN design (No Mismatch)Joint NN design (Mismatch)PDD

SNR=35SNR=30SNR=25 (b)

Fig. 12: Sum-rate performance of the joint NN design with various mismatches: (a) MIMOconﬁguration mismatch: N RF and M s mismatches; (b) K and SNR mismatches. D. Generalization Ability

In this subsection, we analyze the generalization ability of the proposed joint NN design. A jointlytrained NN with system conﬁguration ( N RF , M s , K ) can be straightforwardly transferred to test thesamples with smaller ( N RF , M s , K ) , rather than training a new NN. By reducing the dimension,the input channel H of the trained DRL-based NN has dimension ¯ M s × K ( ¯ M s < M s ). To test thesamples from the system conﬁguration ( N RF , M s , K ) , we need to input the channel with the samedimension of the training data. However, the dimension of channel in these testing data is ¯ M s × K and we need to add ( K − K ) zero column vectors. Moreover, the time step of the DRL-basedNN in each episode is set to be N RF . As for the deep-unfolding NN, the input is the equivalentchannel ¯ H ∈ C N RF × K but the dimension of equivalent channel in the testing data is N RF × K .Thus, we need to add ( N RF − N RF ) zero row vectors and ( K − K ) zero column vectors to ¯ H .In addition, the NN trained with a certain SNR can be directly employed to test the samples withdifferent values of SNR since the SNR is part of the inputs of the proposed joint NN design. Toimprove the generalization ability of the NN with respect to the SNR, the training dataset couldinvolve the samples with different values of SNR.Fig. 12(a) presents the sum-rate performance of the joint NN design with the MIMO conﬁgurationmismatch. We train the DRL-based NN and the deep-unfolding NN in the conﬁguration of K = 16 ,SNR = 40 dB, N RF = 30 , and M s = 256 and test the trained NNs in different settings of N RF and M s with ﬁxed K = 16 and SNR = 40 dB. From the ﬁgure, we can see that though the NNs IA Joint NN design PDD MM MS

Algorithms B ea m e n e r g y User 1User 2User 3User 4User 5User 6

Fig. 13: Selected beam energy of each user.are employed in different MIMO conﬁgurations, there exists a small performance loss due to themismatch. Moreover, the mismatched joint NN design still outperforms the PDD. It demonstratesthe satisfactory generalization ability of the proposed joint NN for different MIMO conﬁgurations.In addition, the performance loss between the mismatched joint NN design and that without themismatch decreases with N RF and M s . It is mainly because the performance of these two schemes isclose when the mismatch between the training and testing conﬁgurations is small. Fig. 12(b) presentsthe sum-rate performance of the joint NN design with the mismatch of SNR and K . We train theDRL-based NN and the deep-unfolding NN in the setting of K = 20 , SNR = 30 dB, N RF = 20 ,and M s = 256 and test the trained NNs in different values of SNR and K with ﬁxed N RF = 20 and M s = 256 . It is obvious that the mismatched joint NN design always outperforms the PDD eventhrough there is a small performance loss. The sum-rate performance of the mismatched joint NNdesign is closer to that without the mismatch for SNR= dB compared to SNR= dB and SNR= dB. Furthermore, we can see that the performance loss decreases with K . The small performanceloss demonstrates the satisfactory generalization ability of the proposed joint NN design for differentvalues of K and SNR. E. Fairness, Imperfect CSI, and Complexity Comparison

Fig. 13 presents the selected beam energy of each user in the setting of K = 6 , SNR = 30 dB, N RF = 8 , and M s = 128 , which shows the fairness among users. The selected beam energy of the k -th user is deﬁned as the l -norm of the k -th column of the equivalent channel ¯ H , i.e., (cid:107) ¯ h k (cid:107) . It is Angular estimation error (%) S u m - r a t e ( b i t s / s / H z ) FD-ZFJoint NN designPDDMM-WMMSEIA-ZFMS-ZF

Fig. 14: Achievable system sum-rate versus the channel estimation error.obvious that the selected beam energy of different users achieved by the IA and joint NN design ismore balanced than that of the PDD, MM, and MS, which demonstrates that the IA and joint NNdesign achieves better fairness among users. It is mainly because the IA selects the beam aligned foreach user and the joint NN design takes the fairness (17) into consideration.Fig. 14 presents the achievable system sum-rate versus the channel estimation error of the analyzedschemes. The channel estimation error is characterized by the angular estimation error deﬁned as ∆ φ πM s , where ∆ φ denotes the error between the estimated angle and the actual angle of the LoS in thechannel model (5), and πM s is the angular resolution of the antenna. From the results, the sum-rateperformance achieved by all algorithms degrades with the angular estimation error. The proposed jointNN design provides the best performance, followed by the PDD, MM-WMMSE, MS-ZF, and IA-ZF,which veriﬁes the ability of the joint NN design to handle channel uncertainties. It is worth notingthat the sum-rate achieved by the FD-ZF degrades severely and the joint NN design outperformsFD-ZF when the angular estimation error reaches . It is mainly because: (i) The FD-ZF calculatesthe precoding matrix based on the original channel H while the other schemes apply the equivalentchannel ¯ H with selected beams and much lower dimension; (ii) The iterative WMMSE algorithmand deep-unfolding NN have better robustness than the ZF precoding since the ZF precoding willcause severe interference among users with channel estimation errors. Furthermore, the gap betweenthe joint NN design and the other benchmarks escalates with the angular estimation error since theDRL-based NN and deep-unfolding NN have better robustness fed with a large number of samplesin the training process.

16 18 20 22 24 26 28 30

Number of RF chains -2 -1 C on ve r g e n ce t i m e ( s ) Joint NN designPDDMM-WMMSEIA-ZFMS-ZF

Fig. 15: Convergence time versus the number of RF chains N RF .Fig. 15 shows the convergence time versus the number of RF chains in the aforementioned schemesand we simulate the scenarios where the number of users equals to that of the RF chains, i.e., K = N RF . The convergence time of the joint NN design is computed by averaging the inferencetime of samples. The convergence time increases monotonically with N RF . Compared with thecomputational analysis in Table I, we can see that the PDD has the highest complexity and largestconvergence time, followed by the MM-WMMSE, MS-ZF, joint NN design, and IA-ZF. Note thatthe convergence time of MS-ZF exceeds the joint NN design when N RF = 22 . By considering theconvergence time, sum-rate performance, and hardware costs, we can conclude that the proposedjoint NN designed for beam selection and digital precoding achieves a good balance among theperformance, cost, and computational complexity.VII. C ONCLUSION

In this work, we investigated the joint beam selection and precoding design for mmWave MU-MIMO systems with single-sided DLA. An efﬁcient joint NN design has been proposed to solvethis challenging problem. Speciﬁcally, the DRL-based NN has been applied to obtain the beamselection matrix and the deep-unfolding NN has been proposed to optimize the digital precodingmatrix. Simulation results showed that our proposed jointly trained NN signiﬁcantly outperforms theexisting iterative algorithms with reduced computational complexity and stronger robustness. Thus,we conclude that the proposed joint NN design can be employed as an alternative of the iterativeoptimization algorithm in practical systems. The future work could generalize our proposed joint NN design into a framework to solve other challenging MINLPs in communications, where the DRL-basedNN and the deep-unfolding NN can be applied to optimize the discrete variables and the continuousvariables, respectively. R EFERENCES [1] F. Boccardi, R. W. Heath, A. Lozano, T. L. Marzetta, and P. Popovski, “Five disruptive technology directions for 5G,”

IEEECommun. Mag. , vol. 52, no. 2, pp. 74–80, Feb. 2014.[2] T. L. Marzetta, “Noncooperative cellular wireless with unlimited numbers of base station antennas,”

IEEE Trans. WirelessCommun. , vol. 9, no. 11, pp. 3590–3600, Nov. 2010.[3] A. Ghosh, T. A. Thomas, M. C. Cudak, R. Ratasuk, P. Moorut, F. W. Vook, T. S. Rappaport, G. R. MacCartney, S. Sun, andS. Nie, “Millimeter-wave enhanced local area systems: A high-data-rate approach for future wireless networks,”

IEEE J. Sel.Areas Commun. , vol. 32, no. 6, pp. 1152–1163, Jun. 2014.[4] F. Rusek, D. Persson, B. K. Lau, E. G. Larsson, T. L. Marzetta, O. Edfors, and F. Tufvesson, “Scaling up MIMO: Opportunitiesand challenges with very large arrays,”

IEEE Signal Process. Mag. , vol. 30, no. 1, pp. 40–60, Jan. 2013.[5] O. E. Ayach, S. Rajagopal, S. Abu-Surra, Z. Pi, and R. W. Heath, “Spatially sparse precoding in millimeter wave MIMO systems,”

IEEE Trans. Wireless Commun. , vol. 13, no. 3, pp. 1499–1513, Mar. 2014.[6] J. Brady, N. Behdad, and A. M. Sayeed, “Beamspace MIMO for millimeter-wave communications: System architecture, modeling,analysis, and measurements,”

IEEE Trans. Antennas Propag. , vol. 61, no. 7, pp. 3814–3827, Jul. 2013.[7] Y. Zeng and R. Zhang, “Millimeter wave MIMO with lens antenna array: A new path division multiplexing paradigm,”

IEEETrans. Commun. , vol. 64, no. 4, pp. 1557–1571, Apr. 2016.[8] H. Liu, X. Yuan, and Y. Zhang, “Statistical beamforming for FDD downlink massive MIMO via spatial information extractionand beam selection,” arXiv preprint arXiv:2003.03041 , 2020.[9] X. Gao, L. Dai, Z. Chen, Z. Wang, and Z. Zhang, “Near-optimal beam selection for beamspace mmWave massive MIMO systems,”

IEEE Commun. Lett. , vol. 20, no. 5, pp. 1054–1057, May 2016.[10] R. Guo, Y. Cai, M. Zhao, Q. Shi, B. Champagne, and L. Hanzo, “Joint design of beam selection and precoding matrices formmWave MU-MIMO systems relying on lens antenna arrays,”

IEEE J. Sel. Topics Signal Process. , vol. 12, no. 2, pp. 313–325,May 2018.[11] A. Sayeed and J. Brady, “Beamspace MIMO for high-dimensional multiuser communication at millimeter-wave frequencies,” in

Proc. IEEE Global Telecommun. Conf. (GLOBECOM) , Dec. 2013, pp. 3679–3684.[12] P. V. Amadori and C. Masouros, “Low RF-complexity millimeter-wave beamspace-MIMO systems by beam selection,”

IEEETrans. Commun. , vol. 63, no. 6, pp. 2212–2223, Jun. 2015.[13] W. Shen, X. Bu, X. Gao, C. Xing, and L. Hanzo, “Beamspace precoding and beam selection for wideband millimeter-waveMIMO relying on lens antenna arrays,”

IEEE Trans. Signal Process. , vol. 67, no. 24, pp. 6301–6313, Dec. 2019.[14] N. J. Myers, A. Mezghani, and R. W. Heath, “FALP: Fast beam alignment in mmWave systems with low-resolution phaseshifters,”

IEEE Trans. Commun. , vol. 67, no. 12, pp. 8739–8753, Dec. 2019.[15] R. Pal, A. K. Chaitanya, and K. V. Srinivas, “Low-complexity beam selection algorithms for millimeter wave beamspace MIMOsystems,”

IEEE Commun. Lett. , vol. 23, no. 4, pp. 768–771, Apr. 2019.[16] Z. Qin, H. Ye, G. Y. Li, and B. F. Juang, “Deep learning in physical layer communications,”

IEEE Wireless Commun. , vol. 26,no. 2, pp. 93–99, Apr. 2019.[17] H. Sun, X. Chen, Q. Shi, M. Hong, X. Fu, and N. D. Sidiropoulos, “Learning to optimize: Training deep neural networks forinterference management,”

IEEE Trans. Signal Process. , vol. 66, no. 20, pp. 5438–5453, Oct. 2018. [18] Y. Long, Z. Chen, J. Fang, and C. Tellambura, “Data-driven-based analog beam selection for hybrid beamforming under mmWavechannels,” IEEE J. Sel. Topics Signal Process. , vol. 12, no. 2, pp. 340–352, May 2018.[19] A. Klautau, P. Batista, N. Gonz´alez-Prelcic, Y. Wang, and R. W. Heath, “5G MIMO data for machine learning: Application tobeam-selection using deep learning,” in , 2018, pp. 1–9.[20] C. Ant´on-Haro and X. Mestre, “Learning and data-driven beam selection for mmWave communications: An angle of arrival-basedapproach,”

IEEE Access , vol. 7, pp. 20 404–20 415, 2019.[21] Y. Wei, F. R. Yu, M. Song, and Z. Han, “User scheduling and resource allocation in HetNets with hybrid energy supply: Anactor-critic reinforcement learning approach,”

IEEE Trans. Wireless Commun. , vol. 17, no. 1, pp. 680–692, Jan. 2018.[22] A. Balatsoukas-Stimming and C. Studer, “Deep unfolding for communications systems: A survey and some new directions,” in , 2019, pp. 266–271.[23] J. Chien and C. Lee, “Deep unfolding for topic models,”

IEEE Trans. Pattern Anal. Mach. Intell. , vol. 40, no. 2, pp. 318–331,2018.[24] M. Borgerding, P. Schniter, and S. Rangan, “Amp-inspired deep networks for sparse linear inverse problems,”

IEEE Trans. SignalProcess. , vol. 65, no. 16, pp. 4293–4308, Aug. 2017.[25] Q. Hu, Y. Cai, Q. Shi, K. Xu, G. Yu, and Z. Ding, “Iterative algorithm induced deep-unfolding neural networks: Precoding designfor multiuser MIMO systems,”

IEEE Trans. Wireless Commun. , to appear.[26] H. He, C. Wen, S. Jin, and G. Y. Li, “Model-driven deep learning for MIMO detection,”

IEEE Trans. Signal Process. , vol. 68,pp. 1702–1715, 2020.[27] V. Mnih and et al. , “Human-level control through deep reinforcement learning,”

Nature , vol. 518, pp. 529–233, Feb. 2015.[28] L. Xiao, Y. Li, X. Huang, and X. Du, “Cloud-based malware detection game for mobile devices with ofﬂoading,”

IEEE Trans.Mobile Comput. , vol. 16, no. 10, pp. 2742–2750, Oct. 2017.[29] S. Wang, H. Liu, P. H. Gomes, and B. Krishnamachari, “Deep reinforcement learning for dynamic multichannel access in wirelessnetworks,”

IEEE Trans. Cogn. Commun. Netw. , vol. 4, no. 2, pp. 257–265, Jun. 2018.[30] Y. Sun, M. Peng, and S. Mao, “Deep reinforcement learning-based mode selection and resource management for green fog radioaccess networks,”

IEEE Internet Things J. , vol. 6, no. 2, pp. 1960–1971, Apr. 2019.[31] U. Challita, L. Dong, and W. Saad, “Proactive resource management for LTE in unlicensed spectrum: A deep learning perspective,”

IEEE Trans. Wireless Commun. , vol. 17, no. 7, pp. 4674–4689, Jul. 2018.[32] N. Zhao, Y. Liang, D. Niyato, Y. Pei, M. Wu, and Y. Jiang, “Deep reinforcement learning for user association and resourceallocation in heterogeneous cellular networks,”

IEEE Trans. Wireless Commun. , vol. 18, no. 11, pp. 5141–5152, Nov. 2019.[33] L. Huang, S. Bi, and Y. J. Zhang, “Deep reinforcement learning for online computation ofﬂoading in wireless powered mobile-edgecomputing networks,”

IEEE Trans. Mobile Comput. , vol. 19, no. 11, pp. 2581–2593, Nov. 2020.[34] Q. Shi, M. Razaviyayn, Z. Luo, and C. He, “An iteratively weighted MMSE approach to distributed sum-utility maximizationfor a MIMO interfering broadcast channel,”

IEEE Trans. Signal Process. , vol. 59, no. 9, pp. 4331–4340, Sep. 2011.[35] J. Tao, J. Chen, J. Xing, S. Fu, and J. Xie, “Autoencoder neural network based intelligent hybrid beamforming design for mmWavemassive MIMO systems,”

IEEE Trans. Cogn. Commun. Netw. , vol. 6, no. 3, pp. 1019–1030, Sep. 2020.[36] H. van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double Q-learning,” in