Reinforcement Learning for Beam Pattern Design in Millimeter Wave and Massive MIMO Systems
11 Reinforcement Learning for Beam Pattern Design inMillimeter Wave and Massive MIMO Systems
Yu Zhang, Muhammad Alrabeiah, and Ahmed Alkhateeb
Abstract —Employing large antenna arrays is a key character-istic of millimeter wave (mmWave) and terahertz communicationsystems. However, due to the adoption of fully analog or hybridanalog/digital architectures, as well as non-ideal hardware orarbitrary/unknown array geometries, the accurate channel stateinformation becomes hard to acquire. This impedes the designof beamforming/combining vectors that are crucial to fullyexploit the potential of large-scale antenna arrays in providingsufficient receive signal power. In this paper, we develop a novelframework that leverages deep reinforcement learning (DRL) anda Wolpertinger-variant architecture and learns how to iterativelyoptimize the beam pattern (shape) for serving one or a small setof users relying only on the receive power measurements andwithout requiring any explicit channel knowledge. The proposedmodel accounts for key hardware constraints such as the phase-only, constant-modulus, and quantized-angle constraints. Fur-ther, the proposed framework can efficiently optimize the beampatterns for systems with non-ideal hardware and for arrays withunknown or arbitrary array geometries. Simulation results showthat the developed solution is capable of finding near optimalbeam patterns based only on the receive power measurements.
I. I
NTRODUCTION
Leveraging the large bandwidth available at millimeterwave (mmWave) frequency bands requires the deploymentof large antenna arrays. However, because of the high costand power consumption of the mixed-circuit components,mmWave systems normally rely either fully or partially onanalog beamforming, where transmitters/receivers employ net-works of quantized phase shifters [1], [2]. This makes thebasic MIMO signal processing functions, such as channelestimation, challenging as the channels are seen only throughthe RF lens. As a result, classical beamforming/combiningdesign approaches, e.g. [3], [4], may not be feasible because ofthe unavailability of the channels as well as the new constraintsof the design problem. Besides, the hardware is possiblynot ideal due to the use of inexpensive and low-precisionradio components. In this case, the performance of the com-monly used beams (such as the ones in classical beamsteeringcodebooks) degrades drastically for their unawareness of theenvironment and hardware/array geometry.
Prior Work:
Designing efficient beamforming and com-bining is essential for realizing the potential of MIMO com-munications, and it has been an important research topic inthe literature of MIMO signal processing [1], [3]–[6]. ForMIMO systems with no hardware constraints, i.e., with fully-digital processing and no constraints on the RF hardware,
Yu Zhang, Muhammad Alrabeiah and Ahmed Alkhateeb are with ArizonaState University (Email: y.zhang, malrabei, [email protected]). This work issupported by the National Science Foundation under Grant No. 1923676. maximum ratio transmission and combining maximize theachievable SNR with single-stream transmission/reception [3].To realize these solutions, however, the MIMO system shouldbe able to control the magnitude and phase of the signalat each antennas. When only the phase can be controlled,equal-gain transmission solutions have been developed tomaximize the SNR or diversity gains [4]. This is particularlyinteresting for mmWave and terahertz systems where thebeamforming/precoding processing is fully or partially donein the RF domain using analog phase shifters [1]. In thesesystems, however, the phase shifters can normally take onlyquantized phase shift values. This associates the search overthe large space of quantized phase shift values with highcomplexity (e.g., for a 32-element antenna array with 2-bitphase shifters, there are possible beamforming vectors)[1], [5], [6]. Further, in analog beamforming architectures, thechannel is seen through the RF lens, which makes it hard toacquire at the baseband, especially for systems with arbitraryor unknown array geometries. To address these challenges,this paper designs a reinforcement learning based approach toefficiently learn the analog beamforming patterns that adapt tothe surrounding environment and the adopted hardware/arraygeometry without requiring explicit channel knowledge. Contribution:
In this paper, we propose a deep reinforce-ment learning based framework that can learn how to optimizethe beam pattern for serving a single user or a set of userswith similar channels.
The developed framework relies onlyon receive power measurements and does not requireany channel knowledge.
This framework adapts the beampattern based on the surrounding environment and learns howto compensate for the hardware impairments. This is doneby utilizing a novel Wolpertinger architecture [7] which isdesigned to efficiently explore the large discrete action space.The proposed model accounts for key hardware constraintssuch as the phase-only, constant-modulus, and quantized-angle constraints [1]. This is realized by defining the statedirectly as the phases of the analog phase shifters and theaction as the change of phases within the quantized phase set.Simulation results show that the proposed solution is capableof finding the near optimal beam pattern and achieving abeamforming/combining gain comparable to that of equal gaincombining. II. S
YSTEM AND C HANNEL M ODELS
In this section, we introduce in detail our adopted sys-tem and channel models. We also describe how the modelconsiders arbitrary array geometries with possible hardwareimpairments. a r X i v : . [ ee ss . SP ] F e b A. System Model
We consider the system model where a mmWave massiveMIMO base station (BS) with M antennas is communicatingwith a single-antenna user. Further, given the high cost andpower consumption of mixed-signal components, we considera practical system where the BS has only one radio frequency(RF) chain and employs analog-only beamforming/combiningusing a network of r -bit quantized phase shifters. Therefore,the beamforming/combining vector can be written as w = 1 √ M (cid:2) e jθ , e jθ , . . . , e jθ M (cid:3) T , (1)where each phase shift θ m is selected from a finite set Θ with r possible discrete values drawn uniformly from ( − π, π ] . Inthe uplink transmission, if a user u transmits a symbol x ∈ C to the base station, where the transmitted symbol satisfies theaverage power constraint E (cid:2) | x | (cid:3) = P x , the received signal atthe base station after combining can be expressed as y u = w H h u x + w H n , (2)where h u ∈ C M × is the uplink channel vector between theuser u and the base station antennas and n ∼ N C (cid:0) , σ n I (cid:1) isthe receive noise vector at the base station. B. Channel Model
We adopt a general geometric channel model for h u .Assume that the signal propagation between the user u and thebase station consists of L paths. Each path (cid:96) has a complexgain α (cid:96) and an angle of arrival φ (cid:96) . Then, the channel vectorcan be written as h u = L (cid:88) (cid:96) =1 α (cid:96) a ( φ (cid:96) ) , (3)where a ( φ (cid:96) ) is the array response vector of the base station.The definition of a ( φ (cid:96) ) depends on the array geometry andhardware impairments. Next, we discuss that in more detail. C. Hardware Impairments Model
Most of the prior work on mmWave signal processing hasassumed uniform antenna arrays with perfect calibration andideal hardware [1], [2], [8], [9]. In this paper, we consider amore general antenna array model that accounts for arbitrarygeometry and hardware impairments, and target learning beampattern that mitigates the influence of those unknown factors.While the beam pattern learning solution that we develop inthis paper is general for various kinds of array geometries andhardware impairments, we evaluate the proposed solution inSection V with respect to two main characteristics of interest,namely non-uniform spacing and phase mismatch between theantenna elements. For linear arrays, the array response vectorcan be modeled to capture these characteristics as follows a ( φ (cid:96) ) = (cid:104) e j ( kd cos( φ (cid:96) )+∆ θ ) , e j ( kd cos( φ (cid:96) )+∆ θ ) , . . . ,e j ( kd M cos( φ (cid:96) )+∆ θ M ) (cid:105) T , (4) where d m is the position of the m -th antenna, and ∆ θ m isthe additional phase shift incurred at the m -th antenna (tomodel the phase mismatch). Without loss of generality, weassume that d m and ∆ θ m are fixed yet unknown randomrealizations drawn from the distributions N (cid:0) ( m − d, σ d (cid:1) and N (cid:0) , σ p (cid:1) respectively, where d is the ideal antennaspacing, σ d and σ p model the standard deviations of therandom antenna position and phase mismatch. Besides, weimpose an additional constraint d < d < · · · < d M to makesure the generated antenna positions physically meaningful.III. P ROBLEM D EFINITION
In this paper, we investigate the beam pattern design prob-lem for mmWave and massive MIMO system with unknownarray geometry and hardware impairment. Given the systemand channel models described in Section II, the SNR aftercombining for user u can be written as SNR u = (cid:12)(cid:12) w H h u (cid:12)(cid:12) (cid:107) w (cid:107) ρ = (cid:12)(cid:12) w H h u (cid:12)(cid:12) ρ, (5)where (cid:107) w (cid:107) = 1 is implicitly used and ρ = P x σ n . Besides, wedefine the beamforming/combining gain of adopting w as atransmit/receive beamformer for user u as g u = (cid:12)(cid:12) w H h u (cid:12)(cid:12) . (6)It can be seen that maximizing (6) is equivalent to maximizingthe SNR in (5). Therefore, the objective of this paper is todesign (learn) the beamformer w that maximizes the beam-forming/combining gain given by (6) for a single user or a setof users with similar channels. Therefore, the beam patternlearning problem can be formulated as w opt = arg max w | H | (cid:88) h u ∈ H (cid:12)(cid:12) w H h u (cid:12)(cid:12) , (7) s . t . w m = 1 √ M e jθ m , ∀ m = 1 , ..., M, (8) θ m ∈ Θ , ∀ m = 1 , ..., M, (9)where w m is the m -th element of the beamforming vectorand H is the channel set that is supposed to contain a singlechannel or multiple similar channels. It is worth mentioningthat the constraint in (8) is imposed to uphold the adoptedanalog-only system model, and the constraint in (9) is torespect the quantized phase-shifters hardware constraint.Due to the unknown array geometry as well as possiblehardware impairments, the accurate channel state informa-tion is generally hard to acquire. This means that all thechannels h u ∈ H in the objective function are possiblyunknown. Instead, the base station may only have access to thebeamforming/combining gain g u , or equivalently the ReceivedSignal Strength Indicator (RSSI). Therefore, problem (7) ishard to solve in a general sense for the unknown parametersin the objective function as well as the non-convex constraint(8) and the discrete constraint (9). Given that this problemis essentially a search problem in a dauntingly hugeyet finite and discrete space, we consider leveraging thepowerful exploration capability of deep reinforcement learning Im Re
Actor Network
State
ActionValue
Adjust beam phases to new state P r o t o ac ti on Critic Network Loss calculation
EnvironmentDRL Agent
Transmitters
CombiningBeam
Beam phase vector The changes of beam phases
Reward
MSE loss
Quantizer
Policy loss
RF Chain
Receive CombiningGain
Comb C r iti c t a r g e t Fig. 1. The proposed beam pattern design framework with deep reinforcement learning. The schematic shows the agent architecture, and the way it interactswith the environment. to efficiently search over the space to find the optimal or near-optimal solution.IV. B
EAM P ATTERN L EARNING
In this section, we present our proposed DRL-based algo-rithm for addressing the beam pattern design problem (7). Itis worth mentioning that when viewing the problem from areinforcement learning perspective, it features a finite yet veryhigh dimensional action space. This makes the traditionallearning frameworks (such as deep Q-learning, deep determin-istic policy gradient, etc.) hard to apply. Therefore, we adopta novel architecture called Wolpertinger to enable the efficientsearch in a large discrete action space, the details of whichcan be found at [7].
1) Reinforcement Learning Setup:
To solve the problemwith reinforcement learning, we first specify the correspondingbuilding blocks of the learning algorithm as follows: • State:
We define the state s t as a vector that consists ofthe phases of all the phase shifters at the t -th iteration,that is, s t = [ θ , θ , . . . , θ M ] T . This phase vector can beconverted to the actual beamforming vector by applying(1). Since all the phases in s t are selected from Θ , and allthe phase values in Θ are within ( − π, π ] , (1) essentiallydefines a bijective mapping from the phase vector tothe beamforming vector. Therefore, for simplicity, wewill use the term “beamforming vector” to refer to boththis phase vector and the actual beamforming vector (theconversion is given by (1)), according to the context. • Action:
We define the action a t as the element-wisechanges to all the phases in s t . Since the phases canonly take values in Θ , a change of a phase means thatthe phase shifter selects a value from Θ . Therefore, theaction is directly specified as the next state, i.e. s t +1 = a t . • Reward:
We define a ternary reward mechanism, i.e. thereward r t takes values from { +1 , , − } . We compare thebeamforming gain achieved by the current beamformingvector, denoted by g t , with two values: (i) an adaptive threshold β t , and (ii) the previous beamforming gain g t − . The reward is computed using the following rule – g t > β t , r t = +1 ; – g t ≤ β t and g t > g t − , r t = 0 ; – g t ≤ β t and g t ≤ g t − , r t = − .It is important to note that the adopted adaptive thresholdmechanism does not rely on any prior knowledge of thechannel distribution. The threshold value starts from zero andwhenever the BS tries a new beam and the resulting beamform-ing gain surpasses the current threshold, the system updates thethreshold by the value of this new beamforming gain. Besides,since the update of threshold also marks a successful detectionof a new beam that achieves the best beamforming gain so far,the BS also records this beamforming vector. As can be seenin the reward definition, in order to calculate the reward, thesystem always tracks two quantities, which are the previousbeamforming gain and the best beamforming gain achieved sofar (i.e. the threshold).
2) Environment Interaction:
As mentioned in Sections Iand III, due to the possible hardware impairments, accuratechannel state information is generally unavailable. Therefore,the base station can only resort to the beamforming/combininggain to adjust its beam pattern in order to achieve a betterperformance. Upon forming a new beam ˜ w , the base stationuses this beam to receive the pilots transmitted from everyuser. Then, it averages all the beamforming gains ¯ g = 1 | H | (cid:88) h u ∈ H (cid:12)(cid:12) ˜ w H h u (cid:12)(cid:12) , (10)where H represents the targeted user channel set. Recall that(10) is the same as evaluating the objective function of (7) withthe current beamforming vector ˜ w . Depending on whether ornot the new average beamforming gain surpasses the previousone as well as the current threshold, the base station gets eitherreward or penalty, based on which it can judge the “quality”of the current beam and decide how to move. Algorithm 1
DRL Based Beam Pattern Learning Initialize actor network µ ( s | θ µ ) and critic network Q ( s , a | θ Q ) with random weights θ µ and θ Q Initialize target networks µ (cid:48) and Q (cid:48) with the weights ofactor and critic networks’ θ µ (cid:48) ← θ µ and θ Q (cid:48) ← θ Q Initialize the replay memory D , minibatch size B Initialize adaptive threshold β = 0 and the previousaverage beamforming gain g = 0 Initialize a random process N for action exploration Initialize a random phase vector as the initial state s for t = 1 to T do Receive a predicted action from actor network withexploration noise (cid:98) a t = µ ( s t | θ µ ) + N t Quantize the predicted action to a valid beamformingvector a t according to (11) Execute action a t , observe reward r t and update stateto s t +1 = a t Update the threshold β and previous gain g t Store the transition ( s t , a t , r t , s t +1 ) in D Sample a random mini batch of B transitions ( s b , a b , r b , s b +1 ) from D Calculate target y b = r b + γQ (cid:48) ( s b +1 , µ (cid:48) ( s b +1 | θ µ (cid:48) ) | θ Q (cid:48) ) Update the critic network by minimizing the meansquared loss L = B (cid:80) b ( y b − Q ( s b , a b | θ Q )) Update the actor network using the sampled policygradient given by − B (cid:80) Bb =1 ∇ a Q ( s , a ) | s = s b , a = µ ( s b | θ µ ) ∇ θ µ µ ( s | θ µ ) | s = s b Update the target networks every C iterations end for
3) Exploration:
The exploration happens after the actornetwork predicts the action (cid:98) a t +1 based on the current state(beam) s t . Upon obtaining the predicted action, an additivenoise is added element-wisely to (cid:98) a t +1 for the purpose ofexploration, which is a customary way in the context ofreinforcement learning with continuous action spaces [10],[11]. In our problem, we use temporally correlated noisesamples generated by an Ornstein-Uhlenbeck process [12],which is also used in [7]. It is worth mentioning that aproper configuration of the noise generation parameters hassignificant impact on the learning process. Normally, the extentof exploration (noise power) is set to be a decreasing functionwith respect to the iteration number, which is commonlyknown as exploration-exploitation tradeoff [10]. Furthermore,the exact configuration of noise power should relate to specificapplications. In our problem, for example, the noise is directlyadded to the predicted phases. Thus, at the very beginning, thenoise should be strong enough to perturb the predicted phaseto any other phases in Θ . By contrast, when the learningprocess approaches to the termination (the learned beamalready performs well), the noise power should be decreased toa smaller level that is only capable of perturbing the predictedphase to its adjacent phases in Θ .
4) Quantization:
The predicted beam (with explorationnoise added) should be quantized in order to be a valid newbeam that can be implemented by the discrete phase-shifters.
BS 3Buildings (R1200, 181)
User Grid 1 U s er G r i d U s er G r i d Fig. 2. The top view of the considered communication scenario.
Therefore, each quantized phase in the new vector can becalculated as [ s t +1 ] m = arg min θ ∈ Θ | θ − [ (cid:98) s t +1 ] m | , ∀ m = 1 , , . . . , M, (11)which is essentially a nearest neighbor lookup (i.e. a KNNclassifier with k = 1 ).
5) Forward Computation and Backward Update:
The cur-rent state s t and the new state s t +1 (recall that we directlyset s t +1 = a t ) are then fed into the critic network tocompute the Q value, based on which the targets of both actorand critic networks are calculated. This completes a forwardpass. Following that, a backward update is performed to theparameters of the actor and critic networks. A pseudo code ofthe algorithm can be found in Algorithm 1.V. S IMULATION R ESULTS
In this section, we evaluate the performance of the proposedsolution. We first describe the adopted scenario and datasetused in our simulations and then discuss the results.
A. Scenario and Dataset
In our simulations, we consider the outdoor scenario‘O1 60’ which is offered by the DeepMIMO dataset [13] andis generated based on the accurate 3D ray-tracing simulatorWireless InSite [14]. This scenario comprises two streets andone intersection with three uniform x-y user grids, as shownin Fig. 2. To generate the channels from the users to the basestation, we adopt the following DeepMIMO parameters: (1)Scenario name: O1 60, (2) Active BSs: 3, (3) Active users:Row 1200 to 1200, (4) Number of BS antennas in (x, y, z): (1,32, 1), (5) System bandwidth: 1 GHz, (6) Number of OFDMsub-carriers: 1 (single-carrier), (7) Number of multipaths: 5.From the generated dataset, we further select the user at row1200 and column 181 in the scenario. The locations of boththe selected user and the base station are marked in Fig. 2.
B. Performance Evaluation
We first evaluate our proposed DRL-based beam patternlearning solution on learning a single beam that serves a single
Iterations B ea m f o r m i ng ga i n EGC upper boundLearned beam pattern with 3-bit phase shiftersClassical beamsteering codebook (32 beams)
Angle (Radian) -70-60-50-40-30-20-100 G a i n ( d B ) Angle (Radian) -70-60-50-40-30-20-100 G a i n ( d B ) EGC beam patternLearned beam patternEGC beam patternLearned beam pattern
Angle (Radian) -70-60-50-40-30-20-100 G a i n ( d B ) EGC beam patternLearned beam pattern
Fig. 3. The beam pattern learning results for a single user with LOS connection to the base station. The base station employs a perfect uniform linear arraywith 32 antennas and 3-bit phase shifters. In this figure, we show the learning process and the beam patterns learned at three different stages during theiterations. The learned beam patterns are plotted using solid red line, and the equal gain combining/beamforming vector is plotted using dashed black line. user with LOS connection to the base station. In Fig. 3, wecompare the performance of the learned single beam with a32-beam classical beamsteering codebook. As it is commonlyknown, classical beamsteering codebook normally performsvery well in LOS scenario. However, our proposed methodachieves higher beamforming gain than the best beam in theclassical beamsteering codebook, with negligible iterations.More interestingly, with less than × iterations, theproposed solution can reach more than of the EGCupper bound. It is worth mentioning that the EGC upperbound can only be reached when the user’s channel is knownand unquantized phase shifters are deployed. By contrast, ourproposed solution can finally achieve almost of the EGCupper bound with 3-bit phase shifters and without any channelinformation.We also plot the learned beam patterns at three differentstages (iteration 1000, 5000, and 100000) during the learningprocess, which helps understand how the beam pattern evolvesover time. As shown in Fig. 3, at iteration 1000, the learnedbeam pattern has very strong side lobes, weakening the mainlobe gain to a great extent. At iteration 5000, the gain of themain lobe becomes stronger. However, there are still multipleside lobes with relatively high gains. Finally, at iteration100000, it can be seen that the main lobe has quite stronggain compared to the other side lobes, having at least 10 dBgain over the second strongest side lobe. And most of the sidelobes are below − dB. Besides, the learned beam patterncaptures the EGC beam pattern very well, which explainsthe good performance it achieves. The slight mismatching ismainly caused by the use of quantized phase shifters. With3-bit resolution, each phase shifter can only realize 8 differentvalues of phase shifts drawn uniformly from ( − π, π ] . The proposed beam pattern learning solution is also eval-uated on a system where hardware impairments exist (withthe same user considered above). This is a more realisticand interesting scenario, for mmWave systems are susceptibleto hardware mismatches like antenna spacing mismatch andphase mismatch. The wavelength in mmWave bands is sosmall that even slight mismatching can lead to a drasticdegradation of the performance. This for sure calls for anintelligent design process that is capable of adapting the beampattern to the hardware, mitigating the loss caused by hardwaremismatches. The simulation results confirm that our proposedsolution is competent to learn such optimized beam patternfor a system with hardware impairments.Fig. 4 (a) shows the beam patterns for both equal gain com-bining/beamforming vector (plotted in black) and the learnedbeam (plotted in red). At the first glance, the learned beam ap-pears distorted and has multiple low-gain lobes. However, theperformance of such beam is excellent. This can be explainedby comparing the beam patterns of the learned beam and theequal gain combining/beamforming vector. As can be seenfrom the learned beam patterns, our proposed solutionintelligently approximates the optimal beam, where allthe dominant lobes are well captured.
By contrast, theclassical beamsteering codebook fails when the hardware isnot perfect, as depicted in Fig. 4 (b). This is because thedistorted array pattern incurred by the hardware impairmentmakes the pointed classical beamsteering codebook beamsonly able to capture a small portion of the transmitted power,which further results in an inferior beamforming/combininggain. The learned beam shown in Fig. 4 (a) is capable ofachieving more than of the EGC upper bound withapproximately only iterations, as shown in Fig. 4 (b). This (a) Iterations B ea m f o r m i ng ga i n EGC upper boundLearned beam pattern with 3-bit phase shiftersClassical beamsteering codebook (32 beams) (b)Fig. 4. The beam pattern learned for a single user with LOS connectionto the base station. The base station employs a uniform linear array with32 antennas and 3-bit phase shifters, where hardware impairments exist. Thestandard deviation of the antenna position is . λ and the standard deviationof the phase mismatches is . π . (a) shows the beam patterns for the equalgain combining/beamforming vector (black) and the learned beam (red). Atransformation of √· is used to better show the finer structure of the beams.(b) shows the learning process. is especially interesting given that the proposed solution doesnot rely on any channel state information. As it is known, thechannel estimation in this case relies first on a full calibrationof the hardware, which is a hard and expensive process.VI. C ONCLUSIONS AND D ISCUSSIONS
In this paper, we developed a DRL-based approach to learnthe optimized beam pattern for a single user or a group ofusers with similar channels relying only on the receive powermeasurements and without any channel knowledge. This ap-proach relaxes the coherence/synchronization requirementsand is important for fully-analog or hybrid analog/digital ar-chitectures that are commonly adopted by mmWave/terahertzcommunication systems. The proposed learning frameworkrespects key hardware constraints such as the phase-only,constant-modulus, and quantized-angle constraints. Simulationresults show that the proposed solution is capable of findingthe near optimal beam pattern which achieves a beamform- ing/combining gain comparable to that of equal gain combin-ing without any explicit channel knowledge.R
EFERENCES[1] A. Alkhateeb, J. Mo, N. Gonzalez-Prelcic, and R. W. Heath, “MIMOPrecoding and Combining Solutions for Millimeter-Wave Systems,”
IEEE Communications Magazine , vol. 52, no. 12, pp. 122–131, 2014.[2] A. Alkhateeb, O. El Ayach, G. Leus, and R. Heath, “Channel Estimationand Hybrid Precoding for Millimeter Wave Cellular Systems,”
IEEEJournal of Selected Topics in Signal Processing , vol. 8, no. 5, pp. 831–846, Oct. 2014.[3] T. K. Y. Lo, “Maximum ratio transmission,”
IEEE Transactions onCommunications , vol. 47, no. 10, pp. 1458–1461, 1999.[4] D. Love and R. Heath Jr, “Equal gain transmission in multiple-inputmultiple-output wireless systems,”
IEEE Transactions on Communica-tions , vol. 51, no. 7, pp. 1102–1110, 2003.[5] X. Li, Y. Zhu, and P. Xia, “Enhanced Analog Beamforming for Sin-gle Carrier Millimeter Wave MIMO Systems,”
IEEE Transactions onWireless Communications , vol. 16, no. 7, pp. 4261–4274, 2017.[6] O. El Ayach, S. Rajagopal, S. Abu-Surra, Z. Pi, and R. Heath, “Spatiallysparse precoding in millimeter wave MIMO systems,”
IEEE Transac-tions on Wireless Communications , vol. 13, no. 3, pp. 1499–1513, Mar.2014.[7] G. Dulac-Arnold, R. Evans, H. van Hasselt, P. Sunehag, T. Lillicrap,J. Hunt, T. Mann, T. Weber, T. Degris, and B. Coppin, “Deep reinforce-ment learning in large discrete action spaces,” 2015.[8] S. Hur, T. Kim, D. Love, J. Krogmeier, T. Thomas, and A. Ghosh,“Millimeter wave beamforming for wireless backhaul and access insmall cell networks,”
IEEE Transactions on Communications , vol. 61,no. 10, pp. 4391–4403, Oct. 2013.[9] J. Wang, Z. Lan, C. Pyo, T. Baykas, C. Sum, M. Rahman, J. Gao,R. Funada, F. Kojima, H. Harada et al. , “Beam codebook basedbeamforming protocol for multi-Gbps millimeter-wave WPAN systems,”
IEEE Journal on Selected Areas in Communications , vol. 27, no. 8, pp.1390–1399, Nov. 2009.[10] R. S. Sutton and A. G. Barto,
Reinforcement Learning: An Introduction .Cambridge, MA, USA: A Bradford Book, 2018.[11] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa,D. Silver, and D. Wierstra, “Continuous control with deep reinforcementlearning,” 2015.[12] G. E. Uhlenbeck and L. S. Ornstein, “On the Theory of the BrownianMotion,”
Phys. Rev. , vol. 36, pp. 823–841, Sep 1930. [Online].Available: https://link.aps.org/doi/10.1103/PhysRev.36.823[13] A. Alkhateeb, “DeepMIMO: A Generic Deep Learning Dataset for Mil-limeter Wave and Massive MIMO Applications,” in