Deep Learning based Antenna Selection and CSI Extrapolation in Massive MIMO Systems
DDeep Learning based Antenna Selection andCSI Extrapolation in Massive MIMO Systems
Bo Lin,
Student Member, IEEE , Feifei Gao,
Fellow, IEEE , Shun Zhang,
SeniorMember, IEEE , Ting Zhou, and Ahmed Alkhateeb
Abstract
A critical bottleneck of massive multiple-input multiple-output (MIMO) system is the huge trainingoverhead caused by downlink transmission, like channel estimation, downlink beamforming and covari-ance observation. In this paper, we propose to use the channel state information (CSI) of a small numberof antennas to extrapolate the CSI of the other antennas and reduce the training overhead. Specifically,we design a deep neural network that we call an antenna domain extrapolation network (ADEN) that canexploit the correlation function among antennas. We then propose a deep learning (DL) based antennaselection network (ASN) that can select a limited antennas for optimizing the extrapolation, which isconventionally a type of combinatorial optimization and is difficult to solve. We trickly designed aconstrained degradation algorithm to generate a differentiable approximation of the discrete antennaselection vector such that the back-propagation of the neural network can be guaranteed. Numericalresults show that the proposed ADEN outperforms the traditional fully connected one, and the antennaselection scheme learned by ASN is much better than the trivially used uniform selection.
Index Terms
Channel extrapolation, deep learning, antenna selection, channel covariance matrix, beam prediction
B. Lin and F. Gao are with Department of Automation, Tsinghua University, State Key Lab of Intelligent Technologies andSystems, Tsinghua University, State Key for Information Science and Technology (TNList) Beijing 100084, P. R. China (e-mail:[email protected]; [email protected]). S. Zhang is with the State Key Laboratory of Integrated Services Networks,Xidian University, Xian 710071, P.R. China (e-mail: [email protected]). T. Zhou is with the Shanghai FrontierInnovation Research Institute, Chinese Academy of Sciences, Shanghai 201210, P.R. China (e-mail: [email protected]). A.Alkhateeb is with the School of Electrical, Computer and Energy Engineering, Arizona State University, Tempe, AZ 85287 USA(e-mail: [email protected]). a r X i v : . [ ee ss . SP ] J a n I. I
NTRODUCTION
Massive MIMO has attracted tremendous attention in the area of wireless communications, inwhich the base station (BS) is equipped with a large scale of antennas and can simultaneouslyserve multiple users. It is well admitted that massive MIMO could significantly boost the systemcapacity and transmission rate, making it a promising technique for both 5G and future wirelesscommunications [1]. Nevertheless, accurate downlink channel state information (CSI) is theprerequisite for achieving the full potential of massive MIMO, and the pilot length is proportionalto the number of transmit antennas. Hence, the training overhead for downlink transmissionbecomes extremely large.Most existing works assumed sparsity when performing the channel estimation, since the BS isalways deployed in a high place and the massive MIMO system mostly works in millimeter wave(mmWave) frequency band [2]. In turn, many channel estimation algorithms such as compressivesensing (CS) methods [3]–[5] and angle domain MIMO channel reconstruction [6]–[8] have beenexplored. However, these approaches mainly rely on simple mathematical models, which maynot be accurate in complicated channel environment.Recently, deep learning (DL), a new artificial intelligence (AI) method, has demonstratedits powerful advantages in many research areas, like image processing, speech processing, andnatural language processing. The application of DL in physical layer communications is alsosweeping [9], and many efforts have been made in channel estimation [10], [11], signal detection[11], [12], and beam prediction [13], [14], etc. In terms of saving the training overhead, a numberof DL based channel prediction methods have been proposed [15], [16] and achieved better resultsthan the traditional methods. In [17], Dong et al. designed a machine learning method to predictthe channel of a part of antennas from that of the other antennas, where channel predictionis modeled as a linear function and is solved by linear regression (LR) and support vectorregression (SVR). In [18], Alrabeiah et al. raised the concept of channel extrapolation in spaceand frequency. An important observation made in [18] is that there exists an implicit mappingfunction between the channels of two antenna sets with different frequencies and positions aslong as the position-to-channel mapping is bijective. Subsequently, Taha et al. proposed a DL
January 19, 2021 DRAFT based method to find the optimal reconfigurable intelligent surface (RIS) reflection matricesthat approaches the maximum achievable rate with only a few active antenna elements [19].Moreover, Yang et al. predicted the downlink channel from the uplink channel for FDD massiveMIMO systems with acceptable accuracy [20].Nevertheless, [18]–[21] simply applied the fully-connected deep neural networks (DNN),while few specific structure has been designed to fit the antenna domain channel extrapolation.Moreover, it is readily known that different antenna selection scheme will achieve differentextrapolation accuracy, while the existing works [17]–[19] simply adopted the uniformly selectedantennas for channel extrapolation. Although uniform selection is effective in many traditionalworks, it cannot guarantee the optimality due to the various electromagnetic field characteristicsof the environment. It is also noted that many traditional antenna selection algorithms [22]–[24]refer to selecting the optimal antennas for data transmission after the channels are known, whichis obviously not applicable for channel extrapolation.In this paper, we propose an antenna domain extrapolation network (ADEN) to performchannel extrapolation and an antenna selection network (ASN) to choose the optimal antennasfor the extrapolation. Specifically, the antenna domain extrapolation is divided into two steps,that is coarse CSI extrapolation and fine CSI extrapolation. The coarse CSI extrapolation isrealized by a fully connected neural network and the fine CSI extrapolation is modeled as anordinary differential equation (ODE) initial value problem, where the coarse extrapolated CSI isthe initial value while the fine extrapolated CSI is the final value. Moreover, a key challenge ofASN is that the operation of selecting antennas is non-differentiable and cannot guarantee theback-propagation. We then design a constrained degradation algorithm (CDA) that formulates adifferentiable approximation of the antenna selection vector. To enhance the overall performance,we next propose to train the ASN and ADEN jointly by penalizing both the extrapolation errorand the antenna selection vector. Based on ASN and ADEN, we present three typical applications:(i) extrapolating CSI from a part of antennas to all antennas; (ii) extrapolating channel covariancematrix (CCM) from a part of antennas to all antennas; (iii) using CSI from a part of antennasto predict the downlink beamforming coefficient of all antennas directly. The simulation results
January 19, 2021 DRAFT show that the proposed ADEN is better than the existing fully connected DNN, and the proposedASN is much better than trivially accepted uniform selection.The remainder of this paper is organized as follows. Section II introduces the system andchannel model. Section III designs the ASN and the ADEN. Section IV presents three typicalapplications of the CSI extrapolation. Section V provides the simulation results and Section VIdraws the conclusion.
Notation : Bold uppercase A is a matrix, bold lowercase a is a column vector, non-bold letter a and A are scalars, caligraphic letter A is a set; | a | is a magnitude of a scalar, (cid:107) a (cid:107) p is thep-norm of a vector, (cid:107) A (cid:107) F is the Frobenius norm of a matrix, |A| is the cardinality of a set; A T , A ∗ , A H are the Hermitian, conjugate, and transpose of A ; (cid:12) and ⊗ represent the Hadamardproduct and Kronecker product operator respectively; (cid:60) ( s ) and (cid:61) ( s ) are the real and imaginarycomponent of s ; E is the expectation.II. S YSTEM AND C HANNEL M ODEL
A. System Model
Let us consider a system where a BS is communicating with a mobile user. The BS has N t (cid:29) antennas and the user has only one antenna. Denote h ∈ C N t as the downlink channelvector from the BS to the user, P ∈ C N × N t as the downlink pilot matrix with N being thelength of pilots, and v ∈ C N as the vector of sensor noise with power σ . The received signalat the user is y = Ph + v . (1)There are many traditional methods to perform channel estimation such as least square (LS)channel estimation and linear minimum mean square error (LMMSE) channel estimation. TheLS channel estimation can be formulated as h LS = P † y , (2)where P † = P H (cid:0) PP H (cid:1) − is the pseudo-inverse of the matrix P . When the signal-to-noiseratio (SNR) is not high enough, LS estimation will bring a large estimation error. In this case, January 19, 2021 DRAFT
LMMSE channel estimation could be adopted to obtain higher estimation accuracy: h MMSE = (cid:0) y T ( P H R h P + σ I ) − P H R h (cid:1) T , (3)where I ∈ C N × N is the identity matrix. Nevertheless, to perform LMMSE channel estimation,the statistical CSI, i.e., CCM R h = E (cid:2) h · h H (cid:3) ∈ C N t × N t is needed. From (2) and (3), we see thatthe training consumption is extremely high for massive number of antennas. A natural questionthen arises: Can we use the channel of M t ( M t Define A ∈ Z N t × as the complete set of allantennas and B ∈ Z M t × as a subset of A with size M t . Moreover, denote h A (same as h in (1))as the vector that contains the channel of antenna set A and h B as the subset of h A that containsthe channel of antenna set B . It has been proved in [18] that if the position-to-channel mappingis bijective, then the channel-to-channel mapping exists. For a given static communicationenvironment including the geometry, materials, antenna positions, etc., the location of the userand the channel usually correspond strictly, i.e., the position-to-channel mapping function isusually bijective [18]. Hence the following channel mapping exists Φ h : { h B } → { h A } . (4) Extrapolation Based Beam Prediction: In massive MIMO systems, downlink beamformingis necessary for spatial multiplexing. The optimal beam for channel h A is chosen from thebeamforming codebook F = (cid:8) f , f , · · · , f | F | (cid:9) that maximizes the system rate f A = arg max f ∈ F log (cid:16) SNR (cid:12)(cid:12) h T A f (cid:12)(cid:12) (cid:17) . (5)The number of beamforming vectors in the codebook is proportional to the number of antennasat BS. Hence, the time and computation cost when selecting the optimal beam is also largein massive MIMO system. To save time and computation resources, we propose the beamextrapolation that predict the downlink beamforming of all antennas from a part of antennas’channel, which utilizes the channel of a part of antennas to predict the beam index of the whole January 19, 2021 DRAFT antennas. In fact, from (4) and (5), we know that the channel-to-beam mapping exists and canbe denoted as Φ beam : { h B } → { f A } . (6) Extrapolation Based Covariance Prediction: In addition to channel and beam, CCM is alsoan important parameter for transceiver design. For example, CCM is used both to design optimalpilots and compute the LMMSE channel estimation. CCM can also be used to find the subspaceof the beamforming vector in a coarse and blind way. However, the cost of obtaining CCM ishuge. We then propose to utilize the CCM of a part of antennas to extrapolate the CCM ofthe whole antennas. Before proving the existence of the CCM-to-CCM mapping, we adopt thefollowing assumption: Assumption 1: The mapping g B : {C} → { R B } is bijective, where C denotes the location ofan area, and R B denotes the CCM of the area and the antenna set B .Assumption 1 means that each area in the candidate set {C} has a unique CCM. Note thatthe bijectiveness of mapping g B depends on some truths including: (i) the signal attenuationfrom the BS to different areas is different; (ii) the geometry and materials of different areas aredifferent; (iii) the scattering paths in different areas are different.Now, the inverse mapping of g B can be described as g − B : { R B } → {C} . (7)For antenna set A , there also exists a mapping g A : {C} → { R A } . Hence, the CCM-to-CCMmapping exist, i.e., Φ R = g A ◦ g − B = { R B } → { R A } . (8)The previously described channel-to-channel mapping (4), channel-to-beam mapping (6), andCCM-to-CCM mapping (8) can be summarized in a unified extrapolation function Φ : { u B } → { u A } , (9)where u B denotes the information of antenna set B and u A denotes the information of antenna January 19, 2021 DRAFT Extrapolation (cid:266)(cid:266) Antenna Selection (cid:266)(cid:266) Fig. 1: Block diagram of the described antenna selection and extrapolation model. set A . As the exact mathematical function of extrapolation is hardly to obtain , we adopt deepneural networks (DNN) to fit such function with the aided of training data. Then, the extrapolationfunction can be described as { u A } = f ( { u B } , Θ e ) , (10)where Θ e is the parameters of DNN. B. Channel Model We adopt a 3-D geometric based channel model [25] where signal emitted by the transmitterreaches the receiver from multiple paths through reflection, diffraction, and refraction [26].Denote α l as the attenuation coefficient of the l -th path, φ a,Dl as the azimuth angle of departure(AoD) of the l -th path, φ e,Dl as the elevation AoD for the l -th path, φ a,Al as the azimuth angleof arrival (AoA) of the l -th path, φ e,Al as the elevation AoA of the l -th path, ϑ l as the phase ofpath l and τ l as the propagation delay of the l -th path. The channel vector h is given by [27] h = L (cid:88) l =1 α l e j ( ϑ l +2 πτ l B ) a ( φ a,Al , φ e,Al ) a ∗ ( φ a,Dl , φ e,Dl ) , (11) The mapping can be treated as the interpolation in antenna domain. The interpolation operation is always based on anexplicit model. For example, The interpolation on the orthogonal frequency division multiplexing (OFDM) subcarrier is realizedby using discrete Fourier transform (DFT). However, in the antenna domain, there is no explicit model to describe the mapping(9). Hence, the linear interpolation result will be very poor. January 19, 2021 DRAFT (cid:540) (cid:540) (cid:536) (cid:540) (cid:540) (cid:536) (cid:540) ASN ADEN Fig. 2: The overall structure of ASN and ADEN. where B is the signal bandwidth, and a ( φ a,Al , φ e,Al ) and a ( φ a,Dl , φ e,Dl ) are the steering vectors atthe arrival and departure sides. The mathematical expression of a ( · ) is a ( φ a,Al , φ e,Al ) = a z ( φ e,Al ) ⊗ a y ( φ a,Al , φ e,Al ) ⊗ a x ( φ a,Al , φ e,Al ) , (12)where a x ( · ) , a y ( · ) , a z ( · ) are the BS array response vectors in the x , y , and z directions (theoperation is the same for the AoD). Moreover, the operators a x ( · ) , a y ( · ) , a z ( · ) are respectivelydefined as a x ( φ a,Al , φ e,Al ) = [1 , e j dxλ sin ( φ e,Al ) cos ( φ a,Al ) , · · · , e j dxλ ( N x − sin ( φ e,Al ) cos ( φ a,Al ) ] , a y ( φ a,Al , φ e,Al ) = [1 , e j dyλ sin ( φ e,Al ) sin ( φ a,Al ) , · · · , e j dyλ ( N y − sin ( φ e,Al ) sin ( φ a,Al ) ] , a z ( φ e,Al ) = [1 , e j dzλ cos ( φ e,Al ) , · · · , e j dzλ ( N z − cos ( φ e,Al ) ] , (13)where λ is the carrier wavelength, while d x , d y , d z are the antenna spacings in the x -, y -, and z - direction.III. D EEP L EARNING B ASED A NTENNA S ELECTION AND A NTENNA D OMAIN E XTRAPOLATION We here propose a DL based joint design that contains two subnetworks, antenna selectionnetwork (ASN) and antenna domain extrapolation network (ADEN), as shown in Fig. 2. Definethe output of ASN as the antenna selection vector s = { s , s , · · · s N t } ∈ { , } N t that is an January 19, 2021 DRAFT ! " (cid:540)(cid:540) (cid:540) (cid:536) (cid:540) softmax ! " $ % (cid:540) & ! & $ % (cid:540) ' Generate probability vectorParameters ! Fig. 3: The three parts of ASN. M t -hot vector with M t elements being ‘ ’ and the other elements being ‘ ’. Specifically, we set s i = 1 if the i -th antenna is selected, while s i = 0 otherwise. The input of ADEN is u B = s (cid:12) u A ,and the output of ADEN is the extrapolated information ˆ u A . The ASN is trained to find theantenna selection vector that minimizes the extrapolation error of the ADEN. We also proposeto connect the ASN and ADEN through a product operation, and then jointly train them viabackpropagation at the same time. A. Antenna Selection Network The ASN is composed of three parts as shown in Fig. 3. The first part is a layer of randomlyinitialized parameters θ . The second part contains several layers of fully connected neurons togenerate a probability vector p = { p , p , · · · , p N t } , where p i represents the probability of the i -th antenna being selected. The overall vector p satisfies the condition (cid:80) N t i =1 p i = 1 . Denotethe output of the layer before the probability layer (also the input of probability vector) as θ p = { θ , θ , · · · , θ N t } . Then p i is generated by p i = softmax ( θ p ) i = exp ( θ i ) N t (cid:80) j =1 exp ( θ j ) . (14)The third part is the antenna selection vector s that is generated based on p . Specially, define January 19, 2021 DRAFT0 the index of the biggest M t elements as I s = arg top M t { p } , (15)where arg top M t { x } is a function that finds the biggest M t elements in vector x , and I s is anindex set. The elements of s with index I s are and otherwise are .We adopt back-propagation algorithm to train the ASN that requires all operations in theneural network being differentiable. However, when generating s , the function arg top M t {·} is not differentiable, which is the key obstacle of performing the antenna selection via DL. Tosolve this problem, let us first provide the following lemma: Lemma 1: For two positive integers M t and N t with M t Fig. 4: Constrained degradation algorithm: forward-propagation link and back-propagation link of ASN. (16): L ASN = α (cid:0) (cid:107) (cid:101) p (cid:107) − M t (cid:1) + α (cid:0) (cid:107) (cid:101) p (cid:107) − M t (cid:1) , (18)where α > and α > are the tuning parameters of the two parts of penalties. During the trainingprocess, the penalty L ASN keeps on decreasing and will approach zero. Hence, both (cid:107) (cid:101) p (cid:107) − M t and (cid:107) (cid:101) p (cid:107) − M t will approach zero, and (cid:101) p will tend to satisfy constraints (16), i.e., (cid:101) p will tendto be s . Interestingly, (cid:101) p will be always differentiable when it gradually approaches s . Hence, thekey idea of CDA is to utilize the differentiable (cid:101) p as an approximation of the non-differentiable s during back-propagation. The forward-propagation and back-propagation links are summarizedin Fig. 4. B. Antenna Domain Extrapolation Network The ADEN is composed of two parts as shown in Fig. 5. The first part is coarse extrapo-lation subnetwork that incudes several fully connected neural layers. The second part is fineextrapolation subnetwork that improves the extrapolation accuracy. Denote the output of coarseextrapolation subnetwork as ˆ u c A and the output of fine extrapolation subnetwork as ˆ u f A . Wepropose to formulate the fine extrapolation (from ˆ u c A to ˆ u f A ) as an optimization problem that January 19, 2021 DRAFT2 (cid:540) (cid:540)(cid:540) (cid:540) (cid:540) (cid:540) (cid:536) non li n ea r ac ti v a ti on non li n ea r ac ti v a ti on non li n ea r ac ti v a ti on non li n ea r ac ti v a ti on (cid:540) non li n ea r ac ti v a ti on (cid:540) ! " $ % $ ! $ " $ (cid:540) (cid:536) Coarse Extrapolation Subnetwork Fine Extrapolation Subnetwork penalty penalty Fig. 5: Antenna domain extrapolation network. satisfies the following ODE d u f ( t ) dt = ψ [ u f ( t ) , t ] u f ( t ) = ˆ u c A , (19)where u f ( · ) denotes the fine extrapolation function, and ψ ( · ) denotes the differential function of u f ( t ) . The initial condition u f ( t ) is ˆ u c A , while the final condition u f ( t N ) is ˆ u f A . The derivationprocess can be found in Appendix C. Traditional methods to solve ODEs are Runge-Kutta [28]and Multi-step methods [29]. With the precise knowledge of ψ ( · ) , there is u f ( t N ) = u f ( t ) + (cid:90) t N t d u f ( t ) dt dt = u f ( t ) + (cid:90) t N t ψ [ u f ( t ) , t ] . (20)However, since ψ ( · ) is not available in the considered extrapolation, we could not use (20) tosolve (19). We then design the ADEN by combining the deep neural network and the structureof Runge-Kutta solution (in [28]) whose structure is shown in Fig. 5 and the mathematicalexpression is formulated as K = f (ˆ u c A ) , K = f K ( K ) , K = f K ( K + a K ) , K = f K ( K + a K ) , K = f K ( K + a K ) , ˆ u f A = f ( K + b K + b K + b K + b K ) , (21) January 19, 2021 DRAFT3 where f K i ( · ) is the nonlinear mapping (or function) of sub-network K i , and a i , b i are theparameters that will be trained.The penalty of ADEN is set as L ADEN = β (cid:107) u A − ˆ u c A (cid:107) + β (cid:107) u A − ˆ u f A (cid:107) , (22)where β > and β > are the tuning parameters of the two parts of penalties. C. Joint Training of ASN and ADEN We adopt a combined loss function to jointly train ASN and ADEN: L = L ASN + ρ L ADEN , (23)where ρ is the weight to balance the penalties of ASN and ADEN.The detailed steps of the joint training algorithm are described in Algorithm 1. Since theneural network can only process real numbers, we construct the input u A (in Fig. 2) as Z in (cid:44) [ (cid:60) ( u A ) , (cid:61) ( u A )] . (24)Correspondingly, the antenna selection vector should also be constructed as s (cid:44) [ s T , s T ] T . (25)Then the input of ADEN is Z s (cid:44) s (cid:12) Z in . The output of ADEN represents the real part andimaginary part of extrapolated ˆ u A (in Fig. 2) Z out (cid:44) [ (cid:60) (ˆ u A ) , (cid:61) (ˆ u A )] . (26)After joint training of ASN and ADEN, we obtain an antenna selection vector s . During theonline evaluation, since the antenna selection vector s has been obtained, we can delete the ASNand use s (cid:12) u A for antenna domain extrapolation. January 19, 2021 DRAFT4 Algorithm 1 Joint Training of ASN and ADEN Require: Training dataset D , Number of iterations n iter , hyperparameters α , α , ρ , initializedtrainable parameters Θ s , Θ e and a = a = , a = 1 , b = b = , b = b = . Ensure: Trained ASN parameters Θ s , antenna selecting vector s and AEN parameters Θ e . for i = k to n iter do ASN Phase:- Draw mini-batch D k : a random subset of D - Generate antenna selection vector s by: s = M t - hot(arg top M ti { p i | i = 1 , , · · · , N t } ) - Generate a differentiable approximation of s : (cid:101) p = M t · p - Separate the real and imaginary parts of u A : Z in (cid:44) [ (cid:60) ( u A ) , (cid:61) ( u A )] - Concatenate s for real and imaginary parts of u A (same operation for (cid:101) p ): s = [ s T , s T ] T - Perform antenna selecting (Hadamard product) (same operation for (cid:101) p ) : Z s (cid:44) s (cid:12) Z in AEN Phase:- Compute the output of extrapolation network: Z out (cid:44) f ( Z s , Θ s ) - Compute the loss function: L Back-propagation Phase:- Replace s with (cid:101) p - Use Adam optimizer to update Θ s and Θ e end for IV. T YPICAL A PPLICATIONS IN T RANSCEIVER D ESIGN The proposed antenna selection can be applied into many communications tasks that need toselect antennas with certain purpose, for example, antenna selection for the channel estimationin hybrid massive MIMO system, antenna activation strategy for RIS, antenna selection fordata transmission, etc. Similarly, the antenna domain extrapolation can be applied in manyextrapolation problems in multi-antenna systems. Next, we explain how our proposed model canbe applied to the channel, beam, and covariance extrapolation problems. January 19, 2021 DRAFT5 A. Channel Extrapolation In the channel extrapolation case, we have u A = h A and u B = h B in (9). The channelextrapolation can be described as h A = f h ( h B ) , (27)where f h ( · ) is the channel extrapolation function learned by the ADEN.We set the network input as h in = (cid:2) (cid:60) (cid:0) h T B (cid:1) , (cid:61) (cid:0) h T B (cid:1)(cid:3) T , and the corresponding label is h lab = (cid:2) (cid:60) (cid:0) h T A (cid:1) , (cid:61) (cid:0) h T A (cid:1)(cid:3) T . The quality of the channel extrapolation result is evaluated by the NMSE indicator. NMSE = E (cid:20)(cid:12)(cid:12)(cid:12) h A − ˆ h A (cid:12)(cid:12)(cid:12) (cid:21) E (cid:2) | h A | (cid:3) . (28) B. Beam Prediction The beamforming is used to increase downlink transmission rate of massive MIMO system,and the optimal downlink beamforming vector f A can be generated from (5). When there canonly be limited number of pilots and only limited channel h B of a small number of antennascan be obtained, we propose to directly predict the beam index Beam A from h B . In this case,we set u A = Beam A and u B = h B in (9). The mathematical formula of beam prediction is { Beam A } = f b ( { h B } ) , (29)where f b ( · ) is the beam extrapolation function that will be learned by the ADEN. After the Beam A being predicted, the corresponding beamforming vector can be obtained by looking upthe codebook. C. CCM Extrapolation CCM is the statistical characteristic of the channel and is conventionally obtained from theaccumulation of sufficient number of estimated channel vectors. For massive MIMO system,unfortunately, the number of the estimated channel vectors is significantly large. Nevertheless,according to the mapping (8), CCM can also be extrapolated from a small number of antennas. January 19, 2021 DRAFT6 Fig. 6: The communications scenario. Let u A = R A and u B = R B in (9), and there is R A = f c ( R B ) , (30)where f c ( · ) denotes the CCM extrapolation function that will be learned by the ADEN. Sim-ilarly to the channel vector extrapolation, the input of CCM extrapolation network is R in = {(cid:60) ( R B ) , (cid:61) ( R B ) } , and the corresponding label is R lab = {(cid:60) ( R A ) , (cid:61) ( R A ) } . Denote R out = {(cid:60) ( R ) , (cid:61) ( R ) } as the output of ADEN. Since the CCMs are positive semi-definite, we addone more layer to ensure the positive semi-definiteness, and the corresponding output is R (cid:48) out = {(cid:60) ( R ) + (cid:60) ( R T ) , (cid:61) ( R ) − (cid:61) ( R T ) } = {(cid:60) ( R + R H ) , (cid:61) ( R + R H ) } . (31)V. S IMULATION R ESULT In this section, we evaluate the performance of the proposed ASN and ADEN for channelextrapolation, beam prediction, and CCM extrapolation. January 19, 2021 DRAFT7 TABLE I: DeepMIMO Dataset Parameters Parameters ValueScenario name O1 28Active BS BS15Active users Row 3252-3852Number of BS Antennas 64Number of BS Antennas in x-axis 8Number of BS Antennas in y-axis 8Number of BS Antennas in z-axis 1Antenna spacing (wave-length) 0.5Bandwidth (GHz) 0.2Number of OFDM subcarriers 1OFDM sampling factor 1OFDM limit 1Number of paths 11 A. Communications Set Up Let us consider a scenario from the DeepMIMO dataset [30] that is constructed from the3D ray-tracing software Wireless InSite [31] and could capture the channel dependence onthe frequency and location. Specifically, we use the outdoor scenario ‘O1 28’ [30] available atfrequency f c = 28 GHz, as shown in Fig. 6. Meanwhile, the BS (BS 15 in Fig. 6) is equipped witha uniform planar array (UPA) of 8 × d is set to λ c where λ c is the carrier wavelength. The bandwidth of the system is set to200 MHz and the number of paths is set to 11. The corresponding rows of the communicationscenario in Fig. 6 are from 3252 to 3852. Each row contains 181 users while each user representsa position in the scenario. Hence, there are a total number of 108,781 channels. The channelvectors are generated based on formula (11) and the parameters in Table I.We use the algorithm [32] in [33] to generate a beamforming codebook based on the antennaarray parameters in Table I. Then we select the sequence number of the beamforming vectoraccording to formula (5) to form the label of the beam prediction network.The dataset of CCM extrapolation is generated from the channel vectors that are collected January 19, 2021 DRAFT8 TABLE II: Network Training Hyper-Parameters Parameters Channel Beam CCMSolver AdamInitial learning rate × − Sub-network K i fully connected layer and ReLuNumber of neurons in K i 512 512 16,386Dataset size 108,781 108,781 52,569Dataset split 80%-20%Penalty factor of vector s α = α = β = 1 , β = 10 Scale factor ρ initial ρ = 5 , then × every epochfrom × m area around each user. Define one × m area as a collection block. For eachcollection block, we evenly collected channels at 25 locations i.e., 5 rows and 5 columns, andthen obtain the channel covariance matrix as R h = E (cid:2) h · h H (cid:3) . We collect the CCMs of theusers from 3552 row to 2852 row and generate a total number of 52569 collection blocks. B. Neural Network Training The configurations of the neural network in the three applications are as follows: 1) Channel Extrapolation: In case of channel extrapolation, the ASN is composed of threelayers. Each layer has 64 neurons, and the output of ASN is s ∈ { , } × . The input of ADENis s (cid:12) (cid:2) (cid:60) ( h T A ) , (cid:61) ( h T A ) (cid:3) T ∈ R × . Moreover, each sub-network contains 512 neurons and aReLu layer. 2) Beam Prediction: The input is the same as that of the channel extrapolation case. Theoutput and label of the network is the beam index and the loss function is a crossentry function.Moreover, each layer includes 128 neurons and the activation function is ReLu. For beamprediction case, since the output of ADEN is the beam index, i.e., a one-dimension number, onlya few layers are needed to achieve satisfactory accuracy. Hence we delete the coarse extrapolationsubnetwork and only use the fine extrapolation subnetwork for beam prediction. 3) CCM Extrapolation: In case of CCM extrapolation, the ASN is composed of three layers.Each layer has 64 neurons, and the output of ASN is s ∈ { , } × . Different from channel January 19, 2021 DRAFT9 (a) (b) (c) Fig. 7: Antenna selection patterns at SNR=30dB: (a) uniform antenna selection pattern; (b) antenna selection patternlearned by ‘ASN+DNN’; (c) antenna selection patterns learned by ‘ASN+ADEN’. extrapolation, sampling the covariance matrix in the antenna domain requires expanding theantenna selection vector into a two-dimensional matrix S = s · s T ∈ { , } × . Then weconcatenate S to generate S = [ S T , S T ] T ∈ { , } × . CCM should also be constructed as R in ∈ R × with the real and imaginary parts separated. Then we perform Hadamard producton S and Z in as S (cid:12) (cid:2) (cid:60) ( R T A ) , (cid:61) ( R T A ) (cid:3) T . Before entering the ADEN network, the result ofHadamard product should be reshaped into a column vector ∈ R × . C. Performance Evaluation For all simulations, we compare four channel extrapolation schemes with the same numberof neurons: (i) ‘Uniform + DNN’ (using traditional DNN to extrapolate from uniform antennaselection patterns); (ii) ‘Uniform + ADEN’ (using ADEN to extrapolate from uniform antennaselection patterns); (iii) ‘ASN + DNN’ (using DNN to extrapolate from learned antenna selectionpatterns); (iv) ‘ASN + ADEN’ (using ADEN to extrapolate from learned antenna selectionpatterns). 1) Channel Extrapolation: We use the channel of 8 antennas to extrapolate the channel of64 antennas. The uniform antenna selection pattern is shown in Fig. 7(a). At SNR=30dB, theantenna selection patterns learned by ‘ASN + DNN’ and ‘ASN + ADEN’ are shown in Fig. 7(b),Fig. (c) respectively, which look quite different from the uniform one. The NMSE of channel January 19, 2021 DRAFT0 epoch -3 -2 -1 N M SE uniform selection + DNNuniform selection + ADENproposed ASN + DNNproposed ASN + ADEN Fig. 8: The NMSE of channel extrapolation versus epoches with 8 antennas and SNR=30dB. SNR/dB -2 -1 N M SE uniform selection + DNNuniform selection + ADENproposed ASN + DNNproposed ASN + ADEN Fig. 9: The NMSE of channel extrapolation versus SNR with 8 antennas. extrapolation versus the number of epochs for four different schemes are displayed in Fig. 8. It isseen that the extrapolation NMSE of ‘Uniform + DNN’ is 0.060, while the extrapolation NMSE of‘Uniform + ADEN’ reaches 0.050. Moreover, the extrapolation NMSE of ‘ASN + DNN’ reducesto 0.017, while the extrapolation NMSE of ‘ASN + ADEN’ significantly drops to 0.006. Thenwe test the channel extrapolation NMSE at different SNR in Fig. 9. Clearly, ADEN outperforms January 19, 2021 DRAFT1 (a) (b) Fig. 10: Antenna selection patterns at SNR=30dB: (a) antenna selection pattern learned by ‘ASN+DNN’; (b) antennaselection patterns learned by ‘ASN+ADEN’. epoch N M SE proposed ASN + ADENproposed ASN + DNNuniform selection + ADENuniform selection + DNN Fig. 11: Beam prediction accuracy versus epoches at SNR=30dB. traditional DNN in terms of the accuracy of extrapolation, and the proposed ASN performs muchbetter than the uniform selection. From Fig .8 and Fig. 9, the proposed ADEN performs slightlybetter than DNN for uniform extrapolation. However, with the optimized antenna selection, theaccuracy of ADEN will be much better than that of DNN, which demonstrates the effectivenessof the proposed joint training scheme. January 19, 2021 DRAFT2 SNR/dB N M SE proposed ASN + ADENproposed ASN + DNNuniform selection + ADENuniform selection + DNN Fig. 12: Beam prediction accuracy versus SNR. 2) Beam Prediction: For beam prediction, we utilize the channel of 8 antennas to predict thebeam index of 64 antennas. At SNR=30dB, the antenna selection patterns learned by ‘ASN +DNN’ and ‘ASN + ADEN’ are shown in Fig. 10. The beam prediction accuracy of the fourschemes are displayed in Fig. 11. The accuracy of ‘ASN + ADEN’, ‘ASN + DNN’, ‘Uniform +ADEN’, and ‘Uniform + DNN’ are 0.965, 0.933, 0.890, and 0.880 respectively. Then we test theaccuracy of beam prediction at different SNR in Fig. 12. Similarly to the channel extrapolation,we see that the ADEN achieves high beam prediction accuracy than traditional DNN and thelearned antenna selection pattern by ASN is better than the uniform pattern. 3) CCM Extrapolation: For CCM extrapolation, we first use the CCM of 8 antennas toextrapolate the CCM of 64 antennas. The antenna selection patterns for CCM extrapolationlearned by ‘ASN + DNN’ and ‘ASN + ADEN’ are shown in Fig. 13, and the NMSE of channelextrapolation for four different schemes are displayed in Fig. 14. It is seen that the NMSE of CCMextrapolation for ‘Uniform + DNN’, ‘Uniform + ADEN’, ‘ASN + DNN’, and ‘ASN + ADEN’are 0.036, 0.026, 0.016, 0.007 respectively. We then show the NMSE of CCM extrapolationusing different numbers of antennas in Fig. 15. It is seen that the improvement brought by theantenna selection is quite significant when |B| is small. Nevertheless, when |B| increases, the January 19, 2021 DRAFT3 (a) (b) Fig. 13: Antenna selection patterns at SNR=30dB: (a) antenna selection pattern learned by ‘ASN+DNN’; (b) antennaselection pattern learned by ‘ASN+ADEN’. epoch -2 -1 N M SE uniform selection + DNNuniform selection + ADENproposed ASN + DNNproposed ASN + ADEN Fig. 14: The NMSE of CCM extrapolation versus epoches with 8 antennas. improvement brought by the antenna selection reduces while the improvement of extrapolationmostly comes from the designed ADEN. Moreover, the proposed ‘ASN+ADEN’ always achievesthe best extrapolation accuracy with different antenna numbers. January 19, 2021 DRAFT4 Antenna numbers -4 -3 -2 -1 N M SE uniform selection + DNNuniform selection + ADENproposed ASN + DNNproposed ASN + ADEN Fig. 15: The NMSE of CCM extrapolation versus antenna numbers.TABLE III: Channel Extrapolation Error with Different Initial Weights AntennaNumber Channel Extrapolation NMSE ( × e-03) Variance8 6.936 7.220 8.426 4.845 7.062 6.714 8.920 6.670 1.313e-0616 2.631 3.455 2.821 4.006 4.397 4.612 1.878 3.960 7.943e-0724 1.813 1.494 2.675 1.621 2.315 1.707 2.528 2.269 1.755e-0732 1.414 1.412 1.742 1.405 1.629 1.160 1.379 1.446 2.645e-08 D. Sensitivity Analysis By configuring the neural networks with different initial weights, we will obtain differentantenna selection patterns. It is meaningful to study the variance of extrapolation error underdifferent initialization conditions. We repeat training the ‘ASN + ADEN’ for channel extrap-olation with different antenna numbers several times, and compute the variances as depictedin Table III. Although the learned antenna selection pattern from each training is different, theextrapolation error does not fluctuate much, and the error variance will be further reduced asthe number of antennas increases. January 19, 2021 DRAFT5 VI. C ONCLUSIONS In this paper, we investigated the antenna domain channel extrapolation for massive MIMOsystem, where the channels of the whole antenna array can be predicted from that of a fewantennas. We first designed the ASN to achieve the optimal antenna selection, where we proposeda constrained degradation method to approximate the derivative of the antenna selection vectorsuch that the gradient can be back propagated when training the network. We next design theADEN to complete the channel extrapolation, where the ODE-inspired network structure isadopted to enhance the performance compared to the conventional DNN. The ASN and ADENare jointly trained to find the optimal parameters. We then present three typical applications:channel extrapolation, CCM extrapolation, and beam prediction. Simulations results show thatthe learned antenna selection is superior to the uniform selection, and the ADEN performs betterthan the tradition DNN. A PPENDIX AP ROOF OF L EMMA Proof : Since ν (cid:48) is a rearrangement of ν , it still satisfies the equality constraints (16). Withoutloss of generality, we assume that there are r ( Hence, the equality (32) holds, and we obtain ν k ν k = ν k ν k = · · · = ν k r ν k r = q. (34)Solving (34), we obtain ν k = ν k = · · · = ν k r = √ q . Substitute ν k i = √ q ( i = 1 , · · · , r ) into(16), we obtain the following equation r · √ q = M t r · ( √ q ) = M t . (35)Solving equation (35), we obtain q = 1 , r = M t , and there are ν k = ν k = · · · = ν k Mt = 1 and ν k Mt +1 = ν k Mt +2 = · · · = ν k Nt = 0 . Therefore the vector ν is an M t -hot vector under theconstraints (16). (cid:3) A PPENDIX BG EOMETRIC E XPLANATION OF L EMMA l -, l -, and l -norm balls of two-dimension vectors in Fig. 16. It is seenthat the intersections of these norm balls are all on x-axis and y-axis with coordinates (1 , , (0 , , ( − , , (0 , − . If we restrict the horizontal and vertical coordinates to be non-negative January 19, 2021 DRAFT7 Fig. 17: Norm balls of 3-dimension vectorsFig. 18: Norm balls of 3-dimension vectors: (cid:107) x (cid:107) p = p √ numbers, then there are only two intersections (1 , and (0 , whose coordinates are exactlytwo one-hot vectors. We then display the norm balls of three-dimension vectors in Fig. 17, andthere are three intersections (1 , , , (0 , , and (0 , , . We see that the intersections of the l -norm, l -norm and l -norm balls are three one-hot vectors.Next, we show how to yield a K-hot vector by deforming the norm balls. Taking three-dimension vectors as an example, we wish to find several graphics such that the coordinates oftheir intersections are two-hot vectors (1,1,0), (0,1,1), (1,0,1), respectively. Since the l p -norm of January 19, 2021 DRAFT8 a two-hot vector is p √ , we could construct the following three graphics (cid:107) x (cid:107) = 2 , (cid:107) x (cid:107) = 2 , (cid:107) x (cid:107) = 2 , (36)as shown in Fig. 18. It seen from Fig. 18 that, the coordinates of three intersections from thethree graphics are exactly the two-hot vectors we need.For more general case, by limiting the l -norm, l -norm, and l -norm of the vectors to p √ K ,we can get K different K-hot vectors. The strict proof can be found in Section III.A PPENDIX CD ERIVATION OF ODEWe here derive the mathematical model of the fine extrapolation subnetwork. The target is toobtain the fine ˆ u f A from the coarse ˆ u c A . A popular way to improve the prediction accuracy is toincrease the depth of DNN. However, simply increasing the depth of DNN may bring variousissues like overfitting, vanishing gradient, etc. In this sense, many neural network structures [35]–[37] based on skip-connections [38] were proposed to help increase the layers of the networkand have achieved advanced performance. The skip connection can be formulated as u f ( n + 1) = u f ( n ) + ψ [ u f ( n ) , ω ( n )] , (37)where n + 1 , n are the layer indices in DNN (also can be seen as the time mark) and u f ( n + 1) , u f ( n ) are two neuron layers. Equation (37) is also known as the discretized Euler equation [39].Replacing discrete variable n with continuous variable t , when the time interval becomes small( t → ) (or the number of layers between the connected layers becomes large), equation (37)can be written as d u f ( t ) dt = ψ [ u f ( t ) , ω ( t )] . (38) January 19, 2021 DRAFT9 For the fine extrapolation subnetwork, the input is u f (0) = ˆ u c A . Then fine extrapolation can beformulated as an ordinary differential equation (ODE) initial value problem d u f ( t ) dt = ψ [ u f ( t ) , ω ( t )] u f (0) = u B . (39)R EFERENCES [1] E. G. Larsson, O. Edfors, F. Tufvesson, and T. L. Marzetta, “Massive MIMO for next generation wireless systems,” IEEECommun. Mag. , vol. 52, no. 2, pp. 186–195, 2014.[2] H. Xie, F. Gao, S. Zhang, and S. Jin, “A unified transmission strategy for TDD/FDD massive MIMO systems with spatialbasis expansion model,” IEEE Trans. Veh. Technol. , vol. 66, no. 4, pp. 3170–3184, 2016.[3] W. U. Bajwa, J. Haupt, A. M. Sayeed, and R. Nowak, “Compressed channel sensing: A new approach to estimating sparsemultipath channels,” Proc. IEEE , vol. 98, no. 6, pp. 1058–1076, 2010.[4] C. Steffens, Y. Yang, and M. Pesavento, “Multidimensional sparse recovery for MIMO channel parameter estimation,” in Proc. Eur. Signal Process. Conf. , 2016, pp. 66–70.[5] P. Cheng, Z. Chen, Y. Rui, Y. J. Guo, L. Gui, M. Tao, and Q. Zhang, “Channel estimation for OFDM systems overdoubly selective channels: A distributed compressive sensing based approach,” IEEE Trans. Commun. , vol. 61, no. 10, pp.4173–4185, 2013.[6] B. Wang, F. Gao, S. Jin, H. Lin, and G. Y. Li, “Spatial-and frequency-wideband effects in millimeter-wave massive MIMOsystems,” IEEE Trans. Signal Process. , vol. 66, no. 13, pp. 3393–3406, 2018.[7] M. Jian, F. Gao, Z. Tian, S. Jin, and S. Ma, “Angle-domain aided UL/DL channel estimation for wideband mmWavemassive MIMO systems with beam squint,” IEEE Trans. Wireless Commun. , vol. 18, no. 7, pp. 3515–3527, 2019.[8] J. Zhao, F. Gao, W. Jia, S. Zhang, S. Jin, and H. Lin, “Angle domain hybrid precoding and channel tracking for millimeterwave massive MIMO systems,” IEEE Trans. Wireless Commun. , vol. 16, no. 10, pp. 6868–6880, 2017.[9] Z. Qin, H. Ye, G. Y. Li, and B.-H. F. Juang, “Deep learning in physical layer communications,” IEEE Wireless Commun. ,vol. 26, no. 2, pp. 93–99, 2019.[10] H. He, C.-K. Wen, S. Jin, and G. Y. Li, “Deep learning-based channel estimation for beamspace mmwave massive MIMOsystems,” IEEE Wireless Commun. Lett. , vol. 7, no. 5, pp. 852–855, 2018.[11] H. Ye, G. Y. Li, and B.-H. Juang, “Power of deep learning for channel estimation and signal detection in OFDM systems,” IEEE Wireless Commun. Lett. , vol. 7, no. 1, pp. 114–117, 2017.[12] H. He, C.-K. Wen, S. Jin, and G. Y. Li, “Model-driven deep learning for joint MIMO channel estimation and signaldetection,” arXiv preprint arXiv:1907.09439 , 2019.[13] P. Zhou, X. Fang, X. Wang, Y. Long, R. He, and X. Han, “Deep learning-based beam management and interferencecoordination in dense mmWave networks,” IEEE Trans. Veh. Technol. , vol. 68, no. 1, pp. 592–603, 2018.[14] A. Klautau, P. Batista, N. Gonz´alez-Prelcic, Y. Wang, and R. W. Heath, “5G MIMO data for machine learning: Applicationto beam-selection using deep learning,” in Proc. IEEE Inf. Theory Appl. Workshop , Feb. 2018, pp. 1–9.[15] J. Guo, C.-K. Wen, S. Jin, and G. Y. Li, “Convolutional neural network-based multiple-rate compressive sensing for massivemimo CSI feedback: Design, simulation, and analysis,” IEEE Trans. Wireless Commun. , vol. 19, no. 4, pp. 2827–2840,2020. January 19, 2021 DRAFT0 [16] W. Ma, C. Qi, Z. Zhang, and J. Cheng, “Sparse channel estimation and hybrid precoding using deep learning for millimeterwave massive MIMO,” IEEE Trans. Commun. , vol. 68, no. 5, pp. 2838–2849, 2020.[17] P. Dong, H. Zhang, and G. Y. Li, “Machine learning prediction based CSI acquisition for FDD massive MIMO downlink,”in Proc. IEEE Global Commun. Conf. (GLOBECOM) , Dec. 2018, pp. 1–6.[18] M. Alrabeiah and A. Alkhateeb, “Deep learning for TDD and FDD massive MIMO: Mapping channels in space andfrequency,” in Proc. Asilomar Conf. Signals, Syst. Comput. , 2019, pp. 1465–1470.[19] A. Taha, M. Alrabeiah, and A. Alkhateeb, “Deep learning for large intelligent surfaces in millimeter wave and massiveMIMO systems,” in Proc. IEEE Global Commun. Conf. (GLOBECOM) , 2019, pp. 1–6.[20] Y. Yang, F. Gao, G. Y. Li, and M. Jian, “Deep learning-based downlink channel prediction for FDD massive MIMOsystem,” IEEE Commun. Lett. , vol. 23, no. 11, pp. 1994–1998, 2019.[21] Y. Yang, F. Gao, Z. Zhong, B. Ai, and A. Alkhateeb, “Deep transfer learning based downlink channel prediction for FDDmassive MIMO systems,” IEEE Trans. Commun. , 2020.[22] A. F. Molisch and M. Z. Win, “MIMO systems with antenna selection,” IEEE microwave mag. , vol. 5, no. 1, pp. 46–56,2004.[23] S. Sanayei and A. Nosratinia, “Antenna selection in MIMO systems,” IEEE Commun. Mag. , vol. 42, no. 10, pp. 68–73,2004.[24] A. F. Molisch, M. Z. Win, Y.-S. Choi, and J. H. Winters, “Capacity of MIMO systems with antenna selection,” IEEETrans. Wireless Commun. , vol. 4, no. 4, pp. 1759–1772, 2005.[25] M. K. Samimi and T. S. Rappaport, “3-D millimeter-wave statistical channel model for 5G wireless system design,” IEEETrans. Microw. Theory Techn. , vol. 64, no. 7, pp. 2207–2225, 2016.[26] A. M. Sayeed, T. Sivanadyan, K. Liu, and S. Haykin, “Wireless communication and sensing in multipath environmentsusing multi-antenna transceivers,” in Handbook on Array Processing and Sensor Networks . Wiley Online Library, 2010.[27] A. M. Sayeed, “Deconstructing multiantenna fading channels,” IEEE Trans. Signal Process. , vol. 50, no. 10, pp. 2563–2579,2002.[28] J. R. Dormand and P. J. Prince, “A family of embedded runge-kutta formulae,” Journal of computational and appliedmathematics , vol. 6, no. 1, pp. 19–26, 1980.[29] W. B. Gragg and H. J. Stetter, “Generalized multistep predictor-corrector methods,” Journal of the ACM (JACM) arXiv e-prints , p. arXiv:1910.02900, Oct 2019.[34] J. M. Steele, The Cauchy-Schwarz master class: an introduction to the art of mathematical inequalities . CambridgeUniversity Press, 2004.[35] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. CVPRW , 2016, pp. 770–778.[36] ——, “Identity mappings in deep residual networks,” in European conference on computer vision . Springer, 2016, pp.630–645.[37] X. He, Z. Mo, P. Wang, Y. Liu, M. Yang, and J. Cheng, “ODE-inspired network design for single image super-resolution,”in Proc. CVPRW , 2019, pp. 1732–1741.[38] X. Mao, C. Shen, and Y.-B. Yang, “Image restoration using very deep convolutional encoder-decoder networks withsymmetric skip connections,” in Advances in neural information processing systems , 2016, pp. 2802–2810.[39] Y. Lu, A. Zhong, Q. Li, and B. Dong, “Beyond finite layer neural networks: Bridging deep architectures and numericaldifferential equations,” in International Conference on Machine Learning . PMLR, 2018, pp. 3276–3285.. PMLR, 2018, pp. 3276–3285.