[PDF] Deep Learning Based Channel Covariance Matrix Estimation with User Location and Scene Images

Abstract

Channel covariance matrix (CCM) is one critical parameter for designing the communications systems. In this paper, a novel framework of the deep learning (DL) based CCM estimation is proposed that exploits the perception of the transmission environment without any channel sample or the pilot signals. Specifically, as CCM is affected by the user's movement, we design a deep neural network (DNN) to predict CCM from user location and user speed, and the corresponding estimation method is named as ULCCME. A location denoising method is further developed to reduce the positioning error and improve the robustness of ULCCME. For cases when user location information is not available, we propose an interesting way that uses the environmental 3D images to predict the CCM, and the corresponding estimation method is named as SICCME. Simulation results show that both the proposed methods are effective and will benefit the subsequent channel estimation.

Full PDF

11 Deep Learning Based Channel Covariance MatrixEstimation with User Location and Scene Images

Weihua Xu, Feifei Gao, Jianhua Zhang, Xiaoming Tao, and Ahmed Alkhateeb

Abstract —Channel covariance matrix (CCM) is one criticalparameter for designing the communications systems. In thispaper, a novel framework of the deep learning (DL) based CCMestimation is proposed that exploits the perception of the trans-mission environment without any channel sample or the pilotsignals. Speciﬁcally, as CCM is affected by the user’s movementwithin a speciﬁc environment, we design a deep neural network(DNN) to predict CCM from user location and user speed, andthe corresponding estimation method is named as ULCCME.A location denoising method is further developed to reduce thepositioning error and improve the robustness of ULCCME. Forcases when user location information is not available, we proposean interesting way that uses the environmental 3D images topredict the CCM, and the corresponding estimation methodis named as SICCME. Simulation results show that both theproposed methods are effective and will beneﬁt the subsequentchannel estimation between the transceivers.

Index Terms —Deep learning, covariance estimation, locationdenoising, scene image, pilot free

I. I

NTRODUCTION C HANNEL covariance matrix (CCM) is an essential infor-mation to enhance the channel estimation and transceiverperformance [1], especially in massive MIMO communica-tions that need accurate downlink beamforming [2]. However,traditional CCM estimation methods need long training signalsto obtain sufﬁcient channel samples, which causes severe timeand energy overhead. Therefore, acquiring CCM with lowoverhead has received much attention in recent years [5]-[10].In [5]-[7], the authors resorted to the uplink CCM to predictthe downlink CCM for the frequency division duplex (FDD)massive MIMO system without the need of downlink channelestimation. However, these methods cannot avoid the hugeoverhead of uplink channel estimation when the base station(BS) contains large number of antennas. Based on the sparsecharacteristics of millimeter wave channels, the authors of[8] applied the compressive sensing to estimate CCM withlow training overhead. The authors of [9] translated the CCMfrom sub-6 GHz to the mmWave band by compressed signal

W. Xu and F. Gao are with Institute for Artiﬁcial Intelligence, TsinghuaUniversity (THUAI), Beijing National Research Center for Information Sci-ence and Technology (BNRist), Department of Automation, Tsinghua Uni-versity, Beijing, P.R. China, 100084 (email: [email protected],[email protected]).J. Zhang is with the State Key Laboratory of Networking and SwitchingTechnology, Beijing University of Posts and Telecommunications, Beijing100876, China (e-mail: [email protected]).X. Tao is with the Department of Electronic Engineering, Tsinghua Uni-versity, Beijing, P.R. China, 100084 (email: [email protected]).A. Alkhateeb is with the Department of Electrical and Computer Engi-neering, University of Texas at Austin, Austin, TX 78712-1687 USA (e-mail:[email protected]). recovery theorem and thus reduced the overhead of CCMestimation at mmWave band. Though the methods of [8] and[9] can improve the efﬁciency of CCM estimation, they highlyrely on the assumption of channel sparsity. In [10], a deeplearning (DL) based CCM estimation is designed by utilizingrich channel information contained in the uplink signals atmultiple base stations (BSs). Although [10] has demonstratedthat DL method can signiﬁcantly reduce the training overheadcompared with compressive sensing methods, it requires co-ordination among multiple BSs and is hard to implement inpractice. Nevertheless, [10] formulated the insight that the DLtools can be potentially used to improve CCM estimation anddeserves further exploration.It was recently found that the side information of com-munication environments, such as user locations [11]-[13],scene point clouds [14]-[16], scene RGB images [17]-[19], canreﬂect rich channel characteristics. Speciﬁcally, most systemparameters of channel matrix can be regarded as functionsof the environment information, like objects’ distributions,shapes, materials, etc [20]-[22]. These functions are difﬁcultto express mathematically or model accurately, but can be wellrepresented by the machine learning models and can be trainedby the real sample data in an ofﬂine manner. Actually, side in-formation has already been exploited to enhance the efﬁciencyof beam alignment with the aid of feature extraction capabilityof machine learning tools [11]-[19]. It then motivates us todesign a highly efﬁcient and more general approach for CCMestimation from the environmental information and thus reduceor even remove the requirement of training sequences.In this paper, we present a novel framework to utilize theside information such as user location, user speed, and thescene images surrounding the user, etc., to estimate CCM ina pilot-free manner. The main contributions of this paper arelisted as follows:1)

User location based CCM estimation (ULCCME) . Wepropose to ﬁrst estimate a rough user moving regionthat contains all possible user locations at a certainmoment. Then a location based CCM estimation DNN,called LCNET, is designed to learn the mapping fromthe user’s starting location and the speed to CCM of thecorresponding user moving region. The well-trained DNNcan be utilized to predict CCM for estimating the channelat the concerned moment.2)

Deep learning based location denoising . As the userlocations are hard to be estimated accurately, we design alocation denoising algorithm to reduce such error. We ﬁrsttrain a location estimation DNN, called LENET, to learn a r X i v : . [ ee ss . SP ] J a n the mapping from the user channel to the user location.Then, we utilize the error distribution of LENET and thenoise distribution of the uploaded location to derive themaximum likelihood (ML) estimate of the current userlocation and correct the uploaded user location.3) Scene image based CCM estimation (SICCME) . Whenthe user location is unavailable due to the strong noisecorruption or the protection of the user’s privacy, wepropose to utilize the scene images of the surroundingenvironment to replace the user location as the inputfeature of DNN, and the so obtained image based CCMestimation DNN is called as ICNET.This paper is organized as follows. Section II introducesthe signal model, and illustrates the existence of the mappingbetween user moving trajectory and channel covariance. Then,Section III proposes the ULCCME method and the locationdenosing method, while Section IV presents the SICCMEmethod. Section V describes the performance metric, thesimulation setup, the data set generation, and then providesnumerous simulation results and fruitful discussions. Finally,the conclusions are drawn in Section VI, and future work isillustrated in Section VII.Notation: A is a matrix; a is a vector; a is a scalar; [ a ] i is the i th element of a ; a i : j is the column vector [[ a ] i , [ a ] i +1 , · · · , [ a ] j ] T ; [ A ] i,j is the element of the i th rowand j th column in A ; [ A ] i, : and [ A ] : ,j are the i th row and the j th column of A respectively; I r is the identity matrix withrank r and N ( m g , R g ) / CN ( m g , R g ) is the real/complexGaussian random distribution with mean m g and covariance R g ; E { · } is the expectation operator.II. P ROBLEM F ORMULATION

A. Signal Model

We consider downlink communications with one BS andone user. BS is equipped with a uniform planar array (UPA)of N B = N Bele × N Baz antennas, and the user is equipped withone single antenna. The signal received at the user can beexpressed as y = h H s + n, (1)where h ∈ C N B × is the downlink channel vector, s ∈ C N B × is the transmit signal at the BS, and n ∈ CN (0 , σ ) is theGaussian noise.We adopt the widely used geometric channel model [23] h = L (cid:88) l =1 α l a t ( θ tl , φ tl ) , (2)where α l is the complex gain of the l th path, θ t l and φ t l arethe elevation and azimuth of the l th path’s angle of departure,and a t ( θ, φ ) ∈ C N B × is the complex steering vector of thetransmit array.The antenna spacing of UPA is set as d , and the steeringvector a t ( θ, φ ) is given by a t ( θ, φ ) = 1 (cid:113) N Bele N Baz a az ( θ, φ ) ⊗ a ele ( θ ) , (3) where a ele ( θ ) = [1 , e j πdλ cos( θ ) , · · · , e j ( N Bele − πdλ cos( θ ) ] T , (4) a az ( θ, φ ) = [1 , e j πdλ sin( θ ) sin( φ ) , · · · , e j ( N Baz − πdλ sin( θ ) sin( φ ) ] T , (5)and λ is the carrier wavelength and ⊗ represents the Kroneckerproduct. B. The Mapping from User Location Distribution to CCM

It is generally known that the channels depend on thescene objects between the BS and user, i.e., buildings, cars,trees, people, etc., as the propagation paths and attenuationof transmission signals can be completely determined by theenvironmental information and the BS/user locations. In fact,it can be assumed that there exists a one-to-one mappingbetween the user position and the channel [24], i.e., p user M ( · ) (cid:29) M − ( · ) h user , (6)where p user ∈ R is the 3D coordinates of the user location, h user is the channel corresponding to p user , while M and M − represent the mapping function and inverse mappingfunction respectively.Unfortunately, since the accurate user location is generallyan uncertain variable for BS (or user), measuring accuratelythe user location in real time, i.e., the localization problem isvery difﬁcult [25]-[27]. The location error of traditional GPSsensors normally reaches ∼ meters. Hence, it is moreeasier for BS (or user) to obtain a probability distribution of theuser location at a certain moment t . Interestingly, the mappingfunction (6) also indicates that the user location distributioncan reﬂect its channel distribution.Let us denote the probability density function (PDF) ob-tained by BS (or user) of the user location at moment t as p u ,t ( x ) , x ∈ C , where x ∈ R is the 3D coordinates ofthe user location and C is a set that contains all possiblevalues of x . From (6), p u ,t ( x ) , x ∈ C can determine a user’schannel distribution and further determine the CCM at moment t . Speciﬁcally, CCM can be expressed as Cov = (cid:90) C p u ,t ( x ) M ( x ) M H ( x )d x . (7)Then, the mapping from user location distribution to CCM canbe expressed as p u ,t ( x ) , x ∈ C → Cov . (8)Interestingly, (8) motivates us to utilize the user locationdistribution to directly obtain the CCM and avoid the hugetraining overhead from the traditional approaches.III. U SER L OCATION B ASED

CCM E

STIMATION

Fig. 1 shows the frame structure of the proposed ULCCMEmethod. We consider a block fading scenario and assumethe channel changes slowly during a time period [ τ, τ + T c ] for any τ , where T c is the channel coherence time (CCT).Because the CCM is a slowly changing parameter compared Location/Speed

Feedback

Covariance

Estimation

Channel

Estimation

Data

Transmission

Channel

Estimation

Data

Transmission ...

Location/Speed

Uploading ...

Fig. 1. The time domain diagram of the proposed user location based CCM estimation. to the instantaneous channel, we assume one CCM containsmultiple CCTs and deﬁne T co as the covariance coherencetime (COCT). For each COCT, the proposed frame structureincludes three stages. In the ﬁrst stage of length T u , the userfeeds back its location and speed to BS. In the second stageof length T e , BS performs the CCM estimation by utilizingthe uploaded user location and speed. After CCM estimation,data service will be provided during the third stage that isassumed to have length N T c , while within each CCT the linearminimum mean-square error (LMMSE) channel estimationcan be performed with the aid of the estimated CCM. Effectivepilot signals can also be designed from the estimated CCM foreach COCT. A. CCM Estimation Method

Without loss of generality, for the k th COCT, we assumethe user location and speed at moment T S k = ( k − T co arefed back to BS, while for the q th CCT of the k th COCT, thechannel at moment T C k,q = ( k − T co + T u + T e + ( q − T c is estimated at the user and is fed back to BS. Deﬁne theuser trajectory as x u ( t ) ∈ R × and express user speed as v u ( t ) , t ∈ [0 , + ∞ ) . The channel corresponding to x u ( T C k,q ) can be expressed as M [ x u ( T C k,q )] as in (6). Note that, theCCM related to the estimation of channel M [ x u ( T C k,q )] canbe obtained by the probability density of x u ( T C k,q ) as wellas the mapping from the user location distribution to CCM,as shown from (8). However, obtaining the actual probabilitydensity of x u ( T C k,q ) is very difﬁcult, because the user movingtrajectory is hard to predict. For simplicity, we assume theuser speed will not change during one COCT, i.e., v u ( T S k ) = v u ( T S k + τ ) , ∀ τ ∈ [0 , T co ) , and try to only utilize x u ( T S k ) and v u ( T S k ) to approximately estimate the probability distributionof x u ( T C k,q ) , q = 1 , , · · · , N .As shown in Fig. 2, let us consider the user only moves ona plane A . Since x u ( T C k,q ) , k = 1 , , · · · , q = 1 , , · · · , N ,satisﬁes || x u ( T C k,q ) − x u ( T S k ) || ≤ v u ( T S k )( T C k,q − T S k ) , (9)the circle area ˆ C ( x u ( T S k ) , v u ( T S k ) T q ) , i.e., the user movingregion in Fig. 2, would contain all possible values of x u ( T C k,q ) ,where ˆ C ( ˆ x , ˆ r ) = { x | || x − ˆ x || ≤ ˆ r, [ x ] = H } , H User User Moving Region

User Moving Trajectory . . . BS BS Coverage Area

Fig. 2. The BS coverage area and user moving region is the height coordinate of plane A , and T q = T C k,q − T S k .Moreover, with no additional information, BS would reason-ably assume that x u ( T C k,q ) obeys uniform distribution over ˆ C ( x u ( T S k ) , v u ( T S k ) T q ) . Then, the corresponding channel co-variance for a given x u ( T C k,q ) can be represented as R k,q = ψ q ( x u ( T S k ) , v u ( T S k )) , (10)where ψ q ( x , v ) =1 πv T q (cid:90) π (cid:90) vT q M [ c ( x , ρ, θ )] M H [ c ( x , ρ, θ )] ρ d ρ d θ (11)and c ( x , ρ, θ ) = [[ x ] + ρ cos( θ ) , [ x ] + ρ sin( θ ) , H ] T . Byutilizing x u ( T S k ) , v u ( T S k ) , and the mapping ψ q ( x , v ) , thechannel covariance to estimate M [ x u ( T C k,q )] can be obtained.However, the scale of the DNN for learning all the mapping ψ q ( x , v ) , q = 1 , , · · · , N , can be very large, which causesserious training and computation overhead. Therefore, wedeﬁne the CCM ¯ R k to estimate all channels within one COCT User Location CoordinatesUser Speed

Fully Connected Layer for Subnetworks

Fully Connected Layer

Concatenation ...... ...

Fig. 3. The proposed architecture of LCNET as the average value ¯ R k = ψ ( x u ( T S k ) , v u ( T S k ))= 1 N N (cid:88) q =1 ψ q ( x u ( T S k ) , v u ( T S k )) = 1 N N (cid:88) q =1 R k,q . (12)Hence, once we obtained the mapping ψ ( x , v ) , an effectiveCCM can be estimated from x u ( T S k ) and v u ( T S k ) , as shownin (12).Next we present the detailed design of DNN for the locationbased CCM estimation, called LCNET, where the featurefusion architecture [28] is adopted to approximately learnthe mapping ψ ( x , v ) , as shown in Fig. 3. The proposedarchitecture contains different subnetworks to extract featuresfrom multiple inputs respectively and then concatenates allthe extracted features as the overall input for the next networklayer. Since [ x ] is a constant H , the input features of LCNETcan be taken as the plane coordinates [ x ] ∈ R × of the userlocation x and user speed v . Then the output of LCNET is an N B × N B vector vec( O ) with [ O ] i,j = (cid:26) Re { [ ψ ( x , v )] i,j } , i ≥ j Im { [ ψ ( x , v )] i,j } , i < j . (13)Since ψ ( x , v ) is conjugate symmetric, the output of LCNETcan be utilized to recover ψ ( x , v ) . It is worth noting thatthough the dimensions of ψ ( x , v ) are generally much largerthan the dimensions of [ x ] , we still use one DNN to predictthe real part and imaginary part of ψ ( x , v ) to learn thecorrelation between different elements of ψ ( x , v ) . Moreover,as the estimated CCM through LCNET may not guaranteeits positive semi-deﬁniteness of, all the negative eigenvaluesof the estimated CCM will be replaced by the minimumnonnegative eigenvalue. B. Enhanced CCM Estimation with Denoising

In practice, the reported user locations would inevitablycontain errors, which will degrade the performance of ULCCE.We propose BS to obtain a historical user location from thepreviously estimated channels to perform the denoising forthe current location. For the k th COCT, we know x u ( T C k,N ) is the user location of the N th CCT and can be very close tothe user location x u ( T S k +1 ) that will be fed back in the nextCOCT. Thus, BS can utilize the inverse mapping M − toestimate x u ( T C k,N ) and then improve the estimation accuracyof x u ( T S k +1 ) .Speciﬁcally, we design a location estimation DNN, calledLENET, with multiple fully connected layers to learn themapping M − , as shown in Fig. 4. The input feature of . . . . . . . . . . . .. . .. . . . . . Fully Connected Layer

Training Noise

Fig. 4. The proposed architecture of LENET

LENET is chosen as the vector concatenated by the normalizedreal part and the normalized image part of one channel h ,i.e., [ ζ Re( h ) T , ζ Im( h ) T ] T , where ζ = E {|| h || } is theaverage channel amplitude. The output of LENET is the planecoordinates [ x ] of the user location corresponding to thechannel h . It is worth noting that the channel of each samplein the training set of LENET is considered to be corrupted bythe Gaussian noise ˜ n ∈ CN ( , ˜ σ I N B ) in order to enhancethe robustness of LENET. Once LENET is well trained, BScan estimate x u ( T C k,N ) at the k th COCT for any k .Denote the estimated channel of the N th CCT at the k th COCT as h k,N . Since h k,N is a random vector, thelocation estimation error of LENET n k = ˆ M − ( h k,N ) − ˜ x u ( T C k,N ) will be a two dimensional real random vector, where ˆ M − ( · ) denotes the learned mapping of LENET and ˜ x u ( t ) =[[ x u ( t )] , [ x u ( t )] ] T is the plane coordinates of x u ( t ) . As theprobability distribution of n k is hard to determine, we useGaussian distribution N ( m n , R n ) to approximate the actualprobability distribution of n k , and then estimate m n and R n by the test set of LENET. Speciﬁcally, by utilizing the test set { ( h T ,i , x T ,i ) | i = 1 , , · · · , V } , m n and R n can be derivedas m n = 1 V Q V (cid:88) i =1 Q (cid:88) j =1 n T ,i,j , R n = 1 V Q − V (cid:88) i =1 Q (cid:88) j =1 ( n T ,i,j − m n )( n T ,i,j − m n ) T , (14)where n T ,i,j = ˆ M − ( h T ,i + ˜ n i,j ) − ˜ x T ,i , (15) h T ,i is the accurate test channel as input feature, ˜ x T ,i ∈ R × is the corresponding user plane coordinates as sample label, V is the number of test samples, and ˜ n i,j , i = 1 , , · · · , V , j = 1 , , · · · , Q , is a sample from CN ( , ˜ σ I N B ) .To estimate ˜ x u ( T S k +1 ) from ˆ M − ( h k,N ) more accurately,the distribution of ˜ x u ( T S k +1 ) − ˜ x u ( T C k,N ) needs to be ob-tained. Since || ˜ x u ( T S k +1 ) − ˜ x u ( T C k,N ) || ≤ v u ( T S k ) T c andthe general user speed can only lead to a very small dis-placement during milliseconds, i.e., v u ( T S k ) T c (cid:28) , we mayassume ˜ x S k = ˜ x u ( T S k +1 ) − ˜ x u ( T C k,N ) ≈ . We furtherassume the estimation noise of ˜ x u ( T S k +1 ) will follow theGaussian distribution N ( , σ I ) and denote the estimate of ˜ x u ( T S k +1 ) as ˘ x k ∈ R × , namely the distribution of ˘ x k is N ( ˜ x u ( T S k +1 ) , σ I ) . From the conditional probability density p ( ˆ M − ( h k,N ) , ˘ x k | ˜ x u ( T S k +1 )) , we can obtain the maximumlikelihood estimate of ˜ x u ( T S k +1 ) by utilizing the obtaineddistribution of n k and ˘ x k . Image Feature/User SpeedFeedback CovarianceEstimation ChannelEstimation DataTransmission ChannelEstimation DataTransmission ...

Location/SpeedUploading ...

ImageFeature

Multi-view ImagesTaken by User

Communication Environment

Multiple Perspectives ... ... ...... ...

Fig. 5. The timing diagram of the proposed image based covariance estimation

Without loss of generality, we assume n k , ˘ x k , and ˜ x u ( T S k +1 ) are statistically independent of each other. Then ˆ M − ( h k,N ) and ˘ x k are conditionally independent, i.e., p ( ˆ M − ( h k,N ) , ˘ x k | ˜ x u ( T S k +1 ))= p ( ˆ M − ( h k,N ) | ˜ x u ( T S k +1 )) p ( ˘ x k | ˜ x u ( T S k +1 ))= p ( ˆ M − ( h k,N ) | ˜ x u ( T S k +1 ))2 πσ c e − [˘ x k − ˜ x u( T S k +1)]T[˘ x k − ˜ x u( T S k +1)]2 σ c . (16)Note that p ( ˆ M − ( h k,N ) | ˜ x u ( T S k +1 )) is approximately theprobability density function of N ( ˜ x u ( T S k +1 ) + m n , R n ) since ˆ M − ( h k,N ) = ˜ x u ( T S k +1 ) − ˜ x S k + n k ≈ ˜ x u ( T S k +1 ) + n k . Wethen obtain the ML estimate of ˜ x u ( T S k +1 ) as ˜ x ML k = arg max ˜ x u ( T S k +1 ) p ( ˆ M − ( h k,N ) , ˘ x k | ˜ x u ( T S k +1 ))= arg min ˘ x u ( T S k +1 ) { σ [ ˘ x k − ˜ x u ( T S k +1 )] T [ ˘ x k − ˜ x u ( T S k +1 )]+[ ˜ x C k − ˜ x u ( T S k +1 )] T R − [ ˜ x C k − ˜ x u ( T S k +1 )] } = arg min ˜ x u ( T S k +1 ) (cid:13)(cid:13)(cid:13) ( σ − I + R − ) ˜ x u ( T S k +1 ) − ( σ − I + R − ) − ( R − ˜ x C k + σ − ˘ x k ) (cid:13)(cid:13)(cid:13) = ( σ − I + R − ) − ( R − ˜ x C k + σ − ˘ x k ) (17)where x C k = ˆ M − ( h k,N ) − m n .As will be veriﬁed in simulations later, treating ˜ x ML k asthe corrected estimation of ˜ x u ( T S k ) can greatly improve therobustness of the proposed ULCCME.IV. S CENE I MAGE B ASED C OVARIANCE E STIMATION M ETHOD

The user location will be unavailable due to the strongnoise/interference corruption or the user’s privacy protection.In these cases, we propose to use scene image taken fromuser terminal to estimate the CCM. Note that, camera hasbeen widely used as an auxiliary equipment for many mobileintelligent terminals, such as autonomous vehicles, unmannedaerial vehicles or even the smart phone. It is expected that scene images taken by user can reﬂect the user locationinformation and represent spatial characteristics between theBS and user. Fig. 5 shows the frame structure of the proposedSICCME method. Compared with ULCCME method, the userwill feed back its speed and the image feature rather than itslocation and BS will utilize these information to estimate CCMduring each COCT. For a fair comparison with ULCCME, thedimension of image feature can be set as 2, which is the sameas the dimension of the user location.Speciﬁcally, we design an image based DNN for CCMestimation, called ICNET, with the feature fusion architectureto learn the mapping ˆ ψ I ( I , v ) : ( I , v ) → ψ ( x , v ) as shownin Fig. 5, where I is the set of scene images taken by user atlocation x from multiple camera angles. Once the ICNET iswell trained, we divide the ICNET into two parts: ICNET U andICNET B . The ICNET U is the partial subnetwork correspondingto the input images I and is used for image feature extractionthrough the convolutional layers. Since the dimension of imagefeature vector is much lower than the original images, ICNET U can be saved at user terminal to avoid the need of directlyfeeding back the image. Before BS starts to estimate CCM ¯ R k for the k th COCT, the user will be asked to take images atthe moment T S k and only needs to feed back the image featureobtained by ICNET U . Then, the CCM ¯ R k can be estimatedthrough the ICNET B .Since more images may more accurately reﬂect the userlocation and the surrounding environment, the performance ofSICCME would depend on the number and camera anglesof the scene images in I . We here illustrate an examplewhen the user takes four pictures along the positive X-axis,negative X-axis, positive Y-axis and negative Y-axis of plane A respectively, at certain location. As shown in Fig. 6, weconcatenate the four images in the dimension of RGB channelto obtain a 12-channel image and then normalize each channelof the 12-channel image to generate the input feature ofICNET U . Speciﬁcally, the set I will contain four images F i ∈ R f H × f W × , i = 1 , , , with RGB channels. Wethen obtain a 12-channel image F from [ F ] : , : , (3 i − i = F i , i = 1 , , , . To enhance the learning performance of ICNET,we utilize the min-max way to normalize the j th channel of The Taken Image Set I Image Concatenation

The Concatenated Image F Min-Max

Normalization

In Each RGB

Channel ...

The Normalized Image F

12 Channel

Fig. 6. The input feature generation of

ICNET U by the taken photos at auser location. F as ˆ F j = [ F ] : , : ,j − F min , j F max , j − F min , j , j = 1 , , · · · , , (18)where F max , j /F min , j is the maximum/minimize pixel value inthe j th channel of the concatenated image F of all ICNET’straining samples. Then, we utilize the normalized image ˆ F as the input feature, where [ ˆ F ] : , : ,j = ˆ F j , j = 1 , , · · · , .The design of output of ICNET is the same with LCNET asshown in (13), and the strategy to guarantee the positive semi-deﬁniteness of the covariance acquired by ICNET is the samewith ULCCME. V. S IMULATION R ESULTS

A. Performance Metric

The normalized mean-square error (NMSE) of CCM esti-mation is used to evaluate the learning accuracy of ULCCMEand SICCME, deﬁned as

NMSE R = (cid:80) Kk =1 || R k − ˆ R k || F (cid:80) Kk =1 || R k || F , (19)where ˆ R k is the estimate of R k , ∀ k , through LCNET orICNET.After obtaining ˆ R k , the linear minimum mean-square error(LMMSE) channel estimation [30] is adopted for channel es-timation within any CCT. By deﬁning the eigen-decompostionof ˆ R k as ˆ R k = ΓΛΓ H , we can design the pilot signal matrixfor one COCT as P pilot ( ˆ R k ) = σ (cid:112) N U Γ [([ µ I − Λ − ] + ) , N B × ( M p − N B ) ] U , (20) Fig. 7. The Rosslyn city model in Wireless Insite where U ∈ C M p × M p can be any unitary matrix and M p is theaffordable length of pilot signals. Moreover, µ can be derivedfrom the well-known water-ﬁlling approach [31] to satisfythe transmit energy constraint tr( P pilot ( ˆ R k ) P Hpilot ( ˆ R k )) = P .Then, the LMMSE channel estimator at the receiver for the k th COCT is expressed as A opt ( ˆ R k ) = ( P Hpilot ˆ R k P pilot + σ I ) − P Hpilot ˆ R k . (21)The NMSE of channel estimation is similarly deﬁned as NMSE H = (cid:80) Ki =1 E {|| ˜ h i − ˆ h i || } (cid:80) Ki =1 || ˜ h i || , (22)where ˆ h i is the estimate of ˜ h i .Moreover, the location estimation accuracy of LENET isevaluated by the root mean-square error (RMSE) performance,which is expressed as RMSE L = (cid:118)(cid:117)(cid:117)(cid:116) V Q V (cid:88) i =1 Q (cid:88) j =1 || n T ,i,j || . (23) B. Simulation Setup

To evaluate ULCCME and SICCME, we resort to WirelessInsite [32], a ray tracing software for channel generation toensure a reasonable correlation between the channel and theenvironmental parameters, like user location, user trajectory,and the scene image, etc. The simulation steps are illustratedas follows:

1) Channel Generation:

We adopt the Rosslyn city modelin Wireless Insite as the 3D scene model of the BS’s sur-rounding environment, as shown in Fig. 7. The BS coveragearea A is set as the selected rectangular area shown in Fig. 8.The size of A is taken as × , and the coordinatesof four vertices of A are (100 , , . , (130 , , . , (100 , , . and (130 , , . respectively. The co-ordinates of BS is set to be (75 , , . For user at anylocation in A , Wireless Insite could produce all parametersinvolved in the corresponding channel (2).

2) Scene Image Generation:

Though the Rosslyn citymodel in Wireless Insite is set to be the BS’s surrounding en-vironment, the original 3D model ﬁle does not have necessarytexture information, and the color information of buildings can The ray tracing software has been adopted in many other research [10]-[19].

UserBS

Fig. 8. The simulated BS coverage area A Fig. 9. The city model with the rendered buildings. not be reﬂected. The missing of texture information will causeserious lack of the visual distinction between the buildingsin the original city model and make the scene images notrealistic. We then use blender [33], a 3D modeling software,to approximately replace the original buildings in the Rosslyncity model with the buildings rendered with proper texture.To reduce the repetition of the texture, only some buildingsaround the BS coverage area will be replaced by the renderedbuildings. For example, as shown in Fig. 9, the buildings B , B , B , B are selected to be replaced with the renderedbuildings. Then, in the blender, the images to be taken by userscan be simulated by the strategy in Section IV.

3) Trajectory Generation:

As shown in Section III, theuser speed is assumed constant during a COCT. The usermoving trajectory will affect the actual probability distribution

User (a)

User ... (b)Fig. 10. (a) The m th simulated user trajectory with the constant directionpattern for the n th COCT. (b) The m th simulated user trajectory with theinconstant direction pattern for the n th COCT. of x u ( T C k,q ) , as well as the performance of the proposedcovariance estimation methods. Here, we consider the constanttrajectory and dynamic trajectory of user to evaluate thechannel estimation performance of ULCCME and SICCME,as shown in Fig. 10.Speciﬁcally, we simulate N tra trajectories for the user. Eachsimulated trajectory contains N co COCTs. For the constanttrajectory case, the user’s moving direction does not changeduring one COCT. Hence, the n th COCT of the m th trajectoryof the user, denoted as x u ,m ( t ) , t ∈ [ T S n , T S n +1 ] , will be astraight line. The speed v m,n and the trajectory x u ,m ( t ) , t ∈ [ T S n , T S n +1 ] are generated from [ v L , v U ]m / s and [0 , π ] throughindependent uniform sampling respectively, for any m and n .For the dynamic trajectory case, the user’s moving directionwill change at the beginning of each CCT in the n th COCT ofthe m th trajectory, for any m and n . The moving direction ofthe trajectory x u ,m ( t ) , t ∈ [ T S n , T C n, ] is generated from [0 , π ] by independent uniform distribution, and then the movingdirection will change at t = T C n,q , q = 1 , , · · · , N . The anglechange θ n,q of user moving direction at t = T C n,q obeys thetruncated Gaussian distribution over ( − π, π ) with N (0 , σ ) .Hence, the trajectory x u ,m ( t ) , t ∈ [ T S n , T S n +1 ] , will be a piece-wise linear line. The speed v m,n of x u ,m ( t ) , t ∈ [ T S n , T S n +1 ] isalso generated from [ v L , v U ]m / s through independent uniformsampling, for any m and n .

4) Training Dataset Generation:

To generate the trainingdataset of LCNET, we select numerous user locations from A with the uniform distribution. For each selected user location,we further choose numerous values from [ v L , v U ]m / s as theuser speed with the uniform distribution. Then, for each userlocation and user speed, the output of LCNET is the vectorshown in (13) that is constructed by the covariance calculatedthrough (10), (11) and (12).Note that the integral in (11) cannot be calculated accurately,as the channels of inﬁnite location points need to be acquired.Hence, for a selected user location x s and a user movingspeed v s , we only choose ﬁnite points from the estimated usermoving region ˆ C ( x s , v s T N ) to approximate ψ ( x s , v s ) in (11). Speciﬁcally, let us express ψ ( x s , v s ) as ψ ( x s , v s ) = (cid:90) π (cid:90) v s T N w ( ρ, θ ) M [ c ( x s , ρ, θ )] M H [ c ( x s , ρ, θ )] ρ d ρ d θ, (24)where w ( ρ, θ ) = 1 N N (cid:88) q =1 f q ( ρ, θ ) ,f q ( ρ, θ ) = (cid:40) πv T q , ρ ≤ v s T q , otherwise . (25)Note that, w ( ρ, θ ) is a probability density function of x = c ( x s , ρ, θ ) over ˆ C ( x s , v s T N ) . When only ﬁnite points arechosen from ˆ C ( x s , v s T N ) to form the set G , the discreteprobability distribution corresponding to these sampled pointsis given by P { x = x s } = w ( || x s − x s || , θ ) (cid:80) ˆ x ∈ G w ( || x s − x s || , θ ) , ∀ x s ∈ G , ∀ θ, (26)and can be used as an approximate estimate of w ( ρ, θ ) . Thus, ψ ( x s , v s ) can be approximately calculated as ˆ ψ ( x s , v s ) = (cid:88) x s ∈ G P { x = x s } M ( x s ) M H ( x s ) . (27)In the simulation, we uniformly choose N p (cid:29) points p s1 , p s2 , · · · , p s N p from A and generate the channels of all these N p points from Wireless Insite. Then, we select the points,which are contained in ˆ C ( x s , v s T N ) , from p s1 , p s2 , · · · , p s N p to form the set G and calculate (27). It is worth mentioningthat the coordinate space (CS) shown in Fig. 6 is adopted fordetermining the coordinates of the user location, where thecoordinate space’s origin is set as A ’s vertex and A is in theﬁrst quadrant of the coordinate space. The input feature of userlocation is the user plane coordinates [ x ] under the CS.To train LENET, we also select numerous user locationsfrom A with uniform distribution and generate the corre-sponding channels of these selected locations to producethe input feature of LENET. The plane coordinates of theseselected locations are set as the output label of LENET. Byutilizing the corresponding channels h t ,i , i = 1 , , · · · , V t , ofthese selected locations, we can approximately estimate ζ as ζ = V t (cid:80) V t i =1 || h t ,i || .For each training sample, four scene images are taken at thecorresponding user location by blender to generate the inputfeature ˆ F of ICNET, as shown in Section V, and the outputof ICNET is the same as the label of this sample. Then, thedataset of ICNET and LCNET will have the same number ofsamples. The image resolution in our simulations is taken as × .Note that, though the target output covariance is R i forthe i th training sample of LCNET or ICNET, the practicallyadopted output label of the i th training sample is the vector(13) constructed by the normalized covariance N B S R i (cid:80) Si =1 tr( R i ) toimprove the learning performance of the DNNs, where S isthe number of training samples. Thus, all the covariance outputfrom LCNET or ICNET should be scaled by the coefﬁcient (cid:80) Si =1 tr( R i ) N B S to recover the original CCM. TABLE IC

RITICAL P ARAMETERS OF W IRELESS I NSITE FOR R AY T RACING

Parameter ValueCarrier Frequency 2.4

GHz

Propagation Model X3DBuilding Material ConcreteMaximum Number of Reﬂections 6Maximum Number of Diffractions 1Maximum Paths Per Receiver Point 50TABLE IIT HE P ARAMETERS OF C ONVOLUTIONAL /P OLLING L AYERS OF

ICNET U Layer Order Kernel/Pool Size Strides FiltersConvolutional (5, 5) (2, 2) 4Polling (3, 2) (5, 5) NoneConvolutional (3, 3) (2, 2) 8Polling (2, 2) (2, 2) NoneConvolutional (3, 3) (2, 2) 12Polling (2, 2) (2, 2) None

5) Simulation Parameters:

The dimensions of the arrayat BS are taken as N Bele = 3 and = N Baz = 4 , and thecritical parameters of Wireless Insite to generate channels areillustrated in TABLE. I; T o = T u + T e and T c are both setto be 0.005s; N tra = 20 trajectories are generated, and eachtrajectory contains N co = 10 COCTs; N and N p are set to be50 and 22500 respectively. For each generated trajectory, theuser speed is uniformly selected from [ v L , v U ] = [2 ,

10] m / s ; Q and M p are set to be 20 and 60, respectively; ˜ σ is set tosatisfy N B ˜ σ E {|| h || } = 10 − , where E {|| h || } is approximately V t (cid:80) V t i =1 || h t ,i || .For the LCNET, the node number of each layer of the sub-network corresponding to user location before the concatenatelayer is set to be 2, 50 and 100 respectively and the nodenumber of each layer of the subnetwork for user speed is setto be 1, 20 and 50 respectively. The node numbers of thefully connected layers that begin with the concatenate layerare set to be 150, 200, 200, 150, 150 and 144 respectively.ICNET U has 9 layers, which includes three 2D convolutionallayers, three 2D average pooling layers and the last three fullconnected layers with 100, 50 and 2 nodes respectively. Theparameters of the convolutional and polling layers of ICNET U are shown in TABLE. II. ICNET B has the same structure withLCNET. The node number of each layer of LENET is set to be24, 50, 100, 100, 50 and 2 respectively. The adopted activationfunction and optimizer for LCNET, LENET and ICNET areReLU and Adam respectively.The training and test set of LCNET and ICNET are con-structed by 20000 and 1000 selected user locations respec-tively. For each selected user location in the training set, 40speed values are selected from [ v L , v U ]m / s . For each selecteduser location in the test set, 5 speed values are selected from [ v L , v U ]m / s . The training and test set of LENET include V = 90000 samples and V t = 9000 samples respectively. Input proportion of totoal training samples -3 ULCCMESICCME

Fig. 11.

NMSE R of ULCCME and SICCME with different input trainingset sizes. -2 -1 ULCCMESICCME

Fig. 12.

NMSE R of ULCCME and SICCME under different σ c . C. Results and Discussions

We ﬁrst analyze the learning accuracies of ULCCME andSICCME under different training set sizes in Fig. 11. With theincrease of input proportion of training samples, the

NMSE R of ULCCME and SICCME all decrease while ULCCME isbetter than SICCME. This is not unexpected since ULCCMErelies on more accurate location information than SICCME.Nevertheless, the difference is not severe, which demonstratesthe effectiveness of purely relying on scene images.We further consider the inﬂuence of the estimation errorof the uploaded user location and user speed on the CCMestimation accuracy. We assume the estimation error of userlocation and user speed follows the Gaussian distribution N (0 , σ ) and N (0 , σ ) , respectively. If the coordinates of thenoisy location are out of the area A , we will correct it by thefollowing way: Denote the coordinate of any a dimension ofthe uploaded location as x , and deﬁne the minimum/maximumbound coordinate of A at the dimension corresponding to x -3 ULCCMESICCME

Fig. 13.

NMSE R of ULCCME and SICCME under different σ v . as x L / x U . Since the actual user location must be in A , thecoordinate value x will be corrected as min(max( x, x U ) , x L ) .We utilize the same principle to correct the uploaded userspeed if the value of the uploaded speed is out of the interval [ v L , v U ]m / s .We then demonstrate the NMSE R of ULCCME and SIC-CME under different location noise variance σ c and speednoise variance σ v in Fig. 12. It is seen that the NMSE R ofSICCME remains constant for different σ c . The reason is thatSICCME is related with environment image and will not beaffected by the location noise. It is also seen that the NMSE R of ULCCME deteriorates with the increasing of σ c , and willbe worse than that of SICCME once σ c > . . Hence,under the scenarios with serious location noise, the SICCMEwill deﬁnitely achieve better performance than the ULCCMEmethod. As shown in Fig. 13, the NMSE R of ULCCME andSICCME under different σ v are plotted. It is seen that the NMSE R of ULCCME and SICCME all deteriorates as σ v increases. It is also seen that the performance of ULCCME andSICCME is more robust to the speed noise than to the locationnoise. Even when σ v = 2m / s , the NMSE R of ULCCMEand SICCME only increase by . and . , respectively.Therefore, in following simulations, we will mainly demon-strate the performance of ULCCME and SICCME caused bythe location noise.As the performance of the proposed location denoisingmethod depends on the location estimation accuracy ofLENET, we plot the RMSE L of LENET under different inputproportions of the training set in Fig. 14. It is seen thatthe optimal location estimation performance of LENET canachieve RMSE L = 2 . if the overall training set is used. Itis also seen that the decreasing of RMSE L slows down whenmore training samples are used. Thus, only half of the datasetis needed to achieve RMSE L = 2 . that approaches theoptimal location estimation performance.We then compare the channel estimation performance withthe CCM obtained from ULCCME, SICCME, as well as Input proportion of totoal training samples

LENET

Fig. 14.

RMSE L of LENET with different input training set sizes. SNR/dB -3 -2 -1 Fig. 15.

NMSE H of different covariance estimation methods for the constanttrajectory case. two traditional methods [30], in Fig. 15 and Fig. 16 re-spectively. The ﬁrst traditional method is the least square(LS) channel estimator, while the second traditional methodutilizes the normalized identity matrix (cid:80) Si =1 tr( R i ) N B S I N B asthe CCM to design the pilot signal matrix as well as toperform LMMSE channel estimation. We also present thechannel estimation performance of ULCCME with/withoutthe location denoising and SICCME method. Moreover, thechannel estimation performance achieved by the perfect CCM N (cid:80) Nq =1 M [ x u ,m ( T C n,q )] is also provided as benchmark. The SNR is deﬁned as

SNR = PM p σ N tra N co N N tra (cid:88) m =1 N co (cid:88) n =1 N (cid:88) q =1 || ˇ h m,n,q || , (28)where ˇ h m,n,q is the channel of the q th CCT within the n thCOCT of the m th simulated trajectory. As shown in Fig. 15and Fig. 16, the NMSE H of the ULCCME and SICCMEmethod is better than that of the two traditional methods, SNR/dB -3 -2 -1 Fig. 16.

NMSE H of different covariance estimation methods for the dynamictrajectory case. even when there are location noise with variance σ c = 2m .This indicates that though the accurate probability distributionof x u ( T C k,q ) is replaced by uniform distribution to simplifythe analysis, the proposed CCM estimation methods can stilloutperform the traditional methods under different trajectorycase. Meanwhile, although the optimal RMSE L = 2 . , i.e.,the standard deviation of location estimation error of LENET,is worse than that with the standard deviation σ c = 2m of thelocation noise, the proposed location denoising method canstill effectively improve the channel estimation performanceof ULCCME method with the aid of LENET. It is also seenthat when there is no location noise, the channel estimationperformance of the SICCME method is worse than that of theULCCME method, because the ULCCME can achieve betterCCM estimation than SICCME. However, for the case withlocation noise σ c = 2m , the channel estimation performanceachieved by ULCCME will be worse than that of SICCME.Moreover, the channel estimation with the proposed ULCCMEand SICCME methods can approach the optimal channelestimation that is achieved by the perfect covariances underrelatively lower SNR.To demonstrate the effectiveness of the location denoisingmethod, we plot the channel estimation performance of theULCCME method with/without the location denoising underdifferent σ c in Fig. 17 and Fig. 18, respectively. It is seen thatthe performance improvement by location denoising is greaterwith larger location noise and lower SNR , which indicates theproposed location denoising method is more suitable for thescenario with serious positioning noise and low

SNR .VI. C

ONCLUSION

We have proposed two deep learning based methods forCCM estimation by utilizing the environmental information,e.g., user location and scene images, that can avoid thedemands for the prior channel samples. A location denoisingmethod is also designed to improve the robustness of ULC-CME by jointly analyzing the estimation error of the trained -3 Fig. 17.

NMSE H of location based method with/without location denoisingunder different σ c for the constant trajectory case. -3 Fig. 18.

NMSE H of location based method with/without location denoisingunder different σ c for the dynamic trajectory case. DNN and position error of user. In fact, if the user motioncontains certain regularity, then a more accurate probabilitydensity for the user location can be designed to furtherimprove the effectiveness of ULCCME. Simulation resultsshow that the proposed ULCCME and SICCME are effectiveand can achieve better channel estimation than the traditionalones. Moreover, the proposed SICCME method has signiﬁcantperformance gain compared with the ULCCME method whenthe location noise is larger, which also motivates us to furtherexplore the scene image information into the ﬁeld of wirelesscommunications. VII. F

UTURE W ORK

Though many dynamic objects exist in the practical com-munication environment, we in this paper only consider theimpact of surrounding static buildings as one of the very ﬁrst attempt to merge image processing into wireless com-munication. In fact, with dynamic objects, we may obtainthe statistical channel E { h user } at location p user and assumethere exists a one to one mapping from location to statisticalchannel, which can be expressed as p user ˆ M ( · ) (cid:29) ˆ M − ( · ) h stats = E { h user } (29)Then, the proposed CCM estimation method can be gen-eralized to the dynamic environments. However, the majorobstacles that prevent us from doing so is the availabilityof a large data set that contains the channels for variousdistributions of dynamical objects. To the best of authorsknowledge, there is no ray tracing software that could helpgenerate such channel set.R EFERENCES[1] H. Xie, F. Gao, S. Jin, J. Fang, and Y. Liang, “Channel estimation forTDD/FDD massive MIMO systems with channel covariance computing,”

IEEE Trans. Wireless Commun. , vol. 17, no. 6, pp. 4206–4218, Jun.2018.[2] W. Roh et al., “Millimeter-wave beamforming as an enabling technol-ogy for 5G cellular communications: Theoretical feasibility and pro-totype results,”

IEEE Commun. Mag. , vol. 52, no. 2, pp. 106-113, Feb.2014.[3] S. Han, C.-L. I, Z. Xu, and C. Rowell, “Large-scale antenna systemswith hybrid analog and digital beamforming for millimeter wave 5G,”

IEEE Commun. Mag. , vol. 53, no. 1, pp. 186–194, Jan. 2015.[4] A. Alkhateeb, O. El Ayach, G. Leus, and R. W. Heath, Jr., “Channelestimation and hybrid precoding for millimeter wave cellular systems,”

IEEE J. Sel. Topics Signal Process. , vol. 8, no. 5, pp. 831–846, Oct.2014.[5] Y.-C. Liang and F. P. S. Chin, “Downlink channel covariancematrix(DCCM) estimation and its applications in wireless DS-CDMA sys-tems,”

IEEE J. Sel. Areas Commun. , vol. 19, no. 2, pp. 222–232, Feb.2001.[6] B. K. Chalise, L. Haering, and A. Czylwik, “Robust uplink to downlinkspatial covariance matrix transformation for downlink beamforming,” in

Proc. IEEE Int. Conf. Commun. (ICC) , June 2004, vol. 5, pp. 3010–3014.[7] A. Decurninge, M. Guillaud, and D. T. M. Slock, “Channel covari-anceestimation in massive MIMO frequency division duplex systems,”in

Proc.IEEE Globecom Workshop , 2015, pp. 1–6.[8] S. Park and R. W. Heath, “Spatial channel covariance estimation forthe hybrid MIMO architecture: A compressive sensing-based approach,”

IEEE Trans. Wireless Commun. , vol. 17, no. 12, pp. 8047-8062, Dec.2018.[9] A. Ali, N. Gonz´alez-Prelcic and R. W. Heath, “Spatial covarianceestimation for millimeter wave hybrid systems using out-of-band infor-mation,”

IEEE Trans. Wireless Commun. , vol. 18, no. 12, pp. 5471-5485,Dec. 2019.[10] X. Li, A. Alkhateeb and C. Tepedelenlio?lu, “Generative adversarialestimation of channel covariance in vehicular millimeter wave systems,” , CA, USA, 2018, pp. 1572-1576.[11] Y. Wang, A. Klautau, M. Ribero, A. C. K. Soong, and R. W. Heath,“MmWave vehicular beam selection with situational awareness usingmachine learning,”

IEEE Access , vol. 7, pp. 87479–87493, 2019.[12] K. Satyanarayana, M. El-Hajjar, A. A. M. Mourad, and L. Hanzo, “Deeplearning aided ﬁngerprint-based beam alignment for mmWave vehicularcommunication,”

IEEE Trans. Veh. Technol. , vol. 68, no. 11, pp. 10858–10871, Nov. 2019.[13] J. C. Aviles, and A. Kouki, “Position-aided mm-wave beam trainingunder NLOS conditions,”

IEEE Access , vol. 4, pp. 8703–8714, 2016.[14] A. Klautau, N. Gonz´alez-Prelcic, and R. W. Heath, “LIDAR data fordeep learning-based mmWave beam-selection,”

IEEE Wireless Commun.Lett. , vol. 8, no. 3, pp. 909–912, 2019.[15] M. Dias, A. Klautau, N. Gonz´alez-Prelcic, and R. W. Heath, “Positionand LIDAR-aided mmWave beam selection using deep learning,” in

Proc. IEEE Int. Workshop on Signal Processing Adv. in WirelessCommun. (SPAWC) , Cannes, France, 2019, pp. 1–5. [16] W. Xu, F. Gao, S. Jin and A. Alkhateeb, “3D scene based beam selectionfor mmWave communications,” IEEE Wireless Commun. Lett. , 2020.[17] M. Alrabeiah, A. Hredzak, Z. Liu and A. Alkhateeb, “ViWi: A deeplearning dataset framework for vision-aided wireless communications,” , Antwerp, Belgium,2020, pp. 1-5.[18] M. Alrabeiah, J. Booth, A. Hredzak and A. Alkhateeb, “ViWi vision-aided mmWave beam tracking: dataset, task, and baseline solutions,” arXiv preprint arXiv:2002.02445 , 2020.[19] G. Charan, M. Alrabeiah, and A. Alkhateeb, “Vision-aided dynamicblockage prediction for 6G wireless communication networks,” arXivpreprint arXiv:2006.09902 , 2020.[20] T. Rappaport, Y. Xing, G. R. MacCartney, Jr., A. F. Molisch, E.Mellios,and J. Zhang, “Overview of millimeter wave communicationsfor ﬁfth-generation (5G) wireless networks-with a focus on propaga-tion mod-els,”

IEEE Trans. Antennas Propag. , vol. 65, no. 12, pp.6213–6230, Dec. 2017.[21] J. Zhang, “The interdisciplinary research of big data and wireless chan-nel: A cluster-nuclei based channel model,”

China Commun. , vol. 13,no. S2, pp. 14–26, 2016.[22] Y. Zhang, M. Alrabeiah and A. Alkhateeb, “Learning beam codebookswith neural networks: towards environment-aware mmWave MIMO,” , Atlanta, GA, USA, 2020, pp.1-5.[23] A. M. Sayeed, “Deconstructing multiantenna fading channels,”

IEEETrans. Signal Processing , vol. 50, no. 10, pp. 2563-2579, Oct. 2002. [24] M. Alrabeiah and A. Alkhateeb, “Deep learning for TDD and FDDmassive MIMO: Mapping channels in space and frequency,” arXivpreprint arXiv:1905.03761 , 2019.[25] P. Misra and P. Enge,

Global Positioning System, Signals, Measurementsand Performance , 2nd ed. Lincoln, MA: Ganga-Jamuna Press, 2004.[26] G. Seco-Granados, J. Lopez-Salcedo, D. Jimenez-Banos, and G. Lopez-Risueno, “Challenges in indoor global navigation satellite systems:Unveiling its core features in signal processing,”

IEEE Signal Process.Mag. , vol. 29, no. 2, pp. 108–131, Mar. 2012.[27] Z. R. Zaidi and B. L. Mark, “Real-time mobility tracking algorithmsforcellular networks based on Kalman ﬁltering,”

IEEE Trans. MobileComput. , vol. 4, no. 2, pp. 195–208, Apr. 2005.[28] X. Chen, H. Ma, J. Wan, B. Li and T. Xia, “Multi-view 3D objectdetection network for autonomous driving,”

IEEE Conf. on ComputerVision and Pattern Recognition (CVPR) , Honolulu, HI, 2017, pp. 6526-6534.[29] L. Liang, H. Peng, G. Y. Li, and X. Shen, “Vehicular communications:a physical layer perspective,”

IEEE Trans. Veh. Technol. , vol. 66, no.12, pp. 10647–10659, Dec. 2017.[30] M. Biguesh, and A. B. Gershman, “Training based MIMO channelestimation: a study of estimator tradeoffs and optimal training signals,”

IEEE Trans. Signal Processing , vol. 54, pp. 884-893, Mar. 2006.[31] F. Gao, T. Cui, and A. Nallanathan, “On channel estimation and optimaltraining design for amplify and forward relay network,”