[PDF] High Dimensional Channel Estimation Using Deep Generative Networks

Abstract

This paper presents a novel compressed sensing (CS) approach to high dimensional wireless channel estimation by optimizing the input to a deep generative network. Channel estimation using generative networks relies on the assumption that the reconstructed channel lies in the range of a generative model. Channel reconstruction using generative priors outperforms conventional CS techniques and requires fewer pilots. It also eliminates the need of a priori knowledge of the sparsifying basis, instead using the structure captured by the deep generative model as a prior. Using this prior, we also perform channel estimation from one-bit quantized pilot measurements, and propose a novel optimization objective function that attempts to maximize the correlation between the received signal and the generator's channel estimate while minimizing the rank of the channel estimate. Our approach significantly outperforms sparse signal recovery methods such as Orthogonal Matching Pursuit (OMP) and Approximate Message Passing (AMP) algorithms such as EM-GM-AMP for narrowband mmWave channel reconstruction, and its execution time is not noticeably affected by the increase in the number of received pilot symbols.

Full PDF

11 High Dimensional Channel Estimation UsingDeep Generative Networks

Eren Balevi, Akash Doshi, Ajil Jalal, Alexandros Dimakis, Jeffrey G. Andrews

Abstract

This paper presents a novel compressed sensing (CS) approach to high dimensional wireless channelestimation by optimizing the input to a deep generative network. Channel estimation using generative networksrelies on the assumption that the reconstructed channel lies in the range of a generative model. Channelreconstruction using generative priors outperforms conventional CS techniques and requires fewer pilots. It alsoeliminates the need of a priori knowledge of the sparsifying basis, instead using the structure captured by thedeep generative model as a prior. Using this prior, we also perform channel estimation from one-bit quantizedpilot measurements, and propose a novel optimization objective function that attempts to maximize thecorrelation between the received signal and the generator’s channel estimate while minimizing the rank of thechannel estimate. Our approach signiﬁcantly outperforms sparse signal recovery methods such as OrthogonalMatching Pursuit (OMP) and Approximate Message Passing (AMP) algorithms such as EM-GM-AMP fornarrowband mmWave channel reconstruction, and its execution time is not noticeably affected by the increasein the number of received pilot symbols.

Index Terms

The authors are with the University of Texas at Austin, TX, USA. Contact Author Email: [email protected]. This work wassupported in part by Intel. This paper was presented in part at the st IEEE Signal Processing Advances in Wireless CommunicationsWorkshop, May 2020 in the Special Session for Machine Learning in Communications [1]. a r X i v : . [ ee ss . SP ] J un MIMO channel estimation, Generative Adversarial Networks (GAN), compressed sensing, one-bit receivers

I. I

NTRODUCTION

A. Motivation

To meet the demand for extremely high bit rates and much lower energy consumption per bit, futurewireless systems are trending to bandwidths larger than 1 GHz and carrier frequencies above 100 GHz.As an example, a future communication system (6G and beyond) may operate at a carrier frequencyapproaching 300 GHz with well over 10,000 cross-polarized antenna elements at each transceiver,and antenna spacings on the order of 1-2 mm [2], [3]. For channel estimation in many antennasystems, typically the number of pilots is assumed to be larger than the number of transmit antennas,resulting in signiﬁcant training overhead which does not scale well to such high dimensional futurecommunication systems. Using sparsity with a compressed sensing method to alleviate this problemleads to solving a complex optimization problem at every coherence interval, whose complexityscales with the number of antennas, will become infeasible. Thus, existing approaches to channelestimation will not scale to this regime in terms of complexity, power consumption, or pilot overhead,and fundamentally new methods are needed. The key to simplifying channel estimation in such highdimensional systems is to exploit stronger prior knowledge of the channel structure. In this paper wepropose a novel unsupervised learning-based approach using deep generative networks for channelestimation.

B. Related Work

Traditional training-based channel estimators such as least-squares (LS) are optimum MaximumLikelihood estimators for rich multipath channels. Furthermore, for Gaussian signal recovery with aknown correlation matrix, minimum mean-squared estimators (MMSE) ﬁnd the signal estimate x that maximizes the a posteriori probability p ( x | y ) and outperforms LS [4]. However recent channelmeasurements conducted for mmWave and THz cellular systems have indicated that, due to clusteringof the paths into small, relatively narrowbeam clusters, high dimensional channels are often verysparse in their beamspace representation [2] or their spatial covariance matrix is low rank [5]. Amongthe ﬁrst papers to highlight the need for exploiting these sparse structures, that LS and MMSE cannotexploit, was [6], which exploited channel sparsity in the beamspace representation of a multi-antennachannel to formulate channel estimation as a CS problem, while [7] also highlighted how to exploitsparsity in the delay-Doppler domain. MmWave channel estimation is made difﬁcult by the lowreceived SNR due to high omnidirectional path loss, and to combat this path loss, large antenna arraysare used to obtain beamforming gain. In [8], a sparse formulation of the mmWave channel estimationproblem was given by expressing the sensing matrix Ψ as a function of the transmit and receiveantenna array response vectors, in addition to the training precoders and combiners. An open loopstrategy for downlink mmWave channel estimation and design of precoders/combiners that minimizethe coherence of Ψ , while incorporating hybrid constraints at the transceiver, was presented in [9],enabling reconstruction from a small number of measurements. In [8] and [9], Orthogonal MatchingPursuit ( L norm minimization) and Basis Pursuit Denoising ( L norm minimization) were employedfor sparse channel reconstruction. Approximate Message Passing (AMP) is another robust class oftechniques for compressed sensing [10], and variants such as EM-GM-AMP [11] and VAMP [12]outperform OMP and BPDN for a large class of sensing matrices. AMP has been widely advocatedfor MIMO channel estimation in the research community, especially for low resolution receivers[13], [14]. AMP has even been extended to adaptively learn the clustered structure in the angle-delaydomain in [15].However, real world channels are never exactly sparse in the DFT-basis, nor do we know the basisthat would yield the most sparse representation. Moreover, all these techniques involve solving acomplex optimization problem at each interval, and require a large number of pilots, especially in low resolution receivers. These are some of the reasons why CS-based methods are still not employed inconventional WiFi receivers for channel estimation, which typically employ LS channel estimationwith frequency domain smoothing that leverages the coherence bandwidth of the channel [16].Meanwhile, there has been a rapid advancement in the application of techniques from deep learningto channel estimation for massive MIMO and mmWave systems. One of the approaches taken wasto perform Joint Channel Estimation and Signal Detection (JCESD) [17] [18], thus performingchannel estimation implicitly. Not recovering the channel estimate prevents precoder and/or combineroptimization, and these techniques call for extensive signal processing changes at the transceiver.One obvious way to recover the channel estimate is to train a Neural Network (NN) in a supervisedmanner, such that it is trained to take as input the pilot measurements and output the channel matrix.This approach is taken in many recent papers [19]–[23]. In particular, [19] also appends the LSchannel estimate of the current and previous received pilot signal to the NN’s input to improve itsperformance. In [20], a variant of the AMP technique called LDAMP is unfolded into a NN, bymaking the parameters of LDAMP learnable. Exploiting the inherent structure in a spatial channelmatrix, making its estimation analogous to image reconstruction, [21] and [22] employ ConvolutionalNeural Networks (CNNs) in place of Fully Connected NNs to learn a channel estimator. A novelreﬁnement called SIP-DNN was proposed in [23], that chose to estimate the channel at all the receiveantennas using only the signal received by the high-resolution ADC antennas.However building such labeled channel datasets for a supervised task is time-consuming, andmost of these techniques would not perform well if the received signal was corrupted by hardwareimpairments and/or transient effects such as shadow fading. A few papers recently have been usingtechniques from unsupervised learning to overcome this limitation of having to build a huge labeleddataset. In [24] and [25], they combine an LS estimator with an underparameterized CNN-baseddenoiser called Deep Decoder [26] to exploit correlation in the channel estimate to improve its quality.In [27], they train an autoencoder to learn a compressed representation for the channel that could immensely reduce channel state information (CSI) feedback overhead in massive MIMO, while [28]trains a CRNet with the same objective in multi-resolution receivers. C. Contributions

In summary, most of the proposed Deep Learning techniques are discriminative, meaning that apriori information is not exploited as opposed to generative models and often call for drastic changesin transceiver signal processing, while the existing signal processing techniques are designed forsparse signal recovery. As mentioned before, there is no explicit way to determine the basis that willgenerate the sparse channel representation with the least non-zero entries, which would allow perfectrecovery for a wider range of sensing matrices with the same number of measurements. This is wherecompressed sensing using generative models proves useful. By ﬁnding an approximate solution in thespan of a generative model, [29] shows how to achieve CS guarantees without employing sparsity.The authors of [29] present a simple gradient descent based algorithm that enables signal recovery forinherently sparse or structured signals from compressed measurements by exploiting the prior learntby a generative model. In this paper, we draw inspiration from the approach presented in [29] toperform the estimation of high dimensional wireless channels from compressive pilot measurements.Our contributions are elaborated below:

Training a GAN to learn the channel distribution:

The underlying probability distribution ofspatial channel matrices for a particular environment can be very complex, and analytically intractable.We describe how to train a Wasserstein GAN [30] using a set of simulated channel realizations, suchthat it learns a generator model that is capable of drawing samples from the underlying channeldistribution.

Full resolution channel estimation:

The trained generative model will output channel realizationsfor different input vectors. We describe a procedure to ﬁnd the optimal input vector such that we canuse the prior of the trained generator to ﬁnd the channel estimate from a low number of noisy pilot measurements. Moreover, the optimization problem deﬁned by the generative network operates in alow-dimensional subspace, whose dimensionality is independent of the number of received pilots,and achieves signiﬁcant reduction in computational complexity. Simultaneously, our technique alsohelps to develop a channel representation that drastically reduces CSI feedback overhead, and learn aprior that enables it to signiﬁcantly outperform conventional CS techniques, without knowledge ofthe sparsifying basis.

One-bit quantized channel estimation:

We design a custom loss function that aims to ﬁndthe channel estimate, in the range of the generator’s output, that has low rank while maximizingcorrelation with the received one-bit measurements. We compare its performance with state-of-the-artCS techniques such as EM-GM-AMP [11], and ﬁnd that it signiﬁcantly improves the quality of thechannel estimate, while still requiring only a limited number of pilots. We validate the improvementin the channel estimate by evaluating the throughput for a hybrid precoded data transmission, wherethe RF and baseband precoders were designed using OMP [31].The paper is organized as follows. The system model is outlined in Section II. The generative channelestimator is explained in detail in Section III, the NN architecture details, simulation benchmarks andresults are outlined in Section IV and V respectively, and the conclusions are highlighted in SectionVI. II. S

YSTEM M ODEL

A. Training based channel estimation

Consider a point-to-point downlink (DL) MIMO setup, where the base station is equipped with N t transmit antennas and the User Equipment (UE) is equipped with N r receive antennas. For simplicity,the exposition that follows considers only a single narrowband frequency channel but can easily beextended to multiple ( N f > ) subcarriers. We consider hybrid beamformers and combiners, andpresent a training-based channel estimation approach. In the DL channel estimation phase, the BS uses a training beamformer p ∈ C N t × to transmit asymbol s ∈ C . To simplify analysis, we set s = 1 in all experiments, but retain it in the equations forease of understanding. The UE employs N r RF chains, hence for each beamforming vector p , N r measurements are produced at the UE. We assume that the training combiner q i ∈ C N r × i ∈ [ N r ] is a 1-sparse vector with 1 at the i th position. As explained in [9], the number of measurementsper time instant at the UE does not depend on the number of RF chains employed at the BS. Thetotal number of measurements M = N r N p where N p is number of distinct beamforming vectors p employed by the BS during training. We denote this sequence as P = [ p ... p N p ] ∈ C N t × N p . It isassumed that the channel coherence time is greater than N p T , where T is the symbol period, hencethe spatial channel matrix H ∈ C N r × N t remains constant over the N p time slots. Hence the receivedtraining signal Y ∈ C N r × N p at the UE can be written as [8] Y = HP s + N , (1)where each element of N ∈ C N r × N p are independent and identically distributed complex Gaussianrandom variables with mean 0 and variance σ . To have more compact expressions, the matrices aredeﬁned as vectors by concatenating the columns, yielding y = HP s + n , (2)where y , HP , n ∈ C N r N p × . Writing HP as I N r HP , and utilizing the expansion ABC = ( C T ⊗ A ) B ,we have y = ( P T ⊗ I N r ) H s + n , (3)where T denotes the transpose operator and ⊗ denotes the Kronecker product. Clearly, the system ofequations represented by (3) does not have a unique solution if N p < N t . In other words, the LSchannel estimate ˆH given by ˆH = argmin H ∈ C N r N t × || y − ( P T ⊗ I N r ) H s || (4) has multiple solutions. Thus, in the low density pilot regime, one cannot directly use LS channelestimation. If H is inherently sparse or structured in a known basis, this can be exploited by CSalgorithms, and is explained as part of the baselines for comparison in Section IV-E. The abovenotation also extends easily to the case where the received signal is quantized, with (3) being rewrittenas y = Q n (( P T ⊗ I N r ) H s + n ) , (5)where Q n denotes the n bit quantization operator. B. Hybrid Precoding for Data Transmission

In Section II-A, we presented a training-based channel estimation approach, hence the trainingbeamformers used were random sequences of QPSK symbols (one can also use a random subset ofthe columns of the DFT matrix or the Zadoff-Chu sequences). Now we move from the training stageto the data transmission phase, where to obtain a higher throughput, one performs optimization ofthe precoder matrices F RF and F BB at the BS, in which P = F RF F BB . To achieve this, the channelestimate recovered at the UE is conveyed to the BS, to maximize the information-theoretic capacity,while incorporating the hardware and power constraints imposed on the entries of F RF and F BB . Asoutlined in [31], we utilize spatially sparse precoding via Orthogonal Matching Pursuit to ﬁnd theoptimum F RF and F BB and evaluate the throughput.III. G ENERATIVE C HANNEL E STIMATOR

A. Training a GAN to learn the channel distribution

We use Generative Adversarial Networks (GANs) for training a generative model. Despite theextensive recent application of deep learning to wireless communications, few communication papershave employed GANs, owing to their perceived training instability [32]. In [33], the authors proposedthe use of variational GANs to accurately learn the channel distribution. However, they restricted themselves to additive noise, and did not consider fading or MIMO. In [34], they employ a conditionalGAN that is trained to output the received signal when the transmitted signal and the received pilotinformation is appended to the input of the GAN. However, when extending it to fading channels,they assumed that the real channel response was available as input to the GAN. Moreover, none ofthese papers exploit the compressed representation that the generator of a GAN learns for a givenoutput signal. We now give an overview of the training procedure for GAN’s in the context of spatialchannel matrix generation.A GAN [35] consists of two feed-forward neural networks, a generator G ( z ; θ g ) and a discriminator D ( x ; θ d ) engaging in an iterative two-player minimax game with the value function V ( G , D ) : min G max D V ( G , D ) = E x ∼ P r ( x ) [ h D ( D ( x ; θ d ))] + E z ∼ P z ( z ) [ h G ( D ( G ( z ; θ g ); θ d ))] , (6)where G ( z ) represents a mapping from the input noise variable z ∼ P z ( z ) to the data space x ∼ P r ( x ) , while D ( x ) represents the probability that x came from the data rather than G . Theexact form of h ( . ) depends on the choice of loss function. In [35], h D ( D ( x )) = log D ( x ) whereas h G ( D ( G ( z ))) = log (1 − D ( G ( z ))) . On the other hand, in the Wasserstein GAN proposed in [30], h D ( D ( x )) = D ( x ) and h G ( D ( G ( z ))) = − D ( G ( z )) . Given z ∈ R d and G ( z ) ∈ R n , typically z ∼ N ( , I d ) and d (cid:28) n . For example, when a GAN is trained on an image dataset, d can be 100,while n = 64 × × (where 64 represents the image height and width in pixels and3 represents the RGB color triplet). G is said to implicitly learn the distribution P g (stored in itsweights θ g ), which on convergence, should approach P r .Since the seminal paper [35], numerous variants of GAN have been published, differing in thearchitecture and/or training procedure of G and D or the loss function used for penalizing the outputof D [36], [30]. However, GANs are known to be difﬁcult to train, one of the reasons being that theyare subject to mode collapse. That is, they learn to characterize only a few modes of the distribution[32]. The objective of training a GAN is that by varying the weights θ g and θ d of G ( z ; θ g ) and D ( z ; θ d ) , we want P g → P r . In [30], the Wasserstein-1 (EM) distance is shown to be much weakerthan KL (Kullback-Leibler) or JS (Jensen-Shannon) divergences , such that simple sequences ofprobability distributions converge under EM but not KL or JS. Using the continuous and differentiableEM distance as the loss function for the output of D during training with weights clipping eliminatescareful balance in training of D and G , and design of NN architecture. It also drastically reducesmode collapse since we can train D to optimality. Hence in this paper, we employ the WassersteinGAN [30] for learning the spatial channel distribution. An outline of the procedure for training aWasserstein GAN in the context of spatial channel matrix generation is given in Alg. 1 (adapted from[30]) and depicted in Fig. 1. Algorithm 1:

Minibatch stochastic gradient descent training of Wasserstein GANs for spatialchannel matrix generation with n d = 5 and c = 0 . D should output 1 for a true channel realization x ∼ P r ( x ) and 0 for a generated fakechannel realization G ( z ) ∼ P g when z is sampled from P z for number of training iterations dofor n d iterations do • Sample minibatch of m noise samples { z , ..., z m } ∼ P z . Update D byascending its stochastic gradient ∇ θ d m m (cid:88) i =1 − D ( G ( z i ))) • Sample minibatch of m channel realizations { x , ..., x m } ∼ P r . Update D byascending its stochastic gradient ∇ θ d m m (cid:88) i =1 D ( x i ) • θ d = clip ( θ d , − c, c ) end Sample minibatch of m noise samples { z , ..., z m } ∼ P z . Update G bydescending its stochastic gradient ∇ θ g m m (cid:88) i =1 − D ( G ( z i ))) end A set of probability distributions P n is said to converge to P ∞ under a distance metric ρ if ρ ( P n , P ∞ ) → as n → ∞ . By “weaker”,we mean that the set of convergent sequences under EM is a superset of the sequences convergent under KL or JS The originalpaper [30] refers to the discriminator as critic and uses n critic = 5 which we refer to as n d Fig. 1. Training a GAN for spatial channel matrices

B. CS-based channel estimation using generative networks

Consider a noisy compressive measurement y of an image x ∗ such that y = Ax ∗ + n . A simplegradient descent based approach for compressed sensing using generative networks was proposedin [29] to ﬁnd the low dimensional representation z ∗ of the given input image x ∗ such that thereconstructed image G ( z ∗ ) has small measurement error || y − AG ( z ) || . While this is a non-convexobjective to optimize (since G ( z ) is a non-convex function of z ), gradient descent was foundempirically to work well. To reconstruct the image, [29] solves the following optimization problem: z ∗ = arg min z f ( y , AG ( z )) , (7)where y is the vector of received samples, G is a generative model, A is a measurement matrix, and f is a loss function. For example, we could have f ( y , AG ( z )) = || y − AG ( z ) || . Here, we minimizethe loss function over the input variable to the generator z . The reconstructed image is then G ( z ∗ ) .As long as gradient descent ﬁnds a good approximate solution to (7), [29] gives a theoretical proofto show that G ( z ∗ ) will be almost as close to the true x ∗ as the closest possible point in the rangeof G , when the entries of the sensing matrix A are sub-Gaussian . A random variable X ∈ R is said to be sub-Gaussian with variance-proxy σ if E [ X ] = 0 and its moment generating functionsatisﬁes E [exp( sX )] ≤ exp( σ s / for all s ∈ R To adapt the framework presented in [29] for channel estimation, we ﬁrst train a Wasserstein GAN[30] using a set of realistic channel realizations H (details of channel parameters presented in SectionIV) as deﬁned in (1). We then extract the trained generator G . The trained generator, having implicitlylearned the underlying probability distribution of the channel matrices, will output channel realizations G ( z ) for a given L bounded input vector z . In the testing phase, we will be given the noisy pilotmeasurements y as deﬁned in (3). We consider two possible cases: when the measurements are fullresolution and when they are one-bit quantized. For each case, we have heuristically developed lossfunctions, that deﬁne the optimization problem to be solved at every coherence interval using gradientdescent. An illustration of the framework is shown in Fig 2 and the approach is summarized in Alg.2. We refer to this framework as the Generative Channel Estimator (GCE). Fig. 2. Generative Channel Estimator Framework

Full Resolution Channel Estimation:

Replacing the sensing matrix A by P T ⊗ I N r as derived in(3), and imposing an L bound on z via regularization, we attempt to solve the following non-convexoptimization problem: z ∗ = arg min z ∈ R d || y − ( P T ⊗ I N r ) G ( z ) s || + λ reg || z || , (8)where d is the dimension of the input vector to the GAN and λ reg serves as a regularization parameter.The reconstructed channel estimate is then simply G ( z ∗ ) . Note that the entries in the trainingprecoder P were chosen i.i.d. from QPSK symbols. As a consequence, all the entries of the matrix A = P T ⊗ I N r are bounded (being either 0 or QPSK symbols) with mean 0, and from Hoeffding’sLemma applied separately to the real and imaginary parts, it follows that each entry of A will besub-Gaussian. Quantized Channel Estimation:

We now consider the case where the received signal is 1-bitquantized. As a result, MIMO channel estimation even in the noiseless setting with sufﬁcient pilotsymbols is an under-determined problem. In [37], they exploit the low-rank nature of mmWavechannels (due to clustering in the propagation environment) to constrain the space of channel estimatesto matrices H with low nuclear norm || H || ∗ (a relaxation of the low-rank constraint). In [38], theauthors solve the same optimization problem as (7) with the measurements y being one-bit quantized,and under certain assumptions on the measurement matrix ( A in (7)) and the architecture of the GAN,design a custom loss function to solve for z ∗ . We draw inspiration from the approach taken in [37]and [38] to design the following non-convex optimization problem for recovery in one-bit setting: z ∗ = arg min z ∈ R d − λ reg N p N r (cid:88) i=1 Q ( y i ) (cid:104) ( P T ⊗ I N r ) i , G ( z ) (cid:105) s + || G ( z ) || ∗ . (9)This heuristically designed loss function attempts to minimize the nuclear norm || G ( z ) || ∗ of theoutput of the generator G ( z ) while maximizing the correlation between Q ( y ) (which is ± ) and (cid:104) ( P T ⊗ I N r ) , G ( z ) (cid:105) . The summation in (9) should be interpreted as the sum over the real and imaginaryparts, separately, N p N r (cid:88) i=1 Q ( y i , real ) (cid:104) ( P T ⊗ I N r ) i , real , G ( z ) real (cid:105) s + N p N r (cid:88) i=1 Q ( y i , imag ) (cid:104) ( P T ⊗ I N r ) i , imag , G ( z ) imag (cid:105) s (10) C. Beamforming using the Generative Channel Estimator

Having recovered the channel estimate G ( z ∗ ) from the compressed pilot measurements at the UE,we now use this channel estimate to design the optimum RF and baseband precoder F RF and F BB .The optimal latent input vector z ∗ of the generator G provides a compressed representation of the Hoeffding’s Lemma states that for any random variable X with E [ X ] = 0 such that a ≤ X ≤ b w.p. 1, for all s ∈ R , E [exp( sX )] ≤ exp( s ( b − a ) / . Hence X is sub-Gaussian with variance proxy ( b − a ) / . Algorithm 2:

Channel Estimation using Deep Generative Networks1. Train a GAN using a set of realistic channel realizations.2. Extract the trained generator G ( z ) .3. Given the noisy pilot measurements y , reconstruct the channel y encodes by solvingthe following optimization problem using gradient descent: • For full resolution pilot measurements: z ∗ = arg min z ∈ R d || y − ( P T ⊗ I N r ) G ( z ) s || + λ reg || z || , • For quantized pilot measurements: z ∗ = arg min z ∈ R d − λ reg N p N r (cid:88) i=1 Q ( y i ) (cid:104) ( P T ⊗ I N r ) i , G ( z ) (cid:105) s + || G ( z ) || ∗ . The initial point z for gradient descent is sampled from P z .4. The reconstructed channel estimate is then G ( z ∗ ) , which is of dimensions N r × N t channel. If we could convey the weights and architecture of the generator from the UE to the BSduring the initial access phase, then in subsequent data transmissions, the CSI overhead would beconsiderably reduced. At every coherence time, we would simply feedback z ∗ to the BS and use G ( z ∗ ) as the channel estimate to design the precoder matrices F RF and F BB .IV. S IMULATION D ETAILS & B

ENCHMARKS

The performance metric is normalized mean square error (NMSE), deﬁned asNMSE = E (cid:34) || H − ˆ H || || H || (cid:35) , (11)where H and ˆ H are column vectors that specify the actual and estimated channel taps in the frequencydomain over all antennas, respectively. A. Data Generation

Channel realizations have been generated using the 5G Toolbox in MATLAB in accordance withthe 3GPP speciﬁcations TR 38.901 . The channel simulation parameters are listed in Table I. Inorder to generate structure in the channel realizations, some degree of correlation is required betweenneighbouring antennas at the BS and the UE. To generate this correlation, the antenna element spacing in the uniform linear arrays (ULA) at the BS and UE were assumed to be λ/ . This reduced antennaspacing is a crucial assumption, and we will justify its requirement in Section V-D. Each channelrealization generated in MATLAB was of dimension ( N f , , N r , N t ) , the ﬁrst and second dimensionbeing the number of subcarriers and number of OFDM symbols respectively. To focus on exploitationof the spatial structure of the channel matrices, we simply extract the ( N r , N t ) matrix correspondingto the ﬁrst subcarrier and ﬁrst OFDM symbol for the purpose of these simulations.Delay Proﬁle CDL-DSubcarrier Spacing 15 kHz N t N r λ/ Sampling Rate 15.36 MHzCarrier Frequency 40 GHzDelay Spread 30 nsDoppler Shift 5 Hz N f B. Data Pre-processing

Note that G ( z ) has dimensions ( N t , N r , , where the last entry corresponds to the real andimaginary part. Thus, in the training dataset, H has to be split up into its real and imaginary part andconcatenated to obtain H G ∈ R N r × N t × , while G ( z ) has to be reshaped as a complex-valued matrixbefore being utilized for optimization in (8) or (9). Before using the data for training the GAN, wenormalize the channel matrices H G ∈ R N r × N t × element-wise as µ i = E [ H G i ] σ i = E [( H G i − µ i ) ] (12) H G i,norm = H G i − µ i σ i , (13)where i ∈ [2 N t N r ] and subscript i is used to denote the i th element in the array. While testing, wedo not have access to the element-wise mean and variance, hence we continue to use the training mean and variance. This implies that G ( z ) in (8) is replaced by G ( z ) i ← µ i + σ i G ( z ) i . (14)We performed a simulation to ascertain the impact of this artifact, and found it was negligible. Theneed for normalization arises from empirical evidence that the GAN is unable to learn mean-shifteddistributions [32]. C. NN Architecture of Generator

The GAN was implemented in Keras and PyTorch , with the basic implementation given online .The generator and discriminator employed in the Wasserstein GAN were Deep Convolutional NNs.While the discriminator architecture was adopted from [30], the generator was ﬁne-tuned to improveits ability to learn the underlying probability distribution and its architecture is described next.The generator G takes an input z ∈ R d , passes it through a dense layer with output size N t N r / ,and reshapes it to an output size of ( N t / , N r / , . This latent representation is then passed through k = 2 layers, each consisting of the following units: upsampling, 2D Convolution with a kernelsize of 4 and Batch Normalization. At each stage, × upsampling is performed, i.e. the input isreshaped from ( N t /n, N r /n, to (2 N t /n, N r /n, by replicating the corresponding values.The performance of the generator is sensitive to this choice of sampling factor, with oversamplingof 4 and above preventing the generator from learning the channel distribution. Similarly, a kernelsize of 4 corresponds to using a × ﬁlter in the ﬁrst two dimensions to replace each value bya weighted average of the neighboring values that are within a × square surrounding it. Bothupsampling and 2D convolution thus model the local correlations in a spatial channel matrix, withlarger upsampling and size of kernel ﬁlter corresponding to a greater estimated spatial correlation. Itis ﬁnally passed through a 2D Convolutional layer with a kernel size of 4 and linear activation toobtain G ( z ) , the N r × N t channel estimate. https://github.com/fchollet/keras https://github.com/pytorch/pytorch https://github.com/eriklindernoren/Keras-GAN D. GAN Training Details & GCE

The training and test parameters for the Wasserstein GAN are speciﬁed in Table II. The generatorthus obtained is utilized in the GCE, to ﬁnd the optimal z ∗ for each channel realization in the testdataset. To minimize the loss function in (8) or (9), as the case may be, we use two approaches.A derivative-free optimization procedure known as Powell’s conjugate direction method [39], witha relative error tolerance of (cid:15) = 10 − was employed in minimizing (8) and (9) for the generativemodel trained in Keras, since a trained Keras model does not provide for differentiation of the lossfunction in (8) with respect to the input vector z . However, as explained in [40], PyTorch allowsautomatic differentiation and hence an Adam [41] optimizer with a learning rate of η = 10 − anditeration count of 100 is utilized in minimizing (8), for the generative model trained in PyTorch.Training dataset size 3654Test dataset size 12Optimizer RMSProp Learning Rate 0.00005Batch size 200Epochs 3000 λ reg E. Compressed Sensing Based DL Channel Estimation

In this subsection, we describe the baselines used for assessing the performance of GCE. Sincewe consider the narrow-band clustered channel model, we can use the virtual channel model [42] toobtain a sparse representation of the channel matrix in the DFT basis. More speciﬁcally, assuminguniform spaced linear arrays at the transmitter and receiver, the array response matrices are given bythe unitary DFT matrices A T ∈ C N t × N t and A R ∈ C N r × N r . Then we can represent H in terms of a K -sparse matrix H v ∈ C N r × N t H = A R H v A H T H = (( A H T ) T ⊗ A R ) H v . (15)Therefore, the received signal at the UE y is given by y = (( A H T P ) T ⊗ A R ) H v s + n . (16)Denoting by A sp = (( A H T P ) T ⊗ A R ) , as explained in [9], the reconstruction of the channel can beformulated as a non-convex combinatorial problemminimize H v ∈ C N r N t || H v || subject to || y − A sp H v s || ≤ σ. (17)A variety of Matching Pursuit (MP) and Approximate Message Passing (AMP) algorithms have beenproposed to solve (17). In particular, we consider three approaches:i) Orthogonal Matching Pursuit (OMP):

We directly solve (17) using OMP, as described in [9].The stopping criterion for OMP is based on the power of the residual error. We stop when the energyin the residual is smaller than a given threshold , which is chosen to be σ .ii) Lasso Baseline:

Consider the L convex relaxation of (17), and use Basis Pursuit Denoising tosolve this problem. In its Lagrangian form, it can be written as:minimize H v ∈ C N r N t || H v || + λ sp || y − A sp H v s || . (18)However all the norms and matrices involved are complex valued. Hence, an L norm minimizationproblem gets converted into a second order conic programming (SOCP) problem [43], and can besolved by standard convex solvers such as CVXPY [44].iii) EM-GM-AMP:

Approximate Message Passing algorithms such as EM-GM-AMP [11] are well-established Bayesian techniques for sparse signal recovery from noisy compressive linear measurementsthat are known to hold for a large class of sensing matrices. Using the EM-GM-AMP implementationdescribed in [11], we input y and A sp and recover the channel estimate H v , which is then used to The maximum number of iterations are set to be 100. If OMP is allowed to run further, it ﬁts to the noise at low SNR and theNMSE increases. recover H using the array response matrices A T and A R . It is to be noted that assuming an antennaspacing of λ/ , with the columns of A T as well as A R being independent, leads to the entries of H v being correlated. This correlation is however not exploited by EM-GM-AMP. Improved benchmarkingcomparisons with algorithms such as EMturboGAMP [45] that attempt to exploit structured sparsityin non i.i.d signals is left for future work.These are the three sparse signal recovery baselines - that each require knowledge of the sparsifyingbasis - we use to assess the performance of the proposed GCE. It should be noted that the beamspacesparsity that is exploited by the CS algorithms is in no way utilized by GCE.V. R ESULTS

The ﬁrst experiment performed is to determine the optimal latent dimension d of the input z to thegenerator. Ideally, CS techniques would determine d in the absence of noise, hence we ﬁx the SNRat a high value of 40 dB, and evaluate the NMSE as a function of the number of pilot measurements N p for varying values of d as shown in Fig 3 using full resolution measurements. p /N t −17−16−15−14−13−12−11 N M S E ( d B ) CS-GAN d = 25CS-GAN d = 35CS-GAN d = 45

Fig. 3. NMSE vs. α = N p /N t for varying dimension d of the input z to the generator G From Fig 3, we can see that d = 35 appears sufﬁcient with N p /N t = 0 . . Increasing the numberof pilot measurements N p /N t beyond 0.4 does not have any measurable impact on the NMSE. Thisindicates that any more measurements would not improve the accuracy of the channel prediction. Moreimportantly, it highlights that there exists a compressed representation for the channel in an unknownbasis, but using the optimal latent input vector z ∗ deﬁned in (8), we can recover the channel predictionperfectly without knowing, for example, that the channel is sparse in the DFT basis. We obtain anearly 50x compressed representation of the channel, with under 40 parameters needed to represent a × channel matrix realization( = 2048 real values). While current mmWave channel estimationtechniques focus on the optimal design of training precoders and combiners under the assumption ofeither virtual channel models [42], UIU models [46] among others, the GCE minimizes the needfor their optimal design and provides a model-free approach for representing inherently sparse orstructured channels. This may prove valuable for future deployments at progressively higher carrierfrequencies, where these models may not hold. With d = 35 , we now vary the SNR, and observe theNMSE vs. SNR for varying α = N p /N t in the case of full-resolution and one-bit quantized pilotmeasurements. The OMP, Lasso and EM-GM-AMP baselines are also plotted. A. Full Resolution Channel Estimation

As shown in Fig 4, the GCE offers large improvement in NMSE, of at least 5 dB at an SNR of-10 dB and up to 8 dB at an SNR of 15 dB for α = 0 . over the EM-GM-AMP baseline. The GCE’sperformance also does not change signiﬁcantly as α increases from 0.4 to 0.75, indicating that theprior learnt by the generator G is informative enough to require only of the total number ofpilots that would have been needed by a well-posed channel estimation problem to reconstruct thechannel. Moreover, the improvement in NMSE offered by the GCE decreases as α increases from0.2 to 1, with the gap between EM-GM-AMP and GCE being reduced to 2 dB at an SNR of 15 dBand α = 0 . . However, at low and medium SNR, the GCE continues to outperform all CS based methods signiﬁcantly.

10 5 0 5 10 15SNR(dB)151050510 N M S E ( d B ) GCE = 0.2Lasso = 0.2OMP = 0.2EM-GM-AMP = 0.2 10 5 0 5 10 15SNR(dB)151050510 N M S E ( d B ) GCE = 0.4Lasso = 0.4OMP = 0.4EM-GM-AMP = 0.410 5 0 5 10 15SNR(dB)15105051015 N M S E ( d B ) GCE = 0.75Lasso = 0.75OMP = 0.75EM-GM-AMP = 0.75 10 5 0 5 10 15SNR(dB)1510505 N M S E ( d B ) GCE = 1OMP = 1EM-GM-AMP = 1

Fig. 4. NMSE vs. SNR for various values of α = N p /N t . The α values are [0.2, 0.4, 0.75, 1]. The Lasso curve is omitted for α = 1 since CVXPY [44] takes too long to converge due to the large number of optimization variables. B. One-bit Quantized Channel Estimation

The NMSE for the case of 1-bit quantized pilot measurements is deﬁned slightly differently, sincein one-bit measurements, we cannot determine the relative scaling factor for the reconstructed channelmatrices. NMSE = E (cid:34) || H − κ ˆ H || || H || (cid:35) , (19)where κ = argmin || H − κ ˆ H || for a given H and ˆ H . Note that though this may seem genie-aided,precoder optimization that ﬁnally determines the achievable rate is not affected by this scaling factor.The dependence of NMSE on SNR for one-bit measurements with varying number of pilots is shown in Fig. 5, and contrasted with the performance of EM-GM-AMP on the same measurements. As onecan clearly see, the GCE brings about an immense improvement in NMSE, and this can be attributedto the rich prior stored in the weights of the generator. Fig. 5. NMSE v/s SNR as a function of α = N p /N t with one-bit quantization. C. Hybrid Precoding for Quantized Channel Estimation

To validate the improvement in channel estimate quality postulated in Section V-B, we calculatethe spectral efﬁciency obtained using hybrid precoding in the data transmission phase. We assume N s = min( N t , N r ) = 16 and optimal unconstrained combiners are employed at the UE. The RFand baseband precoders F RF and F BB are computed as explained in Section II using OMP. Threedifferent channel estimates are used for designing these precoders: the estimate returned by the GCE,the AMP algorithm EM-GM-AMP [11] and the ground truth channel realization (for computing theperfect CSI curve). The spectral efﬁciency vs. SNR plots are shown in Fig. 6 for varying α . As isevident, the GCE channel estimate enables design of precoders that support higher capacity datatransmissions than EM-GM-AMP.

10 5 0 5 10 15SNR(in dB)0123456789 Sp e c t r a l E ff i c i e n c y ( b / s / H z ) Perfect CSIGCE = 0.75GCE = 0.4GCE = 0.2EM-GM-AMP = 0.75EM-GM-AMP = 0.4EM-GM-AMP = 0.2

Fig. 6. Spectral Efﬁciency v/s SNR as a function of α = N p /N t with one-bit quantization using OMP-based precoding D. Explanations & Caveats

The beneﬁt obtained from the GCE is clear in the low pilot density and low SNR regime. As thenumber of pilot symbols increases, the performance of standard CS-based methods gets closer to theGCE, and would be similar to that of the GCE for N p ≥ N t . At low SNR, the pilot measurementsreceived are of very poor quality, hence CS-based methods do not perform well, but the GCE utilizesits prior to obtain performance that cannot be achieved by the CS-based methods. This is clearlyevident in the one-bit quantized case (Fig. 5), where the GCE curves are roughly parallel to theEM-GM-AMP curves with the constant gap being the generative prior gain. It can be expected thatas the number of antennas packed onto a planar array increases with the move toward THz carrierfrequencies, sending an adequate number of pilots would lead to an unsustainable overhead, andrecovering the channel estimate from an insufﬁcient number of pilots will become critical. While theGCE outperforms the three CS-based methods, it is important to note the following caveats: High Spatial Correlation:

GCE required a reduced antenna spacing of λ/ , rather than λ/ , tosuccessfully learn the channel distribution. As shown in Fig. 7(a), the singular value proﬁle of a λ/ channel realization has a higher effective rank than a λ/ realization, due to its lower spatialcorrelation. As a consequence, the generator of a GAN trained on λ/ channel realizations was unableto learn the underlying probability distribution and the resulting performance of GCE was poor asshown on the right in Fig. 7(b) for α = 0 . . Since a GAN was originally designed to learn the priorfor image datasets, which have extremely high spatial correlation, the GCE was also found to workin a similar domain. However, as thousands of antennas get deployed at a transmitter or receiverdue to their tiny size, it is expected that such singular value proﬁles will become more commonlyobserved and only the maximum eigenvector will be needed to acheive capacity in this regime.A recent paper [47] shows how metamaterial antennas can be used for wireless communications,including LTE and WiFi. Conventional antennas that are very small compared to the wavelengthreﬂect most of the signal back to the source. However, a metamaterial antenna steps-up the antenna’sradiated power and behaves as if it were much larger than its actual size, because its novel structurestores and re-radiates energy, which could lead to the deployment of sub-wavelength antennas. M a g n i t u d e o f S i n g u l a r V a l u e s ULA Spacing = /2ULA Spacing = /10 (a) Singular values of channel realizations in descending orderof magnitude

10 5 0 5 10 15SNR(dB)17.515.012.510.07.55.02.50.0 N M S E ( d B ) GAN =0.75 ULA Spacing = /2GAN =0.75 ULA Spacing = /10 (b) NMSE v/s SNR for the two datasets of channel realizationswith antenna spacing λ/ and λ/ .Fig. 7. The left ﬁgure shows the singular value proﬁle of a channel realization with an antenna spacing of λ/ and λ/ . The highercorrelation in the λ/ realization enables the generator to learn a rich prior and the GCE to obtain a signiﬁcantly lower NMSE asshown in the ﬁgure on the right. Rich Generative Prior:

The weights θ g of the generator G ( z ; θ g ) encode a probability distributionover the space of permissible spatial channel matrices, such that by inputting z , we can draw samplesfrom that distribution. Conventional CS techniques have no such prior knowledge of the distribution of the channel matrices, however they capitalize on the sparsity of the beamspace representation ofthe channel, which the GCE does not utilize. The results seem to indicate that the generative prioris much more informative than the sparsifying basis, but we have no means of quantifying this yet.Recent efforts in theoretical machine learning [48] have attempted to quantify the information in theweights of a NN in terms of the impact that perturbing a weight has on the cross-entropy loss. Suchwork could prove very useful in quantifying the information gain of a generative prior. Training on Simulated Channel Realizations:

We have currently trained a GAN using simulatedchannel realizations, since obtaining realistic channel data has not proven possible, even with ourindustry partners. One can only hope to recover the channel estimate based on pilot measurementsfrom current transceiver chips. It remains to be seen if the GAN can succeed in learning the channeldistribution even from these noisy channel estimates. The original GAN proposed in [35] is known tolearn discriminators with poor generalization capabilities, and many recent works [30], [32], [49] havetaken different approaches to justifying design of custom objective functions for the discriminator thatwould help the generator to better approximate the target distribution, and improve the generalizationcapability of the discriminator.

E. Timing Analysis

Using the PyTorch based generative model, optimization of (8) involves only performing gradientdescent with respect to z ∈ R d , with d = 35 in our case. Hence one would expect each iteration to becomputationally inexpensive. To determine the computational advantage of using GCE, we perform acomparision of its execution time per iteration as compared to the CS baselines and the results aretabulated in Table III. The number of iterations required to achieve the NMSE results in Fig. 4 foreach method are also given in Table III. The evaluation of the ﬁrst three methods was performedon an Intel i9-8950HK CPU @ 2.90GHz. The results for GCE are given both when performed onthe Intel i9-8950HK CPU without acceleration as well as when accelerated using a Nvidia GeForce GTX 2070 GPU. As expected, a GPU does speed up backpropogation through the NN immensely asrequired for computing z ∗ in (8).TABLE III: Comparison of execution time per iteration (in milliseconds) for OMP, Lasso, EM-GM-AMP and GCE on a single channel realization at an SNR of -10 dB.The most important ﬁnding from Table III is that the execution time of GCE is not noticeablyaffected by the increase in the number of pilot symbols, while the execution time of CS baselinesincreases with increase in α . The gradient of (8) with respect to z is given by ∇ z f ( y , AG ( z )) = 2( A T ( y − AG ( z )) ∇ z G ( z ) + λ reg z ) , (20)where each row of the matrix ∇ z G ( z ) is d = 35 dimensional. This involves only direct matrixmultiplications of A . On the other hand, for OMP, one of the steps involves inverting columns of A sp having maximum inner product with y , whose complexity scales as O ( N mp ) with ≤ m < .Similarly, the Lasso and EM-GM-AMP optimization problems have complexity scaling with N mp .Moreover, as explained in Section IV-E, Lasso involves solving an SOCP, hence takes much longerthan the other algorithms. The impact of N p on the execution time of GCE will only be seen atmuch higher values of N p , unlike the CS based algorithms for whom the impact of increasing N p is immediately apparent . Note however that the complexity of computing ∇ z G ( z ) is quite highowing to the large number of weights θ g in the trained generator G , hence the execution time ofGCE is comparable to OMP in the low pilot density regime. Each entry of a matrix multiplication can be computed in parallel, but is limited by the number of threads available on the CPU/GPU.Matrix inversion cannot be parallelized in the absence of its LU factorization. VI. C

ONCLUSION

We presented a compressed sensing-based channel estimation approach using deep generativenetworks that achieves a signiﬁcant performance gain over prior techniques for sparse signal recovery,when applied to CDL channel models. Notable aspects of this approach are that it does not requireknowledge of the sparsifying basis of the channel and immensely reduces the number of pilots requiredto achieve the same NMSE as Lasso/OMP/EM-GM-AMP channel estimation, even in the case ofone-bit quantized pilot measurements. Importantly, as a consequence of the gradient computationof (8) requiring only matrix multiplications, its execution time is approximately independent of thenumber of received pilot symbols N p when N p is small.VII. A CKNOWLEDGEMENTS

The authors would like to thank Nitin Myers for discussions on low resolution quantization andShilpa Talwar, Nageen Himayat, Ariela Zeira at Intel for their invaluable support and technical adviceand feedback. R

EFERENCES [1] A. Doshi, E. Balevi, and J. G. Andrews, “Compressed representation of high dimensional channels using deep generative networks,”in

Proc. IEEE Signal Proc. Adv. in Wireless Comm. (SPAWC) , May 2020.[2] T. S. Rappaport, Y. Xing, O. Kanhere, S. Ju, A. Madanayake, S. Mandal, A. Alkhateeb, and G. C. Trichopoulos, “Wirelesscommunications and applications above 100 GHz: Opportunities and challenges for 6G and beyond,”

IEEE Access , vol. 7, pp.78 729–78 757, Jun. 2019.[3] H. Elayan, O. Amin, R. M. Shubair, and M.-S. Alouini, “Terahertz communication: The opportunities of wireless technologybeyond 5G,” in

IEEE Intl. Conf. on Advanced Comm. Technologies and Networking (CommNet) , Apr. 2018, pp. 1–5.[4] E. Bj¨ornson, J. Hoydis, L. Sanguinetti et al. , ”Massive MIMO networks: Spectral, energy, and hardware efﬁciency” . Foundationsand Trends in Signal Processing, Now Publishers, Inc., Nov. 2017.[5] P. A. Eliasi, S. Rangan, and T. S. Rappaport, “Low-rank spatial channel estimation for millimeter wave cellular systems,” IEEETrans. on Wireless Communications , vol. 16, no. 5, pp. 2748–2759, Apr. 2017.[6] W. U. Bajwa, J. Haupt, A. M. Sayeed, and R. Nowak, “Compressed channel sensing: A new approach to estimating sparsemultipath channels,”

Proc. IEEE , vol. 98, no. 6, pp. 1058–1076, Jun. 2010. [7] W. U. Bajwa, A. Sayeed, and R. Nowak, “Sparse multipath channels: Modeling and estimation,” in IEEE 13th Digital SignalProcessing Workshop and 5th IEEE Signal Processing Education Workshop , 2009, pp. 320–325.[8] A. Alkhateeb, O. El Ayach, G. Leus, and R. W. Heath, “Channel estimation and hybrid precoding for millimeter wave cellularsystems,”

IEEE J. Sel. Topics Sig. Process. , vol. 8, no. 5, pp. 831–846, Oct. 2014.[9] R. M´endez-Rial, C. Rusu, N. Gonz´alez-Prelcic, A. Alkhateeb, and R. W. Heath, “Hybrid MIMO architectures for millimeterwave communications: Phase shifters or switches?”

IEEE Access , vol. 4, pp. 247–267, Jan. 2016.[10] S. Rangan, “Generalized approximate message passing for estimation with random linear mixing,” in

IEEE Intl. Symposium onInformation Theory Proceedings , Jul. 2011, pp. 2168–2172.[11] J. P. Vila and P. Schniter, “Expectation-maximization Gaussian-mixture approximate message passing,”

IEEE Trans. on SignalProcessing , vol. 61, no. 19, pp. 4658–4672, Jul. 2013.[12] S. Rangan, P. Schniter, and A. K. Fletcher, “Vector approximate message passing,”

IEEE Trans. on Info. Theory , vol. 65, no. 10,pp. 6664–6684, May 2019.[13] J. Mo, P. Schniter, N. G. Prelcic, and R. W. Heath, “Channel estimation in millimeter wave MIMO systems with one-bitquantization,” in , Nov. 2014, pp. 957–961.[14] J. Mo, P. Schniter, and R. W. Heath, “Channel estimation in broadband millimeter wave MIMO systems with few-bit ADCs,”

IEEE Trans. on Signal Processing , vol. 66, no. 5, pp. 1141–1154, Dec. 2017.[15] X. Lin, S. Wu, C. Jiang, L. Kuang, J. Yan, and L. Hanzo, “Estimation of broadband multiuser millimeter wave massiveMIMO-OFDM channels by exploiting their sparse structure,”

IEEE Transactions on Wireless Communications , vol. 17, no. 6, pp.3959–3973, June 2018.[16] D. Katselis, C. R. Rojas, M. Bengtsson, and H. Hjalmarsson, “Frequency smoothing gains in preamble-based channel estimationfor multicarrier systems,”

Signal Processing , vol. 93, no. 9, pp. 2777–2782, Sep. 2013.[17] H. Ye, G. Y. Li, and B.-H. Juang, “Power of deep learning for channel estimation and signal detection in OFDM systems,”

IEEEWireless Communications Letters , vol. 7, no. 1, pp. 114–117, Feb. 2018.[18] H. He, C.-K.Wen, S. Jin, and G. Y. Li, “Model-driven deep learning for joint MIMO channel estimation and signal detection,” arXiv preprint arXiv:1907.09439 , Feb. 2019.[19] Y. Yang, F. Gao, X. Ma, and S. Zhang, “Deep learning-based channel estimation for doubly selective fading channels,”

IEEEAccess , vol. 7, pp. 36 579–36 589, Mar. 2019.[20] H. He, C.-K. Wen, S. Jin, and G. Y. Li, “Deep learning-based channel estimation for beamspace mmWave massive MIMOsystems,”

IEEE Wireless Communications Letters , vol. 7, no. 5, pp. 852–855, Oct. 2018.[21] X. Ru, L. Wei, and Y. Xu, “Model-driven channel estimation for OFDM systems based on image super-resolution network,” arXiv preprint arXiv:1911.13106 , Nov. 2019.[22] P. Dong, H. Zhang, G. Y. Li, I. S. Gaspar, and N. NaderiAlizadeh, “Deep CNN-based channel estimation for mmWave MassiveMIMO systems,”

IEEE J. Sel. Topics Sig. Process. , vol. 13, no. 5, pp. 989–1000, Jul. 2019. [23] S. Gao, P. Dong, Z. Pan, and G. Y. Li, “Deep learning based channel estimation for massive MIMO with mixed-resolutionADCs,” arXiv preprint arXiv:1908.06245 , Feb. 2019.[24] E. Balevi and J. G. Andrews, “Deep learning-based channel estimation for high-dimensional signals,” arXiv preprintarXiv:1904.09346 , 2019.[25] E. Balevi, A. Doshi, and J. G. Andrews, “Massive MIMO Channel Estimation with an Untrained Deep Neural Network,” IEEETrans. on Wireless Communications , vol. 19, no. 3, pp. 2079–2090, Jan. 2020.[26] R. Heckel and P. Hand, “Deep Decoder: Concise Image Representations from Untrained Non-convolutional Networks,” in

Proc.ICLR , Feb. 2019.[27] C.-K. Wen, W.-T. Shih, and S. Jin, “Deep learning for massive MIMO CSI feedback,”

IEEE Wireless Communications Letters ,vol. 7, no. 5, pp. 748–751, Mar. 2018.[28] Z. Lu, J. Wang, and J. Song, “Multi-resolution CSI feedback with deep learning in Massive MIMO system,” arXiv preprintarXiv:1910.14322 , Oct. 2019.[29] A. Bora, A. Jalal, E. Price, and A. G. Dimakis, “Compressed sensing using generative models,” in

Intl. Conf. on MachineLearning (ICML) , Aug. 2017, pp. 537–546.[30] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein generative adversarial networks,” in

Intl. Conf. on Machine Learning(ICML) , Dec. 2017, pp. 214–223.[31] O. El Ayach, S. Rajagopal, S. Abu-Surra, Z. Pi, and R. W. Heath, “Spatially sparse precoding in millimeter wave MIMO systems,”

IEEE Trans. on Wireless Communications , vol. 13, no. 3, pp. 1499–1513, Jan. 2014.[32] A. Srivastava, L. Valkov, C. Russell, M. U. Gutmann, and C. Sutton, “Veegan: Reducing mode collapse in GANs using implicitvariational learning,” in

Adv. in Neural Info. Process. Systems , Dec. 2017, pp. 3308–3318.[33] T. J. OShea, T. Roy, and N. West, “Approximating the void: Learning stochastic channel models from observation with variationalgenerative adversarial networks,” in

IEEE Intl. Conf. on Computing, Net. and Comm. , Apr. 2019, pp. 681–686.[34] H. Ye, G. Y. Li, B.-H. F. Juang, and K. Sivanesan, “Channel agnostic end-to-end learning based communication systems withconditional GAN,” in

IEEE GC Wkshps , Dec. 2018, pp. 1–5.[35] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarialnets,” in

Adv. in Neural Info. Process. Systems , Dec. 2014, pp. 2672–2680.[36] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with deep convolutional generative adversarialnetworks,” in

Proc. ICLR , Nov. 2015.[37] N. J. Myers, K. N. Tran, and R. W. Heath Jr, “Low-rank mmWave MIMO channel estimation in one-bit receivers,” arXiv preprintarXiv:1910.09141 , Oct. 2019.[38] S. Qiu, X. Wei, and Z. Qiu, “Robust One-Bit Recovery via ReLU Generative Networks: Improved Statistical Rates and GlobalLandscape Analysis,” in NeurIPS Deep Inverse Workshop , Dec. 2019. [39] M. J. Powell, “An efﬁcient method for ﬁnding the minimum of a function of several variables without calculating derivatives,” The Computer Journal , vol. 7, no. 2, pp. 155–162, Jan. 1964.[40] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automaticdifferentiation in pytorch,”

Neural Info. Process. Systems (NIPS) Workshop Autodiff , Oct. 2017.[41] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in

Proc. ICLR , Dec. 2014.[42] A. M. Sayeed, “Deconstructing multiantenna fading channels,”

IEEE Trans. Signal Process. , vol. 50, no. 10, pp. 2653–2579, Oct.2002.[43] S. Winter, H. Sawada, and S. Makino, “On real and complex valued (cid:96) -norm minimization for overcomplete blind sourceseparation,” in IEEE Wkshp on Appl. of Sig. Process. to Audio and Acoustics , Nov. 2005, pp. 86–89.[44] S. Diamond and S. Boyd, “CVXPY: A Python-embedded modeling language for convex optimization,”

Journal of MachineLearning Research , vol. 17, no. 83, pp. 1–5, Apr. 2016.[45] P. Schniter, “Turbo reconstruction of structured sparse signals,” in , Mar. 2010, pp. 1–6.[46] A. M. Tulino, A. Lozano, and S. Verd´u, “Capacity-achieving input covariance for single-user multi-antenna channels,”

IEEETrans. on Wireless Communications , vol. 5, no. 3, pp. 662–671, Mar. 2006.[47] M. M. Hasan, M. R. I. Faruque, and M. T. Islam, “Dual band metamaterial antenna for LTE/Bluetooth/WiMAX system,”

Scientiﬁcreports , vol. 8, no. 1, pp. 1–17, Jan. 2018.[48] A. Achille and S. Soatto, “Where is the information in a deep neural network?” arXiv preprint arXiv:1905.12213 , May 2019.[49] H. Thanh-Tung, T. Tran, and S. Venkatesh, “Improving generalization and stability of generative adversarial networks,” in Proc.ICLRin Proc.ICLR