[PDF] A Markovian Model-Driven Deep Learning Framework for Massive MIMO CSI Feedback

Abstract

Forward channel state information (CSI) often plays a vital role in scheduling and capacity-approaching transmission optimization for massive multiple-input multiple-output (MIMO) communication systems. In frequency division duplex (FDD) massive MIMO systems, forwardlink CSI reconstruction at the transmitter relies critically on CSI feedback from receiving nodes and must carefully weigh the tradeoff between reconstruction accuracy and feedback bandwidth. Recent studies on the use of recurrent neural networks (RNNs) have demonstrated strong promises, though the cost of computation and memory remains high, for massive MIMO deployment. In this work, we exploit channel coherence in time to substantially improve the feedback efficiency. Using a Markovian model, we develop a deep convolutional neural network (CNN)-based framework MarkovNet to differentially encode forward CSI in time to effectively improve reconstruction accuracy. Furthermore, we explore important physical insights, including spherical normalization of input data and convolutional layers for feedback compression. We demonstrate substantial performance improvement and complexity reduction over the RNN-based work by our proposed MarkovNet to recover forward CSI estimates accurately. We explore additional practical consideration in feedback quantization, and show that MarkovNet outperforms RNN-based CSI estimation networks at a fraction of the computational cost.

Full PDF

11 A Markovian Model-Driven Deep LearningFramework for Massive MIMO CSI Feedback

Zhenyu Liu, Mason del Rosario, and Zhi Ding

Abstract —Forward channel state information (CSI) often playsa vital role in scheduling and capacity-approaching transmissionoptimization for massive multiple-input multiple-output (MIMO)communication systems. In frequency division duplex (FDD)massive MIMO systems, forwardlink CSI reconstruction at thetransmitter relies critically on CSI feedback from receiving nodesand must carefully weigh the tradeoff between reconstructionaccuracy and feedback bandwidth. Recent studies on the useof recurrent neural networks (RNNs) have demonstrated strongpromises, though the cost of computation and memory remainshigh, for massive MIMO deployment. In this work, we exploitchannel coherence in time to substantially improve the feedbackefﬁciency. Using a Markovian model, we develop a deep convo-lutional neural network (CNN)-based framework MarkovNet todifferentially encode forward CSI in time to effectively improvereconstruction accuracy. Furthermore, we explore importantphysical insights, including spherical normalization of input dataand convolutional layers for feedback compression. We demon-strate substantial performance improvement and complexity re-duction over the RNN-based work by our proposed MarkovNet torecover forward CSI estimates accurately. We explore additionalpractical consideration in feedback quantization, and show thatMarkovNet outperforms RNN-based CSI estimation networks ata fraction of the computational cost.

Index Terms —Massive MIMO, CSI compressed feedback, deeplearning, FDD

I. I

NTRODUCTION

Massive MIMO wireless interface has been identiﬁed asa critical radio technology at the physical layer capable ofsubstantially improving the bandwidth efﬁciency and deliv-ering Gigabits/s services to many heterogeneous subscriberssimultaneously. The efﬁcacy of such massive MIMO downlinkdepends on the availability of accurate forward (downlink) CSIestimates at the base station (BS) for transmission precoding.Given the large number of antennas in massive MIMO and po-tentially broad bandwidth, such downlink CSI estimation andacquisition require substantial amount of feedback from eachsubscriber user equipment (UE). To support high mobility UEsin modern mobile wireless, timely feedback for time varying(i.e., fading) CSI estimation [1], [2] pose critical challenges.Frequent reporting of CSI for massive MIMO coverage wouldconsume too much network bandwidth and UE power. Theneed for efﬁcient CSI feedback in massive MIMO networksstrongly motivates many research efforts aimed at downlinkCSI compression, feedback, and reconstruction.The problem of CSI feedback and reconstruction in massiveMIMO has been an active research area in recent years.Traditional vector quantization and codebook-based methodsreduce feedback overhead by quantizing the CSI at the UEside [3]–[6]. However, the feedback overhead grows with the number of antennas, often requiring large amount of uplinkbandwidth or low accuracy for practical massive MIMO wire-less transmission. Compressive sensing (CS)-based approachesexploit the sparsity channel property in some domain tolower the CSI feedback overhead [7]–[9]. However, CS-basedapproaches often hinge on strong channel sparsity conditionsnot strictly satisﬁed in some domains. Moreover, iterative CSreconstruction methods may need a large amount of compu-tation time to accurately recover downlink CSI estimates.There has been a surging wave of interest in applyingartiﬁcial neural networks for forward CSI estimation [10]–[13]. The popularity and versatility of deep learning (DL) havemotivated a number recent works that explored deep neuralnetworks (DNN) for downlink channel compression and re-covery, particularly for massive MIMO wireless interface. Typ-ically, these DNNs have utilized two prevailing architecturesthat are successful in other applications. First, ConvolutionalNeural Networks (CNNs), which have demonstrated state-of-the-art performance in image processing tasks [14], have beenintegrated in an autoencoder for CSI compression and recoveryof a single snapshot [15]. Second, Recurrent Neural Networks(RNNs) have been further investigated to exploit temporalCSI coherent for feedback compression in massive MIMOsystems [16]–[20]. RNNs can leverage hidden states througharchitectures such as long short-term memory (LSTM) cellsto exploit the effect of past inputs.Existing works have demonstrated that DNNs can provideefﬁcient CSI feedback and reconstruction for time-varyingMIMO channels [16]–[20]. However, important issues remainunresolved in at least the following two aspects:1)

Complexity and Storage : The number of parameters inthe RNN layers for CSI compression and reconstructionof massive MIMO systems can be staggeringly large.For example, the RNN module can add additionalparameters [16], which raises storage and computationconcerns. A fully connected layer-based autoencoder hasbeen proposed for the CSI feedback in time varyingchannel to reduce the computational complexity andrequired memory [20] . However, the accuracy is lessfavorable in comparison to [16]. While other works haveinvestigated RNNs of reduced size [19], [20], the leastcomputationally expensive of these models still requires parameters per snapshot. Also, the networks in [19],[20] suffer from the signiﬁcant feedback performancedrop when the compression ratio is small, since thenetworks have to use the same compression ratio insuccessive time slots and can not get the accurate priorinformation in the initial time slot. Despite the reported a r X i v : . [ ee ss . SP ] S e p success of “stacked” LSTMs, the minimum necessarydepth of recurrent layers for CSI recovery accuracy hasnot been adequately evaluated. Considering the largeRNN parameter count, performance improvement shouldbe substantial to justify the memory overhead.2) Physical Insight : The success of RNNs in areas such asvideo processing [21] and natural language processing(NLP) [22], [23] has stimulated their applications in for-ward CSI feedback and reconstruction. However, despitethe apparent similarities in terms of a time series predic-tion, the physical nature of underlying CSI in massiveMIMO is considerably different from those in video con-tents and image contents. Leveraging domain knowledgeand physical characteristics on mobile wireless channelscan be beneﬁcial. For example, LTE frames (subframes)occupy 10ms (1ms) of airtime and permit CSI feedbackintervals that are integer multiples of either. DNN-basedCSI feedback and recovery should consider the practicalconstraint of how often such feedback can be transmittedand how CSI of fading channels would vary due to theDoppler effect.In order to reduce computational complexity and modelsize, we seek to systematically exploit the physical channelcharacteristics such as the temporal coherence of forward CSI.Instead of training an RNN as a black box to learn andacquire the underlying CSI characteristics for compression,feedback, and recovery, we directly leverage the known chan-nel coherence temporally by developing a simple but effectiveMarkovian model driven differential CSI feedback frameworkMarkovNet to improve CSI recovery accuracy and reducecomputational complexity. Spherical CSI feedback frameworkand enhanced CSI feedback network structure are proposedto provide the accurate prior information in the initial timeslot. CNN-based CSI feedback networks are trained to furthercompress and recover the differential CSI effectively. We showthat this simple MarkovNet can directly take advantage of thechannel fading property to deliver much more efﬁcient CSIcompression and recovery.This paper is organized as follows. Section II describesthe massive MIMO system model commonly adopted in thisand similar works. Section III presents two approaches toexploit CSI temporal coherence: RNNs and differential encod-ing. Section IV describes our proposed differential encoding-based CSI feedback framework, MarkovNet, as well as datapre-processing techniques to further improve CSI feedbackaccuracy for individual channel snapshots such as power-based spherical normalization. Section V introduces the pro-posed CNN-based dimension compression and decompressionmodule for model size and complexity reduction. Section VIpresents our experimental results, including computationalanalysis and performance under feedback quantization, forMarkovNet in comparison with a benchmark RNN-basednetwork. Section VII concludes this manuscript.II. S

YSTEM M ODEL

A. Forwardlink Channnel Estimation and Reconstruction

In this paper, we consider a massive MIMO BS known in 5Gas gNB equipped with N b (cid:29) antennas to serve a number of single-antenna UEs within its cell. We apply orthogonalfrequency division multiplexing (OFDM) in downlink trans-mission over N f subcarriers.To model the received signal of a UE, consider the m − thsubcarrier at time t . Let h t,m ∈ C N b × denote the channelvector, w t,m ∈ C N b × denote transmit precoding vector, x t,m ∈ C be the transmitted data symbol, and n t,m ∈ C bethe additive noise. Then the received signal of the UE on the m − th subcarrier at time t is given by y t,m = h Ht,m w t,m x t,m + n t,m , (1)where ( · ) H represents the conjugate transpose. The downlinkCSI matrix in the spatial frequency domain at time t is denotedas ˜ H t = (cid:2) h t, , ..., h t,N f (cid:3) H ∈ C N f × N b .Based on the downlink channel matrix ˜ H t , the gNB candetermine the transmit precoding vector for each subcarrier.However, since the size of the CSI matrix ˜ H t is N f × N b , theUE’s CSI feedback payload is large and consumes a staggeringamount of uplink bandwidth in massive MIMO systems.To reduce the feedback overhead, we ﬁrst exploit the spar-sity of CSI in a different projection space, the delay domain.Multipath effects cause short delay spreads, resulting in sparseCSI matrices in the delay domain [24]. With the help of 2Ddiscrete Fourier transform (DFT), CSI matrix H f in spatial-frequency domain can be transformed to be H d in angular-delay domain using F Hd H f F a = H d , (2)where F d and F a denote the N f × N f and N b × N b unitaryDFT matrices, respectively. After 2D DFT of H f , mostelements in the N f × N b matrix H d are negligible exceptfor the ﬁrst R d rows that dominate the channel response[15]. Therefore, we can approximate the channel by truncatingCSI matrix to the ﬁrst R d rows. H t is utilized to denotethe ﬁrst R d rows of matrices after 2D DFT of ˜ H t . Using H t as a supervised learning objective, a DL based encoderand decoder, which is often referred to as an autoencoder,can be jointly trained and optimized to achieve efﬁcient CSIcompression and reconstruction as shown in Fig. 1(a). Severalrecent works that adopted this autoencoder structure [15] [25]have reported notable successes.To allow gNB to track the time-varying characteristicsof wireless fading channels, UEs need to periodically esti-mate and feed back instantaneous CSI with high power andbandwidth efﬁciency. Considering a time duration with T successive time slots, the sequence of time-varying channelmatrix is deﬁned as { H t } Tt =1 = { H , H , · · · , H T } . B. High Efﬁciency CSI Feedback Encoding

To reduce feedback overhead, temporal coherence of theradio fading channels can be exploited. Since RF channelsof mobile UEs are governed by physical scatters, multi-paths, bandwidth, and Doppler effect, the fading CSI exhibitsphysically coherent characteristics including coherence time,coherence bandwidth, and coherence space. For mobile users,coherence time measures temporal channel variations anddescribes the Doppler effect caused by UE mobility. For most t s …… ˆ( ) t φ − Η s ( ) E Η Encoder 1Decoder 1 Η ˆ Η -+ Encoder tDecoder t t Η ˆ t Η -+ Fig. 1: Illustration of the temporal correlation based CSIfeedback. ( t > application scenarios, the massive MIMO channels do not varyabruptly. By exploiting the channel coherence time, the UEand the gNB can rely on their previously stored CSI estimatesto encode only the innovation components within the CSI.Speciﬁcally, the UE can encode and feed back CSI variationsinstead of the full CSI to substantially reduce feedback cost.Accordingly, gNB can combine the new feedback with itspreviously recovered CSI within coherence time to reconstructsubsequent CSI estimates.We can adopt a general ﬁrst order Markovian channel model p ( H t | H t − , · · · , H ) = p ( H t | H t − ) . (3)Given knowledge of the CSI at the previous time slot, theminimum mean square estimation (MMSE) of H t can bedeﬁned as φ ( H t − ) = E { H t | H t − } . (4)We deﬁne the MMSE estimation error as V t = H t − E { H t | H t − ) = H t − φ ( H t − ) . (5)Consider the scenario that, at time t − , the UE and the gNBhave successfully exchanged the CSI H t − . Then it would bemore efﬁcient for the UE to compress and feed back the CSIestimation error V t to the gNB instead of the raw H t .Based on this CSI model, we can develop a novel DLencoder and decoder architecture by exploiting a trainable neu-ral network to learn the unknown MMSE estimation function φ ( H t − ) = E { H t | H t − } . This new DL encoder and decoderarchitecture is shown in Fig. 1.As shown in Fig. 1, the feedback for the CSI matrixsequence can be divided into two phases: a) The feedbackof CSI at the ﬁrst (initial) time slot ( t = 1 ) without priorinformation; b) The feedback of CSI in subsequent time slots( t = 2 , , ..., T ) given the prior CSI information. Denote ˆH t as the reconstruction of CSI matrix H t at time slot t .Deﬁne the encoding and decoding function as f e ( · ) and f d ( · ) ,respectively. For downlink CSI feedback architecture in theﬁrst time slot, the encoder network and decoder network canbe denoted, respectively, by s = f e, ( H − E { H } ) , (6) ˆ H = f d, ( s ) + E { H } (7)This initial phase assumes that the CSI mean is known fromtraining or past information. If such information is unavailable, Fig. 2: (a) Illustration of a “stacked” LSTM network ofdepth 3 shown with recurrent connections. (b) Same network“unrolled” into T timeslots. The network is trained with eitherperfect or quantized CSI, ¯ H , to generate CSI estimates, ˆ H .then E { H } = 0 shall be applied. For downlink CSI feedbackarchitecture of subsequent time slot t ( t ≥ ), the encodernetwork and decoder network can be executed, respectively,by s t = f e,t ( H t − φ ( ˆ H t − )) , (8) ˆ H t = f d,t ( s t ) + φ ( ˆ H t − ) (9)Since the optimum function φ ( ˆ H t − )) is unknown, one directsolution is to approximate the function with deep neuralnetwork architecture trained by using a set of CSI samples.III. E XPLOITING C HANNEL T EMPORAL C OHERENCE

We now discuss two avenues for exploiting the temporalcoherence of CSIs at successive time-slots: a DNN architecturethat utilizes long-short term memory (LSTM) layers and aninformation theoretic differential encoding approach.

A. Recurrent Neural Networks

Recurrent neural networks (RNNs) include layers whichencode memory of previous states. Through backpropagationtraining, recurrent layers learn whether to incorporate infor-mation stored in memory in the layer’s output and whetherthat information should be kept in memory [26]. The memoryincorporation enables RNNs to store, remember, and processinformation that resides in past signals for long time periods.RNNs can utilize past input sequence samples to predict futurestates. [27].RNNs have found wide applications in areas such as naturallanguage processing (NLP), including machine translation [22]and sentiment extraction [23]. For NLP tasks, empirical resultshave demonstrated the effectiveness of “deep” or “stacked”RNNs, networks which use the outputs of hidden recurrentlayers as inputs to subsequent recurrent layers [28].Prior works have investigated stacked RNNs for CSI esti-mation. Several proposals have favored the use of Long ShortTerm Memory (LSTM) cell [16]–[18], a recurrent unit that cantackle the vanishing gradient problem inherent in recurrentbackpropagation [29]. Existing LSTM-based works in CSIestimation have assumed that stacked LSTMs are better thanshallow LSTMs, presenting models which used LSTM cells ofdepth 3 [16]. Fig. 2 demonstrates the principle of this LSTM network for CSI feedback and estimation. This bias towardsdeep RNNs is likely due to the aforementioned successesin NLP, where deep recurrent layers are theorized to learnhierarchical levels of semantic abstraction [23], [30].This RNN approach has been recently proposed in [16].In this work, we shall consider the proposed architecture of[16] as the benchmark method. However, deep LSTMs canbe problematic, as the number of parameters per LSTM cellcan be quite large. If a parsimonious model is desired dueto memory constraints, then memory intensive RNNs can bevery costly.In this work, we explore ways to simplify the LSTM archi-tectures without CSI performance loss, as deeper networks donot necessarily lead to better estimation accuracy. In fact, weshall show later (Fig. 11, Section VI-C) that a single LSTMlayer (i.e., D = 1 ) could yield higher accuracy CSI estimatesthan deeper LSTMs (i.e., D ∈ [2 , ) in some cases. B. CSI Entropy and Feedback Encoding

Despite the success of deep RNNs in CSI estimation andrecovery, several important research questions remain. • First, what simpliﬁcations can be made to reduce com-putational complexity while maintaining efﬁcient CSIfeedback and accurate CSI recovery? • Second, how much CSI feedback bandwidth in terms ofbitwidth per CSI coefﬁcient is sufﬁcient? • Third, how frequently should a UE should provide CSIfeedback for gNB to update its CSI estimate?It is therefore important to tackle these open questions thathamper the practical application and efﬁcacy of DL based CSIestimation and recovery in massive MIMO networks.Consider random channel matrix H t that consists of com-plex fading coefﬁcients for the t -th timeslot. We denoteits joint probability density function p ( H t ) and deﬁne thecorresponding entropy as H ( H t ) = − (cid:88) H t p ( H t ) log p ( H t ) (10)where (10) is the sum over all realizations of r.v. H t . The CSIentropy of (12) describes the required number of bits for theUE to feed back its CSI estimate to the gNB for reconstruction.Denote the ( i, j ) -th CSI element within H t as H t, ( i,j ) at time t . If all elements are independent, then we have a simple upperbound on the entropy of the full CSI matrix as H ( H t ) ≤ H UB = (cid:88) i,j H ( H t, ( i,j ) ) (11)This entropy bound H UB describes the approximate numberof total bits necessary for direct encoding of forwardlink CSIfor UE feedback.Fortunately, in mobile wireless networks, CSI within acoherence time exhibits strong correlation [31]. Therefore,instead of constructing CSI independently by relying on CSIfeedbacks for individual time slots, the gNB can utilize thisCSI dependency by leveraging both previously reconstructedCSIs and the current CSI feedback. In other words, the UE C o nd i t i o n a l E n t r o p y ( b i t s ) Indoor 5.3GHz, 0.001m/s Outdoor 300MHz, 0.9m/s

Fig. 3: Averaged conditional entropy (bits/element), underdifferent feedback intervals ( δ ).feedback should focus on providing information that is notavailable at the gNB from CSIs of previous time slots.Taking advantage of the Markovian CSI model, we caninvestigate how much the gNB can beneﬁt from the previ-ous CSI. Given the Markovian channel model of (3), theconditional CSI entropy quantiﬁes the amount of informationneeded to characterize the CSI matrix based on the availableCSIs from earlier reconstruction: H ( H t | H t − , . . . , H ) = H ( H t | H t − ) (12) = − (cid:88) H t − (cid:88) H t p ( H t ) log p ( H t | H t − ) From the well known relationship of H ( H t | H t − ) ≤ H ( H t ) ,it is clear that by utilizing the most recently reconstructed CSI,the gNB would require less feedback bandwidth and improvethe UE feedback efﬁciency.A stationary ﬁrst order Markovian channel model is char-acterized by the conditional probability density function of p ( H t | H t − ) . In practice, such distribution information on CSIis difﬁcult to acquire analytically. To gain valuable insightsinto the time-coherence between CSI at different feedbackintervals, we shall provide a numerical evaluation of typicalwireless channel models by comparing the entropy and theconditional entropy of the forwardlink CSI parameters. Notethe following relationship between CSI entropies at t and t − δ where δ is the feedback interval: H ( H t, ( i,j ) | H t − δ ) ≤ H ( H t, ( i,j ) | H t − δ, ( i,j ) ) ≤ H ( H t, ( i,j ) ) . (13)For practical reasons, we shall numerically evaluate the con-ditional entropy of H ( H t, ( i,j ) | H t − δ, ( i,j ) ) averaged over thecoefﬁcients in H t . Such information can present importantguidelines to the determining number of bits for CSI feedbackand how often UE should provide such CSI feedback for theCSI estimation of massive MIMO systems by the gNB.In this experiment, we consider the link with N b = 32 transmit antennas and 1 receive antenna over N f = 1024 subcarriers. After applying the 2D DFT, R d = 32 rowsof signiﬁcant CSI elements in delay domain are retained in H t . For each element within H t , we apply a 14-bit uniformquantizer to encode raw CSI values, resulting in a normalizedmean square quantization error of -40dB. Since the complexCSI matrices were always divided into real part and imaginarypart as the real-valued input to the neural network [15]–[20], we consider the conditional entropy of the CSI’s realpart and imaginary part individually. Fig. 3 demonstrates theestimated conditional entropy averaged over the × N b × R d CSI elements.We generate 10,000 random indoor and outdoor channelresponses using the channel models given in [32] and [16].Following the examples in [16], the indoor channel is in the5.3 GHz band, with little or no mobility at velocity of 10 − m/s. The outdoor channel is in the 300 MHz band, at velocityof 0.9 m/s. The bandwidth for indoor and outdoor channelsis 20 MHz. The conditional entropy is evaluated for differentlengths of feedback interval δ = 40 ms, 80ms, 160ms, and ∞ (i.e., no feedback).As the results of Fig. 3, the average conditional entropyvaries between 1 to 5 bits/element for both the indoor andoutdoor channel models tested. As the duration of the feedbackinterval grows, the conditional entropy increases because ofthe limited channel coherence time. In addition, it is intuitivethat the outdoor channel exhibits higher conditional entropysince higher velocity corresponds to shorter coherence time[33]. For both channel models, the average entropy of the CSIelements without prior CSI can be seen for δ = ∞ whichattends its maximum value of approximately 5 bits. Thesenumerical results strongly motivate a systematic selection offeedback interval and feedback bandwidth. For example, for afeedback interval of 80 ms, an average of approximately 3 bitsper CSI coefﬁcient can be used for the outdoor CSI feedbackby UEs when prior CSI is utilized by the gNB. On the otherhand, for the same feedback interval, an average bitwidth of 1bit per CSI coefﬁcient can be used for indoor CSI feedback.The reduction of CSI entropy under conditions of knownprior CSI knowledge motivates the idea of condition-basedencoding such as differential encoding by the UE. Encodingthe difference between successive feedback instants, H t and ˆ H t − δ , can reduce the required number of UE feedback bits,allowing more compression without loss of information [34].IV. D IFFERENTIAL

CSI E

NCODING

A. A Simpliﬁed Markovian Model

Although the general Markovian CSI model motivates theuse of a DNN to approximate the conditional mean φ ( ˆ H t − ) through training and learning, we can examine simpler CSImodels in order to develop a low complexity encoder-decoderstructure with consistently strong performance. Consider thesimpliﬁed Markovian CSI model of [35]: H t = γ H t − + V t (14) PowerSpherical CSI32 × ×

32 Output32 ×

32 RecoveredDownlink CSI32 × Encodernetwork DecodernetworkCode

Feedback PowerCsiNet Pro

Fig. 4: Architecture of spherical CSI feedback in SphNet.where γ is a constant and V t is a zero-mean i.i.d. ran-dom matrix. Given ensemble samples of H t − and H t , theunknown γ can be estimated via ˆ γ = Trace ( E { H t H Ht − } ) E (cid:107) H t − (cid:107) . (15)Based on this simpliﬁed 1st order autoregressive (AR)model, we propose a low complexity encoder-decoder archi-tecture for time slot t ( t ≥ ) as s t = f e,t ( H t − γ ˆ H t − ) , (16) ˆ H t = f d,t ( s t ) + γ ˆ H t − (17)Based on this simpliﬁed model, we propose a differentialencoding architecture named “MarkovNet” for efﬁcient CSIfeedback and reconstruction in the massive MIMO systems.The differential CSI feedback framework consists of twophases: a ﬁrst network used for the initial CSI at t (SphericalCSI feedback), and a second network in subsequent timeslotsto compress and encode differential information as describedby the encoder of (16).To fully exploit the temporal CSI coherence, accurate CSIat the initial time slot t is required to provide sufﬁcientbaseline information for the CSI feedback in subsequent timeslots. To this end, our proposed framework shall apply CSIpre-processing and optimize the neural network structure.Speciﬁcally, • We propose the spherical normalization for CSI pre-processing to the input data distribution to make thenetwork’s objective function more applicable to the com-monly adopted accuracy metric, NMSE. • We also optimize the CSI encoder-decoder to enhanceCSI recovery accuracy.

B. Transforming CSI Feedback in Spherical Coordinate

How to effectively apply DL techniques to exploit channeldata properties and optimization objects remains an openresearch issue, as many existing DL based works mainly utilizethe deep learning architectures and optimization functionssuccessfully developed for other application areas. Directadoption of DL architectures without customization for CSIdata characteristics risks unsatisfactory performance. In par-ticular, data processing methods and loss functions developedfor computer vision may not be well suited for CSI encodingand reconstruction.To begin, many existing DL-based CSI encoding-decodingschemes conveniently view the 2D MIMO channel matrix H t as akin to an image such that the normalized elementsof the CSI matrix are utilized as image-like input to DLnetworks in both training and testing. However, the multipathfading MIMO channels exhibit unique special properties andprobability distributions different from 2D image data.Among other differences, images are represented as ma-trices of intensity pixel values. For color images, each colorchannel corresponds to a 2D matrix of pixel values that areunsigned integers, e.g., in the range between 0 and 255. Bynormalizing these pixels, there can be strong beneﬁt in prepar-ing the images as inputs of the DL model. However, unlikedifferent images whose pixel values are mostly in the sameorder of magnitude, the dynamic range of CSI data can bemuch greater. For example, RF pathloss grows polynomiallywith distance between gNB and UE [36]. As a result, CSI ofone UE can be different from CSI of another UE by severalorders of magnitude, depending on their respective distancesto gNB. A naive normalization can render CSIs of some UEstoo small for the DL networks to respond to. Another differentfeature is that the baseband CSI parameters are complexvalues, consisting of both magnitude and phase, whereas imagepixels are nonnegative real (with normalization).In terms of learning objectives, several existing DL-basedCSI feedback works adopted the loss function similar toimage recovery for training the DL networks. Speciﬁcally, theobjective is to minimize the mean square error (MSE):MSE = 1 N N (cid:88) k =1 (cid:107) H k − ˆH k (cid:107) , (18)where k and N are, respectively, the index and total numberof samples in the data set, whereas (cid:107)·(cid:107) denotes the Frobeniusnorm. On the other hand, it is typical more meaningful in CSIestimation to apply the normalized MSE (NMSE)NMSE = 1 N N (cid:88) k =1 (cid:107) H k − ˆH k (cid:107) / (cid:107) H k (cid:107) , (19)to assess the accuracy of CSI recovery at the gNB [37] andfeedback efﬁciency [15], [16], [25]. By directly using MSEas the loss function, the DL networks would be biased towardthe CSI accuracy of stronger MIMO channels.In response to the domain-speciﬁc characteristics of dataand objective in CSI estimation, we propose to use a sphericalCSI data structure for feedback as shown in Fig. 4. Thespherical CSI feedback architecture splits the downlink CSImatrix H k into a magnitude value p k and a spherical downlinkCSI matrix ˇ H k , where p k = (cid:107) H k (cid:107) is the size of the CSImatrix whereas the unit norm spherical CSI ˇ H k = H k / (cid:107) H k (cid:107) represents remains on the surface of the unit hyper-sphere.The UE would encode and feedback both the size p k and thespherical CSI matrix ˇ H k separately.Spherical CSI feedback architecture presents numerical ad-vantages. First, we can construct an encoder DL networkthat focuses on compressing and encoding the spherical CSImatrix ˇ H k . The size of the CSI would be directly sent backto the gNB separately since it contains little redundancy.Thus, even for CSI matrices of different magnitudes, theyare equally important in training the encoder and decoder × × M × × ×

32, 2Input: 32 ×

32, 2 (Real + Imag)Output: 32 ×

32, 27 × EncoderDecoder × × × × × × Fig. 5: Architecture of CsiNet Pro.networks. During training the gradients for UEs that are faraway from the gNB would no longer be negligible [38].Moreover, the decoder will have a more limited domain formore accurate CSI recovery under spherical normalization[39].As shown in Fig. 4, our joint CSI compression and recon-struction architecture still utilizes the effective autoencoderstructure in which the encoder at the UE attempts to learna low-dimensional CSI representation for a relatively high-dimensional dataset represented in the form of spherical CSImatrices. The decoder at the gNB reconstructs the CSI matrixbased on feedback information extracted from the UE encoderand the direct feedback of CSI magnitude p k . C. CsiNet Pro: A Novel CSI Encoder-Decoder Network

We propose an efﬁcient neural network structure, namedCsiNet Pro, for UE encoding and gNB decoding of CSIin massive MIMO networks. The structure of CsiNet Prois illustrated in Fig. 5. In comparison with existing neuralnetworks such as those from [15] [16], CsiNet Pro provides adeeper encoder that uses more convolutional layers to betterextract features of CSI. There is a corresponding decoder atthe gNB that also contains 4 convolution layers.The design of encoder for dimension compression is crucial.However, the encoders in [15], [16], [25] all utilized oneconvolutional layer and one fully connected layer. As a majordeparture, the encoder of CsiNet Pro utilizes 4 convolutionallayers for feature extraction and 1 fully connected layerfor dimension compression. Speciﬁcally, the 4 convolutionallayers apply × kernels to generate , , and featuremaps, respectively (see Fig. 5). Network for t Network for t Network for t Feedback

EncoderNetworkDecoderNetwork

Codeword at t 𝐇𝐇 �𝐇𝐇 DecoderNetwork

Feedback

EncoderNetwork

Codeword at t 𝐇𝐇 − 𝛄𝛄�𝐇𝐇 �𝐇𝐇 − 𝜸𝜸�𝐇𝐇 DecoderNetwork

Feedback

EncoderNetwork

Codeword at t 𝐇𝐇 − 𝛄𝛄�𝐇𝐇 �𝐇𝐇 − 𝜸𝜸�𝐇𝐇 Fig. 6: Illustration of the multi-stage, differential CSI feedback framework MarkovNet.Another major change in CsiNet Pro is the use of a differentnormalization range and output activation function. Recallthat the decoder network utilizes 4 convolutional layers asshown in Fig. 5. Unlike the nonnegative pixel values in imagereconstruction, CSI values contain both real and imaginaryparts that can be either positive or negative. Thus, unlikeprevious works that normalize the CSI values to fall within [0 , in order to use sigmoid or ReLU as the activationfunction of the last layer, our proposed CsiNet Pro normalizesthe real and imaginary CSI values to the range [-1, 1] whileusing “tanh” as its activation function in the last layer.We integrate the CsiNet Pro as part of the spherical CSIfeedback framework shown in Fig. 4 to enhance the CSIrecovery accuracy. D. Differential CSI Encoding

Motivated by the simpliﬁed ﬁrst order AR model for CSI,we propose a differential CSI feedback framework MarkovNetto improve bandwidth efﬁciency. Different from the RNNbased networks such as LSTM which relies on neural networksto learn the required information sharing and correspond-ing CSI compression simultaneously, MarkovNet proactivelyleverages the simpliﬁed AR model (14) for CSI and encode theCSI prediction error as shown in (16) between two successivetime slots.Recall that the difference based on ﬁrst order estimationof the CSI in two adjacent time slots H t − ˆ γ H t − is anapproximation of the innovation V t . As shown in Fig. 6, fortime-slots beyond the initial time-slot, the linear predictiondifference H t − ˆ γ H t − is sent to the encoder network toexecute the encoding process of s t = f e,t ( H t − γ ˆ H t − ) given in (16). At the gNB receiver, the decoder network canutilize the previously recovered CSI ˆ H t − to reconstruct ˆ H t according to ˆ H t = f d,t ( s t ) + γ ˆ H t − as described in (17).MarkovNet from t onward would employ the same networkarchitecture CsiNet Pro as shown in Fig. 5. Compared with network for t which uses a larger compression ratio to ensurethe high recovery accuracy in the ﬁrst timeslot, MarkovNetfrom t can afford smaller compression ratio to achieve ahigher bandwidth efﬁciency with the help of prior information.MarkovNet exhibits several additional advantages in prac-tical implementation. First, compared to RNN-based CSIfeedback, MarkovNet can exploit pretrained model as initialneural network parameters for models used in later timeslots toimprove training efﬁciency since the CSI at adjacent time slotsshare similar data features. Second, differential CSI matrixtend to be more sparse, hereby enabling MarkovNet to achievea higher degree of compression during feedback. Third, formost wireless network applications, both gNB and UEs havelimited power, computation, and storage resources. MarkovNetsimpliﬁes the learning tasks of neural networks and is moreapplicable in a wider variety of wireless deployment scenarios.V. M ODEL R EDUCTION

Practical implementation of deep neural networks for CSIfeedback and recovery can be challenging to some mobiledevices. Because DL network architectures often use largenumbers of parameters, they require substantial computationand memory resources. Unrolled RNN models, such as theLSTM layers in Fig. 2 are particularly computationally expen-sive. For example, CsiNet-LSTM [16] at a compression ratio(CR) of / contains . × parameters per timeslot.One of the main advantages of MarkovNet (see Fig. 6) is itsrelatively low parameter count, as a comparable version ofMarkovNet at a CR of / has . × parameters pertimeslot, a difference of three orders of magnitude relative toCsiNet-LSTM.Our proposed MarkovNet can clearly reduce the model sizeby eliminating the repeated structure used to learn the fromthe sequence data in RNN-style architecture. It is important tonote, however, the fully connected (FC) layers for dimension Reshape ReshapeFeedbackDimension Compression Dimension Decompression × × × × （ a ） Fully connected layer based dimension compression （ b ） CNN based dimension compressionReshape ReshapeFeedbackDimension Compression Dimension Decompression × ×

32 2 × × × ×

32 64 × × × ×

32 M × × Reshape

Fig. 7: Proposed CNN-based dimension compression module.compression and decompression in the current MarkovNet stillcontains a large number of parameters. For example, there aremore than parameters for the FC layers at CR = / .The FC layers for dimension compression and decom-pression, as shown in Fig. 7(a), has often been adopted indeep learning based CSI feedback [15]–[18], [25]. However,elements of the CSI matrix only exhibit strong correlation withits neighbors in angular-delay domain. Thus, we recognizethat the FC layers, though effective and popular, still containsa large fraction of non-essential connections with very weakweight parameters. This realization presents another opportu-nity for model reduction. To further reduce model size, wepropose a CNN-based latent structure to replace the FC layersfor dimension compression. As shown in Fig. 7(b), we slicethe two square feature maps into feature maps of dimension × . We then design M CNN kernels of length × tocompress the codewords dimension. The integer M is adaptivein accordance with the encoder compression ratio denoted by M . Through this feature processing, connections between CSIelements that are far apart in the angular-delay domain areremoved. Strongly correlated features of CSI matrix acrossthe angular-delay domain can effectively be captured by thesmall CNN kernels.TABLE I: Number of parameters and FLOPs comparison forFC-based and proposed CNN-based dimension compressionand decompression module. M: million, K: thousand. Number of parameters FLOPsFC-based Proposed FC-based ProposedCR= / CR= / CR= / To illustrate the effect of the proposed model size reduction,we summarize the number of parameters and the ﬂoating pointoperations (FLOPs) in Table I. This information provides a comparison of the storage size and computational complexitybetween the use of FC-layer and proposed CNN-layer in CSIcompression module and the corresponding decompressionmodule. As shown in Table I, the proposed CNN-baseddimension compression and decompression module reducesthe number of parameters by over 100 times and the number ofFLOPs by at least 4 times. The comparison results demonstratethat our new CNN design for CSI compression and decompres-sion represents an important step in broadening the range ofpractical applications for effectively deploying deep learningbased CSI encoding, feedback, and reconstruction in massiveMIMO wireless systems.VI. P

ERFORMANCE E VALUATION

We assess the performance of both RNN-based CsiNet-LSTM [16] and MarkovNet for two different massive MIMOscenarios generated from the well known COST 2100 model[32].1)

Indoor channels using a 5.3GHz downlink at 0.001 m/sUE velocity, served by a gNB at center of a m × mcoverage area.2) Outdoor channels using a 300MHz downlink at 0.9 m/sUE velocity served by a gNB at center of a m × mcoverage area.We give N b = 32 antennas to the gNB to serve singleantenna UEs randomly distributed within the coverage area.We use N f = 1024 subcarriers and truncate the delay-domainCSI matrix to include the ﬁrst R d = 32 rows.The gNB uses antennas arranged in a uniform linear ar-ray (ULA) with half-wavelength spacing. UEs are randomlypositioned within the coverage area such that their CSIs arerandom. For each indoor/outdoor environment, we generate adataset of sample channels and divide them into . · and . · for training and testing sets, respectively. Thebatch size for the training of MarkovNet is 200. MarkovNetat t1 was trained for 1000 epochs using MSE as the loss Compression ratio -30-25-20-15-10-50 N M SE ( d B ) CsiNetCsiNet-SphCsiNet ProSphNet (a) Indoor

Compression ratio -12-10-8-6-4-20 N M SE ( d B ) CsiNetCsiNet-SphCsiNet ProSphNet (b) Outdoor

Fig. 8: NMSE of different networks in the ﬁrst time slot of MarkovNet over varying compression ratios (CR). t t t t t t t t t t − − − − − − − N M S E ( d B ) MarkovNet, CR=1/4MarkovNet, CR=1/8MarkovNet, CR=1/16MarkovNet, CR=1/32 MarkovNet, CR=1/64CsiNet-LSTM, CR=1/4CsiNet-LSTM, CR=1/8 CsiNet-LSTM, CR=1/16CsiNet-LSTM, CR=1/32CsiNet-LSTM, CR=1/64 (a) MarkovNet and CsiNet-LSTM indoor t t t t t t t t t t − − − − − − − − N M S E ( d B ) MarkovNet, CR=1/4MarkovNet, CR=1/8MarkovNet, CR=1/16MarkovNet, CR=1/32 MarkovNet, CR=1/64CsiNet-LSTM, CR=1/4CsiNet-LSTM, CR=1/8 CsiNet-LSTM, CR=1/16CsiNet-LSTM, CR=1/32CsiNet-LSTM, CR=1/64 (b) MarkovNet and CsiNet-LSTM outdoor

Fig. 9: NMSE comparison between MarkovNet and CsiNet-LSTM over varying compression ratios (CR).function. For the MarkovNet after t , only 150 epochs areused with the help of initialization using the pretrained modelof the previous time slot to reduce training expenses. Weutilize the Adam optimizer with default learning rate − , and hyperparameters (i.e., batch size, epochs) for each testwill be clariﬁed in each relevant subsection. NMSE is used tocompare the CSI recovery accuracy of different networks. A. MarkovNet

In this part, we evaluate the performance of MarkovNetconsidering the performance at the ﬁrst timeslot ( t ), theoverall performance of MarkovNet, and the performance ofMarkovNet-CNN.

1) Performance evaluation at t : To enable efﬁcient differ-ential CSI feedback, high accuracy CSI feedback is requiredat t to provide a good starting CSI condition for subsequent timeslots. Here, we demonstrate that our proposed sphericalCSI feedback framework improves the CSI recovery accuracyfor a single time slot compared to different CSI feedbackframeworks.Fig. 8 compares the performance of channel reconstructionfrom the use of CsiNet, CsiNet-Sph (CsiNet with the helpof spherical feedback), CsiNet Pro, and SphNet (CsiNet Prowith the help of the spherical feedback framework). As shownin Fig. 8, SphNet achieves the best performance in singleshot feedback for CSI recovery without relying on priorCSI knowledge, which means that SphNet can improve theaccuracy of prior information for the MarkovNet. On the onehand, CsiNet Pro outperforms the CsiNet in different CR andscenarios, which means the enhanced network structure iseffective. On the other hand, we can observe that sphericalfeedback can provide the most noticeable performance gain t t t t t t t − − − − − N M S E ( d B ) CR=1/16CR=1/32 CR=1/64CR=1/16 CNN CR=1/32 CNNCR=1/64 CNN (a) indoor t t t t t t t − − − − − − − N M S E ( d B ) CR=1/4CR=1/8 CR=1/16CR=1/4 CNN CR=1/8 CNNCR=1/16 CNN (b) outdoor

Fig. 10: NMSE comparison between MarkovNet and MarkovNet-CNN over varying CR.to both CsiNet and CsiNet Pro. This establishes the strengthof spherical normalization to efﬁciently capture the CSI datafeature.

2) Overall performance evaluation of MarkovNet:

Everyinstance of MarkovNet contains two different compressionratios. For the ﬁrst time slot, we initialize MarkovNet withCR= / at t to provide an accurate CSI baseline for subse-quent time slots. For the rest timeslots, MarkovNet maintainsthe same CR for all subsequent timeslots ( t to t ). Toevaluate MarkovNet’s performance under different amounts ofcompression, we vary the second CR used in the later timeslotsfrom / to / and train each network. For example, in theFig. 9 that follows, “MarkovNet, CR=1/16” uses CR=1/16 attimeslots t through t and CR=1/4 at timeslot t .Fig. 9 compares the performance between MarkovNet andCsiNet-LSTM. The beneﬁt of differential CSI encoding isdemonstrated by the CSI recovery accuracy of MarkovNetusing different compression ratios beyond t in comparisonwith CsiNet-LSTM. MarkovNet consistently achieves higherCSI accuracy than CsiNet-LSTM at every CR level. With thehelp of differential CSI encoding, MarkovNet is an effectiveencoding framework given limited UE power and bandwidthfor CSI encoding. For the indoor channels, MarkovNet can de-liver reliable CSI accuracy of -30dB even for the compressionratio of / , which is a 10dB improvement over CsiNet-LSTM. Although the outdoor scenario continues to be morechallenging, our results show that 1/4 or 1/8 compression ratiocan achieve NMSE of − dB and − dB, respectively. On theother hand, CsiNet-LSTM is shown to provide NMSE only at − and − . dB, respectively.

3) Performance and Complexity Trade-off of MarkovNet-CNN:

Fig. 10 shows the performance comparison betweenthe MarkovNet and MarkovNet-CNN at different meaningfulcompression ratios. Since the trend of CSI accuracy is similarover time, we focus on the performance from t to t . For theﬁrst time slot, we initialize MarkovNet and MarkovNet-CNNwith CR= / at t to provide an accurate CSI baseline for sub- sequent time slots. Note that, to show the inﬂuence of CNN-based dimension compression module at t , the results weshown in Fig. 10 at t are from the labeled compression ratios.Both MarkovNet and MarkovNet-CNN achieve comparableCSI accuracy at t , indicating that CNN layer for compressionand decompression is not only more efﬁcient in memoryand computation, but also delivers similar CSI accuracy. Forthe rest timeslots, MarkovNet maintains the same CR forall subsequent timeslots ( t to t ). Beyond t , MarkovNet-CNN achieves modestly higher accuracy for indoor channelsas shown in Fig. 10(a). The beneﬁt of MarkovNet-CNN likelyarises from the reduction of many redundant weights fromthe FC layer such that there are fewer opportunities for localminimum convergence. For outdoor channels, MarkovNet-CNN achieves comparable CSI accuracy as MarkovNet forcompression ratio of / and / while exhibiting a modestloss of accuracy at CR=1/4. One possible reason is thatoutdoor channels can beneﬁt more from higher number ofconnectivity in layers for compression and feature extractionbecause of their much more complex characteristics. B. Model size and Computational Complexity

We demonstrate that latent convolutional layers requiresigniﬁcantly fewer parameters than FC-layers without loss ofperformance. Table II compare the model size and computa-tional complexity (respectively) of CsiNet-LSTM, MarkovNet,and MarkovNet-CNN associated with a single timeslot. Acrossall compression ratios, MarkovNet uses 60 times fewer param-eters than CsiNet-LSTM. More importantly, MarkovNet-CNNcan use 1/3000 the number of parameters needed by CsiNet-LSTM while achieving better CSI recovery accuracy.Table II also shows the average number of ﬂoating pointoperations (FLOPs) associated with a single timeslot for eachnetwork [40], [41]. MarkovNet and MarkovNet-CNN can savecomputation load by more than and FLOPs, respectively,when compared with the CsiNet-LSTM in each compressionratio. For CsiNet-LSTM, the amount of computation does t t t t t t t t t t − − − − − − N M S E ( d B ) FP32 (D3)UQ11 (D3)Raw UQ11 FP32 (D2)UQ11 (D2) FP32 (D1)UQ11 (D1) (a) Sweep depth for quantized/perfect CSI (Indoor) t t t t t t t t t t − − − − − N M S E ( d B ) FP32 (D3)UQ11 (D3)Raw UQ11 FP32 (D2)UQ11 (D2) FP32 (D1)UQ11 (D1) (b) Sweep depth for quantized/perfect CSI (Outdoor)

Fig. 11: Stacked LSTMs for 3 depths trained on perfect and quantized CSI trained on Indoor 5.3GHz network (a) and onOutdoor 300MHz network (b). .TABLE II: Model size and computational complexity of tested networks. M: million, K: thousand.

Parameters FLOPsCsiNet-LSTM MarkovNet MarkovNet-CNN CsiNet-LSTM MarkovNet MarkovNet-CNNCR= CR= CR=

CR=

CR= not change signiﬁcantly even at low compression ratios. Forexample, a 16-fold drop in compression ratio (from 1/4 to1/64) only results in a 1% saving of FLOPs. In contrast,MarkovNet and MarkovNet-CNN require much lower compu-tational complexity in proportion at lower compression ratios.In MarkovNet and MarkovNet-CNN networks, for example, a16-fold CR reduction (from 1/4 to 1/64) reduces the numberof FLOPs by 9% and 2%, respectively.We note that when deploying MarkovNet and MarkovNet-CNN as a cooperative learning mechanism at both UE andgNB, 50% additional parameters and FLOPs are required incomparison with the training phase. This is because the traineddecoder must be duplicated at the UE side to generate thedecoded CSI for the previous time slot used by the encoder.Despite this additional cost, both MarkovNet and MarkovNet-CNN still can reduce the number of parameters by ordersof magnitude, and save over FLOPs in comparison withCsiNet-LSTM.

C. LSTMs for CSI Estimation

In this section, we explore the effects of varying LSTMdepth on network performance. For all experiments, we usethe Adam optimizer with learning rate of − and a batchsize of 100. For the LSTM-only networks in the ablation TABLE III: Average NMSE across ten timeslots ( T = 10 ) forstacked LSTMs of increasing depth trained on quantized CSIunder 11-bit uniform quantization (‘UQ11’ in Fig. 11). NMSEof quantized CSI under uniform quantization (‘Raw UQ11’)is shown for comparison. Environment Raw UQ11 Depth 1 Depth 2 Depth 3

Indoor -10.66 dB -21.03 dB -21.26 dB -19.54 dBOutdoor -6.41 dB -14.64 dB -13.11 dB -10.40 dB study, we train for 500 epochs. For CsiNet-LSTM, we pretrainCsiNet at each compression ratio for 600 epochs, and we theninitialize the CsiNet at each timeslot of CsiNet-LSTM withthe pretrained weights before training for 500 epochs.

1) Ablation study on LSTM depth:

We seek to knowwhether shallow RNNs perform comparably well to deepnetworks. To investigate the effect of network depth, wetrain stacked LSTMs of increasing depth on { ¯ H t } t =1 , whichare CSI matrices quantized with two different schemes: 1)Single-precision ﬂoating point (FP32) and 2) 11-bit UniformQuantization (UQ11). For a visual depiction of this network,see Fig. 2(b). We train each network with Adam using thedefault t t t t t − − − − − − N M S E ( d B ) (a) Indoor t t t t t t t t t t − − − − − − N M S E ( d B ) (b) Outdoor CR=1/16 (D3)CR=1/32 (D3)CR=1/64 (D3)CR=1/16 (D1)CR=1/32 (D1)CR=1/64 (D1)LSTM FP32 (D3)LSTM FP32 (D1)LSTM UQ11 (D3)LSTM UQ11 (D1)Raw UQ11

Fig. 12: CsiNet-LSTM over varying compression ratios (CR) compared to LSTMs trained on perfect CSI and quantized CSI.D N indicates a stacked LSTM depth N .Figure 11 shows the NMSE per timeslot for each of theseRNNs, and the average performance across all timeslots isshown in Table III. At all depths, LSTMs are able to improvethe test NMSE relative to quantized CSI. However, there isnot a clear linear relationship between network depth andNMSE performance. For the outdoor network (Fig. 11(b)), thenetwork performance and LSTM depth appear negatively cor-related – increasing depth results in decreasing performance.For the indoor network, the best network has a depth of D = 2 ,indicating the the best choice of depth is channel-dependent.

2) LSTM Depth in CsiNet-LSTM:

While LSTMs can per-form admirably when using noisy CSI samples, these sampleswere not subject to compression. Compression is imperativefor channel feedback, as transmitting uncompressed CSI willconsume an undue amount bandwidth.In this experiment, we use a known CNN/RNN for CSIestimation, CsiNet-LSTM [16]. Fig. 12 illustrates the NMSEfor different depths of CsiNet-LSTM in each of the 10 timeslots. The original network utilizes LSTMs of depth 3 (D3in Figures 12(a), 12(b)), and we also train CsiNet-LSTMwith one LSTM layer (D1) for comparison. We show theperformance of the D3 and D1 networks to LSTMs traineddirectly on FP32 and UQ11 CSI samples (i.e., the samenetworks in Fig. 11. We train each network end-to-end basedon the original paper’s hyperparameters and dataset splits.Figure 12(a) shows the NMSE in channel reconstructionby CsiNet-LSTM for the indoor dataset. While the shallowerD1 network with fewer parameters in fact outperformed thedeeper D3 network in the FP32 and UQ11 scenarios, the D3variant of CsiNet-LSTM performs better than the D1 versionfor the all three compression ratios.This performance trend relative to LSTM depth does nothold for the outdoor network. Figure 12(b) shows the NMSEin channel reconstruction for the outdoor dataset. Clearly, theshallower D1 network performs similarly to the D3 networkat the tested compression ratios.These results indicate that a simple, shallower LSTM canperform similarly to complex, deeper networks for both indoorand outdoor datasets. TABLE IV: MarkovNet and CsiNet-LSTM mean NMSEdegradation (increase) under different feedback quantizationbits. The mean is taken across all tested compression ratios,and the degradation in NMSE is relative to ﬂoating point 32bit precision.

Network Environment 6 bits 4 bitsMarkovNet

Indoor 0.70 dB 5.49 dBOutdoor 0.03 dB 0.58 dB

CsiNet-LSTM (D3)

Indoor 2.30 dB 11.96 dBOutdoor 0.07 dB 1.25 dB

CsiNet-LSTM (D1)

Indoor 2.44 dB 11.55 dBOutdoor 0.30 dB 1.21 dB

D. Network Performance Under Feedback Quantization

To understand the effect of feedback quantization, we apply µ -law companding to the encoded layer of both tested net-works. µ -Law companding uses a logarithmic transformationthat emphasizes lower magnitude samples. For signal value x ,the compression portion of the µ -law scheme is written as f ( x ) = sgn ( x ) ln(1 + µ | x | )ln(1 + µ ) , ≤ | x | ≤ . (20)Uniform quantization is applied to the compressed signal.For signal value x , the quantization/dequantization operationproduces a value ˆ x , which can be written ˆ f ( x ) = ∆ (cid:22) f ( x )∆ (cid:25) (21)for ﬁxed step size ∆ . After the quantized feedback is received,then we expand the result using the inverse of (20), F (ˆ x ) = sgn ( ˆ f ( x ))((1 + µ ) | ˆ f ( x ) | − µ , − ≤ y ≤ . (22)Fig. 13 and Fig. 14 show the performance of MarkovNet andCsiNet-LSTM with µ -law companding and ﬁxed quantizationstep size at two different quantization levels, 6 bits and 4 − − − − − − N M S E [ d B ] (a) DiﬀNet 1/4 1/8 1/16 1/32 1/64 − − − − − −

50 (b) CsiNet-LSTM (D3) 1/4 1/8 1/16 1/32 1/64 − − − − − −

50 (c) CsiNet-LSTM (D1)32 bits6 bits4 bits

Fig. 13: MarkovNet and CsiNet-LSTM NMSE performance (dB) for indoor network with feedback subject to mu-lawquantization using ﬁxed step size, ∆ = 2 b − , for b bits. − − − − − − − − N M S E [ d B ] (a) DiﬀNet 1/4 1/8 1/16 1/32 1/64 − − − − − − − −

20 (b) CsiNet-LSTM (D3) 1/4 1/8 1/16 1/32 1/64 − − − − − − − −

20 (c) CsiNet-LSTM (D1)32 bits6 bits4 bits

Fig. 14: MarkovNet and CsiNet-LSTM NMSE performance (dB) for Outdoor network with feedback subject to mu-lawquantization using ﬁxed step size, ∆ = 2 b − , for b bits.bits, in comparison to the non-quantized network (i.e., 32 bitﬂoating point). The networks with quantized feedback use 8bit quantization at the ﬁrst timeslot to establish good intialCSI estimates. Note that the networks are not re-trained orﬁne-tuned after applying quantization.MarkovNet is more robust to feedback quantization noisethan CsiNet-LSTM at either LSTM depth, as the formermaintains NMSE better than -10 dB in both environmentswhile the latter only exceeds -10dB at low compression ratiosin the Indoor environment. Table IV summarizes the meandecrease in NMSE for each network, and for 6 and 4 bit µ -law quantization, MarkovNet has a smaller mean degradationin NMSE performance compared to CsiNet-LSTM.VII. C ONCLUSION

To better exploit temporal correlation, we provide an infor-mation theoretic basis for utilizing differential encoding withCNNs rather than applying overly parameterized LSTMs. We propose MarkovNet, a CNN with differential encoding, whichachieves superior estimation accuracy and lowers computa-tional complexity relative to an LSTM-based CSI estimationnetwork. MarkovNet maintains accurate CSI estimates evenunder feedback quantization. We expand on a prior LSTM-based estimation technique and show that more parsimoniousmodels can yield comparable or better performance.A

CKNOWLEDGMENT

The authors wish to thank Prof. S. Jin of Southeast Univer-sity for his kind assistance on source codes of [16] and relatedquestions in the process of preparing for this manuscript.R

EFERENCES[1] E. P. Simon, L. Ros, H. Hijazi, J. Fang, D. P. Gaillot, and M. Berbineau,“Joint carrier frequency offset and fast time-varying channel estimationfor mimo-ofdm systems,”

IEEE Transactions on Vehicular Technology ,vol. 60, no. 3, pp. 955–965, 2011. [2] W.-G. Song and J.-T. Lim, “Channel estimation and signal detection formimo-ofdm with time varying channels,” IEEE Communications Letters ,vol. 10, no. 7, pp. 540–542, 2006.[3] B. Makki and T. Eriksson, “On hybrid arq and quantized csi feedbackschemes in quasi-static fading channels,”

IEEE Transactions on Com-munications , vol. 60, no. 4, pp. 986–997, April 2012.[4] D. J. Love, R. W. Heath, V. K. N. Lau, D. Gesbert, B. D. Rao, andM. Andrews, “An overview of limited feedback in wireless communi-cation systems,” vol. 26, no. 8, pp. 1341–1365, 2008.[5] H. Shirani-Mehr and G. Caire, “Channel state feedback schemes for mul-tiuser mimo-ofdm downlink,”

IEEE Transactions on Communications ,vol. 57, no. 9, pp. 2713–2723, Sep. 2009.[6] A. Hindy, U. Mittal, and T. Brown, “CSI feedback overhead reductionfor 5g massive mimo systems,” in , 2020, pp. 0116–0120.[7] X. Rao and V. K. N. Lau, “Distributed compressive csit estimation andfeedback for fdd multi-user massive mimo systems,”

IEEE Transactionson Signal Processing , vol. 62, no. 12, pp. 3261–3271, June 2014.[8] M. E. Eltayeb, T. Y. Al-Naffouri, and H. R. Bahrami, “Compressivesensing for feedback reduction in mimo broadcast channels,”

IEEETransactions on Communications , vol. 62, no. 9, pp. 3209–3222, Sep.2014.[9] Z. Gao, L. Dai, S. Han, C. I, Z. Wang, and L. Hanzo, “Compressivesensing techniques for next-generation wireless communications,”

IEEEWireless Communications , vol. 25, no. 3, pp. 144–153, 2018.[10] E. Chen, R. Tao, and X. Zhao, “Channel equalization for ofdm systembased on the bp neural network,” in , vol. 3. IEEE, 2006.[11] N. Tas¸pinar and M. N. Seyman, “Back propagation neural networkapproach for channel estimation in ofdm system,” in . IEEE, 2010, pp. 265–268.[12] C.-H. Cheng, Y.-P. Cheng et al. , “Using back propagation neural networkfor channel estimation and compensation in ofdm systems,” in . IEEE, 2013, pp. 340–345.[13] K. Hiray and K. V. Babu, “A neural network based channel estimationscheme for ofdm system,” in . IEEE, 2016, pp. 0438–0441.[14] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learningfor image recognition,” , Jun 2016. [Online]. Available:http://dx.doi.org/10.1109/cvpr.2016.90[15] C. Wen, W. Shih, and S. Jin, “Deep learning for massive mimo csifeedback,”

IEEE Wireless Communications Letters , vol. 7, no. 5, pp.748–751, Oct 2018.[16] T. Wang, C. Wen, S. Jin, and G. Y. Li, “Deep learning-based csifeedback approach for time-varying massive mimo channels,”

IEEEWireless Communications Letters , vol. 8, no. 2, pp. 416–419, April 2019.[17] C. Lu, W. Xu, H. Shen, J. Zhu, and K. Wang, “Mimo channel infor-mation feedback using deep recurrent network,”

IEEE CommunicationsLetters , vol. 23, no. 1, pp. 188–191, Jan 2019.[18] Y. Liao, H. Yao, Y. Hua, and C. Li, “Csi feedback based on deep learningfor massive mimo systems,”

IEEE Access , vol. 7, pp. 86 810–86 820,2019.[19] X. Li and H. Wu, “Spatio-temporal representation with deep neural re-current network in mimo csi feedback,”

IEEE Wireless CommunicationsLetters , vol. 9, no. 5, pp. 653–657, 2020.[20] Y. Jang, G. Kong, M. Jung, S. Choi, and I. Kim, “Deep autoencoderbased csi feedback with feedback errors and feedback delay in fddmassive mimo systems,”

IEEE Wireless Communications Letters

Advances in Neural Information ProcessingSystems , vol. 4, no. January, pp. 3104–3112, 2014.[23] O. Irsoy and C. Cardie, “Opinion mining with deep recurrent neuralnetworks,”

EMNLP 2014 - 2014 Conference on Empirical Methods inNatural Language Processing, Proceedings of the Conference , pp. 720–728, 2014.[24] R. H. Jr and A. Lozano,

Foundations of MIMO communication . Cam-bridge University Press, 2018. [25] Z. Liu, L. Zhang, and Z. Ding, “Exploiting Bi-Directional ChannelReciprocity in Deep Learning for Low Rate Massive MIMO CSIFeedback,”

IEEE Wireless Communications Letters , vol. 8, no. 3, pp.889–892, 2019.[26] M. Hermans and B. Schrauwen, “Training and analysing deeprecurrent neural networks,” in

Advances in Neural InformationProcessing Systems 26 , C. J. C. Burges, L. Bottou, M. Welling,Z. Ghahramani, and K. Q. Weinberger, Eds. Curran Associates, Inc.,2013, pp. 190–198. [Online]. Available: http://papers.nips.cc/paper/5166-training-and-analysing-deep-recurrent-neural-networks.pdf[27] R. Pascanu, C. Gulcehre, K. Cho, and Y. Bengio, “How to construct deeprecurrent neural networks,” , no. March2014, 2014.[28] Yoav Goldberg, “A Primer on Neural Network Models for NaturalLanguage Processing,”

Journal of Artiﬁcial Intelligence Research

Neuralcomputation , vol. 9, pp. 1735–80, 12 1997.[30] Y. Bengio, “Learning deep architectures for AI,”

Foundations and Trendsin Machine Learning , vol. 2, no. 1, pp. 1–27, 2009.[31] D. Tse and P. Viswanath,

Fundamentals of wireless communication .Cambridge university press, 2005.[32] L. Liu, C. Oestges, J. Poutanen, K. Haneda, P. Vainikainen, F. Quitin,F. Tufvesson, and P. D. Doncker, “The cost 2100 mimo channel model,”

IEEE Wireless Communications , vol. 19, no. 6, pp. 92–99, December2012.[33] T. S. Rappaport,

Wireless communications: principles and practice .prentice hall PTR New Jersey, 1996, vol. 2.[34] S. Dhanani and M. Parker, “11 - entropy, predictive codingand quantization,” in

Digital Video Processing for Engineers

IEEEtransactions on wireless communications , vol. 5, no. 9, pp. 2458–2466,2006.[36] A. Goldsmith,

Wireless communications . Cambridge university press,2005.[37] S. Gao, P. Dong, Z. Pan, and G. Y. Li, “Deep learning based channelestimation for massive mimo with mixed-resolution adcs,”

IEEE Com-munications Letters , vol. 23, no. 11, pp. 1989–1993, Nov 2019.[38] Y. LeCun, L. Bottou, G. Orr, and K. Mller, “Efﬁcient backprop,” in

Neural networks: Tricks of the trade . Springer, 2012, pp. 9–48.[39] Z. Liu, M. del Rosario, X. Liang, L. Zhang, and Z. Ding, “Sphericalnormalization for learned compressive feedback in massive mimo csiacquisition,” in , 2020, pp. 1–6.[40] P. Molchanov, S. Tyree, T. Karras, T. Aila, and J. Kautz, “Pruningconvolutional neural networks for resource efﬁcient inference,” arXivpreprint arXiv:1611.06440 , 2016.[41] A. Nisar, J. A. Sue, and J. Teich, “Performance comparison betweenmachine leamins based lte downlink grant predictors,” in